Sqoop: SQL to Hadoop
場景:數據在RDBMS中,我們如何使用Hive或者Hadoop來進行數據分析呢?
1) RDBMS ==> Hadoop
2) Hadoop ==> RDBMS
MapReduce? InputFormat? OutputFormat
Sqoop: RDBMS和Hadoop之間的一個橋梁
Sqoop 1.x: 1.4.7
底層是通過MapReduce來實現的,而且是只有map沒有reduce的
ruozedata.person? ===>? HDFS
jdbc? ? ? ? ? ? ?
Sqoop 2.x: 1.99.7
RDBMS <==> Hadoop? 出發點是Hadoop
導入:RDBMS ==> Hadoop
導出:Hadoop ==> RDBMS
wget http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.6-cdh5.7.0.tar.gz
tar -zxvf xxxxx -C ~/app
export SQOOP_HOME=/home/hadoop/app/sqoop-1.4.6-cdh5.7.0
export PATH=$SQOOP_HOME/bin:$PATH
$SQOOP_HOME/conf/sqoop-env.sh
export HADOOP_COMMON_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0
export HADOOP_MAPRED_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0
export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.7.0
拷貝mysql驅動到$SQOOP_HOME/lib
sqoop? list-tables \
--connect jdbc:mysql://localhost:3306/ruozedata_basic03 \
--username root --password root
RDBMS ==> HDFS
sqoop import \
--connect jdbc:mysql://localhost:3306/imooc_project \
--username root --password root \
--table day_video_access_topn_stat -m 2 \
--mapreduce-job-name FromMySQLToHDFS \
--delete-target-dir \
--columns "EMPNO,ENAME,JOB,SAL,COMM" \
--target-dir EMP_COLUMN_WHERE \
--fields-terminated-by '\t' \
--null-string '' --null-non-string '0' \
--where 'SAL>2000'
默認:
-m 4
emp
emp.jar
sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username root --password root \
-m 2 \
--mapreduce-job-name FromMySQLToHDFS \
--delete-target-dir \
--target-dir EMP_COLUMN_QUERY \
--fields-terminated-by '\t' \
--null-string '' --null-non-string '0' \
--query "SELECT * FROM emp WHERE EMPNO>=7566 AND \$CONDITIONS" \
--split-by 'EMPNO'
sqoop eval \
--connect jdbc:mysql://localhost:3306/imooc_project \
--username root --password root \
--query 'select * from day_video_access_topn_stat'
emp.opt
import
--connect
jdbc:mysql://localhost:3306/imooc_project
--username
root
--password
root
--table
day_video_access_topn_stat
--delete-target-dir
sqoop --options-file emp.opt
sqoop export \
--connect jdbc:mysql://localhost:3306/sqoop \
--username root --password root \
-m 2 \
--mapreduce-job-name FromHDFSToMySQL \
--table emp_demo \
--export-dir /user/hadoop/emp
sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username root --password root \
--table emp -m 2 \
--mapreduce-job-name FromMySQLToHive \
--delete-target-dir \
--hive-database ruozedata \
--hive-table ruozedata_emp_partition \
--hive-import \
--hive-partition-key 'pt' \
--hive-partition-value '2018-08-08' \
--fields-terminated-by '\t' --hive-overwrite
--create-hive-table不建議使用,建議大家先創建一個hive表再進行sqoop導入數據到hive表
create table ruozedata_emp_partition
(empno int, ename string, job string, mgr int, hiredate string, salary double, comm double, deptno int)
partitioned by (pt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
Hive To MySQL
Hive的數據:HDFS
HDFS2MySQL ==? Hive2MySQL
sqoop job --create ruozejob -- \
import \
--connect jdbc:mysql://localhost:3306/imooc_project \
--username root --password root \
--table day_video_access_topn_stat -m 2 \
--mapreduce-job-name FromMySQLToHDFS \
--delete-target-dir
crontab
job
--options-file
需求:統計各個區域下最熱門的TOP3的商品
1) MySQL: city_info? 靜態
2) MySQL: product_info? 靜態
DROP TABLE product_info;
CREATE TABLE `product_info` (
? `product_id` int(11) DEFAULT NULL,
? `product_name` varchar(255) DEFAULT NULL,
? `extend_info` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
3) Hive: user_click? 用戶行為日志? date分區
user_id int
session_id string
action_time string
city_id int
product_id int
實現需求:
1) city_info ===> hive
2) product_info ===> hive
3) 三表的join? 取 TOP3(按區域進行分組)? 按天分區表
最終的統計結果字段如下:
product_id 商品ID
product_name 商品名稱
area? 區域
click_count? 點擊數/訪問量
rank? 排名
day ? 時間
====> MySQL
day : 20180808
hive 2 mysql: delete xxx from where day='20180808'
id=10
id=11