? 我的場景:社區(qū)免費(fèi)版CDH5.7.6 、Spark要on Yarn;CDH從5.5開始Spark distro不帶Thrift Server分布式SQL引擎、以及spark-sql腳本。Thrift Server是Spark異構(gòu)數(shù)據(jù)大融合愿景重要入口之一,spark-sql腳本是測試SQL利器,但CDH優(yōu)先推自家impala:
why and when to use which engine (Hive, Impala, and Spark)
Please add Spark Thrift Server to the CDH Spark distro
? 這段話基本概括了:
? For interfacing with SQL BI tools like Tableau, Excel, etc, Impala is the best engine. It was specifically designed for BI tools like Tableau and is fully certified/supported with most major BI tools including Tableau. Please see the following blog for more details on why and when to use which engine (Hive, Impala, and Spark)?
? For those looking for a Spark server to develop applications against, the Thrift Server for Spark is architecturally limited to exposing just SQL (in addition to other architectural limitations around security, multi-tenancy, redundancy, concurrency, etc). As such Cloudera founded the Livy project which aims to enable an interface for applications to better interface with Spark broadly (available as community preview in Cloudera Labs for feedback and community participation):http://blog.cloudera.com/blog/2016/07/livy-the-open-source-rest-service-for-apache-spark-joins-cloudera-labs/?_ga=1.116860357.2120376933.1474491928
? SQL雖然不是Spark的主業(yè),但SQL是通向Hive和RDB的大門,而且Spark的SQL解析器增加支持一些SQL語法比如注冊(cè)臨時(shí)表,這個(gè)表可以存在于任何關(guān)系數(shù)據(jù)存儲(chǔ)系統(tǒng)(RDB、Hive),只要有驅(qū)動(dòng)就可以,不必編程,還是挺強(qiáng)大的。暫時(shí)不想用impala或SJS只能自行編譯Spark替換CDH distro了:
How to upgrade Spark on CDH5.5
? 編譯原生Spark的話、最吸引人的是:“you can always run the latest version of Spark on CDH.”,問題來了,編譯原生Spark?好還是編譯CDH Spark distro好?,Spark二進(jìn)制發(fā)布包的編譯語句(參數(shù)與mvn構(gòu)建一樣):
? make-distribution.sh -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0-cdh5.7.6 -Dscala-2.10.5 -DskipTests -Phive -Phive-thriftserver
? 其中-Phive 和 -Phive-thriftserver 參數(shù)指定了將hive依賴以及thriftserver編譯打包,但不能指定Hive版本,默認(rèn)是按照Hive1.2.1......在這一篇CDH社區(qū)帖子:
CDH 5.5 does not have Spark Thrift Server
有提到:The thrift server in Spark is not tested, and might not be compatible, with the Hive version that is in CDH. Hive in CDH is 1.1 (patched) and Spark uses Hive 1.2.1. You might see API issues during compilation or run time failures due to that.
? CDH的Spark distro會(huì)提前于社區(qū)做一些修正和改進(jìn),而且直至目前最新版CDH5.12其使用的Spark仍然是這個(gè)1.6.0;再加上是要部署到生產(chǎn)集群上,保險(xiǎn)起見決定編譯CDH Spark distro,編譯過程順利無誤,以下是步驟:
? 1、源碼包下載:archive.cloudera.com/cdh5/cdh/5/;停掉CDH的Spark服務(wù),其實(shí)沒啥好停的對(duì)于Spark就一個(gè)His Server角色實(shí)例;
? 2、編譯,語句就是上述make-distribution,編出來最主要的三個(gè)jar會(huì)匯總放在dist/lib,最省事兒的方式是將assembly、examples 以及 spark-1.6.0-cdh5.7.6-yarn-shuffle.jar直接替換拷貝到所有集群節(jié)點(diǎn)機(jī):/opt/cloudera/parcels/CDH-5.7.6-1.cdh5.7.6.p0.6/jars 目錄下;事實(shí)上該目錄是所有節(jié)點(diǎn)機(jī)的Java ClassPath,查看18088端口HisServer列出的環(huán)境信息可以看得到,得益于Java按需加載Class機(jī)制,所有jar都可以扔到這一個(gè)地方;
? 3、將assembly包上傳到HDFS:/user/spark/share/lib,就是個(gè)存放assembly的目錄可以隨便定義;為jar包以及spark目錄授權(quán):hdfs dfs -chmod 755 /user/spark 使得namenode:50070可以看到該目錄;
? 4、將assembly配置到CDH Spark:?
? 5、我這里另啟了一個(gè)Hive metastore專門為Spark服務(wù),根據(jù)spark-env.sh中配置的HIVE_CONF_DIR找到配置文件:/etc/hive/conf/hive-site.xml,修改了hive.metastore.uri配置項(xiàng);
? 6、拷貝thrift的啟停腳本到集群節(jié)點(diǎn)機(jī)、啟動(dòng):
./start-thriftserver.sh --master yarn-client --hiveconf hive.server2.thrift.port=10001
? 7、測試:beeline -u jdbc:hive2://localhost:10001 -n hdfs
beeline參數(shù):-u表示后跟標(biāo)準(zhǔn)JDBC url串;-n是用戶名;-p是密碼,集群未啟用安全驗(yàn)證密碼可以為空,注意不知道是HS2還是beeline的毛病,指定連接庫名是沒用的,默認(rèn)直連default庫。詳細(xì)beeline用法參見:
? https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-BeelineExample
? CDH將Spark、Hive等稱作服務(wù)、而它們的各種組件如Spark的HisServer2、Hive的metastoreService都稱作隸屬于該服務(wù)的角色,這些角色在某臺(tái)節(jié)點(diǎn)機(jī)實(shí)際運(yùn)行起來即角色實(shí)例,所有所謂的gateway網(wǎng)關(guān)是它們的客戶端,也就是可以運(yùn)行spark-shell或hive、beeline這些CLI的節(jié)點(diǎn)機(jī)。CDH的SPARK_HOME是:/opt/cloudera/parcels/CDH-5.7.6-1.cdh5.7.6.p0.6/lib/spark;CDH的各個(gè)服務(wù)配置目錄都在/etc/如對(duì)于Spark是/etc/spark/conf,這里放置了Spark關(guān)鍵的兩個(gè)東東:全局默認(rèn)配置spark-default.conf以及環(huán)境初始化腳本spark-env.sh;對(duì)于CDH來說Spark比較“松散”,不像HDFS或Hive那樣在指定節(jié)點(diǎn)機(jī)上有特定服務(wù)存在。
? ST(Spark ThriftServer)到位了,開始測試使用ST去join查詢mySQL和Hive,關(guān)聯(lián)關(guān)系是hive庫表:idp_pub_ans.dim_aclineend 的gid字段與mysql庫表 hive.test 的resource_id字段相關(guān)聯(lián)。首先將mySQL驅(qū)動(dòng)告訴Spark,以原生spark-default.conf配置文件為例、以如下形式配置額外ClassPath:
??? spark.driver.extraClassPath=/usr/appsoft/spark/lib/*
??? spark.executor.extraClassPath=/usr/appsoft/spark/lib/*
? 以上分別為driver和executor配置了額外的CP,指向了一個(gè)目錄下的所有jar,mySQL驅(qū)動(dòng)就扔在該目錄下,這樣比較方便以后再有附加jar扔進(jìn)去即可,上述配置在CDH界面配置即可,然后更新到集群,別忘了將mySQL驅(qū)動(dòng)上傳到所有節(jié)點(diǎn)機(jī)相應(yīng)目錄下,上述配置的目錄路徑隨意,只要在節(jié)點(diǎn)機(jī)物理磁盤存在既可。
? 然后重啟ST、令ST執(zhí)行建表SQL語句/注冊(cè)臨時(shí)表,將mySQL庫中的表以SparkSQL臨時(shí)表形式進(jìn)行注冊(cè),注冊(cè)名是mySQLtest:
CREATE TEMPORARY table mySQLtest
USING org.apache.spark.sql.jdbc
OPTIONS(
url "jdbc:mysql://192.11.1.1:3306/hive",
dbtable "test",
user 'hive',
password 'hive'
);
? beeline命令一行輸入:CREATE TEMPORARY table mySQLtest USING org.apache.spark.sql.jdbc OPTIONS(url "jdbc:mysql://192.11.1.1:3306/hive",dbtable "test",user 'hive',password 'hive');
? o了,現(xiàn)在可以用dim_aclineend.gid=mySQLtest.resource_id關(guān)聯(lián)關(guān)系來join查詢兩個(gè)異構(gòu)數(shù)據(jù)源的數(shù)據(jù)了,以此類推,Spark對(duì)oracle等等RDB、HDFS數(shù)據(jù)文件如Json、CSV等等各種數(shù)據(jù)源都可以類似處理,這便是Df:Data Fusion,對(duì)HBase稍有不同,以后可以開文另說關(guān)于shc;但是ST有個(gè)毛病:它每次接收到一個(gè)JDBC連接時(shí),才會(huì)創(chuàng)建一個(gè)sparkContext供使用,因?yàn)樗褂玫氖莝park-submmit腳本提交任務(wù),而且每次都是一個(gè)JVM進(jìn)程,這樣導(dǎo)致臨時(shí)表的生命周期僅限于一次會(huì)話,如果對(duì)跨異構(gòu)數(shù)據(jù)源數(shù)據(jù)JOIN需求很多,可以考慮更靠譜的SJS,這一點(diǎn)還可以參見HS2介紹:
http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/
? 提到:For each client connection, it creates a new execution context that serves Hive SQL requests from the client. 可見HS2的設(shè)計(jì)是為每客戶端連接單獨(dú)創(chuàng)建上下文。
? 注冊(cè)臨時(shí)視圖的方式:
CREATE TEMPORARY VIEW mysqlInfo
USING org.apache.spark.sql.jdbc
OPTIONS (......
? 以spark-sql腳本方式:
bin/spark-sql? --driver-class-path lib/mysql-connector-java.jar
附: