續前文http://www.lxweimin.com/p/b27545f6d730,基于搭建好的Hadoop集群來部署Spark
1、安裝 Scala
官網下載Scala,我這里下載的是最新的2.12.1
解壓并設置環境變量
export SCALA_HOME=/home/spark/scala-2.12.1
export PATH=$SCALA_HOME/bin:$PATH
[root@master jre]# source ~/.bashrc
安裝配置Spark
下載預編譯對應hadoop版本的Spark,基于已安裝好的hadoop版本,我這里下載的是spark-2.1.0-bin-hadoop2.7
配置Spark
配置spark-env.sh
cd /home/spark/spark-2.1.0-bin-hadoop2.7/conf #進入spark配置目錄
cp spark-env.sh.template spark-env.sh #從配置模板復制
vi spark-env.sh #添加配置內容
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.121-0.b13.el7_3.x86_64/jre
export HADOOP_HOME=/usr/local/hadoop-2.7.3
export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
SPARK_MASTER_HOST=master
SPARK_LOCAL_DIRS=/home/spark/spark-2.1.0-bin-hadoop2.7
SPARK_DRIVER_MEMORY=1G
注:在設置Worker進程的CPU個數和內存大小,要注意機器的實際硬件條件,如果配置的超過當前Worker節點的硬件條件,Worker進程會啟動失敗。
[root@master conf]# vim slaves
master
slave
將配置好的Spark文件夾分發給所有slave,我這里只有一個slave
scp -r /home/spark/spark-2.1.0-bin-hadoop2.7 root@slave:/home/spark
啟動Spark
[root@master spark-2.1.0-bin-hadoop2.7]# sbin/start-all.sh
檢查Spark相關進程是否成功啟動
Master上:
[root@master spark-2.1.0-bin-hadoop2.7]# jps
13312 ResourceManager
3716 Master
13158 SecondaryNameNode
12857 NameNode
8697 Jps
13451 NodeManager
12989 DataNode
3807 Worker
Slave上:
[root@localhost spark-2.1.0-bin-hadoop2.7]# jps
9300 NodeManager
15604 Jps
1480 Worker
9179 DataNode
進入Spark的Web管理頁面: http://192.168.1.240:8080
Paste_Image.png
運行示例
示例代碼如下:
該示例為分別計算README.md文件中含有字母'a'和'b'的行數統計
from pyspark import SparkContext
logFile = "/user/test1/README.md" # Should be some file on your hdfs system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
sc.stop()
示例執行如下:
[root@master spark-2.1.0-bin-hadoop2.7]# /home/spark/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master spark:192.168.1.240:7077 --deploy-mode client /home/code/spark_test/test1.py
Lines with a: 62, lines with b: 30