1.序
我的工程是maven工程,通過maven不需要理會包的加載問題,很是方便。如果你還沒有使用maven來管理工程的話那強烈建議你使用maven,盡管前期學習有點麻煩(主要是maven的默認下載鏡像是國外)
2.搭建詳情
下面是我建工程的截圖
1.jpg
2.jpg
3.jpg
4.jpg
5.jpg
6.jpg
7.jpg
8.jpg
9.jpg
10.jpg
11.jpg
3.測試wordcount程序
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.pwsoft</groupId>
<artifactId>SparkStudy</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/io.netty/netty-all -->
<dependency>
<groupId>io.netty</groupId>
<artifactId>netty-all</artifactId>
<version>4.1.4.Final</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.6.4</version>
</dependency>
</dependencies>
</project>
wordcount程序源碼
object MyWordCount {
def main(args: Array[String]) {
//獲取SparkContext
val spark = SparkSession
.builder
.appName("Spark Pi").master("local")
.getOrCreate()
var sc = spark.sparkContext
//讀取文件,返回這個文件的行數
val count = sc.textFile("F:\\vmware\\share\\soft\\spark-1.6.0-bin-hadoop2.6\\README.md").count()
println(count)
// val lines = spark.sparkContext.textFile("F:\\vmware\\share\\soft\\spark-1.6.0-bin-hadoop2.6\\README.md")
// val wordcount = lines.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_).collect()
// wordcount.foreach(pair => println(pair._1 + " : " + pair._2))
/**
* flatMap產生 MapPartitionsRDD
* map 產生 MapPartitionsRDD
* reduceByKey 產生 ShuffledRDD
* sortByKey 產生 ShuffledRDD
*/
spark.sparkContext.textFile("F:\\vmware\\share\\soft\\spark-1.6.0-bin-hadoop2.6\\README.md").flatMap(line => line.split(" "))
.map(word => (word, 1)).reduceByKey(_+_).map(pair => (pair._2, pair._1)).sortByKey(false).collect()
.map(pair => (pair._2, pair._1)).foreach(pair => println(pair._1 + " : " + pair._2))
//為了可以通過web控制臺看到信息,加一個寫循環不讓程序結束
while (true) {}
spark.stop()
}
}
成功運行