介紹
本篇文章主要摘自Spark官網(wǎng)的Spark Programming Guide,在之前的一篇文章中已經(jīng)有對這里面一些概念的基本介紹,這里就不再贅述了。(參見Spark常用概念)
本篇文章的主要思想是根據(jù)代碼解讀JavaRDD和JavaPairRDD的常用API。
下面開始吧。。。
連接Spark
使用Maven或者SBT來創(chuàng)建本地Java/Scala應(yīng)用的工程。
下面展示下如何在Windows環(huán)境中單機(jī)編譯并運(yùn)行Spark的Java代碼(Scala的代碼類似)
使用IDEA
創(chuàng)建一個(gè)新的maven工程,其中pom.xml的內(nèi)容參見下面:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.paulHome.app</groupId>
<artifactId>learnSparkJavaApi</artifactId>
<version>1.0</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>com.github.fommil.netlib</groupId>
<artifactId>all</artifactId>
<version>1.1.2</version>
<type>pom</type>
</dependency>
</dependencies>
</project>
這個(gè)文件我添加的比較全了,包括SQL、Streanming、MLlib都加進(jìn)去了。當(dāng)然還有一些其他的關(guān)于Maven的配置參見另一篇文章搭建虛擬機(jī)Spark環(huán)境。另外多說一句,記得在IDEA的Maven的配置中選上自動下載源碼文件,這樣方便后面閱讀學(xué)習(xí)。本機(jī)調(diào)試Spark程序的最大好處就是可以斷點(diǎn)debug,可以很好的來閱讀源碼理解源碼。
然后根據(jù)你喜好,創(chuàng)建好自己的工程文件。我自己的情況見下圖所示(另外多說一句,安裝JDK的時(shí)候千萬別放在默認(rèn)的帶空格的目錄Program Files下面,這就是個(gè)坑,如果你還需要用到HDFS,也就是再安裝Haoop的時(shí)候就會踩到。不過我現(xiàn)在就沒改,因?yàn)椴淮_定是否要在家里用到Hadoop,不過后面用到的話我肯定會改的):
接下來就是配置Run了,主要是VM -option寫上:-Dspark.master=local[4](估計(jì)是這篇文章的第一個(gè)大重點(diǎn)了吧)
為了方便大家學(xué)習(xí),我把這個(gè)Java源碼也放上去吧,是之前邊學(xué)邊隨手寫的,主要是學(xué)習(xí)官網(wǎng)的每一條示例語句,所以代碼沒啥主題。都是第一次寫Spark應(yīng)用時(shí)寫的代碼(是的,我才剛學(xué)不久)。
/**
* Created by Paul Yang on 2017/4/15.
*/
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaDoubleRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.broadcast.Broadcast;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.util.AccumulatorV2;
import org.apache.spark.util.LongAccumulator;
import scala.Tuple2;
import scala.collection.immutable.List;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.regex.Pattern;
public class simpleRddMain {
//Used to sum
static int countSum = 0;
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("simple RDD opt")
.set("spark.hadoop.validateOutputSpecs", "false");
JavaSparkContext sc = new JavaSparkContext(conf);
//parallel a RDD
ArrayList<Integer> intList = new ArrayList<Integer>(){{
add(1);
add(2);
add(3);
add(4);
add(5);
}};
JavaRDD<Integer> integerRdd = sc.parallelize(intList); // Get a RDD from a list.
System.out.println("Integer RDD:");
integerRdd.collect();
//Lambda expressions
JavaRDD<String> stringRdd = sc.textFile("G:/ImportantTools/spark-2.1.0-bin-hadoop2.7/README.md");
JavaRDD<Integer> intLineLength = stringRdd.map(s -> s.length());
intLineLength.persist(StorageLevel.MEMORY_ONLY());
int totalLen = intLineLength.reduce((a, b) -> a + b);
System.out.println("Lines(" + stringRdd.count() + ")<<<Lambda expressions>>>: Total len = " + totalLen);
//anonymous inner class or a name one
class GetLenFunc implements Function<String, Integer> {
@Override
public Integer call(String s) throws Exception {
return s.length();
}
}
JavaRDD<Integer> funcLineLengths = stringRdd.map( new GetLenFunc() );
int funcTotalLen = funcLineLengths.reduce( new Function2<Integer, Integer, Integer>() {
public Integer call (Integer a, Integer b) {return a + b;}
});
System.out.println("<<<anonymous inner class or a name one>>>: Total Len = " + funcTotalLen);
//Wordcount Process
// JavaRDD<String> wordsRdd = stringRdd.flatMap(new FlatMapFunction<String, String>() {
// @Override
// public Iterator<String> call(String line) throws Exception {
// return Arrays.asList( line.split(" ")).iterator();
// }
// });
JavaRDD<String> wordsRdd = stringRdd.flatMap(s -> Arrays.asList(s.split(" ")).iterator());
JavaPairRDD<String, Integer> eachWordRdd = wordsRdd.mapToPair(s -> new Tuple2(s, 1));
JavaPairRDD<String, Integer> wordCntRdd = eachWordRdd.reduceByKey( (a, b) -> a + b );
wordCntRdd.collect();
wordCntRdd.foreach(new VoidFunction<Tuple2<String, Integer>>() {
@Override
public void call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
System.out.println(stringIntegerTuple2._1 + "@@@" + stringIntegerTuple2._2);
}
});
//Understanding closures
integerRdd.foreach(new VoidFunction<Integer>() {
@Override
public void call(Integer integer) throws Exception {
countSum += integer.intValue();
}
});
System.out.println("#~~~~~scope and life cycle of variables and methods~~~~~~# countSum = " + countSum);
//Working with Key-Value Pairs
JavaPairRDD<String, Integer> strIntPairRdd = stringRdd.mapToPair(s -> new Tuple2(s, 1));
JavaPairRDD<String, Integer> strCountRdd = strIntPairRdd.reduceByKey((a, b) -> a + b);
//strCountRdd.sortByKey();
strCountRdd.collect();
System.out.println("###Working with Key-Value Pairs### :" + strCountRdd.toString());
strCountRdd.foreach(new VoidFunction<Tuple2<String, Integer>>() {
@Override
public void call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
System.out.println(stringIntegerTuple2._1 + ":" + stringIntegerTuple2._2);
}
});
//Broadcast Variables
Broadcast<double[]> broadcastVar = sc.broadcast(new double[] {1.1, 2.2, 3.3});
broadcastVar.value();
//Accumulator
LongAccumulator longAccum = sc.sc().longAccumulator();
integerRdd.foreach(x -> longAccum.add(x));
System.out.println("\n\n\nAccumulator: " + longAccum.value() + "\n\n\n\n");
//AccumulatorV2
class MyVector {
double[] vals;
public MyVector(int vecLen) {
vals = new double[vecLen];
}
public void reset() {
for(int i = 0; i < vals.length; i++) {
vals[i] = 0;
}
}
public void add(MyVector inVec) {
for(int i = 0; i < vals.length; i++) {
vals[i] += inVec.vals[i];
}
}
}
class VectorAccumulatorV2 extends AccumulatorV2<MyVector,MyVector> {
private MyVector selfVect = null;
public VectorAccumulatorV2(int vecLen) {
selfVect = new MyVector(vecLen);
}
@Override
public boolean isZero() {
for(int i = 0; i < selfVect.vals.length; i++) {
if(selfVect.vals[i] != 0) return false;
}
return true;
}
@Override
public AccumulatorV2<MyVector, MyVector> copy() {
VectorAccumulatorV2 ret = new VectorAccumulatorV2(copy().value().vals.length);
return ret;
}
@Override
public void reset() {
selfVect.reset();
}
@Override
public void add(MyVector v) {
selfVect.add(v);
}
@Override
public void merge(AccumulatorV2<MyVector, MyVector> other) {
MyVector minVec = null, maxVec = null;
if(other.value().vals.length < selfVect.vals.length) {
minVec = other.value();
maxVec = selfVect;
}
else {
minVec = selfVect;
maxVec = other.value();
}
//TODO: merge together.
}
@Override
public MyVector value() {
return selfVect;
}
}
VectorAccumulatorV2 myVecAcc = new VectorAccumulatorV2(5);
sc.sc().register(myVecAcc, "MyVectorAcc1");
}
}
點(diǎn)擊運(yùn)行后你可能會遇到兩個(gè)錯(cuò)誤,第一個(gè)就是下面這個(gè):
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
這個(gè)錯(cuò)誤其實(shí)可以忽略的,因?yàn)檎也坏娇蛇\(yùn)行的Hadoop bin還是可以繼續(xù)跑下去的,就是不用HDFS罷了。
如果糾結(jié)想去掉這個(gè)Error那也只需要兩步走:
- 去下載winutils.exe:我發(fā)的這個(gè)是Hadoop2.7版本的Github鏈接,和你在別的bolg里面找到的舊版本不一樣的(不過其實(shí)貌似沒啥區(qū)別都一樣用)
- 設(shè)置環(huán)境變量。把上面下載下來的目錄(bin的上級目錄)加入到環(huán)境變量HADOOP_HOME中。
- 重啟IDEA(想了想還是寫了這一步)
如果你愿意的話可以跟蹤下報(bào)錯(cuò)的代碼,加個(gè)斷點(diǎn),然后你還可以找到另一個(gè)方法來就是在程序中加入配置語句來解決這個(gè)問題,這樣HADOOP_HOME就可以不配置了,因?yàn)樵谀愕膚indow上可能已經(jīng)有Haoop了,你又不想改或者不想在bin里面添加winutils相關(guān)文件。
第二個(gè)報(bào)錯(cuò)的地方可能是:
JavaRDD<String> stringRdd = sc.textFile("G:/ImportantTools/spark-2.1.0-bin-hadoop2.7/README.md");
顯示找不到這個(gè)文件,其實(shí)運(yùn)行上面的代碼完全不需要去Spark官網(wǎng)下載任何release包,因?yàn)槲覀冇蠱aven已經(jīng)幫我們自動下載搞定了。這里只是因?yàn)椴幌肓硗鈽?gòu)造數(shù)據(jù)文件,所以還是用release包中的文件。然后把路徑寫對就ok了,記得要寫盤符,不然默認(rèn)就在IDEA的工程目錄中去找了。
解決這兩個(gè)錯(cuò)誤后應(yīng)該就可以順利看到運(yùn)行結(jié)果了。
看結(jié)果的時(shí)候你可能會嫌Spark自帶的輸出日志太多了,略煩,那么還可以修改輸出的級別限制輸出,主要是把log4j.rootCategory=INFO, console改為log4j.rootCategory=WARN, console即可抑制Spark把INFO級別的日志打到控制臺上。而如果要顯示更全面的信息,可以把INFO改為DEBUG。
log4j.properties內(nèi)如如下:
log4j.rootLogger=${root.logger}
root.logger=WARN,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
shell.log.level=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.logger.org.apache.spark.repl.Main=${shell.log.level}
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=${shell.log.level}
這個(gè)文件需要放到程序能自動讀取加載的地方,比如resources目錄下:
這樣再run的時(shí)候log看起來就清爽多了。
初始化Spark
這部分內(nèi)容在官網(wǎng)上主要是說在代碼里使用SparkConf來建立一個(gè)JavaSparkContext。我在之前的文章中已經(jīng)有介紹了,這里也不再贅述了。
當(dāng)然另外使用shell來運(yùn)行也不是本文章的重點(diǎn),我們的重點(diǎn)是學(xué)習(xí)Spark實(shí)打?qū)嵉腞DD API。
那就繼續(xù)看下面的重點(diǎn)RDD了。。。
RDDs
RDD的概念和特點(diǎn)還是參考上面的我之前寫的文章《Spark常用概念》,我覺得已經(jīng)寫的比較有概括性和歸納性了。
產(chǎn)生RDD
RDD總的來說有兩種方法得到:
- 從代碼中Parallelize得到
/** Distribute a local Scala collection to form an RDD. */
def parallelize[T](list: java.util.List[T]): JavaRDD[T] =
parallelize(list, sc.defaultParallelism)
讀源碼知道方法parallelize()的輸入是一個(gè)List<T>,輸出是一個(gè)JavaRDD<T>
- 從文件中讀取得到
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(path: String): JavaRDD[String] = sc.textFile(path)
方法textFile()的輸入是一個(gè)字符串,這個(gè)字符串可以是一個(gè)具體的文件,或者是一個(gè)目錄。如果是目錄的話就會自動讀取這個(gè)目錄下面所有的文件,然后返回一個(gè)JavaRDD<String>。
- 其他API
當(dāng)然還有一些其他的生成RDD的API,比如很有用的創(chuàng)建一個(gè)空的RDD:
/** Get an RDD that has no partitions or elements. */
def emptyRDD[T]: JavaRDD[T] = {
implicit val ctag: ClassTag[T] = fakeClassTag
JavaRDD.fromRDD(new EmptyRDD[T](sc))
}
以及其他
/** Distribute a local Scala collection to form an RDD. */
def parallelizePairs[K, V](list: java.util.List[Tuple2[K, V]]): JavaPairRDD[K, V] =
parallelizePairs(list, sc.defaultParallelism)
/** Distribute a local Scala collection to form an RDD. */
def parallelizeDoubles(list: java.util.List[java.lang.Double]): JavaDoubleRDD =
parallelizeDoubles(list, sc.defaultParallelism)
上面都是SparkContext這個(gè)Class中的方法
RDD操作
RDD的操作分為兩種:1、transformation;2、action。transformation就是將RDD的每一個(gè)elements進(jìn)行映射變形,或許是1對1的map,或許是1對N(N>=0)的flatMap,又或許是加入鍵值映射成(Key,Value)形勢的mapToPair。而action操作是對RDD的符合條件的elements進(jìn)行計(jì)算然后返回一個(gè)值。下面重點(diǎn)介紹下幾個(gè)常用的transformation以及action的API:
主要來自于JavaRDD以及JavaPairRDD中。
- Transformation
- map
這個(gè)方法是將一個(gè)function接口的實(shí)現(xiàn)類應(yīng)用到RDD的每一個(gè)元素上,然后返回一個(gè)新的RDD
- map
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[R](f: JFunction[T, R]): JavaRDD[R] =
new JavaRDD(rdd.map(f)(fakeClassTag))(fakeClassTag)
比如我有一個(gè)數(shù)據(jù)集格式是:id,value
那么直接sc.textFile("pathOfFile")后的得到的是String類型的element組成的RDD,需要按照格式解析下,然后具體的map方法就是:
//Lambda表達(dá)式寫法
JavaRDD<String[]> strArrayIdValue = stringRdd.map(s -> s.split(",", -1));
//非Lambda表達(dá)式寫法
JavaRDD<String[]> strArrayIdValue = stringRdd.map(new Function<String, String[]>() {
@Override
public String[] call(String v1) throws Exception {
return v1.split(",", -1);
}
});
不過接下來為了更好的展示返回值的類型,我就不再用Lambda表達(dá)式的格式來寫了。
- filter
這個(gè)是按照一個(gè)過濾規(guī)則將能返回true的元素保留下來,返回false的不保留從而產(chǎn)生一個(gè)新的RDD。
還是上面的例子,假設(shè)數(shù)據(jù)集中有格式錯(cuò)誤,或者數(shù)據(jù)缺失的數(shù)據(jù),簡單認(rèn)為String的數(shù)組個(gè)數(shù)不為2就是要扔掉的,那么:
JavaRDD<String[]> strFiltedRdd = strArrayIdValue.filter(new Function<String[], Boolean>() {
@Override
public Boolean call(String[] v1) throws Exception {
return v1.length == 2;
}
});
- flatMap
和map類似,不同的是每條element不是一定映射為另一個(gè)新的element,而是1對N的映射,其中N >= 0,所以假如上面的例子中數(shù)據(jù)value的值是按照value1|value2|value3...來構(gòu)造的。那么去掉id,將各個(gè)value值保存到一個(gè)RDD是這么寫:
JavaRDD<String> strValueNRdd = strFiltedRdd.flatMap(new FlatMapFunction<String[], String>() {
@Override
public Iterator<String> call(String[] strings) throws Exception {
return Arrays.asList(strings[1].split("\\|", -1)).iterator();
}
});
這樣就將一個(gè)element中的value合集分成了每個(gè)element都只包含一個(gè)value的新RDD了。
- mapPartitions
這個(gè)是按照RDD在每個(gè)partition分區(qū)上進(jìn)行映射。源碼定義如下:
/**
* Return a new RDD by applying a function to each partition of this RDD.
*/
def mapPartitions[U](f: FlatMapFunction[JIterator[T], U]): JavaRDD[U] = {
def fn: (Iterator[T]) => Iterator[U] = {
(x: Iterator[T]) => f.call(x.asJava).asScala
}
JavaRDD.fromRDD(rdd.mapPartitions(fn)(fakeClassTag[U]))(fakeClassTag[U])
}
將之前的map方法裝換成這個(gè)后:
JavaRDD<String[]> strArrayIdValue = stringRdd.mapPartitions(new FlatMapFunction<Iterator<String>, String[]>() {
@Override
public Iterator<String[]> call(Iterator<String> stringIterator) throws Exception {
ArrayList<String[]> arrList = new ArrayList<String[]>();
arrList.add(stringIterator.next().split(",", -1));
return arrList.iterator();
}
});
- union
這個(gè)是將輸入的RDD以及調(diào)用的源RDD進(jìn)行合并。產(chǎn)生一個(gè)新的RDD。這個(gè)API在讀取多個(gè)獨(dú)立的文件并產(chǎn)生一個(gè)RDD時(shí)比較有用,比如:
JavaRDD<String> unionAllFilesRdd = sc.emptyRDD();
for(String name : fileNames) {
unionAllFilesRdd = unionAllFilesRdd.union(sc.textFile(name));
}
- intersection
返回輸入和源的RDD的交集,并且不重復(fù)。
源碼說明:
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* @note This method performs a shuffle internally.
*/
def intersection(other: JavaRDD[T]): JavaRDD[T] = wrapRDD(rdd.intersection(other.rdd))
- distinct
返回RDD中不重復(fù)的element組成的新RDD,也就是去重操作。沒有入?yún)ⅰ?br> 源碼:
/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(): JavaRDD[T] = wrapRDD(rdd.distinct())
- subtract
該方法的作用是將存在于本RDD中的element但是不存在于輸入RDD中的element找出來組合成一個(gè)新的RDD。
/**
* Return an RDD with the elements from `this` that are not in `other`.
*
* Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
* RDD will be less than or equal to us.
*/
def subtract(other: JavaRDD[T]): JavaRDD[T] = wrapRDD(rdd.subtract(other))
綜合前面幾個(gè)方法,如果我們要做一個(gè)這個(gè)任務(wù),將RDD A中不同于RDD B的數(shù)據(jù)加入到B中,并將A中與B重復(fù)的部分按照某種條件替換。
1. 需要得到A對于B的不同集:AoutB = A.subtract(B)
2. 將滿足替換條件的A的子集找出來:replaceCandidateA = A.filter(判斷條件)
3. 找出真正能替換的A的子集:realReplaceA = replaceCandidateA.intersection(B)
4. 找出要丟棄的B的子集:discardB = B.intersection(realReplaceA)
5. 丟棄后剩下的B的內(nèi)容:B = B.subtract(discardB)
6. 合并替換集以及新加集到B中:newB = B.union(realReplaceA).union(AoutB)
- mapToPair
這個(gè)API是用來將JavaRDD轉(zhuǎn)成JavaPairRDD,也就是將Key提出來,比如之前的id,value數(shù)據(jù),在轉(zhuǎn)成[id,value]的String數(shù)組后,可以生成讓id為key的JavaPairRDD:
JavaPairRDD<String, String> keyValuePairRdd = strFiltedRdd.mapToPair(new PairFunction<String[], String, String>() {
@Override
public Tuple2<String, String> call(String[] strings) throws Exception {
return new Tuple2<>(strings[0], strings[1]);
}
});
接下來會介紹一些JavaPairRDD上獨(dú)有的API
- groupByKey
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with the existing partitioner/parallelism level.
*
* @note If you are grouping in order to perform an aggregation (such as a sum or average) over
* each key, using `JavaPairRDD.reduceByKey` or `JavaPairRDD.combineByKey`
* will provide much better performance.
*/
def groupByKey(): JavaPairRDD[K, JIterable[V]] =
fromRDD(groupByResultToJava(rdd.groupByKey()))
這個(gè)API用起來不是很爽,不能自定義一些組合方式,而且執(zhí)行細(xì)節(jié)需要注意下,參見:Avoid GroupByKey以及深入理解groupByKey、reduceByKey
- reduceByKey
將PairRDD中的每一個(gè)元素按照Key的值,進(jìn)行Value的“相加”,“相加”的具體操作由實(shí)現(xiàn)接口Function2的類完成。
源碼:
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
*/
def reduceByKey(func: JFunction2[V, V, V]): JavaPairRDD[K, V] = {
fromRDD(reduceByKey(defaultPartitioner(rdd), func))
}
可見reduceByKey并不會改變V的類型。比如我們把之前的PairRDD做下value的合并,代碼如下:
JavaPairRDD<String, String> byIdValuesPairRdd = keyValuePairRdd.reduceByKey(new Function2<String, String, String>() {
@Override
public String call(String v1, String v2) throws Exception {
return v1+"|"+v2;
}
});
這里的例子中V的類型是String,其實(shí)在實(shí)際用的時(shí)候也可以先將String map成ArrayList<String>然后再reduceByKey合成一個(gè)大的ArrayList。
那么有沒有一個(gè)方法可以從String直接transformation到ArrayList<String>呢?接著往下看吧。
- aggregateByKey
這個(gè)方法的入?yún)⒈容^多,主要原因是這個(gè)方法的目的是將RDD的每個(gè)元素按照Key合并成U,因?yàn)閁的類型不同于V,所以需要指明V如何和U合并(第二個(gè)入?yún)?/em>),以及U和U的合并方法(第三個(gè)入?yún)?/em>),而且還需要給出一個(gè)最初始的U(比如是一個(gè)空集合,或者是0對于整數(shù)想加,或者是1對于整數(shù)相乘;第一個(gè)入?yún)?/em>)。好戲來了,先看源碼:
/**
* Aggregate the values of each key, using given combine functions and a neutral "zero value".
* This function can return a different result type, U, than the type of the values in this RDD,
* V. Thus, we need one operation for merging a V into a U and one operation for merging two U's.
* The former operation is used for merging values within a partition, and the latter is used for
* merging values between partitions. To avoid memory allocation, both of these functions are
* allowed to modify and return their first argument instead of creating a new U.
*/
def aggregateByKey[U](zeroValue: U, seqFunc: JFunction2[U, V, U], combFunc: JFunction2[U, U, U]):
JavaPairRDD[K, U] = {
implicit val ctag: ClassTag[U] = fakeClassTag
fromRDD(rdd.aggregateByKey(zeroValue)(seqFunc, combFunc))
}
具體實(shí)踐 duang duang duang:
JavaPairRDD<String, ArrayList<String>> keyValuelistPairRdd = keyValuePairRdd.aggregateByKey(new ArrayList<String>(), new Function2<ArrayList<String>, String, ArrayList<String>>() {
@Override
public ArrayList<String> call(ArrayList<String> v1, String v2) throws Exception {
v1.add(v2);
return v1;
}
}, new Function2<ArrayList<String>, ArrayList<String>, ArrayList<String>>() {
@Override
public ArrayList<String> call(ArrayList<String> v1, ArrayList<String> v2) throws Exception {
v1.addAll(v2);
return v1;
}
});
通過三個(gè)入?yún)⒏愣诉@個(gè)從String -> ArrayList<String>的轉(zhuǎn)變。
這么好的機(jī)會自然不能錯(cuò)過用Lambda來秀下:
val initialSet = mutable.HashSet.empty[String]
val addToSet = (s: mutable.HashSet[String], v: String) => s += v
val mergePartitionSets = (p1: mutable.HashSet[String], p2: mutable.HashSet[String]) => p1 ++= p2
val keyValuelistPairRdd = keyValuePairRdd.aggregateByKey(initialSet)(addToSet, mergePartitionSets)
怎么畫風(fēng)突變成Scala了,我也沒辦法啊,我能有什么辦法,用Java實(shí)在寫不出來,不知道這個(gè)怎么轉(zhuǎn)成Lambda。。。。(Java藥丸啊)。如果有人知道這個(gè)用Java咋通過Lambda寫,麻煩在評論里告知下,不勝感謝!
JavaPairRDD<String, ArrayList<String>> keyValuelistLambda = keyValuePairRdd.aggregateByKey(new ArrayList<String>(), (uList,vStr) -> {uList.add(vStr); return uList;}, (u1, u2) -> {u1.addAll(u2); return u1;});
額。。。,我自己還是想出來一個(gè)寫法(不然太丟人了,雖然正經(jīng)學(xué)Java才3個(gè)月,但是這不是借口啊),不過看起來有點(diǎn)怪怪的感覺,不過如果大家有更好的寫法還是非常歡迎能在評論區(qū)展示下。謝謝!
- sortByKey
當(dāng)K可以排序時(shí),可以使用這個(gè)方法來對其排序,默認(rèn)是升序排序。源碼:
/**
* Sort the RDD by key, so that each partition contains a sorted range of the elements in
* ascending order. Calling `collect` or `save` on the resulting RDD will return or output an
* ordered list of records (in the `save` case, they will be written to multiple `part-X` files
* in the filesystem, in order of the keys).
*/
def sortByKey(): JavaPairRDD[K, V] = sortByKey(true)
/**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*/
def sortByKey(ascending: Boolean): JavaPairRDD[K, V] = {
val comp = com.google.common.collect.Ordering.natural().asInstanceOf[Comparator[K]]
sortByKey(comp, ascending)
}
- join
這個(gè)方法是對于兩個(gè)PairRDD按照Key進(jìn)行取交集,如果k在本RDD和輸入RDD中都存在,那么就加入返回的RDD中,且RDD的每一個(gè)元素為k, (v1, v2),其中后面是一個(gè)Tuple,v1來自于本RDD,v2來自于輸入RDD。源碼:
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
*/
def join[W](other: JavaPairRDD[K, W], partitioner: Partitioner): JavaPairRDD[K, (V, W)] =
fromRDD(rdd.join(other, partitioner))
JavaPairRDD<String, Tuple2<String, String>> joinRDD = byIdValuesPairRdd.join(keyValuePairRdd);
- leftOuterJoin & rightOuterJoin & fullOuterJoin
join還有三個(gè)變身版本,我們可以通過實(shí)踐看看各自的用法:
ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){
{
add(new Tuple2<>(1, "str1"));
add(new Tuple2<>(1, "str11"));
add(new Tuple2<>(2, "str2"));
add(new Tuple2<>(4, "str44"));
};
};
JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList);
ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){
{
add(new Tuple2<>(2, "str2"));
add(new Tuple2<>(3, "str3"));
add(new Tuple2<>(4, "str4"));
add(new Tuple2<>(5, "str5"));
add(new Tuple2<>(7, "str77"));
}
};
JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList);
JavaPairRDD<Integer, Tuple2<String, String>> joinRdd = paralPairRdd.join(otherParalPairRdd);
joinRdd.foreach(s -> System.out.println("join*"+ s.toString()));
JavaPairRDD<Integer, Tuple2<String, Optional<String>>> leftOuterJoinRdd = paralPairRdd.leftOuterJoin(otherParalPairRdd);
leftOuterJoinRdd.foreach(s -> System.out.println("leftOuterJoin*"+ s.toString()));
JavaPairRDD<Integer, Tuple2<Optional<String>, String>> rightOuterJoinRdd = paralPairRdd.rightOuterJoin(otherParalPairRdd);
rightOuterJoinRdd.foreach(s -> System.out.println("rightOuterJoin*"+ s.toString()));
JavaPairRDD<Integer, Tuple2<Optional<String>, Optional<String>>> fullOuterJoinRdd = paralPairRdd.fullOuterJoin(otherParalPairRdd);
fullOuterJoinRdd.foreach(s -> System.out.println("fullOuterJoin*"+ s.toString()));
上面代碼的運(yùn)行結(jié)果為:
join(2,(str2,str2))
join(4,(str44,str4))
leftOuterJoin(2,(str2,Optional[str2]))
leftOuterJoin(4,(str44,Optional[str4]))
leftOuterJoin(1,(str1,Optional.empty))
leftOuterJoin(1,(str11,Optional.empty))
rightOuterJoin*(2,(Optional[str2],str2))
rightOuterJoin(4,(Optional[str44],str4))
rightOuterJoin(3,(Optional.empty,str3))
rightOuterJoin(7,(Optional.empty,str77))
rightOuterJoin(5,(Optional.empty,str5))
fullOuterJoin(4,(Optional[str44],Optional[str4]))
fullOuterJoin(2,(Optional[str2],Optional[str2]))
fullOuterJoin(3,(Optional.empty,Optional[str3]))
fullOuterJoin(7,(Optional.empty,Optional[str77]))
fullOuterJoin(1,(Optional[str1],Optional.empty))
fullOuterJoin(1,(Optional[str11],Optional.empty))
fullOuterJoin*(5,(Optional.empty,Optional[str5]))
可見join就是取交集,left是就是本RDD有的key的Value集合,right就是輸入RDD的key的value集合,full就是并集。而且不保證key的順序,只保證value的順序。
- cogroup
這個(gè)方法是將兩個(gè)或者多個(gè)PairRDD(如果入?yún)⑹嵌鄠€(gè)RDD,那么有幾個(gè)就合并幾個(gè))合并在一起,如果其中一個(gè)RDD的某個(gè)key,在另一個(gè)RDD中沒有出現(xiàn),那么就要記錄一個(gè)空集合。Talk is cheap, let's show you the codes and the run result.
源碼為先:
/**
* For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
* list of values for that key in `this` as well as `other`.
*/
def cogroup[W](other: JavaPairRDD[K, W]): JavaPairRDD[K, (JIterable[V], JIterable[W])] =
fromRDD(cogroupResultToJava(rdd.cogroup(other)))
有多個(gè)重載版的方法,加上調(diào)用方法的RDD自身,最多支持一口氣4個(gè)RDD的cogroup
動手實(shí)踐:
ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){
{
add(new Tuple2<>(1, "str1"));
add(new Tuple2<>(1, "str11"));
add(new Tuple2<>(2, "str2"));
add(new Tuple2<>(4, "str44"));
};
};
JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList);
ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){
{
add(new Tuple2<>(2, "str2"));
add(new Tuple2<>(3, "str3"));
add(new Tuple2<>(4, "str4"));
add(new Tuple2<>(5, "str5"));
}
};
JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList);
JavaPairRDD<Integer, Tuple2<Iterable<String>, Iterable<String>>> coGroupRdd = paralPairRdd.cogroup(otherParalPairRdd);
coGroupRdd.foreach(s -> System.out.println("+++"+ s.toString()));
看下運(yùn)行結(jié)果:
+++(3,([],[str3]))
+++(1,([str1, str11],[]))
+++(2,([str2],[str2]))
+++(4,([str44],[str4]))
+++(5,([],[str5]))
可以看到Key的順序是不保證的,但是key內(nèi)value的順序是有保證的。
- intersection
這個(gè)方法的作用是取本RDD和輸入RDD的交集,和join的區(qū)別在與join會把Key一樣Value不一樣的一起留下,但是intersection只會留下Key和Value都一樣的數(shù)據(jù):
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* @note This method performs a shuffle internally.
*/
def intersection(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] =
new JavaPairRDD[K, V](rdd.intersection(other.rdd))
看個(gè)實(shí)例:
ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){
{
add(new Tuple2<>(1, "str1"));
add(new Tuple2<>(1, "str11"));
add(new Tuple2<>(2, "str2"));
add(new Tuple2<>(4, "str44"));
};
};
JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList);
ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){
{
add(new Tuple2<>(2, "str2"));
add(new Tuple2<>(3, "str3"));
add(new Tuple2<>(4, "str4"));
add(new Tuple2<>(5, "str5"));
add(new Tuple2<>(7, "str77"));
}
};
JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList);
JavaPairRDD<Integer, Tuple2<String, String>> joinRdd = paralPairRdd.join(otherParalPairRdd);
joinRdd.foreach(s -> System.out.println("join*"+ s.toString()));
JavaPairRDD<Integer, String> intersectRdd = paralPairRdd.intersection(otherParalPairRdd);
intersectRdd.foreach(s -> System.out.println("intersection*" + s.toString()));
運(yùn)行結(jié)果:
join(4,(str44,str4))
join(2,(str2,str2))
intersection*(2,str2)
所以intertsection不會改變返回值的類型,但是join會改變,因?yàn)関alue被修改為了Tuple類型了。
- subtract
這個(gè)方法將本RDD中存在的元素但是不存在與輸入RDD的元素取出,組成一個(gè)輸出類型不變的RDD。也就是說這個(gè)不同的判斷不只是根據(jù)Key來的,還包括了Value的值。只要Key和Value有一個(gè)不同,那么就會被取出作為返回RDD的部分?jǐn)?shù)據(jù)。
/**
* Return an RDD with the elements from `this` that are not in `other`.
*
* Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
* RDD will be <= us.
*/
def subtract(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] =
fromRDD(rdd.subtract(other))
實(shí)例代碼:
ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){
{
add(new Tuple2<>(1, "str1"));
add(new Tuple2<>(1, "str11"));
add(new Tuple2<>(2, "str2"));
add(new Tuple2<>(4, "str44"));
};
};
JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList);
ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){
{
add(new Tuple2<>(2, "str2"));
add(new Tuple2<>(3, "str3"));
add(new Tuple2<>(4, "str4"));
add(new Tuple2<>(5, "str5"));
add(new Tuple2<>(7, "str77"));
}
};
JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList);
JavaPairRDD<Integer, String> substractRdd = paralPairRdd.subtract(otherParalPairRdd);
substractRdd.foreach(s -> System.out.println("substract*" + s.toString()));
運(yùn)行結(jié)果:
substract(4,str44)
substract(1,str1)
substract*(1,str11)
*不過還要小心有坑,這里的數(shù)據(jù)結(jié)構(gòu)是Tuple2<Integer, String>,如果換成Tuple2<Integer, String[]>,那么即使String[]的內(nèi)容一樣,也同樣被認(rèn)為是不同的值,切記!!!
- coalesce
這個(gè)方法可以用以減少RDD的分區(qū)到輸入的參數(shù)個(gè)數(shù)上。說是效率比較高,但是用起來感覺無法讓數(shù)據(jù)最后保存在一個(gè)文件上。
/**
* Return a new RDD that is reduced into `numPartitions` partitions.
*/
def coalesce(numPartitions: Int): JavaPairRDD[K, V] = fromRDD(rdd.coalesce(numPartitions))
- repartition
這個(gè)方法主要是用來重新shuffle RDD的data,使用一種隨機(jī)的方式來產(chǎn)生更多或者更少的分區(qū)并平衡它們。This always shuffles all data over the network.
/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
* which can avoid performing a shuffle.
*/
def repartition(numPartitions: Int): JavaPairRDD[K, V] = fromRDD(rdd.repartition(numPartitions))
有時(shí)候會用它最后將結(jié)果RDD保存在一個(gè)part文件上。
coGroupRdd.repartition(1).saveAsTextFile(fileName);
- Actions
- reduce
這個(gè)其實(shí)和reduceByKey類似,不過因?yàn)閞educeByKey的輸出是各個(gè)Key的新Value的element的集合還是一個(gè)RDD,而reduce是對所有RDD的element的一個(gè)匯總,最后形成一個(gè)單獨(dú)的輸出,不再是RDD,所以這個(gè)操作歸在action當(dāng)中,而reducedByKey則屬于transformation。
- reduce
/**
* Reduces the elements of this RDD using the specified commutative and associative binary
* operator.
*/
def reduce(f: JFunction2[T, T, T]): T = rdd.reduce(f)
實(shí)際應(yīng)用:
JavaRDD<String> stringRdd = sc.textFile("G:/ImportantTools/spark-2.1.0-bin-hadoop2.7/README.md");
JavaRDD<Integer> intLineLength = stringRdd.map(s -> s.length());
intLineLength.persist(StorageLevel.MEMORY_ONLY());
int totalLen = intLineLength.reduce((a, b) -> a + b);
- collect
重點(diǎn)看下源碼里的Note:
/**
* Return an array that contains all of the elements in this RDD.
*
* @note this method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collect(): JList[T] =
rdd.collect().toSeq.asJava
- collectAsMap
這是JavaPairRDD特有的API,可以返回一個(gè)原RDD中K,V對應(yīng)關(guān)系的Map。
/**
* Return the key-value pairs in this RDD to the master as a Map.
*
* @note this method should only be used if the resulting data is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collectAsMap(): java.util.Map[K, V] = mapAsSerializableJavaMap(rdd.collectAsMap())
不過這里需要注意一點(diǎn),這個(gè)返回的Map類型在廣播broadcast中可能會有問題。
比如:
final Map<String, MyInfoClass> kvMap = keyValuePairRDD.collectAsMap();
final Broadcast<Map<String, MyInfoClass>> broadcastKvMap = sc.broadcast(kvMap);
上面代碼是把一個(gè)JavaPairRDD轉(zhuǎn)成Map,然后再broadcast出來,方便其他executor在執(zhí)行諸如Map()等方法時(shí)使用,不過這么寫有一定的概率產(chǎn)生一個(gè)錯(cuò)誤:
17/06/14 11:43:53 INFO scheduler.DAGScheduler: ShuffleMapStage 1 (repartition at FromBSID2Gps.java:214) failed in 1.182 s due to Job aborted due to stage failure: Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in stage 1.0 (TID 19, s36.dc.taiyear, executor 3): java.io.IOException: java.lang.UnsupportedOperationException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1213)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
.... ....
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.UnsupportedOperationException
at java.util.AbstractMap.put(AbstractMap.java:203)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:217)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1206)
... 20 more
所以為了避免被當(dāng)做AbstractMap來處理,需要顯式的指明Map的類型,比如下面這段代碼。
Map<String, MyInfoClass> kvMap = new HashMap<>();
kvMap.putAll(keyValuePairRDD.collectAsMap());
final Broadcast<Map<String, MyInfoClass>> broadcastKvMap = sc.broadcast(kvMap);
- count
/**
* Return the number of elements in the RDD.
*/
def count(): Long = rdd.count()
- first
/**
* Return the first element in this RDD.
*/
def first(): T = rdd.first()
- take
/**
* Take the first num elements of the RDD. This currently scans the partitions *one by one*, so
* it will be slow if a lot of partitions are required. In that case, use collect() to get the
* whole RDD instead.
*
* @note this method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def take(num: Int): JList[T] =
rdd.take(num).toSeq.asJava
- saveAsTextFile
這個(gè)不是保存到一個(gè)文件中,而是保存到這個(gè)文件夾中,根據(jù)partition的個(gè)數(shù)來生成文件個(gè)數(shù)。
/**
* Save this RDD as a text file, using string representations of elements.
*/
def saveAsTextFile(path: String): Unit = {
rdd.saveAsTextFile(path)
}
- countByKey
用哈希Map作為返回值
/** Count the number of elements for each key, and return the result to the master as a Map. */
def countByKey(): java.util.Map[K, jl.Long] =
mapAsSerializableJavaMap(rdd.countByKey()).asInstanceOf[java.util.Map[K, jl.Long]]
- foreach
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: VoidFunction[T]) {
rdd.foreach(x => f.call(x))
}
其他RDD類型
下面介紹下
- JavaDoubleRDD
共享變量
因?yàn)镾park的APP是序列化后分發(fā)到各個(gè)Worker節(jié)點(diǎn)上運(yùn)行的,所以需要特殊的方法才能在各個(gè)worker節(jié)點(diǎn)上得到一個(gè)有效統(tǒng)一的全局變量值。