Spark官方編程指南—の—詳解加實(shí)踐

介紹

本篇文章主要摘自Spark官網(wǎng)的Spark Programming Guide,在之前的一篇文章中已經(jīng)有對這里面一些概念的基本介紹,這里就不再贅述了。(參見Spark常用概念
本篇文章的主要思想是根據(jù)代碼解讀JavaRDD和JavaPairRDD的常用API。
下面開始吧。。。

連接Spark

使用Maven或者SBT來創(chuàng)建本地Java/Scala應(yīng)用的工程。
下面展示下如何在Windows環(huán)境中單機(jī)編譯并運(yùn)行Spark的Java代碼(Scala的代碼類似)

使用IDEA

創(chuàng)建一個(gè)新的maven工程,其中pom.xml的內(nèi)容參見下面:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.paulHome.app</groupId>
    <artifactId>learnSparkJavaApi</artifactId>
    <version>1.0</version>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

    <dependencies>
        <dependency> <!-- Spark dependency -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.10</artifactId>
            <version>2.1.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
            <version>2.1.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.11 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>
        <dependency>
            <groupId>com.github.fommil.netlib</groupId>
            <artifactId>all</artifactId>
            <version>1.1.2</version>
            <type>pom</type>
        </dependency>

    </dependencies>

</project>

這個(gè)文件我添加的比較全了,包括SQL、Streanming、MLlib都加進(jìn)去了。當(dāng)然還有一些其他的關(guān)于Maven的配置參見另一篇文章搭建虛擬機(jī)Spark環(huán)境另外多說一句,記得在IDEA的Maven的配置中選上自動下載源碼文件,這樣方便后面閱讀學(xué)習(xí)。本機(jī)調(diào)試Spark程序的最大好處就是可以斷點(diǎn)debug,可以很好的來閱讀源碼理解源碼。

然后根據(jù)你喜好,創(chuàng)建好自己的工程文件。我自己的情況見下圖所示(另外多說一句,安裝JDK的時(shí)候千萬別放在默認(rèn)的帶空格的目錄Program Files下面,這就是個(gè)坑,如果你還需要用到HDFS,也就是再安裝Haoop的時(shí)候就會踩到。不過我現(xiàn)在就沒改,因?yàn)椴淮_定是否要在家里用到Hadoop,不過后面用到的話我肯定會改的):

學(xué)習(xí)Spark的工程目錄結(jié)構(gòu)

接下來就是配置Run了,主要是VM -option寫上:-Dspark.master=local[4](估計(jì)是這篇文章的第一個(gè)大重點(diǎn)了吧)

Run configuration

為了方便大家學(xué)習(xí),我把這個(gè)Java源碼也放上去吧,是之前邊學(xué)邊隨手寫的,主要是學(xué)習(xí)官網(wǎng)的每一條示例語句,所以代碼沒啥主題。都是第一次寫Spark應(yīng)用時(shí)寫的代碼(是的,我才剛學(xué)不久)。

/**
 * Created by Paul Yang on 2017/4/15.
 */
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaDoubleRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.broadcast.Broadcast;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.util.AccumulatorV2;
import org.apache.spark.util.LongAccumulator;
import scala.Tuple2;
import scala.collection.immutable.List;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.regex.Pattern;

public class simpleRddMain {

    //Used to sum
    static int countSum = 0;

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName("simple RDD opt")
                .set("spark.hadoop.validateOutputSpecs", "false");
        JavaSparkContext sc = new JavaSparkContext(conf);

        //parallel a RDD
        ArrayList<Integer> intList = new ArrayList<Integer>(){{
            add(1);
            add(2);
            add(3);
            add(4);
            add(5);
        }};

        JavaRDD<Integer> integerRdd = sc.parallelize(intList); // Get a RDD from a list.
        System.out.println("Integer RDD:");
        integerRdd.collect();

        //Lambda expressions
        JavaRDD<String> stringRdd = sc.textFile("G:/ImportantTools/spark-2.1.0-bin-hadoop2.7/README.md");
        JavaRDD<Integer> intLineLength = stringRdd.map(s -> s.length());
        intLineLength.persist(StorageLevel.MEMORY_ONLY());
        int totalLen = intLineLength.reduce((a, b) -> a + b);
        System.out.println("Lines(" + stringRdd.count() + ")<<<Lambda expressions>>>: Total len = " + totalLen);

        //anonymous inner class or a name one
        class GetLenFunc implements Function<String, Integer> {
            @Override
            public Integer call(String s) throws Exception {
                return s.length();
            }
        }
        JavaRDD<Integer> funcLineLengths = stringRdd.map( new GetLenFunc() );
        int funcTotalLen = funcLineLengths.reduce( new Function2<Integer, Integer, Integer>() {
           public Integer call (Integer a, Integer b) {return a + b;}
        });
        System.out.println("<<<anonymous inner class or a name one>>>: Total Len = " + funcTotalLen);


        //Wordcount Process
//        JavaRDD<String> wordsRdd = stringRdd.flatMap(new FlatMapFunction<String, String>() {
//            @Override
//            public Iterator<String> call(String line) throws Exception {
//                return Arrays.asList( line.split(" ")).iterator();
//            }
//        });
        JavaRDD<String> wordsRdd = stringRdd.flatMap(s -> Arrays.asList(s.split(" ")).iterator());
        JavaPairRDD<String, Integer> eachWordRdd = wordsRdd.mapToPair(s -> new Tuple2(s, 1));
        JavaPairRDD<String, Integer> wordCntRdd = eachWordRdd.reduceByKey( (a, b) -> a + b );
        wordCntRdd.collect();
        wordCntRdd.foreach(new VoidFunction<Tuple2<String, Integer>>() {
            @Override
            public void call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
                System.out.println(stringIntegerTuple2._1 + "@@@" + stringIntegerTuple2._2);
            }
        });

        //Understanding closures
        integerRdd.foreach(new VoidFunction<Integer>() {
            @Override
            public void call(Integer integer) throws Exception {
                countSum += integer.intValue();
            }
        });
        System.out.println("#~~~~~scope and life cycle of variables and methods~~~~~~# countSum = " + countSum);

        //Working with Key-Value Pairs
        JavaPairRDD<String, Integer> strIntPairRdd = stringRdd.mapToPair(s -> new Tuple2(s, 1));
        JavaPairRDD<String, Integer> strCountRdd = strIntPairRdd.reduceByKey((a, b) -> a + b);
        //strCountRdd.sortByKey();
        strCountRdd.collect();
        System.out.println("###Working with Key-Value Pairs### :" + strCountRdd.toString());
        strCountRdd.foreach(new VoidFunction<Tuple2<String, Integer>>() {
            @Override
            public void call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
                System.out.println(stringIntegerTuple2._1 + ":" + stringIntegerTuple2._2);
            }
        });

        //Broadcast Variables
        Broadcast<double[]> broadcastVar = sc.broadcast(new double[] {1.1, 2.2, 3.3});
        broadcastVar.value();

        //Accumulator
        LongAccumulator longAccum = sc.sc().longAccumulator();
        integerRdd.foreach(x -> longAccum.add(x));
        System.out.println("\n\n\nAccumulator: " + longAccum.value() + "\n\n\n\n");

        //AccumulatorV2
        class MyVector {
            double[] vals;

            public MyVector(int vecLen) {
                vals = new double[vecLen];
            }

            public void reset() {
                for(int i = 0; i < vals.length; i++) {
                    vals[i] = 0;
                }
            }

            public void add(MyVector inVec) {
                for(int i = 0; i < vals.length; i++) {
                    vals[i] += inVec.vals[i];
                }
            }
        }
        class VectorAccumulatorV2 extends AccumulatorV2<MyVector,MyVector> {
            private MyVector selfVect = null;

            public VectorAccumulatorV2(int vecLen) {
                selfVect = new MyVector(vecLen);
            }

            @Override
            public boolean isZero() {
                for(int i = 0; i < selfVect.vals.length; i++) {
                    if(selfVect.vals[i] != 0) return false;
                }
                return true;
            }

            @Override
            public AccumulatorV2<MyVector, MyVector> copy() {
                VectorAccumulatorV2 ret = new VectorAccumulatorV2(copy().value().vals.length);
                return ret;
            }

            @Override
            public void reset() {
                selfVect.reset();
            }

            @Override
            public void add(MyVector v) {
                selfVect.add(v);
            }

            @Override
            public void merge(AccumulatorV2<MyVector, MyVector> other) {
                MyVector minVec = null, maxVec = null;
                if(other.value().vals.length < selfVect.vals.length) {
                    minVec = other.value();
                    maxVec = selfVect;
                }
                else {
                    minVec = selfVect;
                    maxVec = other.value();
                }
                //TODO: merge together.
            }

            @Override
            public MyVector value() {
                return selfVect;
            }
        }
        VectorAccumulatorV2 myVecAcc = new VectorAccumulatorV2(5);
        sc.sc().register(myVecAcc, "MyVectorAcc1");


    }
}

點(diǎn)擊運(yùn)行后你可能會遇到兩個(gè)錯(cuò)誤,第一個(gè)就是下面這個(gè):

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

這個(gè)錯(cuò)誤其實(shí)可以忽略的,因?yàn)檎也坏娇蛇\(yùn)行的Hadoop bin還是可以繼續(xù)跑下去的,就是不用HDFS罷了。
如果糾結(jié)想去掉這個(gè)Error那也只需要兩步走:

  1. 去下載winutils.exe:我發(fā)的這個(gè)是Hadoop2.7版本的Github鏈接,和你在別的bolg里面找到的舊版本不一樣的(不過其實(shí)貌似沒啥區(qū)別都一樣用)
  2. 設(shè)置環(huán)境變量。把上面下載下來的目錄(bin的上級目錄)加入到環(huán)境變量HADOOP_HOME中。
  3. 重啟IDEA(想了想還是寫了這一步)

如果你愿意的話可以跟蹤下報(bào)錯(cuò)的代碼,加個(gè)斷點(diǎn),然后你還可以找到另一個(gè)方法來就是在程序中加入配置語句來解決這個(gè)問題,這樣HADOOP_HOME就可以不配置了,因?yàn)樵谀愕膚indow上可能已經(jīng)有Haoop了,你又不想改或者不想在bin里面添加winutils相關(guān)文件。

第二個(gè)報(bào)錯(cuò)的地方可能是:

JavaRDD<String> stringRdd = sc.textFile("G:/ImportantTools/spark-2.1.0-bin-hadoop2.7/README.md");

顯示找不到這個(gè)文件,其實(shí)運(yùn)行上面的代碼完全不需要去Spark官網(wǎng)下載任何release包,因?yàn)槲覀冇蠱aven已經(jīng)幫我們自動下載搞定了。這里只是因?yàn)椴幌肓硗鈽?gòu)造數(shù)據(jù)文件,所以還是用release包中的文件。然后把路徑寫對就ok了,記得要寫盤符,不然默認(rèn)就在IDEA的工程目錄中去找了。
解決這兩個(gè)錯(cuò)誤后應(yīng)該就可以順利看到運(yùn)行結(jié)果了。

看結(jié)果的時(shí)候你可能會嫌Spark自帶的輸出日志太多了,略煩,那么還可以修改輸出的級別限制輸出,主要是把log4j.rootCategory=INFO, console改為log4j.rootCategory=WARN, console即可抑制Spark把INFO級別的日志打到控制臺上。而如果要顯示更全面的信息,可以把INFO改為DEBUG。
log4j.properties內(nèi)如如下:

log4j.rootLogger=${root.logger}
root.logger=WARN,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
shell.log.level=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.logger.org.apache.spark.repl.Main=${shell.log.level}
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=${shell.log.level}

這個(gè)文件需要放到程序能自動讀取加載的地方,比如resources目錄下:

log4j.properties的文件位置,放在resources目錄下

這樣再run的時(shí)候log看起來就清爽多了。

初始化Spark

這部分內(nèi)容在官網(wǎng)上主要是說在代碼里使用SparkConf來建立一個(gè)JavaSparkContext。我在之前的文章中已經(jīng)有介紹了,這里也不再贅述了。
當(dāng)然另外使用shell來運(yùn)行也不是本文章的重點(diǎn),我們的重點(diǎn)是學(xué)習(xí)Spark實(shí)打?qū)嵉腞DD API。
那就繼續(xù)看下面的重點(diǎn)RDD了。。。

RDDs

RDD的概念和特點(diǎn)還是參考上面的我之前寫的文章《Spark常用概念》,我覺得已經(jīng)寫的比較有概括性和歸納性了。

產(chǎn)生RDD

RDD總的來說有兩種方法得到:

  • 從代碼中Parallelize得到
/** Distribute a local Scala collection to form an RDD. */
  def parallelize[T](list: java.util.List[T]): JavaRDD[T] =
    parallelize(list, sc.defaultParallelism)

讀源碼知道方法parallelize()的輸入是一個(gè)List<T>,輸出是一個(gè)JavaRDD<T>

  • 從文件中讀取得到
/**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   */
  def textFile(path: String): JavaRDD[String] = sc.textFile(path)

方法textFile()的輸入是一個(gè)字符串,這個(gè)字符串可以是一個(gè)具體的文件,或者是一個(gè)目錄。如果是目錄的話就會自動讀取這個(gè)目錄下面所有的文件,然后返回一個(gè)JavaRDD<String>。

  • 其他API
    當(dāng)然還有一些其他的生成RDD的API,比如很有用的創(chuàng)建一個(gè)空的RDD:
/** Get an RDD that has no partitions or elements. */
  def emptyRDD[T]: JavaRDD[T] = {
    implicit val ctag: ClassTag[T] = fakeClassTag
    JavaRDD.fromRDD(new EmptyRDD[T](sc))
  }

以及其他

/** Distribute a local Scala collection to form an RDD. */
  def parallelizePairs[K, V](list: java.util.List[Tuple2[K, V]]): JavaPairRDD[K, V] =
    parallelizePairs(list, sc.defaultParallelism)

/** Distribute a local Scala collection to form an RDD. */
  def parallelizeDoubles(list: java.util.List[java.lang.Double]): JavaDoubleRDD =
    parallelizeDoubles(list, sc.defaultParallelism)

上面都是SparkContext這個(gè)Class中的方法

RDD操作

RDD的操作分為兩種:1、transformation;2、action。transformation就是將RDD的每一個(gè)elements進(jìn)行映射變形,或許是1對1的map,或許是1對N(N>=0)的flatMap,又或許是加入鍵值映射成(Key,Value)形勢的mapToPair。而action操作是對RDD的符合條件的elements進(jìn)行計(jì)算然后返回一個(gè)值。下面重點(diǎn)介紹下幾個(gè)常用的transformation以及action的API:
主要來自于JavaRDD以及JavaPairRDD中。

  • Transformation
    • map
      這個(gè)方法是將一個(gè)function接口的實(shí)現(xiàn)類應(yīng)用到RDD的每一個(gè)元素上,然后返回一個(gè)新的RDD
/**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[R](f: JFunction[T, R]): JavaRDD[R] =
    new JavaRDD(rdd.map(f)(fakeClassTag))(fakeClassTag)

比如我有一個(gè)數(shù)據(jù)集格式是:id,value
那么直接sc.textFile("pathOfFile")后的得到的是String類型的element組成的RDD,需要按照格式解析下,然后具體的map方法就是:

//Lambda表達(dá)式寫法
JavaRDD<String[]> strArrayIdValue = stringRdd.map(s -> s.split(",", -1));
//非Lambda表達(dá)式寫法
JavaRDD<String[]> strArrayIdValue = stringRdd.map(new Function<String, String[]>() {
            @Override
            public String[] call(String v1) throws Exception {
                return v1.split(",", -1);
            }
        });

不過接下來為了更好的展示返回值的類型,我就不再用Lambda表達(dá)式的格式來寫了。

  • filter
    這個(gè)是按照一個(gè)過濾規(guī)則將能返回true的元素保留下來,返回false的不保留從而產(chǎn)生一個(gè)新的RDD。
    還是上面的例子,假設(shè)數(shù)據(jù)集中有格式錯(cuò)誤,或者數(shù)據(jù)缺失的數(shù)據(jù),簡單認(rèn)為String的數(shù)組個(gè)數(shù)不為2就是要扔掉的,那么:
        JavaRDD<String[]> strFiltedRdd = strArrayIdValue.filter(new Function<String[], Boolean>() {
            @Override
            public Boolean call(String[] v1) throws Exception {
                return v1.length == 2;
            }
        });
  • flatMap
    和map類似,不同的是每條element不是一定映射為另一個(gè)新的element,而是1對N的映射,其中N >= 0,所以假如上面的例子中數(shù)據(jù)value的值是按照value1|value2|value3...來構(gòu)造的。那么去掉id,將各個(gè)value值保存到一個(gè)RDD是這么寫:
        JavaRDD<String> strValueNRdd = strFiltedRdd.flatMap(new FlatMapFunction<String[], String>() {
            @Override
            public Iterator<String> call(String[] strings) throws Exception {
                return Arrays.asList(strings[1].split("\\|", -1)).iterator();
            }
        });

這樣就將一個(gè)element中的value合集分成了每個(gè)element都只包含一個(gè)value的新RDD了。

  • mapPartitions
    這個(gè)是按照RDD在每個(gè)partition分區(qū)上進(jìn)行映射。源碼定義如下:
  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   */
  def mapPartitions[U](f: FlatMapFunction[JIterator[T], U]): JavaRDD[U] = {
    def fn: (Iterator[T]) => Iterator[U] = {
      (x: Iterator[T]) => f.call(x.asJava).asScala
    }
    JavaRDD.fromRDD(rdd.mapPartitions(fn)(fakeClassTag[U]))(fakeClassTag[U])
  }

將之前的map方法裝換成這個(gè)后:

        JavaRDD<String[]> strArrayIdValue = stringRdd.mapPartitions(new FlatMapFunction<Iterator<String>, String[]>() {
            @Override
            public Iterator<String[]> call(Iterator<String> stringIterator) throws Exception {
                ArrayList<String[]> arrList = new ArrayList<String[]>();
                arrList.add(stringIterator.next().split(",", -1));
                return  arrList.iterator();
            }
        });
  • union
    這個(gè)是將輸入的RDD以及調(diào)用的源RDD進(jìn)行合并。產(chǎn)生一個(gè)新的RDD。這個(gè)API在讀取多個(gè)獨(dú)立的文件并產(chǎn)生一個(gè)RDD時(shí)比較有用,比如:
        JavaRDD<String> unionAllFilesRdd = sc.emptyRDD();
        for(String name : fileNames) {
            unionAllFilesRdd = unionAllFilesRdd.union(sc.textFile(name));
        }
  • intersection
    返回輸入和源的RDD的交集,并且不重復(fù)。
    源碼說明:
  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: JavaRDD[T]): JavaRDD[T] = wrapRDD(rdd.intersection(other.rdd))
  • distinct
    返回RDD中不重復(fù)的element組成的新RDD,也就是去重操作。沒有入?yún)ⅰ?br> 源碼:
  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): JavaRDD[T] = wrapRDD(rdd.distinct())
  • subtract
    該方法的作用是將存在于本RDD中的element但是不存在于輸入RDD中的element找出來組合成一個(gè)新的RDD。
  /**
   * Return an RDD with the elements from `this` that are not in `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be less than or equal to us.
   */
  def subtract(other: JavaRDD[T]): JavaRDD[T] = wrapRDD(rdd.subtract(other))

綜合前面幾個(gè)方法,如果我們要做一個(gè)這個(gè)任務(wù),將RDD A中不同于RDD B的數(shù)據(jù)加入到B中,并將A中與B重復(fù)的部分按照某種條件替換。
1. 需要得到A對于B的不同集:AoutB = A.subtract(B)
2. 將滿足替換條件的A的子集找出來:replaceCandidateA = A.filter(判斷條件)
3. 找出真正能替換的A的子集:realReplaceA = replaceCandidateA.intersection(B)
4. 找出要丟棄的B的子集:discardB = B.intersection(realReplaceA)
5. 丟棄后剩下的B的內(nèi)容:B = B.subtract(discardB)
6. 合并替換集以及新加集到B中:newB = B.union(realReplaceA).union(AoutB)

  • mapToPair
    這個(gè)API是用來將JavaRDD轉(zhuǎn)成JavaPairRDD,也就是將Key提出來,比如之前的id,value數(shù)據(jù),在轉(zhuǎn)成[id,value]的String數(shù)組后,可以生成讓id為key的JavaPairRDD:
        JavaPairRDD<String, String> keyValuePairRdd = strFiltedRdd.mapToPair(new PairFunction<String[], String, String>() {
            @Override
            public Tuple2<String, String> call(String[] strings) throws Exception {
                return new Tuple2<>(strings[0], strings[1]);
            }
        });

接下來會介紹一些JavaPairRDD上獨(dú)有的API

  • groupByKey
  /**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with the existing partitioner/parallelism level.
   *
   * @note If you are grouping in order to perform an aggregation (such as a sum or average) over
   * each key, using `JavaPairRDD.reduceByKey` or `JavaPairRDD.combineByKey`
   * will provide much better performance.
   */
  def groupByKey(): JavaPairRDD[K, JIterable[V]] =
    fromRDD(groupByResultToJava(rdd.groupByKey()))

這個(gè)API用起來不是很爽,不能自定義一些組合方式,而且執(zhí)行細(xì)節(jié)需要注意下,參見:Avoid GroupByKey以及深入理解groupByKey、reduceByKey

  • reduceByKey
    將PairRDD中的每一個(gè)元素按照Key的值,進(jìn)行Value的“相加”,“相加”的具體操作由實(shí)現(xiàn)接口Function2的類完成。
    源碼:
  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
   * parallelism level.
   */
  def reduceByKey(func: JFunction2[V, V, V]): JavaPairRDD[K, V] = {
    fromRDD(reduceByKey(defaultPartitioner(rdd), func))
  }

可見reduceByKey并不會改變V的類型。比如我們把之前的PairRDD做下value的合并,代碼如下:

        JavaPairRDD<String, String> byIdValuesPairRdd = keyValuePairRdd.reduceByKey(new Function2<String, String, String>() {
            @Override
            public String call(String v1, String v2) throws Exception {
                return v1+"|"+v2;
            }
        });

這里的例子中V的類型是String,其實(shí)在實(shí)際用的時(shí)候也可以先將String map成ArrayList<String>然后再reduceByKey合成一個(gè)大的ArrayList。
那么有沒有一個(gè)方法可以從String直接transformation到ArrayList<String>呢?接著往下看吧。

  • aggregateByKey
    這個(gè)方法的入?yún)⒈容^多,主要原因是這個(gè)方法的目的是將RDD的每個(gè)元素按照Key合并成U,因?yàn)閁的類型不同于V,所以需要指明V如何和U合并(第二個(gè)入?yún)?/em>),以及U和U的合并方法(第三個(gè)入?yún)?/em>),而且還需要給出一個(gè)最初始的U(比如是一個(gè)空集合,或者是0對于整數(shù)想加,或者是1對于整數(shù)相乘;第一個(gè)入?yún)?/em>)。好戲來了,先看源碼:
  /**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's.
   * The former operation is used for merging values within a partition, and the latter is used for
   * merging values between partitions. To avoid memory allocation, both of these functions are
   * allowed to modify and return their first argument instead of creating a new U.
   */
  def aggregateByKey[U](zeroValue: U, seqFunc: JFunction2[U, V, U], combFunc: JFunction2[U, U, U]):
      JavaPairRDD[K, U] = {
    implicit val ctag: ClassTag[U] = fakeClassTag
    fromRDD(rdd.aggregateByKey(zeroValue)(seqFunc, combFunc))
  }

具體實(shí)踐 duang duang duang:

        JavaPairRDD<String, ArrayList<String>> keyValuelistPairRdd = keyValuePairRdd.aggregateByKey(new ArrayList<String>(), new Function2<ArrayList<String>, String, ArrayList<String>>() {
            @Override
            public ArrayList<String> call(ArrayList<String> v1, String v2) throws Exception {
                v1.add(v2);
                return v1;
            }
        }, new Function2<ArrayList<String>, ArrayList<String>, ArrayList<String>>() {
            @Override
            public ArrayList<String> call(ArrayList<String> v1, ArrayList<String> v2) throws Exception {
                v1.addAll(v2);
                return v1;
            }
        });

通過三個(gè)入?yún)⒏愣诉@個(gè)從String -> ArrayList<String>的轉(zhuǎn)變。
這么好的機(jī)會自然不能錯(cuò)過用Lambda來秀下:

    val initialSet = mutable.HashSet.empty[String]
    val addToSet = (s: mutable.HashSet[String], v: String) => s += v
    val mergePartitionSets = (p1: mutable.HashSet[String], p2: mutable.HashSet[String]) => p1 ++= p2

    val keyValuelistPairRdd = keyValuePairRdd.aggregateByKey(initialSet)(addToSet, mergePartitionSets)

怎么畫風(fēng)突變成Scala了,我也沒辦法啊,我能有什么辦法,用Java實(shí)在寫不出來,不知道這個(gè)怎么轉(zhuǎn)成Lambda。。。。(Java藥丸啊)。如果有人知道這個(gè)用Java咋通過Lambda寫,麻煩在評論里告知下,不勝感謝!

        JavaPairRDD<String, ArrayList<String>> keyValuelistLambda = keyValuePairRdd.aggregateByKey(new ArrayList<String>(), (uList,vStr) -> {uList.add(vStr); return uList;}, (u1, u2) -> {u1.addAll(u2); return u1;});

額。。。,我自己還是想出來一個(gè)寫法(不然太丟人了,雖然正經(jīng)學(xué)Java才3個(gè)月,但是這不是借口啊),不過看起來有點(diǎn)怪怪的感覺,不過如果大家有更好的寫法還是非常歡迎能在評論區(qū)展示下。謝謝!

  • sortByKey
    當(dāng)K可以排序時(shí),可以使用這個(gè)方法來對其排序,默認(rèn)是升序排序。源碼:
  /**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements in
   * ascending order. Calling `collect` or `save` on the resulting RDD will return or output an
   * ordered list of records (in the `save` case, they will be written to multiple `part-X` files
   * in the filesystem, in order of the keys).
   */
  def sortByKey(): JavaPairRDD[K, V] = sortByKey(true)
  /**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  def sortByKey(ascending: Boolean): JavaPairRDD[K, V] = {
    val comp = com.google.common.collect.Ordering.natural().asInstanceOf[Comparator[K]]
    sortByKey(comp, ascending)
  }
  • join
    這個(gè)方法是對于兩個(gè)PairRDD按照Key進(jìn)行取交集,如果k在本RDD和輸入RDD中都存在,那么就加入返回的RDD中,且RDD的每一個(gè)元素為k, (v1, v2),其中后面是一個(gè)Tuple,v1來自于本RDD,v2來自于輸入RDD。源碼:
  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: JavaPairRDD[K, W], partitioner: Partitioner): JavaPairRDD[K, (V, W)] =
    fromRDD(rdd.join(other, partitioner))
        JavaPairRDD<String, Tuple2<String, String>> joinRDD = byIdValuesPairRdd.join(keyValuePairRdd);
  • leftOuterJoin & rightOuterJoin & fullOuterJoin
    join還有三個(gè)變身版本,我們可以通過實(shí)踐看看各自的用法:
        ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(1, "str1"));
                add(new Tuple2<>(1, "str11"));
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(4, "str44"));
            };
        };
        JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList);

        ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(3, "str3"));
                add(new Tuple2<>(4, "str4"));
                add(new Tuple2<>(5, "str5"));
                add(new Tuple2<>(7, "str77"));
            }
        };
        JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList);

        JavaPairRDD<Integer, Tuple2<String, String>> joinRdd = paralPairRdd.join(otherParalPairRdd);
        joinRdd.foreach(s -> System.out.println("join*"+ s.toString()));

        JavaPairRDD<Integer, Tuple2<String, Optional<String>>> leftOuterJoinRdd = paralPairRdd.leftOuterJoin(otherParalPairRdd);
        leftOuterJoinRdd.foreach(s -> System.out.println("leftOuterJoin*"+ s.toString()));

        JavaPairRDD<Integer, Tuple2<Optional<String>, String>> rightOuterJoinRdd = paralPairRdd.rightOuterJoin(otherParalPairRdd);
        rightOuterJoinRdd.foreach(s -> System.out.println("rightOuterJoin*"+ s.toString()));

        JavaPairRDD<Integer, Tuple2<Optional<String>, Optional<String>>> fullOuterJoinRdd = paralPairRdd.fullOuterJoin(otherParalPairRdd);
        fullOuterJoinRdd.foreach(s -> System.out.println("fullOuterJoin*"+ s.toString()));

上面代碼的運(yùn)行結(jié)果為:

join(2,(str2,str2))
join
(4,(str44,str4))

leftOuterJoin(2,(str2,Optional[str2]))
leftOuterJoin
(4,(str44,Optional[str4]))
leftOuterJoin(1,(str1,Optional.empty))
leftOuterJoin
(1,(str11,Optional.empty))
rightOuterJoin*(2,(Optional[str2],str2))

rightOuterJoin(4,(Optional[str44],str4))
rightOuterJoin
(3,(Optional.empty,str3))
rightOuterJoin(7,(Optional.empty,str77))
rightOuterJoin
(5,(Optional.empty,str5))

fullOuterJoin(4,(Optional[str44],Optional[str4]))
fullOuterJoin
(2,(Optional[str2],Optional[str2]))
fullOuterJoin(3,(Optional.empty,Optional[str3]))
fullOuterJoin
(7,(Optional.empty,Optional[str77]))
fullOuterJoin(1,(Optional[str1],Optional.empty))
fullOuterJoin
(1,(Optional[str11],Optional.empty))
fullOuterJoin*(5,(Optional.empty,Optional[str5]))
可見join就是取交集,left是就是本RDD有的key的Value集合,right就是輸入RDD的key的value集合,full就是并集。而且不保證key的順序,只保證value的順序。

  • cogroup
    這個(gè)方法是將兩個(gè)或者多個(gè)PairRDD(如果入?yún)⑹嵌鄠€(gè)RDD,那么有幾個(gè)就合并幾個(gè))合并在一起,如果其中一個(gè)RDD的某個(gè)key,在另一個(gè)RDD中沒有出現(xiàn),那么就要記錄一個(gè)空集合。Talk is cheap, let's show you the codes and the run result.
    源碼為先:
  /**
   * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
   * list of values for that key in `this` as well as `other`.
   */
  def cogroup[W](other: JavaPairRDD[K, W]): JavaPairRDD[K, (JIterable[V], JIterable[W])] =
    fromRDD(cogroupResultToJava(rdd.cogroup(other)))

有多個(gè)重載版的方法,加上調(diào)用方法的RDD自身,最多支持一口氣4個(gè)RDD的cogroup


cogroup的各個(gè)重載版本

動手實(shí)踐:

        ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(1, "str1"));
                add(new Tuple2<>(1, "str11"));
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(4, "str44"));
            };
        };
        JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList);

        ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(3, "str3"));
                add(new Tuple2<>(4, "str4"));
                add(new Tuple2<>(5, "str5"));
            }
        };
        JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList);
        JavaPairRDD<Integer, Tuple2<Iterable<String>, Iterable<String>>> coGroupRdd = paralPairRdd.cogroup(otherParalPairRdd);
        coGroupRdd.foreach(s -> System.out.println("+++"+ s.toString()));

看下運(yùn)行結(jié)果:

+++(3,([],[str3]))
+++(1,([str1, str11],[]))
+++(2,([str2],[str2]))
+++(4,([str44],[str4]))
+++(5,([],[str5]))
可以看到Key的順序是不保證的,但是key內(nèi)value的順序是有保證的。

  • intersection
    這個(gè)方法的作用是取本RDD和輸入RDD的交集,和join的區(qū)別在與join會把Key一樣Value不一樣的一起留下,但是intersection只會留下Key和Value都一樣的數(shù)據(jù):
  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] =
    new JavaPairRDD[K, V](rdd.intersection(other.rdd))

看個(gè)實(shí)例:

        ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(1, "str1"));
                add(new Tuple2<>(1, "str11"));
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(4, "str44"));
            };
        };
        JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList);

        ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(3, "str3"));
                add(new Tuple2<>(4, "str4"));
                add(new Tuple2<>(5, "str5"));
                add(new Tuple2<>(7, "str77"));
            }
        };
        JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList);

        JavaPairRDD<Integer, Tuple2<String, String>> joinRdd = paralPairRdd.join(otherParalPairRdd);
        joinRdd.foreach(s -> System.out.println("join*"+ s.toString()));

        JavaPairRDD<Integer, String> intersectRdd = paralPairRdd.intersection(otherParalPairRdd);
        intersectRdd.foreach(s -> System.out.println("intersection*" + s.toString()));

運(yùn)行結(jié)果:

join(4,(str44,str4))
join
(2,(str2,str2))
intersection*(2,str2)
所以intertsection不會改變返回值的類型,但是join會改變,因?yàn)関alue被修改為了Tuple類型了。

  • subtract
    這個(gè)方法將本RDD中存在的元素但是不存在與輸入RDD的元素取出,組成一個(gè)輸出類型不變的RDD。也就是說這個(gè)不同的判斷不只是根據(jù)Key來的,還包括了Value的值。只要Key和Value有一個(gè)不同,那么就會被取出作為返回RDD的部分?jǐn)?shù)據(jù)。
  /**
   * Return an RDD with the elements from `this` that are not in `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be <= us.
   */
  def subtract(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] =
    fromRDD(rdd.subtract(other))

實(shí)例代碼:

        ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(1, "str1"));
                add(new Tuple2<>(1, "str11"));
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(4, "str44"));
            };
        };
        JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList);

        ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(3, "str3"));
                add(new Tuple2<>(4, "str4"));
                add(new Tuple2<>(5, "str5"));
                add(new Tuple2<>(7, "str77"));
            }
        };
        JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList);

        JavaPairRDD<Integer, String> substractRdd = paralPairRdd.subtract(otherParalPairRdd);
        substractRdd.foreach(s -> System.out.println("substract*" + s.toString()));

運(yùn)行結(jié)果:
substract(4,str44)
substract
(1,str1)
substract*(1,str11)
*不過還要小心有坑,這里的數(shù)據(jù)結(jié)構(gòu)是Tuple2<Integer, String>,如果換成Tuple2<Integer, String[]>,那么即使String[]的內(nèi)容一樣,也同樣被認(rèn)為是不同的值,切記!!!

  • coalesce
    這個(gè)方法可以用以減少RDD的分區(qū)到輸入的參數(shù)個(gè)數(shù)上。說是效率比較高,但是用起來感覺無法讓數(shù)據(jù)最后保存在一個(gè)文件上。
  /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   */
  def coalesce(numPartitions: Int): JavaPairRDD[K, V] = fromRDD(rdd.coalesce(numPartitions))
  • repartition
    這個(gè)方法主要是用來重新shuffle RDD的data,使用一種隨機(jī)的方式來產(chǎn)生更多或者更少的分區(qū)并平衡它們。This always shuffles all data over the network.
 /**
  * Return a new RDD that has exactly numPartitions partitions.
  *
  * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
  * a shuffle to redistribute data.
  *
  * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
  * which can avoid performing a shuffle.
  */
 def repartition(numPartitions: Int): JavaPairRDD[K, V] = fromRDD(rdd.repartition(numPartitions))

有時(shí)候會用它最后將結(jié)果RDD保存在一個(gè)part文件上。

     coGroupRdd.repartition(1).saveAsTextFile(fileName);
  • Actions
    • reduce
      這個(gè)其實(shí)和reduceByKey類似,不過因?yàn)閞educeByKey的輸出是各個(gè)Key的新Value的element的集合還是一個(gè)RDD,而reduce是對所有RDD的element的一個(gè)匯總,最后形成一個(gè)單獨(dú)的輸出,不再是RDD,所以這個(gè)操作歸在action當(dāng)中,而reducedByKey則屬于transformation。
  /**
   * Reduces the elements of this RDD using the specified commutative and associative binary
   * operator.
   */
  def reduce(f: JFunction2[T, T, T]): T = rdd.reduce(f)

實(shí)際應(yīng)用:

        JavaRDD<String> stringRdd = sc.textFile("G:/ImportantTools/spark-2.1.0-bin-hadoop2.7/README.md");
        JavaRDD<Integer> intLineLength = stringRdd.map(s -> s.length());
        intLineLength.persist(StorageLevel.MEMORY_ONLY());
        int totalLen = intLineLength.reduce((a, b) -> a + b);
  • collect
    重點(diǎn)看下源碼里的Note:
  /**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note this method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): JList[T] =
    rdd.collect().toSeq.asJava
  • collectAsMap
    這是JavaPairRDD特有的API,可以返回一個(gè)原RDD中K,V對應(yīng)關(guān)系的Map。
/**
   * Return the key-value pairs in this RDD to the master as a Map.
   *
   * @note this method should only be used if the resulting data is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collectAsMap(): java.util.Map[K, V] = mapAsSerializableJavaMap(rdd.collectAsMap())

不過這里需要注意一點(diǎn),這個(gè)返回的Map類型在廣播broadcast中可能會有問題。
比如:

                final Map<String, MyInfoClass> kvMap = keyValuePairRDD.collectAsMap();
                final Broadcast<Map<String, MyInfoClass>> broadcastKvMap = sc.broadcast(kvMap);

上面代碼是把一個(gè)JavaPairRDD轉(zhuǎn)成Map,然后再broadcast出來,方便其他executor在執(zhí)行諸如Map()等方法時(shí)使用,不過這么寫有一定的概率產(chǎn)生一個(gè)錯(cuò)誤:

17/06/14 11:43:53 INFO scheduler.DAGScheduler: ShuffleMapStage 1 (repartition at FromBSID2Gps.java:214) failed in 1.182 s due to Job aborted due to stage failure: Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in stage 1.0 (TID 19, s36.dc.taiyear, executor 3): java.io.IOException: java.lang.UnsupportedOperationException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1213)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
.... ....
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.UnsupportedOperationException
at java.util.AbstractMap.put(AbstractMap.java:203)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:217)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1206)
... 20 more
所以為了避免被當(dāng)做AbstractMap來處理,需要顯式的指明Map的類型,比如下面這段代碼。

                Map<String, MyInfoClass> kvMap = new HashMap<>();
                kvMap.putAll(keyValuePairRDD.collectAsMap());
                final Broadcast<Map<String, MyInfoClass>> broadcastKvMap = sc.broadcast(kvMap);
  • count
  /**
   * Return the number of elements in the RDD.
   */
  def count(): Long = rdd.count()
  • first
  /**
   * Return the first element in this RDD.
   */
  def first(): T = rdd.first()
  • take
  /**
   * Take the first num elements of the RDD. This currently scans the partitions *one by one*, so
   * it will be slow if a lot of partitions are required. In that case, use collect() to get the
   * whole RDD instead.
   *
   * @note this method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def take(num: Int): JList[T] =
    rdd.take(num).toSeq.asJava
  • saveAsTextFile
    這個(gè)不是保存到一個(gè)文件中,而是保存到這個(gè)文件夾中,根據(jù)partition的個(gè)數(shù)來生成文件個(gè)數(shù)。
  /**
   * Save this RDD as a text file, using string representations of elements.
   */
  def saveAsTextFile(path: String): Unit = {
    rdd.saveAsTextFile(path)
  }
  • countByKey
    用哈希Map作為返回值
  /** Count the number of elements for each key, and return the result to the master as a Map. */
  def countByKey(): java.util.Map[K, jl.Long] =
    mapAsSerializableJavaMap(rdd.countByKey()).asInstanceOf[java.util.Map[K, jl.Long]]
  • foreach
  /**
   * Applies a function f to all elements of this RDD.
   */
  def foreach(f: VoidFunction[T]) {
    rdd.foreach(x => f.call(x))
  }

其他RDD類型

從JavaRDDLike集成的結(jié)構(gòu)圖

下面介紹下

  • JavaDoubleRDD
管中窺豹,可見這個(gè)RDD主要負(fù)責(zé)一些數(shù)據(jù)統(tǒng)計(jì)方面的功能

共享變量

因?yàn)镾park的APP是序列化后分發(fā)到各個(gè)Worker節(jié)點(diǎn)上運(yùn)行的,所以需要特殊的方法才能在各個(gè)worker節(jié)點(diǎn)上得到一個(gè)有效統(tǒng)一的全局變量值。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 230,002評論 6 542
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 99,400評論 3 429
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 178,136評論 0 383
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經(jīng)常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 63,714評論 1 317
  • 正文 為了忘掉前任,我火速辦了婚禮,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 72,452評論 6 412
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 55,818評論 1 328
  • 那天,我揣著相機(jī)與錄音,去河邊找鬼。 笑死,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,812評論 3 446
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 42,997評論 0 290
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 49,552評論 1 335
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 41,292評論 3 358
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 43,510評論 1 374
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 39,035評論 5 363
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 44,721評論 3 348
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 35,121評論 0 28
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 36,429評論 1 294
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 52,235評論 3 398
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 48,480評論 2 379

推薦閱讀更多精彩內(nèi)容