Spark官方編程指南—の—詳解加實(shí)踐

介紹

本篇文章主要摘自Spark官網(wǎng)的Spark Programming Guide，在之前的一篇文章中已經(jīng)有對這里面一些概念的基本介紹，這里就不再贅述了。（參見Spark常用概念）
本篇文章的主要思想是根據(jù)代碼解讀JavaRDD和JavaPairRDD的常用API。
下面開始吧。。。

連接Spark

使用Maven或者SBT來創(chuàng)建本地Java/Scala應(yīng)用的工程。
下面展示下如何在Windows環(huán)境中單機(jī)編譯并運(yùn)行Spark的Java代碼（Scala的代碼類似）

使用IDEA

創(chuàng)建一個(gè)新的maven工程，其中pom.xml的內(nèi)容參見下面：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.paulHome.app</groupId>
    <artifactId>learnSparkJavaApi</artifactId>
    <version>1.0</version>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

    <dependencies>
        <dependency> <!-- Spark dependency -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.10</artifactId>
            <version>2.1.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
            <version>2.1.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.11 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>
        <dependency>
            <groupId>com.github.fommil.netlib</groupId>
            <artifactId>all</artifactId>
            <version>1.1.2</version>
            <type>pom</type>
        </dependency>

    </dependencies>

</project>

這個(gè)文件我添加的比較全了，包括SQL、Streanming、MLlib都加進(jìn)去了。當(dāng)然還有一些其他的關(guān)于Maven的配置參見另一篇文章搭建虛擬機(jī)Spark環(huán)境。另外多說一句，記得在IDEA的Maven的配置中選上自動下載源碼文件，這樣方便后面閱讀學(xué)習(xí)。本機(jī)調(diào)試Spark程序的最大好處就是可以斷點(diǎn)debug，可以很好的來閱讀源碼理解源碼。

然后根據(jù)你喜好，創(chuàng)建好自己的工程文件。我自己的情況見下圖所示（另外多說一句，安裝JDK的時(shí)候千萬別放在默認(rèn)的帶空格的目錄Program Files下面，這就是個(gè)坑，如果你還需要用到HDFS，也就是再安裝Haoop的時(shí)候就會踩到。不過我現(xiàn)在就沒改，因?yàn)椴淮_定是否要在家里用到Hadoop，不過后面用到的話我肯定會改的）：

學(xué)習(xí)Spark的工程目錄結(jié)構(gòu)

接下來就是配置Run了，主要是VM -option寫上：-Dspark.master=local[4]（估計(jì)是這篇文章的第一個(gè)大重點(diǎn)了吧）

Run configuration

為了方便大家學(xué)習(xí)，我把這個(gè)Java源碼也放上去吧，是之前邊學(xué)邊隨手寫的，主要是學(xué)習(xí)官網(wǎng)的每一條示例語句，所以代碼沒啥主題。都是第一次寫Spark應(yīng)用時(shí)寫的代碼（是的，我才剛學(xué)不久）。

/**
 * Created by Paul Yang on 2017/4/15.
 */
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaDoubleRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.broadcast.Broadcast;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.util.AccumulatorV2;
import org.apache.spark.util.LongAccumulator;
import scala.Tuple2;
import scala.collection.immutable.List;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.regex.Pattern;

public class simpleRddMain {

    //Used to sum
    static int countSum = 0;

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName("simple RDD opt")
                .set("spark.hadoop.validateOutputSpecs", "false");
        JavaSparkContext sc = new JavaSparkContext(conf);

        //parallel a RDD
        ArrayList<Integer> intList = new ArrayList<Integer>(){{
            add(1);
            add(2);
            add(3);
            add(4);
            add(5);
        }};

        JavaRDD<Integer> integerRdd = sc.parallelize(intList); // Get a RDD from a list.
        System.out.println("Integer RDD:");
        integerRdd.collect();

        //Lambda expressions
        JavaRDD<String> stringRdd = sc.textFile("G:/ImportantTools/spark-2.1.0-bin-hadoop2.7/README.md");
        JavaRDD<Integer> intLineLength = stringRdd.map(s -> s.length());
        intLineLength.persist(StorageLevel.MEMORY_ONLY());
        int totalLen = intLineLength.reduce((a, b) -> a + b);
        System.out.println("Lines(" + stringRdd.count() + ")<<<Lambda expressions>>>: Total len = " + totalLen);

        //anonymous inner class or a name one
        class GetLenFunc implements Function<String, Integer> {
            @Override
            public Integer call(String s) throws Exception {
                return s.length();
            }
        }
        JavaRDD<Integer> funcLineLengths = stringRdd.map( new GetLenFunc() );
        int funcTotalLen = funcLineLengths.reduce( new Function2<Integer, Integer, Integer>() {
           public Integer call (Integer a, Integer b) {return a + b;}
        });
        System.out.println("<<<anonymous inner class or a name one>>>: Total Len = " + funcTotalLen);


        //Wordcount Process
//        JavaRDD<String> wordsRdd = stringRdd.flatMap(new FlatMapFunction<String, String>() {
//            @Override
//            public Iterator<String> call(String line) throws Exception {
//                return Arrays.asList( line.split(" ")).iterator();
//            }
//        });
        JavaRDD<String> wordsRdd = stringRdd.flatMap(s -> Arrays.asList(s.split(" ")).iterator());
        JavaPairRDD<String, Integer> eachWordRdd = wordsRdd.mapToPair(s -> new Tuple2(s, 1));
        JavaPairRDD<String, Integer> wordCntRdd = eachWordRdd.reduceByKey( (a, b) -> a + b );
        wordCntRdd.collect();
        wordCntRdd.foreach(new VoidFunction<Tuple2<String, Integer>>() {
            @Override
            public void call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
                System.out.println(stringIntegerTuple2._1 + "@@@" + stringIntegerTuple2._2);
            }
        });

        //Understanding closures
        integerRdd.foreach(new VoidFunction<Integer>() {
            @Override
            public void call(Integer integer) throws Exception {
                countSum += integer.intValue();
            }
        });
        System.out.println("#~~~~~scope and life cycle of variables and methods~~~~~~# countSum = " + countSum);

        //Working with Key-Value Pairs
        JavaPairRDD<String, Integer> strIntPairRdd = stringRdd.mapToPair(s -> new Tuple2(s, 1));
        JavaPairRDD<String, Integer> strCountRdd = strIntPairRdd.reduceByKey((a, b) -> a + b);
        //strCountRdd.sortByKey();
        strCountRdd.collect();
        System.out.println("###Working with Key-Value Pairs### :" + strCountRdd.toString());
        strCountRdd.foreach(new VoidFunction<Tuple2<String, Integer>>() {
            @Override
            public void call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
                System.out.println(stringIntegerTuple2._1 + ":" + stringIntegerTuple2._2);
            }
        });

        //Broadcast Variables
        Broadcast<double[]> broadcastVar = sc.broadcast(new double[] {1.1, 2.2, 3.3});
        broadcastVar.value();

        //Accumulator
        LongAccumulator longAccum = sc.sc().longAccumulator();
        integerRdd.foreach(x -> longAccum.add(x));
        System.out.println("\n\n\nAccumulator: " + longAccum.value() + "\n\n\n\n");

        //AccumulatorV2
        class MyVector {
            double[] vals;

            public MyVector(int vecLen) {
                vals = new double[vecLen];
            }

            public void reset() {
                for(int i = 0; i < vals.length; i++) {
                    vals[i] = 0;
                }
            }

            public void add(MyVector inVec) {
                for(int i = 0; i < vals.length; i++) {
                    vals[i] += inVec.vals[i];
                }
            }
        }
        class VectorAccumulatorV2 extends AccumulatorV2<MyVector,MyVector> {
            private MyVector selfVect = null;

            public VectorAccumulatorV2(int vecLen) {
                selfVect = new MyVector(vecLen);
            }

            @Override
            public boolean isZero() {
                for(int i = 0; i < selfVect.vals.length; i++) {
                    if(selfVect.vals[i] != 0) return false;
                }
                return true;
            }

            @Override
            public AccumulatorV2<MyVector, MyVector> copy() {
                VectorAccumulatorV2 ret = new VectorAccumulatorV2(copy().value().vals.length);
                return ret;
            }

            @Override
            public void reset() {
                selfVect.reset();
            }

            @Override
            public void add(MyVector v) {
                selfVect.add(v);
            }

            @Override
            public void merge(AccumulatorV2<MyVector, MyVector> other) {
                MyVector minVec = null, maxVec = null;
                if(other.value().vals.length < selfVect.vals.length) {
                    minVec = other.value();
                    maxVec = selfVect;
                }
                else {
                    minVec = selfVect;
                    maxVec = other.value();
                }
                //TODO: merge together.
            }

            @Override
            public MyVector value() {
                return selfVect;
            }
        }
        VectorAccumulatorV2 myVecAcc = new VectorAccumulatorV2(5);
        sc.sc().register(myVecAcc, "MyVectorAcc1");


    }
}

點(diǎn)擊運(yùn)行后你可能會遇到兩個(gè)錯(cuò)誤，第一個(gè)就是下面這個(gè)：

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

這個(gè)錯(cuò)誤其實(shí)可以忽略的，因?yàn)檎也坏娇蛇\(yùn)行的Hadoop bin還是可以繼續(xù)跑下去的，就是不用HDFS罷了。
如果糾結(jié)想去掉這個(gè)Error那也只需要兩步走：

去下載winutils.exe：我發(fā)的這個(gè)是Hadoop2.7版本的Github鏈接，和你在別的bolg里面找到的舊版本不一樣的（不過其實(shí)貌似沒啥區(qū)別都一樣用）
設(shè)置環(huán)境變量。把上面下載下來的目錄（bin的上級目錄）加入到環(huán)境變量HADOOP_HOME中。
重啟IDEA（想了想還是寫了這一步）

如果你愿意的話可以跟蹤下報(bào)錯(cuò)的代碼，加個(gè)斷點(diǎn)，然后你還可以找到另一個(gè)方法來就是在程序中加入配置語句來解決這個(gè)問題，這樣HADOOP_HOME就可以不配置了，因?yàn)樵谀愕膚indow上可能已經(jīng)有Haoop了，你又不想改或者不想在bin里面添加winutils相關(guān)文件。

第二個(gè)報(bào)錯(cuò)的地方可能是：

JavaRDD<String> stringRdd = sc.textFile("G:/ImportantTools/spark-2.1.0-bin-hadoop2.7/README.md");

顯示找不到這個(gè)文件，其實(shí)運(yùn)行上面的代碼完全不需要去Spark官網(wǎng)下載任何release包，因?yàn)槲覀冇蠱aven已經(jīng)幫我們自動下載搞定了。這里只是因?yàn)椴幌肓硗鈽?gòu)造數(shù)據(jù)文件，所以還是用release包中的文件。然后把路徑寫對就ok了，記得要寫盤符，不然默認(rèn)就在IDEA的工程目錄中去找了。
解決這兩個(gè)錯(cuò)誤后應(yīng)該就可以順利看到運(yùn)行結(jié)果了。

看結(jié)果的時(shí)候你可能會嫌Spark自帶的輸出日志太多了，略煩，那么還可以修改輸出的級別限制輸出，主要是把log4j.rootCategory=INFO, console改為log4j.rootCategory=WARN, console即可抑制Spark把INFO級別的日志打到控制臺上。而如果要顯示更全面的信息，可以把INFO改為DEBUG。
log4j.properties內(nèi)如如下：

log4j.rootLogger=${root.logger}
root.logger=WARN,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
shell.log.level=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.logger.org.apache.spark.repl.Main=${shell.log.level}
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=${shell.log.level}

這個(gè)文件需要放到程序能自動讀取加載的地方，比如resources目錄下：

log4j.properties的文件位置，放在resources目錄下

這樣再run的時(shí)候log看起來就清爽多了。

初始化Spark

這部分內(nèi)容在官網(wǎng)上主要是說在代碼里使用SparkConf來建立一個(gè)JavaSparkContext。我在之前的文章中已經(jīng)有介紹了，這里也不再贅述了。
當(dāng)然另外使用shell來運(yùn)行也不是本文章的重點(diǎn)，我們的重點(diǎn)是學(xué)習(xí)Spark實(shí)打?qū)嵉腞DD API。
那就繼續(xù)看下面的重點(diǎn)RDD了。。。

RDDs

RDD的概念和特點(diǎn)還是參考上面的我之前寫的文章《Spark常用概念》，我覺得已經(jīng)寫的比較有概括性和歸納性了。

產(chǎn)生RDD

RDD總的來說有兩種方法得到：

從代碼中Parallelize得到

/** Distribute a local Scala collection to form an RDD. */
  def parallelize[T](list: java.util.List[T]): JavaRDD[T] =
    parallelize(list, sc.defaultParallelism)

讀源碼知道方法parallelize()的輸入是一個(gè)List<T>，輸出是一個(gè)JavaRDD<T>

從文件中讀取得到

/**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   */
  def textFile(path: String): JavaRDD[String] = sc.textFile(path)

方法textFile()的輸入是一個(gè)字符串，這個(gè)字符串可以是一個(gè)具體的文件，或者是一個(gè)目錄。如果是目錄的話就會自動讀取這個(gè)目錄下面所有的文件，然后返回一個(gè)JavaRDD<String>。

其他API
當(dāng)然還有一些其他的生成RDD的API，比如很有用的創(chuàng)建一個(gè)空的RDD：

/** Get an RDD that has no partitions or elements. */
  def emptyRDD[T]: JavaRDD[T] = {
    implicit val ctag: ClassTag[T] = fakeClassTag
    JavaRDD.fromRDD(new EmptyRDD[T](sc))
  }

以及其他

/** Distribute a local Scala collection to form an RDD. */
  def parallelizePairs[K, V](list: java.util.List[Tuple2[K, V]]): JavaPairRDD[K, V] =
    parallelizePairs(list, sc.defaultParallelism)

/** Distribute a local Scala collection to form an RDD. */
  def parallelizeDoubles(list: java.util.List[java.lang.Double]): JavaDoubleRDD =
    parallelizeDoubles(list, sc.defaultParallelism)

上面都是SparkContext這個(gè)Class中的方法

RDD操作

RDD的操作分為兩種：1、transformation；2、action。transformation就是將RDD的每一個(gè)elements進(jìn)行映射變形，或許是1對1的map，或許是1對N（N>=0）的flatMap，又或許是加入鍵值映射成（Key，Value）形勢的mapToPair。而action操作是對RDD的符合條件的elements進(jìn)行計(jì)算然后返回一個(gè)值。下面重點(diǎn)介紹下幾個(gè)常用的transformation以及action的API：
主要來自于JavaRDD以及JavaPairRDD中。

Transformation
- map
  這個(gè)方法是將一個(gè)function接口的實(shí)現(xiàn)類應(yīng)用到RDD的每一個(gè)元素上，然后返回一個(gè)新的RDD

/**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[R](f: JFunction[T, R]): JavaRDD[R] =
    new JavaRDD(rdd.map(f)(fakeClassTag))(fakeClassTag)

比如我有一個(gè)數(shù)據(jù)集格式是：id,value
那么直接sc.textFile("pathOfFile")后的得到的是String類型的element組成的RDD，需要按照格式解析下，然后具體的map方法就是：

//Lambda表達(dá)式寫法
JavaRDD<String[]> strArrayIdValue = stringRdd.map(s -> s.split(",", -1));
//非Lambda表達(dá)式寫法
JavaRDD<String[]> strArrayIdValue = stringRdd.map(new Function<String, String[]>() {
            @Override
            public String[] call(String v1) throws Exception {
                return v1.split(",", -1);
            }
        });

不過接下來為了更好的展示返回值的類型，我就不再用Lambda表達(dá)式的格式來寫了。

filter
這個(gè)是按照一個(gè)過濾規(guī)則將能返回true的元素保留下來，返回false的不保留從而產(chǎn)生一個(gè)新的RDD。
還是上面的例子，假設(shè)數(shù)據(jù)集中有格式錯(cuò)誤，或者數(shù)據(jù)缺失的數(shù)據(jù)，簡單認(rèn)為String的數(shù)組個(gè)數(shù)不為2就是要扔掉的，那么：

        JavaRDD<String[]> strFiltedRdd = strArrayIdValue.filter(new Function<String[], Boolean>() {
            @Override
            public Boolean call(String[] v1) throws Exception {
                return v1.length == 2;
            }
        });

flatMap
和map類似，不同的是每條element不是一定映射為另一個(gè)新的element，而是1對N的映射，其中N >= 0，所以假如上面的例子中數(shù)據(jù)value的值是按照value1|value2|value3...來構(gòu)造的。那么去掉id，將各個(gè)value值保存到一個(gè)RDD是這么寫：

        JavaRDD<String> strValueNRdd = strFiltedRdd.flatMap(new FlatMapFunction<String[], String>() {
            @Override
            public Iterator<String> call(String[] strings) throws Exception {
                return Arrays.asList(strings[1].split("\\|", -1)).iterator();
            }
        });

這樣就將一個(gè)element中的value合集分成了每個(gè)element都只包含一個(gè)value的新RDD了。

mapPartitions
這個(gè)是按照RDD在每個(gè)partition分區(qū)上進(jìn)行映射。源碼定義如下：

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   */
  def mapPartitions[U](f: FlatMapFunction[JIterator[T], U]): JavaRDD[U] = {
    def fn: (Iterator[T]) => Iterator[U] = {
      (x: Iterator[T]) => f.call(x.asJava).asScala
    }
    JavaRDD.fromRDD(rdd.mapPartitions(fn)(fakeClassTag[U]))(fakeClassTag[U])
  }

將之前的map方法裝換成這個(gè)后：

        JavaRDD<String[]> strArrayIdValue = stringRdd.mapPartitions(new FlatMapFunction<Iterator<String>, String[]>() {
            @Override
            public Iterator<String[]> call(Iterator<String> stringIterator) throws Exception {
                ArrayList<String[]> arrList = new ArrayList<String[]>();
                arrList.add(stringIterator.next().split(",", -1));
                return  arrList.iterator();
            }
        });

union
這個(gè)是將輸入的RDD以及調(diào)用的源RDD進(jìn)行合并。產(chǎn)生一個(gè)新的RDD。這個(gè)API在讀取多個(gè)獨(dú)立的文件并產(chǎn)生一個(gè)RDD時(shí)比較有用，比如：

        JavaRDD<String> unionAllFilesRdd = sc.emptyRDD();
        for(String name : fileNames) {
            unionAllFilesRdd = unionAllFilesRdd.union(sc.textFile(name));
        }

intersection
返回輸入和源的RDD的交集，并且不重復(fù)。
源碼說明：

  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: JavaRDD[T]): JavaRDD[T] = wrapRDD(rdd.intersection(other.rdd))

distinct
返回RDD中不重復(fù)的element組成的新RDD，也就是去重操作。沒有入?yún)ⅰ?br> 源碼：

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): JavaRDD[T] = wrapRDD(rdd.distinct())

subtract
該方法的作用是將存在于本RDD中的element但是不存在于輸入RDD中的element找出來組合成一個(gè)新的RDD。

  /**
   * Return an RDD with the elements from `this` that are not in `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be less than or equal to us.
   */
  def subtract(other: JavaRDD[T]): JavaRDD[T] = wrapRDD(rdd.subtract(other))

綜合前面幾個(gè)方法，如果我們要做一個(gè)這個(gè)任務(wù)，將RDD A中不同于RDD B的數(shù)據(jù)加入到B中，并將A中與B重復(fù)的部分按照某種條件替換。
1. 需要得到A對于B的不同集：AoutB = A.subtract(B)
2. 將滿足替換條件的A的子集找出來：replaceCandidateA = A.filter(判斷條件)
3. 找出真正能替換的A的子集：realReplaceA = replaceCandidateA.intersection(B)
4. 找出要丟棄的B的子集：discardB = B.intersection(realReplaceA)
5. 丟棄后剩下的B的內(nèi)容：B = B.subtract(discardB)
6. 合并替換集以及新加集到B中：newB = B.union(realReplaceA).union(AoutB)

mapToPair
這個(gè)API是用來將JavaRDD轉(zhuǎn)成JavaPairRDD，也就是將Key提出來，比如之前的id,value數(shù)據(jù)，在轉(zhuǎn)成[id,value]的String數(shù)組后，可以生成讓id為key的JavaPairRDD：

        JavaPairRDD<String, String> keyValuePairRdd = strFiltedRdd.mapToPair(new PairFunction<String[], String, String>() {
            @Override
            public Tuple2<String, String> call(String[] strings) throws Exception {
                return new Tuple2<>(strings[0], strings[1]);
            }
        });

接下來會介紹一些JavaPairRDD上獨(dú)有的API

groupByKey

  /**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with the existing partitioner/parallelism level.
   *
   * @note If you are grouping in order to perform an aggregation (such as a sum or average) over
   * each key, using `JavaPairRDD.reduceByKey` or `JavaPairRDD.combineByKey`
   * will provide much better performance.
   */
  def groupByKey(): JavaPairRDD[K, JIterable[V]] =
    fromRDD(groupByResultToJava(rdd.groupByKey()))

這個(gè)API用起來不是很爽，不能自定義一些組合方式，而且執(zhí)行細(xì)節(jié)需要注意下，參見：Avoid GroupByKey以及深入理解groupByKey、reduceByKey

reduceByKey
將PairRDD中的每一個(gè)元素按照Key的值，進(jìn)行Value的“相加”，“相加”的具體操作由實(shí)現(xiàn)接口Function2的類完成。
源碼：

  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
   * parallelism level.
   */
  def reduceByKey(func: JFunction2[V, V, V]): JavaPairRDD[K, V] = {
    fromRDD(reduceByKey(defaultPartitioner(rdd), func))
  }

可見reduceByKey并不會改變V的類型。比如我們把之前的PairRDD做下value的合并，代碼如下：

        JavaPairRDD<String, String> byIdValuesPairRdd = keyValuePairRdd.reduceByKey(new Function2<String, String, String>() {
            @Override
            public String call(String v1, String v2) throws Exception {
                return v1+"|"+v2;
            }
        });

這里的例子中V的類型是String，其實(shí)在實(shí)際用的時(shí)候也可以先將String map成ArrayList<String>然后再reduceByKey合成一個(gè)大的ArrayList。
那么有沒有一個(gè)方法可以從String直接transformation到ArrayList<String>呢？接著往下看吧。

aggregateByKey
這個(gè)方法的入?yún)⒈容^多，主要原因是這個(gè)方法的目的是將RDD的每個(gè)元素按照Key合并成U，因?yàn)閁的類型不同于V，所以需要指明V如何和U合并（第二個(gè)入?yún)?/em>），以及U和U的合并方法（第三個(gè)入?yún)?/em>），而且還需要給出一個(gè)最初始的U（比如是一個(gè)空集合，或者是0對于整數(shù)想加，或者是1對于整數(shù)相乘；第一個(gè)入?yún)?/em>）。好戲來了，先看源碼：

/** * Aggregate the values of each key, using given combine functions and a neutral "zero value". * This function can return a different result type, U, than the type of the values in this RDD, * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's. * The former operation is used for merging values within a partition, and the latter is used for * merging values between partitions. To avoid memory allocation, both of these functions are * allowed to modify and return their first argument instead of creating a new U. */ def aggregateByKey[U](zeroValue: U, seqFunc: JFunction2[U, V, U], combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U] = { implicit val ctag: ClassTag[U] = fakeClassTag fromRDD(rdd.aggregateByKey(zeroValue)(seqFunc, combFunc)) }

具體實(shí)踐 duang duang duang:

JavaPairRDD<String, ArrayList<String>> keyValuelistPairRdd = keyValuePairRdd.aggregateByKey(new ArrayList<String>(), new Function2<ArrayList<String>, String, ArrayList<String>>() { @Override public ArrayList<String> call(ArrayList<String> v1, String v2) throws Exception { v1.add(v2); return v1; } }, new Function2<ArrayList<String>, ArrayList<String>, ArrayList<String>>() { @Override public ArrayList<String> call(ArrayList<String> v1, ArrayList<String> v2) throws Exception { v1.addAll(v2); return v1; } });

通過三個(gè)入?yún)⒏愣诉@個(gè)從String -> ArrayList<String>的轉(zhuǎn)變。
這么好的機(jī)會自然不能錯(cuò)過用Lambda來秀下：

val initialSet = mutable.HashSet.empty[String] val addToSet = (s: mutable.HashSet[String], v: String) => s += v val mergePartitionSets = (p1: mutable.HashSet[String], p2: mutable.HashSet[String]) => p1 ++= p2 val keyValuelistPairRdd = keyValuePairRdd.aggregateByKey(initialSet)(addToSet, mergePartitionSets)

怎么畫風(fēng)突變成Scala了，我也沒辦法啊，我能有什么辦法，用Java實(shí)在寫不出來，不知道這個(gè)怎么轉(zhuǎn)成Lambda。。。。（Java藥丸啊）。如果有人知道這個(gè)用Java咋通過Lambda寫，麻煩在評論里告知下，不勝感謝！

JavaPairRDD<String, ArrayList<String>> keyValuelistLambda = keyValuePairRdd.aggregateByKey(new ArrayList<String>(), (uList,vStr) -> {uList.add(vStr); return uList;}, (u1, u2) -> {u1.addAll(u2); return u1;});

額。。。，我自己還是想出來一個(gè)寫法（不然太丟人了，雖然正經(jīng)學(xué)Java才3個(gè)月，但是這不是借口啊），不過看起來有點(diǎn)怪怪的感覺，不過如果大家有更好的寫法還是非常歡迎能在評論區(qū)展示下。謝謝！

sortByKey
當(dāng)K可以排序時(shí)，可以使用這個(gè)方法來對其排序，默認(rèn)是升序排序。源碼：

/** * Sort the RDD by key, so that each partition contains a sorted range of the elements in * ascending order. Calling `collect` or `save` on the resulting RDD will return or output an * ordered list of records (in the `save` case, they will be written to multiple `part-X` files * in the filesystem, in order of the keys). */ def sortByKey(): JavaPairRDD[K, V] = sortByKey(true) /** * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling * `collect` or `save` on the resulting RDD will return or output an ordered list of records * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in * order of the keys). */ def sortByKey(ascending: Boolean): JavaPairRDD[K, V] = { val comp = com.google.common.collect.Ordering.natural().asInstanceOf[Comparator[K]] sortByKey(comp, ascending) }

join
這個(gè)方法是對于兩個(gè)PairRDD按照Key進(jìn)行取交集，如果k在本RDD和輸入RDD中都存在，那么就加入返回的RDD中，且RDD的每一個(gè)元素為k, (v1, v2)，其中后面是一個(gè)Tuple，v1來自于本RDD，v2來自于輸入RDD。源碼：

/** * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD. */ def join[W](other: JavaPairRDD[K, W], partitioner: Partitioner): JavaPairRDD[K, (V, W)] = fromRDD(rdd.join(other, partitioner))

JavaPairRDD<String, Tuple2<String, String>> joinRDD = byIdValuesPairRdd.join(keyValuePairRdd);

leftOuterJoin & rightOuterJoin & fullOuterJoin
join還有三個(gè)變身版本，我們可以通過實(shí)踐看看各自的用法：

ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){ { add(new Tuple2<>(1, "str1")); add(new Tuple2<>(1, "str11")); add(new Tuple2<>(2, "str2")); add(new Tuple2<>(4, "str44")); }; }; JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList); ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){ { add(new Tuple2<>(2, "str2")); add(new Tuple2<>(3, "str3")); add(new Tuple2<>(4, "str4")); add(new Tuple2<>(5, "str5")); add(new Tuple2<>(7, "str77")); } }; JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList); JavaPairRDD<Integer, Tuple2<String, String>> joinRdd = paralPairRdd.join(otherParalPairRdd); joinRdd.foreach(s -> System.out.println("join*"+ s.toString())); JavaPairRDD<Integer, Tuple2<String, Optional<String>>> leftOuterJoinRdd = paralPairRdd.leftOuterJoin(otherParalPairRdd); leftOuterJoinRdd.foreach(s -> System.out.println("leftOuterJoin*"+ s.toString())); JavaPairRDD<Integer, Tuple2<Optional<String>, String>> rightOuterJoinRdd = paralPairRdd.rightOuterJoin(otherParalPairRdd); rightOuterJoinRdd.foreach(s -> System.out.println("rightOuterJoin*"+ s.toString())); JavaPairRDD<Integer, Tuple2<Optional<String>, Optional<String>>> fullOuterJoinRdd = paralPairRdd.fullOuterJoin(otherParalPairRdd); fullOuterJoinRdd.foreach(s -> System.out.println("fullOuterJoin*"+ s.toString()));

上面代碼的運(yùn)行結(jié)果為：

join(2,(str2,str2))
join(4,(str44,str4))

leftOuterJoin(2,(str2,Optional[str2]))
leftOuterJoin(4,(str44,Optional[str4]))
leftOuterJoin(1,(str1,Optional.empty))
leftOuterJoin(1,(str11,Optional.empty))
rightOuterJoin*(2,(Optional[str2],str2))

rightOuterJoin(4,(Optional[str44],str4))
rightOuterJoin(3,(Optional.empty,str3))
rightOuterJoin(7,(Optional.empty,str77))
rightOuterJoin(5,(Optional.empty,str5))

fullOuterJoin(4,(Optional[str44],Optional[str4]))
fullOuterJoin(2,(Optional[str2],Optional[str2]))
fullOuterJoin(3,(Optional.empty,Optional[str3]))
fullOuterJoin(7,(Optional.empty,Optional[str77]))
fullOuterJoin(1,(Optional[str1],Optional.empty))
fullOuterJoin(1,(Optional[str11],Optional.empty))
fullOuterJoin*(5,(Optional.empty,Optional[str5]))
可見join就是取交集，left是就是本RDD有的key的Value集合，right就是輸入RDD的key的value集合，full就是并集。而且不保證key的順序，只保證value的順序。

cogroup
這個(gè)方法是將兩個(gè)或者多個(gè)PairRDD（如果入?yún)⑹嵌鄠€(gè)RDD，那么有幾個(gè)就合并幾個(gè)）合并在一起，如果其中一個(gè)RDD的某個(gè)key，在另一個(gè)RDD中沒有出現(xiàn)，那么就要記錄一個(gè)空集合。Talk is cheap, let's show you the codes and the run result.
源碼為先：

/** * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the * list of values for that key in `this` as well as `other`. */ def cogroup[W](other: JavaPairRDD[K, W]): JavaPairRDD[K, (JIterable[V], JIterable[W])] = fromRDD(cogroupResultToJava(rdd.cogroup(other)))

有多個(gè)重載版的方法，加上調(diào)用方法的RDD自身，最多支持一口氣4個(gè)RDD的cogroup

cogroup的各個(gè)重載版本

動手實(shí)踐：

ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){ { add(new Tuple2<>(1, "str1")); add(new Tuple2<>(1, "str11")); add(new Tuple2<>(2, "str2")); add(new Tuple2<>(4, "str44")); }; }; JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList); ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){ { add(new Tuple2<>(2, "str2")); add(new Tuple2<>(3, "str3")); add(new Tuple2<>(4, "str4")); add(new Tuple2<>(5, "str5")); } }; JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList); JavaPairRDD<Integer, Tuple2<Iterable<String>, Iterable<String>>> coGroupRdd = paralPairRdd.cogroup(otherParalPairRdd); coGroupRdd.foreach(s -> System.out.println("+++"+ s.toString()));

看下運(yùn)行結(jié)果：

+++(3,([],[str3]))
+++(1,([str1, str11],[]))
+++(2,([str2],[str2]))
+++(4,([str44],[str4]))
+++(5,([],[str5]))
可以看到Key的順序是不保證的，但是key內(nèi)value的順序是有保證的。

intersection
這個(gè)方法的作用是取本RDD和輸入RDD的交集，和join的區(qū)別在與join會把Key一樣Value不一樣的一起留下，但是intersection只會留下Key和Value都一樣的數(shù)據(jù)：

/** * Return the intersection of this RDD and another one. The output will not contain any duplicate * elements, even if the input RDDs did. * * @note This method performs a shuffle internally. */ def intersection(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] = new JavaPairRDD[K, V](rdd.intersection(other.rdd))

看個(gè)實(shí)例：

ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){ { add(new Tuple2<>(1, "str1")); add(new Tuple2<>(1, "str11")); add(new Tuple2<>(2, "str2")); add(new Tuple2<>(4, "str44")); }; }; JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList); ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){ { add(new Tuple2<>(2, "str2")); add(new Tuple2<>(3, "str3")); add(new Tuple2<>(4, "str4")); add(new Tuple2<>(5, "str5")); add(new Tuple2<>(7, "str77")); } }; JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList); JavaPairRDD<Integer, Tuple2<String, String>> joinRdd = paralPairRdd.join(otherParalPairRdd); joinRdd.foreach(s -> System.out.println("join*"+ s.toString())); JavaPairRDD<Integer, String> intersectRdd = paralPairRdd.intersection(otherParalPairRdd); intersectRdd.foreach(s -> System.out.println("intersection*" + s.toString()));

運(yùn)行結(jié)果：

join(4,(str44,str4))
join(2,(str2,str2))
intersection*(2,str2)
所以intertsection不會改變返回值的類型，但是join會改變，因?yàn)関alue被修改為了Tuple類型了。

subtract
這個(gè)方法將本RDD中存在的元素但是不存在與輸入RDD的元素取出，組成一個(gè)輸出類型不變的RDD。也就是說這個(gè)不同的判斷不只是根據(jù)Key來的，還包括了Value的值。只要Key和Value有一個(gè)不同，那么就會被取出作為返回RDD的部分?jǐn)?shù)據(jù)。

/** * Return an RDD with the elements from `this` that are not in `other`. * * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting * RDD will be <= us. */ def subtract(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] = fromRDD(rdd.subtract(other))

實(shí)例代碼：

ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){ { add(new Tuple2<>(1, "str1")); add(new Tuple2<>(1, "str11")); add(new Tuple2<>(2, "str2")); add(new Tuple2<>(4, "str44")); }; }; JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList); ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){ { add(new Tuple2<>(2, "str2")); add(new Tuple2<>(3, "str3")); add(new Tuple2<>(4, "str4")); add(new Tuple2<>(5, "str5")); add(new Tuple2<>(7, "str77")); } }; JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList); JavaPairRDD<Integer, String> substractRdd = paralPairRdd.subtract(otherParalPairRdd); substractRdd.foreach(s -> System.out.println("substract*" + s.toString()));

運(yùn)行結(jié)果：
substract(4,str44)
substract(1,str1)
substract*(1,str11)
*不過還要小心有坑，這里的數(shù)據(jù)結(jié)構(gòu)是Tuple2<Integer, String>，如果換成Tuple2<Integer, String[]>，那么即使String[]的內(nèi)容一樣，也同樣被認(rèn)為是不同的值，切記！！！

coalesce
這個(gè)方法可以用以減少RDD的分區(qū)到輸入的參數(shù)個(gè)數(shù)上。說是效率比較高，但是用起來感覺無法讓數(shù)據(jù)最后保存在一個(gè)文件上。

/** * Return a new RDD that is reduced into `numPartitions` partitions. */ def coalesce(numPartitions: Int): JavaPairRDD[K, V] = fromRDD(rdd.coalesce(numPartitions))

repartition
這個(gè)方法主要是用來重新shuffle RDD的data，使用一種隨機(jī)的方式來產(chǎn)生更多或者更少的分區(qū)并平衡它們。This always shuffles all data over the network.

/** * Return a new RDD that has exactly numPartitions partitions. * * Can increase or decrease the level of parallelism in this RDD. Internally, this uses * a shuffle to redistribute data. * * If you are decreasing the number of partitions in this RDD, consider using `coalesce`, * which can avoid performing a shuffle. */ def repartition(numPartitions: Int): JavaPairRDD[K, V] = fromRDD(rdd.repartition(numPartitions))

有時(shí)候會用它最后將結(jié)果RDD保存在一個(gè)part文件上。

coGroupRdd.repartition(1).saveAsTextFile(fileName);

Actions

reduce
這個(gè)其實(shí)和reduceByKey類似，不過因?yàn)閞educeByKey的輸出是各個(gè)Key的新Value的element的集合還是一個(gè)RDD，而reduce是對所有RDD的element的一個(gè)匯總，最后形成一個(gè)單獨(dú)的輸出，不再是RDD，所以這個(gè)操作歸在action當(dāng)中，而reducedByKey則屬于transformation。

/** * Reduces the elements of this RDD using the specified commutative and associative binary * operator. */ def reduce(f: JFunction2[T, T, T]): T = rdd.reduce(f)

實(shí)際應(yīng)用：

JavaRDD<String> stringRdd = sc.textFile("G:/ImportantTools/spark-2.1.0-bin-hadoop2.7/README.md"); JavaRDD<Integer> intLineLength = stringRdd.map(s -> s.length()); intLineLength.persist(StorageLevel.MEMORY_ONLY()); int totalLen = intLineLength.reduce((a, b) -> a + b);

collect
重點(diǎn)看下源碼里的Note：

/** * Return an array that contains all of the elements in this RDD. * * @note this method should only be used if the resulting array is expected to be small, as * all the data is loaded into the driver's memory. */ def collect(): JList[T] = rdd.collect().toSeq.asJava

collectAsMap
這是JavaPairRDD特有的API，可以返回一個(gè)原RDD中K,V對應(yīng)關(guān)系的Map。

/** * Return the key-value pairs in this RDD to the master as a Map. * * @note this method should only be used if the resulting data is expected to be small, as * all the data is loaded into the driver's memory. */ def collectAsMap(): java.util.Map[K, V] = mapAsSerializableJavaMap(rdd.collectAsMap())

不過這里需要注意一點(diǎn)，這個(gè)返回的Map類型在廣播broadcast中可能會有問題。
比如：

final Map<String, MyInfoClass> kvMap = keyValuePairRDD.collectAsMap(); final Broadcast<Map<String, MyInfoClass>> broadcastKvMap = sc.broadcast(kvMap);

上面代碼是把一個(gè)JavaPairRDD轉(zhuǎn)成Map，然后再broadcast出來，方便其他executor在執(zhí)行諸如Map()等方法時(shí)使用，不過這么寫有一定的概率產(chǎn)生一個(gè)錯(cuò)誤：

17/06/14 11:43:53 INFO scheduler.DAGScheduler: ShuffleMapStage 1 (repartition at FromBSID2Gps.java:214) failed in 1.182 s due to Job aborted due to stage failure: Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in stage 1.0 (TID 19, s36.dc.taiyear, executor 3): java.io.IOException: java.lang.UnsupportedOperationException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1213)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
.... ....
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.UnsupportedOperationException
at java.util.AbstractMap.put(AbstractMap.java:203)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:217)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1206)
... 20 more
所以為了避免被當(dāng)做AbstractMap來處理，需要顯式的指明Map的類型，比如下面這段代碼。

Map<String, MyInfoClass> kvMap = new HashMap<>(); kvMap.putAll(keyValuePairRDD.collectAsMap()); final Broadcast<Map<String, MyInfoClass>> broadcastKvMap = sc.broadcast(kvMap);

count

/** * Return the number of elements in the RDD. */ def count(): Long = rdd.count()

first

/** * Return the first element in this RDD. */ def first(): T = rdd.first()

take

/** * Take the first num elements of the RDD. This currently scans the partitions *one by one*, so * it will be slow if a lot of partitions are required. In that case, use collect() to get the * whole RDD instead. * * @note this method should only be used if the resulting array is expected to be small, as * all the data is loaded into the driver's memory. */ def take(num: Int): JList[T] = rdd.take(num).toSeq.asJava

saveAsTextFile
這個(gè)不是保存到一個(gè)文件中，而是保存到這個(gè)文件夾中，根據(jù)partition的個(gè)數(shù)來生成文件個(gè)數(shù)。

/** * Save this RDD as a text file, using string representations of elements. */ def saveAsTextFile(path: String): Unit = { rdd.saveAsTextFile(path) }

countByKey
用哈希Map作為返回值

/** Count the number of elements for each key, and return the result to the master as a Map. */ def countByKey(): java.util.Map[K, jl.Long] = mapAsSerializableJavaMap(rdd.countByKey()).asInstanceOf[java.util.Map[K, jl.Long]]

foreach

/** * Applies a function f to all elements of this RDD. */ def foreach(f: VoidFunction[T]) { rdd.foreach(x => f.call(x)) }

其他RDD類型

從JavaRDDLike集成的結(jié)構(gòu)圖

下面介紹下

JavaDoubleRDD

管中窺豹，可見這個(gè)RDD主要負(fù)責(zé)一些數(shù)據(jù)統(tǒng)計(jì)方面的功能

共享變量

因?yàn)镾park的APP是序列化后分發(fā)到各個(gè)Worker節(jié)點(diǎn)上運(yùn)行的，所以需要特殊的方法才能在各個(gè)worker節(jié)點(diǎn)上得到一個(gè)有效統(tǒng)一的全局變量值。

最后編輯于：2017.12.08 01:06:41

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明：文章內(nèi)容（如有圖片或視頻亦包括在內(nèi)）由作者上傳并發(fā)布，文章內(nèi)容僅代表作者本人觀點(diǎn)，簡書系信息發(fā)布平臺，僅提供信息存儲服務(wù)。

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 230,002評論 6贊 542
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 99,400評論 3贊 429
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 178,136評論 0贊 383
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經(jīng)常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 63,714評論 1贊 317
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 72,452評論 6贊 412
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 55,818評論 1贊 328
城市分裂傳說
那天，我揣著相機(jī)與錄音，去河邊找鬼。笑死，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 43,812評論 3贊 446
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 42,997評論 0贊 290
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 49,552評論 1贊 335
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 41,292評論 3贊 358
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 43,510評論 1贊 374
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 39,035評論 5贊 363
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 44,721評論 3贊 348
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 35,121評論 0贊 28
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 36,429評論 1贊 294
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個(gè)月前我還...
沈念sama閱讀 52,235評論 3贊 398
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 48,480評論 2贊 379

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Spark官方編程指南—の—詳解加實(shí)踐

Spark官方編程指南—の—詳解加實(shí)踐

介紹

連接Spark

使用IDEA

初始化Spark

RDDs

產(chǎn)生RDD

RDD操作

其他RDD類型

共享變量

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Spark官方編程指南—の—詳解加實(shí)踐

介紹

連接Spark

使用IDEA

初始化Spark

RDDs

產(chǎn)生RDD

RDD操作

其他RDD類型

共享變量

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频