fold

官方文檔描述：

Aggregate the elements of each partition, and then the results for all the partitions, 
using a given associative and commutative function and a neutral "zero value". 
The function op(t1, t2) is allowed to modify t1 and return it as its result value 
to avoid object allocation; however, it should not modify t2.

函數原型：

def fold(zeroValue: T)(f: JFunction2[T, T, T]): T

**
fold是aggregate的簡化，將aggregate中的seqOp和combOp使用同一個函數op。
**

源碼分析：

def fold(zeroValue: T)(op: (T, T) => T): T = withScope {  
  // Clone the zero value since we will also be serializing it as part of tasks 
  var jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance())  
  val cleanOp = sc.clean(op)  
  val foldPartition = (iter: Iterator[T]) => iter.fold(zeroValue)(cleanOp)  
  val mergeResult = (index: Int, taskResult: T) => jobResult = op(jobResult, taskResult)  
  sc.runJob(this, foldPartition, mergeResult)  
  jobResult
}

**
從源碼中可以看出，先是將zeroValue賦值給jobResult，然后針對每個分區利用op函數與zeroValue進行計算，再利用op函數將taskResult和jobResult合并計算，同時更新jobResult，最后，將jobResult的結果返回。
**

實例：

List<String> data = Arrays.asList("5", "1", "1", "3", "6", "2", "2");
JavaRDD<String> javaRDD = javaSparkContext.parallelize(data,5);
JavaRDD<String> partitionRDD = javaRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<String>, Iterator<String>>() {    
  @Override    
  public Iterator<String> call(Integer v1, Iterator<String> v2) throws Exception {        
    LinkedList<String> linkedList = new LinkedList<String>();        
    while(v2.hasNext()){            
      linkedList.add(v1 + "=" + v2.next());        
    }        
    return linkedList.iterator();    
  }
},false);

System.out.println(partitionRDD.collect());

String foldRDD = javaRDD.fold("0", new Function2<String, String, String>() {    
    @Override    
    public String call(String v1, String v2) throws Exception {        
        return v1 + " - " + v2;    
    }
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldRDD);

countByKey

官方文檔描述：

Count the number of elements for each key, collecting the results to a local Map.
Note that this method should only be used if the resulting map is expected to be small, 
as the whole thing is loaded into the driver's memory. To handle very large results, 
consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), 
which returns an RDD[T, Long] instead of a map.

函數原型：

def countByKey(): java.util.Map[K, Long]

源碼分析：

def countByKey(): Map[K, Long] = self.withScope {  
   self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}

**
從源碼中可以看出，先是進行map操作轉化為(key,1)鍵值對，再進行reduce聚合操作，最后利用collect函數將數據加載到driver，并轉化為map類型。
注意，從上述分析可以看出，countByKey操作將數據全部加載到driver端的內存，如果數據量比較大，可能出現OOM。因此，如果key數量比較多，建議進行rdd.mapValues(_ => 1L).reduceByKey(_ + _)，返回RDD[T, Long]。
**

實例：

List<String> data = Arrays.asList("5", "1", "1", "3", "6", "2", "2");
JavaRDD<String> javaRDD = javaSparkContext.parallelize(data,5);

JavaRDD<String> partitionRDD = javaRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<String>, Iterator<String>>() {    
  @Override      
  public Iterator<String> call(Integer v1, Iterator<String> v2) throws Exception {        
    LinkedList<String> linkedList = new LinkedList<String>();        
    while(v2.hasNext()){            
      linkedList.add(v1 + "=" + v2.next());        
    }        
    return linkedList.iterator();    
  }
},false);
System.out.println(partitionRDD.collect());
JavaPairRDD<String,String> javaPairRDD = javaRDD.mapToPair(new PairFunction<String, String, String>() {    
   @Override    
    public Tuple2<String, String> call(String s) throws Exception {        
      return new Tuple2<String, String>(s,s);    
  }
});
System.out.println(javaPairRDD.countByKey());

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

【Spark Java API】Action(2)—fold、countByKey

【Spark Java API】Action(2)—fold、countByKey

fold

官方文檔描述：

函數原型：

源碼分析：

實例：

countByKey

官方文檔描述：

函數原型：

源碼分析：

實例：

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

【Spark Java API】Action(2)—fold、countByKey

fold

官方文檔描述：

函數原型：

源碼分析：

實例：

countByKey

官方文檔描述：

函數原型：

源碼分析：

實例：

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频