treeAggregate

官方文檔描述：

Aggregates the elements of this RDD in a multi-level tree pattern.

函數(shù)原型：

def treeAggregate[U](    
    zeroValue: U,    
    seqOp: JFunction2[U, T, U],    
    combOp: JFunction2[U, U, U],
    depth: Int): U 
def treeAggregate[U](    
    zeroValue: U,    
    seqOp: JFunction2[U, T, U],    
    combOp: JFunction2[U, U, U]): U

**
可理解為更復雜的多階aggregate。
**

源碼分析：

def treeAggregate[U: ClassTag](zeroValue: U)(    
    seqOp: (U, T) => U,    
    combOp: (U, U) => U,    
    depth: Int = 2): U = withScope {  
  require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.")  
  if (partitions.length == 0) {    
    Utils.clone(zeroValue, context.env.closureSerializer.newInstance())  
  } else {    
    val cleanSeqOp = context.clean(seqOp)    
    val cleanCombOp = context.clean(combOp)    
    val aggregatePartition =      
      (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)    
    var partiallyAggregated = mapPartitions(it => Iterator(aggregatePartition(it)))    
    var numPartitions = partiallyAggregated.partitions.length    
    val scale = math.max(math.ceil(math.pow(numPartitions, 1.0 / depth)).toInt, 2)    
    // If creating an extra level doesn't help reduce    
    // the wall-clock time, we stop tree aggregation.          
    // Don't trigger TreeAggregation when it doesn't save wall-clock time    
    while (numPartitions > scale + math.ceil(numPartitions.toDouble / scale)) {      
      numPartitions /= scale      
      val curNumPartitions = numPartitions      
      partiallyAggregated = partiallyAggregated.mapPartitionsWithIndex {        
        (i, iter) => iter.map((i % curNumPartitions, _))      
      }.reduceByKey(new HashPartitioner(curNumPartitions), cleanCombOp).values    
  }    
  partiallyAggregated.reduce(cleanCombOp)  
  }
}

**
從源碼中可以看出，treeAggregate函數(shù)先是對每個分區(qū)利用scala的aggregate函數(shù)進行局部聚合的操作；同時，依據(jù)depth參數(shù)計算scale，如果當分區(qū)數(shù)量過多時，則按i%curNumPartitions進行key值計算，再按key進行重新分區(qū)合并計算；最后，在進行reduce聚合操作。這樣可以通過調(diào)解深度來減少reduce的開銷。
**

實例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);
//轉(zhuǎn)化操作
JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() {    
  @Override    
  public String call(Integer v1) throws Exception {        
    return Integer.toString(v1);    
  }
});

String result1 = javaRDD1.treeAggregate("0", new Function2<String, String, String>() {    
  @Override    
  public String call(String v1, String v2) throws Exception {        
    System.out.println(v1 + "=seq=" + v2);        
    return v1 + "=seq=" + v2;    
  }
}, new Function2<String, String, String>() {    
    @Override    
    public String call(String v1, String v2) throws Exception {        
      System.out.println(v1 + "<=comb=>" + v2);        
      return v1 + "<=comb=>" + v2;    
  }
});
System.out.println(result1);

treeReduce

官方文檔描述：

Reduces the elements of this RDD in a multi-level tree pattern.

函數(shù)原型：

def treeReduce(f: JFunction2[T, T, T], depth: Int): T
def treeReduce(f: JFunction2[T, T, T]): T

**
與treeAggregate類似，只不過是seqOp和combOp相同的treeAggregate。
**

源碼分析：

def treeReduce(f: (T, T) => T, depth: Int = 2): T = withScope {  
  require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.")  
  val cleanF = context.clean(f)  
  val reducePartition: Iterator[T] => Option[T] = iter => {    
    if (iter.hasNext) {      
      Some(iter.reduceLeft(cleanF))    
    } else {      
      None    
    }  
  }  
  val partiallyReduced = mapPartitions(it => Iterator(reducePartition(it)))  
  val op: (Option[T], Option[T]) => Option[T] = (c, x) => {    
  if (c.isDefined && x.isDefined) {      
    Some(cleanF(c.get, x.get))    
  } else if (c.isDefined) {      
    c    
  } else if (x.isDefined) {      
    x    
  } else {      
    None    
  }  
 }  
partiallyReduced.treeAggregate(Option.empty[T])(op, op, depth)    
  .getOrElse(throw new UnsupportedOperationException("empty collection"))}

**
從源碼中可以看出，treeReduce函數(shù)先是針對每個分區(qū)利用scala的reduceLeft函數(shù)進行計算；最后，在將局部合并的RDD進行treeAggregate計算，這里的seqOp和combOp一樣，初值為空。在實際應(yīng)用中，可以用treeReduce來代替reduce，主要是用于單個reduce操作開銷比較大，而treeReduce可以通過調(diào)整深度來控制每次reduce的規(guī)模。
**

實例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,5);
JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() {    
    @Override    
    public String call(Integer v1) throws Exception {        
      return Integer.toString(v1);    
    }
});
String result = javaRDD1.treeReduce(new Function2<String, String, String>() {    
    @Override    
    public String call(String v1, String v2) throws Exception {        
      System.out.println(v1 + "=" + v2);        
      return v1 + "=" + v2;    
  }
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + treeReduceRDD);

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

【Spark Java API】Action(5)—treeAggregate、treeReduce

【Spark Java API】Action(5)—treeAggregate、treeReduce

treeAggregate

官方文檔描述：

函數(shù)原型：

源碼分析：

實例：

treeReduce

官方文檔描述：

函數(shù)原型：

源碼分析：

實例：

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

【Spark Java API】Action(5)—treeAggregate、treeReduce

treeAggregate

官方文檔描述：

函數(shù)原型：

源碼分析：

實例：

treeReduce

官方文檔描述：

函數(shù)原型：

源碼分析：

實例：

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频