coalesce
官方文檔描述:
Return a new RDD that is reduced into `numPartitions` partitions.
函數(shù)原型:
def coalesce(numPartitions: Int): JavaRDD[T]
def coalesce(numPartitions: Int, shuffle: Boolean): JavaRDD[T]
源碼分析:
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null) : RDD[T] = withScope {
if (shuffle) {
/** Distributes elements evenly across output partitions, starting from a random partition. */
val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
// Note that the hash code of the key will just be the key itself. The HashPartitioner
// will mod it with the number of total partitions.
position = position + 1
(position, t)
}
} : Iterator[(Int, T)]
// include a shuffle step so that our upstream tasks are still distributed
new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions).values
} else {
new CoalescedRDD(this, numPartitions)
}
}
**
從源碼中可以看出,當(dāng)shuffle=false時(shí),由于不進(jìn)行shuffle,問(wèn)題就變成parent RDD中哪些partition可以合并在一起,合并的過(guò)程依據(jù)設(shè)置的numPartitons中的元素個(gè)數(shù)進(jìn)行合并處理。
當(dāng)shuffle=true時(shí),進(jìn)行shuffle操作,原理很簡(jiǎn)單,先是對(duì)partition中record進(jìn)行k-v轉(zhuǎn)換,其中key是由 (new Random(index)).nextInt(numPartitions)+1計(jì)算得到,value為record,index 是該 partition 的索引,numPartitions 是 CoalescedRDD 中的 partition 個(gè)數(shù),然后 shuffle 后得到 ShuffledRDD, 可以得到均分的 records,再經(jīng)過(guò)復(fù)雜算法來(lái)建立 ShuffledRDD 和 CoalescedRDD 之間的數(shù)據(jù)聯(lián)系,最后過(guò)濾掉 key,得到 coalesce 后的結(jié)果 MappedRDD。
**
實(shí)例:
List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
// shuffle默認(rèn)是false
JavaRDD<Integer> coalesceRDD = javaRDD.coalesce(2);
System.out.println(coalesceRDD);
JavaRDD<Integer> coalesceRDD1 = javaRDD.coalesce(2,true);
System.out.println(coalesceRDD1);
注意:
**
coalesce() 可以將 parent RDD 的 partition 個(gè)數(shù)進(jìn)行調(diào)整,比如從 5 個(gè)減少到 3 個(gè),或者從 5 個(gè)增加到 10 個(gè)。需要注意的是當(dāng) shuffle = false 的時(shí)候,是不能增加 partition 個(gè)數(shù)的(即不能從 5 個(gè)變?yōu)?10 個(gè))。
**
repartition
官網(wǎng)文檔描述:
Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD.
Internally, this uses a shuffle to redistribute data.
If you are decreasing the number of partitions in this RDD, consider using `coalesce`,which can avoid performing a shuffle.
**
特別需要說(shuō)明的是,如果使用repartition對(duì)RDD的partition數(shù)目進(jìn)行縮減操作,可以使用coalesce函數(shù),將shuffle設(shè)置為false,避免shuffle過(guò)程,提高效率。
**
函數(shù)原型:
def repartition(numPartitions: Int): JavaRDD[T]
源碼分析:
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
**
從源碼中可以看到repartition等價(jià)于 coalesce(numPartitions, shuffle = true)
**
實(shí)例:
List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
//等價(jià)于 coalesce(numPartitions, shuffle = true)
JavaRDD<Integer> repartitionRDD = javaRDD.repartition(2);
System.out.println(repartitionRDD);