摘要:
Shuffle
是MapReduce
編程模型中最耗時的一個步驟,而Spark
將Shuffle
過程分解成了Shuffle Write
和Shuffle Read
兩個過程,本文我們將詳細解讀Spark
的Shuffle Write
實現。
ShuffleWriter
Spark Shuffle Write
的接口是org.apache.spark.shuffle.ShuffleWriter
我們來看下接口定義:
private[spark] abstract class ShuffleWriter[K, V] {
/** Write a sequence of records to this task's output */
@throws[IOException]
def write(records: Iterator[Product2[K, V]]): Unit
/** Close this writer, passing along whether the map completed */
def stop(success: Boolean): Option[MapStatus]
}
共有三個實現類:
BypassMergeSortShuffleWriter
我們以第一個stage
(map)的個數為m個來計算,第二個stage個數為r個來計算
BypassMergeSortShuffleWriter
可以分為
1.為每個
ShuffleMapTask
(即map
端的每個partition
,每個ShuffleMapTask
處理的是map端的一個partition
)創建r
個臨時文件
2.迭代每個map的partition,根據getPartition(key)來分組,并寫入對應的partitionId的文件
3.合并步驟2產生的r個文件,并將每個partitionId對應的索引寫入index文件
關鍵代碼解讀
public void write(Iterator<Product2<K, V>> records) throws IOException {
...
// 根據下游stage(reduce端)的partition個數創建對應個數的DiskWriter
partitionWriters = new DiskBlockObjectWriter[numPartitions];
partitionWriterSegments = new FileSegment[numPartitions];
for (int i = 0; i < numPartitions; i++) {
final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
blockManager.diskBlockManager().createTempShuffleBlock();
final File file = tempShuffleBlockIdPlusFile._2();
final BlockId blockId = tempShuffleBlockIdPlusFile._1();
partitionWriters[i] =
blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
}
// 根據`getPartition(key)`獲取kv所屬的reduce的partitionId,并將kv寫入對應的partitionId的臨時文件
while (records.hasNext()) {
final Product2<K, V> record = records.next();
final K key = record._1();
partitionWriters[partitioner.getPartition(key)].write(key, record._2());
}
for (int i = 0; i < numPartitions; i++) {
final DiskBlockObjectWriter writer = partitionWriters[i];
partitionWriterSegments[i] = writer.commitAndGet();
writer.close();
}
File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
File tmp = Utils.tempFileWith(output);
try {
// 合并多個partitionId對應的臨時文件,寫入`shuffle_${shuffleId}_${mapId}_${reduceId}.data`文件
partitionLengths = writePartitionedFile(tmp);
// 將多個partitionId對應的index寫入`shuffle_${shuffleId}_${mapId}_${reduceId}.index`文件
shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
}
mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}
1.默認的
Partitioner
的實現類為HashPartitioner
2.默認的SerializerInstance
的實現類為JavaSerializerInstance
FileSegment
一個BypassMergeSortShuffleWriter
的中間臨時文件稱之為FileSegment
class FileSegment(val file: File, val offset: Long, val length: Long)
file
記錄物理文件,length
記錄文件大小,用于合并多個FileSegment
時寫index文件。
我們再看下合并臨時文件方法writePartitionedFile
的實現:
private long[] writePartitionedFile(File outputFile) throws IOException {
final long[] lengths = new long[numPartitions];
...
final FileChannel out = FileChannel.open(outputFile.toPath(), WRITE, APPEND, CREATE);
for (int i = 0; i < numPartitions; i++) {
final File file = partitionWriterSegments[i].file();
if (file.exists()) {
final FileChannel in = FileChannel.open(file.toPath(), READ);
try {
long size = in.size();
// 合并文件的關鍵代碼,通過NIO的transferTo提高合并文件流的效率
Utils.copyFileStreamNIO(in, out, 0, size);
lengths[i] = size;
}
}
}
}
partitionWriters = null;
// 返回每個臨時文件大小,用于寫Index文件
return lengths;
}
寫index文件的方法writeIndexFileAndCommit:
def writeIndexFileAndCommit(
shuffleId: Int,
mapId: Int,
lengths: Array[Long],
dataTmp: File): Unit = {
val indexFile = getIndexFile(shuffleId, mapId)
val indexTmp = Utils.tempFileWith(indexFile)
try {
val out = new DataOutputStream(
new BufferedOutputStream(Files.newOutputStream(indexTmp.toPath)))
Utils.tryWithSafeFinally {
var offset = 0L
out.writeLong(offset)
for (length <- lengths) {
offset += length
out.writeLong(offset)
}
}
}
...
}
NOTE: 1.文件合并時采用了java nio的transferTo方法提高文件合并效率。
2.BypassMergeSortShuffleWriter
的完整代碼
BypassMergeSortShuffleWriter Example
我們通過下面一個例子來看下BypassMergeSortShuffleWriter
的工作原理。
1.真實場景下,我們的partition上的數據往往是無序的,本例中我們模擬的數據是有序的,不要誤認為BypassMergeSortShuffleWriter會為我們的數據排序。
SortShuffleWriter
預備知識:
org.apache.spark.util.collection.AppendOnlyMap
org.apache.spark.util.collection.PartitionedPairBuffer
TimSorter
SortShuffleWriter.writer()實現
我們先看下writer
的具體實現:
override def write(records: Iterator[Product2[K, V]]): Unit = {
sorter = if (dep.mapSideCombine) {
require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
new ExternalSorter[K, V, C](
context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
} else {
new ExternalSorter[K, V, V](
context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}
sorter.insertAll(records)
val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
val tmp = Utils.tempFileWith(output)
try {
val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
} finally {
if (tmp.exists() && !tmp.delete()) {
logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
}
}
}
SortShuffleWriter write
過程大概可以分成兩個步驟,第一步insertAll
,第二步merge
溢寫到磁盤的SpilledFile
ExternalSorter
可以分為四個步驟來理解
- 根據是否需要
combine
操作,決定緩存結構是PartitionedAppendOnlyMap
還是PartitionedPairBuffer
,在這兩種數據結構中,我們會先按照partitionId
將數據排序,而且在每個partition
中,我們會根據key排序。 - 當緩存數據到達我們的內存限制,或者或者條數限制,我們將進行
spill
操作,并且每個SpilledFile
會記錄每個parition
有多少條記錄。 - 當我們請求一個
iterator
或者文件時,會將所有的SpilledFile
和在內存當中未進行溢寫的數據進行合并。 - 最后請求
stop
方法刪除相關臨時文件。
ExternalSorter.insertAll
的實現:
def insertAll(records: Iterator[Product2[K, V]]): Unit = {
val shouldCombine = aggregator.isDefined
// 根據aggregator是否定義來判斷是否需要map端合并(combine)
if (shouldCombine) {
// Combine values in-memory first using our AppendOnlyMap
// 對應rdd.aggregatorByKey的 seqOp 參數
val mergeValue = aggregator.get.mergeValue
// 對應rdd.aggregatorByKey的zeroValue參數,利用zeroValue來創建Combiner
val createCombiner = aggregator.get.createCombiner
var kv: Product2[K, V] = null
val update = (hadValue: Boolean, oldValue: C) => {
if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
}
while (records.hasNext) {
addElementsRead()
kv = records.next()
map.changeValue((getPartition(kv._1), kv._1), update)
maybeSpillCollection(usingMap = true)
}
} else {
// Stick values into our buffer
while (records.hasNext) {
addElementsRead()
val kv = records.next()
buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
maybeSpillCollection(usingMap = false)
}
}
}
需要注意的一點是往
map/buffer
中寫入的key
都是(partitionId,key)
,因為我們需要對一個臨時文件中的數據結構,先按照partitionId
排序,再按照key
排序。
寫磁盤的時機
寫磁盤的時機有兩個條件,滿足其中一個就進行spill操作。
- 1.每32個元素采樣一次,判斷當前內存指是否大于
myMemoryThreshold
,即currentMemory >= myMemoryThreshold
。currentMemory
需要通過預估當前map/buffer
大小來獲取。 - 2.判斷內存緩存結構中數據條數是否大于強制溢寫閾值,即
_elementsRead > numElementsForceSpillThreshold
。強制溢寫閾值可以通過在SparkConf
中設置spark.shuffle.spill.batchSize
來控制。
private def maybeSpillCollection(usingMap: Boolean): Unit = {
var estimatedSize = 0L
if (usingMap) {
// 預估map在內存中的大小
estimatedSize = map.estimateSize()
if (maybeSpill(map, estimatedSize)) {
// 如果內存中的數據spill到磁盤上了,重置map
map = new PartitionedAppendOnlyMap[K, C]
}
} else {
// 預估buffer在內存中的大小
estimatedSize = buffer.estimateSize()
if (maybeSpill(buffer, estimatedSize)) {
// 同map操作
buffer = new PartitionedPairBuffer[K, C]
}
}
if (estimatedSize > _peakMemoryUsedBytes) {
_peakMemoryUsedBytes = estimatedSize
}
}
protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
var shouldSpill = false
if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
val amountToRequest = 2 * currentMemory - myMemoryThreshold
val granted = acquireMemory(amountToRequest)
myMemoryThreshold += granted
shouldSpill = currentMemory >= myMemoryThreshold
}
shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
if (shouldSpill) {
_spillCount += 1
logSpillage(currentMemory)
// 溢寫
spill(collection)
_elementsRead = 0
_memoryBytesSpilled += currentMemory
releaseMemory()
}
shouldSpill
}
溢寫磁盤的過程
override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
// 利用timsort算法將內存中的數據排序
val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
// 將內存中的數據寫入磁盤
val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
// 加入spills數組
spills += spillFile
}
總結下insertAll過程就是,利用內存緩存結構的數據結構PartitionedPairBuffer
/PartitionedAppendOnlyMap
,一邊往內存緩存寫數據一邊判斷是否達到spill的條件,一次spill就是一個磁盤臨時文件。
讀取SpilledFile
過程
SpilledFile
數據文件是按照(partitionId,recordKey)來排序,而且我們記錄了每個partition
的offset
,所以我們獲取一個SpilledFile
中的某個partition數據就變得很簡單了。
讀取SpilledFile
的實現類是SpillReader
merge過程
private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
: Iterator[(Int, Iterator[Product2[K, C]])] = {
val readers = spills.map(new SpillReader(_))
val inMemBuffered = inMemory.buffered
(0 until numPartitions).iterator.map { p =>
val inMemIterator = new IteratorForPartition(p, inMemBuffered)
val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
if (aggregator.isDefined) {
(p, mergeWithAggregation(
iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
} else if (ordering.isDefined) {
(p, mergeSort(iterators, ordering.get))
} else {
(p, iterators.iterator.flatten)
}
}
}
merge
過程是比較復雜的一個過程,要涉及到當前Shuffle
是否有aggregator
和ordering
操作。接下來我們將就這幾種情況一一分析。
no aggregator or sorter
partitionBy
case class TestIntKey(i: Int)
val conf = new SparkConf()
conf.setMaster("local[3]")
conf.setAppName("shuffle debug")
conf.set("spark.shuffle.sort.bypassMergeThreshold", "0")
conf.set("spark.shuffle.spill.numElementsForceSpillThreshold", 4.toString)
val sc = new SparkContext(conf)
val testData = (1 to 100).toList
sc.parallelize(testData, 1)
.map(x => {
(TestIntKey(x % 3), x)
}).partitionBy(new HashPartitioner(3)).collect()
no aggregator but sorter
這段代碼其實很容易混淆,因為很容易想到sortByKey
操作就是無aggregator
有sorter
操作,但是我們其實可以看到SortShuffleWriter
在初始化ExternalSorter
的時,ordring = None
。具體代碼如下:
sorter = if (dep.mapSideCombine) {
...
} else {
new ExternalSorter[K, V, V](
context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}
NOTE:
sortBykey
的ordering
的邏輯將會被放到Shuffle Read
過程中執行,這個我們后續將會介紹。
不過我們還是來簡單看下mergeSort
方法的實現。我們的SpilledFile
中,每個partition內的數據已經是按照recordKey排好序的,所以我們只要拿到每個SpilledFile的
private def mergeSort(iterators: Seq[Iterator[Product2[K, C]]], comparator: Comparator[K])
: Iterator[Product2[K, C]] =
{
// NOTE:(fchen)將該partition數據全部放入等級隊列當中,取數據時進行每個iterator頭部對比,取出最小的
val bufferedIters = iterators.filter(_.hasNext).map(_.buffered)
type Iter = BufferedIterator[Product2[K, C]]
val heap = new mutable.PriorityQueue[Iter]()(new Ordering[Iter] {
override def compare(x: Iter, y: Iter): Int = -comparator.compare(x.head._1, y.head._1)
})
heap.enqueue(bufferedIters: _*) // Will contain only the iterators with hasNext = true
new Iterator[Product2[K, C]] {
override def hasNext: Boolean = !heap.isEmpty
override def next(): Product2[K, C] = {
if (!hasNext) {
throw new NoSuchElementException
}
val firstBuf = heap.dequeue()
val firstPair = firstBuf.next()
if (firstBuf.hasNext) {
// 將迭代器重新放回等級隊列
heap.enqueue(firstBuf)
}
firstPair
}
}
}
我們通過下面這個例子來看下mergeSort
的整個過程:
從示例圖中我們可以清晰的看出,一個分散在多個
SpilledFile
的partition數據,經過mergeSort
操作之后,就會變成按照recordKey排序的Iterator了。
aggregator, but no sorter
reduceByKey
if (!totalOrder) {
new Iterator[Iterator[Product2[K, C]]] {
val sorted = mergeSort(iterators, comparator).buffered
// Buffers reused across elements to decrease memory allocation
val keys = new ArrayBuffer[K]
val combiners = new ArrayBuffer[C]
override def hasNext: Boolean = sorted.hasNext
override def next(): Iterator[Product2[K, C]] = {
if (!hasNext) {
throw new NoSuchElementException
}
keys.clear()
combiners.clear()
val firstPair = sorted.next()
keys += firstPair._1
combiners += firstPair._2
val key = firstPair._1
while (sorted.hasNext && comparator.compare(sorted.head._1, key) == 0) {
val pair = sorted.next()
var i = 0
var foundKey = false
while (i < keys.size && !foundKey) {
if (keys(i) == pair._1) {
combiners(i) = mergeCombiners(combiners(i), pair._2)
foundKey = true
}
i += 1
}
if (!foundKey) {
keys += pair._1
combiners += pair._2
}
}
keys.iterator.zip(combiners.iterator)
}
}.flatMap(i => i)
}
看到這我們可能會有所困惑,為什么key存儲需要一個ArrayBuffer
reduceByKey Example:
val conf = new SparkConf()
conf.setMaster("local[3]")
conf.setAppName("shuffle debug")
conf.set("spark.shuffle.sort.bypassMergeThreshold", "0")
conf.set("spark.shuffle.spill.numElementsForceSpillThreshold", (4).toString)
val sc = new SparkContext(conf)
val testData = (1 to 10).toList
val keys = Array("Aa", "BB")
val count = sc.parallelize(testData, 1)
.map(x => {
(keys(x % 2), x)
}).reduceByKey(_ + _, 3).collectPartitions().foreach(x => {
x.foreach(y => {
println(y._1 + "," + y._2)
})
})
下圖演示了reduceByKey
在有hash
沖突的情況下,整個mergeWithAggregation
過程
aggregator and sorter
雖然有這段邏輯,但是我并沒找到同時帶有aggregator和sorter的操作,所以這里我們簡單過下這段邏輯就好了。
合并SpilledFile
經過partition的merge操作之后就可以進行data和index文件的寫入,具體的寫入過程和BypassMergeSortShuffleWriter
是一樣的,這里我們就不再做更多的解釋了。
private[this] case class SpilledFile(
file: File,
blockId: BlockId,
serializerBatchSizes: Array[Long],
elementsPerPartition: Array[Long])
SortShuffleWriter
總結
序列化了兩次,一次是寫SpilledFile,一次是合并SpilledFile
UnsafeShuffleWriter
上面我們介紹了兩種在堆內做Shuffle write的方式,這種方式的缺點很明顯,就是在大對象的情況下,Jvm的垃圾回收性能表現比較差。所以就衍生了堆外內存的Shuffle write,即UnsafeShuffleWriter
。
從宏觀上看,UnsafeShuffleWriter
跟SortShufflerWriter
設計很相似,都是將map
端的數據,按照reduce
端的partitionId
進行排序,超過一定限制就將內存中的記錄溢寫到磁盤上。最后將這些文件合并寫入一個MapOutputFile
,并記錄每個partition
的offset
。
通過上面兩種on-heap的Shuffle write模型,我們就可以知道
預備知識
內存分頁管理模型
實現細節
在詳細介紹UnsafeShuffleWriter
之前,讓我們先來看下基礎知識,先看下PackedRecordPointer
類。
final class PackedRecordPointer {
...
public static long packPointer(long recordPointer, int partitionId) {
final long pageNumber = (recordPointer & MASK_LONG_UPPER_13_BITS) >>> 24;
final long compressedAddress = pageNumber | (recordPointer & MASK_LONG_LOWER_27_BITS);
return (((long) partitionId) << 40) | compressedAddress;
}
private long packedRecordPointer;
public void set(long packedRecordPointer) {
this.packedRecordPointer = packedRecordPointer;
}
public int getPartitionId() {
return (int) ((packedRecordPointer & MASK_LONG_UPPER_24_BITS) >>> 40);
}
public long getRecordPointer() {
final long pageNumber = (packedRecordPointer << 24) & MASK_LONG_UPPER_13_BITS;
final long offsetInPage = packedRecordPointer & MASK_LONG_LOWER_27_BITS;
return pageNumber | offsetInPage;
}
}
PackedRecordPointer
用一個long類型來存儲partitionId,pageNumber,offsetInPage
,已知一個long
是64位,從代碼中我們可以看出:
[ 24 bit partitionId ] [ 13 bit pageNumber] [ 27 bit offset in page]
insertRecord
方法:
public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)
throws IOException {
// 如果寫入內存的條數大于強制Spill閾值進行spill
if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {
spill();
}
growPointerArrayIfNecessary();
// Need 4 bytes to store the record length.
final int required = length + 4;
acquireNewPageIfNecessary(required);
assert(currentPage != null);
final Object base = currentPage.getBaseObject();
final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);
Platform.putInt(base, pageCursor, length);
pageCursor += 4;
Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
pageCursor += length;
inMemSorter.insertRecord(recordAddress, partitionId);
}
spill
過程其實就是寫文件的過程,也就是調用writeSortedFile
的過程:
private void writeSortedFile(boolean isLastFile) {
...
// 將inMemSorter,也就是PackedRecordPointer按照partitionId排序
final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
inMemSorter.getSortedIterator();
final byte[] writeBuffer = new byte[diskWriteBufferSize];
final Tuple2<TempShuffleBlockId, File> spilledFileInfo =
blockManager.diskBlockManager().createTempShuffleBlock();
final File file = spilledFileInfo._2();
final TempShuffleBlockId blockId = spilledFileInfo._1();
final SpillInfo spillInfo = new SpillInfo(numPartitions, file, blockId);
final SerializerInstance ser = DummySerializerInstance.INSTANCE;
final DiskBlockObjectWriter writer =
blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse);
int currentPartition = -1;
while (sortedRecords.hasNext()) {
sortedRecords.loadNext();
final int partition = sortedRecords.packedRecordPointer.getPartitionId();
if (partition != currentPartition) {
// Switch to the new partition
if (currentPartition != -1) {
final FileSegment fileSegment = writer.commitAndGet();
spillInfo.partitionLengths[currentPartition] = fileSegment.length();
}
currentPartition = partition;
}
final long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
final Object recordPage = taskMemoryManager.getPage(recordPointer);
final long recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);
int dataRemaining = Platform.getInt(recordPage, recordOffsetInPage);
long recordReadPosition = recordOffsetInPage + 4; // skip over record length
while (dataRemaining > 0) {
final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining);
Platform.copyMemory(
recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer);
writer.write(writeBuffer, 0, toTransfer);
recordReadPosition += toTransfer;
dataRemaining -= toTransfer;
}
writer.recordWritten();
}
final FileSegment committedSegment = writer.commitAndGet();
writer.close();
if (currentPartition != -1) {
spillInfo.partitionLengths[currentPartition] = committedSegment.length();
spills.add(spillInfo);
}
}
下圖演示了數據在內存中的過程
由于UnsafeShuffleWriter
并沒有aggreate
和sort
操作,所以合并多個臨時文件中某一個partition
的數據就變得很簡單了,因為我們記錄了每個partition
的offset
private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
...
if (spills.length == 0) {
java.nio.file.Files.newOutputStream(outputFile.toPath()).close(); // Create an empty file
return new long[partitioner.numPartitions()];
} else if (spills.length == 1) {
Files.move(spills[0].file, outputFile);
return spills[0].partitionLengths;
} else {
final long[] partitionLengths;
if (fastMergeEnabled && fastMergeIsSupported) {
if (transferToEnabled && !encryptionEnabled) {
logger.debug("Using transferTo-based fast merge");
partitionLengths = mergeSpillsWithTransferTo(spills, outputFile);
} else {
logger.debug("Using fileStream-based fast merge");
partitionLengths = mergeSpillsWithFileStream(spills, outputFile, null);
}
}
}
...
}
與SortShuffleWriter
對比
- 數據是放在堆外內存,減少GC開銷。
-
merge
文件無需反序列化文件。
觸發條件
我們先來看下SortShuffleManager是如何選擇應該采用哪種ShuffleWriter的
override def registerShuffle[K, V, C](
shuffleId: Int,
numMaps: Int,
dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
new BypassMergeSortShuffleHandle[K, V](
shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
// Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
new SerializedShuffleHandle[K, V](
shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else {
// Otherwise, buffer map outputs in a deserialized form:
new BaseShuffleHandle(shuffleId, numMaps, dependency)
}
}
Bypass觸發條件
def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
// We cannot bypass sorting if we need to do map-side aggregation.
if (dep.mapSideCombine) {
require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
false
} else {
val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
dep.partitioner.numPartitions <= bypassMergeThreshold
}
}
1.
reduce
端partition
個數小于spark.shuffle.sort.bypassMergeThreshold
2.無map
端combine
操作
UnsafeShuffleWriter觸發條件
def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
val shufId = dependency.shuffleId
val numPartitions = dependency.partitioner.numPartitions
if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
s"${dependency.serializer.getClass.getName}, does not support object relocation")
false
} else if (dependency.aggregator.isDefined) {
log.debug(
s"Can't use serialized shuffle for shuffle $shufId because an aggregator is defined")
false
} else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
false
} else {
log.debug(s"Can use serialized shuffle for shuffle $shufId")
true
}
}
1.
Serializer
支持relocation
2.無map
端combine
操作
3.reduce
端partition
個數小于
SortShuffleWriter觸發條件
無法使用上述兩種ShuffleWriter
則采用SortShuffleWriter
關鍵點
- 為什么需要合并shuffle中間結果
減少讀取時的文件句柄數。 我們可以看到一個partition產生的臨時文件數目為reduce個數,當我們reduce個數非常大的時候,executor需要維護非常多的文件句柄。在HashShuffleWriter實現中,需要讀取過多的文件。
說明
- 本文是基于寫博客時的最新master代碼分析的,而spark還不斷迭代中,大家需要根據spark發展繼續分析。
- 文中所有源碼都截取關鍵代碼,忽略了大部分對邏輯分析無關的代碼,并不代表其他代碼不重要。
總結
1.
ShuffleWriter
肯定會產生落磁盤文件。
2.從宏觀上看,ShuffleWriter
過程就是在Map
端根據Partitioner
聚合Reduce
端的數據,最后將數據寫入一個數據文件,并記錄每個Partitoin
的偏移量,為Reduce
端讀取做準備。
- Future work
[SPARK-7271] Redesign shuffle interface for binary processing