摘要：Shuffle是MapReduce編程模型中最耗時的一個步驟，而Spark將Shuffle過程分解成了Shuffle Write和Shuffle Read兩個過程，本文我們將詳細解讀Spark的Shuffle Write實現。

ShuffleWriter

Spark Shuffle Write的接口是org.apache.spark.shuffle.ShuffleWriter

我們來看下接口定義：

private[spark] abstract class ShuffleWriter[K, V] {![屏幕快照 2017-12-17 下午2.48.59.png](http://upload-images.jianshu.io/upload_images/1381055-7248e894ca3ea2b4.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

  /** Write a sequence of records to this task's output */
  @throws[IOException]
  def write(records: Iterator[Product2[K, V]]): Unit

  /** Close this writer, passing along whether the map completed */
  def stop(success: Boolean): Option[MapStatus]
}

共有三個實現類:

Shuffle Writer的實現類

BypassMergeSortShuffleWriter

我們以第一個stage（map）的個數為m個來計算，第二個stage個數為r個來計算

BypassMergeSortShuffleWriter可以分為

1.為每個ShuffleMapTask(即map端的每個partition，每個ShuffleMapTask處理的是map端的一個partition)創建r個臨時文件
2.迭代每個map的partition，根據getPartition(key)來分組，并寫入對應的partitionId的文件
3.合并步驟2產生的r個文件，并將每個partitionId對應的索引寫入index文件

BypassMergeSortShuffleWriter流程圖

關鍵代碼解讀

public void write(Iterator<Product2<K, V>> records) throws IOException {
  ...
  // 根據下游stage(reduce端)的partition個數創建對應個數的DiskWriter
  partitionWriters = new DiskBlockObjectWriter[numPartitions];
  partitionWriterSegments = new FileSegment[numPartitions];
  for (int i = 0; i < numPartitions; i++) {
    final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
      blockManager.diskBlockManager().createTempShuffleBlock();
    final File file = tempShuffleBlockIdPlusFile._2();
    final BlockId blockId = tempShuffleBlockIdPlusFile._1();
    partitionWriters[i] =
      blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
  }

  // 根據`getPartition(key)`獲取kv所屬的reduce的partitionId，并將kv寫入對應的partitionId的臨時文件
  while (records.hasNext()) {
    final Product2<K, V> record = records.next();
    final K key = record._1();
    partitionWriters[partitioner.getPartition(key)].write(key, record._2());
  }

  for (int i = 0; i < numPartitions; i++) {
    final DiskBlockObjectWriter writer = partitionWriters[i];
    partitionWriterSegments[i] = writer.commitAndGet();
    writer.close();
  }

  File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
  File tmp = Utils.tempFileWith(output);
  try {
    // 合并多個partitionId對應的臨時文件，寫入`shuffle_${shuffleId}_${mapId}_${reduceId}.data`文件
    partitionLengths = writePartitionedFile(tmp);
    // 將多個partitionId對應的index寫入`shuffle_${shuffleId}_${mapId}_${reduceId}.index`文件
    shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
  }
  mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}

1.默認的Partitioner的實現類為HashPartitioner
2.默認的SerializerInstance的實現類為JavaSerializerInstance

FileSegment

一個BypassMergeSortShuffleWriter的中間臨時文件稱之為FileSegment

class FileSegment(val file: File, val offset: Long, val length: Long)

file記錄物理文件，length記錄文件大小，用于合并多個FileSegment時寫index文件。

我們再看下合并臨時文件方法writePartitionedFile的實現：

private long[] writePartitionedFile(File outputFile) throws IOException {
  final long[] lengths = new long[numPartitions];
  ...
  final FileChannel out = FileChannel.open(outputFile.toPath(), WRITE, APPEND, CREATE);
    for (int i = 0; i < numPartitions; i++) {
      final File file = partitionWriterSegments[i].file();
      if (file.exists()) {
        final FileChannel in = FileChannel.open(file.toPath(), READ);
        try {
          long size = in.size();
          // 合并文件的關鍵代碼，通過NIO的transferTo提高合并文件流的效率
          Utils.copyFileStreamNIO(in, out, 0, size);
          lengths[i] = size;
        }
      }
    }
  }
  partitionWriters = null;
  // 返回每個臨時文件大小，用于寫Index文件
  return lengths;
}

寫index文件的方法writeIndexFileAndCommit:

def writeIndexFileAndCommit(
    shuffleId: Int,
    mapId: Int,
    lengths: Array[Long],
    dataTmp: File): Unit = {
  val indexFile = getIndexFile(shuffleId, mapId)
  val indexTmp = Utils.tempFileWith(indexFile)
  try {
    val out = new DataOutputStream(
      new BufferedOutputStream(Files.newOutputStream(indexTmp.toPath)))
    Utils.tryWithSafeFinally {
      var offset = 0L
      out.writeLong(offset)
      for (length <- lengths) {
        offset += length
        out.writeLong(offset)
      }
    }
  }
  ...
}

NOTE: 1.文件合并時采用了java nio的transferTo方法提高文件合并效率。
2.BypassMergeSortShuffleWriter的完整代碼

BypassMergeSortShuffleWriter Example

我們通過下面一個例子來看下BypassMergeSortShuffleWriter的工作原理。

BypassMergeSortShuffleWriter Example

1.真實場景下，我們的partition上的數據往往是無序的，本例中我們模擬的數據是有序的，不要誤認為BypassMergeSortShuffleWriter會為我們的數據排序。

SortShuffleWriter

預備知識:

org.apache.spark.util.collection.AppendOnlyMap

org.apache.spark.util.collection.PartitionedPairBuffer

TimSorter

SortShuffleWriter.writer()實現

我們先看下writer的具體實現：

override def write(records: Iterator[Product2[K, V]]): Unit = {
  sorter = if (dep.mapSideCombine) {
    require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
    new ExternalSorter[K, V, C](
      context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
  } else {
    new ExternalSorter[K, V, V](
      context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
  }
  sorter.insertAll(records)

  val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
  val tmp = Utils.tempFileWith(output)
  try {
    val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
    val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
    shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
    mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
  } finally {
    if (tmp.exists() && !tmp.delete()) {
      logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
    }
  }
}

SortShuffleWriter write過程大概可以分成兩個步驟，第一步insertAll，第二步merge溢寫到磁盤的SpilledFile

ExternalSorter可以分為四個步驟來理解

根據是否需要combine操作，決定緩存結構是PartitionedAppendOnlyMap還是PartitionedPairBuffer，在這兩種數據結構中，我們會先按照partitionId將數據排序，而且在每個partition中，我們會根據key排序。
當緩存數據到達我們的內存限制，或者或者條數限制，我們將進行spill操作，并且每個SpilledFile會記錄每個parition有多少條記錄。
當我們請求一個iterator或者文件時，會將所有的SpilledFile和在內存當中未進行溢寫的數據進行合并。
最后請求stop方法刪除相關臨時文件。

ExternalSorter.insertAll的實現：

def insertAll(records: Iterator[Product2[K, V]]): Unit = {
  val shouldCombine = aggregator.isDefined
  // 根據aggregator是否定義來判斷是否需要map端合并(combine)
  if (shouldCombine) {
    // Combine values in-memory first using our AppendOnlyMap
    // 對應rdd.aggregatorByKey的 seqOp 參數
    val mergeValue = aggregator.get.mergeValue
    // 對應rdd.aggregatorByKey的zeroValue參數，利用zeroValue來創建Combiner
    val createCombiner = aggregator.get.createCombiner
    var kv: Product2[K, V] = null
    val update = (hadValue: Boolean, oldValue: C) => {
      if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
    }
    while (records.hasNext) {
      addElementsRead()
      kv = records.next()
      map.changeValue((getPartition(kv._1), kv._1), update)
      maybeSpillCollection(usingMap = true)
    }
  } else {
    // Stick values into our buffer
    while (records.hasNext) {
      addElementsRead()
      val kv = records.next()
      buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
      maybeSpillCollection(usingMap = false)
    }
  }
}

需要注意的一點是往map/buffer中寫入的key都是(partitionId，key)，因為我們需要對一個臨時文件中的數據結構，先按照partitionId排序，再按照key排序。

寫磁盤的時機

寫磁盤的時機有兩個條件，滿足其中一個就進行spill操作。

1.每32個元素采樣一次，判斷當前內存指是否大于myMemoryThreshold，即currentMemory >= myMemoryThreshold。currentMemory需要通過預估當前map/buffer大小來獲取。
2.判斷內存緩存結構中數據條數是否大于強制溢寫閾值，即_elementsRead > numElementsForceSpillThreshold。強制溢寫閾值可以通過在SparkConf中設置spark.shuffle.spill.batchSize來控制。

private def maybeSpillCollection(usingMap: Boolean): Unit = {
  var estimatedSize = 0L
  if (usingMap) {
    // 預估map在內存中的大小
    estimatedSize = map.estimateSize()
    if (maybeSpill(map, estimatedSize)) {
    // 如果內存中的數據spill到磁盤上了，重置map
      map = new PartitionedAppendOnlyMap[K, C]
    }
  } else {
    // 預估buffer在內存中的大小
    estimatedSize = buffer.estimateSize()
    if (maybeSpill(buffer, estimatedSize)) {
    // 同map操作
      buffer = new PartitionedPairBuffer[K, C]
    }
  }

  if (estimatedSize > _peakMemoryUsedBytes) {
    _peakMemoryUsedBytes = estimatedSize
  }
}

protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
  var shouldSpill = false
  if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
    val amountToRequest = 2 * currentMemory - myMemoryThreshold
    val granted = acquireMemory(amountToRequest)
    myMemoryThreshold += granted
    shouldSpill = currentMemory >= myMemoryThreshold
  }
  shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
  if (shouldSpill) {
    _spillCount += 1
    logSpillage(currentMemory)
    // 溢寫
    spill(collection)
    _elementsRead = 0
    _memoryBytesSpilled += currentMemory
    releaseMemory()
  }
  shouldSpill
}

溢寫磁盤的過程

override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
  // 利用timsort算法將內存中的數據排序
  val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
  // 將內存中的數據寫入磁盤
  val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
  // 加入spills數組
  spills += spillFile
}

總結下insertAll過程就是，利用內存緩存結構的數據結構PartitionedPairBuffer/PartitionedAppendOnlyMap，一邊往內存緩存寫數據一邊判斷是否達到spill的條件，一次spill就是一個磁盤臨時文件。

讀取`SpilledFile`過程

SpilledFile數據文件是按照(partitionId，recordKey)來排序，而且我們記錄了每個partition的offset，所以我們獲取一個SpilledFile中的某個partition數據就變得很簡單了。

讀取SpilledFile的實現類是SpillReader

merge過程

private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
    : Iterator[(Int, Iterator[Product2[K, C]])] = {
  val readers = spills.map(new SpillReader(_))
  val inMemBuffered = inMemory.buffered
  (0 until numPartitions).iterator.map { p =>
    val inMemIterator = new IteratorForPartition(p, inMemBuffered)
    val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
    if (aggregator.isDefined) {
      (p, mergeWithAggregation(
        iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
    } else if (ordering.isDefined) {
      (p, mergeSort(iterators, ordering.get))
    } else {
      (p, iterators.iterator.flatten)
    }
  }
}

merge過程是比較復雜的一個過程，要涉及到當前Shuffle是否有aggregator和ordering操作。接下來我們將就這幾種情況一一分析。

no aggregator or sorter

partitionBy

case class TestIntKey(i: Int)
val conf = new SparkConf()
conf.setMaster("local[3]")
conf.setAppName("shuffle debug")
conf.set("spark.shuffle.sort.bypassMergeThreshold", "0")
conf.set("spark.shuffle.spill.numElementsForceSpillThreshold", 4.toString)
val sc = new SparkContext(conf)
val testData = (1 to 100).toList
sc.parallelize(testData, 1)
  .map(x => {
    (TestIntKey(x % 3), x)
  }).partitionBy(new HashPartitioner(3)).collect()

partitionBy流程圖

no aggregator but sorter

這段代碼其實很容易混淆，因為很容易想到sortByKey操作就是無aggregator有sorter操作，但是我們其實可以看到SortShuffleWriter在初始化ExternalSorter的時，ordring = None。具體代碼如下：

sorter = if (dep.mapSideCombine) {
  ...
} else {
  new ExternalSorter[K, V, V](
    context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}

NOTE: sortBykey的ordering的邏輯將會被放到Shuffle Read過程中執行，這個我們后續將會介紹。

不過我們還是來簡單看下mergeSort方法的實現。我們的SpilledFile中，每個partition內的數據已經是按照recordKey排好序的，所以我們只要拿到每個SpilledFile的

private def mergeSort(iterators: Seq[Iterator[Product2[K, C]]], comparator: Comparator[K])
    : Iterator[Product2[K, C]] =
{
  // NOTE:(fchen)將該partition數據全部放入等級隊列當中，取數據時進行每個iterator頭部對比，取出最小的
  val bufferedIters = iterators.filter(_.hasNext).map(_.buffered)
  type Iter = BufferedIterator[Product2[K, C]]
  val heap = new mutable.PriorityQueue[Iter]()(new Ordering[Iter] {
    override def compare(x: Iter, y: Iter): Int = -comparator.compare(x.head._1, y.head._1)
  })
  heap.enqueue(bufferedIters: _*)  // Will contain only the iterators with hasNext = true
  new Iterator[Product2[K, C]] {
    override def hasNext: Boolean = !heap.isEmpty

    override def next(): Product2[K, C] = {
      if (!hasNext) {
        throw new NoSuchElementException
      }
      val firstBuf = heap.dequeue()
      val firstPair = firstBuf.next()
      if (firstBuf.hasNext) {
        // 將迭代器重新放回等級隊列
        heap.enqueue(firstBuf)
      }
      firstPair
    }
  }
}

我們通過下面這個例子來看下mergeSort的整個過程：

mergeSort

從示例圖中我們可以清晰的看出，一個分散在多個SpilledFile的partition數據，經過mergeSort操作之后，就會變成按照recordKey排序的Iterator了。

aggregator, but no sorter

reduceByKey

if (!totalOrder) {
  new Iterator[Iterator[Product2[K, C]]] {
    val sorted = mergeSort(iterators, comparator).buffered

    // Buffers reused across elements to decrease memory allocation
    val keys = new ArrayBuffer[K]
    val combiners = new ArrayBuffer[C]

    override def hasNext: Boolean = sorted.hasNext

    override def next(): Iterator[Product2[K, C]] = {
      if (!hasNext) {
        throw new NoSuchElementException
      }
      keys.clear()
      combiners.clear()
      val firstPair = sorted.next()
      keys += firstPair._1
      combiners += firstPair._2
      val key = firstPair._1
      while (sorted.hasNext && comparator.compare(sorted.head._1, key) == 0) {
        val pair = sorted.next()
        var i = 0
        var foundKey = false
        while (i < keys.size && !foundKey) {
          if (keys(i) == pair._1) {
            combiners(i) = mergeCombiners(combiners(i), pair._2)
            foundKey = true
          }
          i += 1
        }
        if (!foundKey) {
          keys += pair._1
          combiners += pair._2
        }
      }

      keys.iterator.zip(combiners.iterator)
    }
  }.flatMap(i => i)
}

看到這我們可能會有所困惑，為什么key存儲需要一個ArrayBuffer

reduceByKey Example:

val conf = new SparkConf()
conf.setMaster("local[3]")
conf.setAppName("shuffle debug")
conf.set("spark.shuffle.sort.bypassMergeThreshold", "0")
conf.set("spark.shuffle.spill.numElementsForceSpillThreshold", (4).toString)
val sc = new SparkContext(conf)
val testData = (1 to 10).toList
val keys = Array("Aa", "BB")
val count = sc.parallelize(testData, 1)
  .map(x => {
    (keys(x % 2), x)
  }).reduceByKey(_ + _, 3).collectPartitions().foreach(x => {
  x.foreach(y => {
    println(y._1 + "," + y._2)
  })
})

下圖演示了reduceByKey在有hash沖突的情況下，整個mergeWithAggregation過程

reduceByKey with hash collisions

aggregator and sorter

雖然有這段邏輯，但是我并沒找到同時帶有aggregator和sorter的操作，所以這里我們簡單過下這段邏輯就好了。

合并`SpilledFile`

經過partition的merge操作之后就可以進行data和index文件的寫入，具體的寫入過程和BypassMergeSortShuffleWriter是一樣的，這里我們就不再做更多的解釋了。

private[this] case class SpilledFile(
  file: File,
  blockId: BlockId,
  serializerBatchSizes: Array[Long],
  elementsPerPartition: Array[Long])

`SortShuffleWriter`總結

序列化了兩次，一次是寫SpilledFile，一次是合并SpilledFile

UnsafeShuffleWriter

上面我們介紹了兩種在堆內做Shuffle write的方式，這種方式的缺點很明顯，就是在大對象的情況下，Jvm的垃圾回收性能表現比較差。所以就衍生了堆外內存的Shuffle write，即UnsafeShuffleWriter。

從宏觀上看，UnsafeShuffleWriter跟SortShufflerWriter設計很相似，都是將map端的數據，按照reduce端的partitionId進行排序，超過一定限制就將內存中的記錄溢寫到磁盤上。最后將這些文件合并寫入一個MapOutputFile，并記錄每個partition的offset。

通過上面兩種on-heap的Shuffle write模型，我們就可以知道

預備知識

內存分頁管理模型

實現細節

在詳細介紹UnsafeShuffleWriter之前，讓我們先來看下基礎知識，先看下PackedRecordPointer類。

final class PackedRecordPointer {
  ...
  public static long packPointer(long recordPointer, int partitionId) {
    final long pageNumber = (recordPointer & MASK_LONG_UPPER_13_BITS) >>> 24;
    final long compressedAddress = pageNumber | (recordPointer & MASK_LONG_LOWER_27_BITS);
    return (((long) partitionId) << 40) | compressedAddress;
  }

  private long packedRecordPointer;

  public void set(long packedRecordPointer) {
    this.packedRecordPointer = packedRecordPointer;
  }

  public int getPartitionId() {
    return (int) ((packedRecordPointer & MASK_LONG_UPPER_24_BITS) >>> 40);
  }

  public long getRecordPointer() {
    final long pageNumber = (packedRecordPointer << 24) & MASK_LONG_UPPER_13_BITS;
    final long offsetInPage = packedRecordPointer & MASK_LONG_LOWER_27_BITS;
    return pageNumber | offsetInPage;
  }
}

PackedRecordPointer用一個long類型來存儲partitionId,pageNumber,offsetInPage，已知一個long是64位，從代碼中我們可以看出:

[ 24 bit partitionId ] [ 13 bit pageNumber] [ 27 bit offset in page]

insertRecord方法：

public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)
  throws IOException {
  // 如果寫入內存的條數大于強制Spill閾值進行spill
  if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {
    spill();
  }

  growPointerArrayIfNecessary();
  // Need 4 bytes to store the record length.
  final int required = length + 4;
  acquireNewPageIfNecessary(required);

  assert(currentPage != null);
  final Object base = currentPage.getBaseObject();
  final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);
  Platform.putInt(base, pageCursor, length);
  pageCursor += 4;
  Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
  pageCursor += length;
  inMemSorter.insertRecord(recordAddress, partitionId);
}

spill過程其實就是寫文件的過程，也就是調用writeSortedFile的過程：

private void writeSortedFile(boolean isLastFile) {
  ...
  // 將inMemSorter，也就是PackedRecordPointer按照partitionId排序
  final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
    inMemSorter.getSortedIterator();

  final byte[] writeBuffer = new byte[diskWriteBufferSize];

  final Tuple2<TempShuffleBlockId, File> spilledFileInfo =
    blockManager.diskBlockManager().createTempShuffleBlock();
  final File file = spilledFileInfo._2();
  final TempShuffleBlockId blockId = spilledFileInfo._1();
  final SpillInfo spillInfo = new SpillInfo(numPartitions, file, blockId);

  final SerializerInstance ser = DummySerializerInstance.INSTANCE;

  final DiskBlockObjectWriter writer =
    blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse);

  int currentPartition = -1;
  while (sortedRecords.hasNext()) {
    sortedRecords.loadNext();
    final int partition = sortedRecords.packedRecordPointer.getPartitionId();
    if (partition != currentPartition) {
      // Switch to the new partition
      if (currentPartition != -1) {
        final FileSegment fileSegment = writer.commitAndGet();
        spillInfo.partitionLengths[currentPartition] = fileSegment.length();
      }
      currentPartition = partition;
    }

    final long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
    final Object recordPage = taskMemoryManager.getPage(recordPointer);
    final long recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);
    int dataRemaining = Platform.getInt(recordPage, recordOffsetInPage);
    long recordReadPosition = recordOffsetInPage + 4; // skip over record length
    while (dataRemaining > 0) {
      final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining);
      Platform.copyMemory(
        recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer);
      writer.write(writeBuffer, 0, toTransfer);
      recordReadPosition += toTransfer;
      dataRemaining -= toTransfer;
    }
    writer.recordWritten();
  }

  final FileSegment committedSegment = writer.commitAndGet();
  writer.close();
  if (currentPartition != -1) {
    spillInfo.partitionLengths[currentPartition] = committedSegment.length();
    spills.add(spillInfo);
  }
}

下圖演示了數據在內存中的過程

ShuffleExternalSorter

由于UnsafeShuffleWriter并沒有aggreate和sort操作，所以合并多個臨時文件中某一個partition的數據就變得很簡單了，因為我們記錄了每個partition的offset

private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
  ...
  if (spills.length == 0) {
    java.nio.file.Files.newOutputStream(outputFile.toPath()).close(); // Create an empty file
    return new long[partitioner.numPartitions()];
  } else if (spills.length == 1) {
    Files.move(spills[0].file, outputFile);
    return spills[0].partitionLengths;
  } else {
    final long[] partitionLengths;
    if (fastMergeEnabled && fastMergeIsSupported) {
      if (transferToEnabled && !encryptionEnabled) {
        logger.debug("Using transferTo-based fast merge");
        partitionLengths = mergeSpillsWithTransferTo(spills, outputFile);
      } else {
        logger.debug("Using fileStream-based fast merge");
        partitionLengths = mergeSpillsWithFileStream(spills, outputFile, null);
      }
    }
  }
  ...
}

與`SortShuffleWriter`對比

數據是放在堆外內存，減少GC開銷。
merge文件無需反序列化文件。

觸發條件

我們先來看下SortShuffleManager是如何選擇應該采用哪種ShuffleWriter的

override def registerShuffle[K, V, C](
    shuffleId: Int,
    numMaps: Int,
    dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
  if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
    new BypassMergeSortShuffleHandle[K, V](
      shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
  } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
    // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
    new SerializedShuffleHandle[K, V](
      shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
  } else {
    // Otherwise, buffer map outputs in a deserialized form:
    new BaseShuffleHandle(shuffleId, numMaps, dependency)
  }
}

Bypass觸發條件

def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
  // We cannot bypass sorting if we need to do map-side aggregation.
  if (dep.mapSideCombine) {
    require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
    false
  } else {
    val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
    dep.partitioner.numPartitions <= bypassMergeThreshold
  }
}

1.reduce端partition個數小于spark.shuffle.sort.bypassMergeThreshold
2.無map端combine操作

UnsafeShuffleWriter觸發條件

def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
  val shufId = dependency.shuffleId
  val numPartitions = dependency.partitioner.numPartitions
  if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
    log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
      s"${dependency.serializer.getClass.getName}, does not support object relocation")
    false
  } else if (dependency.aggregator.isDefined) {
    log.debug(
      s"Can't use serialized shuffle for shuffle $shufId because an aggregator is defined")
    false
  } else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
    log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
      s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
    false
  } else {
    log.debug(s"Can use serialized shuffle for shuffle $shufId")
    true
  }
}

1.Serializer支持relocation
2.無map端combine操作
3.reduce端partition個數小于

SortShuffleWriter觸發條件
無法使用上述兩種ShuffleWriter則采用SortShuffleWriter

關鍵點

為什么需要合并shuffle中間結果

減少讀取時的文件句柄數。 我們可以看到一個partition產生的臨時文件數目為reduce個數，當我們reduce個數非常大的時候，executor需要維護非常多的文件句柄。在HashShuffleWriter實現中，需要讀取過多的文件。

說明

本文是基于寫博客時的最新master代碼分析的，而spark還不斷迭代中，大家需要根據spark發展繼續分析。
文中所有源碼都截取關鍵代碼，忽略了大部分對邏輯分析無關的代碼，并不代表其他代碼不重要。

總結

1.ShuffleWriter肯定會產生落磁盤文件。
2.從宏觀上看，ShuffleWriter過程就是在Map端根據Partitioner聚合Reduce端的數據，最后將數據寫入一個數據文件，并記錄每個Partitoin的偏移量，為Reduce端讀取做準備。

Future work

[SPARK-7271] Redesign shuffle interface for binary processing

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Spark源碼解析之Shuffle Writer

Spark源碼解析之Shuffle Writer

ShuffleWriter

BypassMergeSortShuffleWriter

關鍵代碼解讀

BypassMergeSortShuffleWriter Example

SortShuffleWriter

預備知識:

SortShuffleWriter.writer()實現

寫磁盤的時機

溢寫磁盤的過程

讀取`SpilledFile`過程

merge過程

no aggregator or sorter

no aggregator but sorter

aggregator, but no sorter

aggregator and sorter

合并`SpilledFile`

`SortShuffleWriter`總結

UnsafeShuffleWriter

預備知識

實現細節

與`SortShuffleWriter`對比

觸發條件

關鍵點

說明

總結

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Spark源碼解析之Shuffle Writer

ShuffleWriter

BypassMergeSortShuffleWriter

關鍵代碼解讀

BypassMergeSortShuffleWriter Example

SortShuffleWriter

預備知識:

SortShuffleWriter.writer()實現

寫磁盤的時機

溢寫磁盤的過程

讀取SpilledFile過程

merge過程

no aggregator or sorter

no aggregator but sorter

aggregator, but no sorter

aggregator and sorter

合并SpilledFile

SortShuffleWriter總結

UnsafeShuffleWriter

預備知識

實現細節

與SortShuffleWriter對比

觸發條件

關鍵點

說明

總結

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

讀取`SpilledFile`過程

合并`SpilledFile`

`SortShuffleWriter`總結

與`SortShuffleWriter`對比