Spark源碼解析之Shuffle Writer

摘要:ShuffleMapReduce編程模型中最耗時的一個步驟,而SparkShuffle過程分解成了Shuffle WriteShuffle Read兩個過程,本文我們將詳細解讀SparkShuffle Write實現。

ShuffleWriter

Spark Shuffle Write的接口是org.apache.spark.shuffle.ShuffleWriter

我們來看下接口定義:

private[spark] abstract class ShuffleWriter[K, V] {![屏幕快照 2017-12-17 下午2.48.59.png](http://upload-images.jianshu.io/upload_images/1381055-7248e894ca3ea2b4.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

  /** Write a sequence of records to this task's output */
  @throws[IOException]
  def write(records: Iterator[Product2[K, V]]): Unit

  /** Close this writer, passing along whether the map completed */
  def stop(success: Boolean): Option[MapStatus]
}

共有三個實現類:

Shuffle Writer的實現類

BypassMergeSortShuffleWriter

我們以第一個stage(map)的個數為m個來計算,第二個stage個數為r個來計算

BypassMergeSortShuffleWriter可以分為

1.為每個ShuffleMapTask(即map端的每個partition,每個ShuffleMapTask處理的是map端的一個partition)創建r個臨時文件
2.迭代每個map的partition,根據getPartition(key)來分組,并寫入對應的partitionId的文件
3.合并步驟2產生的r個文件,并將每個partitionId對應的索引寫入index文件

BypassMergeSortShuffleWriter流程圖

關鍵代碼解讀

public void write(Iterator<Product2<K, V>> records) throws IOException {
  ...
  // 根據下游stage(reduce端)的partition個數創建對應個數的DiskWriter
  partitionWriters = new DiskBlockObjectWriter[numPartitions];
  partitionWriterSegments = new FileSegment[numPartitions];
  for (int i = 0; i < numPartitions; i++) {
    final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
      blockManager.diskBlockManager().createTempShuffleBlock();
    final File file = tempShuffleBlockIdPlusFile._2();
    final BlockId blockId = tempShuffleBlockIdPlusFile._1();
    partitionWriters[i] =
      blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
  }

  // 根據`getPartition(key)`獲取kv所屬的reduce的partitionId,并將kv寫入對應的partitionId的臨時文件
  while (records.hasNext()) {
    final Product2<K, V> record = records.next();
    final K key = record._1();
    partitionWriters[partitioner.getPartition(key)].write(key, record._2());
  }

  for (int i = 0; i < numPartitions; i++) {
    final DiskBlockObjectWriter writer = partitionWriters[i];
    partitionWriterSegments[i] = writer.commitAndGet();
    writer.close();
  }

  File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
  File tmp = Utils.tempFileWith(output);
  try {
    // 合并多個partitionId對應的臨時文件,寫入`shuffle_${shuffleId}_${mapId}_${reduceId}.data`文件
    partitionLengths = writePartitionedFile(tmp);
    // 將多個partitionId對應的index寫入`shuffle_${shuffleId}_${mapId}_${reduceId}.index`文件
    shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
  }
  mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}

1.默認的Partitioner的實現類為HashPartitioner
2.默認的SerializerInstance的實現類為JavaSerializerInstance

FileSegment

一個BypassMergeSortShuffleWriter的中間臨時文件稱之為FileSegment

class FileSegment(val file: File, val offset: Long, val length: Long)

file記錄物理文件,length記錄文件大小,用于合并多個FileSegment時寫index文件。

我們再看下合并臨時文件方法writePartitionedFile的實現:

private long[] writePartitionedFile(File outputFile) throws IOException {
  final long[] lengths = new long[numPartitions];
  ...
  final FileChannel out = FileChannel.open(outputFile.toPath(), WRITE, APPEND, CREATE);
    for (int i = 0; i < numPartitions; i++) {
      final File file = partitionWriterSegments[i].file();
      if (file.exists()) {
        final FileChannel in = FileChannel.open(file.toPath(), READ);
        try {
          long size = in.size();
          // 合并文件的關鍵代碼,通過NIO的transferTo提高合并文件流的效率
          Utils.copyFileStreamNIO(in, out, 0, size);
          lengths[i] = size;
        }
      }
    }
  }
  partitionWriters = null;
  // 返回每個臨時文件大小,用于寫Index文件
  return lengths;
}

寫index文件的方法writeIndexFileAndCommit:

def writeIndexFileAndCommit(
    shuffleId: Int,
    mapId: Int,
    lengths: Array[Long],
    dataTmp: File): Unit = {
  val indexFile = getIndexFile(shuffleId, mapId)
  val indexTmp = Utils.tempFileWith(indexFile)
  try {
    val out = new DataOutputStream(
      new BufferedOutputStream(Files.newOutputStream(indexTmp.toPath)))
    Utils.tryWithSafeFinally {
      var offset = 0L
      out.writeLong(offset)
      for (length <- lengths) {
        offset += length
        out.writeLong(offset)
      }
    }
  }
  ...
}

NOTE: 1.文件合并時采用了java nio的transferTo方法提高文件合并效率。
2.BypassMergeSortShuffleWriter完整代碼

BypassMergeSortShuffleWriter Example

我們通過下面一個例子來看下BypassMergeSortShuffleWriter的工作原理。

BypassMergeSortShuffleWriter Example

1.真實場景下,我們的partition上的數據往往是無序的,本例中我們模擬的數據是有序的,不要誤認為BypassMergeSortShuffleWriter會為我們的數據排序。

SortShuffleWriter

預備知識:

  • org.apache.spark.util.collection.AppendOnlyMap
  • org.apache.spark.util.collection.PartitionedPairBuffer
  • TimSorter

SortShuffleWriter.writer()實現

我們先看下writer的具體實現:

override def write(records: Iterator[Product2[K, V]]): Unit = {
  sorter = if (dep.mapSideCombine) {
    require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
    new ExternalSorter[K, V, C](
      context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
  } else {
    new ExternalSorter[K, V, V](
      context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
  }
  sorter.insertAll(records)

  val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
  val tmp = Utils.tempFileWith(output)
  try {
    val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
    val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
    shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
    mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
  } finally {
    if (tmp.exists() && !tmp.delete()) {
      logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
    }
  }
}

SortShuffleWriter write過程大概可以分成兩個步驟,第一步insertAll,第二步merge溢寫到磁盤的SpilledFile

ExternalSorter可以分為四個步驟來理解

  • 根據是否需要combine操作,決定緩存結構是PartitionedAppendOnlyMap還是PartitionedPairBuffer,在這兩種數據結構中,我們會先按照partitionId將數據排序,而且在每個partition中,我們會根據key排序。
  • 當緩存數據到達我們的內存限制,或者或者條數限制,我們將進行spill操作,并且每個SpilledFile會記錄每個parition有多少條記錄。
  • 當我們請求一個iterator或者文件時,會將所有的SpilledFile和在內存當中未進行溢寫的數據進行合并。
  • 最后請求stop方法刪除相關臨時文件。

ExternalSorter.insertAll的實現:

def insertAll(records: Iterator[Product2[K, V]]): Unit = {
  val shouldCombine = aggregator.isDefined
  // 根據aggregator是否定義來判斷是否需要map端合并(combine)
  if (shouldCombine) {
    // Combine values in-memory first using our AppendOnlyMap
    // 對應rdd.aggregatorByKey的 seqOp 參數
    val mergeValue = aggregator.get.mergeValue
    // 對應rdd.aggregatorByKey的zeroValue參數,利用zeroValue來創建Combiner
    val createCombiner = aggregator.get.createCombiner
    var kv: Product2[K, V] = null
    val update = (hadValue: Boolean, oldValue: C) => {
      if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
    }
    while (records.hasNext) {
      addElementsRead()
      kv = records.next()
      map.changeValue((getPartition(kv._1), kv._1), update)
      maybeSpillCollection(usingMap = true)
    }
  } else {
    // Stick values into our buffer
    while (records.hasNext) {
      addElementsRead()
      val kv = records.next()
      buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
      maybeSpillCollection(usingMap = false)
    }
  }
}

需要注意的一點是往map/buffer中寫入的key都是(partitionId,key),因為我們需要對一個臨時文件中的數據結構,先按照partitionId排序,再按照key排序。

寫磁盤的時機

寫磁盤的時機有兩個條件,滿足其中一個就進行spill操作。

  • 1.每32個元素采樣一次,判斷當前內存指是否大于myMemoryThreshold,即currentMemory >= myMemoryThresholdcurrentMemory需要通過預估當前map/buffer大小來獲取。
  • 2.判斷內存緩存結構中數據條數是否大于強制溢寫閾值,即_elementsRead > numElementsForceSpillThreshold。強制溢寫閾值可以通過在SparkConf中設置spark.shuffle.spill.batchSize來控制。
private def maybeSpillCollection(usingMap: Boolean): Unit = {
  var estimatedSize = 0L
  if (usingMap) {
    // 預估map在內存中的大小
    estimatedSize = map.estimateSize()
    if (maybeSpill(map, estimatedSize)) {
    // 如果內存中的數據spill到磁盤上了,重置map
      map = new PartitionedAppendOnlyMap[K, C]
    }
  } else {
    // 預估buffer在內存中的大小
    estimatedSize = buffer.estimateSize()
    if (maybeSpill(buffer, estimatedSize)) {
    // 同map操作
      buffer = new PartitionedPairBuffer[K, C]
    }
  }

  if (estimatedSize > _peakMemoryUsedBytes) {
    _peakMemoryUsedBytes = estimatedSize
  }
}
protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
  var shouldSpill = false
  if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
    val amountToRequest = 2 * currentMemory - myMemoryThreshold
    val granted = acquireMemory(amountToRequest)
    myMemoryThreshold += granted
    shouldSpill = currentMemory >= myMemoryThreshold
  }
  shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
  if (shouldSpill) {
    _spillCount += 1
    logSpillage(currentMemory)
    // 溢寫
    spill(collection)
    _elementsRead = 0
    _memoryBytesSpilled += currentMemory
    releaseMemory()
  }
  shouldSpill
}

溢寫磁盤的過程

override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
  // 利用timsort算法將內存中的數據排序
  val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
  // 將內存中的數據寫入磁盤
  val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
  // 加入spills數組
  spills += spillFile
}

總結下insertAll過程就是,利用內存緩存結構的數據結構PartitionedPairBuffer/PartitionedAppendOnlyMap,一邊往內存緩存寫數據一邊判斷是否達到spill的條件,一次spill就是一個磁盤臨時文件。

讀取SpilledFile過程

SpilledFile數據文件是按照(partitionId,recordKey)來排序,而且我們記錄了每個partitionoffset,所以我們獲取一個SpilledFile中的某個partition數據就變得很簡單了。

讀取SpilledFile的實現類是SpillReader

merge過程

private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
    : Iterator[(Int, Iterator[Product2[K, C]])] = {
  val readers = spills.map(new SpillReader(_))
  val inMemBuffered = inMemory.buffered
  (0 until numPartitions).iterator.map { p =>
    val inMemIterator = new IteratorForPartition(p, inMemBuffered)
    val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
    if (aggregator.isDefined) {
      (p, mergeWithAggregation(
        iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
    } else if (ordering.isDefined) {
      (p, mergeSort(iterators, ordering.get))
    } else {
      (p, iterators.iterator.flatten)
    }
  }
}

merge過程是比較復雜的一個過程,要涉及到當前Shuffle是否有aggregatorordering操作。接下來我們將就這幾種情況一一分析。

no aggregator or sorter

partitionBy

case class TestIntKey(i: Int)
val conf = new SparkConf()
conf.setMaster("local[3]")
conf.setAppName("shuffle debug")
conf.set("spark.shuffle.sort.bypassMergeThreshold", "0")
conf.set("spark.shuffle.spill.numElementsForceSpillThreshold", 4.toString)
val sc = new SparkContext(conf)
val testData = (1 to 100).toList
sc.parallelize(testData, 1)
  .map(x => {
    (TestIntKey(x % 3), x)
  }).partitionBy(new HashPartitioner(3)).collect()
partitionBy流程圖

no aggregator but sorter

這段代碼其實很容易混淆,因為很容易想到sortByKey操作就是無aggregatorsorter操作,但是我們其實可以看到SortShuffleWriter在初始化ExternalSorter的時,ordring = None。具體代碼如下:

sorter = if (dep.mapSideCombine) {
  ...
} else {
  new ExternalSorter[K, V, V](
    context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}

NOTE: sortBykeyordering的邏輯將會被放到Shuffle Read過程中執行,這個我們后續將會介紹。

不過我們還是來簡單看下mergeSort方法的實現。我們的SpilledFile中,每個partition內的數據已經是按照recordKey排好序的,所以我們只要拿到每個SpilledFile的

private def mergeSort(iterators: Seq[Iterator[Product2[K, C]]], comparator: Comparator[K])
    : Iterator[Product2[K, C]] =
{
  // NOTE:(fchen)將該partition數據全部放入等級隊列當中,取數據時進行每個iterator頭部對比,取出最小的
  val bufferedIters = iterators.filter(_.hasNext).map(_.buffered)
  type Iter = BufferedIterator[Product2[K, C]]
  val heap = new mutable.PriorityQueue[Iter]()(new Ordering[Iter] {
    override def compare(x: Iter, y: Iter): Int = -comparator.compare(x.head._1, y.head._1)
  })
  heap.enqueue(bufferedIters: _*)  // Will contain only the iterators with hasNext = true
  new Iterator[Product2[K, C]] {
    override def hasNext: Boolean = !heap.isEmpty

    override def next(): Product2[K, C] = {
      if (!hasNext) {
        throw new NoSuchElementException
      }
      val firstBuf = heap.dequeue()
      val firstPair = firstBuf.next()
      if (firstBuf.hasNext) {
        // 將迭代器重新放回等級隊列
        heap.enqueue(firstBuf)
      }
      firstPair
    }
  }
}

我們通過下面這個例子來看下mergeSort的整個過程:

mergeSort

從示例圖中我們可以清晰的看出,一個分散在多個SpilledFile的partition數據,經過mergeSort操作之后,就會變成按照recordKey排序的Iterator了。

aggregator, but no sorter

reduceByKey

if (!totalOrder) {
  new Iterator[Iterator[Product2[K, C]]] {
    val sorted = mergeSort(iterators, comparator).buffered

    // Buffers reused across elements to decrease memory allocation
    val keys = new ArrayBuffer[K]
    val combiners = new ArrayBuffer[C]

    override def hasNext: Boolean = sorted.hasNext

    override def next(): Iterator[Product2[K, C]] = {
      if (!hasNext) {
        throw new NoSuchElementException
      }
      keys.clear()
      combiners.clear()
      val firstPair = sorted.next()
      keys += firstPair._1
      combiners += firstPair._2
      val key = firstPair._1
      while (sorted.hasNext && comparator.compare(sorted.head._1, key) == 0) {
        val pair = sorted.next()
        var i = 0
        var foundKey = false
        while (i < keys.size && !foundKey) {
          if (keys(i) == pair._1) {
            combiners(i) = mergeCombiners(combiners(i), pair._2)
            foundKey = true
          }
          i += 1
        }
        if (!foundKey) {
          keys += pair._1
          combiners += pair._2
        }
      }

      keys.iterator.zip(combiners.iterator)
    }
  }.flatMap(i => i)
}

看到這我們可能會有所困惑,為什么key存儲需要一個ArrayBuffer

reduceByKey Example:

val conf = new SparkConf()
conf.setMaster("local[3]")
conf.setAppName("shuffle debug")
conf.set("spark.shuffle.sort.bypassMergeThreshold", "0")
conf.set("spark.shuffle.spill.numElementsForceSpillThreshold", (4).toString)
val sc = new SparkContext(conf)
val testData = (1 to 10).toList
val keys = Array("Aa", "BB")
val count = sc.parallelize(testData, 1)
  .map(x => {
    (keys(x % 2), x)
  }).reduceByKey(_ + _, 3).collectPartitions().foreach(x => {
  x.foreach(y => {
    println(y._1 + "," + y._2)
  })
})

下圖演示了reduceByKey在有hash沖突的情況下,整個mergeWithAggregation過程

reduceByKey with hash collisions

aggregator and sorter

雖然有這段邏輯,但是我并沒找到同時帶有aggregator和sorter的操作,所以這里我們簡單過下這段邏輯就好了。

合并SpilledFile

經過partition的merge操作之后就可以進行data和index文件的寫入,具體的寫入過程和BypassMergeSortShuffleWriter是一樣的,這里我們就不再做更多的解釋了。

private[this] case class SpilledFile(
  file: File,
  blockId: BlockId,
  serializerBatchSizes: Array[Long],
  elementsPerPartition: Array[Long])

SortShuffleWriter總結

序列化了兩次,一次是寫SpilledFile,一次是合并SpilledFile

UnsafeShuffleWriter

上面我們介紹了兩種在堆內做Shuffle write的方式,這種方式的缺點很明顯,就是在大對象的情況下,Jvm的垃圾回收性能表現比較差。所以就衍生了堆外內存的Shuffle write,即UnsafeShuffleWriter

從宏觀上看,UnsafeShuffleWriterSortShufflerWriter設計很相似,都是將map端的數據,按照reduce端的partitionId進行排序,超過一定限制就將內存中的記錄溢寫到磁盤上。最后將這些文件合并寫入一個MapOutputFile,并記錄每個partitionoffset

通過上面兩種on-heap的Shuffle write模型,我們就可以知道

預備知識

內存分頁管理模型

實現細節

在詳細介紹UnsafeShuffleWriter之前,讓我們先來看下基礎知識,先看下PackedRecordPointer類。

final class PackedRecordPointer {
  ...
  public static long packPointer(long recordPointer, int partitionId) {
    final long pageNumber = (recordPointer & MASK_LONG_UPPER_13_BITS) >>> 24;
    final long compressedAddress = pageNumber | (recordPointer & MASK_LONG_LOWER_27_BITS);
    return (((long) partitionId) << 40) | compressedAddress;
  }

  private long packedRecordPointer;

  public void set(long packedRecordPointer) {
    this.packedRecordPointer = packedRecordPointer;
  }

  public int getPartitionId() {
    return (int) ((packedRecordPointer & MASK_LONG_UPPER_24_BITS) >>> 40);
  }

  public long getRecordPointer() {
    final long pageNumber = (packedRecordPointer << 24) & MASK_LONG_UPPER_13_BITS;
    final long offsetInPage = packedRecordPointer & MASK_LONG_LOWER_27_BITS;
    return pageNumber | offsetInPage;
  }
}

PackedRecordPointer用一個long類型來存儲partitionId,pageNumber,offsetInPage,已知一個long是64位,從代碼中我們可以看出:

[ 24 bit partitionId ] [ 13 bit pageNumber] [ 27 bit offset in page]

insertRecord方法:

public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)
  throws IOException {
  // 如果寫入內存的條數大于強制Spill閾值進行spill
  if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {
    spill();
  }

  growPointerArrayIfNecessary();
  // Need 4 bytes to store the record length.
  final int required = length + 4;
  acquireNewPageIfNecessary(required);

  assert(currentPage != null);
  final Object base = currentPage.getBaseObject();
  final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);
  Platform.putInt(base, pageCursor, length);
  pageCursor += 4;
  Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
  pageCursor += length;
  inMemSorter.insertRecord(recordAddress, partitionId);
}

spill過程其實就是寫文件的過程,也就是調用writeSortedFile的過程:

private void writeSortedFile(boolean isLastFile) {
  ...
  // 將inMemSorter,也就是PackedRecordPointer按照partitionId排序
  final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
    inMemSorter.getSortedIterator();

  final byte[] writeBuffer = new byte[diskWriteBufferSize];

  final Tuple2<TempShuffleBlockId, File> spilledFileInfo =
    blockManager.diskBlockManager().createTempShuffleBlock();
  final File file = spilledFileInfo._2();
  final TempShuffleBlockId blockId = spilledFileInfo._1();
  final SpillInfo spillInfo = new SpillInfo(numPartitions, file, blockId);

  final SerializerInstance ser = DummySerializerInstance.INSTANCE;

  final DiskBlockObjectWriter writer =
    blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse);

  int currentPartition = -1;
  while (sortedRecords.hasNext()) {
    sortedRecords.loadNext();
    final int partition = sortedRecords.packedRecordPointer.getPartitionId();
    if (partition != currentPartition) {
      // Switch to the new partition
      if (currentPartition != -1) {
        final FileSegment fileSegment = writer.commitAndGet();
        spillInfo.partitionLengths[currentPartition] = fileSegment.length();
      }
      currentPartition = partition;
    }

    final long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
    final Object recordPage = taskMemoryManager.getPage(recordPointer);
    final long recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);
    int dataRemaining = Platform.getInt(recordPage, recordOffsetInPage);
    long recordReadPosition = recordOffsetInPage + 4; // skip over record length
    while (dataRemaining > 0) {
      final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining);
      Platform.copyMemory(
        recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer);
      writer.write(writeBuffer, 0, toTransfer);
      recordReadPosition += toTransfer;
      dataRemaining -= toTransfer;
    }
    writer.recordWritten();
  }

  final FileSegment committedSegment = writer.commitAndGet();
  writer.close();
  if (currentPartition != -1) {
    spillInfo.partitionLengths[currentPartition] = committedSegment.length();
    spills.add(spillInfo);
  }
}

下圖演示了數據在內存中的過程

ShuffleExternalSorter

由于UnsafeShuffleWriter并沒有aggreatesort操作,所以合并多個臨時文件中某一個partition的數據就變得很簡單了,因為我們記錄了每個partitionoffset

private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
  ...
  if (spills.length == 0) {
    java.nio.file.Files.newOutputStream(outputFile.toPath()).close(); // Create an empty file
    return new long[partitioner.numPartitions()];
  } else if (spills.length == 1) {
    Files.move(spills[0].file, outputFile);
    return spills[0].partitionLengths;
  } else {
    final long[] partitionLengths;
    if (fastMergeEnabled && fastMergeIsSupported) {
      if (transferToEnabled && !encryptionEnabled) {
        logger.debug("Using transferTo-based fast merge");
        partitionLengths = mergeSpillsWithTransferTo(spills, outputFile);
      } else {
        logger.debug("Using fileStream-based fast merge");
        partitionLengths = mergeSpillsWithFileStream(spills, outputFile, null);
      }
    }
  }
  ...
}

SortShuffleWriter對比

  • 數據是放在堆外內存,減少GC開銷。
  • merge文件無需反序列化文件。

觸發條件

我們先來看下SortShuffleManager是如何選擇應該采用哪種ShuffleWriter的

override def registerShuffle[K, V, C](
    shuffleId: Int,
    numMaps: Int,
    dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
  if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
    new BypassMergeSortShuffleHandle[K, V](
      shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
  } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
    // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
    new SerializedShuffleHandle[K, V](
      shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
  } else {
    // Otherwise, buffer map outputs in a deserialized form:
    new BaseShuffleHandle(shuffleId, numMaps, dependency)
  }
}

Bypass觸發條件

def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
  // We cannot bypass sorting if we need to do map-side aggregation.
  if (dep.mapSideCombine) {
    require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
    false
  } else {
    val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
    dep.partitioner.numPartitions <= bypassMergeThreshold
  }
}

1.reducepartition個數小于spark.shuffle.sort.bypassMergeThreshold
2.無mapcombine操作

UnsafeShuffleWriter觸發條件

def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
  val shufId = dependency.shuffleId
  val numPartitions = dependency.partitioner.numPartitions
  if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
    log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
      s"${dependency.serializer.getClass.getName}, does not support object relocation")
    false
  } else if (dependency.aggregator.isDefined) {
    log.debug(
      s"Can't use serialized shuffle for shuffle $shufId because an aggregator is defined")
    false
  } else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
    log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
      s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
    false
  } else {
    log.debug(s"Can use serialized shuffle for shuffle $shufId")
    true
  }
}

1.Serializer支持relocation
2.無mapcombine操作
3.reducepartition個數小于

SortShuffleWriter觸發條件
無法使用上述兩種ShuffleWriter則采用SortShuffleWriter

關鍵點

  • 為什么需要合并shuffle中間結果

減少讀取時的文件句柄數。 我們可以看到一個partition產生的臨時文件數目為reduce個數,當我們reduce個數非常大的時候,executor需要維護非常多的文件句柄。在HashShuffleWriter實現中,需要讀取過多的文件。

說明

  • 本文是基于寫博客時的最新master代碼分析的,而spark還不斷迭代中,大家需要根據spark發展繼續分析。
  • 文中所有源碼都截取關鍵代碼,忽略了大部分對邏輯分析無關的代碼,并不代表其他代碼不重要。

總結

1.ShuffleWriter肯定會產生落磁盤文件。
2.從宏觀上看,ShuffleWriter過程就是在Map端根據Partitioner聚合Reduce端的數據,最后將數據寫入一個數據文件,并記錄每個Partitoin的偏移量,為Reduce端讀取做準備。

  • Future work

[SPARK-7271] Redesign shuffle interface for binary processing

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容