[SPARK][CORE] 面試問題之UnsafeShuffleWriter流程解析(下)

歡迎關注微信公眾號“Tim在路上”
Unsafe Shuffle的實現在一定程度上是Tungsten內存管理優化的的主要應用場景。其實現過程實際上和SortShuffleWriter是類似的,但是其中維護和執行的數據結構是不一樣的。

UnsafeShuffleWriter 源碼解析

@Override
public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException {
  // Keep track of success so we know if we encountered an exception
  // We do this rather than a standard try/catch/re-throw to handle
  // generic throwables.
  // [1] 使用success記錄write是否成功,判斷是write階段的異常還是clean階段
  boolean success = false;
  try {
    // [2] 遍歷所有的數據插入ShuffleExternalSorter
    while (records.hasNext()) {
      insertRecordIntoSorter(records.next());
    }
    // [3] close排序器使所有數據寫出到磁盤,并將多個溢寫文件合并到一起
    closeAndWriteOutput();
    success = true;
  } finally {
    if (sorter != null) {
      try {
        // [4] 清除并釋放資源
        sorter.cleanupResources();
      } catch (Exception e) {
        // Only throw this error if we won't be masking another
        // error.
        if (success) {
          throw e;
        } else {
logger.error("In addition to a failure during writing, we failed during " +
                       "cleanup.", e);
        }
      }
    }
  }
}

從上面的代碼可以看出,UnsafeShuffleWriter的write過程如下:

  • [1] 使用success記錄write是否成功,判斷是write階段的異常還是clean階段
  • [2] 遍歷所有的數據插入ShuffleExternalSorter
  • [3] close排序器使所有數據寫出到磁盤,并將多個溢寫文件合并到一起
  • [4] 清除并釋放資源
// open()方法是在初始化UnsafeShuffleWriter調用的,其中會創建sorter, 并創建一個字節輸出流,同時封裝序列化流
private void open() throws SparkException {
  assert (sorter == null);
  sorter = new ShuffleExternalSorter(
    memoryManager,
    blockManager,
    taskContext,
    initialSortBufferSize,
    partitioner.numPartitions(),
    sparkConf,
    writeMetrics);
    // MyByteArrayOutputStream類是ByteArrayOutputStream的簡單封裝,只是將內部byte[]數組暴露出來】
    //【DEFAULT_INITIAL_SER_BUFFER_SIZE常量值是1024 * 1024,即緩沖區初始1MB大】
  serBuffer = new MyByteArrayOutputStream(DEFAULT_INITIAL_SER_BUFFER_SIZE);
  serOutputStream = serializer.serializeStream(serBuffer);
}

void insertRecordIntoSorter(Product2<K, V> record) throws IOException {
    assert(sorter != null);
    // [1] 獲取record的key和partitionId
    final K key = record._1();
    final int partitionId = partitioner.getPartition(key);
    // [2] 將record序列化為二進制,并寫的字節數組輸出流serBuffer中
    serBuffer.reset();
    serOutputStream.writeKey(key, OBJECT_CLASS_TAG);
    serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);
    serOutputStream.flush();

    final int serializedRecordSize = serBuffer.size();
    assert (serializedRecordSize > 0);
    // [3] 將其插入到ShuffleExternalSorter中
    sorter.insertRecord(
      serBuffer.getBuf(), Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);
  }

這一步是將record插入前的準備,現將序列化為二進制存儲在內存中。

  • [1] 獲取record的key和partitionId
  • [2] 將record序列化為二進制,并寫的字節數組輸出流serBuffer中
  • [3] 將序列化的二進制數組,分區id, length 作為參數插入到ShuffleExternalSorter中

那么數據在ShuffleExternalSorter中寫入過程是怎么樣呢?

public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)
  throws IOException {

  // [1] 判斷inMemSorter中的記錄是否到達了溢寫閾值(默認是整數最大值),如果滿足就先進行spill
  // for tests
  assert(inMemSorter != null);
  if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {
logger.info("Spilling data because number of spilledRecords crossed the threshold " +
      numElementsForSpillThreshold);
    spill();
  }
  // [2] 檢查inMemSorter是否有額外的空間插入,如果可以獲取就擴充空間,否則進行溢寫
  growPointerArrayIfNecessary();
  final int uaoSize = UnsafeAlignedOffset.getUaoSize();
  // Need 4 or 8 bytes to store the record length.
  final int required = length + uaoSize;
  // [3] 判斷當前內存空間currentPage是否有足夠的內存,如果不夠就申請,申請不下來就需要spill
  acquireNewPageIfNecessary(required);

  assert(currentPage != null);
  // [4] 獲取currentPage的base Object和recordAddress
  final Object base = currentPage.getBaseObject();
  final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);
  // [5] 根據base, pageCursor, 先向當前內存空間寫長度值,并移動指針
  UnsafeAlignedOffset.putSize(base, pageCursor, length);
  pageCursor += uaoSize;
  // [6] 再寫序列化之后的數據, 并移動指指
  Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
  pageCursor += length;
  // [7] 將recordAddress和partitionId插入inMemSorter進行排序
  inMemSorter.insertRecord(recordAddress, partitionId);
}

從上面分析,數據插入ShuffleExternalSorter總共需要7步:

  • [1] 判斷inMemSorter中的記錄是否到達了溢寫閾值(默認是整數最大值),如果滿足就先進行spill
  • [2] 檢查inMemSorter是否有額外的空間插入,如果可以獲取就擴充空間,否則進行溢寫
  • [3] 判斷當前內存空間currentPage是否有足夠的內存,如果不夠就申請,申請不下來就需要spill
  • [4] 獲取currentPage的base Object和recordAddress
  • [5] 先向當前內存空間寫長度值,并移動指針
  • [6] 再寫序列化之后的數據, 并移動指指
  • [7] 將recordAddress和partitionId插入inMemSorter進行排序

從上面的介紹可以看出在整個插入過程中,主要涉及ShuffleExternalSorterShuffleInMemorySorter 兩個數據結構。我們來簡單看了ShuffleExternalSorter 類。

final class ShuffleExternalSorter extends MemoryConsumer implements ShuffleChecksumSupport {

  private final int numPartitions;
  private final TaskMemoryManager taskMemoryManager;
  private final BlockManager blockManager;
  private final TaskContext taskContext;
  private final ShuffleWriteMetricsReporter writeMetrics;
  private final LinkedList<MemoryBlock> allocatedPages = new LinkedList<>();

  private final LinkedList<SpillInfo> spills = new LinkedList<>();

  /** Peak memory used by this sorter so far, in bytes. **/
  private long peakMemoryUsedBytes;

  // These variables are reset after spilling:
  @Nullable private ShuffleInMemorySorter inMemSorter;
  @Nullable private MemoryBlock currentPage = null;
  private long pageCursor = -1;
  ...
}

可見每個ShuffleExternalSorter 中封裝著ShuffleInMemorySorter類。同時封裝allocatedPages

、spills和currentPage。也就是說ShuffleExternalSorter使用MemoryBlock存儲數據,每條記錄包括長度信息和K-V Pair。

另外在 ShuffleInMemorySorter 中,通過LongArray 來存儲數據,并實現了SortComparator

排序方法。其中LongArray 存儲的record的位置信息,主要有分區id, page id 和offset。

ShuffleExternalSorter 使用MemoryBlock存儲數據,每條記錄包括長度信息和K-V Pair
ShuffleInMemorySorter 使用long數組存儲每條記錄對應的位置信息(page number + offset),以及其對應的PartitionId,共8 bytes
d.png

從上面的關于ShuffleExternalSorterShuffleInMemorySorter 可以看出,這里其實質上是使用Tungsten實現了類似于BytesToBytesMap的數據結構,不過將其數組部分LongArray用ShuffleInMemorySorter 進行了封裝,其余拆分為ShuffleExternalSorter

ShuffleExternalSorter 將數據寫入了當前的內存空間,將數據的recordAddress和partitionId寫入了ShuffleInMemorySorter ,那么其具體是如何實現排序和數據的溢寫的?

private void writeSortedFile(boolean isLastFile) {

  // [1] 將inMemSorter的數據排序,并返回ShuffleSorterIterator
  // This call performs the actual sort.
  final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
    inMemSorter.getSortedIterator();

  // If there are no sorted records, so we don't need to create an empty spill file.
  if (!sortedRecords.hasNext()) {
    return;
  }

  final ShuffleWriteMetricsReporter writeMetricsToUse;

  ...

  // [2] 創建緩存數據writeBuffer數組,為了避免DiskBlockObjectWriter的低效的寫
  // Small writes to DiskBlockObjectWriter will be fairly inefficient. Since there doesn't seem to
  // be an API to directly transfer bytes from managed memory to the disk writer, we buffer
  // data through a byte array. This array does not need to be large enough to hold a single
  // record;
  final byte[] writeBuffer = new byte[diskWriteBufferSize];

  // Because this output will be read during shuffle, its compression codec must be controlled by
  // spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use
  // createTempShuffleBlock here; see SPARK-3426 for more details.
  final Tuple2<TempShuffleBlockId, File> spilledFileInfo =
    blockManager.diskBlockManager().createTempShuffleBlock();
  final File file = spilledFileInfo._2();
  final TempShuffleBlockId blockId = spilledFileInfo._1();
  final SpillInfo spillInfo = new SpillInfo(numPartitions, file, blockId);

  // Unfortunately, we need a serializer instance in order to construct a DiskBlockObjectWriter.
  // Our write path doesn't actually use this serializer (since we end up calling the `write()`
  // OutputStream methods), but DiskBlockObjectWriter still calls some methods on it. To work
  // around this, we pass a dummy no-op serializer.
  final SerializerInstance ser = DummySerializerInstance.INSTANCE;

  int currentPartition = -1;
  final FileSegment committedSegment;
  try (DiskBlockObjectWriter writer =
      blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse)) {

    final int uaoSize = UnsafeAlignedOffset.getUaoSize();
    // [3] 按分區遍歷已經排好序的指針數據, 并未每個分區提交一個FileSegment,并記錄分區的大小
    while (sortedRecords.hasNext()) {
      sortedRecords.loadNext();
      final int partition = sortedRecords.packedRecordPointer.getPartitionId();
      assert (partition >= currentPartition);
      if (partition != currentPartition) {
        // Switch to the new partition
        if (currentPartition != -1) {
          final FileSegment fileSegment = writer.commitAndGet();
          spillInfo.partitionLengths[currentPartition] = fileSegment.length();
        }
        currentPartition = partition;
        if (partitionChecksums.length > 0) {
          writer.setChecksum(partitionChecksums[currentPartition]);
        }
      }
      // [4] 取得數據的指針,再通過指針取得頁號與偏移量
      final long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
      final Object recordPage = taskMemoryManager.getPage(recordPointer);
      final long recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);
      // [5] 取得數據前面存儲的長度,然后讓指針跳過它
      int dataRemaining = UnsafeAlignedOffset.getSize(recordPage, recordOffsetInPage);
      long recordReadPosition = recordOffsetInPage + uaoSize; // skip over record length
     // [6] 數據拷貝到上面創建的緩存中,通過緩存轉到DiskBlockObjectWriter, 并寫入數據,移動指針
      while (dataRemaining > 0) {
        final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining);
        Platform.copyMemory(
          recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer);
        writer.write(writeBuffer, 0, toTransfer);
        recordReadPosition += toTransfer;
        dataRemaining -= toTransfer;
      }
      writer.recordWritten();
    }

    committedSegment = writer.commitAndGet();
  }
  // If `writeSortedFile()` was called from `closeAndGetSpills()` and no records were inserted,
  // then the file might be empty. Note that it might be better to avoid calling
  // writeSortedFile() in that case.
  if (currentPartition != -1) {
    spillInfo.partitionLengths[currentPartition] = committedSegment.length();
    spills.add(spillInfo);
  }

  if (!isLastFile) {  // i.e. this is a spill file
    writeMetrics.incRecordsWritten(
      ((ShuffleWriteMetrics)writeMetricsToUse).recordsWritten());
    taskContext.taskMetrics().incDiskBytesSpilled(
      ((ShuffleWriteMetrics)writeMetricsToUse).bytesWritten());
  }
}

溢寫排序文件總的來說分為兩步:

首先是通過ShuffleInMemorySorter排序,獲取對應分區的FileSegment和長度。寫文件或溢寫前根據數據的PartitionId信息,使用TimSort對ShuffleInMemorySorter的long數組排序,排序的結果為,PartitionId相同的聚集在一起,且PartitionId較小的排在前面,然后按分區寫出FileSegment, 并記錄每個分區的長度。

Unled.png

其次是基于排好序的指針執行數據的溢寫操作。依次讀取ShuffleInMemorySorter中long數組的元素,再根據page number和offset信息去ShuffleExternalSorter中讀取K-V Pair寫入文件, 溢寫前先寫入writeBuffer,然后在寫入DiskBlockObjectWriter中。


itled.png

具體的步驟見下:

  • [1] 將inMemSorter的數據排序,并返回ShuffleSorterIterator
  • [2] 創建緩存數據writeBuffer數組,為了避免DiskBlockObjectWriter的低效的寫
  • [3] 按分區遍歷已經排好序的指針數據, 并未每個分區提交一個FileSegment,并記錄分區的大小
  • [4] 取得數據的指針,再通過指針取得頁號與偏移量
  • [5] 取得數據前面存儲的長度,然后讓指針跳過它
  • [6] 數據拷貝到上面創建的緩存writeBuffer中,通過緩存轉到DiskBlockObjectWriter, 并寫入數據,移動指針

最后我們看下,UnsafeShuffleWriter是如何將最后溢寫的文件進行合并的?

// UnsafeShuffleWriter
void closeAndWriteOutput() throws IOException {
  assert(sorter != null);
  updatePeakMemoryUsed();
  serBuffer = null;
  serOutputStream = null;
  // [1] 關閉排序器,并將排序器中的數據全部溢寫到磁盤,返回SpillInfo數組
  final SpillInfo[] spills = sorter.closeAndGetSpills();
  try {
    // [2] 將多個溢出文件合并在一起,根據溢出次數和 IO 壓縮編解碼器選擇最快的合并策略
    partitionLengths = mergeSpills(spills);
  } finally {
    sorter = null;
    for (SpillInfo spill : spills) {
      if (spill.file.exists() && !spill.file.delete()) {
logger.error("Error while deleting spill file {}", spill.file.getPath());
      }
    }
  }
  mapStatus = MapStatus$.MODULE$.apply(
    blockManager.shuffleServerId(), partitionLengths, mapId);
}

private long[] mergeSpills(SpillInfo[] spills) throws IOException {
    long[] partitionLengths;
    // [1] 如果根本沒有溢寫文件,寫一個空文件
    if (spills.length == 0) {
      final ShuffleMapOutputWriter mapWriter = shuffleExecutorComponents
          .createMapOutputWriter(shuffleId, mapId, partitioner.numPartitions());
      return mapWriter.commitAllPartitions(
        ShuffleChecksumHelper.EMPTY_CHECKSUM_VALUE).getPartitionLengths();
    // [2] 如果只有一個溢寫文件,就直接將它寫入輸出文件中
    } else if (spills.length == 1) {
      // [2.1] 創建單個file的map output writer
      Optional<SingleSpillShuffleMapOutputWriter> maybeSingleFileWriter =
          shuffleExecutorComponents.createSingleFileMapOutputWriter(shuffleId, mapId);
      if (maybeSingleFileWriter.isPresent()) {
        // Here, we don't need to perform any metrics updates because the bytes written to this
        // output file would have already been counted as shuffle bytes written.
        partitionLengths = spills[0].partitionLengths;
        logger.debug("Merge shuffle spills for mapId {} with length {}", mapId,
            partitionLengths.length);
        maybeSingleFileWriter.get()
          .transferMapSpillFile(spills[0].file, partitionLengths, sorter.getChecksums());
      } else {
        partitionLengths = mergeSpillsUsingStandardWriter(spills);
      }
    // [3] 如果有多個溢寫文件,如果啟用并支持快速合并,并且啟用了transferTo機制,還沒有加密,        就使用NIO zero-copy來合并到輸出文件, 不啟用transferTo或不支持快速合并,就使用壓縮的BIO FileStream來合并到輸出文件
    } else {
      partitionLengths = mergeSpillsUsingStandardWriter(spills);
    }
    return partitionLengths;
  }

多個spills的合并的具體的實現在mergeSpillsWithFileStream 方法中,為了減少篇幅的冗長這里就不再展開了。

溢寫的文件進行合并,有如下幾個步驟:

  • [1] 關閉排序器,并將排序器中的數據全部溢寫到磁盤,返回SpillInfo數組

  • [2] 將多個溢出文件合并在一起,根據溢出次數和 IO 壓縮編解碼器選擇最快的合并策略

     - [2.1] 如果根本沒有溢寫文件,寫一個空文件
    
     - [2.2] 如果只有一個溢寫文件,就直接將它寫入輸出文件中
    
     - [2.3] 如果有多個溢寫文件,如果啟用并支持快速合并,并且啟用了transferTo機制,還沒有加密,        就使用NIO zero-copy來合并到輸出文件, 不啟用transferTo或不支持快速合并,就使用壓縮的BIO FileStream來合并到輸出文件
    
    

至此,UnsafeShuffleWriter的實現就介紹完了。

下面我們談下UnsafeShuffleWriter的優勢:

  • ShuffleExternalSorter使用UnSafe API操作序列化數據,而不是Java對象,減少了內存占用及因此導致的GC耗時,這個優化需要Serializer支持relocation。 ShuffleExternalSorter存原始數據,ShuffleInMemorySorter使用壓縮指針存儲元數據,每條記錄僅占8 bytes,并且排序時不需要處理原始數據,效率高。
  • 溢寫 & 合并這一步操作的是同一Partition的數據,因為使用UnSafe API直接操作序列化數據,合并時不需要反序列化數據。
  • 溢寫 & 合并可以使用fastMerge提升效率(調用NIO的transferTo方法),設置spark.shuffle.unsafe.fastMergeEnabled為true,并且如果使用了壓縮,需要壓縮算法支持SerializedStreams的連接。
  • 排序時并非將數據進行排序,而是將數據的地址指針進行排序

總結,UnsafeShuffleWriter是Tungsten最重要的應用,他的實現原理類似于SortShuffleWriter, 但是基于UnSafe API使用了定義的ShuffleExternalSorter和ShuffleInMemorySorter來存儲和維護數據。

其整體流程為,所有的數據在插入前都需要序列化為二進制數組,然后再將其插入到數據結構ShuffleExternalSorter中。在ShuffleExternalSorter定義了ShuffleInMemorySorter主要用于存儲數據的partitionId和recordAddress, 另外定義了MemoryBlock頁空間數組

在ShuffleExternalSorter的insertRecord時會先,判斷ShuffleInMemorySorter和當前內存空間是否足夠新數據的插入,不夠需要申請,申請失敗則需要spill

插入數據時會先寫入占用內存空間的長度,再寫入數據值,最后將recordAddress和partitionId插入ShuffleInMemorySorter中。在進行spill時會將ShuffleInMemorySorter中的數據進行排序,并按照分區生成FileSegment并統計分區的大小,然后遍歷指針數組根據地址將對應的數據進行寫出。在進行合并時可以直接使用UnSafe API直接操作序列化數據,返回匯總的文件。

通過UnsafeShuffleWriter只會產生兩個文件,一個分區的數據文件,一個索引文件。整個UnsafeShuffleWriter過程只會產生2 * M 個中間文件。

今天就先到這里,通過上面的介紹,我們也留下些面試題:

  1. 為什么UnsafeShuffleWriter無法支持無法支持map端的aggregation?
  2. 為什么UnsafeShuffleWriter分區數的最大值為 (1 << 24) ?
  3. ShuffleExternalSorter實現是基于JVM的嗎?以及其在排序上有什么優化?

歡迎關注微信公眾號“Tim在路上”

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。