歡迎關注微信公眾號“Tim在路上”
Unsafe Shuffle的實現在一定程度上是Tungsten內存管理優化的的主要應用場景。其實現過程實際上和SortShuffleWriter是類似的,但是其中維護和執行的數據結構是不一樣的。
UnsafeShuffleWriter 源碼解析
@Override
public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException {
// Keep track of success so we know if we encountered an exception
// We do this rather than a standard try/catch/re-throw to handle
// generic throwables.
// [1] 使用success記錄write是否成功,判斷是write階段的異常還是clean階段
boolean success = false;
try {
// [2] 遍歷所有的數據插入ShuffleExternalSorter
while (records.hasNext()) {
insertRecordIntoSorter(records.next());
}
// [3] close排序器使所有數據寫出到磁盤,并將多個溢寫文件合并到一起
closeAndWriteOutput();
success = true;
} finally {
if (sorter != null) {
try {
// [4] 清除并釋放資源
sorter.cleanupResources();
} catch (Exception e) {
// Only throw this error if we won't be masking another
// error.
if (success) {
throw e;
} else {
logger.error("In addition to a failure during writing, we failed during " +
"cleanup.", e);
}
}
}
}
}
從上面的代碼可以看出,UnsafeShuffleWriter的write過程如下:
- [1] 使用success記錄write是否成功,判斷是write階段的異常還是clean階段
- [2] 遍歷所有的數據插入ShuffleExternalSorter
- [3] close排序器使所有數據寫出到磁盤,并將多個溢寫文件合并到一起
- [4] 清除并釋放資源
// open()方法是在初始化UnsafeShuffleWriter調用的,其中會創建sorter, 并創建一個字節輸出流,同時封裝序列化流
private void open() throws SparkException {
assert (sorter == null);
sorter = new ShuffleExternalSorter(
memoryManager,
blockManager,
taskContext,
initialSortBufferSize,
partitioner.numPartitions(),
sparkConf,
writeMetrics);
// MyByteArrayOutputStream類是ByteArrayOutputStream的簡單封裝,只是將內部byte[]數組暴露出來】
//【DEFAULT_INITIAL_SER_BUFFER_SIZE常量值是1024 * 1024,即緩沖區初始1MB大】
serBuffer = new MyByteArrayOutputStream(DEFAULT_INITIAL_SER_BUFFER_SIZE);
serOutputStream = serializer.serializeStream(serBuffer);
}
void insertRecordIntoSorter(Product2<K, V> record) throws IOException {
assert(sorter != null);
// [1] 獲取record的key和partitionId
final K key = record._1();
final int partitionId = partitioner.getPartition(key);
// [2] 將record序列化為二進制,并寫的字節數組輸出流serBuffer中
serBuffer.reset();
serOutputStream.writeKey(key, OBJECT_CLASS_TAG);
serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);
serOutputStream.flush();
final int serializedRecordSize = serBuffer.size();
assert (serializedRecordSize > 0);
// [3] 將其插入到ShuffleExternalSorter中
sorter.insertRecord(
serBuffer.getBuf(), Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);
}
這一步是將record插入前的準備,現將序列化為二進制存儲在內存中。
- [1] 獲取record的key和partitionId
- [2] 將record序列化為二進制,并寫的字節數組輸出流serBuffer中
- [3] 將序列化的二進制數組,分區id, length 作為參數插入到ShuffleExternalSorter中
那么數據在ShuffleExternalSorter中寫入過程是怎么樣呢?
public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)
throws IOException {
// [1] 判斷inMemSorter中的記錄是否到達了溢寫閾值(默認是整數最大值),如果滿足就先進行spill
// for tests
assert(inMemSorter != null);
if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {
logger.info("Spilling data because number of spilledRecords crossed the threshold " +
numElementsForSpillThreshold);
spill();
}
// [2] 檢查inMemSorter是否有額外的空間插入,如果可以獲取就擴充空間,否則進行溢寫
growPointerArrayIfNecessary();
final int uaoSize = UnsafeAlignedOffset.getUaoSize();
// Need 4 or 8 bytes to store the record length.
final int required = length + uaoSize;
// [3] 判斷當前內存空間currentPage是否有足夠的內存,如果不夠就申請,申請不下來就需要spill
acquireNewPageIfNecessary(required);
assert(currentPage != null);
// [4] 獲取currentPage的base Object和recordAddress
final Object base = currentPage.getBaseObject();
final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);
// [5] 根據base, pageCursor, 先向當前內存空間寫長度值,并移動指針
UnsafeAlignedOffset.putSize(base, pageCursor, length);
pageCursor += uaoSize;
// [6] 再寫序列化之后的數據, 并移動指指
Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
pageCursor += length;
// [7] 將recordAddress和partitionId插入inMemSorter進行排序
inMemSorter.insertRecord(recordAddress, partitionId);
}
從上面分析,數據插入ShuffleExternalSorter總共需要7步:
- [1] 判斷inMemSorter中的記錄是否到達了溢寫閾值(默認是整數最大值),如果滿足就先進行spill
- [2] 檢查inMemSorter是否有額外的空間插入,如果可以獲取就擴充空間,否則進行溢寫
- [3] 判斷當前內存空間currentPage是否有足夠的內存,如果不夠就申請,申請不下來就需要spill
- [4] 獲取currentPage的base Object和recordAddress
- [5] 先向當前內存空間寫長度值,并移動指針
- [6] 再寫序列化之后的數據, 并移動指指
- [7] 將recordAddress和partitionId插入inMemSorter進行排序
從上面的介紹可以看出在整個插入過程中,主要涉及ShuffleExternalSorter
和 ShuffleInMemorySorter
兩個數據結構。我們來簡單看了ShuffleExternalSorter
類。
final class ShuffleExternalSorter extends MemoryConsumer implements ShuffleChecksumSupport {
private final int numPartitions;
private final TaskMemoryManager taskMemoryManager;
private final BlockManager blockManager;
private final TaskContext taskContext;
private final ShuffleWriteMetricsReporter writeMetrics;
private final LinkedList<MemoryBlock> allocatedPages = new LinkedList<>();
private final LinkedList<SpillInfo> spills = new LinkedList<>();
/** Peak memory used by this sorter so far, in bytes. **/
private long peakMemoryUsedBytes;
// These variables are reset after spilling:
@Nullable private ShuffleInMemorySorter inMemSorter;
@Nullable private MemoryBlock currentPage = null;
private long pageCursor = -1;
...
}
可見每個ShuffleExternalSorter
中封裝著ShuffleInMemorySorter類。同時封裝allocatedPages
、spills和currentPage。也就是說ShuffleExternalSorter
使用MemoryBlock存儲數據,每條記錄包括長度信息和K-V Pair。
另外在 ShuffleInMemorySorter
中,通過LongArray
來存儲數據,并實現了SortComparator
排序方法。其中LongArray
存儲的record的位置信息,主要有分區id, page id 和offset。
ShuffleExternalSorter | 使用MemoryBlock存儲數據,每條記錄包括長度信息和K-V Pair |
---|---|
ShuffleInMemorySorter | 使用long數組存儲每條記錄對應的位置信息(page number + offset),以及其對應的PartitionId,共8 bytes |
從上面的關于ShuffleExternalSorter
和ShuffleInMemorySorter
可以看出,這里其實質上是使用Tungsten實現了類似于BytesToBytesMap的數據結構,不過將其數組部分LongArray用ShuffleInMemorySorter
進行了封裝,其余拆分為ShuffleExternalSorter
。
ShuffleExternalSorter
將數據寫入了當前的內存空間,將數據的recordAddress和partitionId寫入了ShuffleInMemorySorter
,那么其具體是如何實現排序和數據的溢寫的?
private void writeSortedFile(boolean isLastFile) {
// [1] 將inMemSorter的數據排序,并返回ShuffleSorterIterator
// This call performs the actual sort.
final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
inMemSorter.getSortedIterator();
// If there are no sorted records, so we don't need to create an empty spill file.
if (!sortedRecords.hasNext()) {
return;
}
final ShuffleWriteMetricsReporter writeMetricsToUse;
...
// [2] 創建緩存數據writeBuffer數組,為了避免DiskBlockObjectWriter的低效的寫
// Small writes to DiskBlockObjectWriter will be fairly inefficient. Since there doesn't seem to
// be an API to directly transfer bytes from managed memory to the disk writer, we buffer
// data through a byte array. This array does not need to be large enough to hold a single
// record;
final byte[] writeBuffer = new byte[diskWriteBufferSize];
// Because this output will be read during shuffle, its compression codec must be controlled by
// spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use
// createTempShuffleBlock here; see SPARK-3426 for more details.
final Tuple2<TempShuffleBlockId, File> spilledFileInfo =
blockManager.diskBlockManager().createTempShuffleBlock();
final File file = spilledFileInfo._2();
final TempShuffleBlockId blockId = spilledFileInfo._1();
final SpillInfo spillInfo = new SpillInfo(numPartitions, file, blockId);
// Unfortunately, we need a serializer instance in order to construct a DiskBlockObjectWriter.
// Our write path doesn't actually use this serializer (since we end up calling the `write()`
// OutputStream methods), but DiskBlockObjectWriter still calls some methods on it. To work
// around this, we pass a dummy no-op serializer.
final SerializerInstance ser = DummySerializerInstance.INSTANCE;
int currentPartition = -1;
final FileSegment committedSegment;
try (DiskBlockObjectWriter writer =
blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse)) {
final int uaoSize = UnsafeAlignedOffset.getUaoSize();
// [3] 按分區遍歷已經排好序的指針數據, 并未每個分區提交一個FileSegment,并記錄分區的大小
while (sortedRecords.hasNext()) {
sortedRecords.loadNext();
final int partition = sortedRecords.packedRecordPointer.getPartitionId();
assert (partition >= currentPartition);
if (partition != currentPartition) {
// Switch to the new partition
if (currentPartition != -1) {
final FileSegment fileSegment = writer.commitAndGet();
spillInfo.partitionLengths[currentPartition] = fileSegment.length();
}
currentPartition = partition;
if (partitionChecksums.length > 0) {
writer.setChecksum(partitionChecksums[currentPartition]);
}
}
// [4] 取得數據的指針,再通過指針取得頁號與偏移量
final long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
final Object recordPage = taskMemoryManager.getPage(recordPointer);
final long recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);
// [5] 取得數據前面存儲的長度,然后讓指針跳過它
int dataRemaining = UnsafeAlignedOffset.getSize(recordPage, recordOffsetInPage);
long recordReadPosition = recordOffsetInPage + uaoSize; // skip over record length
// [6] 數據拷貝到上面創建的緩存中,通過緩存轉到DiskBlockObjectWriter, 并寫入數據,移動指針
while (dataRemaining > 0) {
final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining);
Platform.copyMemory(
recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer);
writer.write(writeBuffer, 0, toTransfer);
recordReadPosition += toTransfer;
dataRemaining -= toTransfer;
}
writer.recordWritten();
}
committedSegment = writer.commitAndGet();
}
// If `writeSortedFile()` was called from `closeAndGetSpills()` and no records were inserted,
// then the file might be empty. Note that it might be better to avoid calling
// writeSortedFile() in that case.
if (currentPartition != -1) {
spillInfo.partitionLengths[currentPartition] = committedSegment.length();
spills.add(spillInfo);
}
if (!isLastFile) { // i.e. this is a spill file
writeMetrics.incRecordsWritten(
((ShuffleWriteMetrics)writeMetricsToUse).recordsWritten());
taskContext.taskMetrics().incDiskBytesSpilled(
((ShuffleWriteMetrics)writeMetricsToUse).bytesWritten());
}
}
溢寫排序文件總的來說分為兩步:
首先是通過ShuffleInMemorySorter排序,獲取對應分區的FileSegment和長度。寫文件或溢寫前根據數據的PartitionId信息,使用TimSort對ShuffleInMemorySorter的long數組排序,排序的結果為,PartitionId相同的聚集在一起,且PartitionId較小的排在前面,然后按分區寫出FileSegment, 并記錄每個分區的長度。
其次是基于排好序的指針執行數據的溢寫操作。依次讀取ShuffleInMemorySorter中long數組的元素,再根據page number和offset信息去ShuffleExternalSorter中讀取K-V Pair寫入文件, 溢寫前先寫入writeBuffer,然后在寫入DiskBlockObjectWriter中。
具體的步驟見下:
- [1] 將inMemSorter的數據排序,并返回ShuffleSorterIterator
- [2] 創建緩存數據writeBuffer數組,為了避免DiskBlockObjectWriter的低效的寫
- [3] 按分區遍歷已經排好序的指針數據, 并未每個分區提交一個FileSegment,并記錄分區的大小
- [4] 取得數據的指針,再通過指針取得頁號與偏移量
- [5] 取得數據前面存儲的長度,然后讓指針跳過它
- [6] 數據拷貝到上面創建的緩存writeBuffer中,通過緩存轉到DiskBlockObjectWriter, 并寫入數據,移動指針
最后我們看下,UnsafeShuffleWriter是如何將最后溢寫的文件進行合并的?
// UnsafeShuffleWriter
void closeAndWriteOutput() throws IOException {
assert(sorter != null);
updatePeakMemoryUsed();
serBuffer = null;
serOutputStream = null;
// [1] 關閉排序器,并將排序器中的數據全部溢寫到磁盤,返回SpillInfo數組
final SpillInfo[] spills = sorter.closeAndGetSpills();
try {
// [2] 將多個溢出文件合并在一起,根據溢出次數和 IO 壓縮編解碼器選擇最快的合并策略
partitionLengths = mergeSpills(spills);
} finally {
sorter = null;
for (SpillInfo spill : spills) {
if (spill.file.exists() && !spill.file.delete()) {
logger.error("Error while deleting spill file {}", spill.file.getPath());
}
}
}
mapStatus = MapStatus$.MODULE$.apply(
blockManager.shuffleServerId(), partitionLengths, mapId);
}
private long[] mergeSpills(SpillInfo[] spills) throws IOException {
long[] partitionLengths;
// [1] 如果根本沒有溢寫文件,寫一個空文件
if (spills.length == 0) {
final ShuffleMapOutputWriter mapWriter = shuffleExecutorComponents
.createMapOutputWriter(shuffleId, mapId, partitioner.numPartitions());
return mapWriter.commitAllPartitions(
ShuffleChecksumHelper.EMPTY_CHECKSUM_VALUE).getPartitionLengths();
// [2] 如果只有一個溢寫文件,就直接將它寫入輸出文件中
} else if (spills.length == 1) {
// [2.1] 創建單個file的map output writer
Optional<SingleSpillShuffleMapOutputWriter> maybeSingleFileWriter =
shuffleExecutorComponents.createSingleFileMapOutputWriter(shuffleId, mapId);
if (maybeSingleFileWriter.isPresent()) {
// Here, we don't need to perform any metrics updates because the bytes written to this
// output file would have already been counted as shuffle bytes written.
partitionLengths = spills[0].partitionLengths;
logger.debug("Merge shuffle spills for mapId {} with length {}", mapId,
partitionLengths.length);
maybeSingleFileWriter.get()
.transferMapSpillFile(spills[0].file, partitionLengths, sorter.getChecksums());
} else {
partitionLengths = mergeSpillsUsingStandardWriter(spills);
}
// [3] 如果有多個溢寫文件,如果啟用并支持快速合并,并且啟用了transferTo機制,還沒有加密, 就使用NIO zero-copy來合并到輸出文件, 不啟用transferTo或不支持快速合并,就使用壓縮的BIO FileStream來合并到輸出文件
} else {
partitionLengths = mergeSpillsUsingStandardWriter(spills);
}
return partitionLengths;
}
多個spills的合并的具體的實現在mergeSpillsWithFileStream
方法中,為了減少篇幅的冗長這里就不再展開了。
溢寫的文件進行合并,有如下幾個步驟:
[1] 關閉排序器,并將排序器中的數據全部溢寫到磁盤,返回SpillInfo數組
-
[2] 將多個溢出文件合并在一起,根據溢出次數和 IO 壓縮編解碼器選擇最快的合并策略
- [2.1] 如果根本沒有溢寫文件,寫一個空文件 - [2.2] 如果只有一個溢寫文件,就直接將它寫入輸出文件中 - [2.3] 如果有多個溢寫文件,如果啟用并支持快速合并,并且啟用了transferTo機制,還沒有加密, 就使用NIO zero-copy來合并到輸出文件, 不啟用transferTo或不支持快速合并,就使用壓縮的BIO FileStream來合并到輸出文件
至此,UnsafeShuffleWriter的實現就介紹完了。
下面我們談下UnsafeShuffleWriter的優勢:
- ShuffleExternalSorter使用UnSafe API操作序列化數據,而不是Java對象,減少了內存占用及因此導致的GC耗時,這個優化需要Serializer支持relocation。 ShuffleExternalSorter存原始數據,ShuffleInMemorySorter使用壓縮指針存儲元數據,每條記錄僅占8 bytes,并且排序時不需要處理原始數據,效率高。
- 溢寫 & 合并這一步操作的是同一Partition的數據,因為使用UnSafe API直接操作序列化數據,合并時不需要反序列化數據。
- 溢寫 & 合并可以使用fastMerge提升效率(調用NIO的transferTo方法),設置spark.shuffle.unsafe.fastMergeEnabled為true,并且如果使用了壓縮,需要壓縮算法支持SerializedStreams的連接。
- 排序時并非將數據進行排序,而是將數據的地址指針進行排序
總結,UnsafeShuffleWriter是Tungsten最重要的應用,他的實現原理類似于SortShuffleWriter, 但是基于UnSafe API使用了定義的ShuffleExternalSorter和ShuffleInMemorySorter來存儲和維護數據。
其整體流程為,所有的數據在插入前都需要序列化為二進制數組,然后再將其插入到數據結構ShuffleExternalSorter中。在ShuffleExternalSorter定義了ShuffleInMemorySorter主要用于存儲數據的partitionId和recordAddress, 另外定義了MemoryBlock頁空間數組。
在ShuffleExternalSorter的insertRecord時會先,判斷ShuffleInMemorySorter和當前內存空間是否足夠新數據的插入,不夠需要申請,申請失敗則需要spill。
插入數據時會先寫入占用內存空間的長度,再寫入數據值,最后將recordAddress和partitionId插入ShuffleInMemorySorter中。在進行spill時會將ShuffleInMemorySorter中的數據進行排序,并按照分區生成FileSegment并統計分區的大小,然后遍歷指針數組根據地址將對應的數據進行寫出。在進行合并時可以直接使用UnSafe API直接操作序列化數據,返回匯總的文件。
通過UnsafeShuffleWriter只會產生兩個文件,一個分區的數據文件,一個索引文件。整個UnsafeShuffleWriter過程只會產生2 * M 個中間文件。
今天就先到這里,通過上面的介紹,我們也留下些面試題:
- 為什么UnsafeShuffleWriter無法支持無法支持map端的aggregation?
- 為什么UnsafeShuffleWriter分區數的最大值為 (1 << 24) ?
- ShuffleExternalSorter實現是基于JVM的嗎?以及其在排序上有什么優化?
歡迎關注微信公眾號“Tim在路上”