1.ShuffleManager
Spark在初始化SparkEnv的時候,會在create()方法里面初始化ShuffleManager
// Let the user specify short names for shuffle managersvalshortShuffleMgrNames =Map("sort"-> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,"tungsten-sort"-> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)valshuffleMgrName = conf.get(config.SHUFFLE_MANAGER)valshuffleMgrClass =? ? ? shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase(Locale.ROOT), shuffleMgrName)valshuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)
這里可以看到包含sort和tungsten-sort兩種shuffle,通過反射創建了ShuffleManager,ShuffleManager是一個特質,核心方法有下面幾個:
private[spark]traitShuffleManager{/**
? * 注冊一個shuffle返回句柄
? */defregisterShuffle[K,V,C](? ? ? shuffleId:Int,? ? ? dependency:ShuffleDependency[K,V,C]):ShuffleHandle/** 獲取一個Writer根據給定的分區,在executors執行map任務時被調用 */defgetWriter[K,V](? ? ? handle:ShuffleHandle,? ? ? mapId:Long,? ? ? context:TaskContext,? ? ? metrics:ShuffleWriteMetricsReporter):ShuffleWriter[K,V]/**
? * 獲取一個Reader根據reduce分區的范圍,在executors執行reduce任務時被調用
? */defgetReader[K,C](? ? ? handle:ShuffleHandle,? ? ? startPartition:Int,? ? ? endPartition:Int,? ? ? context:TaskContext,? ? ? metrics:ShuffleReadMetricsReporter):ShuffleReader[K,C]...}
2.SortShuffleManager
SortShuffleManager是ShuffleManager的唯一實現類,對于以上三個方法的實現如下:
2.1 registerShuffle
/**
? * Obtains a [[ShuffleHandle]] to pass to tasks.
? */overridedefregisterShuffle[K,V,C](? ? ? shuffleId:Int,? ? ? dependency:ShuffleDependency[K,V,C]):ShuffleHandle= {// 1.首先檢查是否符合BypassMergeSortif(SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {// If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't// need map-side aggregation, then write numPartitions files directly and just concatenate// them at the end. This avoids doing serialization and deserialization twice to merge// together the spilled files, which would happen with the normal code path. The downside is// having multiple files open at a time and thus more memory allocated to buffers.newBypassMergeSortShuffleHandle[K,V](? ? ? ? shuffleId, dependency.asInstanceOf[ShuffleDependency[K,V,V]])// 2.否則檢查是否能夠序列化}elseif(SortShuffleManager.canUseSerializedShuffle(dependency)) {// Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:newSerializedShuffleHandle[K,V](? ? ? ? shuffleId, dependency.asInstanceOf[ShuffleDependency[K,V,V]])? ? }else{// Otherwise, buffer map outputs in a deserialized form:newBaseShuffleHandle(shuffleId, dependency)? ? }? }
1.首先檢查是否符合BypassMergeSort,這里需要滿足兩個條件,首先是當前shuffle依賴中沒有map端的聚合操作,其次是分區數要小于spark.shuffle.sort.bypassMergeThreshold的值,默認為200,如果滿足這兩個條件,會返回BypassMergeSortShuffleHandle,啟用bypass merge-sort shuffle機制
defshouldBypassMergeSort(conf:SparkConf, dep:ShuffleDependency[_, _, _]):Boolean= {// We cannot bypass sorting if we need to do map-side aggregation.if(dep.mapSideCombine) {false}else{// 默認值為200valbypassMergeThreshold:Int= conf.get(config.SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD)? ? dep.partitioner.numPartitions <= bypassMergeThreshold? }}
2.如果不滿足上面條件,檢查是否滿足canUseSerializedShuffle()方法,如果滿足該方法中的3個條件,則會返回SerializedShuffleHandle,啟用tungsten-sort shuffle機制
defcanUseSerializedShuffle(dependency:ShuffleDependency[_, _, _]):Boolean= {valshufId = dependency.shuffleIdvalnumPartitions = dependency.partitioner.numPartitions// 序列化器需要支持Relocationif(!dependency.serializer.supportsRelocationOfSerializedObjects) {? ? log.debug(s"Can't use serialized shuffle for shuffle$shufIdbecause the serializer, "+s"${dependency.serializer.getClass.getName}, does not support object relocation")false// 不能有map端聚合操作}elseif(dependency.mapSideCombine) {? ? log.debug(s"Can't use serialized shuffle for shuffle$shufIdbecause we need to do "+s"map-side aggregation")false// 分區數不能大于16777215+1}elseif(numPartitions >MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {? ? log.debug(s"Can't use serialized shuffle for shuffle$shufIdbecause it has more than "+s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODEpartitions")false}else{? ? log.debug(s"Can use serialized shuffle for shuffle$shufId")true}}
3.如果以上兩個條件都不滿足的話,會返回BaseShuffleHandle,采用基本sort shuffle機制
2.2 getReader
/**
* Get a reader for a range of reduce partitions (startPartition to endPartition-1, inclusive).
* Called on executors by reduce tasks.
*/overridedefgetReader[K,C](? ? handle:ShuffleHandle,? ? startPartition:Int,? ? endPartition:Int,? ? context:TaskContext,? ? metrics:ShuffleReadMetricsReporter):ShuffleReader[K,C] = {valblocksByAddress =SparkEnv.get.mapOutputTracker.getMapSizesByExecutorId(? ? handle.shuffleId, startPartition, endPartition)newBlockStoreShuffleReader(? ? handle.asInstanceOf[BaseShuffleHandle[K, _,C]], blocksByAddress, context, metrics,? ? shouldBatchFetch = canUseBatchFetch(startPartition, endPartition, context))}
這里返回BlockStoreShuffleReader
2.3 getWriter
/** Get a writer for a given partition. Called on executors by map tasks. */overridedefgetWriter[K,V](? ? handle:ShuffleHandle,? ? mapId:Long,? ? context:TaskContext,? ? metrics:ShuffleWriteMetricsReporter):ShuffleWriter[K,V] = {valmapTaskIds = taskIdMapsForShuffle.computeIfAbsent(? ? handle.shuffleId, _ =>newOpenHashSet[Long](16))? mapTaskIds.synchronized { mapTaskIds.add(context.taskAttemptId()) }valenv =SparkEnv.get// 根據handle獲取不同ShuffleWritehandlematch{caseunsafeShuffleHandle:SerializedShuffleHandle[K@unchecked,V@unchecked] =>newUnsafeShuffleWriter(? ? ? ? env.blockManager,? ? ? ? context.taskMemoryManager(),? ? ? ? unsafeShuffleHandle,? ? ? ? mapId,? ? ? ? context,? ? ? ? env.conf,? ? ? ? metrics,? ? ? ? shuffleExecutorComponents)casebypassMergeSortHandle:BypassMergeSortShuffleHandle[K@unchecked,V@unchecked] =>newBypassMergeSortShuffleWriter(? ? ? ? env.blockManager,? ? ? ? bypassMergeSortHandle,? ? ? ? mapId,? ? ? ? env.conf,? ? ? ? metrics,? ? ? ? shuffleExecutorComponents)caseother:BaseShuffleHandle[K@unchecked,V@unchecked, _] =>newSortShuffleWriter(? ? ? ? shuffleBlockResolver, other, mapId, context, shuffleExecutorComponents)? }}
這里會根據handle獲取不同ShuffleWrite,如果是SerializedShuffleHandle,使用UnsafeShuffleWriter,如果是BypassMergeSortShuffleHandle,采用BypassMergeSortShuffleWriter,否則使用SortShuffleWriter
3.三種Writer的實現
如上文所說,當開啟bypass機制后,會使用BypassMergeSortShuffleWriter,如果serializer支持relocation并且map端沒有聚合同時分區數目不大于16777215+1三個條件都滿足,使用UnsafeShuffleWriter,否則使用SortShuffleWriter
3.1 BypassMergeSortShuffleWriter
BypassMergeSortShuffleWriter繼承ShuffleWriter,用java實現,會將map端的多個輸出文件合并為一個文件,同時生成一個索引文件,索引記錄到每個分區的初始地址,write()方法如下:
@Overridepublic void write(Iterator> records)throwsIOException{? assert (partitionWriters ==null);// 新建一個ShuffleMapOutputWriterShuffleMapOutputWritermapOutputWriter = shuffleExecutorComponents? .createMapOutputWriter(shuffleId, mapId, numPartitions);try{// 如果沒有數據的話if(!records.hasNext()) {// 返回所有分區的寫入長度partitionLengths = mapOutputWriter.commitAllPartitions();// 更新mapStatusmapStatus =MapStatus$.MODULE$.apply(? ? ? ? blockManager.shuffleServerId(), partitionLengths, mapId);return;? ? }finalSerializerInstanceserInstance = serializer.newInstance();finallong openStartTime =System.nanoTime();// 創建和分區數相等的DiskBlockObjectWriter FileSegmentpartitionWriters =newDiskBlockObjectWriter[numPartitions];? ? partitionWriterSegments =newFileSegment[numPartitions];// 對于每個分區for(int i =0; i < numPartitions; i++) {// 創建一個臨時的blockfinalTuple2 tempShuffleBlockIdPlusFile =? ? ? blockManager.diskBlockManager().createTempShuffleBlock();// 獲取temp block的file和idfinalFilefile = tempShuffleBlockIdPlusFile._2();finalBlockIdblockId = tempShuffleBlockIdPlusFile._1();// 對于每個分區,創建一個DiskBlockObjectWriterpartitionWriters[i] =? ? ? blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);? ? }// Creating the file to write to and creating a disk writer both involve interacting with// the disk, and can take a long time in aggregate when we open many files, so should be// included in the shuffle write time.// 創建文件和寫入文件都需要大量時間,也需要包含在shuffle寫入時間里面writeMetrics.incWriteTime(System.nanoTime() - openStartTime);// 如果有數據的話while(records.hasNext()) {finalProduct2 record = records.next();finalKkey = record._1();// 對于每條數據按key寫入相應分區對應的文件partitionWriters[partitioner.getPartition(key)].write(key, record._2());? ? }for(int i =0; i < numPartitions; i++) {try(DiskBlockObjectWriterwriter = partitionWriters[i]) {// 提交partitionWriterSegments[i] = writer.commitAndGet();? ? ? }? ? }// 將所有分區文件合并成一個文件partitionLengths = writePartitionedData(mapOutputWriter);// 更新mapStatusmapStatus =MapStatus$.MODULE$.apply(? ? ? blockManager.shuffleServerId(), partitionLengths, mapId);? }catch(Exceptione) {try{? ? ? mapOutputWriter.abort(e);? ? }catch(Exceptione2) {? ? ? logger.error("Failed to abort the writer after failing to write map output.", e2);? ? ? e.addSuppressed(e2);? ? }throwe;? }}
合并文件的方法writePartitionedData()如下,默認采用零拷貝的方式來合并文件:
privatelong[] writePartitionedData(ShuffleMapOutputWritermapOutputWriter)throwsIOException{// Track location of the partition starts in the output fileif(partitionWriters !=null) {// 開始時間finallong writeStartTime =System.nanoTime();try{for(int i =0; i < numPartitions; i++) {// 獲取每個文件finalFilefile = partitionWriterSegments[i].file();ShufflePartitionWriterwriter = mapOutputWriter.getPartitionWriter(i);if(file.exists()) {// 采取零拷貝方式if(transferToEnabled) {// Using WritableByteChannelWrapper to make resource closing consistent between// this implementation and UnsafeShuffleWriter.Optional maybeOutputChannel = writer.openChannelWrapper();// 在這里會調用Utils.copyFileStreamNIO方法,最終調用FileChannel.transferTo方法拷貝文件if(maybeOutputChannel.isPresent()) {? ? ? ? ? ? ? writePartitionedDataWithChannel(file, maybeOutputChannel.get());? ? ? ? ? ? }else{? ? ? ? ? ? ? writePartitionedDataWithStream(file, writer);? ? ? ? ? ? }? ? ? ? ? }else{// 否則采取流的方式拷貝writePartitionedDataWithStream(file, writer);? ? ? ? ? }if(!file.delete()) {? ? ? ? ? ? logger.error("Unable to delete file for partition {}", i);? ? ? ? ? }? ? ? ? }? ? ? }? ? }finally{? ? ? writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);? ? }? ? partitionWriters =null;? }returnmapOutputWriter.commitAllPartitions();}
3.2 UnsafeShuffleWriter
UnsafeShuffleWriter也是繼承ShuffleWriter,用java實現,write方法如下:
@Overridepublic void write(scala.collection.Iterator> records)throwsIOException{// Keep track of success so we know if we encountered an exception// We do this rather than a standard try/catch/re-throw to handle// generic throwables.// 跟蹤異常boolean success =false;try{while(records.hasNext()) {// 將數據插入ShuffleExternalSorter進行外部排序insertRecordIntoSorter(records.next());? ? }// 合并并輸出文件closeAndWriteOutput();? ? success =true;? }finally{if(sorter !=null) {try{? ? ? ? sorter.cleanupResources();? ? ? }catch(Exceptione) {// Only throw this error if we won't be masking another// error.if(success) {throwe;? ? ? ? }else{? ? ? ? ? logger.error("In addition to a failure during writing, we failed during "+"cleanup.", e);? ? ? ? }? ? ? }? ? }? }}
這里主要有兩個方法:
3.2.1 insertRecordIntoSorter()
@VisibleForTestingvoid insertRecordIntoSorter(Product2 record)throwsIOException{? assert(sorter !=null);// 獲取key和分區finalKkey = record._1();finalint partitionId = partitioner.getPartition(key);// 重置緩沖區serBuffer.reset();// 將key和value寫入緩沖區serOutputStream.writeKey(key,OBJECT_CLASS_TAG);? serOutputStream.writeValue(record._2(),OBJECT_CLASS_TAG);? serOutputStream.flush();// 獲取序列化數據大小finalint serializedRecordSize = serBuffer.size();? assert (serializedRecordSize >0);// 將序列化后的數據插入ShuffleExternalSorter處理sorter.insertRecord(? ? serBuffer.getBuf(),Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);}
該方法會將數據進行序列化,并且將序列化后的數據通過insertRecord()方法插入外部排序器中,insertRecord()方法如下:
public void insertRecord(ObjectrecordBase, long recordOffset, int length, int partitionId)throwsIOException{// for testsassert(inMemSorter !=null);// 如果數據條數超過溢寫閾值,直接溢寫磁盤if(inMemSorter.numRecords() >= numElementsForSpillThreshold) {? ? logger.info("Spilling data because number of spilledRecords crossed the threshold "+? ? ? numElementsForSpillThreshold);? ? spill();? }// Checks whether there is enough space to insert an additional record in to the sort pointer// array and grows the array if additional space is required. If the required space cannot be// obtained, then the in-memory data will be spilled to disk.// 檢查是否有足夠的空間插入額外的記錄到排序指針數組中,如果需要額外的空間對數組進行擴容,如果空間不夠,內存中的數據將會被溢寫到磁盤上growPointerArrayIfNecessary();finalint uaoSize =UnsafeAlignedOffset.getUaoSize();// Need 4 or 8 bytes to store the record length.// 需要額外的4或8個字節存儲數據長度finalint required = length + uaoSize;// 如果需要更多的內存,會想TaskMemoryManager申請新的pageacquireNewPageIfNecessary(required);? assert(currentPage !=null);finalObjectbase = currentPage.getBaseObject();//Given a memory page and offset within that page, encode this address into a 64-bit long.//This address will remain valid as long as the corresponding page has not been freed.// 通過給定的內存頁和偏移量,將當前數據的邏輯地址編碼成一個long型finallong recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);// 寫長度值UnsafeAlignedOffset.putSize(base, pageCursor, length);// 移動指針pageCursor += uaoSize;// 寫數據Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);// 移動指針pageCursor += length;// 將編碼的邏輯地址和分區id傳給ShuffleInMemorySorter進行排序inMemSorter.insertRecord(recordAddress, partitionId);}
在這里對于數據的緩存和溢寫不借助于其他高級數據結構,而是直接操作內存空間
growPointerArrayIfNecessary()方法如下:
/**
* Checks whether there is enough space to insert an additional record in to the sort pointer
* array and grows the array if additional space is required. If the required space cannot be
* obtained, then the in-memory data will be spilled to disk.
*/privatevoid growPointerArrayIfNecessary()throwsIOException{? assert(inMemSorter !=null);// 如果沒有空間容納新的數據if(!inMemSorter.hasSpaceForAnotherRecord()) {// 獲取當前內存使用量long used = inMemSorter.getMemoryUsage();LongArrayarray;try{// could trigger spilling// 分配給緩存原來兩倍的容量array = allocateArray(used /8*2);? ? }catch(TooLargePageExceptione) {// The pointer array is too big to fix in a single page, spill.// 如果超出了一頁的大小,直接溢寫,溢寫方法見后面// 一頁的大小為128M,在PackedRecordPointer類中// static final int MAXIMUM_PAGE_SIZE_BYTES = 1 << 27;? // 128 megabytesspill();return;? ? }catch(SparkOutOfMemoryErrore) {// should have trigger spillingif(!inMemSorter.hasSpaceForAnotherRecord()) {? ? ? ? logger.error("Unable to grow the pointer array");throwe;? ? ? }return;? ? }// check if spilling is triggered or notif(inMemSorter.hasSpaceForAnotherRecord()) {// 如果有了剩余空間,則表明沒必要擴容,釋放分配的空間freeArray(array);? ? }else{// 否則把原來的數組復制到新的數組inMemSorter.expandPointerArray(array);? ? }? }}
spill()方法如下:
@Overridepublic long spill(long size,MemoryConsumertrigger)throwsIOException{if(trigger !=this|| inMemSorter ==null|| inMemSorter.numRecords() ==0) {return0L;? }? logger.info("Thread {} spilling sort data of {} to disk ({} {} so far)",Thread.currentThread().getId(),Utils.bytesToString(getMemoryUsage()),? ? spills.size(),? ? spills.size() >1?" times":" time");// Sorts the in-memory records and writes the sorted records to an on-disk file.// This method does not free the sort data structures.// 對內存中的數據進行排序并且將有序記錄寫到一個磁盤文件中,這個方法不會釋放排序的數據結構writeSortedFile(false);finallong spillSize = freeMemory();// 重置ShuffleInMemorySorterinMemSorter.reset();// Reset the in-memory sorter's pointer array only after freeing up the memory pages holding the// records. Otherwise, if the task is over allocated memory, then without freeing the memory// pages, we might not be able to get memory for the pointer array.taskContext.taskMetrics().incMemoryBytesSpilled(spillSize);returnspillSize;}
writeSortedFile()方法:
privatevoid writeSortedFile(boolean isLastFile) {// This call performs the actual sort.// 返回一個排序好的迭代器finalShuffleInMemorySorter.ShuffleSorterIteratorsortedRecords =? ? inMemSorter.getSortedIterator();// If there are no sorted records, so we don't need to create an empty spill file.if(!sortedRecords.hasNext()) {return;? }finalShuffleWriteMetricsReporterwriteMetricsToUse;// 如果為true,則為輸出文件,否則為溢寫文件if(isLastFile) {// We're writing the final non-spill file, so we _do_ want to count this as shuffle bytes.writeMetricsToUse = writeMetrics;? }else{// We're spilling, so bytes written should be counted towards spill rather than write.// Create a dummy WriteMetrics object to absorb these metrics, since we don't want to count// them towards shuffle bytes written.writeMetricsToUse =newShuffleWriteMetrics();? }// Small writes to DiskBlockObjectWriter will be fairly inefficient. Since there doesn't seem to// be an API to directly transfer bytes from managed memory to the disk writer, we buffer// data through a byte array. This array does not need to be large enough to hold a single// record;// 創建一個字節緩沖數組,大小為1mfinalbyte[] writeBuffer =newbyte[diskWriteBufferSize];// Because this output will be read during shuffle, its compression codec must be controlled by// spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use// createTempShuffleBlock here; see SPARK-3426 for more details.// 創建一個臨時的shuffle blockfinalTuple2 spilledFileInfo =? ? blockManager.diskBlockManager().createTempShuffleBlock();// 獲取文件和idfinalFilefile = spilledFileInfo._2();finalTempShuffleBlockIdblockId = spilledFileInfo._1();finalSpillInfospillInfo =newSpillInfo(numPartitions, file, blockId);// Unfortunately, we need a serializer instance in order to construct a DiskBlockObjectWriter.// Our write path doesn't actually use this serializer (since we end up calling the `write()`// OutputStream methods), but DiskBlockObjectWriter still calls some methods on it. To work// around this, we pass a dummy no-op serializer.// 不做任何轉換的序列化器,因為需要一個實例來構造DiskBlockObjectWriterfinalSerializerInstanceser =DummySerializerInstance.INSTANCE;? int currentPartition =-1;finalFileSegmentcommittedSegment;try(DiskBlockObjectWriterwriter =? ? ? blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse)) {finalint uaoSize =UnsafeAlignedOffset.getUaoSize();// 遍歷while(sortedRecords.hasNext()) {? ? ? sortedRecords.loadNext();finalint partition = sortedRecords.packedRecordPointer.getPartitionId();? ? ? assert (partition >= currentPartition);if(partition != currentPartition) {// Switch to the new partition// 如果切換到了新的分區,提交當前分區,并且記錄當前分區大小if(currentPartition !=-1) {finalFileSegmentfileSegment = writer.commitAndGet();? ? ? ? ? spillInfo.partitionLengths[currentPartition] = fileSegment.length();? ? ? ? }// 然后切換到下一個分區currentPartition = partition;? ? ? }// 獲取指針,通過指針獲取頁號和偏移量finallong recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();finalObjectrecordPage = taskMemoryManager.getPage(recordPointer);finallong recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);// 獲取剩余數據int dataRemaining =UnsafeAlignedOffset.getSize(recordPage, recordOffsetInPage);// 跳過數據前面存儲的長度long recordReadPosition = recordOffsetInPage + uaoSize;// skip over record lengthwhile(dataRemaining >0) {finalint toTransfer =Math.min(diskWriteBufferSize, dataRemaining);// 將數據拷貝到緩沖數組中Platform.copyMemory(? ? ? ? ? recordPage, recordReadPosition, writeBuffer,Platform.BYTE_ARRAY_OFFSET, toTransfer);// 從緩沖數組中轉入DiskBlockObjectWriterwriter.write(writeBuffer,0, toTransfer);// 更新位置recordReadPosition += toTransfer;// 更新剩余數據dataRemaining -= toTransfer;? ? ? }? ? ? writer.recordWritten();? ? }// 提交committedSegment = writer.commitAndGet();? }// If `writeSortedFile()` was called from `closeAndGetSpills()` and no records were inserted,// then the file might be empty. Note that it might be better to avoid calling// writeSortedFile() in that case.// 記錄溢寫文件的列表if(currentPartition !=-1) {? ? spillInfo.partitionLengths[currentPartition] = committedSegment.length();? ? spills.add(spillInfo);? }// 如果是溢寫文件,更新溢寫的指標if(!isLastFile) {? ? ? writeMetrics.incRecordsWritten(? ? ? ((ShuffleWriteMetrics)writeMetricsToUse).recordsWritten());? ? taskContext.taskMetrics().incDiskBytesSpilled(? ? ? ((ShuffleWriteMetrics)writeMetricsToUse).bytesWritten());? }}
encodePageNumberAndOffset()方法如下:
public long encodePageNumberAndOffset(MemoryBlockpage, long offsetInPage) {// 如果開啟了堆外內存,偏移量為絕對地址,可能需要64位進行編碼,由于頁大小限制,將其減去當前頁的基地址,變為相對地址if(tungstenMemoryMode ==MemoryMode.OFF_HEAP) {// In off-heap mode, an offset is an absolute address that may require a full 64 bits to// encode. Due to our page size limitation, though, we can convert this into an offset that's// relative to the page's base offset; this relative offset will fit in 51 bits.offsetInPage -= page.getBaseOffset();? }returnencodePageNumberAndOffset(page.pageNumber, offsetInPage);}@VisibleForTestingpublic static long encodePageNumberAndOffset(int pageNumber, long offsetInPage) {? assert (pageNumber >=0) :"encodePageNumberAndOffset called with invalid page";// 高13位為頁號,低51位為偏移量// 頁號左移51位,再拼偏移量和上一個低51位都為1的掩碼0x7FFFFFFFFFFFFLreturn(((long) pageNumber) <
ShuffleInMemorySorter的insertRecord()方法如下:
public void insertRecord(long recordPointer, int partitionId) {if(!hasSpaceForAnotherRecord()) {thrownewIllegalStateException("There is no space for new record");? }? array.set(pos,PackedRecordPointer.packPointer(recordPointer, partitionId));? pos++;}
PackedRecordPointer.packPointer()方法:
public static long packPointer(long recordPointer, int partitionId) {? assert (partitionId <=MAXIMUM_PARTITION_ID);// Note that without word alignment we can address 2^27 bytes = 128 megabytes per page.// Also note that this relies on some internals of how TaskMemoryManager encodes its addresses.// 將頁號右移24位,和低27位拼在一起,這樣邏輯地址被壓縮成40位finallong pageNumber = (recordPointer &MASK_LONG_UPPER_13_BITS) >>>24;finallong compressedAddress = pageNumber | (recordPointer &MASK_LONG_LOWER_27_BITS);// 將分區號放在高24位上return(((long) partitionId) <<40) | compressedAddress;}
getSortedIterator()方法:
publicShuffleSorterIteratorgetSortedIterator() {? int offset =0;// 使用基數排序對內存分區ID進行排序?;鶖蹬判蛞斓枚?,但是在添加指針時需要額外的內存作為保留內存if(useRadixSort) {? ? offset =RadixSort.sort(? ? ? array, pos,PackedRecordPointer.PARTITION_ID_START_BYTE_INDEX,PackedRecordPointer.PARTITION_ID_END_BYTE_INDEX,false,false);// 否則采用timSort排序}else{MemoryBlockunused =newMemoryBlock(? ? ? array.getBaseObject(),? ? ? array.getBaseOffset() + pos *8L,? ? ? (array.size() - pos) *8L);LongArraybuffer =newLongArray(unused);Sorter sorter =newSorter<>(newShuffleSortDataFormat(buffer));? ? sorter.sort(array,0, pos,SORT_COMPARATOR);? }returnnewShuffleSorterIterator(pos, array, offset);}
3.2.2 closeAndWriteOutput()
@VisibleForTestingvoid closeAndWriteOutput()throwsIOException{? assert(sorter !=null);? updatePeakMemoryUsed();? serBuffer =null;? serOutputStream =null;// 獲取溢寫文件finalSpillInfo[] spills = sorter.closeAndGetSpills();? sorter =null;finallong[] partitionLengths;try{// 合并溢寫文件partitionLengths = mergeSpills(spills);? }finally{// 刪除溢寫文件for(SpillInfospill : spills) {if(spill.file.exists() && !spill.file.delete()) {? ? ? ? logger.error("Error while deleting spill file {}", spill.file.getPath());? ? ? }? ? }? }// 更新mapstatusmapStatus =MapStatus$.MODULE$.apply(? ? blockManager.shuffleServerId(), partitionLengths, mapId);}
mergeSpills()方法:
privatelong[] mergeSpills(SpillInfo[] spills)throwsIOException{? long[] partitionLengths;// 如果沒有溢寫文件,創建空的if(spills.length ==0) {finalShuffleMapOutputWritermapWriter = shuffleExecutorComponents? ? ? ? .createMapOutputWriter(shuffleId, mapId, partitioner.numPartitions());returnmapWriter.commitAllPartitions();// 如果只有一個溢寫文件,將它合并輸出}elseif(spills.length ==1) {Optional maybeSingleFileWriter =? ? ? ? shuffleExecutorComponents.createSingleFileMapOutputWriter(shuffleId, mapId);if(maybeSingleFileWriter.isPresent()) {// Here, we don't need to perform any metrics updates because the bytes written to this// output file would have already been counted as shuffle bytes written.partitionLengths = spills[0].partitionLengths;? ? ? maybeSingleFileWriter.get().transferMapSpillFile(spills[0].file, partitionLengths);? ? }else{? ? ? partitionLengths = mergeSpillsUsingStandardWriter(spills);? ? }// 如果有多個,合并輸出,合并的時候有NIO和BIO兩種方式}else{? ? partitionLengths = mergeSpillsUsingStandardWriter(spills);? }returnpartitionLengths;}
3.3 SortShuffleWriter
SortShuffleWriter會使用PartitionedAppendOnlyMap或PartitionedPariBuffer在內存中進行排序,如果超過內存限制,會溢寫到文件中,在全局輸出有序文件的時候,對之前的所有輸出文件和當前內存中的數據進行全局歸并排序,對key相同的元素會使用定義的function進行聚合,入口為write()方法:
overridedefwrite(records:Iterator[Product2[K,V]]):Unit= {// 創建一個外部排序器,如果map端有預聚合,就傳入aggregator和keyOrdering,否則不需要傳入sorter =if(dep.mapSideCombine) {newExternalSorter[K,V,C](? ? ? context, dep.aggregator,Some(dep.partitioner), dep.keyOrdering, dep.serializer)? }else{// In this case we pass neither an aggregator nor an ordering to the sorter, because we don't// care whether the keys get sorted in each partition; that will be done on the reduce side// if the operation being run is sortByKey.newExternalSorter[K,V,V](? ? ? context, aggregator =None,Some(dep.partitioner), ordering =None, dep.serializer)? }// 將數據放入ExternalSorter進行排序sorter.insertAll(records)// Don't bother including the time to open the merged output file in the shuffle write time,// because it just opens a single file, so is typically too fast to measure accurately// (see SPARK-3570).// 創建一個輸出WrtiervalmapOutputWriter = shuffleExecutorComponents.createMapOutputWriter(? ? dep.shuffleId, mapId, dep.partitioner.numPartitions)// 將外部排序的數據寫入Writersorter.writePartitionedMapOutput(dep.shuffleId, mapId, mapOutputWriter)valpartitionLengths = mapOutputWriter.commitAllPartitions()// 更新mapstatusmapStatus =MapStatus(blockManager.shuffleServerId, partitionLengths, mapId)}
insertAll()方法:
definsertAll(records:Iterator[Product2[K,V]]):Unit= {//TODO:stop combining if we find that the reduction factor isn't highvalshouldCombine = aggregator.isDefined// 是否需要map端聚合if(shouldCombine) {// Combine values in-memory first using our AppendOnlyMap// 使用AppendOnlyMap在內存中聚合values// 獲取mergeValue()函數,將新值合并到當前聚合結果中valmergeValue = aggregator.get.mergeValue// 獲取createCombiner()函數,創建聚合初始值valcreateCombiner = aggregator.get.createCombinervarkv:Product2[K,V] =null// 如果一個key當前有聚合值,則合并,如果沒有創建初始值valupdate = (hadValue:Boolean, oldValue:C) => {if(hadValue) mergeValue(oldValue, kv._2)elsecreateCombiner(kv._2)? ? }// 遍歷while(records.hasNext) {// 增加讀取記錄數addElementsRead()? ? ? kv = records.next()// map為PartitionedAppendOnlyMap,將分區和key作為key,聚合值作為valuemap.changeValue((getPartition(kv._1), kv._1), update)// 是否需要溢寫到磁盤maybeSpillCollection(usingMap =true)? ? }// 如果不需要map端聚合}else{// Stick values into our bufferwhile(records.hasNext) {? ? ? addElementsRead()valkv = records.next()// buffer為PartitionedPairBuffer,將分區和key加進去buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])// 是否需要溢寫到磁盤maybeSpillCollection(usingMap =false)? ? }? }}
該方法主要是判斷在插入數據時,是否需要在map端進行預聚合,分別采用兩種數據結構來保存
maybeSpillCollection()方法里面會調用maybeSpill()方法檢查是否需要溢寫,如果發生溢寫,重新構造一個map或者buffer結構從頭開始緩存,如下:
privatedefmaybeSpillCollection(usingMap:Boolean):Unit= {varestimatedSize =0Lif(usingMap) {? ? estimatedSize = map.estimateSize()// 判斷是否需要溢寫if(maybeSpill(map, estimatedSize)) {? ? ? map =newPartitionedAppendOnlyMap[K,C]? ? }? }else{? ? estimatedSize = buffer.estimateSize()// 判斷是否需要溢寫if(maybeSpill(buffer, estimatedSize)) {? ? ? buffer =newPartitionedPairBuffer[K,C]? ? }? }if(estimatedSize > _peakMemoryUsedBytes) {? ? _peakMemoryUsedBytes = estimatedSize? }}protecteddefmaybeSpill(collection:C, currentMemory:Long):Boolean= {varshouldSpill =false// 如果讀取的記錄數是32的倍數,并且預估map或者buffer內存占用大于默認的5m閾值if(elementsRead %32==0&& currentMemory >= myMemoryThreshold) {// Claim up to double our current memory from the shuffle memory pool// 嘗試申請2*currentMemory-5m的內存valamountToRequest =2* currentMemory - myMemoryThresholdvalgranted = acquireMemory(amountToRequest)// 更新閾值myMemoryThreshold += granted// If we were granted too little memory to grow further (either tryToAcquire returned 0,// or we already had more memory than myMemoryThreshold), spill the current collection// 判斷,如果還是不夠,確定溢寫shouldSpill = currentMemory >= myMemoryThreshold? ? }// 如果shouldSpill為false,但是讀取的記錄數大于Integer.MAX_VALUE,也是需要溢寫shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold// Actually spillif(shouldSpill) {// 溢寫次數+1_spillCount +=1logSpillage(currentMemory)// 溢寫緩存的集合spill(collection)? ? ? _elementsRead =0_memoryBytesSpilled += currentMemory// 釋放內存releaseMemory()? ? }? ? shouldSpill? }
maybeSpill()方法里面會調用spill()進行溢寫,如下:
overrideprotected[this]defspill(collection:WritablePartitionedPairCollection[K,C]):Unit= {// 根據給定的比較器進行排序,返回排序結果的迭代器valinMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)// 將迭代器中的數據溢寫到磁盤文件中valspillFile = spillMemoryIteratorToDisk(inMemoryIterator)// ArrayBuffer記錄所有溢寫的文件spills += spillFile? }
spillMemoryIteratorToDisk()方法如下:
private[this]defspillMemoryIteratorToDisk(inMemoryIterator:WritablePartitionedIterator)? ? :SpilledFile= {// Because these files may be read during shuffle, their compression must be controlled by// spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use// createTempShuffleBlock here; see SPARK-3426 for more context.// 創建一個臨時塊val(blockId, file) = diskBlockManager.createTempShuffleBlock()// These variables are reset after each flushvarobjectsWritten:Long=0valspillMetrics:ShuffleWriteMetrics=newShuffleWriteMetrics// 創建溢寫文件的DiskBlockObjectWritervalwriter:DiskBlockObjectWriter=? ? blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, spillMetrics)// List of batch sizes (bytes) in the order they are written to disk// 記錄寫入批次大小valbatchSizes =newArrayBuffer[Long]// How many elements we have in each partition// 記錄每個分區條數valelementsPerPartition =newArray[Long](numPartitions)// Flush the disk writer's contents to disk, and update relevant variables.// The writer is committed at the end of this process.// 將內存中的數據按批次刷寫到磁盤中defflush():Unit= {valsegment = writer.commitAndGet()? ? batchSizes += segment.length? ? _diskBytesSpilled += segment.length? ? objectsWritten =0}varsuccess =falsetry{// 遍歷map或者buffer中的記錄while(inMemoryIterator.hasNext) {valpartitionId = inMemoryIterator.nextPartition()? ? ? require(partitionId >=0&& partitionId < numPartitions,s"partition Id:${partitionId}should be in the range [0,${numPartitions})")// 寫入并更新計數值inMemoryIterator.writeNext(writer)? ? ? elementsPerPartition(partitionId) +=1objectsWritten +=1// 寫入條數達到10000條時,將這批刷寫到磁盤if(objectsWritten == serializerBatchSize) {? ? ? ? flush()? ? ? }? ? }// 遍歷完以后,將剩余的刷寫到磁盤if(objectsWritten >0) {? ? ? flush()? ? }else{? ? ? writer.revertPartialWritesAndClose()? ? }? ? success =true}finally{if(success) {? ? ? writer.close()? ? }else{// This code path only happens if an exception was thrown above before we set success;// close our stuff and let the exception be thrown furtherwriter.revertPartialWritesAndClose()if(file.exists()) {if(!file.delete()) {? ? ? ? ? logWarning(s"Error deleting${file}")? ? ? ? }? ? ? }? ? }? }// 返回溢寫文件SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition)}
接下來就是排序合并操作,調用ExternalSorter.writePartitionedMapOutput()方法:
defwritePartitionedMapOutput(? ? shuffleId:Int,? ? mapId:Long,? ? mapOutputWriter:ShuffleMapOutputWriter):Unit= {varnextPartitionId =0// 如果沒有發生溢寫if(spills.isEmpty) {// Case where we only have in-memory datavalcollection =if(aggregator.isDefined) mapelsebuffer// 根據指定的比較器進行排序valit = collection.destructiveSortedWritablePartitionedIterator(comparator)while(it.hasNext()) {valpartitionId = it.nextPartition()varpartitionWriter:ShufflePartitionWriter=nullvarpartitionPairsWriter:ShufflePartitionPairsWriter=nullTryUtils.tryWithSafeFinally {? ? ? ? partitionWriter = mapOutputWriter.getPartitionWriter(partitionId)valblockId =ShuffleBlockId(shuffleId, mapId, partitionId)? ? ? ? partitionPairsWriter =newShufflePartitionPairsWriter(? ? ? ? ? partitionWriter,? ? ? ? ? serializerManager,? ? ? ? ? serInstance,? ? ? ? ? blockId,? ? ? ? ? context.taskMetrics().shuffleWriteMetrics)// 將分區內的數據依次取出while(it.hasNext && it.nextPartition() == partitionId) {? ? ? ? ? it.writeNext(partitionPairsWriter)? ? ? ? }? ? ? } {if(partitionPairsWriter !=null) {? ? ? ? ? partitionPairsWriter.close()? ? ? ? }? ? ? }? ? ? nextPartitionId = partitionId +1}// 如果發生溢寫,將溢寫文件和緩存數據進行歸并排序,排序完成后按照分區依次寫入ShufflePartitionPairsWriter}else{// We must perform merge-sort; get an iterator by partition and write everything directly.// 這里會進行歸并排序for((id, elements) <-this.partitionedIterator) {valblockId =ShuffleBlockId(shuffleId, mapId, id)varpartitionWriter:ShufflePartitionWriter=nullvarpartitionPairsWriter:ShufflePartitionPairsWriter=nullTryUtils.tryWithSafeFinally {? ? ? ? partitionWriter = mapOutputWriter.getPartitionWriter(id)? ? ? ? partitionPairsWriter =newShufflePartitionPairsWriter(? ? ? ? ? partitionWriter,? ? ? ? ? serializerManager,? ? ? ? ? serInstance,? ? ? ? ? blockId,? ? ? ? ? context.taskMetrics().shuffleWriteMetrics)if(elements.hasNext) {for(elem <- elements) {? ? ? ? ? ? partitionPairsWriter.write(elem._1, elem._2)? ? ? ? ? }? ? ? ? }? ? ? } {if(partitionPairsWriter !=null) {? ? ? ? ? partitionPairsWriter.close()? ? ? ? }? ? ? }? ? ? nextPartitionId = id +1}? }? context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled)? context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled)? context.taskMetrics().incPeakExecutionMemory(peakMemoryUsedBytes)}
partitionedIterator()方法:
defpartitionedIterator:Iterator[(Int,Iterator[Product2[K,C]])] = {valusingMap = aggregator.isDefinedvalcollection:WritablePartitionedPairCollection[K,C] =if(usingMap) mapelsebufferif(spills.isEmpty) {// Special case: if we have only in-memory data, we don't need to merge streams, and perhaps// we don't even need to sort by anything other than partition ID// 如果沒有溢寫,并且沒有排序,只按照分區id排序if(ordering.isEmpty) {// The user hasn't requested sorted keys, so only sort by partition ID, not keygroupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))// 如果沒有溢寫但是排序,先按照分區id排序,再按key排序}else{// We do need to sort by both partition ID and keygroupByPartition(destructiveIterator(? ? ? ? collection.partitionedDestructiveSortedIterator(Some(keyComparator))))? ? }? }else{// Merge spilled and in-memory data// 如果有溢寫,就將溢寫文件和內存中的數據歸并排序merge(spills, destructiveIterator(? ? ? collection.partitionedDestructiveSortedIterator(comparator)))? }}
歸并方法如下:
privatedefmerge(spills:Seq[SpilledFile], inMemory:Iterator[((Int,K),C)])? ? :Iterator[(Int,Iterator[Product2[K,C]])] = {// 讀取溢寫文件valreaders = spills.map(newSpillReader(_))valinMemBuffered = inMemory.buffered// 遍歷分區(0until numPartitions).iterator.map { p =>valinMemIterator =newIteratorForPartition(p, inMemBuffered)// 合并溢寫文件和內存中的數據valiterators = readers.map(_.readNextPartition()) ++Seq(inMemIterator)// 如果有聚合邏輯,按分區聚合,對key按照keyComparator排序if(aggregator.isDefined) {// Perform partial aggregation across partitions(p, mergeWithAggregation(? ? ? ? iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))// 如果沒有聚合,但是有排序邏輯,按照ordering做歸并}elseif(ordering.isDefined) {// No aggregator given, but we have an ordering (e.g. used by reduce tasks in sortByKey);// sort the elements without trying to merge them(p, mergeSort(iterators, ordering.get))// 什么都沒有直接歸并}else{? ? ? (p, iterators.iterator.flatten)? ? }? }}
在write()方法中調用commitAllPartitions()方法輸出數據,其中調用writeIndexFileAndCommit()方法寫出數據和索引文件,如下:
defwriteIndexFileAndCommit(? ? shuffleId:Int,? ? mapId:Long,? ? lengths:Array[Long],? ? dataTmp:File):Unit= {// 創建索引文件和臨時索引文件valindexFile = getIndexFile(shuffleId, mapId)valindexTmp =Utils.tempFileWith(indexFile)try{// 獲取shuffle data filevaldataFile = getDataFile(shuffleId, mapId)// There is only one IndexShuffleBlockResolver per executor, this synchronization make sure// the following check and rename are atomic.// 對于每個executor只有一個IndexShuffleBlockResolver,確保原子性synchronized {// 檢查索引是否和數據文件已經有了對應關系valexistingLengths = checkIndexAndDataFile(indexFile, dataFile, lengths.length)if(existingLengths !=null) {// Another attempt for the same task has already written our map outputs successfully,// so just use the existing partition lengths and delete our temporary map outputs.// 如果存在對應關系,說明shuffle write已經完成,刪除臨時索引文件System.arraycopy(existingLengths,0, lengths,0, lengths.length)if(dataTmp !=null&& dataTmp.exists()) {? ? ? ? ? dataTmp.delete()? ? ? ? }? ? ? }else{// 如果不存在,創建一個BufferedOutputStream// This is the first successful attempt in writing the map outputs for this task,// so override any existing index and data files with the ones we wrote.valout =newDataOutputStream(newBufferedOutputStream(newFileOutputStream(indexTmp)))Utils.tryWithSafeFinally {// We take in lengths of each block, need to convert it to offsets.// 獲取每個分區的大小,累加偏移量,寫入臨時索引文件varoffset =0L? ? ? ? ? out.writeLong(offset)for(length <- lengths) {? ? ? ? ? ? offset += length? ? ? ? ? ? out.writeLong(offset)? ? ? ? ? }? ? ? ? } {? ? ? ? ? out.close()? ? ? ? }// 刪除可能存在的其他索引文件if(indexFile.exists()) {? ? ? ? ? indexFile.delete()? ? ? ? }// 刪除可能存在的其他數據文件if(dataFile.exists()) {? ? ? ? ? dataFile.delete()? ? ? ? }// 將臨時文件重命名成正式文件if(!indexTmp.renameTo(indexFile)) {thrownewIOException("fail to rename file "+ indexTmp +" to "+ indexFile)? ? ? ? }if(dataTmp !=null&& dataTmp.exists() && !dataTmp.renameTo(dataFile)) {thrownewIOException("fail to rename file "+ dataTmp +" to "+ dataFile)? ? ? ? }? ? ? }? ? }? }finally{if(indexTmp.exists() && !indexTmp.delete()) {? ? ? logError(s"Failed to delete temporary index file at${indexTmp.getAbsolutePath}")? ? }? }}
龍華大道1號 http://www.kinghill.cn/Dynamics/2106.html