motivation 動機
The various 2G limit in Spark. Spark中存在的各種2G限制問題.
- When reading the data block is stored in the hard disk, the following code fragment is called.
獲取緩存在本地硬盤的數據塊時,會調用以下代碼片段
val iterToReturn: Iterator[Any] = {
val diskBytes = diskStore.getBytes(blockId)
if (level.deserialized) {
val diskValues = serializerManager.dataDeserializeStream(
blockId,
diskBytes.toInputStream(dispose = true))(info.classTag)
maybeCacheDiskValuesInMemory(info, blockId, level, diskValues)
} else {
val stream = maybeCacheDiskBytesInMemory(info, blockId, level, diskBytes)
.map {_.toInputStream(dispose = false)}
.getOrElse { diskBytes.toInputStream(dispose = true) }
serializerManager.dataDeserializeStream(blockId, stream)(info.classTag)
}
}
def getBytes(blockId: BlockId): ChunkedByteBuffer = {
val file = diskManager.getFile(blockId.name)
val channel = new RandomAccessFile(file, "r").getChannel
Utils.tryWithSafeFinally {
// For small files, directly read rather than memory map
if (file.length < minMemoryMapBytes) {
val buf = ByteBuffer.allocate(file.length.toInt)
channel.position(0)
while (buf.remaining() != 0) {
if (channel.read(buf) == -1) {
throw new IOException("Reached EOF before filling buffer\n" +
s"offset=0\nfile=${file.getAbsolutePath}\nbuf.remaining=${buf.remaining}")
}
}
buf.flip()
new ChunkedByteBuffer(buf)
} else {
new ChunkedByteBuffer(channel.map(MapMode.READ_ONLY, 0, file.length))
}
} {
channel.close()
}
}
The above code has the following problems: 上面的代碼存在以下問題:
* Channel.map(MapMode.READ_ONLY, 0, file.length)
returns an instance of MappedByteBuffer
. the size of MappedByteBuffer
can not exceed 2G. channel.map(MapMode.READ_ONLY, 0, file.length)
返回的實例是MappedByteBuffer
. MappedByteBuffer的大小不能超過2G
* When a Iterator[Any]
is generated, need to load all the data into the memory,this may take up a lot of memory. 獲取Iterator[Any]
時需要把全部數據加載到內存中, 這可能會導致占用很多堆外內存.
* MappedByteBuffer map a file to memory, and it's controlled by operator system, JVM can't control the memory. MappedByteBuffer 使用系統緩存,系統緩存不可控.
- When using kryo serialized data, the following code fragment is called:
在使用kryo序列化數據時, 會調用以下代碼片段:
override def serialize[T: ClassTag](t: T): ByteBuffer = {
output.clear()
val kryo = borrowKryo()
try {
kryo.writeClassAndObject(output, t)
} catch {
case e: KryoException if e.getMessage.startsWith("Buffer overflow") =>
throw new SparkException(s"Kryo serialization failed: ${e.getMessage}. To avoid this, " +
"increase spark.kryoserializer.buffer.max value.")
} finally {
releaseKryo(kryo)
}
ByteBuffer.wrap(output.toBytes)
}
The above code has the following problems: 上面的代碼存在以下問題:
* The serialization data is stored in the output
internal byte[]
, the size of byte[]
can not exceed 2G. 序列化t時會把序列化后的數據存儲在output內部byte[]里, byte[]的大小不能超過2G.
- When RPC writes data to be sent to a Channel, the following code fragment is called:
在RPC把要發送的數據寫入到Channel時會調用以下代碼片段:
public long transferTo(final WritableByteChannel target, final long position) throws IOException {
Preconditions.checkArgument(position == totalBytesTransferred, "Invalid position.");
// Bytes written for header in this call.
long writtenHeader = 0;
if (header.readableBytes() > 0) {
writtenHeader = copyByteBuf(header, target);
totalBytesTransferred += writtenHeader;
if (header.readableBytes() > 0) {
return writtenHeader;
}
}
// Bytes written for body in this call.
long writtenBody = 0;
if (body instanceof FileRegion) {
writtenBody = ((FileRegion) body).transferTo(target, totalBytesTransferred - headerLength);
} else if (body instanceof ByteBuf) {
writtenBody = copyByteBuf((ByteBuf) body, target);
}
totalBytesTransferred += writtenBody;
return writtenHeader + writtenBody;
}
The above code has the following problems: ~~上面的代碼存在以下問題: ~~
* the size of ByteBuf cannot exceed 2G. ByteBuf的大小不能超過2G
* cannot transfer data over 2G in memory. ~~無法傳輸內存中超過2G的數據 ~~
- When decodes the RPC message received, the following code fragment is called:
解碼RPC接收的消息時調用以下代碼片段:
public final class MessageDecoder extends MessageToMessageDecoder<ByteBuf> {
private static final Logger logger = LoggerFactory.getLogger(MessageDecoder.class);
@Override
public void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out) {
Message.Type msgType = Message.Type.decode(in);
Message decoded = decode(msgType, in);
assert decoded.type() == msgType;
logger.trace("Received message {}: {}", msgType, decoded);
out.add(decoded);
}
private Message decode(Message.Type msgType, ByteBuf in) {
switch (msgType) {
case ChunkFetchRequest:
return ChunkFetchRequest.decode(in);
case ChunkFetchSuccess:
return ChunkFetchSuccess.decode(in);
case ChunkFetchFailure:
return ChunkFetchFailure.decode(in);
default:
throw new IllegalArgumentException("Unexpected message type: " + msgType);
}
}
}
The above code has the following problems: 上面的代碼存在以下問題:
* the size of ByteBuf cannot exceed 2G. ByteBuf的大小不能超過2G
* Must be in the receiver to complete the data can be decoded. 必須在接收到全部數據時才能解碼.
Goals
- Setup for eliminating the various 2G limit in Spark.
解決Spark中存在的各種2G限制問題.(The 2G limit 1,2,3,4) - Support back-pressure flow control for remote data reading(experimental goal). ~~遠程數據讀取支持back-pressure flow control(實驗目標). ~~ (The 2G limit 4)
- Add buffer pool(long-range goal).
添加緩存池(遠期目標).
Design
Setup for eliminating the various 2G limit in Spark. 解決Spark中存在的各種2G限制問題.
Replace ByteBuffer with ChunkedByteBuffer. 使用 ChunkedByteBuffer 替換 ByteBuffer. (The 2G limit 1,2)
ChunkedByteBuffer Introduction: ChunkedByteBuffer 介紹:
- Store data with multiple
ByteBuffer
instance.用多個ByteBuffer存儲數據 - Support reference counting, a necessary condition to the feature of the buffer pool.
支持引用計數,實現資源池必要條件
Reference counted objects - Support serialization for easy transport.
支持序列化,方便傳輸 - Support slice duplicate and copy operation.
支持類似于ByteBuffer的切片(slice), 副本(duplicate)和復制(copy)等操作, 方便處理 - Can be efficiently converted to
InputStream
,ByteBuffer
,byte[]
andByteBuf
, etc.可以高效轉換成InputStream
,ByteBuffer
,byte[]
和ByteBuf
等,便于和其他接口對接 - 可以方便的寫入數據到
OutputStream
- Move the ChunkedByteBuffer class to
common/network-common/src/main/java/org/apache/spark/network/buffer/
. ~~把ChunkedByteBuffer類移動到common/network-common/src/main/java/org/apache/spark/network/buffer/
. ~~ - Modify
ManagedBuffer.nioByteBuffer
's return value to ChunkedByteBuffer instance.修改ManagedBuffer.nioByteBuffer的返回值為ChunkedByteBuffer實例.(The 2G limit 1) - Further standardize the use of
ManagedBuffer
andChunkedByteBuffer
.進一步規范ManagedBuffer
和ChunkedByteBuffer
的使用.
- Data in memory, network, disk and other sources are represented with
ManagedBuffer
,內存,網絡,硬盤和其他來源的數據使用ManagedBuffer
表示. - ChunkedByteBuffer only represents the data in the memory.
ChunkedByteBuffer只表示內存中的數據. -
ManagedBuffer.nioByteBuffer
is called only when there is sufficient memory.只有在確認有足夠的內存保存數據時才會調用ManagedBuffer.nioByteBuffer.
- Modify the parameter of
SerializerInstance.deserialize
and the return value ofSerializerInstance.serialize
to ChunkedByteBuffer instance.
修改SerializerInstance.deserialize方法的參數和SerializerInstance.serialize方法的返回值改為ChunkedByteBuffer實例.(The 2G limit 2)
def serialize[T: ClassTag](t: T): ChunkedByteBuffer = {
output.clear()
val out = ChunkedByteBufferOutputStream.newInstance()
// The data is output to the OutputStream, rather than the internal byte[] in the output object.
// ~~序列化后的數據輸出到OutputStream,而不是到output對象的內部字節數組里.~~
output.setOutputStream(out)
val kryo = borrowKryo()
kryo.writeClassAndObject(output, t)
output.close()
out.toChunkedByteBuffer
}
- Other changes.
其他修改.
Replace ByteBuf with InputStream. 使用 InputStream 替換 ByteBuf.
- Add InputStreamManagedBuffer class, used to convert InputStream instance to ManagedBuffer instance.
添加InputStreamManagedBuffer類,用于把InputStream轉換成ManagedBuffer實例.(The 2G limit 4) - Modify
NioManagedBuffer.convertToNetty
method returns InputStream instances when the size of data is larger than Integer.MAX_VALUE.修改(The 2G limit 3)NioManagedBuffer.convertToNetty
方法在數據量大于Integer.MAX_VALUE時返回InputStream實例. - Modify MessageWithHeader classes, support processing InputStream instance (The 2G limit 3)
修改MessageWithHeader類, 支持處理InputStream類型的body對象
-
2.
和3.
的修改結合起來支持傳輸內存中超過2G的數據.
- Modify the parameters of the
Encodable.encode
method to OutputStream instance.修改Encodable.encode方法的參數為OutputStream實例.(The 2G limit 4)
5.It can handle mixed storage data. ~~UploadBlock添加toInputStream方法,支持處理混合存儲數據(The 2G limit 3) ~~
public InputStream toInputStream() throws IOException {
ChunkedByteBufferOutputStream out = ChunkedByteBufferOutputStream.newInstance();
Encoders.Bytes.encode(out, type().id());
encodeWithoutBlockData(out);
// out.toChunkedByteBuffer().toInputStream() data in memory
// blockData.createInputStream() data in hard disk(FileInputStream)
return new SequenceInputStream(out.toChunkedByteBuffer().toInputStream(),
blockData.createInputStream());
}
-
2
,3
,4
and5
are combined to resolve the 2G limit in RPC message encoding and sending process.2.
3.
4.
和5.
組合起來解決RPC消息編碼和發送過程中的2G限制.
- Modify the parameters of the decode method of the classes who implement the Encodable interface to InputStream instance. ~~修改實現Encodable接口子類的decode方法參數為InputStream實例. (The 2G limit 4) ~~
- Modify TransportFrameDecoder class, use
LinkedList<ByteBuf>
to represent the Frame, remove the size limit of Frame. ~~修改TransportFrameDecoder類,使用LinkedList<ByteBuf>
來表示Frame,移除Frame的大小限制. ~~ (The 2G limit 4) - Add ByteBufInputStream class, used to convert
LinkedList<ByteBuf>
instance to InputStream instance.添加ByteBufInputStream類,用于把LinkedList<ByteBuf>包裝成InputStream實例.在讀取完一個ByteBuf的數據時就會調用ByteBuf.release
方法釋放ByteBuf. (The 2G limit 4) - Modify the parameters of
RpcHandler.receive
method to InputStream instance.修改(The 2G limit 4)RpcHandler.receive
方法的參數為InputStream實例.
-
6
,7
,8
and9
are combined to resolve the 2G limit in RPC message receiving and decoding process.6.
7.
8.
和9.
組合起來解決RPC消息接收和解碼的過程中的2G限制
Read data
Local data
- Only the data stored in the memory is represented by ChunkedByteBuffer, the other is represented by ManagedBuffer.
只有存儲在內存中的數據用 ChunkedByteBuffer 表示,其他的數據都使用 ManagedBuffer 表示.(The 2G limit 1)
- Modify
DiskStore.getBytes
's return value type to ManagedBuffer instance, which callsManagedBuffer.nioByteBuffer
only when the memory has enough space to store the ManagedBuffer data.修改DiskStore.getBytes
的返回值為ManagedBuffer實例, 只有在內存有足夠的空間儲存ManagedBuffer數據時才會調用ManagedBuffer.nioByteBuffer
方法.
Remote Data (The 2G limit 4)
There are three options: 有三個可選方案:
- Add InputStreamInterceptor to support propagate back-pressure to shuffle server(The option has been implemented):
添加InputStreamInterceptor支持propagate back-pressure 到 shuffle server端(該方案已經實現):
- When the number of ByteBuf in the cache exceeds a certain amount, call
channel.config ().SetAutoRead (false)
disable AUTO_READ, no longer automatically callchannle.read ()
. ~~在緩存的 ByteBuf 數量超過一定數量時調用channel.config().setAutoRead(false)
禁用AUTO_READ, 不再自動調用channle.read()
. ~~ - When the number of ByteBuf in the cache is smaller than a certain amount, call
channel.config().setAutoRead(true)
enable AUTO_READ . ~~在緩存的 ByteBuf 數量小于一定數量時調用channel.config().setAutoRead(true)
激活AUTO_READ. ~~ - The advantage of this option is to support propagate back-pressure; drawback is that can lead semantic change the existing API, in some cases the IO retry function is invalid.
該方案的優點是支持propagate back-pressure; 缺點是會導致現有API的語義改變, 某些情況下導致錯誤重試功能失效. - 參考文檔:
- Netty的read事件與AUTO_READ模式
- TCP/IP詳解--舉例明白發送/接收緩沖區、滑動窗口協議之間的關系
- TCP 滑動窗口協議 詳解 - InputStreamInterceptor設計方案:
- 創建一固定大小線程安全緩存池
- netty線程接收到ByteBuf放到緩存池, 如果緩存的ByteBuf超過緩存容量的90%時,調用
channel.config().setAutoRead(false)
, 不在自動接收數據. 對端寫入堵塞. - 數據處理線程從緩沖池中取出ByteBuf, 如果緩存的ByteBuf數量少于緩存池容量的10%,調用
channel.config().setAutoRead(true)
, 激活數據自動讀取. - 如果處理完一個ByteBuf,釋放該ByteBuf, 并調用
channle.read()
接收數據.
- When the size of message is greater than a certain value, the message is written to disk, not take up memory. ~~在消息大小大于一定值時,把消息寫到硬盤上,不再占用內存. ~~
- The advantage of this options is to take up very little memory, the disadvantage is to increase the disk IO.
該方案的優點是占用很少的內存,缺點是增加磁盤IO.
- Combined with buffer pool, qs far as possible stores data in memory. ~~結合緩存池,盡可能的把數據存儲在內存里. ~~
- Write message to the buffer pool when there has enough memory, otherwise write on disk. ~~把消息寫到緩存池, 在緩存池中有足夠的內存時,內存不足時才寫到硬盤上. ~~
Add buffer pool
The buffer pool can reduce memory allocation, reduce GC time, improve the performance of spark core. 緩存池能夠減少內存分配占用, 減少GC時間,提升程序性能
- Reduce the number of large objects created in the Eden area, according to experience twitter using buffer pools can significantly reduce the number of GC.
減少在eden區創建大對象的次數,根據twitter的經驗,使用緩存池能顯著減少GC次數.
Netty 4 Reduces GC Overhead by 5x at Twitter - Use buffer pool to reduce the number of memory allocations and wiping zero.
使用緩存池能夠減少內存分配和抹零次數.
Using as a generic library
實現該功能的難點有:
- Spark在使用ByteBuffer時沒有考慮釋放問題, 由java GC回收.
- 添加引用計數主動釋放, 減少GC壓力, 需要添加引用計數和內存泄露檢測相關代碼, 改動大.
- 復用netty buffer代碼,支持內存泄露檢查和動態調整大小.