前言
最近在進行StarRocks與數據湖集成方面的一些工作(重點是SR 3.2與Paimon 0.6的適配),同時閱讀和修改了部分代碼,發現StarRocks JNI Connector是個稱得上精妙的模塊,為各類大數據組件適配StarRocks提供了方便的入口,從而豐富其聯邦查詢能力。本文嘗試簡單分析一下它的設計思路,并通過Paimon Reader看一下StarRocks的數據湖Reader是如何通過它實現的。
總體設計
StarRocks JNI Connector背后的思想比較簡單,就是介于C++-based的BE和Java-based的大數據組件之間的抽象中間層,可以直接復用Java SDK,規避了對BE代碼的侵入以及使用C++訪問大數據存儲的諸多不便。示意圖如下。
可見,JNI Connector在Java一側提供了統一的規約,接入方實現open() / getNext() / close()
三個方法(均位于ConnectorScanner
抽象類下),并提供必要的信息(如數據類型等),就可以將數據讀出。在JNI Connector內部,會把要處理的數據寫入C++能識別到的native memory區域(對于Java來說則是堆外內存)。BE側通過讀取這部分內存進行Scan操作。ConnectorScanner
抽象類的代碼如下。
public abstract class ConnectorScanner {
private OffHeapTable offHeapTable;
private String[] fields;
private ColumnType[] types;
private int tableSize;
/**
* Initialize the reader with parameters passed by the class constructor and allocate necessary resources.
* Developers can call {@link ConnectorScanner#initOffHeapTableWriter(ColumnType[], String[], int)} method here
* to allocate memory spaces.
*/
public abstract void open() throws IOException;
/**
* Close the reader and release resources.
*/
public abstract void close() throws IOException;
/**
* Scan original data and save it to off-heap table.
*
* @return The number of rows scanned.
* The specific implementation needs to call the {@link ConnectorScanner#appendData(int, Object)} method
* to save data to off-heap table.
* The number of rows scanned must less than or equal to {@link ConnectorScanner#tableSize}
*/
public abstract int getNext() throws IOException;
/**
* This method need be called before {@link ConnectorScanner#getNext()}
*
* @param requiredTypes column types to scan
* @param requiredFields columns names to scan
* @param fetchSize number of rows
*/
protected void initOffHeapTableWriter(ColumnType[] requiredTypes, String[] requiredFields, int fetchSize) {
this.tableSize = fetchSize;
this.types = requiredTypes;
this.fields = requiredFields;
}
protected void appendData(int index, ColumnValue value) {
offHeapTable.appendData(index, value);
}
protected int getTableSize() {
return tableSize;
}
public OffHeapTable getOffHeapTable() {
return offHeapTable;
}
public long getNextOffHeapChunk() throws IOException {
initOffHeapTable();
int numRows = 0;
try {
numRows = getNext();
} catch (IOException e) {
releaseOffHeapTable();
throw e;
}
return finishOffHeapTable(numRows);
}
private void initOffHeapTable() {
offHeapTable = new OffHeapTable(types, fields, tableSize);
}
private long finishOffHeapTable(int numRows) {
offHeapTable.setNumRows(numRows);
return offHeapTable.getMetaNativeAddress();
}
protected void releaseOffHeapColumnVector(int fieldId) {
offHeapTable.releaseOffHeapColumnVector(fieldId);
}
protected void releaseOffHeapTable() {
if (offHeapTable != null) {
offHeapTable.close();
}
}
}
從上述設計和代碼可以看出,JNI Connector需要特別關注的點有兩個:一是如何在讀取時兼容不同大數據組件的存儲類型(ColumnValue
和ColumnType
),二是如何保證BE側正確而高效地訪問包含外表數據的內存區域(OffHeapColumnVector
和OffHeapTable
)。下面分別討論。
類型兼容性
JNI Connector設計了接口ColumnValue
用來表示不同組件的不同數據類型的取值規范,目前有三種實現,分別對應Hive、Hudi和Paimon,類圖如下。
可見是提供了對常見基礎類型和復合類型的支持。以paimon-reader模塊中的PaimonColumnValue
為例,部分基礎類型取值的部分代碼如下。
@Override
public long getLong() {
return (long) fieldData;
}
@Override
public double getDouble() {
return (double) fieldData;
}
@Override
public String getString(ColumnType.TypeValue type) {
if (type == ColumnType.TypeValue.DATE) {
int epoch = (int) fieldData;
LocalDate date = LocalDate.ofEpochDay(epoch);
return PaimonScannerUtils.formatDate(date);
} else {
return fieldData.toString();
}
}
@Override
public String getTimestamp(ColumnType.TypeValue type) {
if (type == ColumnType.TypeValue.DATETIME_MILLIS) {
Timestamp ts = (Timestamp) fieldData;
LocalDateTime dateTime = ts.toLocalDateTime();
return PaimonScannerUtils.formatDateTime(dateTime);
} else {
return fieldData.toString();
}
}
對于復合類型(即Array、Map和Struct),則需要分別處理每個元素,如PaimonColumnValue#unpackMap()
方法:
@Override
public void unpackMap(List<ColumnValue> keys, List<ColumnValue> values) {
InternalMap map = (InternalMap) fieldData;
DataType keyType;
DataType valueType;
if (dataType instanceof MapType) {
keyType = ((MapType) dataType).getKeyType();
valueType = ((MapType) dataType).getValueType();
} else {
throw new UnsupportedOperationException("Unsupported type: " + dataType);
}
InternalArray keyArray = map.keyArray();
toPaimonColumnValue(keys, keyArray, keyType);
InternalArray valueArray = map.valueArray();
toPaimonColumnValue(values, valueArray, valueType);
}
private void toPaimonColumnValue(List<ColumnValue> values, InternalArray array, DataType dataType) {
for (int i = 0; i < array.size(); i++) {
PaimonColumnValue cv = null;
Object o = InternalRowUtils.get(array, i, dataType);
if (o != null) {
cv = new PaimonColumnValue(o, dataType);
}
values.add(cv);
}
}
而上文中出現的ColumnType
類主要是顯式聲明了StarRocks支持的19種定長基礎類型(INT、LONG、TIMESTAMP等等)、2種變長基礎類型(VARCHAR、BINARY)和3種復合類型,以及提供了解析復合類型的helper方法,讀者可自行參考代碼。
堆外內存布局與訪問
如果想讓SR BE能夠處理JNI Connector讀取的表數據,就必然要求堆外內存中的數據存儲方法是BE原生可以識別的。BE的C++代碼中對上述三類數據的原生存儲列舉如下:
- 定長基礎類型的列
FixedLengthColumn
需要一個容器,為vector<CppType> _data
; - 變長基礎類型的列
BinaryColumnBase
需要兩個容器,分別為數據容器vector<uint8_t> _bytes
,以及偏移量容器vector<uint32_t> _offsets
,分別記錄該列每一行的數據,以及每行數據對應的起始地址; - 復合類型的列則視情況而定,例如數組類型
ArrayColumn
需要兩個容器ColumnPtr _elements
和FixedLengthColumn<uint32_t>::Ptr _offsets
(采用ColumnPtr
是為了兼容嵌套類型),而映射類型MapColumn
需要三個容器,讀者可自行推測。
特別地,如果一個列可以為空,那么根據NullableColumn
的定義,還需要一個額外的空標記容器FixedLengthColumn<uint8_t>::Ptr _null_column
來存儲該列每一行是否為空。
OffHeapColumnVector
基于以上知識,JNI Connector設計了基于堆外內存的外表列數據容器OffHeapColumnVector
,解釋一下它的幾個關鍵屬性,簡單易懂。
// 空標記,對應_null_column
private long nulls;
// 實際數據,對應_data
private long data;
// 偏移量,對應_offsets
private long offsetData;
// 初始化容量
private int capacity;
// 列類型
private ColumnType type;
// 空元素計數
private int numNulls;
// 已寫入的元素數
protected int elementsAppended;
// 嵌套的OffHeapColumnVector,用于變長類型和復合類型
private OffHeapColumnVector[] childColumns;
OffHeapColumnVector
的內存分配采用了類似Spark Tungsten(之前的博客講過)的風格,測試環境下會調用JVM Unsafe API,正式環境則會調用SR BE的Memory Tracker Native API。通過reserveInternal()
方法,我們可以清楚地看到不同類型的內存分配邏輯。
private void reserveInternal(int newCapacity) {
int oldCapacity = (nulls == 0L) ? 0 : capacity;
long oldOffsetSize = (nulls == 0) ? 0 : (capacity + 1) * 4L;
long newOffsetSize = (newCapacity + 1) * 4L;
int typeSize = type.getPrimitiveTypeValueSize();
if (type.isUnknown()) {
// don't do anything.
} else if (typeSize != -1) {
this.data = Platform.reallocateMemory(data, oldCapacity * typeSize, newCapacity * typeSize);
} else if (type.isByteStorageType()) {
this.offsetData = Platform.reallocateMemory(offsetData, oldOffsetSize, newOffsetSize);
int childCapacity = newCapacity * DEFAULT_STRING_LENGTH;
this.childColumns = new OffHeapColumnVector[1];
this.childColumns[0] = new OffHeapColumnVector(childCapacity, new ColumnType(type.name + "#data",
ColumnType.TypeValue.BYTE));
} else if (type.isArray() || type.isMap() || type.isStruct()) {
if (type.isArray() || type.isMap()) {
this.offsetData = Platform.reallocateMemory(offsetData, oldOffsetSize, newOffsetSize);
}
int size = type.childTypes.size();
this.childColumns = new OffHeapColumnVector[size];
for (int i = 0; i < size; i++) {
this.childColumns[i] = new OffHeapColumnVector(newCapacity, type.childTypes.get(i));
}
} else {
throw new RuntimeException("Unhandled type: " + type);
}
this.nulls = Platform.reallocateMemory(nulls, oldCapacity, newCapacity);
Platform.setMemory(nulls + oldCapacity, (byte) 0, newCapacity - oldCapacity);
capacity = newCapacity;
if (offsetData != 0) {
// offsetData[0] == 0 always.
// we have to set it explicitly otherwise it's undefined value here.
Platform.putInt(null, offsetData, 0);
}
}
來看一下寫入和讀取定長類型的方法,以INT為例,同樣是Spark Tungsten風格,如同操作C++指針。Platform
類實際上就是從Spark中的同名類稍加修改而來,getInt()
和putInt()
方法均是直接調用Unsafe API。
public int appendInt(int v) {
reserve(elementsAppended + 1);
putInt(elementsAppended, v);
return elementsAppended++;
}
private void putInt(int rowId, int value) {
Platform.putInt(null, data + 4L * rowId, value);
}
public int getInt(int rowId) {
return Platform.getInt(null, data + 4L * rowId);
}
至于變長類型,則需要額外寫入每行的偏移量信息,讀取時根據偏移量從字節流中取出對應的區塊并轉化即可。
private int appendByteArray(byte[] value, int offset, int length) {
int copiedOffset = arrayData().appendBytes(length, value, offset);
reserve(elementsAppended + 1);
putArrayOffset(elementsAppended, copiedOffset, length);
return elementsAppended++;
}
private void putArrayOffset(int rowId, int offset, int length) {
Platform.putInt(null, offsetData + 4L * rowId, offset);
Platform.putInt(null, offsetData + 4L * (rowId + 1), offset + length);
}
public String getUTF8String(int rowId) {
if (isNullAt(rowId)) {
return null;
}
int start = getArrayOffset(rowId);
int end = getArrayOffset(rowId + 1);
int size = end - start;
byte[] bytes = arrayData().getBytes(start, size);
return new String(bytes, StandardCharsets.UTF_8);
}
復合類型涉及到的更多是childColumns
的操作,與上面的思想相通,此處不再贅述。
OffHeapTable
顧名思義,OffHeapTable
是統一管理一張表對應的所有OffHeapColumnVector
的組件。它的實現也非常簡潔,部分代碼摘錄如下。
public class OffHeapTable {
public OffHeapColumnVector[] vectors;
public String[] fields;
public OffHeapColumnVector meta;
public int numRows;
public boolean[] released;
public OffHeapTable(ColumnType[] types, String[] fields, int capacity) {
this.fields = fields;
this.vectors = new OffHeapColumnVector[types.length];
this.released = new boolean[types.length];
int metaSize = 0;
for (int i = 0; i < types.length; i++) {
vectors[i] = new OffHeapColumnVector(capacity, types[i]);
metaSize += types[i].computeColumnSize();
released[i] = false;
}
this.meta = new OffHeapColumnVector(metaSize, new ColumnType("#meta", ColumnType.TypeValue.LONG));
this.numRows = 0;
}
public void appendData(int fieldId, ColumnValue o) {
vectors[fieldId].appendValue(o);
}
public void releaseOffHeapColumnVector(int fieldId) {
if (!released[fieldId]) {
vectors[fieldId].close();
released[fieldId] = true;
}
}
public long getMetaNativeAddress() {
meta.appendLong(numRows);
for (OffHeapColumnVector v : vectors) {
v.updateMeta(meta);
}
return meta.valuesNativeAddress();
}
}
除了列名稱、行數、釋放標記等必要信息,需要特別注意的是,OffHeapTable
還額外維護了一個名為meta
的存儲元數據的OffHeapColumnVector
,里面存有各個數據容器的起始內存地址,方便快速定位。更新元數據的操作如下所示。
public void updateMeta(OffHeapColumnVector meta) {
if (type.isUnknown()) {
meta.appendLong(0);
} else if (type.isByteStorageType()) {
meta.appendLong(nullsNativeAddress());
meta.appendLong(arrayOffsetNativeAddress());
meta.appendLong(arrayDataNativeAddress());
} else if (type.isArray() || type.isMap() || type.isStruct()) {
meta.appendLong(nullsNativeAddress());
if (type.isArray() || type.isMap()) {
meta.appendLong(arrayOffsetNativeAddress());
}
for (OffHeapColumnVector c : childColumns) {
c.updateMeta(meta);
}
} else {
meta.appendLong(nullsNativeAddress());
meta.appendLong(valuesNativeAddress());
}
}
從Paimon Reader到BE Scan
介紹完類型兼容和堆外內存訪問的設計,接下來就可以通過Paimon Reader中的PaimonSplitScanner
看看JNI Connector是如何與BE聯動的。
JNI Connector強制要求ConnectorScanner
的實現類傳入兩個固定的構造參數,分別是讀取的數據行數,以及表類型特定的參數(如列信息、謂詞條件等):
public PaimonSplitScanner(int fetchSize, Map<String, String> params) {
this.fetchSize = fetchSize;
this.requiredFields = params.get("required_fields").split(",");
this.nestedFields = params.getOrDefault("nested_fields", "").split(",");
this.splitInfo = params.get("split_info");
this.predicateInfo = params.get("predicate_info");
this.encodedTable = params.get("native_table");
this.classLoader = this.getClass().getClassLoader();
}
前面已經提到接入方需要實現open() / getNext() / close()
三個方法,來看下PaimonSplitScanner#open()
方法的實現。
@Override
public void open() throws IOException {
try (ThreadContextClassLoader ignored = new ThreadContextClassLoader(classLoader)) {
table = PaimonScannerUtils.decodeStringToObject(encodedTable);
parseRequiredTypes();
initOffHeapTableWriter(requiredTypes, requiredFields, fetchSize);
initReader();
} catch (Exception e) {
close();
String msg = "Failed to open the paimon reader.";
LOG.error(msg, e);
throw new IOException(msg, e);
}
}
private void initReader() throws IOException {
ReadBuilder readBuilder = table.newReadBuilder();
RowType rowType = table.rowType();
List<String> fieldNames = PaimonScannerUtils.fieldNames(rowType);
int[] projected = Arrays.stream(requiredFields).mapToInt(fieldNames::indexOf).toArray();
readBuilder.withProjection(projected);
List<Predicate> predicates = PaimonScannerUtils.decodeStringToObject(predicateInfo);
readBuilder.withFilter(predicates);
Split split = PaimonScannerUtils.decodeStringToObject(splitInfo);
RecordReader<InternalRow> reader = readBuilder.newRead().executeFilter().createReader(split);
iterator = new RecordReaderIterator<>(reader);
}
可見,open()
方法通過解析表名,獲取列及類型信息,創建OffHeapTable
實例,并通過調用Paimon SDK中的相關方法構造帶有列裁剪、謂詞下推等信息的RecordReader
實例,最終產生實際讀取Paimon數據的迭代器。getNext()
方法就通過此迭代器讀取數據,并轉換為定義好的PaimonColumnValue
實例,然后調用基類的方法將其寫入各個OffHeapColumnVector
,水到渠成。
// 最終會被ConnectorScanner#getNextOffHeapChunk()調用
@Override
public int getNext() throws IOException {
try (ThreadContextClassLoader ignored = new ThreadContextClassLoader(classLoader)) {
int numRows = 0;
while (iterator.hasNext() && numRows < fetchSize) {
InternalRow row = iterator.next();
if (row == null) {
break;
}
for (int i = 0; i < requiredFields.length; i++) {
Object fieldData = InternalRowUtils.get(row, i, logicalTypes[i]);
if (fieldData == null) {
appendData(i, null);
} else {
ColumnValue fieldValue = new PaimonColumnValue(fieldData, logicalTypes[i]);
appendData(i, fieldValue);
}
}
numRows++;
}
return numRows;
} catch (Exception e) {
close();
String msg = "Failed to get the next off-heap table chunk of paimon.";
LOG.error(msg, e);
throw new IOException(msg, e);
}
}
最后一步,來看下BE是如何利用上面講到的所有內容的。BE側對應的C++類名為JniScanner
,而在JniScanner::_init_jni_table_scanner()
方法中,我們可以看到通過ScannerFactory
工廠實例(Java代碼略)獲取對應的ConnectorScanner
實例及其參數的邏輯:
Status JniScanner::_init_jni_table_scanner(JNIEnv* _jni_env, RuntimeState* runtime_state) {
jclass scanner_factory_class = _jni_env->FindClass(_jni_scanner_factory_class.c_str());
jmethodID scanner_factory_constructor = _jni_env->GetMethodID(scanner_factory_class, "<init>", "()V");
jobject scanner_factory_obj = _jni_env->NewObject(scanner_factory_class, scanner_factory_constructor);
jmethodID get_scanner_method =
_jni_env->GetMethodID(scanner_factory_class, "getScannerClass", "()Ljava/lang/Class;");
_jni_scanner_cls = (jclass)_jni_env->CallObjectMethod(scanner_factory_obj, get_scanner_method);
RETURN_IF_ERROR(_check_jni_exception(_jni_env, "Failed to init the scanner class."));
_jni_env->DeleteLocalRef(scanner_factory_class);
_jni_env->DeleteLocalRef(scanner_factory_obj);
jmethodID scanner_constructor = _jni_env->GetMethodID(_jni_scanner_cls, "<init>", "(ILjava/util/Map;)V");
RETURN_IF_ERROR(_check_jni_exception(_jni_env, "Failed to get a scanner class constructor."));
jclass hashmap_class = _jni_env->FindClass("java/util/HashMap");
jmethodID hashmap_constructor = _jni_env->GetMethodID(hashmap_class, "<init>", "(I)V");
jobject hashmap_object = _jni_env->NewObject(hashmap_class, hashmap_constructor, _jni_scanner_params.size());
jmethodID hashmap_put =
_jni_env->GetMethodID(hashmap_class, "put", "(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;");
RETURN_IF_ERROR(_check_jni_exception(_jni_env, "Failed to get the HashMap methods."));
}
以及初始化ConnectorScanner
中open() / getNext() / close()
相關方法的邏輯:
Status JniScanner::_init_jni_method(JNIEnv* _jni_env) {
// init jmethod
_jni_scanner_open = _jni_env->GetMethodID(_jni_scanner_cls, "open", "()V");
RETURN_IF_ERROR(_check_jni_exception(_jni_env, "Failed to get `open` jni method"));
_jni_scanner_get_next_chunk = _jni_env->GetMethodID(_jni_scanner_cls, "getNextOffHeapChunk", "()J");
RETURN_IF_ERROR(_check_jni_exception(_jni_env, "Failed to get `getNextOffHeapChunk` jni method"));
_jni_scanner_close = _jni_env->GetMethodID(_jni_scanner_cls, "close", "()V");
RETURN_IF_ERROR(_check_jni_exception(_jni_env, "Failed to get `close` jni method"));
_jni_scanner_release_column = _jni_env->GetMethodID(_jni_scanner_cls, "releaseOffHeapColumnVector", "(I)V");
RETURN_IF_ERROR(_check_jni_exception(_jni_env, "Failed to get `releaseOffHeapColumnVector` jni method"));
_jni_scanner_release_table = _jni_env->GetMethodID(_jni_scanner_cls, "releaseOffHeapTable", "()V");
RETURN_IF_ERROR(_check_jni_exception(_jni_env, "Failed to get `releaseOffHeapTable` jni method"));
return Status::OK();
}
BE通過JNI調用open()
和getNextOffHeapChunk()
方法取得數據并放入native memory后,繼續調用JniScanner::_fill_column()
方法獲取對應列的行數據和元數據(即內存地址),再根據不同類型調用不同的append方法。需要注意,由于OffHeapColumnVector
固定包含空標記字段,所以這里通過內存數據還原出來的都是NullableColumn
,并通過調用C++的memcpy()
函數將元數據中指向的內存區域復制到實際的NullableColumn
里。
Status JniScanner::_fill_column(FillColumnArgs* pargs) {
FillColumnArgs& args = *pargs;
if (args.must_nullable && !args.column->is_nullable()) {
return Status::DataQualityError(fmt::format("NOT NULL column[{}] is not supported.", args.slot_name));
}
void* ptr = next_chunk_meta_as_ptr();
if (ptr == nullptr) {
// struct field mismatch.
args.column->append_default(args.num_rows);
return Status::OK();
}
if (args.column->is_nullable()) {
// if column is nullable, we parse `null_column`,
// and update `args.nulls` and set `data_column` to `args.column`
bool* null_column_ptr = static_cast<bool*>(ptr);
auto* nullable_column = down_cast<NullableColumn*>(args.column);
NullData& null_data = nullable_column->null_column_data();
null_data.resize(args.num_rows);
memcpy(null_data.data(), null_column_ptr, args.num_rows);
nullable_column->update_has_null();
auto* data_column = nullable_column->data_column().get();
pargs->column = data_column;
pargs->nulls = null_data.data();
} else {
// otherwise we skip this chunk meta, because in Java side
// we assume every column starts with `null_column`.
}
LogicalType column_type = args.slot_type.type;
if (column_type == LogicalType::TYPE_BOOLEAN) {
RETURN_IF_ERROR((_append_primitive_data<TYPE_BOOLEAN>(args)));
} else if (column_type == LogicalType::TYPE_TINYINT) {
RETURN_IF_ERROR((_append_primitive_data<TYPE_TINYINT>(args)));
} else if (column_type == LogicalType::TYPE_SMALLINT) {
RETURN_IF_ERROR((_append_primitive_data<TYPE_SMALLINT>(args)));
}
// ...以下各種類型略去...
else {
return Status::InternalError(fmt::format("Type {} is not supported for off-heap table scanner", column_type));
}
return Status::OK();
}
這樣,Paimon Reader讀取的數據就轉交到了BE Scan流程,接下來就可以進行后續的計算了。
The End
放松一下,準備去看雙紅會。
民那晚安。