我們都知道Kafka對消息的處理速度非常的快,單機的TPS達到了百萬條的數量級。主要是由于Producer端將對個小消息進行合并,進行一個batch message的操作。對于KafkaProducer 的流程設計通過源碼的角度進行詳細的解析。
這里使用的代碼版本為: 0.10.1
- 構造方法: public KafkaProducer(Map<String, Object> configs) {...}
- KafkaProducer 中成員變量
// 如果用戶沒有配置"client.id", 則用 "producer-" + PRODUCER_CLIENT_ID_SEQUENCE.getAndIncrement()作為cientId
private static final AtomicInteger PRODUCER_CLIENT_ID_SEQUENCE = new AtomicInteger(1);
// 作為jmx中BeanName 的前綴
private static final String JMX_PREFIX = "kafka.producer";
private String clientId;
//通過此類的`partition`方法將那一條消息負載到指定的topic的partition中
//用戶可以自定義擴展此類, 可通過 "partitioner.class"進行配置
private final Partitioner partitioner;
//對message的size做限制, 可通過"max.request.size"進行配置
private final int maxRequestSize;
//作為Produce端申請message內存的大小, 如果下一條消息申請內存時,內存大小不夠,則等待。可通過"buffer.memory"進行配置
//具體使用RecordAccumulator中的BufferPool中使用
private final long totalMemorySize;
//作為獲取集群信息的元數據類
private final Metadata metadata;
//message消息的累積類, 每次發送消息,都將消息append在RecordAccumulator中
private final RecordAccumulator accumulator;
//消息發送線程,從accumulator中獲取可以發送的消息, 進行消息的發送
//放入ioThread線程中, 實例化的時候就會啟動
private final Sender sender;
//metrics 數據監控類
private final Metrics metrics;
//作為sender啟動類
private final Thread ioThread;
//數據傳輸的壓縮格式
private final CompressionType compressionType;
//消息發送失敗的統計類
private final Sensor errors;
private final Time time;
//通過此類將消息的key序列化為傳輸的byte[]
//用戶可自己實現序列化方法, 可通過"key.serializer"進行配置
private final Serializer<K> keySerializer;
//通過此類將消息的value序列化為傳輸的byte[]
//同樣可以自己實現, 可通過"value.serializer"進行配置
private final Serializer<V> valueSerializer;
//作為KafkaProducer實現的的輸入參數, 用戶配置信息類
private final ProducerConfig producerConfig;
//發送消息時,最大的阻塞時間,the buffer is full or metadata unavailable,可通過"max.block.ms"配置(0.10.1版本)
private final long maxBlockTimeMs;
//發送消息時,發送請求的最大超時時間, 可通過"request.timeout.ms"配置(0.10.1版本)
private final int requestTimeoutMs;
//發送數據的攔截器列表, 對發送ProducerRecord時, 進行一些攔截處理
private final ProducerInterceptors<K, V> interceptors;
- 消息的發送方法 KafkaProducer.send(ProducerRecord<K, V> record, Callback callback)
public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
// intercept the record, which can be potentially modified; this method does not throw exceptions
ProducerRecord<K, V> interceptedRecord = this.interceptors == null ? record : this.interceptors.onSend(record);
return doSend(interceptedRecord, callback);
}
private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
TopicPartition tp = null;
try {
//確保metadata對當前record.topic可用,并返回cluster + waitedOnMetadataMs(此方法的阻塞時間)
//可用的條件:metadata中cluster當前topic的partitionsCount != null
//1> 用戶沒有指定partion
//2> 用戶指定了partion,必須 partition < partitionsCount (因為partition是從0開始)
ClusterAndWaitTime clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), maxBlockTimeMs);
//計算下面操作最大可阻塞的時間
long remainingWaitMs = Math.max(0, maxBlockTimeMs - clusterAndWaitTime.waitedOnMetadataMs);
Cluster cluster = clusterAndWaitTime.cluster;
//將ProducerRecord 中的key跟value根據對應的序列化類序列化為對應的byte[]
byte[] serializedKey;
try {
serializedKey = keySerializer.serialize(record.topic(), record.key());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG).getName() +
" specified in key.serializer");
}
byte[] serializedValue;
try {
serializedValue = valueSerializer.serialize(record.topic(), record.value());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
" specified in value.serializer");
}
//如果用戶沒有指定partition即record.partition != null
//則根據配置的Partitioner進行消息的負載分配
int partition = partition(record, serializedKey, serializedValue, cluster);
//消息序列化的size, 加上消息頭的大小 SIZE_LENGTH(INT類型的大小 4) + OFFSET_LENGTH(LONG類型的大小 8)
int serializedSize = Records.LOG_OVERHEAD + Record.recordSize(serializedKey, serializedValue);
//驗證單次消息的大小, 必須小于maxRequestSize, 必須小于totalMemorySize
ensureValidRecordSize(serializedSize);
//生成消息發送對應的TopicPartition
tp = new TopicPartition(record.topic(), partition);
long timestamp = record.timestamp() == null ? time.milliseconds() : record.timestamp();
log.trace("Sending record {} with callback {} to topic {} partition {}", record, callback, record.topic(), partition);
// producer callback will make sure to call both 'callback' and interceptor callback
Callback interceptCallback = this.interceptors == null ? callback : new InterceptorCallback<>(callback, this.interceptors, tp);
//將消息append到accumulator 中
RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey, serializedValue, interceptCallback, remainingWaitMs);
//如果消息滿足發送的條件, 則喚醒發送線程, 進行消息的發送
//滿足消息發送的條件:
//1> RecordAccumulator中batches對應的
// TopicPartition的消息隊列Deque<RecordBatch>的size() > 1;
// 或者當前RecordBatch.isFull()已經滿了
// 2> 當前RecordBatch 是新建的, 新建的表示一定有數據
if (result.batchIsFull || result.newBatchCreated) {
log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
this.sender.wakeup();
}
return result.future;
} catch (...) {
// handling exceptions and record the errors;
// for API exceptions return them in the future,
// for other exceptions throw directly
}
}
- 消息的batch方法。 accumulator.append(tp, timestamp, serializedKey, serializedValue, interceptCallback, remainingWaitMs)
/**
* Add a record to the accumulator, return the append result
* <p>
* The append result will contain the future metadata, and flag for whether the appended batch is full or a new batch is created
* <p>
*
* @param tp The topic/partition to which this record is being sent
* @param timestamp The timestamp of the record
* @param key The key for the record
* @param value The value for the record
* @param callback The user-supplied callback to execute when the request is complete
* @param maxTimeToBlock 最大的申請內存的阻塞時間
*/
public RecordAppendResult append(TopicPartition tp,
long timestamp,
byte[] key,
byte[] value,
Callback callback,
long maxTimeToBlock) throws InterruptedException {
//統計, 在append 中的消息的數據信息
appendsInProgress.incrementAndGet();
try {
// 如果batches中存在對應的TopicPartition的消息隊列, 直接返回, 否則創建一個
Deque<RecordBatch> dq = getOrCreateDeque(tp);
synchronized (dq) {
if (closed)
throw new IllegalStateException("Cannot send after the producer is closed.");
//將消息放入dq中: 獲取dq中最后一個RecordBatch, 如果不存在, 直接返回NULL
// 如果存在, 將消息append到RecordBatch中, 如果RecordBatch沒有空間存放,直接返回NULL
// 如果有空間, append進去, 生成一個FutureRecordMetadata,
//==>并通過callback+FutureRecordMetadata實例化一個Thunk, 添加到thunks中, 供消息響應之后回調
RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
if (appendResult != null)
return appendResult;
}
//上面流程append不成功, 則重新申請內存,創建Records, 進行append
//申請內存的大小為, batchSize與當前消息需要size的最大值
//free申請內存時: 1: 如果size == poolableSize(即為batchSize), 從Deque<ByteBuffer> free 隊列中獲取,
// 如果隊列為空,則重新分配一個batchSize的ByteBuffer, 需要跑判斷availableMemory是否大于需要分配size
// 如果滿足,則直接分配, 否則需要等待內存的內存的釋放
// 2: 如果內存不為空,跟1中隊列為空之后的分配策略相同
int size = Math.max(this.batchSize, Records.LOG_OVERHEAD + Record.recordSize(key, value));
log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
ByteBuffer buffer = free.allocate(size, maxTimeToBlock);
synchronized (dq) {
// Need to check if producer is closed again after grabbing the dequeue lock.
if (closed)
throw new IllegalStateException("Cannot send after the producer is closed.");
RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
if (appendResult != null) {
// Somebody else found us a batch, return the one we waited for! Hopefully this doesn't happen often...
free.deallocate(buffer);
return appendResult;
}
MemoryRecords records = MemoryRecords.emptyRecords(buffer, compression, this.batchSize);
RecordBatch batch = new RecordBatch(tp, records, time.milliseconds());
FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, callback, time.milliseconds()));
dq.addLast(batch);
incomplete.add(batch);
return new RecordAppendResult(future, dq.size() > 1 || batch.records.isFull(), true);
}
} finally {
appendsInProgress.decrementAndGet();
}
}
- ioThread 中 Sender 的工作流程
public void run() {
// 主調用流程,循環執行
while (running) {
try {
run(time.milliseconds());
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
log.debug("Beginning shutdown of Kafka producer I/O thread, sending remaining records.");
// 非強制關閉時, 如果accumulator 跟 client 還有未發送完的消息, 等待發送
while (!forceClose && (this.accumulator.hasUnsent() || this.client.inFlightRequestCount() > 0)) {
try {
run(time.milliseconds());
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
// 強制關閉, accumulator中數據直接abort
if (forceClose) {
// We need to fail all the incomplete batches and wake up the threads waiting on
// the futures.
this.accumulator.abortIncompleteBatches();
}
try {
this.client.close();
} catch (Exception e) {
log.error("Failed to close network client", e);
}
log.debug("Shutdown of Kafka producer I/O thread has completed.");
}
/**
* 方法幾點說明:
* 1: guaranteeMessageOrder字段來判斷是否需要擔保,數據發送的有序性
* kafka這里為了保證消息發送的順序, 發送一條Record消息, 進行muted操作,響應之后umuted, 就可以繼續發送
* 2: 消息重新發送, RecordBatch中字段attempts + lastAttemptMs, attempts>0 表示重新發送的Record,
* 必須滿足 batch.lastAttemptMs + retryBackoffMs > nowMs 才能繼續發送
* 3: this.client.ready(node, now)
* 必須為連接狀態, 即ConnectionState.CONNECTED
* 對于需要權限驗證的請求,必須已驗證
* InFlightRequests.canSendMore(node): 當前節點請求隊列為空
* 或者隊列中第一個請求已完成且queue.size() < this.maxInFlightRequestsPerConnection
*/
void run(long now) {
Cluster cluster = metadata.fetch();
//獲取當前accumulator中batches中的數據, readyNodes + nextReadyCheckDelayMs + unknownLeaderTopics
//readyNodes: 同時滿足下面兩個條件
// 1.可以發送數據, 下面任何一個條件滿足即可
// a.數據有滿的數據: deque.size() > 1 (一定有一個數據是滿的) 或者第一個; 或者deque中第一個數據是滿的
// b.數據存放的時間已失效
// c.BufferPool中有等待釋放內存的隊列有數據
// d.accumulator 中有刷新操作, 此操作是用戶進行KafkaProducer.flush()操作
// 2.若是重試數據,已超過重試的阻塞時間,可以重新發送
//nextReadyCheckDelayMs: 對于readyNodes中不滿足可以發送數據數據時, 需要等待可以發送數據的時間,即下一個檢測準備數據的延遲的時間
//unknownLeaderTopics: batches中的TopicPartition在cluster不能找到Leader且!deque.isEmpty()(有數據需要發送)
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
// if there are any partitions whose leaders are not known yet, force metadata update
// 如果返回的數據有不知道Leader的Topic, 則放入metadata 中, 請求更新metadata
if (!result.unknownLeaderTopics.isEmpty()) {
for (String topic : result.unknownLeaderTopics)
this.metadata.add(topic);
this.metadata.requestUpdate();
}
// 對于readyNodes中Node中不能發送數據的直接 移除
// notReadyTimeout 用于this.client.poll(pollTimeout, now);即此方法的最大阻塞時間
Iterator<Node> iter = result.readyNodes.iterator();
long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
Node node = iter.next();
if (!this.client.ready(node, now)) {
iter.remove();
notReadyTimeout = Math.min(notReadyTimeout, this.client.connectionDelay(node, now));
}
}
// create produce requests
Map<Integer, List<RecordBatch>> batches = this.accumulator.drain(cluster,
result.readyNodes,
this.maxRequestSize,
now);
if (guaranteeMessageOrder) {
// Mute all the partitions drained
for (List<RecordBatch> batchList : batches.values()) {
for (RecordBatch batch : batchList)
this.accumulator.mutePartition(batch.topicPartition);
}
}
//移除超時的 RecordBatch
List<RecordBatch> expiredBatches = this.accumulator.abortExpiredBatches(this.requestTimeout, now);
// update sensors
for (RecordBatch expiredBatch : expiredBatches)
this.sensors.recordErrors(expiredBatch.topicPartition.topic(), expiredBatch.recordCount);
sensors.updateProduceRequestMetrics(batches);
List<ClientRequest> requests = createProduceRequests(batches, now);
// If we have any nodes that are ready to send + have sendable data, poll with 0 timeout so this can immediately
// loop and try sending more data. Otherwise, the timeout is determined by nodes that have partitions with data
// that isn't yet sendable (e.g. lingering, backing off). Note that this specifically does not include nodes
// with sendable data that aren't ready to send since they would cause busy looping.
long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
if (result.readyNodes.size() > 0) {
log.trace("Nodes with data ready to send: {}", result.readyNodes);
log.trace("Created {} produce requests: {}", requests.size(), requests);
pollTimeout = 0;
}
for (ClientRequest request : requests)
client.send(request, now);
// if some partitions are already ready to be sent, the select time would be 0;
// otherwise if some partition already has some data accumulated but not ready yet,
// the select time will be the time difference between now and its linger expiry time;
// otherwise the select time will be the time difference between now and the metadata expiry time;
this.client.poll(pollTimeout, now);
}