KafkaProducer 流程解析

我們都知道Kafka對消息的處理速度非常的快,單機的TPS達到了百萬條的數量級。主要是由于Producer端將對個小消息進行合并,進行一個batch message的操作。對于KafkaProducer 的流程設計通過源碼的角度進行詳細的解析。

這里使用的代碼版本為: 0.10.1

  • 構造方法: public KafkaProducer(Map<String, Object> configs) {...}
  • KafkaProducer 中成員變量
    // 如果用戶沒有配置"client.id", 則用 "producer-" + PRODUCER_CLIENT_ID_SEQUENCE.getAndIncrement()作為cientId
    private static final AtomicInteger PRODUCER_CLIENT_ID_SEQUENCE = new AtomicInteger(1);

    // 作為jmx中BeanName 的前綴
    private static final String JMX_PREFIX = "kafka.producer";

    private String clientId;

    //通過此類的`partition`方法將那一條消息負載到指定的topic的partition中
    //用戶可以自定義擴展此類, 可通過 "partitioner.class"進行配置
    private final Partitioner partitioner;

    //對message的size做限制, 可通過"max.request.size"進行配置
    private final int maxRequestSize;

    //作為Produce端申請message內存的大小, 如果下一條消息申請內存時,內存大小不夠,則等待。可通過"buffer.memory"進行配置
    //具體使用RecordAccumulator中的BufferPool中使用
    private final long totalMemorySize;

    //作為獲取集群信息的元數據類
    private final Metadata metadata;

    //message消息的累積類, 每次發送消息,都將消息append在RecordAccumulator中
    private final RecordAccumulator accumulator;

    //消息發送線程,從accumulator中獲取可以發送的消息, 進行消息的發送
    //放入ioThread線程中, 實例化的時候就會啟動
    private final Sender sender;

    //metrics 數據監控類
    private final Metrics metrics;

    //作為sender啟動類
    private final Thread ioThread;

    //數據傳輸的壓縮格式
    private final CompressionType compressionType;

    //消息發送失敗的統計類
    private final Sensor errors;

    private final Time time;

    //通過此類將消息的key序列化為傳輸的byte[]
    //用戶可自己實現序列化方法, 可通過"key.serializer"進行配置
    private final Serializer<K> keySerializer;

    //通過此類將消息的value序列化為傳輸的byte[]
    //同樣可以自己實現, 可通過"value.serializer"進行配置
    private final Serializer<V> valueSerializer;

    //作為KafkaProducer實現的的輸入參數, 用戶配置信息類
    private final ProducerConfig producerConfig;

    //發送消息時,最大的阻塞時間,the buffer is full or metadata unavailable,可通過"max.block.ms"配置(0.10.1版本)
    private final long maxBlockTimeMs;

    //發送消息時,發送請求的最大超時時間, 可通過"request.timeout.ms"配置(0.10.1版本)
    private final int requestTimeoutMs;

    //發送數據的攔截器列表, 對發送ProducerRecord時, 進行一些攔截處理
    private final ProducerInterceptors<K, V> interceptors; 

  • 消息的發送方法 KafkaProducer.send(ProducerRecord<K, V> record, Callback callback)
    public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
        // intercept the record, which can be potentially modified; this method does not throw exceptions
        ProducerRecord<K, V> interceptedRecord = this.interceptors == null ? record : this.interceptors.onSend(record);
        return doSend(interceptedRecord, callback);
    }

    private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
        TopicPartition tp = null;
        try {
            //確保metadata對當前record.topic可用,并返回cluster + waitedOnMetadataMs(此方法的阻塞時間)
            //可用的條件:metadata中cluster當前topic的partitionsCount != null
            //1> 用戶沒有指定partion
            //2> 用戶指定了partion,必須 partition < partitionsCount (因為partition是從0開始)
            ClusterAndWaitTime clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), maxBlockTimeMs);
            //計算下面操作最大可阻塞的時間
            long remainingWaitMs = Math.max(0, maxBlockTimeMs - clusterAndWaitTime.waitedOnMetadataMs);
            Cluster cluster = clusterAndWaitTime.cluster;
            //將ProducerRecord 中的key跟value根據對應的序列化類序列化為對應的byte[]
            byte[] serializedKey;
            try {
                serializedKey = keySerializer.serialize(record.topic(), record.key());
            } catch (ClassCastException cce) {
                throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() +
                        " to class " + producerConfig.getClass(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG).getName() +
                        " specified in key.serializer");
            }
            byte[] serializedValue;
            try {
                serializedValue = valueSerializer.serialize(record.topic(), record.value());
            } catch (ClassCastException cce) {
                throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
                        " to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
                        " specified in value.serializer");
            }
            //如果用戶沒有指定partition即record.partition != null
            //則根據配置的Partitioner進行消息的負載分配
            int partition = partition(record, serializedKey, serializedValue, cluster);
            //消息序列化的size, 加上消息頭的大小 SIZE_LENGTH(INT類型的大小 4) + OFFSET_LENGTH(LONG類型的大小 8)
            int serializedSize = Records.LOG_OVERHEAD + Record.recordSize(serializedKey, serializedValue);
            //驗證單次消息的大小, 必須小于maxRequestSize, 必須小于totalMemorySize
            ensureValidRecordSize(serializedSize);
            //生成消息發送對應的TopicPartition
            tp = new TopicPartition(record.topic(), partition);
            long timestamp = record.timestamp() == null ? time.milliseconds() : record.timestamp();
            log.trace("Sending record {} with callback {} to topic {} partition {}", record, callback, record.topic(), partition);
            // producer callback will make sure to call both 'callback' and interceptor callback
            Callback interceptCallback = this.interceptors == null ? callback : new InterceptorCallback<>(callback, this.interceptors, tp);
            //將消息append到accumulator 中
            RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey, serializedValue, interceptCallback, remainingWaitMs);
            //如果消息滿足發送的條件, 則喚醒發送線程, 進行消息的發送
            //滿足消息發送的條件:
            //1> RecordAccumulator中batches對應的
            //   TopicPartition的消息隊列Deque<RecordBatch>的size() > 1; 
            //   或者當前RecordBatch.isFull()已經滿了
            // 2> 當前RecordBatch 是新建的, 新建的表示一定有數據                     
            if (result.batchIsFull || result.newBatchCreated) {
                log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
                this.sender.wakeup();
            }
            return result.future;
        } catch (...) {
            // handling exceptions and record the errors;
            // for API exceptions return them in the future,
            // for other exceptions throw directly
        }
    }
  • 消息的batch方法。 accumulator.append(tp, timestamp, serializedKey, serializedValue, interceptCallback, remainingWaitMs)
    /**
     * Add a record to the accumulator, return the append result
     * <p>
     * The append result will contain the future metadata, and flag for whether the appended batch is full or a new batch is created
     * <p>
     *
     * @param tp The topic/partition to which this record is being sent
     * @param timestamp The timestamp of the record
     * @param key The key for the record
     * @param value The value for the record
     * @param callback The user-supplied callback to execute when the request is complete
     * @param maxTimeToBlock 最大的申請內存的阻塞時間
     */
    public RecordAppendResult append(TopicPartition tp,
                                     long timestamp,
                                     byte[] key,
                                     byte[] value,
                                     Callback callback,
                                     long maxTimeToBlock) throws InterruptedException {
        //統計, 在append 中的消息的數據信息
        appendsInProgress.incrementAndGet();
        try {
            // 如果batches中存在對應的TopicPartition的消息隊列, 直接返回, 否則創建一個
            Deque<RecordBatch> dq = getOrCreateDeque(tp);
            synchronized (dq) {
                if (closed)
                    throw new IllegalStateException("Cannot send after the producer is closed.");
                //將消息放入dq中: 獲取dq中最后一個RecordBatch, 如果不存在, 直接返回NULL
                //               如果存在, 將消息append到RecordBatch中, 如果RecordBatch沒有空間存放,直接返回NULL
                //                                如果有空間, append進去, 生成一個FutureRecordMetadata, 
                //==>并通過callback+FutureRecordMetadata實例化一個Thunk, 添加到thunks中, 供消息響應之后回調
                RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
                if (appendResult != null)
                    return appendResult;
            }

            //上面流程append不成功, 則重新申請內存,創建Records, 進行append
            //申請內存的大小為, batchSize與當前消息需要size的最大值
            //free申請內存時: 1: 如果size == poolableSize(即為batchSize), 從Deque<ByteBuffer> free 隊列中獲取, 
            //          如果隊列為空,則重新分配一個batchSize的ByteBuffer, 需要跑判斷availableMemory是否大于需要分配size
            //          如果滿足,則直接分配, 否則需要等待內存的內存的釋放
            //                2: 如果內存不為空,跟1中隊列為空之后的分配策略相同
            int size = Math.max(this.batchSize, Records.LOG_OVERHEAD + Record.recordSize(key, value));
            log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
            ByteBuffer buffer = free.allocate(size, maxTimeToBlock);
            synchronized (dq) {
                // Need to check if producer is closed again after grabbing the dequeue lock.
                if (closed)
                    throw new IllegalStateException("Cannot send after the producer is closed.");

                RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
                if (appendResult != null) {
                    // Somebody else found us a batch, return the one we waited for! Hopefully this doesn't happen often...
                    free.deallocate(buffer);
                    return appendResult;
                }
                MemoryRecords records = MemoryRecords.emptyRecords(buffer, compression, this.batchSize);
                RecordBatch batch = new RecordBatch(tp, records, time.milliseconds());
                FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, callback, time.milliseconds()));

                dq.addLast(batch);
                incomplete.add(batch);
                return new RecordAppendResult(future, dq.size() > 1 || batch.records.isFull(), true);
            }
        } finally {
            appendsInProgress.decrementAndGet();
        }
    }

  • ioThread 中 Sender 的工作流程
    public void run() {

        // 主調用流程,循環執行
        while (running) {
            try {
                run(time.milliseconds());
            } catch (Exception e) {
                log.error("Uncaught error in kafka producer I/O thread: ", e);
            }
        }

        log.debug("Beginning shutdown of Kafka producer I/O thread, sending remaining records.");

        // 非強制關閉時, 如果accumulator 跟 client 還有未發送完的消息, 等待發送
        while (!forceClose && (this.accumulator.hasUnsent() || this.client.inFlightRequestCount() > 0)) {
            try {
                run(time.milliseconds());
            } catch (Exception e) {
                log.error("Uncaught error in kafka producer I/O thread: ", e);
            }
        }

        // 強制關閉, accumulator中數據直接abort
        if (forceClose) {
            // We need to fail all the incomplete batches and wake up the threads waiting on
            // the futures.
            this.accumulator.abortIncompleteBatches();
        }
        try {
            this.client.close();
        } catch (Exception e) {
            log.error("Failed to close network client", e);
        }

        log.debug("Shutdown of Kafka producer I/O thread has completed.");
    }

    /**
     * 方法幾點說明:
     * 1: guaranteeMessageOrder字段來判斷是否需要擔保,數據發送的有序性
     *     kafka這里為了保證消息發送的順序, 發送一條Record消息, 進行muted操作,響應之后umuted, 就可以繼續發送
     * 2: 消息重新發送, RecordBatch中字段attempts + lastAttemptMs, attempts>0 表示重新發送的Record,
     *     必須滿足 batch.lastAttemptMs + retryBackoffMs > nowMs 才能繼續發送
     * 3: this.client.ready(node, now) 
     *     必須為連接狀態, 即ConnectionState.CONNECTED
     *     對于需要權限驗證的請求,必須已驗證
     *     InFlightRequests.canSendMore(node): 當前節點請求隊列為空
     *         或者隊列中第一個請求已完成且queue.size() < this.maxInFlightRequestsPerConnection   
     */
    void run(long now) {
        Cluster cluster = metadata.fetch();
        //獲取當前accumulator中batches中的數據, readyNodes + nextReadyCheckDelayMs + unknownLeaderTopics
        //readyNodes: 同時滿足下面兩個條件 
        //  1.可以發送數據, 下面任何一個條件滿足即可
        //      a.數據有滿的數據: deque.size() > 1 (一定有一個數據是滿的) 或者第一個; 或者deque中第一個數據是滿的
        //      b.數據存放的時間已失效
        //      c.BufferPool中有等待釋放內存的隊列有數據
        //      d.accumulator 中有刷新操作, 此操作是用戶進行KafkaProducer.flush()操作                
        //  2.若是重試數據,已超過重試的阻塞時間,可以重新發送
        //nextReadyCheckDelayMs: 對于readyNodes中不滿足可以發送數據數據時, 需要等待可以發送數據的時間,即下一個檢測準備數據的延遲的時間
        //unknownLeaderTopics: batches中的TopicPartition在cluster不能找到Leader且!deque.isEmpty()(有數據需要發送)
        RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);

        // if there are any partitions whose leaders are not known yet, force metadata update
        // 如果返回的數據有不知道Leader的Topic, 則放入metadata 中, 請求更新metadata
        if (!result.unknownLeaderTopics.isEmpty()) {
            for (String topic : result.unknownLeaderTopics)
                this.metadata.add(topic);
            this.metadata.requestUpdate();
        }

        // 對于readyNodes中Node中不能發送數據的直接 移除
        // notReadyTimeout 用于this.client.poll(pollTimeout, now);即此方法的最大阻塞時間
        Iterator<Node> iter = result.readyNodes.iterator();
        long notReadyTimeout = Long.MAX_VALUE;
        while (iter.hasNext()) {
            Node node = iter.next();
            if (!this.client.ready(node, now)) {
                iter.remove();
                notReadyTimeout = Math.min(notReadyTimeout, this.client.connectionDelay(node, now));
            }
        }

        // create produce requests
        Map<Integer, List<RecordBatch>> batches = this.accumulator.drain(cluster,
                                                                         result.readyNodes,
                                                                         this.maxRequestSize,
                                                                         now);
        if (guaranteeMessageOrder) {
            // Mute all the partitions drained
            for (List<RecordBatch> batchList : batches.values()) {
                for (RecordBatch batch : batchList)
                    this.accumulator.mutePartition(batch.topicPartition);
            }
        }

        //移除超時的 RecordBatch
        List<RecordBatch> expiredBatches = this.accumulator.abortExpiredBatches(this.requestTimeout, now);
        // update sensors
        for (RecordBatch expiredBatch : expiredBatches)
            this.sensors.recordErrors(expiredBatch.topicPartition.topic(), expiredBatch.recordCount);

        sensors.updateProduceRequestMetrics(batches);
        List<ClientRequest> requests = createProduceRequests(batches, now);
        // If we have any nodes that are ready to send + have sendable data, poll with 0 timeout so this can immediately
        // loop and try sending more data. Otherwise, the timeout is determined by nodes that have partitions with data
        // that isn't yet sendable (e.g. lingering, backing off). Note that this specifically does not include nodes
        // with sendable data that aren't ready to send since they would cause busy looping.
        long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
        if (result.readyNodes.size() > 0) {
            log.trace("Nodes with data ready to send: {}", result.readyNodes);
            log.trace("Created {} produce requests: {}", requests.size(), requests);
            pollTimeout = 0;
        }
        for (ClientRequest request : requests)
            client.send(request, now);

        // if some partitions are already ready to be sent, the select time would be 0;
        // otherwise if some partition already has some data accumulated but not ready yet,
        // the select time will be the time difference between now and its linger expiry time;
        // otherwise the select time will be the time difference between now and the metadata expiry time;
        this.client.poll(pollTimeout, now);
    }

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容