概述

最近在看Netty的源碼，關注了下其隊列的實現；Netty中基于不同的IO模型，提供了不同的線程實現:

BIO:ThreadPerChannelEventLoop
每個Channel一個線程，采用的隊列為LinkedBlockingQueue
NIO:NioEventLoop(水平觸發)
每個線程一個Selector，可以注冊多個Channel，采用的隊列為MpscChunkedArrayQueue或MpscLinkedAtomicQueue
Epoll:EpollEventLoop(邊緣觸發)
和2相同
那為什么要采用不同的Queue實現呢？下面看看不同Queue的具體實現;

LinkedBlockingQueue

LinkedBlockingQueue是JDK提供的，采用鏈表存儲數據，通過ReentrantLock和Condition來解決競爭和支持堵塞；

既然采用鏈表，鐵定要定義一個新的節點類，在LinkedBlockingQueue中這個節點類為:

static class Node<E> {
    E item;
    Node<E> next;

    Node(E x) { item = x;}
}

可以看到實現很簡單，采用單向鏈接，通過next指向下一個節點，如果next為null，表示該節點為尾節點；

LinkedBlockingQueue的成員變量為:

//容量，隊列是和ArrayList不同，有容量限制
private final int capacity;
//當前節點數量
private final AtomicInteger count = new AtomicInteger(0);
//頭節點
private transient Node<E> head;
//尾節點
private transient Node<E> last;
//出列鎖，當從隊列取數據時，要先獲取該鎖
private final ReentrantLock takeLock = new ReentrantLock();
//隊列非空條件變量，當隊列為空時，出列線程要等待該條件變量
private final Condition notEmpty = takeLock.newCondition();
//入列鎖，當往隊列添加數據時，要先獲取該鎖
private final ReentrantLock putLock = new ReentrantLock();
//隊列容量未滿條件變量，當隊列滿了，入列線程要等待該條件變量
private final Condition notFull = putLock.newCondition();

從上面的成員變量大概可以看出:

可以設置容量，但未提供初始容量、最大容量之類的特性；
先入先出隊列，入列和出列都要獲取鎖，因此是線程安全的；
入列和出列分為兩個鎖；

以其中的入列offer方法為例(由于netty中使用的是Queue而不是BlockingQueue,因此此處分析的都是非堵塞的方法)：

public boolean offer(E e) {
    if (e == null) throw new NullPointerException();//參數非空
    final AtomicInteger count = this.count;//隊列元素數量
    if (count.get() == capacity)//隊列已滿，無法添加，返回false
        return false;
    int c = -1;
    Node<E> node = new Node(e);//將元素封裝為節點
    final ReentrantLock putLock = this.putLock;
    putLock.lock();//獲取鎖，所有入列操作共有同一個鎖
    try {
        if (count.get() < capacity) {//只有隊列不滿，才能添加
            enqueue(node);//入列
            c = count.getAndIncrement();
            if (c + 1 < capacity)//如果添加元素之后，隊列仍然不滿，notFull條件變量滿足條件，通知排隊等待的線程
                notFull.signal();
        }
    } finally {
        putLock.unlock();//釋放鎖
    }
    if (c == 0)
        signalNotEmpty();//說明之前隊列為空，因此需要出發非空條件變量
    return c >= 0;
}

ArrayBlockingQueue

顧名思義，ArrayBlockingQueue是采用數組存儲數據的；它的成員變量如下:

//數組，用于存儲數據
final Object[] items;
//ArrayBlockingQueue維護了兩個索引，一個用于出列，一個用于入列
int takeIndex;
int putIndex;
//當前隊列的元素數量
int count;
//可重入鎖
final ReentrantLock lock;
//隊列容量非空條件變量，當隊列空了，出列線程要等待該條件變量
private final Condition notEmpty;
//隊列容量未滿條件變量，當隊列滿了，入列線程要等待該條件變量
private final Condition notFull;

從上面可出：

入列和出列采用同一個鎖，也就是說入列和出列會彼此競爭鎖；
采用索引來記錄當前出列和入列的位置，避免了移動數組元素；
基于以上2點，在高并發的情況下，由于鎖競爭，性能應該比不上鏈表的實現；

MpscChunkedArrayQueue

MpscChunkedArrayQueue也是采用數組來實現的，從名字上可以看出它是支持多生產者單消費者( Multi Producer Single Consumer),和前面的兩種隊列使用場景有些差異；但恰好符合netty的使用場景；它對特定場景進行了優化:

CacheLine Padding
LinkedBlockingQueue的head和last是相鄰的，ArrayBlockingQueue的takeIndex和putIndex是相鄰的;而我們都知道CPU將數據加載到緩存實際上是按照緩存行加載的，因此可能出現明明沒有修改last，但由于出列操作修改了head，導致整個緩存行失效，需要重新進行加載；

//此處我將多個類中的變量合并到了一起，便于查看
long p01, p02, p03, p04, p05, p06, p07;
long p10, p11, p12, p13, p14, p15, p16, p17;
protected long producerIndex;
long p01, p02, p03, p04, p05, p06, p07;
long p10, p11, p12, p13, p14, p15, p16, p17;
protected long maxQueueCapacity;
protected long producerMask;
protected E[] producerBuffer;
protected volatile long producerLimit;
protected boolean isFixedChunkSize = false;
long p0, p1, p2, p3, p4, p5, p6, p7;
long p10, p11, p12, p13, p14, p15, p16, p17;
protected long consumerMask;
protected E[] consumerBuffer;
protected long consumerIndex;

可以看到生產者索引和消費者索引中間padding了18個long變量，18*8=144，而一般操作系統的cacheline為64,可以通過如下方式查看緩存行大小:

cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size

減少鎖的使用,使用CAS＋自旋：
由于使用鎖會造成線程切換，消耗資源；因此MpscChunkedArrayQueue并未使用鎖，而是使用自旋；和Disruptor的BusySpinWaitStrategy比較類似，如果系統比較繁忙，自旋效率會很適合；當然它也會造成CPU使用率比較高，所以建議使用時將這些線程綁定到特定的CPU;
支持擴容;
MpscChunkedArrayQueue采用數組作為內部存儲結構，那么它是如何實現擴容的呢？可能大家第一反應想到的是創建新數組，然后將老數據挪到新數組中去；但MpscChunkedArrayQueue采用了一種獨特的方式，避免了數組的復制；
舉例說明：
假設隊列的初始化大小為4，則初始的buffer數組為4+1；為什么要＋1呢？因為最后一個元素需要存儲下一個buffer的指針；假設隊列中存儲了8個元素，則數組的內容如下：

buffer

數組下標	0	1	2	3	4
內容	e0	e1	e2	JUMP	next[5]

數組下標	5	6	7	8	9
內容	e4	e5	JUMP	e3	next

可以看到，每個buffer數組的大小都是固定的（之前的版本支持固定大小和非固定大小），也就是initialCapacity指定的大小；每個數組的最后一個實際保存的是個指針，指向下一個數組；讀取數據時，如果遇到JUMP表示要從下一個buffer數組讀取數據；

public E poll() {//消費隊列元素
    final E[] buffer = consumerBuffer;
    final long index = consumerIndex;
    final long mask = consumerMask;
    //通過Unsafe.getObjectVolatile(E[] buffer, long offset)獲取數組元素
    //因此需要根據數組索引，計算出在內存中的偏移量
    final long offset = modifiedCalcElementOffset(index, mask);
    Object e = lvElement(buffer, offset);
    if (e == null) {
        //e==null并不一定表示隊列為空,因為入列的時候是先更新producerIndex,后更新數組元素，因此需要判斷producerIndex
        if (index != lvProducerIndex()) {
            //采用自旋，直到獲取到數據
            do {
                e = lvElement(buffer, offset);
            } while (e == null);
        }
        else {
            return null;
        }
    }
    if (e == JUMP) {//跳轉到新的buff尋找
        final E[] nextBuffer = getNextBuffer(buffer, mask);
        return newBufferPoll(nextBuffer, index);
    }
    //從隊列中取出數據之后，將數組對應位置元素清除
    soElement(buffer, offset, null);
    soConsumerIndex(index + 2);
    return (E) e;
}

性能對比

從網上找了一份測試代碼,稍做修改：

public class TestQueue {
    private static int PRD_THREAD_NUM;
    private static int C_THREAD_NUM=1;

    private static int N = 1<<20;
    private static ExecutorService executor;

    public static void main(String[] args) throws Exception {
        System.out.println("Producer\tConsumer\tcapacity \t LinkedBlockingQueue \t ArrayBlockingQueue \t MpscLinkedAtomicQueue \t MpscChunkedArrayQueue \t MpscArrayQueue");

        for (int j = 1; j < 8; j++) {
            PRD_THREAD_NUM = (int) Math.pow(2, j);
            executor = Executors.newFixedThreadPool(PRD_THREAD_NUM * 2);

            for (int i = 9; i < 12; i++) {
                int length = 1<< i;
                System.out.print(PRD_THREAD_NUM + "\t\t");
                System.out.print(C_THREAD_NUM + "\t\t");
                System.out.print(length + "\t\t");
                System.out.print(doTest2(new LinkedBlockingQueue<Integer>(length), N) + "/s\t\t");
                System.out.print(doTest2(new ArrayBlockingQueue<Integer>(length), N) + "/s\t\t");
                System.out.print(doTest2(new MpscLinkedAtomicQueue<Integer>(), N) + "/s\t\t");
                System.out.print(doTest2(new MpscChunkedArrayQueue<Integer>(length), N) + "/s\t\t");
                System.out.print(doTest2(new MpscArrayQueue<Integer>(length), N) + "/s");
                System.out.println();
            }

            executor.shutdown();
        }
    }

    private static class Producer implements Runnable {
        int n;
        Queue<Integer> q;

        public Producer(int initN, Queue<Integer> initQ) {
            n = initN;
            q = initQ;
        }

        public void run() {
            while (n > 0) {
                if (q.offer(n)) {
                    n--;
                }
            }
        }
    }

    private static class Consumer implements Callable<Long> {
        int n;
        Queue<Integer> q;

        public Consumer(int initN, Queue<Integer> initQ) {
            n = initN;
            q = initQ;
        }

        public Long call() {
            long sum = 0;
            Integer e = null;
            while (n > 0) {
                if ((e = q.poll()) != null) {
                    sum += e;
                    n--;
                }

            }
            return sum;
        }
    }

    private static long doTest2(final Queue<Integer> q, final int n)
            throws Exception {
        CompletionService<Long> completionServ = new ExecutorCompletionService<>(executor);

        long t = System.nanoTime();
        for (int i = 0; i < PRD_THREAD_NUM; i++) {
            executor.submit(new Producer(n / PRD_THREAD_NUM, q));
        }
        for (int i = 0; i < C_THREAD_NUM; i++) {
            completionServ.submit(new Consumer(n / C_THREAD_NUM, q));
        }

        for (int i = 0; i < 1; i++) {
            completionServ.take().get();
        }

        t = System.nanoTime() - t;
        return (long) (1000000000.0 * N / t); // Throughput, items/sec
    }
}

chart.png

從上面可以看到：

Mpsc*Queue表現最好，而且性能表現也最穩定；
并發數較低的時候,基于數組的隊列比基于鏈表的隊列表現要好，，推測有可能是因為數組在內存中是連續分配的，因此加載的時候可以有效利用緩存行，減少讀的次數；而鏈表在內存的地址不是連續的，隨機讀代價比較大；
并發數較高的時候，基于鏈表的隊列比基于數組的隊列表現要好；LinkedBlockingQueue因為入列和出列采用不同的鎖，因此鎖競爭應該比ArrayBlockingQueue小；而MpscLinkedAtomicQueue沒有容量限制，使用AtomicReference提供的XCHG功能修改鏈接即可達到出列和入列的目的，效率特別高；
MpscChunkedArrayQueue相對于MpscArrayQueue，提供了動態擴容大能力；

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Netty中Queue的實現

Netty中Queue的實現

概述

LinkedBlockingQueue

ArrayBlockingQueue

MpscChunkedArrayQueue

性能對比

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Netty中Queue的實現

概述

LinkedBlockingQueue

ArrayBlockingQueue

MpscChunkedArrayQueue

性能對比

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频