因為最近在研究高性能方面的技術,突然想起上一份工作從事抗D的項目,在項目中使用到的dpdk組件,其中之一有無鎖相關的技術,就重新翻了下源碼,便寫下這篇。
lock-free
相關的技術也蠻久的了,與之有關的ABA問題,CAS相關的,還有wait-free這個,有興趣的自行研究。
本篇主要介紹dpdk中無鎖環形隊列的實現原理,比如在單進程多線程框架中,有個主線程把接收到的請求均勻的分發到多個工作線程,要高效的,盡量減少同步互斥,阻塞的問題;這一類的問題也有成熟高效的實現方案如muduo庫中的代碼;這里dpdk中的實現和kfifo實現原理差不多。
先看下實現說明:
73 * The Ring Manager is a fixed-size queue, implemented as a table of
74 * pointers. Head and tail pointers are modified atomically, allowing
75 * concurrent access to it. It has the following features:
76 *
77 * - FIFO (First In First Out)
78 * - Maximum size is fixed; the pointers are stored in a table.
79 * - Lockless implementation.
80 * - Multi- or single-consumer dequeue.
81 * - Multi- or single-producer enqueue.
82 * - Bulk dequeue.
83 * - Bulk enqueue.
ring的基本數據結構:
152 struct rte_ring {
159 int flags; /**< Flags supplied at creation. */
160 const struct rte_memzone *memzone;
161 /**< Memzone, if any, containing the rte_ring */
162
163 /** Ring producer status. */
164 struct prod {
165 uint32_t watermark; /**< Maximum items before EDQUOT. */
166 uint32_t sp_enqueue; /**< True, if single producer. */
167 uint32_t size; /**< Size of ring. */
168 uint32_t mask; /**< Mask (size-1) of ring. */
169 volatile uint32_t head; /**< Producer head. */
170 volatile uint32_t tail; /**< Producer tail. */
171 } prod __rte_cache_aligned;
172
173 /** Ring consumer status. */
174 struct cons {
175 uint32_t sc_dequeue; /**< True, if single consumer. */
176 uint32_t size; /**< Size of the ring. */
177 uint32_t mask; /**< Mask (size-1) of ring. */
178 volatile uint32_t head; /**< Consumer head. */
179 volatile uint32_t tail; /**< Consumer tail. */
180 #ifdef RTE_RING_SPLIT_PROD_CONS
181 } cons __rte_cache_aligned;
182 #else
183 } cons;
184 #endif
190 void * ring[0] __rte_cache_aligned; /**< Memory space of ring starts here.
191 * not volatile so need to be careful
192 * about compiler re-ordering */
193 };
其中prod和cons分別表示生產者和消費者數據類型聲明,都cache line對齊,保證cpu讀取數據時很快,且做到生產者和消費者線程的數據隔離,競爭不同的cache line;flags表示是單生產單消費還是其他,對應實現需cas保證正確性,這里假設是單生單消的情況;memzone用于記錄在哪塊mem分配的到時釋放的時候使用;ring數組保存可用空間的地址;這里使用void * ring[0]
[不占任何空間,分配時連續,方便內存釋放,能提高速度];其中prod和cons結構中的mask字段是2的冪次方減1,用于索引下標,這里不用擔心會數組越界什么的,這樣使用“index & mask”,比起取模要快得多,而且當index累加到uint32_t最大值后,再加一又回繞到0了;
這里需要注意的是prod
和cons
都有head
和tail
,作用會在下面解釋。
ring的創建及初始化:
160 /* create the ring */
161 struct rte_ring *
162 rte_ring_create(const char *name, unsigned count, int socket_id,
163 unsigned flags)
164 {
166 struct rte_ring *r;
167 struct rte_tailq_entry *te;
168 const struct rte_memzone *mz;
169 ssize_t ring_size;
170 int mz_flags = 0;
171 struct rte_ring_list* ring_list = NULL;
172 int ret;
173
174 ring_list = RTE_TAILQ_CAST(rte_ring_tailq.head, rte_ring_list);
175
176 ring_size = rte_ring_get_memsize(count);
177 if (ring_size < 0) {
178 rte_errno = ring_size;
179 return NULL;
180 }
181 ...more code
189 te = rte_zmalloc("RING_TAILQ_ENTRY", sizeof(*te), 0);
190 if (te == NULL) {
191 RTE_LOG(ERR, RING, "Cannot reserve memory for tailq\n");
192 rte_errno = ENOMEM;
193 return NULL;
194 }
195
196 rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
197
198 /* reserve a memory zone for this ring. If we can't get rte_config or
199 * we are secondary process, the memzone_reserve function will set
200 * rte_errno for us appropriately - hence no check in this this function */
201 mz = rte_memzone_reserve(mz_name, ring_size, socket_id, mz_flags);
202 if (mz != NULL) {
203 r = mz->addr;
204 /* no need to check return value here, we already checked the
205 * arguments above */
206 rte_ring_init(r, name, count, flags);
207
208 te->data = (void *) r;
209 r->memzone = mz;
210
211 TAILQ_INSERT_TAIL(ring_list, te, next);
212 } else {
213 r = NULL;
214 RTE_LOG(ERR, RING, "Cannot reserve memory\n");
215 rte_free(te);
216 }
217 rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
218
219 return r;
220 }
在rte_ring_create
實現中,rte_ring_get_memsize
先判斷count
有沒有是2的冪次方和是否超過宏定義的最大size[#define RTE_RING_SZ_MASK (unsigned)(0x0fffffff)
],然后調整大小至對齊cache line
:
153 #define RTE_ALIGN_CEIL(val, align) \
154 RTE_ALIGN_FLOOR(((val) + ((typeof(val)) (align) - 1)), align)
135 #define RTE_ALIGN_FLOOR(val, align) \
136 (typeof(val))((val) & (~((typeof(val))((align) - 1))))
計算總共需要的內存空間;分配保存ring地址的struct rte_tailq_entry
后期用于釋放,加寫鎖并設置ring相關數據成員,然后釋放鎖,這部分邏輯是比較簡單的;但是里面也使用到了其他復雜的接口如rte_memzone_reserve
,主要功能是就近socket id node開辟一塊空間[rte_memzone_reserve_thread_safe
],有些額外的信息,這里主要使用mz->addr
用于ring,源碼在后面如果有時間和篇幅再作分析,之后進行rte_ring_init
;
rte_ring_init
初始化中的RTE_BUILD_BUG_ON
宏是編譯期類型cache line對齊檢查,具體作用也在前面的分析中說明過;剩下的是初始化ring各個成員變量:
121 int
122 rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
123 unsigned flags)
124 {
125 int ret;
126
127 /* compilation-time checks */
128 RTE_BUILD_BUG_ON((sizeof(struct rte_ring) &
129 RTE_CACHE_LINE_MASK) != 0);
130 #ifdef RTE_RING_SPLIT_PROD_CONS
131 RTE_BUILD_BUG_ON((offsetof(struct rte_ring, cons) &
132 RTE_CACHE_LINE_MASK) != 0);
133 #endif
134 RTE_BUILD_BUG_ON((offsetof(struct rte_ring, prod) &
135 RTE_CACHE_LINE_MASK) != 0);
142
143 /* init the ring structure */
144 memset(r, 0, sizeof(*r));
148 r->flags = flags;
149 r->prod.watermark = count;
150 r->prod.sp_enqueue = !!(flags & RING_F_SP_ENQ);
151 r->cons.sc_dequeue = !!(flags & RING_F_SC_DEQ);
152 r->prod.size = r->cons.size = count;
153 r->prod.mask = r->cons.mask = count-1;
154 r->prod.head = r->cons.head = 0;
155 r->prod.tail = r->cons.tail = 0;
156
157 return 0;
158 }
#define RTE_BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
#define RTE_CACHE_LINE_MASK (RTE_CACHE_LINE_SIZE-1)
#define offsetof(t, m) ((size_t) &((t *)0)->m)
ring的釋放:
223 void
224 rte_ring_free(struct rte_ring *r)
225 {
226 struct rte_ring_list *ring_list = NULL;
227 struct rte_tailq_entry *te;
228
229 if (r == NULL)
230 return;
231
232 /*
233 * Ring was not created with rte_ring_create,
234 * therefore, there is no memzone to free.
235 */
236 if (r->memzone == NULL) {
238 return;
239 }
240
241 if (rte_memzone_free(r->memzone) != 0) {
243 return;
244 }
245
246 ring_list = RTE_TAILQ_CAST(rte_ring_tailq.head, rte_ring_list);
247 rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
248
249 /* find out tailq entry */
250 TAILQ_FOREACH(te, ring_list, next) {
251 if (te->data == (void *) r)
252 break;
253 }
254
255 if (te == NULL) {
256 rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
257 return;
258 }
259
260 TAILQ_REMOVE(ring_list, te, next);
262 rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
264 rte_free(te);
265 }
釋放沒什么好分析的,詳細釋放過程不作過多分析,這里先進行free(r->memzone)
,再把自己從全局鏈表中摘除,并釋放自己;
先說明下入ring有兩種模式rte_ring_queue_behavior
:
RTE_RING_QUEUE_FIXED: Enqueue a fixed number of items from a ring
RTE_RING_QUEUE_VARIABLE: Enqueue as many items a possible from ring
下面具體先分析單生產者和單消費者的實現過程,如何做到無鎖高性能,由RING_F_SP_ENQ
和RING_F_SC_DEQ
表示,以下是生產者過程:
539 static inline int __attribute__((always_inline))
540 __rte_ring_sp_do_enqueue(struct rte_ring *r, void * const *obj_table,
541 unsigned n, enum rte_ring_queue_behavior behavior)
542 {
543 uint32_t prod_head, cons_tail;
544 uint32_t prod_next, free_entries;
545 unsigned i;
546 uint32_t mask = r->prod.mask;
547 int ret;
548
549 prod_head = r->prod.head;
550 cons_tail = r->cons.tail;
551 /* The subtraction is done between two unsigned 32bits value
552 * (the result is always modulo 32 bits even if we have
553 * prod_head > cons_tail). So 'free_entries' is always between 0
554 * and size(ring)-1. */
555 free_entries = mask + cons_tail - prod_head;
557 /* check that we have enough room in ring */
558 if (unlikely(n > free_entries)) {
559 if (behavior == RTE_RING_QUEUE_FIXED) {
560 __RING_STAT_ADD(r, enq_fail, n);
561 return -ENOBUFS;
562 }
563 else {
564 /* No free entry available */
565 if (unlikely(free_entries == 0)) {
566 __RING_STAT_ADD(r, enq_fail, n);
567 return 0;
568 }
569
570 n = free_entries;
571 }
572 }
573
574 prod_next = prod_head + n;
575 r->prod.head = prod_next;
577 /* write entries in ring */
578 ENQUEUE_PTRS();
579 rte_smp_wmb();
580
581 /* if we exceed the watermark */
582 if (unlikely(((mask + 1) - free_entries + n) > r->prod.watermark)) {
583 ret = (behavior == RTE_RING_QUEUE_FIXED) ? -EDQUOT :
584 (int)(n | RTE_RING_QUOT_EXCEED);
585 __RING_STAT_ADD(r, enq_quota, n);
586 }
587 else {
588 ret = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : n;
589 __RING_STAT_ADD(r, enq_success, n);
590 }
591
592 r->prod.tail = prod_next;
593 return ret;
594 }
其中代碼行555?572是判斷要入隊列ring的元素個數是否大于可用空間,取決于free_entries
的情況,這些只是檢查;其中free_entries = mask + cons_tail - prod_head
算出來的值區間總是在[0,size-1]
;
一開始r->prod.head和r->cons.tail都為0,此時執行prod_next = prod_head + n
,接著修改prod.head
,即r->prod.head = prod_next
(A1);接著入隊元素ENQUEUE_PTRS()
(A2):
356 #define ENQUEUE_PTRS() do { \
357 const uint32_t size = r->prod.size; \
358 uint32_t idx = prod_head & mask; \
359 if (likely(idx + n < size)) { \
360 for (i = 0; i < (n & ((~(unsigned)0x3))); i+=4, idx+=4) { \
361 r->ring[idx] = obj_table[i]; \
362 r->ring[idx+1] = obj_table[i+1]; \
363 r->ring[idx+2] = obj_table[i+2]; \
364 r->ring[idx+3] = obj_table[i+3]; \
365 } \
366 switch (n & 0x3) { \
367 case 3: r->ring[idx++] = obj_table[i++]; \
368 case 2: r->ring[idx++] = obj_table[i++]; \
369 case 1: r->ring[idx++] = obj_table[i++]; \
370 } \
371 } else { \
372 for (i = 0; idx < size; i++, idx++)\
373 r->ring[idx] = obj_table[i]; \
374 for (idx = 0; i < n; i++, idx++) \
375 r->ring[idx] = obj_table[i]; \
376 } \
377 } while(0)
這里likely
成立的情況當要入隊的元素不需要回繞隊列起始處的時候,然后循環步長是四,依次賦值,相當于循環展開優化,減少計算循環索引和預測條件分支判斷等優化;如果需要回繞,那么先入隊[idx,size)
,再入隊[0,idx)
索引處;
入完隊后執行rte_smp_wmb()
(A3),即do { asm volatile ("dmb st" : : : "memory"); } while (0)
,這里表示要等待前面所有存儲內存的指令執行完后再執行后面的存儲內存的指令;這語句作用如注釋:
54 /**
55 * Write memory barrier.
56 *
57 * Guarantees that the STORE operations generated before the barrier
58 * occur before the STORE operations generated after.
59 */
內存屏障相關的知識點會單獨開篇博客整理下,這里先不作過錯說明,只要知道在編譯器和cpu會給我們的代碼進行優化亂序執行等,為了保證正確性,wmb
表示在此之前的寫不能放在后面執行,在此之后的寫不能放到前面執行;rmb
是針對讀的,mb
兩都限制;
代碼行582?590判斷有沒有超過最大watermark
[Quota exceeded. The objects have been enqueued, but the high water mark is exceeded.];不過這里有個疑問是什么情況下if會成立,因為這里用了unlikely
,表示很大可能if不會成立;
最后執行r->prod.tail = prod_next
(A4);
(A1)和(A4)位于(A3)前后。
再來看消費者過程:
722 static inline int __attribute__((always_inline))
723 __rte_ring_sc_do_dequeue(struct rte_ring *r, void **obj_table,
724 unsigned n, enum rte_ring_queue_behavior behavior)
725 {
726 uint32_t cons_head, prod_tail;
727 uint32_t cons_next, entries;
728 unsigned i;
729 uint32_t mask = r->prod.mask;
730
731 cons_head = r->cons.head;
732 prod_tail = r->prod.tail;
733 /* The subtraction is done between two unsigned 32bits value
734 * (the result is always modulo 32 bits even if we have
735 * cons_head > prod_tail). So 'entries' is always between 0
736 * and size(ring)-1. */
737 entries = prod_tail - cons_head;
738
739 if (n > entries) {
740 if (behavior == RTE_RING_QUEUE_FIXED) {
741 __RING_STAT_ADD(r, deq_fail, n);
742 return -ENOENT;
743 }
744 else {
745 if (unlikely(entries == 0)){
746 __RING_STAT_ADD(r, deq_fail, n);
747 return 0;
748 }
749
750 n = entries;
751 }
752 }
754 cons_next = cons_head + n;
755 r->cons.head = cons_next;
756
757 /* copy in table */
758 DEQUEUE_PTRS();
759 rte_smp_rmb();
760
761 __RING_STAT_ADD(r, deq_success, n);
762 r->cons.tail = cons_next;
763 return behavior == RTE_RING_QUEUE_FIXED ? 0 : n;
764 }
731 cons_head = r->cons.head;
732 prod_tail = r->prod.tail;
上面兩行拷貝生產者的tail
索引,并計算有多少個entries,這里由于類型是uint32_t
,故始終entries >=0
;
代碼行739?752根據behavior值判斷能不能pop出n個元素等;此后增加cons_next
,并執行r->cons.head = cons_next
(B1),然后執行DEQUEUE_PTRS
,是ENQUEUE_PTRS
反向操作:
382 #define DEQUEUE_PTRS() do { \
383 uint32_t idx = cons_head & mask; \
384 const uint32_t size = r->cons.size; \
385 if (likely(idx + n < size)) { \
386 for (i = 0; i < (n & (~(unsigned)0x3)); i+=4, idx+=4) {\
387 obj_table[i] = r->ring[idx]; \
388 obj_table[i+1] = r->ring[idx+1]; \
389 obj_table[i+2] = r->ring[idx+2]; \
390 obj_table[i+3] = r->ring[idx+3]; \
391 } \
392 switch (n & 0x3) { \
393 case 3: obj_table[i++] = r->ring[idx++]; \
394 case 2: obj_table[i++] = r->ring[idx++]; \
395 case 1: obj_table[i++] = r->ring[idx++]; \
396 } \
397 } else { \
398 for (i = 0; idx < size; i++, idx++) \
399 obj_table[i] = r->ring[idx]; \
400 for (idx = 0; i < n; i++, idx++) \
401 obj_table[i] = r->ring[idx]; \
402 } \
403 } while (0)
設置內存讀屏障rte_smp_rmb()
__sync_synchronize
;最近更新r->cons.tail = cons_next
(B3);同樣,(B2)中間;
簡單考慮push和pop數據,假設有生產者A和消費者B兩個線程,分別指向ring,有幾個問題:B去pop數據的時候會不會因為判斷有數據而取不到的情況?(A在push數據時不小心先更新了prod.tail
后push數據);A在push數據時是否會因為B先更新cons.tail
而導致覆蓋B還沒來得及pop的數據?還有一些其他情況...
因為這里有內存屏障,并不會出現以上問題,以上列的實現在單生產者和單消費者的情況下比較簡單,但對于多讀多寫一個ring可能比較復雜;
以下分析下復雜情況,即多個線程push和pop數據,這里僅分析生產和消費的源碼,其他的一些比較細節且不是很重要的代碼不在這里分析。
以下是多生產者實現,有段說明:
408 * This function uses a "compare and set" instruction to move the
409 * producer index atomically.
430 static inline int __attribute__((always_inline))
431 __rte_ring_mp_do_enqueue(struct rte_ring *r, void * const *obj_table,
432 unsigned n, enum rte_ring_queue_behavior behavior)
433 {
434 uint32_t prod_head, prod_next;
435 uint32_t cons_tail, free_entries;
436 const unsigned max = n;
437 int success;
438 unsigned i, rep = 0;
439 uint32_t mask = r->prod.mask;
440 int ret;
442 /* Avoid the unnecessary cmpset operation below, which is also
443 * potentially harmful when n equals 0. */
444 if (n == 0)
445 return 0;
446
447 /* move prod.head atomically */
448 do {
449 /* Reset n to the initial burst count */
450 n = max;
451
452 prod_head = r->prod.head;
453 cons_tail = r->cons.tail;
454 /* The subtraction is done between two unsigned 32bits value
455 * (the result is always modulo 32 bits even if we have
456 * prod_head > cons_tail). So 'free_entries' is always between 0
457 * and size(ring)-1. */
458 free_entries = (mask + cons_tail - prod_head);
459
460 /* check that we have enough room in ring */
461 if (unlikely(n > free_entries)) {
462 if (behavior == RTE_RING_QUEUE_FIXED) {
463 __RING_STAT_ADD(r, enq_fail, n);
464 return -ENOBUFS;
465 }
466 else {
467 /* No free entry available */
468 if (unlikely(free_entries == 0)) {
469 __RING_STAT_ADD(r, enq_fail, n);
470 return 0;
471 }
472
473 n = free_entries;
474 }
475 }
477 prod_next = prod_head + n;
478 success = rte_atomic32_cmpset(&r->prod.head, prod_head,
479 prod_next);
480 } while (unlikely(success == 0));
481
482 /* write entries in ring */
483 ENQUEUE_PTRS();
484 rte_smp_wmb();
485
486 /* if we exceed the watermark */
487 if (unlikely(((mask + 1) - free_entries + n) > r->prod.watermark)) {
488 ret = (behavior == RTE_RING_QUEUE_FIXED) ? -EDQUOT :
489 (int)(n | RTE_RING_QUOT_EXCEED);
490 __RING_STAT_ADD(r, enq_quota, n);
491 }
492 else {
493 ret = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : n;
494 __RING_STAT_ADD(r, enq_success, n);
495 }
497 /*
498 * If there are other enqueues in progress that preceded us,
499 * we need to wait for them to complete
500 */
501 while (unlikely(r->prod.tail != prod_head)) {
502 rte_pause();
503
504 /* Set RTE_RING_PAUSE_REP_COUNT to avoid spin too long waiting
505 * for other thread finish. It gives pre-empted thread a chance
506 * to proceed and finish with ring dequeue operation. */
507 if (RTE_RING_PAUSE_REP_COUNT &&
508 ++rep == RTE_RING_PAUSE_REP_COUNT) {
509 rep = 0;
510 sched_yield();
511 }
512 }
513 r->prod.tail = prod_next;
514 return ret;
515 }
代碼行434?446定義和初始化一些變量,其中對n的判斷做了優化,如注釋說明;
代碼行448?480就是使用cas指令,嘗試更新prod.head
,直到成功或者沒有足夠的空間返回,其他的代碼計算free_entries
和上面分析的一樣;
然后執行入隊和內存屏障:
483 ENQUEUE_PTRS();
484 rte_smp_wmb();
代碼行487?495和上面的分析一樣;
而代碼501和512的作用是防止另外的生產者線程也在同時更新prod.tail
,為了防止覆蓋的情況,這里沒有用cas,而是短暫的rte_pause
一會避免busy waiting,等其他生產者線程完成后再更新,但是這里有次數rep
,當到達RTE_RING_PAUSE_REP_COUNT
時,會sched_yield
讓出執行權,不過是個編譯選項;
況且這里使用的是unlikely,可能性比較小;
最后更新prod.tail
。
至于多消費者的實現,這里不分析了,cas實現差不多和上面的多生產者一樣,出隊和單消費者實現一樣,實現大概如下:
do {
entries = r->prod.tailprod_tail - r->cons.head;
cons_next = cons_head + n;
success = rte_atomic32_cmpset(&r->cons.head, cons_head,cons_next);
} while (unlikely(success == 0))
DEQUEUE_PTRS();
rte_smp_rmb();
while (unlikely(r->cons.tail != cons_head)) {
rte_pause();
}
r->cons.tail = cons_next;
以上是整體無鎖環形隊列的實現原理,可能更底層的實現沒有說清楚,或者跟硬件有關系,或者跟自己水平有限有關,不管怎樣,后面會繼續多研究。
另外:
關于cas指令的一些說明,引用DPDK開源社區技術文章中的一段話“當兩個core同時執行針對同一地址的CAS指令時,其實他們是在試圖修改每個core自己持有的Cache line, 假設兩個core都持有相同地址對應cacheline,且各自cacheline 狀態為S, 這時如果要想成功修改,就首先需要把S轉為E或者M, 則需要向其它core invalidate 這個地址的cacheline,則兩個core都會向ring bus 發出 invalidate這個操作, 那么在ringbus上就會根據特定的設計協議仲裁是core0,還是core1能贏得這個invalidate,者完成操作,失敗者需要接受結果invalidate自己對應的cacheline,再讀取勝者修改后的值,回到起點。
到這里, 我們可以發現MESIF協議大大降低了讀操作的時延,沒有讓寫操作更慢,同時保持了一致性。那么對于我們的CAS操作來說,其實鎖并沒有消失,只是轉嫁到了ring bus的總線仲裁協議中。而且大量的多核同時針對一個地址的CAS操作會引起反復的互相invalidate 同一cacheline,造成pingpong效應,同樣會降低性能。只能說,基于CAS的操作仍然是不能濫用,不到萬不得已不用,通常情況下還是使用數據分離模式更好。”
參考:
https://coolshell.cn/articles/8239.html
https://en.wikipedia.org/wiki/Non-blocking_algorithm
http://in355hz.iteye.com/blog/1797829
https://en.wikipedia.org/wiki/Memory_barrier
https://yq.aliyun.com/articles/95441
http://www.lenky.info/archives/2012/11/2028
https://mp.weixin.qq.com/s?__biz=MzI3NDA4ODY4MA==&mid=2653334228&idx=1&sn=8a106aed154ded89283146ddb6a02cf8&chksm=f0cb5d53c7bcd445704592eb7c06407f1b18f1bf94bed3d33345d2443454b15f9c859b64371c&scene=21#wechat_redirect
http://www.man7.org/linux/man-pages/man2/sched_yield.2.html