1 概述
CFS的代碼在kernel/sched_fair.c中實(shí)現(xiàn),其中最重要的為這四個(gè)部分:
時(shí)間記賬
進(jìn)程選擇
調(diào)度器入口
睡眠與喚醒
2 時(shí)間記賬
調(diào)度器需要對(duì)進(jìn)程的運(yùn)行時(shí)間進(jìn)行記賬,這樣才可以得知進(jìn)程的運(yùn)行時(shí)間,從而在當(dāng)前運(yùn)行的進(jìn)程時(shí)間片減少為0時(shí),調(diào)度其他時(shí)間片不為0的進(jìn)程進(jìn)行搶占。
2.1 調(diào)度器實(shí)體結(jié)構(gòu)
在Linux中,CFS使用調(diào)度器實(shí)體結(jié)構(gòu)(在<linux/sched.h>中定義)來(lái)追蹤進(jìn)程進(jìn)行記賬,這個(gè)結(jié)構(gòu)體被命名為se,然后作為成員變量嵌入到進(jìn)程描述符task_struct中。下面是調(diào)度器實(shí)體結(jié)構(gòu)對(duì)應(yīng)的頭文件代碼:
struct sched_entity {
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
struct list_head group_node;
unsigned int on_rq;
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime; /*虛擬實(shí)時(shí)變量*/
u64 prev_sum_exec_runtime;
u64 nr_migrations;
#ifdef CONFIG_SCHEDSTATS
struct sched_statistics statistics;
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
struct cfs_rq *cfs_rq;
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif
};
2.2 虛擬實(shí)時(shí)
在Linux內(nèi)核中,CFS通過(guò)虛擬實(shí)時(shí)來(lái)實(shí)現(xiàn),調(diào)度器實(shí)體結(jié)構(gòu)中使用進(jìn)行vruntime表示。vruntime中存放了進(jìn)程的虛擬運(yùn)行時(shí)間(指進(jìn)程運(yùn)行的時(shí)間和),同時(shí)vruntime的計(jì)算經(jīng)過(guò)了所有進(jìn)程的標(biāo)準(zhǔn)化計(jì)算(加權(quán)計(jì)算)。所謂的標(biāo)準(zhǔn)化計(jì)算就是i通過(guò)nice值的處理器權(quán)重映射完成虛擬運(yùn)行時(shí)間的計(jì)算。
在進(jìn)程調(diào)度的策略中提到,在進(jìn)行進(jìn)程選擇時(shí)CFS會(huì)統(tǒng)計(jì)兩個(gè)重要的概念:實(shí)際運(yùn)行時(shí)間和理想運(yùn)行時(shí)間。通過(guò)這兩個(gè)之比來(lái)選擇下一個(gè)執(zhí)行的進(jìn)程。而在實(shí)際中,linux 的CFS使用虛擬實(shí)時(shí)來(lái)實(shí)現(xiàn)這一統(tǒng)計(jì),通過(guò)時(shí)間記賬功能記錄一個(gè)進(jìn)程已經(jīng)執(zhí)行的時(shí)間和還應(yīng)該運(yùn)行的時(shí)間。該功能由定義在kernel/sched_fair.c文件中的update_curr()函數(shù)實(shí)現(xiàn),需要注意的是這個(gè)函數(shù)由系統(tǒng)定時(shí)器周期性調(diào)用,因此vruntime可以準(zhǔn)確的測(cè)量進(jìn)程的運(yùn)行時(shí)間。
該函數(shù)計(jì)算當(dāng)前進(jìn)程的執(zhí)行時(shí)間,然后將這個(gè)時(shí)間存放在delta_exec
然后將執(zhí)行時(shí)間傳遞給__update_curr(),該函數(shù)負(fù)責(zé)使用當(dāng)前可運(yùn)行進(jìn)程總數(shù)對(duì)運(yùn)行時(shí)間進(jìn)行加權(quán)運(yùn)算,然后將權(quán)重值和當(dāng)前運(yùn)行進(jìn)程的&&vruntime**相加。下面是兩個(gè)函數(shù)的具體實(shí)現(xiàn)。
static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;//獲取當(dāng)前調(diào)度實(shí)體
u64 now = rq_of(cfs_rq)->clock;//獲取當(dāng)前時(shí)間
unsigned long delta_exec;
if (unlikely(!curr))
return;
/*
* 獲得最后一次修改負(fù)載后當(dāng)前任務(wù)所占用的運(yùn)行總時(shí)間 (this cannot
* overflow on 32 bits):
*/
delta_exec = (unsigned long)(now - curr->exec_start);//
if (!delta_exec)
return;
__update_curr(cfs_rq, curr, delta_exec);
curr->exec_start = now;
if (entity_is_task(curr)) {
struct task_struct *curtask = task_of(curr);
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cpuacct_charge(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
}
}
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
unsigned long delta_exec)
{
unsigned long delta_exec_weighted;
schedstat_set(curr->statistics.exec_max,
max((u64)delta_exec, curr->statistics.exec_max));
curr->sum_exec_runtime += delta_exec;
schedstat_add(cfs_rq, exec_clock, delta_exec);
delta_exec_weighted = calc_delta_fair(delta_exec, curr);
curr->vruntime += delta_exec_weighted;
update_min_vruntime(cfs_rq);
}
3 進(jìn)程選擇
在進(jìn)程調(diào)度策略中提到過(guò),在CFS中進(jìn)程選擇是通過(guò)實(shí)際運(yùn)行時(shí)間和理想運(yùn)行時(shí)間的比較得出的。但是在具體實(shí)現(xiàn)中,只使用了vruntime來(lái)選擇下一個(gè)執(zhí)行的進(jìn)程,CFS的調(diào)度器會(huì)選擇vruntime最小的進(jìn)程來(lái)進(jìn)行執(zhí)行。
3.1 查找
在CFS中為了更快的尋找到具有最小vruntime的進(jìn)程,其使用了紅黑樹(shù)對(duì)可運(yùn)行進(jìn)程進(jìn)行組織。根據(jù)紅黑樹(shù)的原理,具有最小的vruntime的進(jìn)程必定在紅黑樹(shù)的最左葉子節(jié)點(diǎn)。所以?xún)H僅需要遍歷就可以找到下一個(gè)執(zhí)行的進(jìn)程。該功能定義在kernel/sched_fair.c的pick_next_entity()函數(shù)中。需要注意的是,可執(zhí)行進(jìn)程紅黑樹(shù)中沒(méi)有最左子節(jié)點(diǎn)(為空時(shí)),表示沒(méi)有可以運(yùn)行的進(jìn)程,但是CFS調(diào)度器不會(huì)等待,而是調(diào)度一個(gè)空進(jìn)程idle來(lái)進(jìn)行運(yùn)行。
static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
struct sched_entity *se = __pick_next_entity(cfs_rq);
struct sched_entity *left = se;
if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
se = cfs_rq->next;
/*
* Prefer last buddy, try to return the CPU to a preempted task.
*/
if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
se = cfs_rq->last;
clear_buddies(cfs_rq, se);
return se;
}
3.2 添加可執(zhí)行進(jìn)程
向可執(zhí)行進(jìn)程紅黑樹(shù)中添加進(jìn)程時(shí),CFS的實(shí)現(xiàn)采用了緩存的方法,當(dāng)新加入的進(jìn)程時(shí)最左子節(jié)點(diǎn)時(shí),那么取代已經(jīng)緩存的當(dāng)前最左子節(jié)點(diǎn),否則當(dāng)前緩存的最左葉子節(jié)點(diǎn)不變,通過(guò)這種方法可以有效減少最左葉子節(jié)點(diǎn)的遍歷。該功能定義在kernel/sched_fair.c的enqueue_entity()函數(shù)中。但是該函數(shù)只是更新時(shí)間和一些統(tǒng)計(jì)數(shù)據(jù),具體插入工作則由__enqueue_entity函數(shù)完成。
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
/*
* Update the normalized vruntime before updating min_vruntime
* through callig update_curr().
*/
if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
se->vruntime += cfs_rq->min_vruntime;
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
account_entity_enqueue(cfs_rq, se);
if (flags & ENQUEUE_WAKEUP) {
place_entity(cfs_rq, se, 0);
enqueue_sleeper(cfs_rq, se);
}
update_stats_enqueue(cfs_rq, se);
check_spread(cfs_rq, se);
if (se != cfs_rq->curr)
__enqueue_entity(cfs_rq, se);
}
/*
* Enqueue an entity into the rb-tree:
*/
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
struct rb_node *parent = NULL;
struct sched_entity *entry;
s64 key = entity_key(cfs_rq, se);
int leftmost = 1;
/*
* Find the right place in the rbtree:
*/
while (*link) {//遍歷查找合適的插入位置
parent = *link;
entry = rb_entry(parent, struct sched_entity, run_node);
/*
* We dont care about collisions. Nodes with
* the same key stay together.
*/
if (key < entity_key(cfs_rq, entry)) {//當(dāng)插入鍵值小于當(dāng)前節(jié)點(diǎn)的鍵值時(shí)
link = &parent->rb_left;
} else {
link = &parent->rb_right;
leftmost = 0;
}
}
/*
* Maintain a cache of leftmost tree entries (it is frequently
* used):
*/
if (leftmost)//對(duì)最常用的最左葉子節(jié)點(diǎn)進(jìn)行緩存
cfs_rq->rb_leftmost = &se->run_node;
rb_link_node(&se->run_node, parent, link);
rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
}
3.3 刪除不可執(zhí)行的進(jìn)程
當(dāng)然還有一個(gè)刪除節(jié)點(diǎn)的函數(shù),這個(gè)功能在進(jìn)程阻塞時(shí),負(fù)責(zé)將代表可執(zhí)行進(jìn)程的節(jié)點(diǎn)移出紅黑樹(shù)。該功能定義在kernel/sched_fair.c的dequeue_entity()函數(shù)中,當(dāng)然和插入一樣該函數(shù)完成的也只是一些統(tǒng)計(jì)工作,具體刪除是由__dequeue_entity()函數(shù)完成的。
static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 min_vruntime;
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
update_stats_dequeue(cfs_rq, se);
if (flags & DEQUEUE_SLEEP) {
#ifdef CONFIG_SCHEDSTATS
if (entity_is_task(se)) {
struct task_struct *tsk = task_of(se);
if (tsk->state & TASK_INTERRUPTIBLE)
se->statistics.sleep_start = rq_of(cfs_rq)->clock;
if (tsk->state & TASK_UNINTERRUPTIBLE)
se->statistics.block_start = rq_of(cfs_rq)->clock;
}
#endif
}
clear_buddies(cfs_rq, se);
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
account_entity_dequeue(cfs_rq, se);
min_vruntime = cfs_rq->min_vruntime;
update_min_vruntime(cfs_rq);
/*
* Normalize the entity after updating the min_vruntime because the
* update can refer to the ->curr item and we need to reflect this
* movement in our normalized position.
*/
if (!(flags & DEQUEUE_SLEEP))
se->vruntime -= min_vruntime;
}
static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
if (cfs_rq->rb_leftmost == &se->run_node) {
struct rb_node *next_node;
next_node = rb_next(&se->run_node);
cfs_rq->rb_leftmost = next_node;
}
rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
}
4 調(diào)度器入口
進(jìn)程調(diào)度其的主要入口函數(shù)為Schedule(),其定義在kernel/sched.c中。這個(gè)函數(shù)時(shí)內(nèi)核其他部分調(diào)用進(jìn)程調(diào)度器的入口。Schedule()函數(shù)首先找到一個(gè)最高優(yōu)先級(jí)的調(diào)度類(lèi)(該調(diào)度類(lèi)有自己的可執(zhí)行隊(duì)列),然后Schedule函數(shù)再由這個(gè)調(diào)度類(lèi)將下一個(gè)進(jìn)行運(yùn)行的進(jìn)程告知Schedule函數(shù)。Schedule函數(shù)最重要的工作由pick_next_task()函數(shù)來(lái)完成,該函數(shù)會(huì)以優(yōu)先級(jí)為次序,從高到低檢查每一個(gè)調(diào)度類(lèi),并且從最高優(yōu)先級(jí)的調(diào)度類(lèi)中選擇最高優(yōu)先級(jí)的進(jìn)程。
/*
* schedule() is the main scheduler function.
*/
asmlinkage void __sched schedule(void)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
struct rq *rq;
int cpu;
need_resched:
preempt_disable();
cpu = smp_processor_id();
rq = cpu_rq(cpu);
rcu_note_context_switch(cpu);
prev = rq->curr;
switch_count = &prev->nivcsw;
release_kernel_lock(prev);
need_resched_nonpreemptible:
schedule_debug(prev);
if (sched_feat(HRTICK))
hrtick_clear(rq);
raw_spin_lock_irq(&rq->lock);
clear_tsk_need_resched(prev);
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
if (unlikely(signal_pending_state(prev->state, prev)))
prev->state = TASK_RUNNING;
else
deactivate_task(rq, prev, DEQUEUE_SLEEP);
switch_count = &prev->nvcsw;
}
pre_schedule(rq, prev);
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);
put_prev_task(rq, prev);
next = pick_next_task(rq);
if (likely(prev != next)) {
sched_info_switch(prev, next);
perf_event_task_sched_out(prev, next);
rq->nr_switches++;
rq->curr = next;
++*switch_count;
context_switch(rq, prev, next); /* unlocks the rq */
/*
* the context switch might have flipped the stack from under
* us, hence refresh the local variables.
*/
cpu = smp_processor_id();
rq = cpu_rq(cpu);
} else
raw_spin_unlock_irq(&rq->lock);
post_schedule(rq);
if (unlikely(reacquire_kernel_lock(current) < 0)) {
prev = rq->curr;
switch_count = &prev->nivcsw;
goto need_resched_nonpreemptible;
}
preempt_enable_no_resched();
if (need_resched())
goto need_resched;
}
/*
* Pick up the highest-prio task:
*/
static inline struct task_struct *
pick_next_task(struct rq *rq)
{
const struct sched_class *class;
struct task_struct *p;
/*
* Optimization: we know that if all tasks are in
* the fair class we can call that function directly:
*/
if (likely(rq->nr_running == rq->cfs.nr_running)) {
p = fair_sched_class.pick_next_task(rq);
if (likely(p))
return p;
}
class = sched_class_highest;
for ( ; ; ) {
p = class->pick_next_task(rq);
if (p)
return p;
/*
* Will never be NULL as the idle class always
* returns a non-NULL p:
*/
class = class->next;
}
}
5 睡眠與喚醒
當(dāng)進(jìn)程被阻塞時(shí)會(huì)處于一個(gè)不可執(zhí)行的狀態(tài),這個(gè)狀態(tài)也被稱(chēng)為睡眠狀態(tài),調(diào)度程序不能夠選擇一個(gè)處于睡眠狀態(tài)的進(jìn)程執(zhí)行,進(jìn)程在等待一些事件的發(fā)生時(shí)就會(huì)處于這個(gè)狀態(tài)。而喚醒則是處于睡眠狀態(tài)的進(jìn)程等待的事件發(fā)生,那么進(jìn)程就會(huì)處于可執(zhí)行狀態(tài),其會(huì)被亦儒道可執(zhí)行紅黑樹(shù)中。
5.1 等待隊(duì)列
睡眠的進(jìn)程都會(huì)存放在等待隊(duì)列中,內(nèi)核使用wake_queue_head_t來(lái)代表等待隊(duì)列。等待隊(duì)列有兩種實(shí)現(xiàn)方式,分別為DECLARE_WAITQUEUE()靜態(tài)創(chuàng)建和init_waitqueue_head()動(dòng)態(tài)創(chuàng)建。 在阻塞時(shí)進(jìn)程會(huì)把自己放入等待隊(duì)列并設(shè)置自己為不可執(zhí)行狀態(tài),當(dāng)阻塞的事件發(fā)生時(shí),隊(duì)列上的進(jìn)程會(huì)喚醒。
進(jìn)程加入等待隊(duì)列有以下步驟:
1. 調(diào)用宏DEFINE_WAIT()創(chuàng)建一個(gè)等待隊(duì)列;
2. 調(diào)用add_wait_queue()把自己加入到隊(duì)列中。進(jìn)程會(huì)在等待的條件喚醒,所以需要在其他地方撰寫(xiě)相應(yīng)的喚醒條件代碼,當(dāng)?shù)却氖录l(fā)生時(shí),可以執(zhí)行wake_up()操作;
3. 調(diào)用prepare_to_wait()方法將進(jìn)程狀態(tài)設(shè)置為TASK_INTERRUPTIBLE或者TASK_UNINTERRUPTIBLE。當(dāng)循環(huán)遍歷時(shí),該函數(shù)還可以把進(jìn)程加回到等待隊(duì)列;
4. 狀態(tài)為TASK_INTERRUPTIBLE的進(jìn)程可以被信號(hào)喚醒。這個(gè)喚醒為偽喚醒,因?yàn)檫M(jìn)程等待的事件并未發(fā)生;
5. 進(jìn)程被喚醒時(shí),檢查等待的事件是否發(fā)生。發(fā)生則退出休眠,不發(fā)生則循環(huán)調(diào)用Schedule()函數(shù);
6. 等待的事件發(fā)生時(shí),進(jìn)程會(huì)將自己設(shè)置為TASK_RUNNING然后調(diào)用finish_wait()方法推出等待隊(duì)列;
需要注意的是,當(dāng)進(jìn)程休眠前其所等待的事件發(fā)生時(shí),進(jìn)程就不會(huì)進(jìn)入休眠狀態(tài)。
函數(shù)inotify_read()函數(shù)是等待隊(duì)列的一個(gè)典型用法,其負(fù)責(zé)從通知文件描述符中讀取信息,該函數(shù)在fs/notify/inotify/inotify_user.c中實(shí)現(xiàn)。
static ssize_t inotify_read(struct file *file, char __user *buf,
size_t count, loff_t *pos)
{
struct fsnotify_group *group;
struct fsnotify_event *kevent;
char __user *start;
int ret;
DEFINE_WAIT(wait);
start = buf;
group = file->private_data;
while (1) {
prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
mutex_lock(&group->notification_mutex);
kevent = get_one_event(group, count);
mutex_unlock(&group->notification_mutex);
if (kevent) {
ret = PTR_ERR(kevent);
if (IS_ERR(kevent))
break;
ret = copy_event_to_user(group, kevent, buf);
fsnotify_put_event(kevent);
if (ret < 0)
break;
buf += ret;
count -= ret;
continue;
}
ret = -EAGAIN;
if (file->f_flags & O_NONBLOCK)
break;
ret = -EINTR;
if (signal_pending(current))
break;
if (start != buf)
break;
schedule();
}
finish_wait(&group->notification_waitq, &wait);
if (start != buf && ret != -EFAULT)
ret = buf - start;
return ret;
}
5.2 喚醒
喚醒操作在前面也提到過(guò)是通過(guò)wake_up()函數(shù)完成的,其會(huì)喚醒制定的等待隊(duì)列上的所有進(jìn)程。在wake_up()函數(shù)中函數(shù)如下:
首先,調(diào)用了try_to_wake_up()去將進(jìn)程狀態(tài)設(shè)置為TASK_RUNNING狀態(tài);
其次,調(diào)用enqueue_task()函數(shù)將進(jìn)程放入可執(zhí)行紅黑樹(shù)中;
最后,若被喚醒的進(jìn)程優(yōu)先級(jí)高于當(dāng)前正在執(zhí)行的進(jìn)程,設(shè)置need_resched()標(biāo)志;
這是個(gè)人在閱讀《Linux內(nèi)核設(shè)計(jì)與實(shí)現(xiàn)》時(shí)候的一點(diǎn)心得,里面加入了一些自己關(guān)于操作系統(tǒng)的理解,對(duì)自己的現(xiàn)有的知識(shí)進(jìn)行梳理,如有錯(cuò)誤敬請(qǐng)指正。