2.Linux內(nèi)核學(xué)習(xí)之進(jìn)程調(diào)度初探(2)進(jìn)程調(diào)度的實(shí)現(xiàn)(CFS)

1 概述

CFS的代碼在kernel/sched_fair.c中實(shí)現(xiàn),其中最重要的為這四個(gè)部分:
時(shí)間記賬
進(jìn)程選擇
調(diào)度器入口
睡眠與喚醒

2 時(shí)間記賬

調(diào)度器需要對(duì)進(jìn)程的運(yùn)行時(shí)間進(jìn)行記賬,這樣才可以得知進(jìn)程的運(yùn)行時(shí)間,從而在當(dāng)前運(yùn)行的進(jìn)程時(shí)間片減少為0時(shí),調(diào)度其他時(shí)間片不為0的進(jìn)程進(jìn)行搶占。

2.1 調(diào)度器實(shí)體結(jié)構(gòu)

在Linux中,CFS使用調(diào)度器實(shí)體結(jié)構(gòu)(在<linux/sched.h>中定義)來(lái)追蹤進(jìn)程進(jìn)行記賬,這個(gè)結(jié)構(gòu)體被命名為se,然后作為成員變量嵌入到進(jìn)程描述符task_struct中。下面是調(diào)度器實(shí)體結(jié)構(gòu)對(duì)應(yīng)的頭文件代碼:

struct sched_entity {
    struct load_weight  load;       /* for load-balancing */
    struct rb_node      run_node;
    struct list_head    group_node;
    unsigned int        on_rq;
    u64         exec_start;
    u64         sum_exec_runtime;
    u64         vruntime;              /*虛擬實(shí)時(shí)變量*/
    u64         prev_sum_exec_runtime;
    u64         nr_migrations;
#ifdef CONFIG_SCHEDSTATS
    struct sched_statistics statistics;
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
    struct sched_entity *parent;
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq       *cfs_rq;
    /* rq "owned" by this entity/group: */
    struct cfs_rq       *my_q;
#endif
};

2.2 虛擬實(shí)時(shí)

在Linux內(nèi)核中,CFS通過(guò)虛擬實(shí)時(shí)來(lái)實(shí)現(xiàn),調(diào)度器實(shí)體結(jié)構(gòu)中使用進(jìn)行vruntime表示。vruntime中存放了進(jìn)程的虛擬運(yùn)行時(shí)間(指進(jìn)程運(yùn)行的時(shí)間和),同時(shí)vruntime的計(jì)算經(jīng)過(guò)了所有進(jìn)程的標(biāo)準(zhǔn)化計(jì)算(加權(quán)計(jì)算)。所謂的標(biāo)準(zhǔn)化計(jì)算就是i通過(guò)nice值的處理器權(quán)重映射完成虛擬運(yùn)行時(shí)間的計(jì)算。
在進(jìn)程調(diào)度的策略中提到,在進(jìn)行進(jìn)程選擇時(shí)CFS會(huì)統(tǒng)計(jì)兩個(gè)重要的概念:實(shí)際運(yùn)行時(shí)間和理想運(yùn)行時(shí)間。通過(guò)這兩個(gè)之比來(lái)選擇下一個(gè)執(zhí)行的進(jìn)程。而在實(shí)際中,linux 的CFS使用虛擬實(shí)時(shí)來(lái)實(shí)現(xiàn)這一統(tǒng)計(jì),通過(guò)時(shí)間記賬功能記錄一個(gè)進(jìn)程已經(jīng)執(zhí)行的時(shí)間和還應(yīng)該運(yùn)行的時(shí)間。該功能由定義在kernel/sched_fair.c文件中的update_curr()函數(shù)實(shí)現(xiàn),需要注意的是這個(gè)函數(shù)由系統(tǒng)定時(shí)器周期性調(diào)用,因此vruntime可以準(zhǔn)確的測(cè)量進(jìn)程的運(yùn)行時(shí)間。
該函數(shù)計(jì)算當(dāng)前進(jìn)程的執(zhí)行時(shí)間,然后將這個(gè)時(shí)間存放在delta_exec
然后將執(zhí)行時(shí)間傳遞給__update_curr(),該函數(shù)負(fù)責(zé)使用當(dāng)前可運(yùn)行進(jìn)程總數(shù)對(duì)運(yùn)行時(shí)間進(jìn)行加權(quán)運(yùn)算,然后將權(quán)重值和當(dāng)前運(yùn)行進(jìn)程的&&vruntime**相加。下面是兩個(gè)函數(shù)的具體實(shí)現(xiàn)。

static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;//獲取當(dāng)前調(diào)度實(shí)體
    u64 now = rq_of(cfs_rq)->clock;//獲取當(dāng)前時(shí)間
    unsigned long delta_exec;

    if (unlikely(!curr))
        return;

    /*
     * 獲得最后一次修改負(fù)載后當(dāng)前任務(wù)所占用的運(yùn)行總時(shí)間 (this cannot
     * overflow on 32 bits):
     */
    delta_exec = (unsigned long)(now - curr->exec_start);//
    if (!delta_exec)
        return;

    __update_curr(cfs_rq, curr, delta_exec);
    curr->exec_start = now;

    if (entity_is_task(curr)) {
        struct task_struct *curtask = task_of(curr);

        trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
        cpuacct_charge(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec);
    }
}
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
          unsigned long delta_exec)
{
    unsigned long delta_exec_weighted;

    schedstat_set(curr->statistics.exec_max,
              max((u64)delta_exec, curr->statistics.exec_max));

    curr->sum_exec_runtime += delta_exec;
    schedstat_add(cfs_rq, exec_clock, delta_exec);
    delta_exec_weighted = calc_delta_fair(delta_exec, curr);

    curr->vruntime += delta_exec_weighted;
    update_min_vruntime(cfs_rq);
}

3 進(jìn)程選擇

在進(jìn)程調(diào)度策略中提到過(guò),在CFS中進(jìn)程選擇是通過(guò)實(shí)際運(yùn)行時(shí)間理想運(yùn)行時(shí)間的比較得出的。但是在具體實(shí)現(xiàn)中,只使用了vruntime來(lái)選擇下一個(gè)執(zhí)行的進(jìn)程,CFS的調(diào)度器會(huì)選擇vruntime最小的進(jìn)程來(lái)進(jìn)行執(zhí)行。

3.1 查找

在CFS中為了更快的尋找到具有最小vruntime的進(jìn)程,其使用了紅黑樹(shù)對(duì)可運(yùn)行進(jìn)程進(jìn)行組織。根據(jù)紅黑樹(shù)的原理,具有最小的vruntime的進(jìn)程必定在紅黑樹(shù)的最左葉子節(jié)點(diǎn)。所以?xún)H僅需要遍歷就可以找到下一個(gè)執(zhí)行的進(jìn)程。該功能定義在kernel/sched_fair.c的pick_next_entity()函數(shù)中。需要注意的是,可執(zhí)行進(jìn)程紅黑樹(shù)中沒(méi)有最左子節(jié)點(diǎn)(為空時(shí)),表示沒(méi)有可以運(yùn)行的進(jìn)程,但是CFS調(diào)度器不會(huì)等待,而是調(diào)度一個(gè)空進(jìn)程idle來(lái)進(jìn)行運(yùn)行。

static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
    struct sched_entity *se = __pick_next_entity(cfs_rq);
    struct sched_entity *left = se;

    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
        se = cfs_rq->next;

    /*
     * Prefer last buddy, try to return the CPU to a preempted task.
     */
    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
        se = cfs_rq->last;

    clear_buddies(cfs_rq, se);

    return se;
}

3.2 添加可執(zhí)行進(jìn)程

向可執(zhí)行進(jìn)程紅黑樹(shù)中添加進(jìn)程時(shí),CFS的實(shí)現(xiàn)采用了緩存的方法,當(dāng)新加入的進(jìn)程時(shí)最左子節(jié)點(diǎn)時(shí),那么取代已經(jīng)緩存的當(dāng)前最左子節(jié)點(diǎn),否則當(dāng)前緩存的最左葉子節(jié)點(diǎn)不變,通過(guò)這種方法可以有效減少最左葉子節(jié)點(diǎn)的遍歷。該功能定義在kernel/sched_fair.c的enqueue_entity()函數(shù)中。但是該函數(shù)只是更新時(shí)間和一些統(tǒng)計(jì)數(shù)據(jù),具體插入工作則由__enqueue_entity函數(shù)完成。

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    /*
     * Update the normalized vruntime before updating min_vruntime
     * through callig update_curr().
     */
    if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
        se->vruntime += cfs_rq->min_vruntime;

    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);
    account_entity_enqueue(cfs_rq, se);

    if (flags & ENQUEUE_WAKEUP) {
        place_entity(cfs_rq, se, 0);
        enqueue_sleeper(cfs_rq, se);
    }

    update_stats_enqueue(cfs_rq, se);
    check_spread(cfs_rq, se);
    if (se != cfs_rq->curr)
        __enqueue_entity(cfs_rq, se);
}
/*
 * Enqueue an entity into the rb-tree:
 */
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
    struct rb_node *parent = NULL;
    struct sched_entity *entry;
    s64 key = entity_key(cfs_rq, se);
    int leftmost = 1;

    /*
     * Find the right place in the rbtree:
     */
    while (*link) {//遍歷查找合適的插入位置
        parent = *link;
        entry = rb_entry(parent, struct sched_entity, run_node);
        /*
         * We dont care about collisions. Nodes with
         * the same key stay together.
         */
        if (key < entity_key(cfs_rq, entry)) {//當(dāng)插入鍵值小于當(dāng)前節(jié)點(diǎn)的鍵值時(shí)
            link = &parent->rb_left;
        } else {
            link = &parent->rb_right;
            leftmost = 0;
        }
    }

    /*
     * Maintain a cache of leftmost tree entries (it is frequently
     * used):
     */
    if (leftmost)//對(duì)最常用的最左葉子節(jié)點(diǎn)進(jìn)行緩存
        cfs_rq->rb_leftmost = &se->run_node;

    rb_link_node(&se->run_node, parent, link);
    rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
}

3.3 刪除不可執(zhí)行的進(jìn)程

當(dāng)然還有一個(gè)刪除節(jié)點(diǎn)的函數(shù),這個(gè)功能在進(jìn)程阻塞時(shí),負(fù)責(zé)將代表可執(zhí)行進(jìn)程的節(jié)點(diǎn)移出紅黑樹(shù)。該功能定義在kernel/sched_fair.c的dequeue_entity()函數(shù)中,當(dāng)然和插入一樣該函數(shù)完成的也只是一些統(tǒng)計(jì)工作,具體刪除是由__dequeue_entity()函數(shù)完成的。

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    u64 min_vruntime;

    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);

    update_stats_dequeue(cfs_rq, se);
    if (flags & DEQUEUE_SLEEP) {
#ifdef CONFIG_SCHEDSTATS
        if (entity_is_task(se)) {
            struct task_struct *tsk = task_of(se);

            if (tsk->state & TASK_INTERRUPTIBLE)
                se->statistics.sleep_start = rq_of(cfs_rq)->clock;
            if (tsk->state & TASK_UNINTERRUPTIBLE)
                se->statistics.block_start = rq_of(cfs_rq)->clock;
        }
#endif
    }

    clear_buddies(cfs_rq, se);

    if (se != cfs_rq->curr)
        __dequeue_entity(cfs_rq, se);
    account_entity_dequeue(cfs_rq, se);

    min_vruntime = cfs_rq->min_vruntime;
    update_min_vruntime(cfs_rq);

    /*
     * Normalize the entity after updating the min_vruntime because the
     * update can refer to the ->curr item and we need to reflect this
     * movement in our normalized position.
     */
    if (!(flags & DEQUEUE_SLEEP))
        se->vruntime -= min_vruntime;
}
static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    if (cfs_rq->rb_leftmost == &se->run_node) {
        struct rb_node *next_node;

        next_node = rb_next(&se->run_node);
        cfs_rq->rb_leftmost = next_node;
    }

    rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
}

4 調(diào)度器入口

進(jìn)程調(diào)度其的主要入口函數(shù)為Schedule(),其定義在kernel/sched.c中。這個(gè)函數(shù)時(shí)內(nèi)核其他部分調(diào)用進(jìn)程調(diào)度器的入口。Schedule()函數(shù)首先找到一個(gè)最高優(yōu)先級(jí)的調(diào)度類(lèi)(該調(diào)度類(lèi)有自己的可執(zhí)行隊(duì)列),然后Schedule函數(shù)再由這個(gè)調(diào)度類(lèi)將下一個(gè)進(jìn)行運(yùn)行的進(jìn)程告知Schedule函數(shù)。Schedule函數(shù)最重要的工作由pick_next_task()函數(shù)來(lái)完成,該函數(shù)會(huì)以優(yōu)先級(jí)為次序,從高到低檢查每一個(gè)調(diào)度類(lèi),并且從最高優(yōu)先級(jí)的調(diào)度類(lèi)中選擇最高優(yōu)先級(jí)的進(jìn)程。

/*
 * schedule() is the main scheduler function.
 */
asmlinkage void __sched schedule(void)
{
    struct task_struct *prev, *next;
    unsigned long *switch_count;
    struct rq *rq;
    int cpu;

need_resched:
    preempt_disable();
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    rcu_note_context_switch(cpu);
    prev = rq->curr;
    switch_count = &prev->nivcsw;

    release_kernel_lock(prev);
need_resched_nonpreemptible:

    schedule_debug(prev);

    if (sched_feat(HRTICK))
        hrtick_clear(rq);

    raw_spin_lock_irq(&rq->lock);
    clear_tsk_need_resched(prev);

    if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
        if (unlikely(signal_pending_state(prev->state, prev)))
            prev->state = TASK_RUNNING;
        else
            deactivate_task(rq, prev, DEQUEUE_SLEEP);
        switch_count = &prev->nvcsw;
    }

    pre_schedule(rq, prev);

    if (unlikely(!rq->nr_running))
        idle_balance(cpu, rq);

    put_prev_task(rq, prev);
    next = pick_next_task(rq);

    if (likely(prev != next)) {
        sched_info_switch(prev, next);
        perf_event_task_sched_out(prev, next);

        rq->nr_switches++;
        rq->curr = next;
        ++*switch_count;

        context_switch(rq, prev, next); /* unlocks the rq */
        /*
         * the context switch might have flipped the stack from under
         * us, hence refresh the local variables.
         */
        cpu = smp_processor_id();
        rq = cpu_rq(cpu);
    } else
        raw_spin_unlock_irq(&rq->lock);

    post_schedule(rq);

    if (unlikely(reacquire_kernel_lock(current) < 0)) {
        prev = rq->curr;
        switch_count = &prev->nivcsw;
        goto need_resched_nonpreemptible;
    }

    preempt_enable_no_resched();
    if (need_resched())
        goto need_resched;
}
/*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq)
{
    const struct sched_class *class;
    struct task_struct *p;

    /*
     * Optimization: we know that if all tasks are in
     * the fair class we can call that function directly:
     */
    if (likely(rq->nr_running == rq->cfs.nr_running)) {
        p = fair_sched_class.pick_next_task(rq);
        if (likely(p))
            return p;
    }

    class = sched_class_highest;
    for ( ; ; ) {
        p = class->pick_next_task(rq);
        if (p)
            return p;
        /*
         * Will never be NULL as the idle class always
         * returns a non-NULL p:
         */
        class = class->next;
    }
}

5 睡眠與喚醒

當(dāng)進(jìn)程被阻塞時(shí)會(huì)處于一個(gè)不可執(zhí)行的狀態(tài),這個(gè)狀態(tài)也被稱(chēng)為睡眠狀態(tài),調(diào)度程序不能夠選擇一個(gè)處于睡眠狀態(tài)的進(jìn)程執(zhí)行,進(jìn)程在等待一些事件的發(fā)生時(shí)就會(huì)處于這個(gè)狀態(tài)。而喚醒則是處于睡眠狀態(tài)的進(jìn)程等待的事件發(fā)生,那么進(jìn)程就會(huì)處于可執(zhí)行狀態(tài),其會(huì)被亦儒道可執(zhí)行紅黑樹(shù)中。

5.1 等待隊(duì)列

睡眠的進(jìn)程都會(huì)存放在等待隊(duì)列中,內(nèi)核使用wake_queue_head_t來(lái)代表等待隊(duì)列。等待隊(duì)列有兩種實(shí)現(xiàn)方式,分別為DECLARE_WAITQUEUE()靜態(tài)創(chuàng)建和init_waitqueue_head()動(dòng)態(tài)創(chuàng)建。 在阻塞時(shí)進(jìn)程會(huì)把自己放入等待隊(duì)列并設(shè)置自己為不可執(zhí)行狀態(tài),當(dāng)阻塞的事件發(fā)生時(shí),隊(duì)列上的進(jìn)程會(huì)喚醒。
進(jìn)程加入等待隊(duì)列有以下步驟:

1. 調(diào)用宏DEFINE_WAIT()創(chuàng)建一個(gè)等待隊(duì)列
2. 調(diào)用add_wait_queue()把自己加入到隊(duì)列中。進(jìn)程會(huì)在等待的條件喚醒,所以需要在其他地方撰寫(xiě)相應(yīng)的喚醒條件代碼,當(dāng)?shù)却氖录l(fā)生時(shí),可以執(zhí)行wake_up()操作;
3. 調(diào)用prepare_to_wait()方法將進(jìn)程狀態(tài)設(shè)置為TASK_INTERRUPTIBLE或者TASK_UNINTERRUPTIBLE。當(dāng)循環(huán)遍歷時(shí),該函數(shù)還可以把進(jìn)程加回到等待隊(duì)列;
4. 狀態(tài)為TASK_INTERRUPTIBLE的進(jìn)程可以被信號(hào)喚醒。這個(gè)喚醒為偽喚醒,因?yàn)檫M(jìn)程等待的事件并未發(fā)生;
5. 進(jìn)程被喚醒時(shí),檢查等待的事件是否發(fā)生。發(fā)生則退出休眠,不發(fā)生則循環(huán)調(diào)用Schedule()函數(shù)
6. 等待的事件發(fā)生時(shí),進(jìn)程會(huì)將自己設(shè)置為TASK_RUNNING然后調(diào)用finish_wait()方法推出等待隊(duì)列;

需要注意的是,當(dāng)進(jìn)程休眠前其所等待的事件發(fā)生時(shí),進(jìn)程就不會(huì)進(jìn)入休眠狀態(tài)。
函數(shù)inotify_read()函數(shù)是等待隊(duì)列的一個(gè)典型用法,其負(fù)責(zé)從通知文件描述符中讀取信息,該函數(shù)在fs/notify/inotify/inotify_user.c中實(shí)現(xiàn)。

static ssize_t inotify_read(struct file *file, char __user *buf,
                size_t count, loff_t *pos)
{
    struct fsnotify_group *group;
    struct fsnotify_event *kevent;
    char __user *start;
    int ret;
    DEFINE_WAIT(wait);

    start = buf;
    group = file->private_data;

    while (1) {
        prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);

        mutex_lock(&group->notification_mutex);
        kevent = get_one_event(group, count);
        mutex_unlock(&group->notification_mutex);

        if (kevent) {
            ret = PTR_ERR(kevent);
            if (IS_ERR(kevent))
                break;
            ret = copy_event_to_user(group, kevent, buf);
            fsnotify_put_event(kevent);
            if (ret < 0)
                break;
            buf += ret;
            count -= ret;
            continue;
        }

        ret = -EAGAIN;
        if (file->f_flags & O_NONBLOCK)
            break;
        ret = -EINTR;
        if (signal_pending(current))
            break;

        if (start != buf)
            break;

        schedule();
    }

    finish_wait(&group->notification_waitq, &wait);
    if (start != buf && ret != -EFAULT)
        ret = buf - start;
    return ret;
}

5.2 喚醒

喚醒操作在前面也提到過(guò)是通過(guò)wake_up()函數(shù)完成的,其會(huì)喚醒制定的等待隊(duì)列上的所有進(jìn)程。在wake_up()函數(shù)中函數(shù)如下:
首先,調(diào)用了try_to_wake_up()去將進(jìn)程狀態(tài)設(shè)置為TASK_RUNNING狀態(tài);
其次,調(diào)用enqueue_task()函數(shù)將進(jìn)程放入可執(zhí)行紅黑樹(shù)中;
最后,若被喚醒的進(jìn)程優(yōu)先級(jí)高于當(dāng)前正在執(zhí)行的進(jìn)程,設(shè)置need_resched()標(biāo)志;

這是個(gè)人在閱讀《Linux內(nèi)核設(shè)計(jì)與實(shí)現(xiàn)》時(shí)候的一點(diǎn)心得,里面加入了一些自己關(guān)于操作系統(tǒng)的理解,對(duì)自己的現(xiàn)有的知識(shí)進(jìn)行梳理,如有錯(cuò)誤敬請(qǐng)指正。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容