DumpStack

静下来,享受技术!
  1. 首页
  2. 负载跟踪
  3. 正文

负载跟踪之(三):util_est的引入

2022年8月13日 2071点热度 0人点赞 1条评论

关注公众号不迷路:DumpStack

扫码加关注

目录

  • 一、util_est - 数据结构,用于记录task和cfs_rq的预估负载
  • 二、dequeue时要完成的工作
    • 2.1 dequeue_task_fair - task出队时完成扣除和更新操作
    • 2.2 util_est_dequeue - 从cfs_rq中扣除task的util_est
    • 2.3 util_est_update - task出队时更新task的util_est
      • 2.3.1 within_margin - 判断value < margin
    • 2.4 UTIL_AVG_UNCHANGED标记更新时机
      • 2.4.1 __update_load_avg_se - 更新se的负载信息
      • 2.4.2 cfs_se_util_change - 更新UTIL_AVG_UNCHANGED标记
  • 三、enqueue时要完成的工作
    • 3.1 enqueue_task_fair - 入队时将task的util_est累加进cfs_rq
    • 3.2 util_est_enqueue - 入队时累加
  • 四、对外提供获取task和cpu的util的接口
    • 4.1 task_util_est - 获取task的util
    • 4.2 _task_util_est - 获取这个task在睡眠之前的util_est
    • 4.3 cpu_util_cfs - 获取cpu的util
    • 4.4 cpu_util - 获取cpu的util值
    • 4.5 cpu_util_without - 获取cpu的util值,跳过指定的task
  • 关注公众号不迷路:DumpStack

 

 

 

预估负载util_est的提出基于下面两个背景:

  • 由上一章我们对pelt的分析,知道pelt中负载{load|runnable|util}_avg更新的周期是1ms(严格的说是1024us),负载累加还是比较慢的
  • cpu调频(schedutil)依赖于cpu的util值,util值小,频率低,util值变化慢,调频慢
  • 对task的迁核依赖于util值,如果util很小,则很难被迁移到大核上去

 

假如我们遇到这样的场景:一个重负载的线程,因为某种原因即将进入睡眠状态,这个重载线程在睡眠之前已经累积了一个很大的util_avg,但是因为在睡眠期间是没有新增负载的,只会对历史负载进行衰减,所以,这个线程在睡眠了很长一段时间后,原有的很大的util_avg值会被衰减到很小,导致其被唤醒后util_avg很小,又因为pelt负载更新很慢,所以一方面会导致cpu的频率提升的很慢,另一方面很难选到大核。

为了补救pelt的这一缺陷,引入了预估负载的概念,实际上就是在task睡眠出队之前,将这个task和cfs_rq的util_est保存再来,当这个task从睡眠中唤醒,重新入队时,重新将这部分util_est累加到cfs_rq上去

 

所以util_est机制的实现主要完成下面工作:

  • 开辟数据结构记录task和cfs_rq的util_est信息
  • 在dequeue时更新这个task的unil_est,并将task的util_est从cfs_rq中扣除
  • 在enqueue时将task的util_est重新累加回cfs_rq中
  • 对外提供接口,用于获取task和cpu的util值,当然这个util值是考虑了util_est的

 

注意:util_est记录的是task/cpu在进入睡眠之前的util值(确切的说是经过滑动平均后的util值),并且util_est是不随着睡眠的时间而衰减的。例如一个task在睡眠之前的util_est值为800,则其在睡眠10s和睡眠100s之后,使用task_util_est获取到的util值都是800

 

下面就来详细分析各个部分的实现

 

一、util_est - 数据结构,用于记录task和cfs_rq的预估负载

/**

* struct util_est - Estimation utilization of FAIR tasks

* @enqueued: instantaneous estimated utilization of a task/cpu

* @ewma: the Exponential Weighted Moving Average (EWMA)

* utilization of a task

*

* Support data structure to track an Exponential Weighted Moving Average

* (EWMA) of a FAIR task's utilization. New samples are added to the moving

* average each time a task completes an activation. Sample's weight is chosen

* so that the EWMA will be relatively insensitive to transient changes to the

* task's workload.

*

* The enqueued attribute has a slightly different meaning for tasks and cpus:

* - task: the task's util_avg at last task dequeue time

* - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU

* Thus, the util_est.enqueued of a task represents the contribution on the

* estimated utilization of the CPU where that task is currently enqueued.

*

* Only for tasks we track a moving average of the past instantaneous

* estimated utilization. This allows to absorb sporadic drops in utilization

* of an otherwise almost periodic task.

*

* The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg

* updates. When a task is dequeued, its util_est should not be updated if its

* util_avg has not been updated in the meantime.

* This information is mapped into the MSB bit of util_est.enqueued at dequeue

* time. Since max value of util_est.enqueued for a task is 1024 (PELT util_avg

* for a task) it is safe to use MSB.

*/

struct util_est {

    //记录task或cfs_rq在上一次睡眠前的预估util

    //最高bit用于标记这个util_avg是否被更新,0表示已更新,1表示未更新

    //util_est只有在util_avg更新后,才会被更新,这主要是由于下面两个原因:

    //a) 对于频繁运行的"小线程",频繁的更新util_est会带来额外的开销

    //b) util_est的计算基于util_avg,如果util_avg没有更新,util_est的更新没有意义

    //注:util_avg更新的最小周期是1ms(1个pelt周期=1024us)

    unsigned int enqueued;

 

    //指数加权平均的值

    unsigned int ewma;

#define UTIL_EST_WEIGHT_SHIFT        2                        //权

#define UTIL_AVG_UNCHANGED        0x80000000

} __attribute__((__aligned__(sizeof(u64))));

 

二、dequeue时要完成的工作

由前面的序言可知,task在dequeue时需要完成下面两个工作:

  • 并将task的util_est从cfs_rq中
  • 更新这个task的unil_est

 

2.1 dequeue_task_fair - task出队时完成扣除和更新操作

/*

* The dequeue_task method is called before nr_running is

* decreased. We remove the task from the rbtree and

* update the fair scheduling stats:

*/

static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

{

    int task_sleep = flags & DEQUEUE_SLEEP;

    ...

 

    //1.先将这个task的util_est从cfs_rq中扣除

    util_est_dequeue(&rq->cfs, p);

 

    ...

 

    //2.更新task的util_est

    util_est_update(&rq->cfs, p, task_sleep);

    ...

}

 

2.2 util_est_dequeue - 从cfs_rq中扣除task的util_est

static inline void util_est_dequeue(

            struct cfs_rq *cfs_rq,

            struct task_struct *p)

{

    unsigned int enqueued;

 

    //1.如果没有使能这个feature,则直接退出

    if (!sched_feat(UTIL_EST))

        return;

 

    /* Update root cfs_rq's estimated utilization */

    //2.task从cfs_rq中移除时,这里将这个task对应util_est从cfs_rq中移除

    enqueued = cfs_rq->avg.util_est.enqueued;

    enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));

    WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);

 

    //3.trace相关

    trace_sched_util_est_cfs_tp(cfs_rq);

}

 

2.3 util_est_update - task出队时更新task的util_est

需要完成两件事:

  • 将这个task的util_est从cfs_rq中移除;
  • 将这个task的util赋值给ue.enqueued;
  • 更新ue.ewma结构;

 

static inline void util_est_update(

            struct cfs_rq *cfs_rq,

            struct task_struct *p,    //哪个task要出队

            bool task_sleep)            //标记这个task是否即将进入睡眠

{

    long last_ewma_diff, last_enqueued_diff;

    struct util_est ue;

 

    //1.如果没有使能这个feature,则直接退出

    if (!sched_feat(UTIL_EST))

        return;

 

    /*

     * Skip update of task's estimated utilization when the task has not

     * yet completed an activation, e.g. being migrated.

     */

    //2.只有在task是即将睡眠时才更新task的util_est

    // 如果仅仅是对这个task执行牵核操作,则跳过

    if (!task_sleep)

        return;

 

    /*

     * If the PELT values haven't changed since enqueue time,

     * skip the util_est update.

     */

    //3.如果这个task的util_avg还没有更新,则跳过

    // 使用task的util_est.enqueued最后一个bit,标识task的util_avg是否被更新过了

    // 只有task的util_avg被更新过,才会更新task的util_est,对于频繁运行的us级的

    // 小task(util_avg最小更新粒度是1ms),才不会在每次dequeue时更新util_est,这

    // 样减小了更新util_est的开销,同时也减小了在task util_avg未更新时代无意义操作

    ue = p->se.avg.util_est;

    if (ue.enqueued & UTIL_AVG_UNCHANGED)

        return;

 

    //4.记录这个task更新前的util_est,这个和后面计算加权平均算法相关

    last_enqueued_diff = ue.enqueued;

 

    /*

     * Reset EWMA on utilization increases, the moving average is used only

     * to smooth utilization decreases.

     */

    //5.直接读出task的util_avg作为util_est

    // UTIL_EST_FASTUP是快速模式,直接使用上一次的util_avg作为util_est

    ue.enqueued = task_util(p);

    if (sched_feat(UTIL_EST_FASTUP)) {

        if (ue.ewma < ue.enqueued) {

            ue.ewma = ue.enqueued;

            goto done;

        }

    }

 

    /*

     * Skip update of task's estimated utilization when its members are

     * already ~1% close to its last activation value.

     */

    //6.这里是1%的余量,如果差异小于1%则不需要更新,直接使用util_avg

    last_ewma_diff = ue.enqueued - ue.ewma;

    last_enqueued_diff -= ue.enqueued;

    if (within_margin(last_ewma_diff, UTIL_EST_MARGIN)) {

        if (!within_margin(last_enqueued_diff, UTIL_EST_MARGIN))

            goto done;

 

        return;

    }

 

    /*

     * To avoid overestimation of actual task utilization, skip updates if

     * we cannot grant there is idle time in this CPU.

     */

    //7.如果util_avg值大于算力的时候,也不会更新

    if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq))))

        return;

 

    /*

     * Update Task's estimated utilization

     *

     * When *p completes an activation we can consolidate another sample

     * of the task size. This is done by storing the current PELT value

     * as ue.enqueued and by using this value to update the Exponential

     * Weighted Moving Average (EWMA):

     *

     * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)

     * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)

     * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)

     * = w * ( last_ewma_diff ) + ewma(t-1)

     * = w * (last_ewma_diff + ewma(t-1) / w)

     *

     * Where 'w' is the weight of new samples, which is configured to be

     * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)

     */

    //8.下面算法计算滑动平均值,这里算法我们不深究了,公式在上面注释中也已经给出了

    // ewma值计算方法:

    // task的util_avg和上一次ewma的值按比例更新到最新的ewma值,

    // 现在w=0.25,即task的util_avg占1/4权重,上次ewma的值占

    // 3/4的权重,w值越小,平滑越明显,即ewma的变化趋势越平缓

    ue.ewma <<= UTIL_EST_WEIGHT_SHIFT;

    ue.ewma += last_ewma_diff;

    ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;

done:

 

    //9.最高bit标记util_avg是否刚被更新过,0表示刚被更新过

    // 因为util_avg在上面已经被使用了,所以置一,标记"脏了"

    ue.enqueued |= UTIL_AVG_UNCHANGED;

 

    //10.将最新计算得到的util_est回写到task里

    WRITE_ONCE(p->se.avg.util_est, ue);

 

    //11.trace相关

    trace_sched_util_est_se_tp(&p->se);

}

 

 

2.3.1 within_margin - 判断value < margin

/*

* Check if a (signed) value is within a specified (unsigned) margin,

* based on the observation that:

*

* abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)

*

* NOTE: this only works when value + maring < INT_MAX.

*/

static inline bool within_margin(int value, int margin)

{

    return ((unsigned int)(value + margin - 1) < (2 * margin - 1));

}

 

2.4 UTIL_AVG_UNCHANGED标记更新时机

在pelt更新se的负载信息时,会更新UTIL_AVG_UNCHANGED标记,具体如下

 

2.4.1 __update_load_avg_se - 更新se的负载信息

int __update_load_avg_se(

            u64 now,

            struct cfs_rq *cfs_rq,

            struct sched_entity *se)

{

    //1.更新{load|runnable|util}_sum

    if (___update_load_sum(now, &se->avg,

                !!se->on_rq,

                se_runnable(se),

                cfs_rq->curr == se)) {

 

        //2.更新{load|runnable|util}_avg

        ___update_load_avg(&se->avg, se_weight(se));

 

        //3.与sa->util_est中的UTIL_AVG_UNCHANGED标记

        cfs_se_util_change(&se->avg);

 

        //4.打印trace信息

        trace_pelt_se_tp(se);

 

        //5.成功更新负载返回1,未更新返回0

        return 1;

    }

 

    return 0;

}

 

2.4.2 cfs_se_util_change - 更新UTIL_AVG_UNCHANGED标记

清空util_est.enqueued的最后一个bit,标记util_avg已经发生变化,等待时机更新util_est

util_est的最后一位用于标记:task从入队后的PELT的util_avg是否被更新了,如果没被更新,(task是很小微妙级,util_avg更新最小粒度是1ms),也不需要更新util_est了,这样减小了更新util_est的开销。

static inline void cfs_se_util_change(struct sched_avg *avg)

{

    unsigned int enqueued;

 

    if (!sched_feat(UTIL_EST))

        return;

 

    /* Avoid store if the flag has been already reset */

    //1.如果该bit为0,则说明已经改变了,则返回

    enqueued = avg->util_est.enqueued;

    if (!(enqueued & UTIL_AVG_UNCHANGED))

        return;

 

    /* Reset flag to report util_avg has been updated */

    //2.清空最后一个bit,标记util_avg已经发生变化

    enqueued &= ~UTIL_AVG_UNCHANGED;

    WRITE_ONCE(avg->util_est.enqueued, enqueued);

}

 

三、enqueue时要完成的工作

3.1 enqueue_task_fair - 入队时将task的util_est累加进cfs_rq

static void enqueue_task_fair(

            struct rq *rq,

            struct task_struct *p,

            int flags)

{

    ...

 

    /*

     * The code below (indirectly) updates schedutil which looks at

     * the cfs_rq utilization to select a frequency.

     * Let's add the task's estimated utilization to the cfs_rq's

     * estimated utilization, before we update schedutil.

     */

    //1.入队时将task的util_est累加进cfs_rq中

    util_est_enqueue(&rq->cfs, p);

 

    ...

}

 

3.2 util_est_enqueue - 入队时累加

static inline void util_est_enqueue(

            struct cfs_rq *cfs_rq,

            struct task_struct *p)

{

    unsigned int enqueued;

 

    //1.如果没有使能这个feature,则直接退出

    if (!sched_feat(UTIL_EST))

        return;

 

    /* Update root cfs_rq's estimated utilization */

    //2.task添加进cfs_rq时,将这个task对应util_est累加进cfs_rq中

    enqueued = cfs_rq->avg.util_est.enqueued;

    enqueued += _task_util_est(p);

    WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);

 

    //3.trace相关

    trace_sched_util_est_cfs_tp(cfs_rq);

}

 

四、对外提供获取task和cpu的util的接口

其他模块提供下面的接口获取task/cpu的util值,这些接口返回的util值已经考虑了预估负载后的util,即使一个task在睡眠很长时间后,睡醒后再次获取到的util值是其睡眠之前的util值,因为util_est是不随着睡眠时间而衰减的,不至于很小,具体实现如下

 

4.1 task_util_est - 获取task的util

有个util_est,我们可以通过task_util_est来替代task_util接口

static inline unsigned long task_util_est(struct task_struct *p)

{

    //考虑了util_est

    return max(task_util(p), _task_util_est(p));

}

 

4.2 _task_util_est - 获取这个task在睡眠之前的util_est

static inline unsigned long _task_util_est(struct task_struct *p)

{

    struct util_est ue = READ_ONCE(p->se.avg.util_est);

 

    return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));

}

 

4.3 cpu_util_cfs - 获取cpu的util

调频时通过下面接口获取cpu的util值,这就是考虑了util值之后的效果

其中cpu_util_cfs实现如下

static inline unsigned long cpu_util_cfs(struct rq *rq)

{

    unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);

 

    if (sched_feat(UTIL_EST)) {

        util = max_t(unsigned long, util,

             READ_ONCE(rq->cfs.avg.util_est.enqueued));

    }

 

    return util;

}

 

4.4 cpu_util - 获取cpu的util值

/**

* Amount of capacity of a CPU that is (estimated to be) used by CFS tasks

* @cpu: the CPU to get the utilization of

*

* The unit of the return value must be the one of capacity so we can compare

* the utilization with the capacity of the CPU that is available for CFS task

* (ie cpu_capacity).

*

* cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the

* recent utilization of currently non-runnable tasks on a CPU. It represents

* the amount of utilization of a CPU in the range [0..capacity_orig] where

* capacity_orig is the cpu_capacity available at the highest frequency

* (arch_scale_freq_capacity()).

* The utilization of a CPU converges towards a sum equal to or less than the

* current capacity (capacity_curr <= capacity_orig) of the CPU because it is

* the running time on this CPU scaled by capacity_curr.

*

* The estimated utilization of a CPU is defined to be the maximum between its

* cfs_rq.avg.util_avg and the sum of the estimated utilization of the tasks

* currently RUNNABLE on that CPU.

* This allows to properly represent the expected utilization of a CPU which

* has just got a big task running since a long sleep period. At the same time

* however it preserves the benefits of the "blocked utilization" in

* describing the potential for other tasks waking up on the same CPU.

*

* Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even

* higher than capacity_orig because of unfortunate rounding in

* cfs.avg.util_avg or just after migrating tasks and new task wakeups until

* the average stabilizes with the new running time. We need to check that the

* utilization stays within the range of [0..capacity_orig] and cap it if

* necessary. Without utilization capping, a group could be seen as overloaded

* (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of

* available capacity. We allow utilization to overshoot capacity_curr (but not

* capacity_orig) as it useful for predicting the capacity required after task

* migrations (scheduler-driven DVFS).

*

* Return: the (estimated) utilization for the specified CPU

*/

static inline unsigned long cpu_util(int cpu)

{

    struct cfs_rq *cfs_rq;

    unsigned int util;

 

    //1.先获取这个cpu的util_avg

    cfs_rq = &cpu_rq(cpu)->cfs;

    util = READ_ONCE(cfs_rq->avg.util_avg);

 

    //2.取两者的最大值

    if (sched_feat(UTIL_EST))

        util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));

 

    //3.util值不能超过算力

    return min_t(unsigned long, util, capacity_orig_of(cpu));

}

 

4.5 cpu_util_without - 获取cpu的util值,跳过指定的task

/*

* cpu_util_without: compute cpu utilization without any contributions from *p

* @cpu: the CPU which utilization is requested

* @p: the task which utilization should be discounted

*

* The utilization of a CPU is defined by the utilization of tasks currently

* enqueued on that CPU as well as tasks which are currently sleeping after an

* execution on that CPU.

*

* This method returns the utilization of the specified CPU by discounting the

* utilization of the specified task, whenever the task is currently

* contributing to the CPU utilization.

*/

static unsigned long cpu_util_without(

            int cpu,                            //要计算哪个cpu的util值

            struct task_struct *p)            //要扣除哪个task的util值

{

    struct cfs_rq *cfs_rq;

    unsigned int util;

 

    /* Task has no contribution or is new */

    //1.如果这个task不属于这个cpu,或者这个task还没有累积负载

    // 则不需要扣除,直接返回这个task的util值

    if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))

        return cpu_util(cpu);

 

    //2.获取这个cpu的util_avg值

    cfs_rq = &cpu_rq(cpu)->cfs;

    util = READ_ONCE(cfs_rq->avg.util_avg);

 

    /* Discount task's util from CPU's util */

    //3.扣除task的util值

    lsub_positive(&util, task_util(p));

 

    /*

     * Covered cases:

     *

     * a) if *p is the only task sleeping on this CPU, then:

     * cpu_util (== task_util) > util_est (== 0)

     * and thus we return:

     * cpu_util_without = (cpu_util - task_util) = 0

     *

     * b) if other tasks are SLEEPING on this CPU, which is now exiting

     * IDLE, then:

     * cpu_util >= task_util

     * cpu_util > util_est (== 0)

     * and thus we discount *p's blocked utilization to return:

     * cpu_util_without = (cpu_util - task_util) >= 0

     *

     * c) if other tasks are RUNNABLE on that CPU and

     * util_est > cpu_util

     * then we use util_est since it returns a more restrictive

     * estimation of the spare capacity on that CPU, by just

     * considering the expected utilization of tasks already

     * runnable on that CPU.

     *

     * Cases a) and b) are covered by the above code, while case c) is

     * covered by the following code when estimated utilization is

     * enabled.

     */

    //4.考虑util_est

    if (sched_feat(UTIL_EST)) {

        unsigned int estimated =

            READ_ONCE(cfs_rq->avg.util_est.enqueued);

 

        /*

         * Despite the following checks we still have a small window

         * for a possible race, when an execl's select_task_rq_fair()

         * races with LB's detach_task():

         *

         * detach_task()

         * p->on_rq = TASK_ON_RQ_MIGRATING;

         * ---------------------------------- A

         * deactivate_task() \

         * dequeue_task() + RaceTime

         * util_est_dequeue() /

         * ---------------------------------- B

         *

         * The additional check on "current == p" it's required to

         * properly fix the execl regression and it helps in further

         * reducing the chances for the above race.

         */

        if (unlikely(task_on_rq_queued(p) || current == p))

            lsub_positive(&estimated, _task_util_est(p));

 

        util = max(util, estimated);

    }

 

    /*

     * Utilization (estimated) can exceed the CPU capacity, thus let's

     * clamp to the maximum CPU capacity to ensure consistency with

     * the cpu_util call.

     */

    //5.返回的cpu的util值不能超过这个cpu的算力

    return min_t(unsigned long, util, capacity_orig_of(cpu));

}

 

 

 

关注公众号不迷路:DumpStack

扫码加关注

本作品采用 知识共享署名-非商业性使用 4.0 国际许可协议 进行许可
标签: 暂无
最后更新:2022年8月13日

tmmdh

这个人很懒,什么都没留下

打赏 点赞
< 上一篇
下一篇 >

文章评论

  • 喵喵酱

    学习了

    2023年1月7日
    回复
  • 取消回复

    COPYRIGHT © 2022 dumpstack.cn. ALL RIGHTS RESERVED.

    浙ICP备2022000966号