负载跟踪之（三）：util_est的引入

关注公众号不迷路：DumpStack

扫码加关注

一、util_est - 数据结构，用于记录task和cfs_rq的预估负载
二、dequeue时要完成的工作
三、enqueue时要完成的工作
- 3.1 enqueue_task_fair - 入队时将task的util_est累加进cfs_rq
- 3.2 util_est_enqueue - 入队时累加
四、对外提供获取task和cpu的util的接口
关注公众号不迷路：DumpStack

预估负载util_est的提出基于下面两个背景：

由上一章我们对pelt的分析，知道pelt中负载{load|runnable|util}_avg更新的周期是1ms（严格的说是1024us），负载累加还是比较慢的
cpu调频（schedutil）依赖于cpu的util值，util值小，频率低，util值变化慢，调频慢
对task的迁核依赖于util值，如果util很小，则很难被迁移到大核上去

假如我们遇到这样的场景：一个重负载的线程，因为某种原因即将进入睡眠状态，这个重载线程在睡眠之前已经累积了一个很大的util_avg，但是因为在睡眠期间是没有新增负载的，只会对历史负载进行衰减，所以，这个线程在睡眠了很长一段时间后，原有的很大的util_avg值会被衰减到很小，导致其被唤醒后util_avg很小，又因为pelt负载更新很慢，所以一方面会导致cpu的频率提升的很慢，另一方面很难选到大核。

为了补救pelt的这一缺陷，引入了预估负载的概念，实际上就是在task睡眠出队之前，将这个task和cfs_rq的util_est保存再来，当这个task从睡眠中唤醒，重新入队时，重新将这部分util_est累加到cfs_rq上去

所以util_est机制的实现主要完成下面工作：

开辟数据结构记录task和cfs_rq的util_est信息
在dequeue时更新这个task的unil_est，并将task的util_est从cfs_rq中扣除
在enqueue时将task的util_est重新累加回cfs_rq中
对外提供接口，用于获取task和cpu的util值，当然这个util值是考虑了util_est的

注意：util_est记录的是task/cpu在进入睡眠之前的util值（确切的说是经过滑动平均后的util值），并且util_est是不随着睡眠的时间而衰减的。例如一个task在睡眠之前的util_est值为800，则其在睡眠10s和睡眠100s之后，使用task_util_est获取到的util值都是800

下面就来详细分析各个部分的实现

一、util_est - 数据结构，用于记录task和cfs_rq的预估负载

/**

* struct util_est - Estimation utilization of FAIR tasks

* @enqueued: instantaneous estimated utilization of a task/cpu

* @ewma: the Exponential Weighted Moving Average (EWMA)

* utilization of a task

* Support data structure to track an Exponential Weighted Moving Average

* (EWMA) of a FAIR task's utilization. New samples are added to the moving

* average each time a task completes an activation. Sample's weight is chosen

* so that the EWMA will be relatively insensitive to transient changes to the

* task's workload.

* The enqueued attribute has a slightly different meaning for tasks and cpus:

* - task: the task's util_avg at last task dequeue time

* - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU

* Thus, the util_est.enqueued of a task represents the contribution on the

* estimated utilization of the CPU where that task is currently enqueued.

* Only for tasks we track a moving average of the past instantaneous

* estimated utilization. This allows to absorb sporadic drops in utilization

* of an otherwise almost periodic task.

* The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg

* updates. When a task is dequeued, its util_est should not be updated if its

* util_avg has not been updated in the meantime.

* This information is mapped into the MSB bit of util_est.enqueued at dequeue

* time. Since max value of util_est.enqueued for a task is 1024 (PELT util_avg

* for a task) it is safe to use MSB.

struct util_est {

//记录task或cfs_rq在上一次睡眠前的预估util

//最高bit用于标记这个util_avg是否被更新，0表示已更新，1表示未更新

//util_est只有在util_avg更新后，才会被更新，这主要是由于下面两个原因：

//a) 对于频繁运行的"小线程"，频繁的更新util_est会带来额外的开销

//b) util_est的计算基于util_avg，如果util_avg没有更新，util_est的更新没有意义

//注：util_avg更新的最小周期是1ms（1个pelt周期=1024us）

unsigned int enqueued;

//指数加权平均的值

unsigned int ewma;

#define UTIL_EST_WEIGHT_SHIFT 2 //权

#define UTIL_AVG_UNCHANGED 0x80000000

} __attribute__((__aligned__(sizeof(u64))));

二、dequeue时要完成的工作

由前面的序言可知，task在dequeue时需要完成下面两个工作：

并将task的util_est从cfs_rq中
更新这个task的unil_est

2.1 dequeue_task_fair - task出队时完成扣除和更新操作

* The dequeue_task method is called before nr_running is

* decreased. We remove the task from the rbtree and

* update the fair scheduling stats:

static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

{

int task_sleep = flags & DEQUEUE_SLEEP;

...

//1.先将这个task的util_est从cfs_rq中扣除

util_est_dequeue(&rq->cfs, p);

...

//2.更新task的util_est

util_est_update(&rq->cfs, p, task_sleep);

...

}

2.2 util_est_dequeue - 从cfs_rq中扣除task的util_est

static inline void util_est_dequeue(

struct cfs_rq *cfs_rq,

struct task_struct *p)

{

unsigned int enqueued;

//1.如果没有使能这个feature，则直接退出

if (!sched_feat(UTIL_EST))

return;

/* Update root cfs_rq's estimated utilization */

//2.task从cfs_rq中移除时，这里将这个task对应util_est从cfs_rq中移除

enqueued = cfs_rq->avg.util_est.enqueued;

enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));

WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);

//3.trace相关

trace_sched_util_est_cfs_tp(cfs_rq);

}

2.3 util_est_update - task出队时更新task的util_est

需要完成两件事：

将这个task的util_est从cfs_rq中移除；
将这个task的util赋值给ue.enqueued；
更新ue.ewma结构；

static inline void util_est_update(

struct cfs_rq *cfs_rq,

struct task_struct *p, //哪个task要出队

bool task_sleep) //标记这个task是否即将进入睡眠

{

long last_ewma_diff, last_enqueued_diff;

struct util_est ue;

//1.如果没有使能这个feature，则直接退出

if (!sched_feat(UTIL_EST))

return;

* Skip update of task's estimated utilization when the task has not

* yet completed an activation, e.g. being migrated.

//2.只有在task是即将睡眠时才更新task的util_est

// 如果仅仅是对这个task执行牵核操作，则跳过

if (!task_sleep)

return;

* If the PELT values haven't changed since enqueue time,

* skip the util_est update.

//3.如果这个task的util_avg还没有更新，则跳过

// 使用task的util_est.enqueued最后一个bit，标识task的util_avg是否被更新过了

// 只有task的util_avg被更新过，才会更新task的util_est，对于频繁运行的us级的

// 小task（util_avg最小更新粒度是1ms），才不会在每次dequeue时更新util_est，这

// 样减小了更新util_est的开销，同时也减小了在task util_avg未更新时代无意义操作

ue = p->se.avg.util_est;

if (ue.enqueued & UTIL_AVG_UNCHANGED)

return;

//4.记录这个task更新前的util_est，这个和后面计算加权平均算法相关

last_enqueued_diff = ue.enqueued;

* Reset EWMA on utilization increases, the moving average is used only

* to smooth utilization decreases.

//5.直接读出task的util_avg作为util_est

// UTIL_EST_FASTUP是快速模式，直接使用上一次的util_avg作为util_est

ue.enqueued = task_util(p);

if (sched_feat(UTIL_EST_FASTUP)) {

if (ue.ewma < ue.enqueued) {

ue.ewma = ue.enqueued;

goto done;

}

* Skip update of task's estimated utilization when its members are

* already ~1% close to its last activation value.

//6.这里是1%的余量，如果差异小于1%则不需要更新，直接使用util_avg

last_ewma_diff = ue.enqueued - ue.ewma;

last_enqueued_diff -= ue.enqueued;

if (within_margin(last_ewma_diff, UTIL_EST_MARGIN)) {

if (!within_margin(last_enqueued_diff, UTIL_EST_MARGIN))

goto done;

return;

}

* To avoid overestimation of actual task utilization, skip updates if

* we cannot grant there is idle time in this CPU.

//7.如果util_avg值大于算力的时候，也不会更新

if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq))))

return;

* Update Task's estimated utilization

* When *p completes an activation we can consolidate another sample

* of the task size. This is done by storing the current PELT value

* as ue.enqueued and by using this value to update the Exponential

* Weighted Moving Average (EWMA):

* ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)

* = w * task_util(p) + ewma(t-1) - w * ewma(t-1)

* = w * (task_util(p) - ewma(t-1)) + ewma(t-1)

* = w * ( last_ewma_diff ) + ewma(t-1)

* = w * (last_ewma_diff + ewma(t-1) / w)

* Where 'w' is the weight of new samples, which is configured to be

* 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)

//8.下面算法计算滑动平均值，这里算法我们不深究了，公式在上面注释中也已经给出了

// ewma值计算方法：

// task的util_avg和上一次ewma的值按比例更新到最新的ewma值，

// 现在w=0.25，即task的util_avg占1/4权重，上次ewma的值占

// 3/4的权重，w值越小，平滑越明显，即ewma的变化趋势越平缓

ue.ewma <<= UTIL_EST_WEIGHT_SHIFT;

ue.ewma += last_ewma_diff;

ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;

done:

//9.最高bit标记util_avg是否刚被更新过，0表示刚被更新过

// 因为util_avg在上面已经被使用了，所以置一，标记"脏了"

ue.enqueued |= UTIL_AVG_UNCHANGED;

//10.将最新计算得到的util_est回写到task里

WRITE_ONCE(p->se.avg.util_est, ue);

//11.trace相关

trace_sched_util_est_se_tp(&p->se);

}

2.3.1 within_margin - 判断value < margin

* Check if a (signed) value is within a specified (unsigned) margin,

* based on the observation that:

* abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)

* NOTE: this only works when value + maring < INT_MAX.

static inline bool within_margin(int value, int margin)

{

return ((unsigned int)(value + margin - 1) < (2 * margin - 1));

}

2.4 UTIL_AVG_UNCHANGED标记更新时机

在pelt更新se的负载信息时，会更新UTIL_AVG_UNCHANGED标记，具体如下

2.4.1 __update_load_avg_se - 更新se的负载信息

int __update_load_avg_se(

u64 now,

struct cfs_rq *cfs_rq,

struct sched_entity *se)

{

//1.更新{load|runnable|util}_sum

if (___update_load_sum(now, &se->avg,

!!se->on_rq,

se_runnable(se),

cfs_rq->curr == se)) {

//2.更新{load|runnable|util}_avg

___update_load_avg(&se->avg, se_weight(se));

//3.与sa->util_est中的UTIL_AVG_UNCHANGED标记

cfs_se_util_change(&se->avg);

//4.打印trace信息

trace_pelt_se_tp(se);

//5.成功更新负载返回1，未更新返回0

return 1;

}

return 0;

}

2.4.2 cfs_se_util_change - 更新UTIL_AVG_UNCHANGED标记

清空util_est.enqueued的最后一个bit，标记util_avg已经发生变化，等待时机更新util_est

util_est的最后一位用于标记：task从入队后的PELT的util_avg是否被更新了，如果没被更新，(task是很小微妙级，util_avg更新最小粒度是1ms)，也不需要更新util_est了，这样减小了更新util_est的开销。

static inline void cfs_se_util_change(struct sched_avg *avg)

{

unsigned int enqueued;

if (!sched_feat(UTIL_EST))

return;

/* Avoid store if the flag has been already reset */

//1.如果该bit为0，则说明已经改变了，则返回

enqueued = avg->util_est.enqueued;

if (!(enqueued & UTIL_AVG_UNCHANGED))

return;

/* Reset flag to report util_avg has been updated */

//2.清空最后一个bit，标记util_avg已经发生变化

enqueued &= ~UTIL_AVG_UNCHANGED;

WRITE_ONCE(avg->util_est.enqueued, enqueued);

}

三、enqueue时要完成的工作

3.1 enqueue_task_fair - 入队时将task的util_est累加进cfs_rq

static void enqueue_task_fair(

struct rq *rq,

struct task_struct *p,

int flags)

{

...

* The code below (indirectly) updates schedutil which looks at

* the cfs_rq utilization to select a frequency.

* Let's add the task's estimated utilization to the cfs_rq's

* estimated utilization, before we update schedutil.

//1.入队时将task的util_est累加进cfs_rq中

util_est_enqueue(&rq->cfs, p);

...

}

3.2 util_est_enqueue - 入队时累加

static inline void util_est_enqueue(

struct cfs_rq *cfs_rq,

struct task_struct *p)

{

unsigned int enqueued;

//1.如果没有使能这个feature，则直接退出

if (!sched_feat(UTIL_EST))

return;

/* Update root cfs_rq's estimated utilization */

//2.task添加进cfs_rq时，将这个task对应util_est累加进cfs_rq中

enqueued = cfs_rq->avg.util_est.enqueued;

enqueued += _task_util_est(p);

WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);

//3.trace相关

trace_sched_util_est_cfs_tp(cfs_rq);

}

四、对外提供获取task和cpu的util的接口

其他模块提供下面的接口获取task/cpu的util值，这些接口返回的util值已经考虑了预估负载后的util，即使一个task在睡眠很长时间后，睡醒后再次获取到的util值是其睡眠之前的util值，因为util_est是不随着睡眠时间而衰减的，不至于很小，具体实现如下

4.1 task_util_est - 获取task的util

有个util_est，我们可以通过task_util_est来替代task_util接口

static inline unsigned long task_util_est(struct task_struct *p)

{

//考虑了util_est

return max(task_util(p), _task_util_est(p));

}

4.2 _task_util_est - 获取这个task在睡眠之前的util_est

static inline unsigned long _task_util_est(struct task_struct *p)

{

struct util_est ue = READ_ONCE(p->se.avg.util_est);

return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));

}

4.3 cpu_util_cfs - 获取cpu的util

调频时通过下面接口获取cpu的util值，这就是考虑了util值之后的效果

其中cpu_util_cfs实现如下

static inline unsigned long cpu_util_cfs(struct rq *rq)

{

unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);

if (sched_feat(UTIL_EST)) {

util = max_t(unsigned long, util,

READ_ONCE(rq->cfs.avg.util_est.enqueued));

}

return util;

}

4.4 cpu_util - 获取cpu的util值

/**

* Amount of capacity of a CPU that is (estimated to be) used by CFS tasks

* @cpu: the CPU to get the utilization of

* The unit of the return value must be the one of capacity so we can compare

* the utilization with the capacity of the CPU that is available for CFS task

* (ie cpu_capacity).

* cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the

* recent utilization of currently non-runnable tasks on a CPU. It represents

* the amount of utilization of a CPU in the range [0..capacity_orig] where

* capacity_orig is the cpu_capacity available at the highest frequency

* (arch_scale_freq_capacity()).

* The utilization of a CPU converges towards a sum equal to or less than the

* current capacity (capacity_curr <= capacity_orig) of the CPU because it is

* the running time on this CPU scaled by capacity_curr.

* The estimated utilization of a CPU is defined to be the maximum between its

* cfs_rq.avg.util_avg and the sum of the estimated utilization of the tasks

* currently RUNNABLE on that CPU.

* This allows to properly represent the expected utilization of a CPU which

* has just got a big task running since a long sleep period. At the same time

* however it preserves the benefits of the "blocked utilization" in

* describing the potential for other tasks waking up on the same CPU.

* Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even

* higher than capacity_orig because of unfortunate rounding in

* cfs.avg.util_avg or just after migrating tasks and new task wakeups until

* the average stabilizes with the new running time. We need to check that the

* utilization stays within the range of [0..capacity_orig] and cap it if

* necessary. Without utilization capping, a group could be seen as overloaded

* (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of

* available capacity. We allow utilization to overshoot capacity_curr (but not

* capacity_orig) as it useful for predicting the capacity required after task

* migrations (scheduler-driven DVFS).

* Return: the (estimated) utilization for the specified CPU

static inline unsigned long cpu_util(int cpu)

{

struct cfs_rq *cfs_rq;

unsigned int util;

//1.先获取这个cpu的util_avg

cfs_rq = &cpu_rq(cpu)->cfs;

util = READ_ONCE(cfs_rq->avg.util_avg);

//2.取两者的最大值

if (sched_feat(UTIL_EST))

util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));

//3.util值不能超过算力

return min_t(unsigned long, util, capacity_orig_of(cpu));

}

4.5 cpu_util_without - 获取cpu的util值，跳过指定的task

* cpu_util_without: compute cpu utilization without any contributions from *p

* @cpu: the CPU which utilization is requested

* @p: the task which utilization should be discounted

* The utilization of a CPU is defined by the utilization of tasks currently

* enqueued on that CPU as well as tasks which are currently sleeping after an

* execution on that CPU.

* This method returns the utilization of the specified CPU by discounting the

* utilization of the specified task, whenever the task is currently

* contributing to the CPU utilization.

static unsigned long cpu_util_without(

int cpu, //要计算哪个cpu的util值

struct task_struct *p) //要扣除哪个task的util值

{

struct cfs_rq *cfs_rq;

unsigned int util;

/* Task has no contribution or is new */

//1.如果这个task不属于这个cpu，或者这个task还没有累积负载

// 则不需要扣除，直接返回这个task的util值

if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))

return cpu_util(cpu);

//2.获取这个cpu的util_avg值

cfs_rq = &cpu_rq(cpu)->cfs;

util = READ_ONCE(cfs_rq->avg.util_avg);

/* Discount task's util from CPU's util */

//3.扣除task的util值

lsub_positive(&util, task_util(p));

* Covered cases:

* a) if *p is the only task sleeping on this CPU, then:

* cpu_util (== task_util) > util_est (== 0)

* and thus we return:

* cpu_util_without = (cpu_util - task_util) = 0

* b) if other tasks are SLEEPING on this CPU, which is now exiting

* IDLE, then:

* cpu_util >= task_util

* cpu_util > util_est (== 0)

* and thus we discount *p's blocked utilization to return:

* cpu_util_without = (cpu_util - task_util) >= 0

* c) if other tasks are RUNNABLE on that CPU and

* util_est > cpu_util

* then we use util_est since it returns a more restrictive

* estimation of the spare capacity on that CPU, by just

* considering the expected utilization of tasks already

* runnable on that CPU.

* Cases a) and b) are covered by the above code, while case c) is

* covered by the following code when estimated utilization is

* enabled.

//4.考虑util_est

if (sched_feat(UTIL_EST)) {

unsigned int estimated =

READ_ONCE(cfs_rq->avg.util_est.enqueued);

* Despite the following checks we still have a small window

* for a possible race, when an execl's select_task_rq_fair()

* races with LB's detach_task():

* detach_task()

* p->on_rq = TASK_ON_RQ_MIGRATING;

* ---------------------------------- A

* deactivate_task() \

* dequeue_task() + RaceTime

* util_est_dequeue() /

* ---------------------------------- B

* The additional check on "current == p" it's required to

* properly fix the execl regression and it helps in further

* reducing the chances for the above race.

if (unlikely(task_on_rq_queued(p) || current == p))

lsub_positive(&estimated, _task_util_est(p));

util = max(util, estimated);

}

* Utilization (estimated) can exceed the CPU capacity, thus let's

* clamp to the maximum CPU capacity to ensure consistency with

* the cpu_util call.

//5.返回的cpu的util值不能超过这个cpu的算力

return min_t(unsigned long, util, capacity_orig_of(cpu));

}

关注公众号不迷路：DumpStack

扫码加关注

本作品采用知识共享署名-非商业性使用 4.0 国际许可协议进行许可

负载跟踪之（三）：util_est的引入

关注公众号不迷路：DumpStack

一、util_est - 数据结构，用于记录task和cfs_rq的预估负载

二、dequeue时要完成的工作

2.1 dequeue_task_fair - task出队时完成扣除和更新操作

2.2 util_est_dequeue - 从cfs_rq中扣除task的util_est

2.3 util_est_update - task出队时更新task的util_est

2.3.1 within_margin - 判断value < margin

2.4 UTIL_AVG_UNCHANGED标记更新时机

2.4.1 __update_load_avg_se - 更新se的负载信息

2.4.2 cfs_se_util_change - 更新UTIL_AVG_UNCHANGED标记

三、enqueue时要完成的工作

3.1 enqueue_task_fair - 入队时将task的util_est累加进cfs_rq

3.2 util_est_enqueue - 入队时累加

四、对外提供获取task和cpu的util的接口

4.1 task_util_est - 获取task的util

4.2 _task_util_est - 获取这个task在睡眠之前的util_est

4.3 cpu_util_cfs - 获取cpu的util

4.4 cpu_util - 获取cpu的util值

4.5 cpu_util_without - 获取cpu的util值，跳过指定的task

关注公众号不迷路：DumpStack

文章评论