关注公众号不迷路:DumpStack
扫码加关注

目录
预估负载util_est的提出基于下面两个背景:
-
由上一章我们对pelt的分析,知道pelt中负载{load|runnable|util}_avg更新的周期是1ms(严格的说是1024us),负载累加还是比较慢的
-
cpu调频(schedutil)依赖于cpu的util值,util值小,频率低,util值变化慢,调频慢
-
对task的迁核依赖于util值,如果util很小,则很难被迁移到大核上去
假如我们遇到这样的场景:一个重负载的线程,因为某种原因即将进入睡眠状态,这个重载线程在睡眠之前已经累积了一个很大的util_avg,但是因为在睡眠期间是没有新增负载的,只会对历史负载进行衰减,所以,这个线程在睡眠了很长一段时间后,原有的很大的util_avg值会被衰减到很小,导致其被唤醒后util_avg很小,又因为pelt负载更新很慢,所以一方面会导致cpu的频率提升的很慢,另一方面很难选到大核。
为了补救pelt的这一缺陷,引入了预估负载的概念,实际上就是在task睡眠出队之前,将这个task和cfs_rq的util_est保存再来,当这个task从睡眠中唤醒,重新入队时,重新将这部分util_est累加到cfs_rq上去
所以util_est机制的实现主要完成下面工作:
-
开辟数据结构记录task和cfs_rq的util_est信息
-
在dequeue时更新这个task的unil_est,并将task的util_est从cfs_rq中扣除
-
在enqueue时将task的util_est重新累加回cfs_rq中
-
对外提供接口,用于获取task和cpu的util值,当然这个util值是考虑了util_est的
注意:util_est记录的是task/cpu在进入睡眠之前的util值(确切的说是经过滑动平均后的util值),并且util_est是不随着睡眠的时间而衰减的。例如一个task在睡眠之前的util_est值为800,则其在睡眠10s和睡眠100s之后,使用task_util_est获取到的util值都是800
下面就来详细分析各个部分的实现
一、util_est - 数据结构,用于记录task和cfs_rq的预估负载
/** * struct util_est - Estimation utilization of FAIR tasks * @enqueued: instantaneous estimated utilization of a task/cpu * @ewma: the Exponential Weighted Moving Average (EWMA) * utilization of a task * * Support data structure to track an Exponential Weighted Moving Average * (EWMA) of a FAIR task's utilization. New samples are added to the moving * average each time a task completes an activation. Sample's weight is chosen * so that the EWMA will be relatively insensitive to transient changes to the * task's workload. * * The enqueued attribute has a slightly different meaning for tasks and cpus: * - task: the task's util_avg at last task dequeue time * - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU * Thus, the util_est.enqueued of a task represents the contribution on the * estimated utilization of the CPU where that task is currently enqueued. * * Only for tasks we track a moving average of the past instantaneous * estimated utilization. This allows to absorb sporadic drops in utilization * of an otherwise almost periodic task. * * The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg * updates. When a task is dequeued, its util_est should not be updated if its * util_avg has not been updated in the meantime. * This information is mapped into the MSB bit of util_est.enqueued at dequeue * time. Since max value of util_est.enqueued for a task is 1024 (PELT util_avg * for a task) it is safe to use MSB. */ struct util_est { //记录task或cfs_rq在上一次睡眠前的预估util //最高bit用于标记这个util_avg是否被更新,0表示已更新,1表示未更新 //util_est只有在util_avg更新后,才会被更新,这主要是由于下面两个原因: //a) 对于频繁运行的"小线程",频繁的更新util_est会带来额外的开销 //b) util_est的计算基于util_avg,如果util_avg没有更新,util_est的更新没有意义 //注:util_avg更新的最小周期是1ms(1个pelt周期=1024us) unsigned int enqueued;
//指数加权平均的值 unsigned int ewma; #define UTIL_EST_WEIGHT_SHIFT 2 //权 #define UTIL_AVG_UNCHANGED 0x80000000 } __attribute__((__aligned__(sizeof(u64)))); |
二、dequeue时要完成的工作
由前面的序言可知,task在dequeue时需要完成下面两个工作:
-
并将task的util_est从cfs_rq中
-
更新这个task的unil_est
2.1 dequeue_task_fair - task出队时完成扣除和更新操作
/* * The dequeue_task method is called before nr_running is * decreased. We remove the task from the rbtree and * update the fair scheduling stats: */ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) { int task_sleep = flags & DEQUEUE_SLEEP; ...
//1.先将这个task的util_est从cfs_rq中扣除 util_est_dequeue(&rq->cfs, p);
...
//2.更新task的util_est util_est_update(&rq->cfs, p, task_sleep); ... } |
2.2 util_est_dequeue - 从cfs_rq中扣除task的util_est
static inline void util_est_dequeue( struct cfs_rq *cfs_rq, struct task_struct *p) { unsigned int enqueued;
//1.如果没有使能这个feature,则直接退出 if (!sched_feat(UTIL_EST)) return;
/* Update root cfs_rq's estimated utilization */ //2.task从cfs_rq中移除时,这里将这个task对应util_est从cfs_rq中移除 enqueued = cfs_rq->avg.util_est.enqueued; enqueued -= min_t(unsigned int, enqueued, _task_util_est(p)); WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
//3.trace相关 trace_sched_util_est_cfs_tp(cfs_rq); } |
2.3 util_est_update - task出队时更新task的util_est
需要完成两件事:
-
将这个task的util_est从cfs_rq中移除;
-
将这个task的util赋值给ue.enqueued;
-
更新ue.ewma结构;
static inline void util_est_update( struct cfs_rq *cfs_rq, struct task_struct *p, //哪个task要出队 bool task_sleep) //标记这个task是否即将进入睡眠 { long last_ewma_diff, last_enqueued_diff; struct util_est ue;
//1.如果没有使能这个feature,则直接退出 if (!sched_feat(UTIL_EST)) return;
/* * Skip update of task's estimated utilization when the task has not * yet completed an activation, e.g. being migrated. */ //2.只有在task是即将睡眠时才更新task的util_est // 如果仅仅是对这个task执行牵核操作,则跳过 if (!task_sleep) return;
/* * If the PELT values haven't changed since enqueue time, * skip the util_est update. */ //3.如果这个task的util_avg还没有更新,则跳过 // 使用task的util_est.enqueued最后一个bit,标识task的util_avg是否被更新过了 // 只有task的util_avg被更新过,才会更新task的util_est,对于频繁运行的us级的 // 小task(util_avg最小更新粒度是1ms),才不会在每次dequeue时更新util_est,这 // 样减小了更新util_est的开销,同时也减小了在task util_avg未更新时代无意义操作 ue = p->se.avg.util_est; if (ue.enqueued & UTIL_AVG_UNCHANGED) return;
//4.记录这个task更新前的util_est,这个和后面计算加权平均算法相关 last_enqueued_diff = ue.enqueued;
/* * Reset EWMA on utilization increases, the moving average is used only * to smooth utilization decreases. */ //5.直接读出task的util_avg作为util_est // UTIL_EST_FASTUP是快速模式,直接使用上一次的util_avg作为util_est ue.enqueued = task_util(p); if (sched_feat(UTIL_EST_FASTUP)) { if (ue.ewma < ue.enqueued) { ue.ewma = ue.enqueued; goto done; } }
/* * Skip update of task's estimated utilization when its members are * already ~1% close to its last activation value. */ //6.这里是1%的余量,如果差异小于1%则不需要更新,直接使用util_avg last_ewma_diff = ue.enqueued - ue.ewma; last_enqueued_diff -= ue.enqueued; if (within_margin(last_ewma_diff, UTIL_EST_MARGIN)) { if (!within_margin(last_enqueued_diff, UTIL_EST_MARGIN)) goto done;
return; }
/* * To avoid overestimation of actual task utilization, skip updates if * we cannot grant there is idle time in this CPU. */ //7.如果util_avg值大于算力的时候,也不会更新 if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq)))) return;
/* * Update Task's estimated utilization * * When *p completes an activation we can consolidate another sample * of the task size. This is done by storing the current PELT value * as ue.enqueued and by using this value to update the Exponential * Weighted Moving Average (EWMA): * * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1) * = w * task_util(p) + ewma(t-1) - w * ewma(t-1) * = w * (task_util(p) - ewma(t-1)) + ewma(t-1) * = w * ( last_ewma_diff ) + ewma(t-1) * = w * (last_ewma_diff + ewma(t-1) / w) * * Where 'w' is the weight of new samples, which is configured to be * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT) */ //8.下面算法计算滑动平均值,这里算法我们不深究了,公式在上面注释中也已经给出了 // ewma值计算方法: // task的util_avg和上一次ewma的值按比例更新到最新的ewma值, // 现在w=0.25,即task的util_avg占1/4权重,上次ewma的值占 // 3/4的权重,w值越小,平滑越明显,即ewma的变化趋势越平缓 ue.ewma <<= UTIL_EST_WEIGHT_SHIFT; ue.ewma += last_ewma_diff; ue.ewma >>= UTIL_EST_WEIGHT_SHIFT; done:
//9.最高bit标记util_avg是否刚被更新过,0表示刚被更新过 // 因为util_avg在上面已经被使用了,所以置一,标记"脏了" ue.enqueued |= UTIL_AVG_UNCHANGED;
//10.将最新计算得到的util_est回写到task里 WRITE_ONCE(p->se.avg.util_est, ue);
//11.trace相关 trace_sched_util_est_se_tp(&p->se); } |
2.3.1 within_margin - 判断value < margin
/* * Check if a (signed) value is within a specified (unsigned) margin, * based on the observation that: * * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1) * * NOTE: this only works when value + maring < INT_MAX. */ static inline bool within_margin(int value, int margin) { return ((unsigned int)(value + margin - 1) < (2 * margin - 1)); } |
2.4 UTIL_AVG_UNCHANGED标记更新时机
在pelt更新se的负载信息时,会更新UTIL_AVG_UNCHANGED标记,具体如下
2.4.1 __update_load_avg_se - 更新se的负载信息
int __update_load_avg_se( u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se) { //1.更新{load|runnable|util}_sum if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se), cfs_rq->curr == se)) {
//2.更新{load|runnable|util}_avg ___update_load_avg(&se->avg, se_weight(se));
//3.与sa->util_est中的UTIL_AVG_UNCHANGED标记 cfs_se_util_change(&se->avg);
//4.打印trace信息 trace_pelt_se_tp(se);
//5.成功更新负载返回1,未更新返回0 return 1; }
return 0; } |
2.4.2 cfs_se_util_change - 更新UTIL_AVG_UNCHANGED标记
清空util_est.enqueued的最后一个bit,标记util_avg已经发生变化,等待时机更新util_est
util_est的最后一位用于标记:task从入队后的PELT的util_avg是否被更新了,如果没被更新,(task是很小微妙级,util_avg更新最小粒度是1ms),也不需要更新util_est了,这样减小了更新util_est的开销。
static inline void cfs_se_util_change(struct sched_avg *avg) { unsigned int enqueued;
if (!sched_feat(UTIL_EST)) return;
/* Avoid store if the flag has been already reset */ //1.如果该bit为0,则说明已经改变了,则返回 enqueued = avg->util_est.enqueued; if (!(enqueued & UTIL_AVG_UNCHANGED)) return;
/* Reset flag to report util_avg has been updated */ //2.清空最后一个bit,标记util_avg已经发生变化 enqueued &= ~UTIL_AVG_UNCHANGED; WRITE_ONCE(avg->util_est.enqueued, enqueued); } |
三、enqueue时要完成的工作
3.1 enqueue_task_fair - 入队时将task的util_est累加进cfs_rq
static void enqueue_task_fair( struct rq *rq, struct task_struct *p, int flags) { ...
/* * The code below (indirectly) updates schedutil which looks at * the cfs_rq utilization to select a frequency. * Let's add the task's estimated utilization to the cfs_rq's * estimated utilization, before we update schedutil. */ //1.入队时将task的util_est累加进cfs_rq中 util_est_enqueue(&rq->cfs, p);
... } |
3.2 util_est_enqueue - 入队时累加
static inline void util_est_enqueue( struct cfs_rq *cfs_rq, struct task_struct *p) { unsigned int enqueued;
//1.如果没有使能这个feature,则直接退出 if (!sched_feat(UTIL_EST)) return;
/* Update root cfs_rq's estimated utilization */ //2.task添加进cfs_rq时,将这个task对应util_est累加进cfs_rq中 enqueued = cfs_rq->avg.util_est.enqueued; enqueued += _task_util_est(p); WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
//3.trace相关 trace_sched_util_est_cfs_tp(cfs_rq); } |
四、对外提供获取task和cpu的util的接口
其他模块提供下面的接口获取task/cpu的util值,这些接口返回的util值已经考虑了预估负载后的util,即使一个task在睡眠很长时间后,睡醒后再次获取到的util值是其睡眠之前的util值,因为util_est是不随着睡眠时间而衰减的,不至于很小,具体实现如下
4.1 task_util_est - 获取task的util
有个util_est,我们可以通过task_util_est来替代task_util接口
static inline unsigned long task_util_est(struct task_struct *p) { //考虑了util_est return max(task_util(p), _task_util_est(p)); } |
4.2 _task_util_est - 获取这个task在睡眠之前的util_est
static inline unsigned long _task_util_est(struct task_struct *p) { struct util_est ue = READ_ONCE(p->se.avg.util_est);
return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)); } |
4.3 cpu_util_cfs - 获取cpu的util
调频时通过下面接口获取cpu的util值,这就是考虑了util值之后的效果
其中cpu_util_cfs实现如下
static inline unsigned long cpu_util_cfs(struct rq *rq) { unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);
if (sched_feat(UTIL_EST)) { util = max_t(unsigned long, util, READ_ONCE(rq->cfs.avg.util_est.enqueued)); }
return util; } |
4.4 cpu_util - 获取cpu的util值
/** * Amount of capacity of a CPU that is (estimated to be) used by CFS tasks * @cpu: the CPU to get the utilization of * * The unit of the return value must be the one of capacity so we can compare * the utilization with the capacity of the CPU that is available for CFS task * (ie cpu_capacity). * * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the * recent utilization of currently non-runnable tasks on a CPU. It represents * the amount of utilization of a CPU in the range [0..capacity_orig] where * capacity_orig is the cpu_capacity available at the highest frequency * (arch_scale_freq_capacity()). * The utilization of a CPU converges towards a sum equal to or less than the * current capacity (capacity_curr <= capacity_orig) of the CPU because it is * the running time on this CPU scaled by capacity_curr. * * The estimated utilization of a CPU is defined to be the maximum between its * cfs_rq.avg.util_avg and the sum of the estimated utilization of the tasks * currently RUNNABLE on that CPU. * This allows to properly represent the expected utilization of a CPU which * has just got a big task running since a long sleep period. At the same time * however it preserves the benefits of the "blocked utilization" in * describing the potential for other tasks waking up on the same CPU. * * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even * higher than capacity_orig because of unfortunate rounding in * cfs.avg.util_avg or just after migrating tasks and new task wakeups until * the average stabilizes with the new running time. We need to check that the * utilization stays within the range of [0..capacity_orig] and cap it if * necessary. Without utilization capping, a group could be seen as overloaded * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of * available capacity. We allow utilization to overshoot capacity_curr (but not * capacity_orig) as it useful for predicting the capacity required after task * migrations (scheduler-driven DVFS). * * Return: the (estimated) utilization for the specified CPU */ static inline unsigned long cpu_util(int cpu) { struct cfs_rq *cfs_rq; unsigned int util;
//1.先获取这个cpu的util_avg cfs_rq = &cpu_rq(cpu)->cfs; util = READ_ONCE(cfs_rq->avg.util_avg);
//2.取两者的最大值 if (sched_feat(UTIL_EST)) util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
//3.util值不能超过算力 return min_t(unsigned long, util, capacity_orig_of(cpu)); } |
4.5 cpu_util_without - 获取cpu的util值,跳过指定的task
/* * cpu_util_without: compute cpu utilization without any contributions from *p * @cpu: the CPU which utilization is requested * @p: the task which utilization should be discounted * * The utilization of a CPU is defined by the utilization of tasks currently * enqueued on that CPU as well as tasks which are currently sleeping after an * execution on that CPU. * * This method returns the utilization of the specified CPU by discounting the * utilization of the specified task, whenever the task is currently * contributing to the CPU utilization. */ static unsigned long cpu_util_without( int cpu, //要计算哪个cpu的util值 struct task_struct *p) //要扣除哪个task的util值 { struct cfs_rq *cfs_rq; unsigned int util;
/* Task has no contribution or is new */ //1.如果这个task不属于这个cpu,或者这个task还没有累积负载 // 则不需要扣除,直接返回这个task的util值 if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time)) return cpu_util(cpu);
//2.获取这个cpu的util_avg值 cfs_rq = &cpu_rq(cpu)->cfs; util = READ_ONCE(cfs_rq->avg.util_avg);
/* Discount task's util from CPU's util */ //3.扣除task的util值 lsub_positive(&util, task_util(p));
/* * Covered cases: * * a) if *p is the only task sleeping on this CPU, then: * cpu_util (== task_util) > util_est (== 0) * and thus we return: * cpu_util_without = (cpu_util - task_util) = 0 * * b) if other tasks are SLEEPING on this CPU, which is now exiting * IDLE, then: * cpu_util >= task_util * cpu_util > util_est (== 0) * and thus we discount *p's blocked utilization to return: * cpu_util_without = (cpu_util - task_util) >= 0 * * c) if other tasks are RUNNABLE on that CPU and * util_est > cpu_util * then we use util_est since it returns a more restrictive * estimation of the spare capacity on that CPU, by just * considering the expected utilization of tasks already * runnable on that CPU. * * Cases a) and b) are covered by the above code, while case c) is * covered by the following code when estimated utilization is * enabled. */ //4.考虑util_est if (sched_feat(UTIL_EST)) { unsigned int estimated = READ_ONCE(cfs_rq->avg.util_est.enqueued);
/* * Despite the following checks we still have a small window * for a possible race, when an execl's select_task_rq_fair() * races with LB's detach_task(): * * detach_task() * p->on_rq = TASK_ON_RQ_MIGRATING; * ---------------------------------- A * deactivate_task() \ * dequeue_task() + RaceTime * util_est_dequeue() / * ---------------------------------- B * * The additional check on "current == p" it's required to * properly fix the execl regression and it helps in further * reducing the chances for the above race. */ if (unlikely(task_on_rq_queued(p) || current == p)) lsub_positive(&estimated, _task_util_est(p));
util = max(util, estimated); }
/* * Utilization (estimated) can exceed the CPU capacity, thus let's * clamp to the maximum CPU capacity to ensure consistency with * the cpu_util call. */ //5.返回的cpu的util值不能超过这个cpu的算力 return min_t(unsigned long, util, capacity_orig_of(cpu)); } |
文章评论
学习了