DumpStack

静下来,享受技术!
  1. 首页
  2. cpuidle
  3. 正文

cpuidle子系统之(四):governor层

2022年3月18日 2234点热度 1人点赞 2条评论

关注公众号不迷路:DumpStack

扫码加关注

目录

  • 一、计算系统所能忍受的延迟
    • 1.1 cpuidle_governor_latency_req - 计算系统能忍受的延迟
  • 二、menu
    • 2.1 设计原理
    • 2.2 menu_device - 数据结构
    • 2.3 menu_governor
    • 2.4 init_menu - menu初始化
    • 2.5 menu_enable_device - 完成governor运行之前的准备工作
    • 2.6 menu_select - 选择一个C state
      • 2.6.1 menu_update
      • 2.6.2 tick_nohz_get_sleep_length - 预估当前会sleep多长时间
      • 2.6.3 nr_iowait_cpu - 获取这个cpu上有多少个iowait
      • 2.6.4 which_bucket - 计算要使用那个桶中的校正因子
      • 2.6.5 get_typical_interval - 计算标准差
    • 2.7 menu_reflect - 更新相关数据结构
  • 三、ladder
    • 3.1 数据结构
      • 3.1.1 ladder_device_state - ladder中对C state的补充描述
      • 3.1.2 ladder_device - ladder中对cpu设备的补充描述
    • 3.2 ladder_governor
    • 3.3 init_ladder - 初始化
    • 3.4 ladder_enable_device - governor被启用时,完成数据结构的初始化
    • 3.5 ladder_select_state - 挑选一个C state
      • 3.5.1 ladder_do_selection
    • 3.6 ladder_reflect - 更新相关数据结构
  • 四、haltpoll
    • 4.1 几个全局变量
    • 4.2 haltpoll_governor
    • 4.3 init_haltpoll - 初始化
    • 4.4 haltpoll_enable_device - governor被启用时完成相应的初始化
    • 4.5 haltpoll_select - 挑选一个idle等级
    • 4.6 haltpoll_reflect
      • 4.6.1 adjust_poll_limit
  • 五、teo
    • 5.1 数据结构
      • 5.1.1 teo_idle_state - 描述一个C state
      • 5.1.2 teo_cpu - 描述一个cpu
    • 5.2 teo_governor
    • 5.3 teo_governor_init - 初始化
    • 5.4 teo_enable_device - governor被启用时,完成数据的初始化
    • 5.5 teo_select - 选择一个idle等级
      • 5.5.1 teo_update
    • 5.6 teo_reflect - 更新数据
  • 关注公众号不迷路:DumpStack

 

 

 

注:因为个人时间和精力的原因,本文未完结,后面有机会再补充吧!

 

CPU提供了多种idle级别,这些idle级别的主要区别是"进入idle时的功耗"和"退出时延迟"。driver层负责定义这些idle状态(即描述每一个状态的功耗和延迟分别是多少),并实现进入和退出相关的操作。最终,driver会把这些信息告诉governor层,由governor根据具体的应用场景,决定要选用哪种idle等级?依据是什么?是考虑cpu的负载还是系统中的iowaiter的个数?

所以governor层要完成的工作:制定游戏规则,并根据这个规则,选择出一个合适的C state

 

一、计算系统所能忍受的延迟

1.1 cpuidle_governor_latency_req - 计算系统能忍受的延迟

通过pm qos计算,关于pm qos相关内容我们在后面会有单独的文章分析

U:\linux-5.10.61\drivers\cpuidle\governor.c

/**

* cpuidle_governor_latency_req - Compute a latency constraint for CPU

* @cpu: Target CPU

*/

s64 cpuidle_governor_latency_req(unsigned int cpu)

{

    struct device *device = get_cpu_device(cpu);

 

    //1.下面是通过QoS计算系统能承受的延迟

    // 关于QoS相关内容,我们在后面介绍

    int device_req = dev_pm_qos_raw_resume_latency(device);

    int global_req = cpu_latency_qos_limit();

 

    //2.取最小值,也就是最苛刻的那个值

    if (device_req > global_req)

        device_req = global_req;

 

    //3.单位转化为us

    return (s64)device_req * NSEC_PER_USEC;

}

 

二、menu

ladder在periodic timer tick system中使用,menu在tickless system中使用

 

2.1 设计原理

注:下文摘自蜗窝科技,向大牛致敬!

 

governor的主要职责,是根据系统的运行情况,选择一个合适idle state(在kernel的标准术语中,也称作C state,即CPU state)。具体的算法,需要基于下面两点考虑:

 

1)切换的代价(本文称为最小滞留时间)

进入C state的目的,是节省功耗,但CPU在C state和normal state之间切换,是要付出功耗上面的代价的,这个代价体现在cpuidle_state结构的target_residency字段上(本文称为"最小滞留时间"),driver在注册cpuidle_state时,要明确的指定该C state的切换代价,基于该代价,CPU必须在该C state中停留超过一定的时间才是划算的

因此governor在选择C state时,需要预测出CPU将要在C state中的预计停留时间,并和备选cpuidle_state的target_residency字段比较,选取满足"预计停留时间 > 最小滞留时间(target_residency)"的C state

 

2)系统对"退出延迟"容忍程度

备选的C state中,"功耗"和"退出延迟"是一对不可调和的矛盾,(处于idle的等级越大,表示睡得越深,此时"功耗"就越低,但是退出idle状态时的"退出延迟"就会越大,相反,睡得越浅,"功耗"就越大,退出idle时的"退出延迟"就越短)。电源管理的目标,是在保证"退出延迟"在系统可接受的范围内的情况下,尽可能的节省功耗

为了描述"功耗"和"退出延迟",cpuidle_state结构中还提供下面两个信息:

  • CPU在某个C state下的功耗信息,对应cpuidle_state结构中的power_usage
  • 退出该C state的延迟,对应cpuidle_state结构中的exit_latency

 

那么如果知道"当前系统所能容忍的延迟"(代码中使用latency_req变量表示),就可以在所有exit_latency小于latency_req的C state中,选取功耗最小的那个,因此,governor算法就转换为获取"当前系统所能容忍的延迟",而这正是pm qos的特长

 

基于上面的考量,menu governor的主要任务就转化为两个:

  • 根据系统的运行情况,预测CPU将在C state中停留的时间(对应变量predicted_us)
  • 借助pm qos framework,获取当前"系统所能容忍的延迟"(对应变量latency_req)

 

对于任务1,menu governor从如下几个方面去达成:

前面讲过,menu governor用于tickless system,简化处理,menu将"距离下一个tick来临的时间"(由next timer event测量,对应menu_device::next_timer_us成员)作为基础的停留时间predicted_us

当然,这个基础的停留时间predicted_us是不准确的,因为在这段时间内,随时都可能产生除next timer event之外的其它wakeup event。为了使预测更准确,有必要加入一个"校正因子"(对应menu_device::correction_factor成员),该"校正因子"等于过去的实际停留时间predicted_us和next_timer_us之间的比值,例如,如果实际的wakeup event都是在预测的next timer event时间的一半时产生,则factor为0.5。另外,为了更精确,menu使用动态平均的factor

另外,对不同范围的next_timer_us,"校正因子"的影响程度是不一样的。例如对于50ms和500ms的next timer event,都是在10ms时产生了wakeup event,显然"校正因子"对500ms的影响比较大。如果计算平均值时将它们混在一起,就会对预测的准确性产生影响,所以在计算"校正因子"时,需要区分不同级别的next_timer_us,(也就是需要对next_timer_us划分不同的档位,也就是下面的"桶"的概念)

同时,系统是否存在io wait,对"校正因子"的敏感度也不同

基于这些考虑,menu使用了一组"校正因子",对应menu_device::correction_factor[BUCKETS]数组,目前有12个,分别用于不同next_timer_us、不同io wait的场景下的的校正。

最后,在有些场合下,next_timer_us的预测是完全不正确的,如存在固定周期的中断时(音频等)。这时menu采用另一种不同的预测方式:统计过去8次停留时间的标准差(stand deviation),如果小于一定的门限值,则使用这8个停留时间的平均值,作为预测值。

 

对于任务2,是对"系统所能容忍的退出延迟"(对应变量latency_req)的估算,menu综合考虑了两种因素,如下:

1) 由pm qos获得的,系统期望的,CPU和DMA的延迟需求,这是一个硬性指标

2) 基于这样一个经验法则:越忙的系统,对系统延迟的要求越高,(即要求退出延迟的时间越短),结合任务1中预测到的停留时间(predicted_us),以及当前系统的CPU平均负荷和iowaiters的个数(get_iowait_load函数获得),计算出一个"系统所能容忍的延迟",计算公式如下,这是一个经验公式

latency_req = predicted_us / (1 + 2 * loadavg + 10 * iowaiters)

 

这个公式反映的是退出延迟和预期停留时间之间的比例,loadavg和iowaiters越大,对退出延迟的要求就越高,即要求退出延迟越低

最后,latency_req的值取上面两个估值的最小值

 

关于menu的设计初衷,在内核源码中有很详细的注释,如下

W:\opensource\linux-5.10.61\drivers\cpuidle\governors\menu.c

/*

* Concepts and ideas behind the menu governor

*

* For the menu governor, there are 3 decision factors for picking a C

* state:

* 1) Energy break even point

* 2) Performance impact

* 3) Latency tolerance (from pmqos infrastructure)

* These these three factors are treated independently.

*

* Energy break even point

* -----------------------

* C state entry and exit have an energy cost, and a certain amount of time in

* the C state is required to actually break even on this cost. CPUIDLE

* provides us this duration in the "target_residency" field. So all that we

* need is a good prediction of how long we'll be idle. Like the traditional

* menu governor, we start with the actual known "next timer event" time.

*

* Since there are other source of wakeups (interrupts for example) than

* the next timer event, this estimation is rather optimistic. To get a

* more realistic estimate, a correction factor is applied to the estimate,

* that is based on historic behavior. For example, if in the past the actual

* duration always was 50% of the next timer tick, the correction factor will

* be 0.5.

*

* menu uses a running average for this correction factor, however it uses a

* set of factors, not just a single factor. This stems from the realization

* that the ratio is dependent on the order of magnitude of the expected

* duration; if we expect 500 milliseconds of idle time the likelihood of

* getting an interrupt very early is much higher than if we expect 50 micro

* seconds of idle time. A second independent factor that has big impact on

* the actual factor is if there is (disk) IO outstanding or not.

* (as a special twist, we consider every sleep longer than 50 milliseconds

* as perfect; there are no power gains for sleeping longer than this)

*

* For these two reasons we keep an array of 12 independent factors, that gets

* indexed based on the magnitude of the expected duration as well as the

* "is IO outstanding" property.

*

* Repeatable-interval-detector

* ----------------------------

* There are some cases where "next timer" is a completely unusable predictor:

* Those cases where the interval is fixed, for example due to hardware

* interrupt mitigation, but also due to fixed transfer rate devices such as

* mice.

* For this, we use a different predictor: We track the duration of the last 8

* intervals and if the stand deviation of these 8 intervals is below a

* threshold value, we use the average of these intervals as prediction.

*

* Limiting Performance Impact

* ---------------------------

* C states, especially those with large exit latencies, can have a real

* noticeable impact on workloads, which is not acceptable for most sysadmins,

* and in addition, less performance has a power price of its own.

*

* As a general rule of thumb, menu assumes that the following heuristic

* holds:

* The busier the system, the less impact of C states is acceptable

*

* This rule-of-thumb is implemented using a performance-multiplier:

* If the exit latency times the performance multiplier is longer than

* the predicted duration, the C state is not considered a candidate

* for selection due to a too high performance impact. So the higher

* this multiplier is, the longer we need to be idle to pick a deep C

* state, and thus the less likely a busy CPU will hit such a deep

* C state.

*

* Two factors are used in determing this multiplier:

* a value of 10 is added for each point of "per cpu load average" we have.

* a value of 5 points is added for each process that is waiting for

* IO on this CPU.

* (these values are experimentally determined)

*

* The load average factor gives a longer term (few seconds) input to the

* decision, while the iowait value gives a cpu local instantanious input.

* The iowait factor may look low, but realize that this is also already

* represented in the system load average.

*

*/

 

2.2 menu_device - 数据结构

struct menu_device {

    //每次从C state返回时,kernel会调用governor的reflect接口,以便有机会让governor

    //考虑这一次state切换的结果,并更新一些统计信息,以便在下一次挑选出更加合适的C state

    //对menu而言,它的reflect接口会设置needs_update标志,并在下一次select时,更新状态,

    //为什么不在reflect中直接完成状态更新呢?这是因为reflect是在关中断的上下文,不适宜

    //做太多的工作,需要尽快完成工作并打开中断,处理中断服务程序,具体行为可参考后面的描述;

    int needs_update;

 

    //一个时间戳,记录上一次从idle中被唤醒是在什么时刻

    int tick_wakeup;

 

    //距离下一次tick来临的时间

    u64 next_timer_ns;

 

    //指明在选择C state时应该使用哪个校正因子,也就是下面correction_factor[]中的索引

    unsigned int bucket;

 

    //保存校正因子的数组,因子的个数为BUCKETS,当前为12

    unsigned int correction_factor[BUCKETS];

 

    //可参考上面的描述,用于计算停留时间的标准差,当前代码使用了8个停留时间(INTERVALS)

    unsigned int intervals[INTERVALS];

 

    //用于索引上面的intervals数组

    int interval_ptr;

};

 

2.3 menu_governor

static struct cpuidle_governor menu_governor = {

    .name =        "menu",

    .rating =    20,

    .enable =    menu_enable_device,

    .select =    menu_select,

    .reflect =    menu_reflect,

};

 

2.4 init_menu - menu初始化

/**

* init_menu - initializes the governor

*/

static int __init init_menu(void)

{

    return cpuidle_register_governor(&menu_governor);

}

postcore_initcall(init_menu);                    //驱动初始化的时候执行

 

2.5 menu_enable_device - 完成governor运行之前的准备工作

在cpuidle_register_device中会通过cpuidle_curr_governor->enable调用到该接口,由下面cpuidle_register_device的调用路径可知,在系统启动注册cpu时或者cpu热插拔时会调用该接口,完成这个cpu所使用的的governor的初始化相关工作,在menu中主要是对校正因子进行了初始化

  • cpuidle_register_device -> cpuidle_endble_device
  • acpi_processor_hotplug -> cpuidle_endble_device
  • acpi_processor_power_state_has_changed -> cpuidle_endble_device

 

/**

* menu_enable_device - scans a CPU's states and does setup

* @drv: cpuidle driver

* @dev: the CPU

*/

static int menu_enable_device(

            struct cpuidle_driver *drv,

            struct cpuidle_device *dev)

{

    //1.每个cpu对应一个menu_device结构

    struct menu_device *data = &per_cpu(menu_devices, dev->cpu);

    int i;

 

    memset(data, 0, sizeof(struct menu_device));

 

    /*

     * if the correction factor is 0 (eg first time init or cpu hotplug

     * etc), we actually want to start out with a unity factor.

     */

    //2.初始化校正因子

    // 其中:BUCKETS=12, RESOLUTION=1024, DECAY=8

    for(i = 0; i < BUCKETS; i++)

        data->correction_factor[i] = RESOLUTION * DECAY;

 

    return 0;

}

 

2.6 menu_select - 选择一个C state

函数调用路径如下,在cpuidle_idle_call中,先选择出一个合适的C state,然后进入该C state

  • cpuidle_idle_call -> cpuidle_select

 

/**

* menu_select - selects the next idle state to enter

* @drv: cpuidle driver containing state data

* @dev: the CPU

* @stop_tick: indication on whether or not to stop the tick

*/

static int menu_select(

            struct cpuidle_driver *drv,

            struct cpuidle_device *dev,

            bool *stop_tick)                    //out,返回是否需要关闭tick

{

    struct menu_device *data = this_cpu_ptr(&menu_devices);

 

    //1.计算"系统所能忍受的退出延迟"latency_req,该函数是通过pm qos计算的

    s64 latency_req = cpuidle_governor_latency_req(dev->cpu);

 

    //2.predicted_us表示"预计会在idle中停留多长时间"

    unsigned int predicted_us;

    u64 predicted_ns;

    u64 interactivity_req;

    unsigned long nr_iowaiters;

    ktime_t delta_next;

    int i, idx;

 

    //2.根据needs_update标志,调用menu_update,更新统计信息

    // 这里主要是对上一次的idle信息进行统计,这部分工作本该放在reflect中完成,

    // 但是因为reflect是在关中断的上下文,不适合做太多的工作,所以移到了这里

    if (data->needs_update) {

        menu_update(drv, dev);

        data->needs_update = 0;

    }

 

    /* determine the expected residency time, round up */

    //3.返回当前sleep的预期长度,也就是距离下一次tick来临的时间

    data->next_timer_ns = tick_nohz_get_sleep_length(&delta_next);

 

    //4.获取这个cpu上中有多少个iowait

    nr_iowaiters = nr_iowait_cpu(dev->cpu);

 

    //5.由上面计算得到的预期睡眠时间和iowaiter的个数,确定应该使用那个桶里的校正因子

    data->bucket = which_bucket(data->next_timer_ns, nr_iowaiters);

 

    //6.当满足下面情况的任意一种,返回0,C state为0表示不需要进入idle状态

    // a) 这个driver只有一种状态,即normal状态,没有idle状态,也就不能进入idle状态了

    // b) 上面计算得到的"系统所能容忍的退出延迟"为0,也就是说当前系统处于一个比较苛刻的

    // 状态,不能进入idle状态

    // c) 满足下面条件,则说明连最浅的睡眠状态都不满足要求,此时不允许进入睡眠状态

    // i) 考虑到切换代价:因为cpuidle_driver::state[]数组中的idle状态是按照睡眠深浅

    // 排序的,排的越靠前的睡眠越浅,退出时的延迟越短,功耗越高。所以如果上面计算得到

    // 的预期睡眠时间data->next_timer_ns比等级1的"最小滞留时间"要小,则说明当前

    // 不适合进入任何idle状态

    // ii) 考虑到退出延迟:"系统所能忍受的退出延迟"比等级1的C state的退出延迟还要短,

    // (其他等级的C state的退出延迟更长),说明系统也不适合进入idle状态

    // iii) dev->states_usage[0].disable为0表示等级为0的idle状态被禁用了,这里是考

    // 虑什么情况呢?????

    if (unlikely(drv->state_count <= 1 || latency_req == 0) ||

     ((data->next_timer_ns < drv->states[1].target_residency_ns ||

     latency_req < drv->states[1].exit_latency_ns) &&

     !dev->states_usage[0].disable)) {

        /*

         * In this case state[0] will be used no matter what, so return

         * it right away and keep the tick running if state[0] is a

         * polling one.

         */

        //6.1 代码走到这表示不适合进入idle状态,也就是normal状态,对应drv->states[0]

        // 此时再normal状态下是否需要关闭idle由CPUIDLE_FLAG_POLLING决定

        *stop_tick = !(drv->states[0].flags & CPUIDLE_FLAG_POLLING);

        return 0;

    }

 

    /* Round up the result for half microseconds. */

    //7.计算预计会在idle状态停留多长时间predicted_us

    // 将next_timer_us乘以校正因子,得到predicted_us,计算时考虑了溢出、精度等情况

    // a) RESOLUTION为1024,表示???

    // b) DECAY为8,表示????

    // c) 加上(RESOLUTION * DECAY * NSEC_PER_USEC) / 2是为了实现四舍五入

    predicted_us = div_u64(data->next_timer_ns *

             data->correction_factor[data->bucket] +

             (RESOLUTION * DECAY * NSEC_PER_USEC) / 2,

             RESOLUTION * DECAY * NSEC_PER_USEC);

 

    /* Use the lowest expected idle interval to pick the idle state. */

    //8.调用get_typical_interval接口,检查是否存在固定周期的情况,检查的逻辑就是

    // 计算8次停留时间的标准差,如果存在,则利用平均值更新predicted_us

    predicted_ns = (u64)min(predicted_us,

                get_typical_interval(data, predicted_us)) *

                NSEC_PER_USEC;

 

    if (tick_nohz_tick_stopped()) {

        /*

         * If the tick is already stopped, the cost of possible short

         * idle duration misprediction is much higher, because the CPU

         * may be stuck in a shallow idle state for a long time as a

         * result of it. In that case say we might mispredict and use

         * the known time till the closest timer event for the idle

         * state selection.

         */

        //由上面的注释可知,如果tick已经停止了,

        //如果滴答已经停止,可能的短期空闲持续时间预测错误的代价要高得多,因为

        //它可能导致CPU长时间停留在一个浅空闲状态。
在这种情况下,我们可能会错

        //误地预测并使用已知的时间直到最近的计时器事件来选择空闲状态

        if (predicted_ns < TICK_NSEC)

            predicted_ns = delta_next;

    } else {

        /*

         * Use the performance multiplier and the user-configurable

         * latency_req to determine the maximum exit latency.

         */

        //9.计算"系统所能容忍的延迟",对应变量latency_req

        //根据predicted_us和系统负荷情况(cpu load、iowaiters),

        //估算另一个延迟容忍值,并和latency_req,取最小值;

        interactivity_req = div64_u64(predicted_ns,

                         performance_multiplier(nr_iowaiters));

        if (latency_req > interactivity_req)

            latency_req = interactivity_req;

    }

 

    //10.代码走到这里,"系统所能忍受的退出延迟"latency_req和

    // "预计会在idle中停留多长时间"predicted_us都已经计算出来了,

    // 下面利用这两个信息挑出一个idle等级

 

    /*

     * Find the idle state with the lowest power while satisfying

     * our constraints.

     */

    idx = -1;

 

    //因为cpuidle_driver->states[]是按照睡眠由浅到深排序的,排在前面的睡眠最浅,功耗最高

    //所以下面for循环的顺序是从高功耗到低功耗遍历啊

    for (i = 0; i < drv->state_count; i++) {

        struct cpuidle_state *s = &drv->states[i];

 

        //这个C state被禁用了,则跳过

        if (dev->states_usage[i].disable)

            continue;

 

        if (idx == -1)

            idx = i; /* first enabled state */

 

        //满足该if条件,表示预计在idle状态停留的时间,不满足正在遍历的C state的"最小滞留时间"

        if (s->target_residency_ns > predicted_ns) {

            /*

             * Use a physical idle state, not busy polling, unless

             * a timer is going to trigger soon enough.

             */

 

            //这里的idx中实际记录着上一次正在遍历的C state

            //这里是什么逻辑呢????

            if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&

             s->exit_latency_ns <= latency_req &&

             s->target_residency_ns <= data->next_timer_ns) {

                predicted_ns = s->target_residency_ns;

                idx = i;

                break;

            }

            if (predicted_ns < TICK_NSEC)

                break;

 

            if (!tick_nohz_tick_stopped()) {

                /*

                 * If the state selected so far is shallow,

                 * waking up early won't hurt, so retain the

                 * tick in that case and let the governor run

                 * again in the next iteration of the loop.

                 */

                predicted_ns = drv->states[idx].target_residency_ns;

                break;

            }

 

            /*

             * If the state selected so far is shallow and this

             * state's target residency matches the time till the

             * closest timer event, select this one to avoid getting

             * stuck in the shallow one for too long.

             */

            if (drv->states[idx].target_residency_ns < TICK_NSEC &&

             s->target_residency_ns <= delta_next)

                idx = i;

 

            return idx;

        }

 

        //这个if条件是在判断退出延迟,满足这个条件表示正在遍历的C state的退出延迟是满足要求的

        if (s->exit_latency_ns > latency_req)

            break;

 

        idx = i;

    }

 

    //没有找到可用的C state,只能使用0,也就是normal状态

    if (idx == -1)

        idx = 0; /* No states enabled. Must use 0. */

 

    /*

     * Don't stop the tick if the selected state is a polling one or if the

     * expected idle duration is shorter than the tick period length.

     */

    //代码走到这,表示初步已经找到一个可使用的状态了

    if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||

     predicted_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {

        *stop_tick = false;

 

        if (idx > 0 && drv->states[idx].target_residency_ns > delta_next) {

            /*

             * The tick is not going to be stopped and the target

             * residency of the state to be returned is not within

             * the time until the next timer event including the

             * tick, so try to correct that.

             */

            for (i = idx - 1; i >= 0; i--) {

                if (dev->states_usage[i].disable)

                    continue;

 

                idx = i;

                if (drv->states[i].target_residency_ns <= delta_next)

                    break;

            }

        }

    }

 

    return idx;

}

 

2.6.1 menu_update

每次从C state返回时,kernel会调用governor的reflect接口,以便有机会让governor考虑这一次state切换的结果,并更新一些统计信息,以便在下一次挑选出更加合适的C state

对menu而言,它的reflect接口仅仅是设置needs_update标志,真在的状态更新是在下一次select的时候,在下一次select时调用menu_update完成相关的数据统计工作

 

/**

* menu_update - attempts to guess what happened after entry

* @drv: cpuidle driver containing state data

* @dev: the CPU

*/

static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)

{

    struct menu_device *data = this_cpu_ptr(&menu_devices);

 

    //1.获取这个cpu上一次进入的C state结构

    int last_idx = dev->last_state_idx;

    struct cpuidle_state *target = &drv->states[last_idx];

    u64 measured_ns;

    unsigned int new_factor;

 

    //2.由下面的注释可知,下面是要计算在上一次C state中睡眠了多长时间

    // 如果上一次所进入的C state不支持睡眠时间的度量

    // 不管用那种测量方式,都将包含退出延时的时间,因为我们关系的是什么时候唤醒的

 

    // 如果输入的空闲状态不支持驻留时间度量,那么无论如何,如果它们很短,

    // 我们都会使用它们,如果很长,则将其截断到整个预期时间

    // 任何测量的时间量都将包括退出延迟,因为我们感兴趣的是唤醒动作开始的时间,

    // 而不是完成的时间,所以我们必须减去退出延迟,但是如果计算出来的时间量小于

    // 退出延时,我们就假设从未进入过idle状态,并且此时的退出延时为0

    /*

     * Try to figure out how much time passed between entry to low

     * power state and occurrence of the wakeup event.

     *

     * If the entered idle state didn't support residency measurements,

     * we use them anyway if they are short, and if long,

     * truncate to the whole expected time.

     *

     * Any measured amount of time will include the exit latency.

     * Since we are interested in when the wakeup begun, not when it

     * was completed, we must subtract the exit latency. However, if

     * the measured amount of time is less than the exit latency,

     * assume the state was never reached and the exit latency is 0.

     */

 

    //3.预测下一次进入idle会睡眠多长时间

    // 如果预测不准,则会导致选择错误的C state

    // 例如预测的时间太短,则会进入浅睡眠,预测的时间太长,则会进入深睡眠

    if (data->tick_wakeup && data->next_timer_ns > TICK_NSEC) {

        //tick_wakeup是在reflect中被设置,也就是上一次从idle中刚唤醒的

        //时候被设置,为true表示tick handler正在运行

        //next_timer_ns是上一次在调用menu_select的时候被设置的,也就是

        //上一次进入idle之前,计算得到的"距离下一次tick到来还有多长时间"

        //同时满足上面两个条件表示当前系统处于深度睡眠的状态,

 

        //下面的MAX_INTERESTING为50ms

 

        //nohz代码说在滴答边界内不会有任何事件(如果滴答停止),但空闲持续时间预

        //测器有不同的意见。
由于CPU是在一个滴答声中被唤醒的(毕竟没有停止),因

        //此预测器并不完全正确,所以假设CPU可能已经空闲了很长时间(但不是永远),

        //以帮助空闲持续时间预测器下次做得更好

        /*

         * The nohz code said that there wouldn't be any events within

         * the tick boundary (if the tick was stopped), but the idle

         * duration predictor had a differing opinion. Since the CPU

         * was woken up by a tick (that wasn't stopped after all), the

         * predictor was not quite right, so assume that the CPU could

         * have been idle long (but not forever) to help the idle

         * duration predictor do a better job next time.

         */

        measured_ns = 9 * MAX_INTERESTING / 10;

    } else if ((drv->states[last_idx].flags & CPUIDLE_FLAG_POLLING) &&

dev->poll_time_limit) {

        /*

         * The CPU exited the "polling" state due to a time limit, so

         * the idle duration prediction leading to the selection of that

         * state was inaccurate. If a better prediction had been made,

         * the CPU might have been woken up from idle by the next timer.

         * Assume that to be the case.

         */

        //CPU由于poll_time_limit的限制退出了polling状态,因此导致选择该状态的空闲

        //持续时间预测是不准确的

        //如果做出了更好的预测,CPU可能会在下一个计时器之前从空闲状态被唤醒

        measured_ns = data->next_timer_ns;

    } else {

        /* measured value */

        //上一次在idle中睡眠的时间

        measured_ns = dev->last_residency_ns;

 

        /* Deduct exit latency */

        //这里要减去退出延时时间

        if (measured_ns > 2 * target->exit_latency_ns)

            measured_ns -= target->exit_latency_ns;

        else

            measured_ns /= 2;

    }

 

    /* Make sure our coefficients do not exceed unity */

    //确保上面计算出来的"预计睡眠时间"不会超过下一次tick中断到来的时间

    if (measured_ns > data->next_timer_ns)

        measured_ns = data->next_timer_ns;

 

    /* Update our correction ratio */

    //下面计算"校正因子"

    //先得到上一次的校正因子,然后衰减一半

    new_factor = data->correction_factor[data->bucket];

    new_factor -= new_factor / DECAY;

 

    //下面是考虑本次"即将睡眠的时间",计算本次使用的衰减因子

    // MAX_INTERESTING为50ms

    // RESOLUTION的值为1024,表示向1024做归一化

    if (data->next_timer_ns > 0 && measured_ns < MAX_INTERESTING)

        new_factor += div64_u64(RESOLUTION * measured_ns, data->next_timer_ns);

    else

        /*

         * we were idle so long that we count it as a perfect

         * prediction

         */

        new_factor += RESOLUTION;

 

    /*

     * We don't want 0 as factor; we always want at least

     * a tiny bit of estimated time. Fortunately, due to rounding,

     * new_factor will stay nonzero regardless of measured_us values

     * and the compiler can eliminate this test as long as DECAY > 1.

     */

    //DECAY如果为1,则表示,menu中DECAY为8

    //我们不想让0作为因子; 我们总是希望至少有一点点的预估时间。幸运的是,

    //由于四舍五入的关系,无论measured_us的值是多少,new_factor都将保持

    //非零状态,并且编译器可以消除这个测试,只要DECAY > 1。

    if (DECAY == 1 && unlikely(new_factor == 0))

        new_factor = 1;

 

    //更新校正因子,这个还是上一次的校正因子

    data->correction_factor[data->bucket] = new_factor;

 

    /* update the repeating-pattern data */

    //保留本次需要睡眠的时间

    data->intervals[data->interval_ptr++] = ktime_to_us(measured_ns);

 

    //索引回滚,防止溢出

    if (data->interval_ptr >= INTERVALS)

        data->interval_ptr = 0;

}

 

2.6.2 tick_nohz_get_sleep_length - 预估当前会sleep多长时间

//如果定义了CONFIG_NO_HZ_COMMON宏的话,则说明tick是可能会被关闭的

#ifdef CONFIG_NO_HZ_COMMON

/**

* tick_nohz_get_sleep_length - return the expected length of the current sleep

* @delta_next: duration until the next event if the tick cannot be stopped

*

* Called from power state control code with interrupts disabled

*/

ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)

{

    struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);

    struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);

    int cpu = smp_processor_id();

 

    /*

     * The idle entry time is expected to be a sufficient approximation of

     * the current time at this point.

     */

    //用"进入idle的时间戳"当做now

    ktime_t now = ts->idle_entrytime;

    ktime_t next_event;

 

    WARN_ON_ONCE(!ts->inidle);

 

    //计算可能在idle状态睡眠多长时间

    //next_event是下一个事件产生的时间,两者的差表示这个cpu会在idle状态睡眠多长时间

    *delta_next = ktime_sub(dev->next_event, now);

 

    //是否能够停止tick

    if (!can_stop_idle_tick(cpu, ts))

        return *delta_next;

 

    //获取下一个事件产生的时间戳,并保存在next_event

    next_event = tick_nohz_next_event(ts, cpu);

    if (!next_event)

        return *delta_next;

 

    /*

     * If the next highres timer to expire is earlier than next_event, the

     * idle governor needs to know that.

     */

    //考虑到高精度定时器的影响,重新计算下一个事件产生的时间戳

    next_event = min_t(u64, next_event,

             hrtimer_next_event_without(&ts->sched_timer));

 

    //重新计算可能在idle状态睡眠多长时间

    return ktime_sub(next_event, now);

}

#else

//如果没有定义CONFIG_NO_HZ_COMMON宏的话,则说明tick是周期性的来的

static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)

{

    *delta_next = TICK_NSEC;

    return *delta_next;

}

#endif

 

2.6.3 nr_iowait_cpu - 获取这个cpu上有多少个iowait

unsigned long nr_iowait_cpu(int cpu)

{

    return atomic_read(&cpu_rq(cpu)->nr_iowait);

}

 

2.6.4 which_bucket - 计算要使用那个桶中的校正因子

static inline int which_bucket(

            u64 duration_ns,                        //预计会睡眠多长时间

            unsigned long nr_iowaiters)            //这个cpu上有多少个iowait

{

    int bucket = 0;

 

    /*

     * We keep two groups of stats; one with no

     * IO pending, one without.

     * This allows us to calculate

     * E(duration)|iowait

     */

    //BUCKETS为12,由上面的注释可知:

    //0~5这六个桶适用于无iowaiter的场景,而6~11这六个桶适用于有iowaiter的场景

    if (nr_iowaiters)

        bucket = BUCKETS/2;

 

    //下面的计算表示一个桶内的差异为10us吗

    if (duration_ns < 10ULL * NSEC_PER_USEC)

        return bucket;

    if (duration_ns < 100ULL * NSEC_PER_USEC)

        return bucket + 1;

    if (duration_ns < 1000ULL * NSEC_PER_USEC)

        return bucket + 2;

    if (duration_ns < 10000ULL * NSEC_PER_USEC)

        return bucket + 3;

    if (duration_ns < 100000ULL * NSEC_PER_USEC)

        return bucket + 4;

    return bucket + 5;

}

 

2.6.5 get_typical_interval - 计算标准差

由下面的注释可知,get_typical_interval尝试通过跟踪最后8个间隔来检测重复模式,并检查那组点的标准偏差是否低于一个阈值。如果它确实低于的话,就用这8个点的平均值作为估计值

/*

* Try detecting repeating patterns by keeping track of the last 8

* intervals, and checking if the standard deviation of that set

* of points is below a threshold. If it is... then use the

* average of these 8 points as the estimated value.

*/

static unsigned int get_typical_interval(

            struct menu_device *data,

            unsigned int predicted_us)

{

    int i, divisor;

    unsigned int min, max, thresh, avg;

    uint64_t sum, variance;

 

    //最大值,大于这个值会被丢弃

    thresh = INT_MAX; /* Discard outliers above this value */

 

again:

 

    /* First calculate the average of past intervals */

    min = UINT_MAX;

    max = 0;

    sum = 0;

    divisor = 0;

 

    //下面for循环完成求和以及个数累加,为后面的求平均值做准备

    //INTERVALS为8,也就是8个窗口

    for (i = 0; i < INTERVALS; i++) {

        unsigned int value = data->intervals[i];

        if (value <= thresh) {

            sum += value;

            divisor++;

 

            //下面两个if条件找出最大和最小值

            if (value > max)

                max = value;

 

            if (value < min)

                min = value;

        }

    }

 

    /*

     * If the result of the computation is going to be discarded anyway,

     * avoid the computation altogether.

     */

    //由上面的注释可知,如果连最小值都比传递进来的基础睡眠时间还要大,则注定下面计

    //算出来的结果会被丢弃掉,(因为返回后取min),那还不如不计算,直接返回好了

    if (min >= predicted_us)

        return UINT_MAX;

 

    //求平均值,使用移位更加方便

    if (divisor == INTERVALS)

        avg = sum >> INTERVAL_SHIFT;

    else

        avg = div_u64(sum, divisor);

 

    /* Then try to determine variance */

    //下面开始求方差,尼玛数学都他妈忘记了啊

    variance = 0;

    for (i = 0; i < INTERVALS; i++) {

        unsigned int value = data->intervals[i];

        if (value <= thresh) {

            int64_t diff = (int64_t)value - avg;

            variance += diff * diff;

        }

    }

    if (divisor == INTERVALS)

        variance >>= INTERVAL_SHIFT;

    else

        do_div(variance, divisor);

 

    /*

     * The typical interval is obtained when standard deviation is

     * small (stddev <= 20 us, variance <= 400 us^2) or standard

     * deviation is small compared to the average interval (avg >

     * 6*stddev, avg^2 > 36*variance). The average is smaller than

     * UINT_MAX aka U32_MAX, so computing its square does not

     * overflow a u64. We simply reject this candidate average if

     * the standard deviation is greater than 715 s (which is

     * rather unlikely).

     *

     * Use this result only if there is no timer to wake us up sooner.

     */

    //这个是什么逻辑啊,和标准差相关吗????

    if (likely(variance <= U64_MAX/36)) {

        if ((((u64)avg*avg > variance*36) && (divisor * 4 >= INTERVALS * 3))

                            || variance <= 400) {

            return avg;

        }

    }

 

    /*

     * If we have outliers to the upside in our distribution, discard

     * those by setting the threshold to exclude these outliers, then

     * calculate the average and standard deviation again. Once we get

     * down to the bottom 3/4 of our samples, stop excluding samples.

     *

     * This can deal with workloads that have long pauses interspersed

     * with sporadic activity with a bunch of short pauses.

     */

    //如果我们的分布中存在向上的异常值,则通过设置阈值来排除这些异常值,从而丢弃

    //这些异常值,然后再次计算平均值和标准偏差。
当我们到达底部的3/4样本时,停止排除样本

    //这可以处理长时间暂停的工作负载,这些工作负载中穿插着一些零星的活动,其中有一些短暂的暂停

    if ((divisor * 4) <= INTERVALS * 3)

        return UINT_MAX;

 

    thresh = max - 1;

    goto again;

}

 

2.7 menu_reflect - 更新相关数据结构

每次从C state返回时,kernel会调用governor的reflect接口,以便有机会让governor考虑这一次state切换的结果,并更新一些统计信息,以便在下一次挑选出更加合适的C state

对menu而言,它的reflect接口仅仅是设置needs_update标志,真在的状态更新是在下一次select的时候,为什么不在reflect中直接完成状态更新呢?这是因为reflect是在关中断的上下文,不适宜做太多的工作,需要尽快完成工作并打开中断,处理中断服务程序

 

该函数会在下面调用路径中被执行,在cpuidle_idle_call的后半部退出idle的时候调用cpuidle_reflect,让governor完成一些自定义的工作

  • cpuidle_idle_call -> cpuidle_reflect

 

/**

* menu_reflect - records that data structures need update

* @dev: the CPU

* @index: the index of actual entered state

*

* NOTE: it's important to be fast here because this operation will add to

* the overall exit latency.

*/

static void menu_reflect(

            struct cpuidle_device *dev,

            int index)                        //刚刚是从那个C state中退出的

{

    //1.每个cpu都有自己的menu_device,从这里也能看出每个cpu是可以

    // 单独的进入idle的,不受cluster中的其他cpu的影响

    struct menu_device *data = this_cpu_ptr(&menu_devices);

 

    dev->last_state_idx = index;

 

    //2.置位data->needs_update标志,在下一个select的时候对一些统计数据进行更新,

    // 为什么不直接在reflect中更新状态,而是到下一次select时再更新呢?

    // 由上面对idle线程的分析可知,在进入idle之前,会将local_irq关闭,然后执行

    // wfi指令进入睡眠,当收到中断从wfi退出时,此时的local_irq还是处于关闭状态的,

    // cpu不能立刻跳转中断向量表去执行中断服务程序,只能接着往下执行,直到local_irq

    // 被打开,才能执行相应的中断handler。也就是说reflect是在关中断时被调用的,需要

    // 尽快返回,以便处理中断事件

    data->needs_update = 1;

 

    //3.下面返回值是bool型,为true表示tick handler是否正在运行

    data->tick_wakeup = tick_nohz_idle_got_tick();

}

 

三、ladder

ladder翻译过来就是梯子的意思

ladder在periodic timer tick system中使用,menu在tickless system中使用

 

3.1 数据结构

core层已经为描述一个cpu和C state,分别提供了cpuidle_device和cpuidle_state数据结构,但是在ladder governor中,这两个数据结构是不够的,所以ladder又定义了下面两个数据结构,作为对cpu和C state的补充

 

3.1.1 ladder_device_state - ladder中对C state的补充描述

struct ladder_device_state {

    struct {

        u32 promotion_count;                //默认值为4

        u32 demotion_count;                //默认值为1

 

        //在该C state的实际睡眠时间必须在[demotion_time_ns, promotion_time_ns]之间,

        //超出该范围则说明这个C state不适用,需要重新选择更深或者更浅的睡眠状态

        //promotion_time_ns控制上调阈值,实际睡眠时间大于该值,则表示之前选的C state

        //已经不适合了,需要选择更深睡眠等级

        //demotion_time_ns控制下调阈值,如果实际的睡眠时间小于该值,则表示之前选的

        //C state也不太适合,此时需要选择更浅的睡眠等级

        //注意:这里所说的睡眠时间不包括退出延迟,而是真正的睡眠时间

        u64 promotion_time_ns;

        u64 demotion_time_ns;

    } threshold;

 

    struct {

        int promotion_count;

        int demotion_count;

    } stats;

};

 

3.1.2 ladder_device - ladder中对cpu设备的补充描述

struct ladder_device {

    struct ladder_device_state states[CPUIDLE_STATE_MAX];

};

 

3.2 ladder_governor

static struct cpuidle_governor ladder_governor = {

    .name =        "ladder",

    .rating =    10,

    .enable =    ladder_enable_device,

    .select =    ladder_select_state,

    .reflect =    ladder_reflect,

};

 

3.3 init_ladder - 初始化

/**

* init_ladder - initializes the governor

*/

static int __init init_ladder(void)

{

    /*

     * When NO_HZ is disabled, or when booting with nohz=off, the ladder

     * governor is better so give it a higher rating than the menu

     * governor.

     */

    if (!tick_nohz_enabled)

        ladder_governor.rating = 25;

 

    return cpuidle_register_governor(&ladder_governor);

}

postcore_initcall(init_ladder);                //驱动初始化

 

3.4 ladder_enable_device - governor被启用时,完成数据结构的初始化

/**

* ladder_enable_device - setup for the governor

* @drv: cpuidle driver

* @dev: the CPU

*/

static int ladder_enable_device(

            struct cpuidle_driver *drv,

            struct cpuidle_device *dev)

{

    int i;

 

    //1.确定第一个搜索的C state

    // 如果idx=0对应的C state被打上了CPUIDLE_FLAG_POLLING标记,则表示这个

    // C state是一个忙等,而不是真正意义上的idle状态,此时跳过,为啥捏???

    //下面first_idx表示仅在真正意义上的idle状态中查找合适的C state

    int first_idx = drv->states[0].flags & CPUIDLE_FLAG_POLLING ? 1 : 0;

    struct ladder_device *ldev = &per_cpu(ladder_devices, dev->cpu);

    struct ladder_device_state *lstate;

    struct cpuidle_state *state;

 

    dev->last_state_idx = first_idx;

 

    //2.遍历所有的C state

    for (i = first_idx; i < drv->state_count; i++) {

        state = &drv->states[i];

        lstate = &ldev->states[i];

 

        lstate->stats.promotion_count = 0;

        lstate->stats.demotion_count = 0;

 

        lstate->threshold.promotion_count = PROMOTION_COUNT;        //4

        lstate->threshold.demotion_count = DEMOTION_COUNT;            //1

 

        if (i < drv->state_count - 1)

            lstate->threshold.promotion_time_ns = state->exit_latency_ns;

        if (i > first_idx)

            lstate->threshold.demotion_time_ns = state->exit_latency_ns;

    }

 

    return 0;

}

 

3.5 ladder_select_state - 挑选一个C state

/**

* ladder_select_state - selects the next state to enter

* @drv: cpuidle driver

* @dev: the CPU

* @dummy: not used

*/

static int ladder_select_state(

            struct cpuidle_driver *drv,

            struct cpuidle_device *dev,

            bool *dummy)

{

    struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);

    struct ladder_device_state *last_state;

 

    //1.这个cpu上一次所处于的idle状态

    int last_idx = dev->last_state_idx;

 

    //2.带有CPUIDLE_FLAG_POLLING标记表示cpu不是真正意义上的idle,而是处于忙等

    // 下面first_idx表示仅在真正意义上的idle状态中查找合适的C state

    int first_idx = drv->states[0].flags & CPUIDLE_FLAG_POLLING ? 1 : 0;

 

    //3.计算这个cpu所能忍受的"退出延迟"

    s64 latency_req = cpuidle_governor_latency_req(dev->cpu);

    s64 last_residency;

 

    /* Special case when user has set very strict latency requirement */

    //4.如果latency_req的值为0,则表示当前系统处于一种苛刻的状态,

    // 不允许进入idle,这时直接返回0,表示normal状态

    if (unlikely(latency_req == 0)) {

        ladder_do_selection(dev, ldev, last_idx, 0);

        return 0;

    }

 

    //5.这个cpu上一次所处于的idle状态

    last_state = &ldev->states[last_idx];

 

    //6.计算上一次在idle状态睡眠的时间,(不包括退出延迟)

    // dev->last_residency_ns表示这个cpu上一次在idle状态的停留时间,这个值是

    // cpu被唤醒后执行的,包含下面的退出延迟

    // drv->states[last_idx].exit_latency_ns表示上一次C state的退出延迟信息

    // 两者的差表示真正在idle状态的睡眠时间

    last_residency = dev->last_residency_ns - drv->states[last_idx].exit_latency_ns;

 

    /* consider promotion */

    //7.上调,即进入更深的睡眠状态,一次只提升一个等级,下面4个条件依次为

    // a) last_idx不是最后一个C state状态,才有上升空间

    // b) 即将启用的这个C state是可用的,也就是last_idx+1对应的C state

    // c) 上一次正在的睡眠时间已经超过了last_idx对应的C state的上升阈值

    // d) 即将启用的这个C state的退出延迟满足系统所能忍受的退出延迟

    if (last_idx < drv->state_count - 1 &&

     !dev->states_usage[last_idx + 1].disable &&

     last_residency > last_state->threshold.promotion_time_ns &&

     drv->states[last_idx + 1].exit_latency_ns <= latency_req) {

 

        //7.1 首先对上一次的睡眠的信息进行统计,请求上调次数累加,下调次数清零

        last_state->stats.promotion_count++;

        last_state->stats.demotion_count = 0;

 

        //7.2 从这个if条件可知,并不是每次请求上调都会得到满足

        // 只有请求累加到一定的次数后,才执行真正的上调的动作

        // 默认为连续4次上调请求,才会真正的执行上调操作

        if (last_state->stats.promotion_count >= last_state->threshold.promotion_count) {

            ladder_do_selection(dev, ldev, last_idx, last_idx + 1);

            return last_idx + 1;

        }

    }

 

    //8.代码走到这里,说明上面的上调操作没有成功,有可能是因为根本就不需要上调,

    // 也有可能是因为连续上调的次数没有达到预祝,也有可能是因为,即使由上调需

    // 求,但是要上调到的C state等级的退出延迟不满足当前系统的需求等原因

 

    /* consider demotion */

    //9.下调,下调也是有好几个条件的,但是这里我们首先考虑两个比较急迫的场景:

    // a) last_idx对应的C state已经被禁用了,或者

    // b) last_idx对应的C state的退出延时已经不满足系统急迫的需求

    if (last_idx > first_idx &&

     (dev->states_usage[last_idx].disable ||

     drv->states[last_idx].exit_latency_ns > latency_req)) {

        int i;

 

        //9.1 向前找,找到一个退出延时满足系统要求的C state

        // 由下面的for循环可知,如果一直没有找到符合条件的C state的话,

        // 则使用first_idx对应的C state,也就是那个睡的最浅的、真正意

        // 义上说面的C state

        for (i = last_idx - 1; i > first_idx; i--) {

            if (drv->states[i].exit_latency_ns <= latency_req)

                break;

        }

 

        //9.2 代码走到这里,表示向下已经找到一个满足条件的C state了

        ladder_do_selection(dev, ldev, last_idx, i);

        return i;

    }

 

    //10.继续考虑下调,下面两个条件表示

    // a) last_idx不是第一个C state,表示还有下降空间

    // b) 上一次在C state中真正停留的时间,已经跌破下调的阈值

    if (last_idx > first_idx &&

     last_residency < last_state->threshold.demotion_time_ns) {

 

        //10.1 统计计数

        last_state->stats.demotion_count++;

        last_state->stats.promotion_count = 0;

 

        //10.2 从这个if条件可知,并不是每次请求下调都会得到满足

        // 只有请求累加到一定的次数后,才执行真正的下调的动作

        // 罢特:默认阈值为1,也就是说只要由1次下调请求,就满足你

        if (last_state->stats.demotion_count >= last_state->threshold.demotion_count) {

            ladder_do_selection(dev, ldev, last_idx, last_idx - 1);

            return last_idx - 1;

        }

    }

 

    /* otherwise remain at the current state */

    //11.维持现状吧

    return last_idx;

}

 

3.5.1 ladder_do_selection

/**

* ladder_do_selection - prepares private data for a state change

* @ldev: the ladder device

* @old_idx: the current state index

* @new_idx: the new target state index

*/

static inline void ladder_do_selection(

            struct cpuidle_device *dev,

            struct ladder_device *ldev,

            int old_idx,

            int new_idx)

{

    //统计计数相关

    //完成一次上调或者下调后,计数清零

    ldev->states[old_idx].stats.promotion_count = 0;

    ldev->states[old_idx].stats.demotion_count = 0;

    dev->last_state_idx = new_idx;

}

 

3.6 ladder_reflect - 更新相关数据结构

在ladder中主要是更新了last_state_idx,也就是上一次在哪个idle状态

/**

* ladder_reflect - update the correct last_state_idx

* @dev: the CPU

* @index: the index of actual state entered

*/

static void ladder_reflect(struct cpuidle_device *dev, int index)

{

    //在reflect中,只是记录上一次是从那个C state中唤醒的

    if (index > 0)

        dev->last_state_idx = index;

}

 

四、haltpoll

haltpoll不需要额外定义数据结构

 

4.1 几个全局变量

static unsigned int guest_halt_poll_ns __read_mostly = 200000;

module_param(guest_halt_poll_ns, uint, 0644);

 

/* division factor to shrink halt_poll_ns */

static unsigned int guest_halt_poll_shrink __read_mostly = 2;

module_param(guest_halt_poll_shrink, uint, 0644);

 

/* multiplication factor to grow per-cpu poll_limit_ns */

static unsigned int guest_halt_poll_grow __read_mostly = 2;

module_param(guest_halt_poll_grow, uint, 0644);

 

/* value in us to start growing per-cpu halt_poll_ns */

static unsigned int guest_halt_poll_grow_start __read_mostly = 50000;

module_param(guest_halt_poll_grow_start, uint, 0644);

 

/* allow shrinking guest halt poll */

static bool guest_halt_poll_allow_shrink __read_mostly = true;

module_param(guest_halt_poll_allow_shrink, bool, 0644);

 

4.2 haltpoll_governor

static struct cpuidle_governor haltpoll_governor = {

    .name =            "haltpoll",

    .rating =        9,

    .enable =        haltpoll_enable_device,

    .select =        haltpoll_select,

    .reflect =        haltpoll_reflect,

};

 

4.3 init_haltpoll - 初始化

static int __init init_haltpoll(void)

{

    if (kvm_para_available())

        return cpuidle_register_governor(&haltpoll_governor);

 

    return 0;

}

postcore_initcall(init_haltpoll);            //在驱动初始化的时候执行

 

4.4 haltpoll_enable_device - governor被启用时完成相应的初始化

/**

* haltpoll_enable_device - scans a CPU's states and does setup

* @drv: cpuidle driver

* @dev: the CPU

*/

static int haltpoll_enable_device(

            struct cpuidle_driver *drv,

            struct cpuidle_device *dev)

{

    dev->poll_limit_ns = 0;

 

    return 0;

}

 

4.5 haltpoll_select - 挑选一个idle等级

我尼玛,这个函数要么返回0,要么返回1,也就是说只有两个状态吗?

/**

* haltpoll_select - selects the next idle state to enter

* @drv: cpuidle driver containing state data

* @dev: the CPU

* @stop_tick: indication on whether or not to stop the tick

*/

static int haltpoll_select(

            struct cpuidle_driver *drv,

            struct cpuidle_device *dev,

            bool *stop_tick)                //返回,标记所选的C state是否要停止tick

{

    //1.获取系统允许的延迟信息

    s64 latency_req = cpuidle_governor_latency_req(dev->cpu);

 

    //2.latency_req等于0,表示当前系统处于一种很苛刻的状态,此时返回0,

    // 表示使用idx=0的C state,也就是不需要睡眠,处于normal模式

    // 此时也不需要停止tick

    if (!drv->state_count || latency_req == 0) {

        *stop_tick = false;

        return 0;

    }

 

    //3.额,这是为啥,为啥就直接返回idx=1的C state呢????

    if (dev->poll_limit_ns == 0)

        return 1;

 

    /* Last state was poll? */

    //4.由上面的注释可知,上一次的C state是poll,难道idx=0的不是normal模式吗???

    if (dev->last_state_idx == 0) {

        /* Halt if no event occurred on poll window */

        if (dev->poll_time_limit == true)

            return 1;

 

        *stop_tick = false;

        /* Otherwise, poll again */

        return 0;

    }

 

    *stop_tick = false;

    /* Last state was halt: poll */

    return 0;

}

 

4.6 haltpoll_reflect

/**

* haltpoll_reflect - update variables and update poll time

* @dev: the CPU

* @index: the index of actual entered state

*/

static void haltpoll_reflect(struct cpuidle_device *dev, int index)

{

    dev->last_state_idx = index;

 

    if (index != 0)

        adjust_poll_limit(dev, dev->last_residency_ns);

}

 

4.6.1 adjust_poll_limit

static void adjust_poll_limit(struct cpuidle_device *dev, u64 block_ns)

{

    unsigned int val;

 

    /* Grow cpu_halt_poll_us if

     * cpu_halt_poll_us < block_ns < guest_halt_poll_us

     */

    if (block_ns > dev->poll_limit_ns && block_ns <= guest_halt_poll_ns) {

        val = dev->poll_limit_ns * guest_halt_poll_grow;

 

        //val的范围不能超过[guest_halt_poll_grow_start, guest_halt_poll_ns]

        if (val < guest_halt_poll_grow_start)

            val = guest_halt_poll_grow_start;

        if (val > guest_halt_poll_ns)

            val = guest_halt_poll_ns;

 

        dev->poll_limit_ns = val;

    } else if (block_ns > guest_halt_poll_ns && guest_halt_poll_allow_shrink) {

        unsigned int shrink = guest_halt_poll_shrink;

 

        val = dev->poll_limit_ns;

        if (shrink == 0)

            val = 0;

        else

            val /= shrink;

        dev->poll_limit_ns = val;

    }

}

 

五、teo

teo是timer event oriented的缩写,也就是面向timer事件的

 

5.1 数据结构

5.1.1 teo_idle_state - 描述一个C state

/**

* struct teo_idle_state - Idle state data used by the TEO cpuidle governor.

* @early_hits: "Early" CPU wakeups "matching" this state.

* @hits: "On time" CPU wakeups "matching" this state.

* @misses: CPU wakeups "missing" this state.

*

* A CPU wakeup is "matched" by a given idle state if the idle duration measured

* after the wakeup is between the target residency of that state and the target

* residency of the next one (or if this is the deepest available idle state, it

* "matches" a CPU wakeup when the measured idle duration is at least equal to

* its target residency).

*

* Also, from the TEO governor perspective, a CPU wakeup from idle is "early" if

* it occurs significantly earlier than the closest expected timer event (that

* is, early enough to match an idle state shallower than the one matching the

* time till the closest timer event). Otherwise, the wakeup is "on time", or

* it is a "hit".

*

* A "miss" occurs when the given state doesn't match the wakeup, but it matches

* the time till the closest timer event used for idle state selection.

*/

struct teo_idle_state {

    unsigned int early_hits;

    unsigned int hits;

    unsigned int misses;

};

 

5.1.2 teo_cpu - 描述一个cpu

/**

* struct teo_cpu - CPU data used by the TEO cpuidle governor.

* @time_span_ns: Time between idle state selection and post-wakeup update.

* @sleep_length_ns: Time till the closest timer event (at the selection time).

* @states: Idle states data corresponding to this CPU.

* @interval_idx: Index of the most recent saved idle interval.

* @intervals: Saved idle duration values.

*/

struct teo_cpu {

    u64 time_span_ns;

    u64 sleep_length_ns;

    struct teo_idle_state states[CPUIDLE_STATE_MAX];

    int interval_idx;

    u64 intervals[INTERVALS];

};

 

5.2 teo_governor

static struct cpuidle_governor teo_governor = {

    .name =        "teo",

    .rating =    19,

    .enable =    teo_enable_device,

    .select =    teo_select,

    .reflect =    teo_reflect,

};

 

 

5.3 teo_governor_init - 初始化

static int __init teo_governor_init(void)

{

    return cpuidle_register_governor(&teo_governor);

}

postcore_initcall(teo_governor_init);

 

5.4 teo_enable_device - governor被启用时,完成数据的初始化

/**

* teo_enable_device - Initialize the governor's data for the target CPU.

* @drv: cpuidle driver (not used).

* @dev: Target CPU.

*/

static int teo_enable_device(

            struct cpuidle_driver *drv,

            struct cpuidle_device *dev)

{

    struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);

    int i;

 

    memset(cpu_data, 0, sizeof(*cpu_data));

 

    //INTERVALS在teo中为8,可理解为8个档位

    for (i = 0; i < INTERVALS; i++)

        cpu_data->intervals[i] = U64_MAX;

 

    return 0;

}

 

5.5 teo_select - 选择一个idle等级

/**

* teo_select - Selects the next idle state to enter.

* @drv: cpuidle driver containing state data.

* @dev: Target CPU.

* @stop_tick: Indication on whether or not to stop the scheduler tick.

*/

static int teo_select(

            struct cpuidle_driver *drv,

            struct cpuidle_device *dev,

            bool *stop_tick)                    //返回,标记是否要关闭tick

{

    struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);

 

    //1.系统所能容忍的延迟

    s64 latency_req = cpuidle_governor_latency_req(dev->cpu);

    u64 duration_ns;

    unsigned int hits, misses, early_hits;

    int max_early_idx, prev_max_early_idx, constraint_idx, idx, i;

    ktime_t delta_tick;

 

    //1.大于0表示确实进入睡眠了,则更新

    if (dev->last_state_idx >= 0) {

        teo_update(drv, dev);

        dev->last_state_idx = -1;

    }

 

    cpu_data->time_span_ns = local_clock();

 

    duration_ns = tick_nohz_get_sleep_length(&delta_tick);

    cpu_data->sleep_length_ns = duration_ns;

 

    hits = 0;

    misses = 0;

    early_hits = 0;

    max_early_idx = -1;

    prev_max_early_idx = -1;

    constraint_idx = drv->state_count;

    idx = -1;

 

    for (i = 0; i < drv->state_count; i++) {

        struct cpuidle_state *s = &drv->states[i];

 

        if (dev->states_usage[i].disable) {

            /*

             * Ignore disabled states with target residencies beyond

             * the anticipated idle duration.

             */

            if (s->target_residency_ns > duration_ns)

                continue;

 

            /*

             * This state is disabled, so the range of idle duration

             * values corresponding to it is covered by the current

             * candidate state, but still the "hits" and "misses"

             * metrics of the disabled state need to be used to

             * decide whether or not the state covering the range in

             * question is good enough.

             */

            hits = cpu_data->states[i].hits;

            misses = cpu_data->states[i].misses;

 

            if (early_hits >= cpu_data->states[i].early_hits ||

             idx < 0)

                continue;

 

            /*

             * If the current candidate state has been the one with

             * the maximum "early hits" metric so far, the "early

             * hits" metric of the disabled state replaces the

             * current "early hits" count to avoid selecting a

             * deeper state with lower "early hits" metric.

             */

            if (max_early_idx == idx) {

                early_hits = cpu_data->states[i].early_hits;

                continue;

            }

 

            /*

             * The current candidate state is closer to the disabled

             * one than the current maximum "early hits" state, so

             * replace the latter with it, but in case the maximum

             * "early hits" state index has not been set so far,

             * check if the current candidate state is not too

             * shallow for that role.

             */

            if (teo_time_ok(drv->states[idx].target_residency_ns)) {

                prev_max_early_idx = max_early_idx;

                early_hits = cpu_data->states[i].early_hits;

                max_early_idx = idx;

            }

 

            continue;

        }

 

        if (idx < 0) {

            idx = i; /* first enabled state */

            hits = cpu_data->states[i].hits;

            misses = cpu_data->states[i].misses;

        }

 

        if (s->target_residency_ns > duration_ns)

            break;

 

        if (s->exit_latency_ns > latency_req && constraint_idx > i)

            constraint_idx = i;

 

        idx = i;

        hits = cpu_data->states[i].hits;

        misses = cpu_data->states[i].misses;

 

        if (early_hits < cpu_data->states[i].early_hits &&

         teo_time_ok(drv->states[i].target_residency_ns)) {

            prev_max_early_idx = max_early_idx;

            early_hits = cpu_data->states[i].early_hits;

            max_early_idx = i;

        }

    }

 

    /*

     * If the "hits" metric of the idle state matching the sleep length is

     * greater than its "misses" metric, that is the one to use. Otherwise,

     * it is more likely that one of the shallower states will match the

     * idle duration observed after wakeup, so take the one with the maximum

     * "early hits" metric, but if that cannot be determined, just use the

     * state selected so far.

     */

    if (hits <= misses) {

        /*

         * The current candidate state is not suitable, so take the one

         * whose "early hits" metric is the maximum for the range of

         * shallower states.

         */

        if (idx == max_early_idx)

            max_early_idx = prev_max_early_idx;

 

        if (max_early_idx >= 0) {

            idx = max_early_idx;

            duration_ns = drv->states[idx].target_residency_ns;

        }

    }

 

    /*

     * If there is a latency constraint, it may be necessary to use a

     * shallower idle state than the one selected so far.

     */

    if (constraint_idx < idx)

        idx = constraint_idx;

 

    if (idx < 0) {

        idx = 0; /* No states enabled. Must use 0. */

    } else if (idx > 0) {

        unsigned int count = 0;

        u64 sum = 0;

 

        /*

         * Count and sum the most recent idle duration values less than

         * the current expected idle duration value.

         */

        for (i = 0; i < INTERVALS; i++) {

            u64 val = cpu_data->intervals[i];

 

            if (val >= duration_ns)

                continue;

 

            count++;

            sum += val;

        }

 

        /*

         * Give up unless the majority of the most recent idle duration

         * values are in the interesting range.

         */

        if (count > INTERVALS / 2) {

            u64 avg_ns = div64_u64(sum, count);

 

            /*

             * Avoid spending too much time in an idle state that

             * would be too shallow.

             */

            if (teo_time_ok(avg_ns)) {

                duration_ns = avg_ns;

                if (drv->states[idx].target_residency_ns > avg_ns)

                    idx = teo_find_shallower_state(drv, dev,

                                 idx, avg_ns);

            }

        }

    }

 

    /*

     * Don't stop the tick if the selected state is a polling one or if the

     * expected idle duration is shorter than the tick period length.

     */

    if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||

     duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {

        *stop_tick = false;

 

        /*

         * The tick is not going to be stopped, so if the target

         * residency of the state to be returned is not within the time

         * till the closest timer including the tick, try to correct

         * that.

         */

        if (idx > 0 && drv->states[idx].target_residency_ns > delta_tick)

            idx = teo_find_shallower_state(drv, dev, idx, delta_tick);

    }

 

    return idx;

}

 

5.5.1 teo_update

/**

* teo_update - Update CPU data after wakeup.

* @drv: cpuidle driver containing state data.

* @dev: Target CPU.

*/

static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)

{

    struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);

    int i, idx_hit = -1, idx_timer = -1;

    u64 measured_ns;

 

    if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {

        /*

         * One of the safety nets has triggered or the wakeup was close

         * enough to the closest timer event expected at the idle state

         * selection time to be discarded.

         */

        measured_ns = U64_MAX;

    } else {

        u64 lat_ns = drv->states[dev->last_state_idx].exit_latency_ns;

 

        /*

         * The computations below are to determine whether or not the

         * (saved) time till the next timer event and the measured idle

         * duration fall into the same "bin", so use last_residency_ns

         * for that instead of time_span_ns which includes the cpuidle

         * overhead.

         */

        measured_ns = dev->last_residency_ns;

        /*

         * The delay between the wakeup and the first instruction

         * executed by the CPU is not likely to be worst-case every

         * time, so take 1/2 of the exit latency as a very rough

         * approximation of the average of it.

         */

        if (measured_ns >= lat_ns)

            measured_ns -= lat_ns / 2;

        else

            measured_ns /= 2;

    }

 

    /*

     * Decay the "early hits" metric for all of the states and find the

     * states matching the sleep length and the measured idle duration.

     */

    for (i = 0; i < drv->state_count; i++) {

        unsigned int early_hits = cpu_data->states[i].early_hits;

 

        cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT;

 

        if (drv->states[i].target_residency_ns <= cpu_data->sleep_length_ns) {

            idx_timer = i;

            if (drv->states[i].target_residency_ns <= measured_ns)

                idx_hit = i;

        }

    }

 

    /*

     * Update the "hits" and "misses" data for the state matching the sleep

     * length. If it matches the measured idle duration too, this is a hit,

     * so increase the "hits" metric for it then. Otherwise, this is a

     * miss, so increase the "misses" metric for it. In the latter case

     * also increase the "early hits" metric for the state that actually

     * matches the measured idle duration.

     */

    if (idx_timer >= 0) {

        unsigned int hits = cpu_data->states[idx_timer].hits;

        unsigned int misses = cpu_data->states[idx_timer].misses;

 

        hits -= hits >> DECAY_SHIFT;

        misses -= misses >> DECAY_SHIFT;

 

        if (idx_timer > idx_hit) {

            misses += PULSE;

            if (idx_hit >= 0)

                cpu_data->states[idx_hit].early_hits += PULSE;

        } else {

            hits += PULSE;

        }

 

        cpu_data->states[idx_timer].misses = misses;

        cpu_data->states[idx_timer].hits = hits;

    }

 

    /*

     * Save idle duration values corresponding to non-timer wakeups for

     * pattern detection.

     */

    cpu_data->intervals[cpu_data->interval_idx++] = measured_ns;

    if (cpu_data->interval_idx >= INTERVALS)

        cpu_data->interval_idx = 0;

}

 

5.6 teo_reflect - 更新数据

/**

* teo_reflect - Note that governor data for the CPU need to be updated.

* @dev: Target CPU.

* @state: Entered state.

*/

static void teo_reflect(

            struct cpuidle_device *dev,

            int state)                //刚刚是从哪个C state中退出的

{

    struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);

 

    //1.记录上一次所处的C state

    dev->last_state_idx = state;

    /*

     * If the wakeup was not "natural", but triggered by one of the safety

     * nets, assume that the CPU might have been idle for the entire sleep

     * length time.

     */

    if (dev->poll_time_limit ||

     (tick_nohz_idle_got_tick() && cpu_data->sleep_length_ns > TICK_NSEC)) {

        dev->poll_time_limit = false;

        cpu_data->time_span_ns = cpu_data->sleep_length_ns;

    } else {

        cpu_data->time_span_ns = local_clock() - cpu_data->time_span_ns;

    }

}

 

 

关注公众号不迷路:DumpStack

扫码加关注

本作品采用 知识共享署名-非商业性使用 4.0 国际许可协议 进行许可
标签: 暂无
最后更新:2022年3月18日

tmmdh

这个人很懒,什么都没留下

打赏 点赞
< 上一篇
下一篇 >

文章评论

  • Clarke Li

    haltpoll是用于优化虚拟机性能的一种cpu idle governor。在drivers/cpuidle/cpuidle-haltpoll.c中定义实现了专门的haltpoll_driver,haltpoll_driver里只定义了两种状态,state[0]是poll状态,state[1]是通过default_enter_idle回调使用默认的idle处理函数。
    https://lwn.net/Articles/792618/

    2022年9月1日
    回复
    • tmmdh

      @Clarke Li 感谢您的恢复,这个地方我再学习一下,后期再完善一下文章! :smile:

      2023年1月8日
      回复
  • 取消回复

    COPYRIGHT © 2022 dumpstack.cn. ALL RIGHTS RESERVED.

    浙ICP备2022000966号