diff options
author | Peter Zijlstra <peterz@infradead.org> | 2012-07-03 07:53:26 -0400 |
---|---|---|
committer | Ingo Molnar <mingo@kernel.org> | 2012-10-24 04:27:33 -0400 |
commit | e9c84cb8d5f1b1ea6fcbe6190d51dc84b6975938 (patch) | |
tree | 80433b41204b5f730b3e7d4d3990eac226c9b65a /kernel/sched | |
parent | f4e26b120b9de84cb627bc7361ba43cfdc51341f (diff) |
sched: Describe CFS load-balancer
Add some scribbles on how and why the load-balancer works..
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341316406.23484.64.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Diffstat (limited to 'kernel/sched')
-rw-r--r-- | kernel/sched/fair.c | 118 |
1 files changed, 116 insertions, 2 deletions
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3e6a3531fa90..a319d56c7605 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c | |||
@@ -3456,8 +3456,122 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp | |||
3456 | 3456 | ||
3457 | #ifdef CONFIG_SMP | 3457 | #ifdef CONFIG_SMP |
3458 | /************************************************** | 3458 | /************************************************** |
3459 | * Fair scheduling class load-balancing methods: | 3459 | * Fair scheduling class load-balancing methods. |
3460 | */ | 3460 | * |
3461 | * BASICS | ||
3462 | * | ||
3463 | * The purpose of load-balancing is to achieve the same basic fairness the | ||
3464 | * per-cpu scheduler provides, namely provide a proportional amount of compute | ||
3465 | * time to each task. This is expressed in the following equation: | ||
3466 | * | ||
3467 | * W_i,n/P_i == W_j,n/P_j for all i,j (1) | ||
3468 | * | ||
3469 | * Where W_i,n is the n-th weight average for cpu i. The instantaneous weight | ||
3470 | * W_i,0 is defined as: | ||
3471 | * | ||
3472 | * W_i,0 = \Sum_j w_i,j (2) | ||
3473 | * | ||
3474 | * Where w_i,j is the weight of the j-th runnable task on cpu i. This weight | ||
3475 | * is derived from the nice value as per prio_to_weight[]. | ||
3476 | * | ||
3477 | * The weight average is an exponential decay average of the instantaneous | ||
3478 | * weight: | ||
3479 | * | ||
3480 | * W'_i,n = (2^n - 1) / 2^n * W_i,n + 1 / 2^n * W_i,0 (3) | ||
3481 | * | ||
3482 | * P_i is the cpu power (or compute capacity) of cpu i, typically it is the | ||
3483 | * fraction of 'recent' time available for SCHED_OTHER task execution. But it | ||
3484 | * can also include other factors [XXX]. | ||
3485 | * | ||
3486 | * To achieve this balance we define a measure of imbalance which follows | ||
3487 | * directly from (1): | ||
3488 | * | ||
3489 | * imb_i,j = max{ avg(W/P), W_i/P_i } - min{ avg(W/P), W_j/P_j } (4) | ||
3490 | * | ||
3491 | * We them move tasks around to minimize the imbalance. In the continuous | ||
3492 | * function space it is obvious this converges, in the discrete case we get | ||
3493 | * a few fun cases generally called infeasible weight scenarios. | ||
3494 | * | ||
3495 | * [XXX expand on: | ||
3496 | * - infeasible weights; | ||
3497 | * - local vs global optima in the discrete case. ] | ||
3498 | * | ||
3499 | * | ||
3500 | * SCHED DOMAINS | ||
3501 | * | ||
3502 | * In order to solve the imbalance equation (4), and avoid the obvious O(n^2) | ||
3503 | * for all i,j solution, we create a tree of cpus that follows the hardware | ||
3504 | * topology where each level pairs two lower groups (or better). This results | ||
3505 | * in O(log n) layers. Furthermore we reduce the number of cpus going up the | ||
3506 | * tree to only the first of the previous level and we decrease the frequency | ||
3507 | * of load-balance at each level inv. proportional to the number of cpus in | ||
3508 | * the groups. | ||
3509 | * | ||
3510 | * This yields: | ||
3511 | * | ||
3512 | * log_2 n 1 n | ||
3513 | * \Sum { --- * --- * 2^i } = O(n) (5) | ||
3514 | * i = 0 2^i 2^i | ||
3515 | * `- size of each group | ||
3516 | * | | `- number of cpus doing load-balance | ||
3517 | * | `- freq | ||
3518 | * `- sum over all levels | ||
3519 | * | ||
3520 | * Coupled with a limit on how many tasks we can migrate every balance pass, | ||
3521 | * this makes (5) the runtime complexity of the balancer. | ||
3522 | * | ||
3523 | * An important property here is that each CPU is still (indirectly) connected | ||
3524 | * to every other cpu in at most O(log n) steps: | ||
3525 | * | ||
3526 | * The adjacency matrix of the resulting graph is given by: | ||
3527 | * | ||
3528 | * log_2 n | ||
3529 | * A_i,j = \Union (i % 2^k == 0) && i / 2^(k+1) == j / 2^(k+1) (6) | ||
3530 | * k = 0 | ||
3531 | * | ||
3532 | * And you'll find that: | ||
3533 | * | ||
3534 | * A^(log_2 n)_i,j != 0 for all i,j (7) | ||
3535 | * | ||
3536 | * Showing there's indeed a path between every cpu in at most O(log n) steps. | ||
3537 | * The task movement gives a factor of O(m), giving a convergence complexity | ||
3538 | * of: | ||
3539 | * | ||
3540 | * O(nm log n), n := nr_cpus, m := nr_tasks (8) | ||
3541 | * | ||
3542 | * | ||
3543 | * WORK CONSERVING | ||
3544 | * | ||
3545 | * In order to avoid CPUs going idle while there's still work to do, new idle | ||
3546 | * balancing is more aggressive and has the newly idle cpu iterate up the domain | ||
3547 | * tree itself instead of relying on other CPUs to bring it work. | ||
3548 | * | ||
3549 | * This adds some complexity to both (5) and (8) but it reduces the total idle | ||
3550 | * time. | ||
3551 | * | ||
3552 | * [XXX more?] | ||
3553 | * | ||
3554 | * | ||
3555 | * CGROUPS | ||
3556 | * | ||
3557 | * Cgroups make a horror show out of (2), instead of a simple sum we get: | ||
3558 | * | ||
3559 | * s_k,i | ||
3560 | * W_i,0 = \Sum_j \Prod_k w_k * ----- (9) | ||
3561 | * S_k | ||
3562 | * | ||
3563 | * Where | ||
3564 | * | ||
3565 | * s_k,i = \Sum_j w_i,j,k and S_k = \Sum_i s_k,i (10) | ||
3566 | * | ||
3567 | * w_i,j,k is the weight of the j-th runnable task in the k-th cgroup on cpu i. | ||
3568 | * | ||
3569 | * The big problem is S_k, its a global sum needed to compute a local (W_i) | ||
3570 | * property. | ||
3571 | * | ||
3572 | * [XXX write more on how we solve this.. _after_ merging pjt's patches that | ||
3573 | * rewrite all of this once again.] | ||
3574 | */ | ||
3461 | 3575 | ||
3462 | static unsigned long __read_mostly max_load_balance_interval = HZ/10; | 3576 | static unsigned long __read_mostly max_load_balance_interval = HZ/10; |
3463 | 3577 | ||