aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched
diff options
context:
space:
mode:
authorPeter Zijlstra <peterz@infradead.org>2012-07-03 07:53:26 -0400
committerIngo Molnar <mingo@kernel.org>2012-10-24 04:27:33 -0400
commite9c84cb8d5f1b1ea6fcbe6190d51dc84b6975938 (patch)
tree80433b41204b5f730b3e7d4d3990eac226c9b65a /kernel/sched
parentf4e26b120b9de84cb627bc7361ba43cfdc51341f (diff)
sched: Describe CFS load-balancer
Add some scribbles on how and why the load-balancer works.. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1341316406.23484.64.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
Diffstat (limited to 'kernel/sched')
-rw-r--r--kernel/sched/fair.c118
1 files changed, 116 insertions, 2 deletions
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e6a3531fa90..a319d56c7605 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3456,8 +3456,122 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
3456 3456
3457#ifdef CONFIG_SMP 3457#ifdef CONFIG_SMP
3458/************************************************** 3458/**************************************************
3459 * Fair scheduling class load-balancing methods: 3459 * Fair scheduling class load-balancing methods.
3460 */ 3460 *
3461 * BASICS
3462 *
3463 * The purpose of load-balancing is to achieve the same basic fairness the
3464 * per-cpu scheduler provides, namely provide a proportional amount of compute
3465 * time to each task. This is expressed in the following equation:
3466 *
3467 * W_i,n/P_i == W_j,n/P_j for all i,j (1)
3468 *
3469 * Where W_i,n is the n-th weight average for cpu i. The instantaneous weight
3470 * W_i,0 is defined as:
3471 *
3472 * W_i,0 = \Sum_j w_i,j (2)
3473 *
3474 * Where w_i,j is the weight of the j-th runnable task on cpu i. This weight
3475 * is derived from the nice value as per prio_to_weight[].
3476 *
3477 * The weight average is an exponential decay average of the instantaneous
3478 * weight:
3479 *
3480 * W'_i,n = (2^n - 1) / 2^n * W_i,n + 1 / 2^n * W_i,0 (3)
3481 *
3482 * P_i is the cpu power (or compute capacity) of cpu i, typically it is the
3483 * fraction of 'recent' time available for SCHED_OTHER task execution. But it
3484 * can also include other factors [XXX].
3485 *
3486 * To achieve this balance we define a measure of imbalance which follows
3487 * directly from (1):
3488 *
3489 * imb_i,j = max{ avg(W/P), W_i/P_i } - min{ avg(W/P), W_j/P_j } (4)
3490 *
3491 * We them move tasks around to minimize the imbalance. In the continuous
3492 * function space it is obvious this converges, in the discrete case we get
3493 * a few fun cases generally called infeasible weight scenarios.
3494 *
3495 * [XXX expand on:
3496 * - infeasible weights;
3497 * - local vs global optima in the discrete case. ]
3498 *
3499 *
3500 * SCHED DOMAINS
3501 *
3502 * In order to solve the imbalance equation (4), and avoid the obvious O(n^2)
3503 * for all i,j solution, we create a tree of cpus that follows the hardware
3504 * topology where each level pairs two lower groups (or better). This results
3505 * in O(log n) layers. Furthermore we reduce the number of cpus going up the
3506 * tree to only the first of the previous level and we decrease the frequency
3507 * of load-balance at each level inv. proportional to the number of cpus in
3508 * the groups.
3509 *
3510 * This yields:
3511 *
3512 * log_2 n 1 n
3513 * \Sum { --- * --- * 2^i } = O(n) (5)
3514 * i = 0 2^i 2^i
3515 * `- size of each group
3516 * | | `- number of cpus doing load-balance
3517 * | `- freq
3518 * `- sum over all levels
3519 *
3520 * Coupled with a limit on how many tasks we can migrate every balance pass,
3521 * this makes (5) the runtime complexity of the balancer.
3522 *
3523 * An important property here is that each CPU is still (indirectly) connected
3524 * to every other cpu in at most O(log n) steps:
3525 *
3526 * The adjacency matrix of the resulting graph is given by:
3527 *
3528 * log_2 n
3529 * A_i,j = \Union (i % 2^k == 0) && i / 2^(k+1) == j / 2^(k+1) (6)
3530 * k = 0
3531 *
3532 * And you'll find that:
3533 *
3534 * A^(log_2 n)_i,j != 0 for all i,j (7)
3535 *
3536 * Showing there's indeed a path between every cpu in at most O(log n) steps.
3537 * The task movement gives a factor of O(m), giving a convergence complexity
3538 * of:
3539 *
3540 * O(nm log n), n := nr_cpus, m := nr_tasks (8)
3541 *
3542 *
3543 * WORK CONSERVING
3544 *
3545 * In order to avoid CPUs going idle while there's still work to do, new idle
3546 * balancing is more aggressive and has the newly idle cpu iterate up the domain
3547 * tree itself instead of relying on other CPUs to bring it work.
3548 *
3549 * This adds some complexity to both (5) and (8) but it reduces the total idle
3550 * time.
3551 *
3552 * [XXX more?]
3553 *
3554 *
3555 * CGROUPS
3556 *
3557 * Cgroups make a horror show out of (2), instead of a simple sum we get:
3558 *
3559 * s_k,i
3560 * W_i,0 = \Sum_j \Prod_k w_k * ----- (9)
3561 * S_k
3562 *
3563 * Where
3564 *
3565 * s_k,i = \Sum_j w_i,j,k and S_k = \Sum_i s_k,i (10)
3566 *
3567 * w_i,j,k is the weight of the j-th runnable task in the k-th cgroup on cpu i.
3568 *
3569 * The big problem is S_k, its a global sum needed to compute a local (W_i)
3570 * property.
3571 *
3572 * [XXX write more on how we solve this.. _after_ merging pjt's patches that
3573 * rewrite all of this once again.]
3574 */
3461 3575
3462static unsigned long __read_mostly max_load_balance_interval = HZ/10; 3576static unsigned long __read_mostly max_load_balance_interval = HZ/10;
3463 3577