sched: Describe CFS load-balancer

Add some scribbles on how and why the load-balancer works.. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1341316406.23484.64.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
author: Peter Zijlstra <peterz@infradead.org> 2012-07-03 07:53:26 -0400
committer: Ingo Molnar <mingo@kernel.org> 2012-10-24 04:27:33 -0400
commit: e9c84cb8d5f1b1ea6fcbe6190d51dc84b6975938 (patch)
tree: 80433b41204b5f730b3e7d4d3990eac226c9b65a /kernel/sched/fair.c
parent: f4e26b120b9de84cb627bc7361ba43cfdc51341f (diff)
1 files changed, 116 insertions, 2 deletions
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e6a3531fa90..a319d56c7605 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3456,8 +3456,122 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
 #ifdef CONFIG_SMP
 /**************************************************
- * Fair scheduling class load-balancing methods:
+ * Fair scheduling class load-balancing methods.
- */
+ *
+ * BASICS
+ *
+ * The purpose of load-balancing is to achieve the same basic fairness the
+ * per-cpu scheduler provides, namely provide a proportional amount of compute
+ * time to each task. This is expressed in the following equation:
+ *
+ *   W_i,n/P_i == W_j,n/P_j for all i,j                               (1)
+ *
+ * Where W_i,n is the n-th weight average for cpu i. The instantaneous weight
+ * W_i,0 is defined as:
+ *
+ *   W_i,0 = \Sum_j w_i,j                                             (2)
+ *
+ * Where w_i,j is the weight of the j-th runnable task on cpu i. This weight
+ * is derived from the nice value as per prio_to_weight[].
+ *
+ * The weight average is an exponential decay average of the instantaneous
+ * weight:
+ *
+ *   W'_i,n = (2^n - 1) / 2^n * W_i,n + 1 / 2^n * W_i,0               (3)
+ *
+ * P_i is the cpu power (or compute capacity) of cpu i, typically it is the
+ * fraction of 'recent' time available for SCHED_OTHER task execution. But it
+ * can also include other factors [XXX].
+ *
+ * To achieve this balance we define a measure of imbalance which follows
+ * directly from (1):
+ *
+ *   imb_i,j = max{ avg(W/P), W_i/P_i } - min{ avg(W/P), W_j/P_j }    (4)
+ *
+ * We them move tasks around to minimize the imbalance. In the continuous
+ * function space it is obvious this converges, in the discrete case we get
+ * a few fun cases generally called infeasible weight scenarios.
+ *
+ * [XXX expand on:
+ *     - infeasible weights;
+ *     - local vs global optima in the discrete case. ]
+ *
+ *
+ * SCHED DOMAINS
+ *
+ * In order to solve the imbalance equation (4), and avoid the obvious O(n^2)
+ * for all i,j solution, we create a tree of cpus that follows the hardware
+ * topology where each level pairs two lower groups (or better). This results
+ * in O(log n) layers. Furthermore we reduce the number of cpus going up the
+ * tree to only the first of the previous level and we decrease the frequency
+ * of load-balance at each level inv. proportional to the number of cpus in
+ * the groups.
+ *
+ * This yields:
+ *
+ *     log_2 n     1     n
+ *   \Sum       { --- * --- * 2^i } = O(n)                            (5)
+ *     i = 0      2^i   2^i
+ *                               `- size of each group
+ *         |         |     `- number of cpus doing load-balance
+ *         |         `- freq
+ *         `- sum over all levels
+ *
+ * Coupled with a limit on how many tasks we can migrate every balance pass,
+ * this makes (5) the runtime complexity of the balancer.
+ *
+ * An important property here is that each CPU is still (indirectly) connected
+ * to every other cpu in at most O(log n) steps:
+ *
+ * The adjacency matrix of the resulting graph is given by:
+ *
+ *             log_2 n     
+ *   A_i,j = \Union     (i % 2^k == 0) && i / 2^(k+1) == j / 2^(k+1)  (6)
+ *             k = 0
+ *
+ * And you'll find that:
+ *
+ *   A^(log_2 n)_i,j != 0  for all i,j                                (7)
+ *
+ * Showing there's indeed a path between every cpu in at most O(log n) steps.
+ * The task movement gives a factor of O(m), giving a convergence complexity
+ * of:
+ *
+ *   O(nm log n),  n := nr_cpus, m := nr_tasks                        (8)
+ *
+ *
+ * WORK CONSERVING
+ *
+ * In order to avoid CPUs going idle while there's still work to do, new idle
+ * balancing is more aggressive and has the newly idle cpu iterate up the domain
+ * tree itself instead of relying on other CPUs to bring it work.
+ *
+ * This adds some complexity to both (5) and (8) but it reduces the total idle
+ * time.
+ *
+ * [XXX more?]
+ *
+ *
+ * CGROUPS
+ *
+ * Cgroups make a horror show out of (2), instead of a simple sum we get:
+ *
+ *                                s_k,i
+ *   W_i,0 = \Sum_j \Prod_k w_k * -----                               (9)
+ *                                 S_k
+ *
+ * Where
+ *
+ *   s_k,i = \Sum_j w_i,j,k  and  S_k = \Sum_i s_k,i                 (10)
+ *
+ * w_i,j,k is the weight of the j-th runnable task in the k-th cgroup on cpu i.
+ *
+ * The big problem is S_k, its a global sum needed to compute a local (W_i)
+ * property.
+ *
+ * [XXX write more on how we solve this.. _after_ merging pjt's patches that
+ *      rewrite all of this once again.]
+ */ 
 static unsigned long __read_mostly max_load_balance_interval = HZ/10;
author	Peter Zijlstra <peterz@infradead.org>	2012-07-03 07:53:26 -0400
committer	Ingo Molnar <mingo@kernel.org>	2012-10-24 04:27:33 -0400
commit	e9c84cb8d5f1b1ea6fcbe6190d51dc84b6975938 (patch)
tree	80433b41204b5f730b3e7d4d3990eac226c9b65a /kernel/sched/fair.c
parent	f4e26b120b9de84cb627bc7361ba43cfdc51341f (diff)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3e6a3531fa90..a319d56c7605 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c
@@ -3456,8 +3456,122 @@ static bool yield_to_task_fair(struct rq rq, struct task_struct p, bool preemp
3456		3456
3457	#ifdef CONFIG_SMP	3457	#ifdef CONFIG_SMP
3458	/**************************************************	3458	/**************************************************
3459	* Fair scheduling class load-balancing methods:	3459	* Fair scheduling class load-balancing methods.
3460	*/	3460	*
		3461	* BASICS
		3462	*
		3463	* The purpose of load-balancing is to achieve the same basic fairness the
		3464	* per-cpu scheduler provides, namely provide a proportional amount of compute
		3465	* time to each task. This is expressed in the following equation:
		3466	*
		3467	* W_i,n/P_i == W_j,n/P_j for all i,j (1)
		3468	*
		3469	* Where W_i,n is the n-th weight average for cpu i. The instantaneous weight
		3470	* W_i,0 is defined as:
		3471	*
		3472	* W_i,0 = \Sum_j w_i,j (2)
		3473	*
		3474	* Where w_i,j is the weight of the j-th runnable task on cpu i. This weight
		3475	* is derived from the nice value as per prio_to_weight[].
		3476	*
		3477	* The weight average is an exponential decay average of the instantaneous
		3478	* weight:
		3479	*
		3480	* W'_i,n = (2^n - 1) / 2^n * W_i,n + 1 / 2^n * W_i,0 (3)
		3481	*
		3482	* P_i is the cpu power (or compute capacity) of cpu i, typically it is the
		3483	* fraction of 'recent' time available for SCHED_OTHER task execution. But it
		3484	* can also include other factors [XXX].
		3485	*
		3486	* To achieve this balance we define a measure of imbalance which follows
		3487	* directly from (1):
		3488	*
		3489	* imb_i,j = max{ avg(W/P), W_i/P_i } - min{ avg(W/P), W_j/P_j } (4)
		3490	*
		3491	* We them move tasks around to minimize the imbalance. In the continuous
		3492	* function space it is obvious this converges, in the discrete case we get
		3493	* a few fun cases generally called infeasible weight scenarios.
		3494	*
		3495	* [XXX expand on:
		3496	* - infeasible weights;
		3497	* - local vs global optima in the discrete case. ]
		3498	*
		3499	*
		3500	* SCHED DOMAINS
		3501	*
		3502	* In order to solve the imbalance equation (4), and avoid the obvious O(n^2)
		3503	* for all i,j solution, we create a tree of cpus that follows the hardware
		3504	* topology where each level pairs two lower groups (or better). This results
		3505	* in O(log n) layers. Furthermore we reduce the number of cpus going up the
		3506	* tree to only the first of the previous level and we decrease the frequency
		3507	* of load-balance at each level inv. proportional to the number of cpus in
		3508	* the groups.
		3509	*
		3510	* This yields:
		3511	*
		3512	* log_2 n 1 n
		3513	* \Sum { --- * --- * 2^i } = O(n) (5)
		3514	* i = 0 2^i 2^i
		3515	* `- size of each group
		3516	* \| \| `- number of cpus doing load-balance
		3517	* \| `- freq
		3518	* `- sum over all levels
		3519	*
		3520	* Coupled with a limit on how many tasks we can migrate every balance pass,
		3521	* this makes (5) the runtime complexity of the balancer.
		3522	*
		3523	* An important property here is that each CPU is still (indirectly) connected
		3524	* to every other cpu in at most O(log n) steps:
		3525	*
		3526	* The adjacency matrix of the resulting graph is given by:
		3527	*
		3528	* log_2 n
		3529	* A_i,j = \Union (i % 2^k == 0) && i / 2^(k+1) == j / 2^(k+1) (6)
		3530	* k = 0
		3531	*
		3532	* And you'll find that:
		3533	*
		3534	* A^(log_2 n)_i,j != 0 for all i,j (7)
		3535	*
		3536	* Showing there's indeed a path between every cpu in at most O(log n) steps.
		3537	* The task movement gives a factor of O(m), giving a convergence complexity
		3538	* of:
		3539	*
		3540	* O(nm log n), n := nr_cpus, m := nr_tasks (8)
		3541	*
		3542	*
		3543	* WORK CONSERVING
		3544	*
		3545	* In order to avoid CPUs going idle while there's still work to do, new idle
		3546	* balancing is more aggressive and has the newly idle cpu iterate up the domain
		3547	* tree itself instead of relying on other CPUs to bring it work.
		3548	*
		3549	* This adds some complexity to both (5) and (8) but it reduces the total idle
		3550	* time.
		3551	*
		3552	* [XXX more?]
		3553	*
		3554	*
		3555	* CGROUPS
		3556	*
		3557	* Cgroups make a horror show out of (2), instead of a simple sum we get:
		3558	*
		3559	* s_k,i
		3560	* W_i,0 = \Sum_j \Prod_k w_k * ----- (9)
		3561	* S_k
		3562	*
		3563	* Where
		3564	*
		3565	* s_k,i = \Sum_j w_i,j,k and S_k = \Sum_i s_k,i (10)
		3566	*
		3567	* w_i,j,k is the weight of the j-th runnable task in the k-th cgroup on cpu i.
		3568	*
		3569	* The big problem is S_k, its a global sum needed to compute a local (W_i)
		3570	* property.
		3571	*
		3572	* [XXX write more on how we solve this.. _after_ merging pjt's patches that
		3573	* rewrite all of this once again.]
		3574	*/
3461		3575
3462	static unsigned long __read_mostly max_load_balance_interval = HZ/10;	3576	static unsigned long __read_mostly max_load_balance_interval = HZ/10;
3463		3577