1 files changed, 139 insertions, 2 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
index 85eeab5e7e32..141bef1c8599 100644
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -19,7 +19,8 @@ CONTENTS:
  1.4 What are exclusive cpusets ?
  1.5 What is memory_pressure ?
  1.6 What is memory spread ?
-  1.7 How do I use cpusets ?
+  1.7 What is sched_load_balance ?
+  1.8 How do I use cpusets ?
 2. Usage Examples and Syntax
  2.1 Basic Usage
  2.2 Adding/removing cpus
@@ -359,8 +360,144 @@ policy, especially for jobs that might have one thread reading in the
 data set, the memory allocation across the nodes in the jobs cpuset
 can become very uneven.
+1.7 What is sched_load_balance ?
+--------------------------------
-1.7 How do I use cpusets ?
+The kernel scheduler (kernel/sched.c) automatically load balances
+tasks.  If one CPU is underutilized, kernel code running on that
+CPU will look for tasks on other more overloaded CPUs and move those
+tasks to itself, within the constraints of such placement mechanisms
+as cpusets and sched_setaffinity.
+The algorithmic cost of load balancing and its impact on key shared
+kernel data structures such as the task list increases more than
+linearly with the number of CPUs being balanced.  So the scheduler
+has support to  partition the systems CPUs into a number of sched
+domains such that it only load balances within each sched domain.
+Each sched domain covers some subset of the CPUs in the system;
+no two sched domains overlap; some CPUs might not be in any sched
+domain and hence won't be load balanced.
+Put simply, it costs less to balance between two smaller sched domains
+than one big one, but doing so means that overloads in one of the
+two domains won't be load balanced to the other one.
+By default, there is one sched domain covering all CPUs, except those
+marked isolated using the kernel boot time "isolcpus=" argument.
+This default load balancing across all CPUs is not well suited for
+the following two situations:
+ 1) On large systems, load balancing across many CPUs is expensive.
+    If the system is managed using cpusets to place independent jobs
+    on separate sets of CPUs, full load balancing is unnecessary.
+ 2) Systems supporting realtime on some CPUs need to minimize
+    system overhead on those CPUs, including avoiding task load
+    balancing if that is not needed.
+When the per-cpuset flag "sched_load_balance" is enabled (the default
+setting), it requests that all the CPUs in that cpusets allowed 'cpus'
+be contained in a single sched domain, ensuring that load balancing
+can move a task (not otherwised pinned, as by sched_setaffinity)
+from any CPU in that cpuset to any other.
+When the per-cpuset flag "sched_load_balance" is disabled, then the
+scheduler will avoid load balancing across the CPUs in that cpuset,
+--except-- in so far as is necessary because some overlapping cpuset
+has "sched_load_balance" enabled.
+So, for example, if the top cpuset has the flag "sched_load_balance"
+enabled, then the scheduler will have one sched domain covering all
+CPUs, and the setting of the "sched_load_balance" flag in any other
+cpusets won't matter, as we're already fully load balancing.
+Therefore in the above two situations, the top cpuset flag
+"sched_load_balance" should be disabled, and only some of the smaller,
+child cpusets have this flag enabled.
+When doing this, you don't usually want to leave any unpinned tasks in
+the top cpuset that might use non-trivial amounts of CPU, as such tasks
+may be artificially constrained to some subset of CPUs, depending on
+the particulars of this flag setting in descendent cpusets.  Even if
+such a task could use spare CPU cycles in some other CPUs, the kernel
+scheduler might not consider the possibility of load balancing that
+task to that underused CPU.
+Of course, tasks pinned to a particular CPU can be left in a cpuset
+that disables "sched_load_balance" as those tasks aren't going anywhere
+else anyway.
+There is an impedance mismatch here, between cpusets and sched domains.
+Cpusets are hierarchical and nest.  Sched domains are flat; they don't
+overlap and each CPU is in at most one sched domain.
+It is necessary for sched domains to be flat because load balancing
+across partially overlapping sets of CPUs would risk unstable dynamics
+that would be beyond our understanding.  So if each of two partially
+overlapping cpusets enables the flag 'sched_load_balance', then we
+form a single sched domain that is a superset of both.  We won't move
+a task to a CPU outside it cpuset, but the scheduler load balancing
+code might waste some compute cycles considering that possibility.
+This mismatch is why there is not a simple one-to-one relation
+between which cpusets have the flag "sched_load_balance" enabled,
+and the sched domain configuration.  If a cpuset enables the flag, it
+will get balancing across all its CPUs, but if it disables the flag,
+it will only be assured of no load balancing if no other overlapping
+cpuset enables the flag.
+If two cpusets have partially overlapping 'cpus' allowed, and only
+one of them has this flag enabled, then the other may find its
+tasks only partially load balanced, just on the overlapping CPUs.
+This is just the general case of the top_cpuset example given a few
+paragraphs above.  In the general case, as in the top cpuset case,
+don't leave tasks that might use non-trivial amounts of CPU in
+such partially load balanced cpusets, as they may be artificially
+constrained to some subset of the CPUs allowed to them, for lack of
+load balancing to the other CPUs.
+1.7.1 sched_load_balance implementation details.
+------------------------------------------------
+The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
+to most cpuset flags.)  When enabled for a cpuset, the kernel will
+ensure that it can load balance across all the CPUs in that cpuset
+(makes sure that all the CPUs in the cpus_allowed of that cpuset are
+in the same sched domain.)
+If two overlapping cpusets both have 'sched_load_balance' enabled,
+then they will be (must be) both in the same sched domain.
+If, as is the default, the top cpuset has 'sched_load_balance' enabled,
+then by the above that means there is a single sched domain covering
+the whole system, regardless of any other cpuset settings.
+The kernel commits to user space that it will avoid load balancing
+where it can.  It will pick as fine a granularity partition of sched
+domains as it can while still providing load balancing for any set
+of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
+The internal kernel cpuset to scheduler interface passes from the
+cpuset code to the scheduler code a partition of the load balanced
+CPUs in the system. This partition is a set of subsets (represented
+as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
+the CPUs that must be load balanced.
+Whenever the 'sched_load_balance' flag changes, or CPUs come or go
+from a cpuset with this flag enabled, or a cpuset with this flag
+enabled is removed, the cpuset code builds a new such partition and
+passes it to the scheduler sched domain setup code, to have the sched
+domains rebuilt as necessary.
+This partition exactly defines what sched domains the scheduler should
+setup - one sched domain for each element (cpumask_t) in the partition.
+The scheduler remembers the currently active sched domain partitions.
+When the scheduler routine partition_sched_domains() is invoked from
+the cpuset code to update these sched domains, it compares the new
+partition requested with the current, and updates its sched domains,
+removing the old and adding the new, for each change.
+1.8 How do I use cpusets ?
 --------------------------
 In order to minimize the impact of cpusets on critical kernel

diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 85eeab5e7e32..141bef1c8599 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt
@@ -19,7 +19,8 @@ CONTENTS:
19	1.4 What are exclusive cpusets ?	19	1.4 What are exclusive cpusets ?
20	1.5 What is memory_pressure ?	20	1.5 What is memory_pressure ?
21	1.6 What is memory spread ?	21	1.6 What is memory spread ?
22	1.7 How do I use cpusets ?	22	1.7 What is sched_load_balance ?
		23	1.8 How do I use cpusets ?
23	2. Usage Examples and Syntax	24	2. Usage Examples and Syntax
24	2.1 Basic Usage	25	2.1 Basic Usage
25	2.2 Adding/removing cpus	26	2.2 Adding/removing cpus
@@ -359,8 +360,144 @@ policy, especially for jobs that might have one thread reading in the
359	data set, the memory allocation across the nodes in the jobs cpuset	360	data set, the memory allocation across the nodes in the jobs cpuset
360	can become very uneven.	361	can become very uneven.
361		362
		363	1.7 What is sched_load_balance ?
		364	--------------------------------
362		365
363	1.7 How do I use cpusets ?	366	The kernel scheduler (kernel/sched.c) automatically load balances
		367	tasks. If one CPU is underutilized, kernel code running on that
		368	CPU will look for tasks on other more overloaded CPUs and move those
		369	tasks to itself, within the constraints of such placement mechanisms
		370	as cpusets and sched_setaffinity.
		371
		372	The algorithmic cost of load balancing and its impact on key shared
		373	kernel data structures such as the task list increases more than
		374	linearly with the number of CPUs being balanced. So the scheduler
		375	has support to partition the systems CPUs into a number of sched
		376	domains such that it only load balances within each sched domain.
		377	Each sched domain covers some subset of the CPUs in the system;
		378	no two sched domains overlap; some CPUs might not be in any sched
		379	domain and hence won't be load balanced.
		380
		381	Put simply, it costs less to balance between two smaller sched domains
		382	than one big one, but doing so means that overloads in one of the
		383	two domains won't be load balanced to the other one.
		384
		385	By default, there is one sched domain covering all CPUs, except those
		386	marked isolated using the kernel boot time "isolcpus=" argument.
		387
		388	This default load balancing across all CPUs is not well suited for
		389	the following two situations:
		390	1) On large systems, load balancing across many CPUs is expensive.
		391	If the system is managed using cpusets to place independent jobs
		392	on separate sets of CPUs, full load balancing is unnecessary.
		393	2) Systems supporting realtime on some CPUs need to minimize
		394	system overhead on those CPUs, including avoiding task load
		395	balancing if that is not needed.
		396
		397	When the per-cpuset flag "sched_load_balance" is enabled (the default
		398	setting), it requests that all the CPUs in that cpusets allowed 'cpus'
		399	be contained in a single sched domain, ensuring that load balancing
		400	can move a task (not otherwised pinned, as by sched_setaffinity)
		401	from any CPU in that cpuset to any other.
		402
		403	When the per-cpuset flag "sched_load_balance" is disabled, then the
		404	scheduler will avoid load balancing across the CPUs in that cpuset,
		405	--except-- in so far as is necessary because some overlapping cpuset
		406	has "sched_load_balance" enabled.
		407
		408	So, for example, if the top cpuset has the flag "sched_load_balance"
		409	enabled, then the scheduler will have one sched domain covering all
		410	CPUs, and the setting of the "sched_load_balance" flag in any other
		411	cpusets won't matter, as we're already fully load balancing.
		412
		413	Therefore in the above two situations, the top cpuset flag
		414	"sched_load_balance" should be disabled, and only some of the smaller,
		415	child cpusets have this flag enabled.
		416
		417	When doing this, you don't usually want to leave any unpinned tasks in
		418	the top cpuset that might use non-trivial amounts of CPU, as such tasks
		419	may be artificially constrained to some subset of CPUs, depending on
		420	the particulars of this flag setting in descendent cpusets. Even if
		421	such a task could use spare CPU cycles in some other CPUs, the kernel
		422	scheduler might not consider the possibility of load balancing that
		423	task to that underused CPU.
		424
		425	Of course, tasks pinned to a particular CPU can be left in a cpuset
		426	that disables "sched_load_balance" as those tasks aren't going anywhere
		427	else anyway.
		428
		429	There is an impedance mismatch here, between cpusets and sched domains.
		430	Cpusets are hierarchical and nest. Sched domains are flat; they don't
		431	overlap and each CPU is in at most one sched domain.
		432
		433	It is necessary for sched domains to be flat because load balancing
		434	across partially overlapping sets of CPUs would risk unstable dynamics
		435	that would be beyond our understanding. So if each of two partially
		436	overlapping cpusets enables the flag 'sched_load_balance', then we
		437	form a single sched domain that is a superset of both. We won't move
		438	a task to a CPU outside it cpuset, but the scheduler load balancing
		439	code might waste some compute cycles considering that possibility.
		440
		441	This mismatch is why there is not a simple one-to-one relation
		442	between which cpusets have the flag "sched_load_balance" enabled,
		443	and the sched domain configuration. If a cpuset enables the flag, it
		444	will get balancing across all its CPUs, but if it disables the flag,
		445	it will only be assured of no load balancing if no other overlapping
		446	cpuset enables the flag.
		447
		448	If two cpusets have partially overlapping 'cpus' allowed, and only
		449	one of them has this flag enabled, then the other may find its
		450	tasks only partially load balanced, just on the overlapping CPUs.
		451	This is just the general case of the top_cpuset example given a few
		452	paragraphs above. In the general case, as in the top cpuset case,
		453	don't leave tasks that might use non-trivial amounts of CPU in
		454	such partially load balanced cpusets, as they may be artificially
		455	constrained to some subset of the CPUs allowed to them, for lack of
		456	load balancing to the other CPUs.
		457
		458	1.7.1 sched_load_balance implementation details.
		459	------------------------------------------------
		460
		461	The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
		462	to most cpuset flags.) When enabled for a cpuset, the kernel will
		463	ensure that it can load balance across all the CPUs in that cpuset
		464	(makes sure that all the CPUs in the cpus_allowed of that cpuset are
		465	in the same sched domain.)
		466
		467	If two overlapping cpusets both have 'sched_load_balance' enabled,
		468	then they will be (must be) both in the same sched domain.
		469
		470	If, as is the default, the top cpuset has 'sched_load_balance' enabled,
		471	then by the above that means there is a single sched domain covering
		472	the whole system, regardless of any other cpuset settings.
		473
		474	The kernel commits to user space that it will avoid load balancing
		475	where it can. It will pick as fine a granularity partition of sched
		476	domains as it can while still providing load balancing for any set
		477	of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
		478
		479	The internal kernel cpuset to scheduler interface passes from the
		480	cpuset code to the scheduler code a partition of the load balanced
		481	CPUs in the system. This partition is a set of subsets (represented
		482	as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
		483	the CPUs that must be load balanced.
		484
		485	Whenever the 'sched_load_balance' flag changes, or CPUs come or go
		486	from a cpuset with this flag enabled, or a cpuset with this flag
		487	enabled is removed, the cpuset code builds a new such partition and
		488	passes it to the scheduler sched domain setup code, to have the sched
		489	domains rebuilt as necessary.
		490
		491	This partition exactly defines what sched domains the scheduler should
		492	setup - one sched domain for each element (cpumask_t) in the partition.
		493
		494	The scheduler remembers the currently active sched domain partitions.
		495	When the scheduler routine partition_sched_domains() is invoked from
		496	the cpuset code to update these sched domains, it compares the new
		497	partition requested with the current, and updates its sched domains,
		498	removing the old and adding the new, for each change.
		499
		500	1.8 How do I use cpusets ?
364	--------------------------	501	--------------------------
365		502
366	In order to minimize the impact of cpusets on critical kernel	503	In order to minimize the impact of cpusets on critical kernel