aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/cpusets.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/cpusets.txt')
-rw-r--r--Documentation/cpusets.txt226
1 files changed, 178 insertions, 48 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
index ec9de6917f01..141bef1c8599 100644
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -7,6 +7,7 @@ Written by Simon.Derr@bull.net
7Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. 7Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
8Modified by Paul Jackson <pj@sgi.com> 8Modified by Paul Jackson <pj@sgi.com>
9Modified by Christoph Lameter <clameter@sgi.com> 9Modified by Christoph Lameter <clameter@sgi.com>
10Modified by Paul Menage <menage@google.com>
10 11
11CONTENTS: 12CONTENTS:
12========= 13=========
@@ -16,9 +17,9 @@ CONTENTS:
16 1.2 Why are cpusets needed ? 17 1.2 Why are cpusets needed ?
17 1.3 How are cpusets implemented ? 18 1.3 How are cpusets implemented ?
18 1.4 What are exclusive cpusets ? 19 1.4 What are exclusive cpusets ?
19 1.5 What does notify_on_release do ? 20 1.5 What is memory_pressure ?
20 1.6 What is memory_pressure ? 21 1.6 What is memory spread ?
21 1.7 What is memory spread ? 22 1.7 What is sched_load_balance ?
22 1.8 How do I use cpusets ? 23 1.8 How do I use cpusets ?
232. Usage Examples and Syntax 242. Usage Examples and Syntax
24 2.1 Basic Usage 25 2.1 Basic Usage
@@ -44,18 +45,19 @@ hierarchy visible in a virtual file system. These are the essential
44hooks, beyond what is already present, required to manage dynamic 45hooks, beyond what is already present, required to manage dynamic
45job placement on large systems. 46job placement on large systems.
46 47
47Each task has a pointer to a cpuset. Multiple tasks may reference 48Cpusets use the generic cgroup subsystem described in
48the same cpuset. Requests by a task, using the sched_setaffinity(2) 49Documentation/cgroup.txt.
49system call to include CPUs in its CPU affinity mask, and using the 50
50mbind(2) and set_mempolicy(2) system calls to include Memory Nodes 51Requests by a task, using the sched_setaffinity(2) system call to
51in its memory policy, are both filtered through that tasks cpuset, 52include CPUs in its CPU affinity mask, and using the mbind(2) and
52filtering out any CPUs or Memory Nodes not in that cpuset. The 53set_mempolicy(2) system calls to include Memory Nodes in its memory
53scheduler will not schedule a task on a CPU that is not allowed in 54policy, are both filtered through that tasks cpuset, filtering out any
54its cpus_allowed vector, and the kernel page allocator will not 55CPUs or Memory Nodes not in that cpuset. The scheduler will not
55allocate a page on a node that is not allowed in the requesting tasks 56schedule a task on a CPU that is not allowed in its cpus_allowed
56mems_allowed vector. 57vector, and the kernel page allocator will not allocate a page on a
57 58node that is not allowed in the requesting tasks mems_allowed vector.
58User level code may create and destroy cpusets by name in the cpuset 59
60User level code may create and destroy cpusets by name in the cgroup
59virtual file system, manage the attributes and permissions of these 61virtual file system, manage the attributes and permissions of these
60cpusets and which CPUs and Memory Nodes are assigned to each cpuset, 62cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
61specify and query to which cpuset a task is assigned, and list the 63specify and query to which cpuset a task is assigned, and list the
@@ -115,7 +117,7 @@ Cpusets extends these two mechanisms as follows:
115 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the 117 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
116 kernel. 118 kernel.
117 - Each task in the system is attached to a cpuset, via a pointer 119 - Each task in the system is attached to a cpuset, via a pointer
118 in the task structure to a reference counted cpuset structure. 120 in the task structure to a reference counted cgroup structure.
119 - Calls to sched_setaffinity are filtered to just those CPUs 121 - Calls to sched_setaffinity are filtered to just those CPUs
120 allowed in that tasks cpuset. 122 allowed in that tasks cpuset.
121 - Calls to mbind and set_mempolicy are filtered to just 123 - Calls to mbind and set_mempolicy are filtered to just
@@ -145,15 +147,10 @@ into the rest of the kernel, none in performance critical paths:
145 - in page_alloc.c, to restrict memory to allowed nodes. 147 - in page_alloc.c, to restrict memory to allowed nodes.
146 - in vmscan.c, to restrict page recovery to the current cpuset. 148 - in vmscan.c, to restrict page recovery to the current cpuset.
147 149
148In addition a new file system, of type "cpuset" may be mounted, 150You should mount the "cgroup" filesystem type in order to enable
149typically at /dev/cpuset, to enable browsing and modifying the cpusets 151browsing and modifying the cpusets presently known to the kernel. No
150presently known to the kernel. No new system calls are added for 152new system calls are added for cpusets - all support for querying and
151cpusets - all support for querying and modifying cpusets is via 153modifying cpusets is via this cpuset file system.
152this cpuset file system.
153
154Each task under /proc has an added file named 'cpuset', displaying
155the cpuset name, as the path relative to the root of the cpuset file
156system.
157 154
158The /proc/<pid>/status file for each task has two added lines, 155The /proc/<pid>/status file for each task has two added lines,
159displaying the tasks cpus_allowed (on which CPUs it may be scheduled) 156displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
@@ -163,16 +160,15 @@ in the format seen in the following example:
163 Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff 160 Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
164 Mems_allowed: ffffffff,ffffffff 161 Mems_allowed: ffffffff,ffffffff
165 162
166Each cpuset is represented by a directory in the cpuset file system 163Each cpuset is represented by a directory in the cgroup file system
167containing the following files describing that cpuset: 164containing (on top of the standard cgroup files) the following
165files describing that cpuset:
168 166
169 - cpus: list of CPUs in that cpuset 167 - cpus: list of CPUs in that cpuset
170 - mems: list of Memory Nodes in that cpuset 168 - mems: list of Memory Nodes in that cpuset
171 - memory_migrate flag: if set, move pages to cpusets nodes 169 - memory_migrate flag: if set, move pages to cpusets nodes
172 - cpu_exclusive flag: is cpu placement exclusive? 170 - cpu_exclusive flag: is cpu placement exclusive?
173 - mem_exclusive flag: is memory placement exclusive? 171 - mem_exclusive flag: is memory placement exclusive?
174 - tasks: list of tasks (by pid) attached to that cpuset
175 - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
176 - memory_pressure: measure of how much paging pressure in cpuset 172 - memory_pressure: measure of how much paging pressure in cpuset
177 173
178In addition, the root cpuset only has the following file: 174In addition, the root cpuset only has the following file:
@@ -237,21 +233,7 @@ such as requests from interrupt handlers, is allowed to be taken
237outside even a mem_exclusive cpuset. 233outside even a mem_exclusive cpuset.
238 234
239 235
2401.5 What does notify_on_release do ? 2361.5 What is memory_pressure ?
241------------------------------------
242
243If the notify_on_release flag is enabled (1) in a cpuset, then whenever
244the last task in the cpuset leaves (exits or attaches to some other
245cpuset) and the last child cpuset of that cpuset is removed, then
246the kernel runs the command /sbin/cpuset_release_agent, supplying the
247pathname (relative to the mount point of the cpuset file system) of the
248abandoned cpuset. This enables automatic removal of abandoned cpusets.
249The default value of notify_on_release in the root cpuset at system
250boot is disabled (0). The default value of other cpusets at creation
251is the current value of their parents notify_on_release setting.
252
253
2541.6 What is memory_pressure ?
255----------------------------- 237-----------------------------
256The memory_pressure of a cpuset provides a simple per-cpuset metric 238The memory_pressure of a cpuset provides a simple per-cpuset metric
257of the rate that the tasks in a cpuset are attempting to free up in 239of the rate that the tasks in a cpuset are attempting to free up in
@@ -308,7 +290,7 @@ the tasks in the cpuset, in units of reclaims attempted per second,
308times 1000. 290times 1000.
309 291
310 292
3111.7 What is memory spread ? 2931.6 What is memory spread ?
312--------------------------- 294---------------------------
313There are two boolean flag files per cpuset that control where the 295There are two boolean flag files per cpuset that control where the
314kernel allocates pages for the file system buffers and related in 296kernel allocates pages for the file system buffers and related in
@@ -378,6 +360,142 @@ policy, especially for jobs that might have one thread reading in the
378data set, the memory allocation across the nodes in the jobs cpuset 360data set, the memory allocation across the nodes in the jobs cpuset
379can become very uneven. 361can become very uneven.
380 362
3631.7 What is sched_load_balance ?
364--------------------------------
365
366The kernel scheduler (kernel/sched.c) automatically load balances
367tasks. If one CPU is underutilized, kernel code running on that
368CPU will look for tasks on other more overloaded CPUs and move those
369tasks to itself, within the constraints of such placement mechanisms
370as cpusets and sched_setaffinity.
371
372The algorithmic cost of load balancing and its impact on key shared
373kernel data structures such as the task list increases more than
374linearly with the number of CPUs being balanced. So the scheduler
375has support to partition the systems CPUs into a number of sched
376domains such that it only load balances within each sched domain.
377Each sched domain covers some subset of the CPUs in the system;
378no two sched domains overlap; some CPUs might not be in any sched
379domain and hence won't be load balanced.
380
381Put simply, it costs less to balance between two smaller sched domains
382than one big one, but doing so means that overloads in one of the
383two domains won't be load balanced to the other one.
384
385By default, there is one sched domain covering all CPUs, except those
386marked isolated using the kernel boot time "isolcpus=" argument.
387
388This default load balancing across all CPUs is not well suited for
389the following two situations:
390 1) On large systems, load balancing across many CPUs is expensive.
391 If the system is managed using cpusets to place independent jobs
392 on separate sets of CPUs, full load balancing is unnecessary.
393 2) Systems supporting realtime on some CPUs need to minimize
394 system overhead on those CPUs, including avoiding task load
395 balancing if that is not needed.
396
397When the per-cpuset flag "sched_load_balance" is enabled (the default
398setting), it requests that all the CPUs in that cpusets allowed 'cpus'
399be contained in a single sched domain, ensuring that load balancing
400can move a task (not otherwised pinned, as by sched_setaffinity)
401from any CPU in that cpuset to any other.
402
403When the per-cpuset flag "sched_load_balance" is disabled, then the
404scheduler will avoid load balancing across the CPUs in that cpuset,
405--except-- in so far as is necessary because some overlapping cpuset
406has "sched_load_balance" enabled.
407
408So, for example, if the top cpuset has the flag "sched_load_balance"
409enabled, then the scheduler will have one sched domain covering all
410CPUs, and the setting of the "sched_load_balance" flag in any other
411cpusets won't matter, as we're already fully load balancing.
412
413Therefore in the above two situations, the top cpuset flag
414"sched_load_balance" should be disabled, and only some of the smaller,
415child cpusets have this flag enabled.
416
417When doing this, you don't usually want to leave any unpinned tasks in
418the top cpuset that might use non-trivial amounts of CPU, as such tasks
419may be artificially constrained to some subset of CPUs, depending on
420the particulars of this flag setting in descendent cpusets. Even if
421such a task could use spare CPU cycles in some other CPUs, the kernel
422scheduler might not consider the possibility of load balancing that
423task to that underused CPU.
424
425Of course, tasks pinned to a particular CPU can be left in a cpuset
426that disables "sched_load_balance" as those tasks aren't going anywhere
427else anyway.
428
429There is an impedance mismatch here, between cpusets and sched domains.
430Cpusets are hierarchical and nest. Sched domains are flat; they don't
431overlap and each CPU is in at most one sched domain.
432
433It is necessary for sched domains to be flat because load balancing
434across partially overlapping sets of CPUs would risk unstable dynamics
435that would be beyond our understanding. So if each of two partially
436overlapping cpusets enables the flag 'sched_load_balance', then we
437form a single sched domain that is a superset of both. We won't move
438a task to a CPU outside it cpuset, but the scheduler load balancing
439code might waste some compute cycles considering that possibility.
440
441This mismatch is why there is not a simple one-to-one relation
442between which cpusets have the flag "sched_load_balance" enabled,
443and the sched domain configuration. If a cpuset enables the flag, it
444will get balancing across all its CPUs, but if it disables the flag,
445it will only be assured of no load balancing if no other overlapping
446cpuset enables the flag.
447
448If two cpusets have partially overlapping 'cpus' allowed, and only
449one of them has this flag enabled, then the other may find its
450tasks only partially load balanced, just on the overlapping CPUs.
451This is just the general case of the top_cpuset example given a few
452paragraphs above. In the general case, as in the top cpuset case,
453don't leave tasks that might use non-trivial amounts of CPU in
454such partially load balanced cpusets, as they may be artificially
455constrained to some subset of the CPUs allowed to them, for lack of
456load balancing to the other CPUs.
457
4581.7.1 sched_load_balance implementation details.
459------------------------------------------------
460
461The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
462to most cpuset flags.) When enabled for a cpuset, the kernel will
463ensure that it can load balance across all the CPUs in that cpuset
464(makes sure that all the CPUs in the cpus_allowed of that cpuset are
465in the same sched domain.)
466
467If two overlapping cpusets both have 'sched_load_balance' enabled,
468then they will be (must be) both in the same sched domain.
469
470If, as is the default, the top cpuset has 'sched_load_balance' enabled,
471then by the above that means there is a single sched domain covering
472the whole system, regardless of any other cpuset settings.
473
474The kernel commits to user space that it will avoid load balancing
475where it can. It will pick as fine a granularity partition of sched
476domains as it can while still providing load balancing for any set
477of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
478
479The internal kernel cpuset to scheduler interface passes from the
480cpuset code to the scheduler code a partition of the load balanced
481CPUs in the system. This partition is a set of subsets (represented
482as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
483the CPUs that must be load balanced.
484
485Whenever the 'sched_load_balance' flag changes, or CPUs come or go
486from a cpuset with this flag enabled, or a cpuset with this flag
487enabled is removed, the cpuset code builds a new such partition and
488passes it to the scheduler sched domain setup code, to have the sched
489domains rebuilt as necessary.
490
491This partition exactly defines what sched domains the scheduler should
492setup - one sched domain for each element (cpumask_t) in the partition.
493
494The scheduler remembers the currently active sched domain partitions.
495When the scheduler routine partition_sched_domains() is invoked from
496the cpuset code to update these sched domains, it compares the new
497partition requested with the current, and updates its sched domains,
498removing the old and adding the new, for each change.
381 499
3821.8 How do I use cpusets ? 5001.8 How do I use cpusets ?
383-------------------------- 501--------------------------
@@ -469,7 +587,7 @@ than stress the kernel.
469To start a new job that is to be contained within a cpuset, the steps are: 587To start a new job that is to be contained within a cpuset, the steps are:
470 588
471 1) mkdir /dev/cpuset 589 1) mkdir /dev/cpuset
472 2) mount -t cpuset none /dev/cpuset 590 2) mount -t cgroup -ocpuset cpuset /dev/cpuset
473 3) Create the new cpuset by doing mkdir's and write's (or echo's) in 591 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
474 the /dev/cpuset virtual file system. 592 the /dev/cpuset virtual file system.
475 4) Start a task that will be the "founding father" of the new job. 593 4) Start a task that will be the "founding father" of the new job.
@@ -481,7 +599,7 @@ For example, the following sequence of commands will setup a cpuset
481named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, 599named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
482and then start a subshell 'sh' in that cpuset: 600and then start a subshell 'sh' in that cpuset:
483 601
484 mount -t cpuset none /dev/cpuset 602 mount -t cgroup -ocpuset cpuset /dev/cpuset
485 cd /dev/cpuset 603 cd /dev/cpuset
486 mkdir Charlie 604 mkdir Charlie
487 cd Charlie 605 cd Charlie
@@ -513,7 +631,7 @@ Creating, modifying, using the cpusets can be done through the cpuset
513virtual filesystem. 631virtual filesystem.
514 632
515To mount it, type: 633To mount it, type:
516# mount -t cpuset none /dev/cpuset 634# mount -t cgroup -o cpuset cpuset /dev/cpuset
517 635
518Then under /dev/cpuset you can find a tree that corresponds to the 636Then under /dev/cpuset you can find a tree that corresponds to the
519tree of the cpusets in the system. For instance, /dev/cpuset 637tree of the cpusets in the system. For instance, /dev/cpuset
@@ -556,6 +674,18 @@ To remove a cpuset, just use rmdir:
556This will fail if the cpuset is in use (has cpusets inside, or has 674This will fail if the cpuset is in use (has cpusets inside, or has
557processes attached). 675processes attached).
558 676
677Note that for legacy reasons, the "cpuset" filesystem exists as a
678wrapper around the cgroup filesystem.
679
680The command
681
682mount -t cpuset X /dev/cpuset
683
684is equivalent to
685
686mount -t cgroup -ocpuset X /dev/cpuset
687echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
688
5592.2 Adding/removing cpus 6892.2 Adding/removing cpus
560------------------------ 690------------------------
561 691