diff options
Diffstat (limited to 'Documentation/cpusets.txt')
-rw-r--r-- | Documentation/cpusets.txt | 226 |
1 files changed, 178 insertions, 48 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index ec9de6917f01..141bef1c8599 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt | |||
@@ -7,6 +7,7 @@ Written by Simon.Derr@bull.net | |||
7 | Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. | 7 | Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. |
8 | Modified by Paul Jackson <pj@sgi.com> | 8 | Modified by Paul Jackson <pj@sgi.com> |
9 | Modified by Christoph Lameter <clameter@sgi.com> | 9 | Modified by Christoph Lameter <clameter@sgi.com> |
10 | Modified by Paul Menage <menage@google.com> | ||
10 | 11 | ||
11 | CONTENTS: | 12 | CONTENTS: |
12 | ========= | 13 | ========= |
@@ -16,9 +17,9 @@ CONTENTS: | |||
16 | 1.2 Why are cpusets needed ? | 17 | 1.2 Why are cpusets needed ? |
17 | 1.3 How are cpusets implemented ? | 18 | 1.3 How are cpusets implemented ? |
18 | 1.4 What are exclusive cpusets ? | 19 | 1.4 What are exclusive cpusets ? |
19 | 1.5 What does notify_on_release do ? | 20 | 1.5 What is memory_pressure ? |
20 | 1.6 What is memory_pressure ? | 21 | 1.6 What is memory spread ? |
21 | 1.7 What is memory spread ? | 22 | 1.7 What is sched_load_balance ? |
22 | 1.8 How do I use cpusets ? | 23 | 1.8 How do I use cpusets ? |
23 | 2. Usage Examples and Syntax | 24 | 2. Usage Examples and Syntax |
24 | 2.1 Basic Usage | 25 | 2.1 Basic Usage |
@@ -44,18 +45,19 @@ hierarchy visible in a virtual file system. These are the essential | |||
44 | hooks, beyond what is already present, required to manage dynamic | 45 | hooks, beyond what is already present, required to manage dynamic |
45 | job placement on large systems. | 46 | job placement on large systems. |
46 | 47 | ||
47 | Each task has a pointer to a cpuset. Multiple tasks may reference | 48 | Cpusets use the generic cgroup subsystem described in |
48 | the same cpuset. Requests by a task, using the sched_setaffinity(2) | 49 | Documentation/cgroup.txt. |
49 | system call to include CPUs in its CPU affinity mask, and using the | 50 | |
50 | mbind(2) and set_mempolicy(2) system calls to include Memory Nodes | 51 | Requests by a task, using the sched_setaffinity(2) system call to |
51 | in its memory policy, are both filtered through that tasks cpuset, | 52 | include CPUs in its CPU affinity mask, and using the mbind(2) and |
52 | filtering out any CPUs or Memory Nodes not in that cpuset. The | 53 | set_mempolicy(2) system calls to include Memory Nodes in its memory |
53 | scheduler will not schedule a task on a CPU that is not allowed in | 54 | policy, are both filtered through that tasks cpuset, filtering out any |
54 | its cpus_allowed vector, and the kernel page allocator will not | 55 | CPUs or Memory Nodes not in that cpuset. The scheduler will not |
55 | allocate a page on a node that is not allowed in the requesting tasks | 56 | schedule a task on a CPU that is not allowed in its cpus_allowed |
56 | mems_allowed vector. | 57 | vector, and the kernel page allocator will not allocate a page on a |
57 | 58 | node that is not allowed in the requesting tasks mems_allowed vector. | |
58 | User level code may create and destroy cpusets by name in the cpuset | 59 | |
60 | User level code may create and destroy cpusets by name in the cgroup | ||
59 | virtual file system, manage the attributes and permissions of these | 61 | virtual file system, manage the attributes and permissions of these |
60 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | 62 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, |
61 | specify and query to which cpuset a task is assigned, and list the | 63 | specify and query to which cpuset a task is assigned, and list the |
@@ -115,7 +117,7 @@ Cpusets extends these two mechanisms as follows: | |||
115 | - Cpusets are sets of allowed CPUs and Memory Nodes, known to the | 117 | - Cpusets are sets of allowed CPUs and Memory Nodes, known to the |
116 | kernel. | 118 | kernel. |
117 | - Each task in the system is attached to a cpuset, via a pointer | 119 | - Each task in the system is attached to a cpuset, via a pointer |
118 | in the task structure to a reference counted cpuset structure. | 120 | in the task structure to a reference counted cgroup structure. |
119 | - Calls to sched_setaffinity are filtered to just those CPUs | 121 | - Calls to sched_setaffinity are filtered to just those CPUs |
120 | allowed in that tasks cpuset. | 122 | allowed in that tasks cpuset. |
121 | - Calls to mbind and set_mempolicy are filtered to just | 123 | - Calls to mbind and set_mempolicy are filtered to just |
@@ -145,15 +147,10 @@ into the rest of the kernel, none in performance critical paths: | |||
145 | - in page_alloc.c, to restrict memory to allowed nodes. | 147 | - in page_alloc.c, to restrict memory to allowed nodes. |
146 | - in vmscan.c, to restrict page recovery to the current cpuset. | 148 | - in vmscan.c, to restrict page recovery to the current cpuset. |
147 | 149 | ||
148 | In addition a new file system, of type "cpuset" may be mounted, | 150 | You should mount the "cgroup" filesystem type in order to enable |
149 | typically at /dev/cpuset, to enable browsing and modifying the cpusets | 151 | browsing and modifying the cpusets presently known to the kernel. No |
150 | presently known to the kernel. No new system calls are added for | 152 | new system calls are added for cpusets - all support for querying and |
151 | cpusets - all support for querying and modifying cpusets is via | 153 | modifying cpusets is via this cpuset file system. |
152 | this cpuset file system. | ||
153 | |||
154 | Each task under /proc has an added file named 'cpuset', displaying | ||
155 | the cpuset name, as the path relative to the root of the cpuset file | ||
156 | system. | ||
157 | 154 | ||
158 | The /proc/<pid>/status file for each task has two added lines, | 155 | The /proc/<pid>/status file for each task has two added lines, |
159 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) | 156 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) |
@@ -163,16 +160,15 @@ in the format seen in the following example: | |||
163 | Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff | 160 | Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff |
164 | Mems_allowed: ffffffff,ffffffff | 161 | Mems_allowed: ffffffff,ffffffff |
165 | 162 | ||
166 | Each cpuset is represented by a directory in the cpuset file system | 163 | Each cpuset is represented by a directory in the cgroup file system |
167 | containing the following files describing that cpuset: | 164 | containing (on top of the standard cgroup files) the following |
165 | files describing that cpuset: | ||
168 | 166 | ||
169 | - cpus: list of CPUs in that cpuset | 167 | - cpus: list of CPUs in that cpuset |
170 | - mems: list of Memory Nodes in that cpuset | 168 | - mems: list of Memory Nodes in that cpuset |
171 | - memory_migrate flag: if set, move pages to cpusets nodes | 169 | - memory_migrate flag: if set, move pages to cpusets nodes |
172 | - cpu_exclusive flag: is cpu placement exclusive? | 170 | - cpu_exclusive flag: is cpu placement exclusive? |
173 | - mem_exclusive flag: is memory placement exclusive? | 171 | - mem_exclusive flag: is memory placement exclusive? |
174 | - tasks: list of tasks (by pid) attached to that cpuset | ||
175 | - notify_on_release flag: run /sbin/cpuset_release_agent on exit? | ||
176 | - memory_pressure: measure of how much paging pressure in cpuset | 172 | - memory_pressure: measure of how much paging pressure in cpuset |
177 | 173 | ||
178 | In addition, the root cpuset only has the following file: | 174 | In addition, the root cpuset only has the following file: |
@@ -237,21 +233,7 @@ such as requests from interrupt handlers, is allowed to be taken | |||
237 | outside even a mem_exclusive cpuset. | 233 | outside even a mem_exclusive cpuset. |
238 | 234 | ||
239 | 235 | ||
240 | 1.5 What does notify_on_release do ? | 236 | 1.5 What is memory_pressure ? |
241 | ------------------------------------ | ||
242 | |||
243 | If the notify_on_release flag is enabled (1) in a cpuset, then whenever | ||
244 | the last task in the cpuset leaves (exits or attaches to some other | ||
245 | cpuset) and the last child cpuset of that cpuset is removed, then | ||
246 | the kernel runs the command /sbin/cpuset_release_agent, supplying the | ||
247 | pathname (relative to the mount point of the cpuset file system) of the | ||
248 | abandoned cpuset. This enables automatic removal of abandoned cpusets. | ||
249 | The default value of notify_on_release in the root cpuset at system | ||
250 | boot is disabled (0). The default value of other cpusets at creation | ||
251 | is the current value of their parents notify_on_release setting. | ||
252 | |||
253 | |||
254 | 1.6 What is memory_pressure ? | ||
255 | ----------------------------- | 237 | ----------------------------- |
256 | The memory_pressure of a cpuset provides a simple per-cpuset metric | 238 | The memory_pressure of a cpuset provides a simple per-cpuset metric |
257 | of the rate that the tasks in a cpuset are attempting to free up in | 239 | of the rate that the tasks in a cpuset are attempting to free up in |
@@ -308,7 +290,7 @@ the tasks in the cpuset, in units of reclaims attempted per second, | |||
308 | times 1000. | 290 | times 1000. |
309 | 291 | ||
310 | 292 | ||
311 | 1.7 What is memory spread ? | 293 | 1.6 What is memory spread ? |
312 | --------------------------- | 294 | --------------------------- |
313 | There are two boolean flag files per cpuset that control where the | 295 | There are two boolean flag files per cpuset that control where the |
314 | kernel allocates pages for the file system buffers and related in | 296 | kernel allocates pages for the file system buffers and related in |
@@ -378,6 +360,142 @@ policy, especially for jobs that might have one thread reading in the | |||
378 | data set, the memory allocation across the nodes in the jobs cpuset | 360 | data set, the memory allocation across the nodes in the jobs cpuset |
379 | can become very uneven. | 361 | can become very uneven. |
380 | 362 | ||
363 | 1.7 What is sched_load_balance ? | ||
364 | -------------------------------- | ||
365 | |||
366 | The kernel scheduler (kernel/sched.c) automatically load balances | ||
367 | tasks. If one CPU is underutilized, kernel code running on that | ||
368 | CPU will look for tasks on other more overloaded CPUs and move those | ||
369 | tasks to itself, within the constraints of such placement mechanisms | ||
370 | as cpusets and sched_setaffinity. | ||
371 | |||
372 | The algorithmic cost of load balancing and its impact on key shared | ||
373 | kernel data structures such as the task list increases more than | ||
374 | linearly with the number of CPUs being balanced. So the scheduler | ||
375 | has support to partition the systems CPUs into a number of sched | ||
376 | domains such that it only load balances within each sched domain. | ||
377 | Each sched domain covers some subset of the CPUs in the system; | ||
378 | no two sched domains overlap; some CPUs might not be in any sched | ||
379 | domain and hence won't be load balanced. | ||
380 | |||
381 | Put simply, it costs less to balance between two smaller sched domains | ||
382 | than one big one, but doing so means that overloads in one of the | ||
383 | two domains won't be load balanced to the other one. | ||
384 | |||
385 | By default, there is one sched domain covering all CPUs, except those | ||
386 | marked isolated using the kernel boot time "isolcpus=" argument. | ||
387 | |||
388 | This default load balancing across all CPUs is not well suited for | ||
389 | the following two situations: | ||
390 | 1) On large systems, load balancing across many CPUs is expensive. | ||
391 | If the system is managed using cpusets to place independent jobs | ||
392 | on separate sets of CPUs, full load balancing is unnecessary. | ||
393 | 2) Systems supporting realtime on some CPUs need to minimize | ||
394 | system overhead on those CPUs, including avoiding task load | ||
395 | balancing if that is not needed. | ||
396 | |||
397 | When the per-cpuset flag "sched_load_balance" is enabled (the default | ||
398 | setting), it requests that all the CPUs in that cpusets allowed 'cpus' | ||
399 | be contained in a single sched domain, ensuring that load balancing | ||
400 | can move a task (not otherwised pinned, as by sched_setaffinity) | ||
401 | from any CPU in that cpuset to any other. | ||
402 | |||
403 | When the per-cpuset flag "sched_load_balance" is disabled, then the | ||
404 | scheduler will avoid load balancing across the CPUs in that cpuset, | ||
405 | --except-- in so far as is necessary because some overlapping cpuset | ||
406 | has "sched_load_balance" enabled. | ||
407 | |||
408 | So, for example, if the top cpuset has the flag "sched_load_balance" | ||
409 | enabled, then the scheduler will have one sched domain covering all | ||
410 | CPUs, and the setting of the "sched_load_balance" flag in any other | ||
411 | cpusets won't matter, as we're already fully load balancing. | ||
412 | |||
413 | Therefore in the above two situations, the top cpuset flag | ||
414 | "sched_load_balance" should be disabled, and only some of the smaller, | ||
415 | child cpusets have this flag enabled. | ||
416 | |||
417 | When doing this, you don't usually want to leave any unpinned tasks in | ||
418 | the top cpuset that might use non-trivial amounts of CPU, as such tasks | ||
419 | may be artificially constrained to some subset of CPUs, depending on | ||
420 | the particulars of this flag setting in descendent cpusets. Even if | ||
421 | such a task could use spare CPU cycles in some other CPUs, the kernel | ||
422 | scheduler might not consider the possibility of load balancing that | ||
423 | task to that underused CPU. | ||
424 | |||
425 | Of course, tasks pinned to a particular CPU can be left in a cpuset | ||
426 | that disables "sched_load_balance" as those tasks aren't going anywhere | ||
427 | else anyway. | ||
428 | |||
429 | There is an impedance mismatch here, between cpusets and sched domains. | ||
430 | Cpusets are hierarchical and nest. Sched domains are flat; they don't | ||
431 | overlap and each CPU is in at most one sched domain. | ||
432 | |||
433 | It is necessary for sched domains to be flat because load balancing | ||
434 | across partially overlapping sets of CPUs would risk unstable dynamics | ||
435 | that would be beyond our understanding. So if each of two partially | ||
436 | overlapping cpusets enables the flag 'sched_load_balance', then we | ||
437 | form a single sched domain that is a superset of both. We won't move | ||
438 | a task to a CPU outside it cpuset, but the scheduler load balancing | ||
439 | code might waste some compute cycles considering that possibility. | ||
440 | |||
441 | This mismatch is why there is not a simple one-to-one relation | ||
442 | between which cpusets have the flag "sched_load_balance" enabled, | ||
443 | and the sched domain configuration. If a cpuset enables the flag, it | ||
444 | will get balancing across all its CPUs, but if it disables the flag, | ||
445 | it will only be assured of no load balancing if no other overlapping | ||
446 | cpuset enables the flag. | ||
447 | |||
448 | If two cpusets have partially overlapping 'cpus' allowed, and only | ||
449 | one of them has this flag enabled, then the other may find its | ||
450 | tasks only partially load balanced, just on the overlapping CPUs. | ||
451 | This is just the general case of the top_cpuset example given a few | ||
452 | paragraphs above. In the general case, as in the top cpuset case, | ||
453 | don't leave tasks that might use non-trivial amounts of CPU in | ||
454 | such partially load balanced cpusets, as they may be artificially | ||
455 | constrained to some subset of the CPUs allowed to them, for lack of | ||
456 | load balancing to the other CPUs. | ||
457 | |||
458 | 1.7.1 sched_load_balance implementation details. | ||
459 | ------------------------------------------------ | ||
460 | |||
461 | The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary | ||
462 | to most cpuset flags.) When enabled for a cpuset, the kernel will | ||
463 | ensure that it can load balance across all the CPUs in that cpuset | ||
464 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are | ||
465 | in the same sched domain.) | ||
466 | |||
467 | If two overlapping cpusets both have 'sched_load_balance' enabled, | ||
468 | then they will be (must be) both in the same sched domain. | ||
469 | |||
470 | If, as is the default, the top cpuset has 'sched_load_balance' enabled, | ||
471 | then by the above that means there is a single sched domain covering | ||
472 | the whole system, regardless of any other cpuset settings. | ||
473 | |||
474 | The kernel commits to user space that it will avoid load balancing | ||
475 | where it can. It will pick as fine a granularity partition of sched | ||
476 | domains as it can while still providing load balancing for any set | ||
477 | of CPUs allowed to a cpuset having 'sched_load_balance' enabled. | ||
478 | |||
479 | The internal kernel cpuset to scheduler interface passes from the | ||
480 | cpuset code to the scheduler code a partition of the load balanced | ||
481 | CPUs in the system. This partition is a set of subsets (represented | ||
482 | as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all | ||
483 | the CPUs that must be load balanced. | ||
484 | |||
485 | Whenever the 'sched_load_balance' flag changes, or CPUs come or go | ||
486 | from a cpuset with this flag enabled, or a cpuset with this flag | ||
487 | enabled is removed, the cpuset code builds a new such partition and | ||
488 | passes it to the scheduler sched domain setup code, to have the sched | ||
489 | domains rebuilt as necessary. | ||
490 | |||
491 | This partition exactly defines what sched domains the scheduler should | ||
492 | setup - one sched domain for each element (cpumask_t) in the partition. | ||
493 | |||
494 | The scheduler remembers the currently active sched domain partitions. | ||
495 | When the scheduler routine partition_sched_domains() is invoked from | ||
496 | the cpuset code to update these sched domains, it compares the new | ||
497 | partition requested with the current, and updates its sched domains, | ||
498 | removing the old and adding the new, for each change. | ||
381 | 499 | ||
382 | 1.8 How do I use cpusets ? | 500 | 1.8 How do I use cpusets ? |
383 | -------------------------- | 501 | -------------------------- |
@@ -469,7 +587,7 @@ than stress the kernel. | |||
469 | To start a new job that is to be contained within a cpuset, the steps are: | 587 | To start a new job that is to be contained within a cpuset, the steps are: |
470 | 588 | ||
471 | 1) mkdir /dev/cpuset | 589 | 1) mkdir /dev/cpuset |
472 | 2) mount -t cpuset none /dev/cpuset | 590 | 2) mount -t cgroup -ocpuset cpuset /dev/cpuset |
473 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in | 591 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in |
474 | the /dev/cpuset virtual file system. | 592 | the /dev/cpuset virtual file system. |
475 | 4) Start a task that will be the "founding father" of the new job. | 593 | 4) Start a task that will be the "founding father" of the new job. |
@@ -481,7 +599,7 @@ For example, the following sequence of commands will setup a cpuset | |||
481 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | 599 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, |
482 | and then start a subshell 'sh' in that cpuset: | 600 | and then start a subshell 'sh' in that cpuset: |
483 | 601 | ||
484 | mount -t cpuset none /dev/cpuset | 602 | mount -t cgroup -ocpuset cpuset /dev/cpuset |
485 | cd /dev/cpuset | 603 | cd /dev/cpuset |
486 | mkdir Charlie | 604 | mkdir Charlie |
487 | cd Charlie | 605 | cd Charlie |
@@ -513,7 +631,7 @@ Creating, modifying, using the cpusets can be done through the cpuset | |||
513 | virtual filesystem. | 631 | virtual filesystem. |
514 | 632 | ||
515 | To mount it, type: | 633 | To mount it, type: |
516 | # mount -t cpuset none /dev/cpuset | 634 | # mount -t cgroup -o cpuset cpuset /dev/cpuset |
517 | 635 | ||
518 | Then under /dev/cpuset you can find a tree that corresponds to the | 636 | Then under /dev/cpuset you can find a tree that corresponds to the |
519 | tree of the cpusets in the system. For instance, /dev/cpuset | 637 | tree of the cpusets in the system. For instance, /dev/cpuset |
@@ -556,6 +674,18 @@ To remove a cpuset, just use rmdir: | |||
556 | This will fail if the cpuset is in use (has cpusets inside, or has | 674 | This will fail if the cpuset is in use (has cpusets inside, or has |
557 | processes attached). | 675 | processes attached). |
558 | 676 | ||
677 | Note that for legacy reasons, the "cpuset" filesystem exists as a | ||
678 | wrapper around the cgroup filesystem. | ||
679 | |||
680 | The command | ||
681 | |||
682 | mount -t cpuset X /dev/cpuset | ||
683 | |||
684 | is equivalent to | ||
685 | |||
686 | mount -t cgroup -ocpuset X /dev/cpuset | ||
687 | echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent | ||
688 | |||
559 | 2.2 Adding/removing cpus | 689 | 2.2 Adding/removing cpus |
560 | ------------------------ | 690 | ------------------------ |
561 | 691 | ||