diff options
| author | Li Zefan <lizf@cn.fujitsu.com> | 2009-02-20 18:38:48 -0500 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2009-02-20 20:57:49 -0500 |
| commit | 3fd076dd955a34c35dc456f4ef676e03cdced044 (patch) | |
| tree | c3eff65f38b43224d0142d1db1dbd9def0edbd4a | |
| parent | 152de30bced150617e5731a9fe2364c9d04fe26c (diff) | |
cpuset: various documentation fixes and updates
I noticed the old commit 8f5aa26c75b7722e80c0c5c5bb833d41865d7019
("cpusets: update_cpumask documentation fix") is not a complete fix,
resulting in inconsistent paragraphs. This patch fixes it and does other
fixes and updates:
- s/migrate_all_tasks()/migrate_live_tasks()/
- describe more cpuset control files
- s/cpumask_t/struct cpumask/
- document cpu hotplug and change of 'sched_relax_domain_level' may cause
domain rebuild
- document various ways to query and modify cpusets
- the equivalent of "mount -t cpuset" is "mount -t cgroup -o cpuset,noprefix"
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| -rw-r--r-- | Documentation/cgroups/cpusets.txt | 65 |
1 files changed, 37 insertions, 28 deletions
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 5c86c258c791..0611e9528c7c 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt | |||
| @@ -142,7 +142,7 @@ into the rest of the kernel, none in performance critical paths: | |||
| 142 | - in fork and exit, to attach and detach a task from its cpuset. | 142 | - in fork and exit, to attach and detach a task from its cpuset. |
| 143 | - in sched_setaffinity, to mask the requested CPUs by what's | 143 | - in sched_setaffinity, to mask the requested CPUs by what's |
| 144 | allowed in that tasks cpuset. | 144 | allowed in that tasks cpuset. |
| 145 | - in sched.c migrate_all_tasks(), to keep migrating tasks within | 145 | - in sched.c migrate_live_tasks(), to keep migrating tasks within |
| 146 | the CPUs allowed by their cpuset, if possible. | 146 | the CPUs allowed by their cpuset, if possible. |
| 147 | - in the mbind and set_mempolicy system calls, to mask the requested | 147 | - in the mbind and set_mempolicy system calls, to mask the requested |
| 148 | Memory Nodes by what's allowed in that tasks cpuset. | 148 | Memory Nodes by what's allowed in that tasks cpuset. |
| @@ -175,6 +175,10 @@ files describing that cpuset: | |||
| 175 | - mem_exclusive flag: is memory placement exclusive? | 175 | - mem_exclusive flag: is memory placement exclusive? |
| 176 | - mem_hardwall flag: is memory allocation hardwalled | 176 | - mem_hardwall flag: is memory allocation hardwalled |
| 177 | - memory_pressure: measure of how much paging pressure in cpuset | 177 | - memory_pressure: measure of how much paging pressure in cpuset |
| 178 | - memory_spread_page flag: if set, spread page cache evenly on allowed nodes | ||
| 179 | - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes | ||
| 180 | - sched_load_balance flag: if set, load balance within CPUs on that cpuset | ||
| 181 | - sched_relax_domain_level: the searching range when migrating tasks | ||
| 178 | 182 | ||
| 179 | In addition, the root cpuset only has the following file: | 183 | In addition, the root cpuset only has the following file: |
| 180 | - memory_pressure_enabled flag: compute memory_pressure? | 184 | - memory_pressure_enabled flag: compute memory_pressure? |
| @@ -252,7 +256,7 @@ is causing. | |||
| 252 | 256 | ||
| 253 | This is useful both on tightly managed systems running a wide mix of | 257 | This is useful both on tightly managed systems running a wide mix of |
| 254 | submitted jobs, which may choose to terminate or re-prioritize jobs that | 258 | submitted jobs, which may choose to terminate or re-prioritize jobs that |
| 255 | are trying to use more memory than allowed on the nodes assigned them, | 259 | are trying to use more memory than allowed on the nodes assigned to them, |
| 256 | and with tightly coupled, long running, massively parallel scientific | 260 | and with tightly coupled, long running, massively parallel scientific |
| 257 | computing jobs that will dramatically fail to meet required performance | 261 | computing jobs that will dramatically fail to meet required performance |
| 258 | goals if they start to use more memory than allowed to them. | 262 | goals if they start to use more memory than allowed to them. |
| @@ -378,7 +382,7 @@ as cpusets and sched_setaffinity. | |||
| 378 | The algorithmic cost of load balancing and its impact on key shared | 382 | The algorithmic cost of load balancing and its impact on key shared |
| 379 | kernel data structures such as the task list increases more than | 383 | kernel data structures such as the task list increases more than |
| 380 | linearly with the number of CPUs being balanced. So the scheduler | 384 | linearly with the number of CPUs being balanced. So the scheduler |
| 381 | has support to partition the systems CPUs into a number of sched | 385 | has support to partition the systems CPUs into a number of sched |
| 382 | domains such that it only load balances within each sched domain. | 386 | domains such that it only load balances within each sched domain. |
| 383 | Each sched domain covers some subset of the CPUs in the system; | 387 | Each sched domain covers some subset of the CPUs in the system; |
| 384 | no two sched domains overlap; some CPUs might not be in any sched | 388 | no two sched domains overlap; some CPUs might not be in any sched |
| @@ -485,17 +489,22 @@ of CPUs allowed to a cpuset having 'sched_load_balance' enabled. | |||
| 485 | The internal kernel cpuset to scheduler interface passes from the | 489 | The internal kernel cpuset to scheduler interface passes from the |
| 486 | cpuset code to the scheduler code a partition of the load balanced | 490 | cpuset code to the scheduler code a partition of the load balanced |
| 487 | CPUs in the system. This partition is a set of subsets (represented | 491 | CPUs in the system. This partition is a set of subsets (represented |
| 488 | as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all | 492 | as an array of struct cpumask) of CPUs, pairwise disjoint, that cover |
| 489 | the CPUs that must be load balanced. | 493 | all the CPUs that must be load balanced. |
| 490 | 494 | ||
| 491 | Whenever the 'sched_load_balance' flag changes, or CPUs come or go | 495 | The cpuset code builds a new such partition and passes it to the |
| 492 | from a cpuset with this flag enabled, or a cpuset with this flag | 496 | scheduler sched domain setup code, to have the sched domains rebuilt |
| 493 | enabled is removed, the cpuset code builds a new such partition and | 497 | as necessary, whenever: |
| 494 | passes it to the scheduler sched domain setup code, to have the sched | 498 | - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes, |
| 495 | domains rebuilt as necessary. | 499 | - or CPUs come or go from a cpuset with this flag enabled, |
| 500 | - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs | ||
| 501 | and with this flag enabled changes, | ||
| 502 | - or a cpuset with non-empty CPUs and with this flag enabled is removed, | ||
| 503 | - or a cpu is offlined/onlined. | ||
| 496 | 504 | ||
| 497 | This partition exactly defines what sched domains the scheduler should | 505 | This partition exactly defines what sched domains the scheduler should |
| 498 | setup - one sched domain for each element (cpumask_t) in the partition. | 506 | setup - one sched domain for each element (struct cpumask) in the |
| 507 | partition. | ||
| 499 | 508 | ||
| 500 | The scheduler remembers the currently active sched domain partitions. | 509 | The scheduler remembers the currently active sched domain partitions. |
| 501 | When the scheduler routine partition_sched_domains() is invoked from | 510 | When the scheduler routine partition_sched_domains() is invoked from |
| @@ -559,7 +568,7 @@ domain, the largest value among those is used. Be careful, if one | |||
| 559 | requests 0 and others are -1 then 0 is used. | 568 | requests 0 and others are -1 then 0 is used. |
| 560 | 569 | ||
| 561 | Note that modifying this file will have both good and bad effects, | 570 | Note that modifying this file will have both good and bad effects, |
| 562 | and whether it is acceptable or not will be depend on your situation. | 571 | and whether it is acceptable or not depends on your situation. |
| 563 | Don't modify this file if you are not sure. | 572 | Don't modify this file if you are not sure. |
| 564 | 573 | ||
| 565 | If your situation is: | 574 | If your situation is: |
| @@ -600,19 +609,15 @@ to allocate a page of memory for that task. | |||
| 600 | 609 | ||
| 601 | If a cpuset has its 'cpus' modified, then each task in that cpuset | 610 | If a cpuset has its 'cpus' modified, then each task in that cpuset |
| 602 | will have its allowed CPU placement changed immediately. Similarly, | 611 | will have its allowed CPU placement changed immediately. Similarly, |
| 603 | if a tasks pid is written to a cpusets 'tasks' file, in either its | 612 | if a tasks pid is written to another cpusets 'tasks' file, then its |
| 604 | current cpuset or another cpuset, then its allowed CPU placement is | 613 | allowed CPU placement is changed immediately. If such a task had been |
| 605 | changed immediately. If such a task had been bound to some subset | 614 | bound to some subset of its cpuset using the sched_setaffinity() call, |
| 606 | of its cpuset using the sched_setaffinity() call, the task will be | 615 | the task will be allowed to run on any CPU allowed in its new cpuset, |
| 607 | allowed to run on any CPU allowed in its new cpuset, negating the | 616 | negating the effect of the prior sched_setaffinity() call. |
| 608 | affect of the prior sched_setaffinity() call. | ||
| 609 | 617 | ||
| 610 | In summary, the memory placement of a task whose cpuset is changed is | 618 | In summary, the memory placement of a task whose cpuset is changed is |
| 611 | updated by the kernel, on the next allocation of a page for that task, | 619 | updated by the kernel, on the next allocation of a page for that task, |
| 612 | but the processor placement is not updated, until that tasks pid is | 620 | and the processor placement is updated immediately. |
| 613 | rewritten to the 'tasks' file of its cpuset. This is done to avoid | ||
| 614 | impacting the scheduler code in the kernel with a check for changes | ||
| 615 | in a tasks processor placement. | ||
| 616 | 621 | ||
| 617 | Normally, once a page is allocated (given a physical page | 622 | Normally, once a page is allocated (given a physical page |
| 618 | of main memory) then that page stays on whatever node it | 623 | of main memory) then that page stays on whatever node it |
| @@ -681,10 +686,14 @@ and then start a subshell 'sh' in that cpuset: | |||
| 681 | # The next line should display '/Charlie' | 686 | # The next line should display '/Charlie' |
| 682 | cat /proc/self/cpuset | 687 | cat /proc/self/cpuset |
| 683 | 688 | ||
| 684 | In the future, a C library interface to cpusets will likely be | 689 | There are ways to query or modify cpusets: |
| 685 | available. For now, the only way to query or modify cpusets is | 690 | - via the cpuset file system directly, using the various cd, mkdir, echo, |
| 686 | via the cpuset file system, using the various cd, mkdir, echo, cat, | 691 | cat, rmdir commands from the shell, or their equivalent from C. |
| 687 | rmdir commands from the shell, or their equivalent from C. | 692 | - via the C library libcpuset. |
| 693 | - via the C library libcgroup. | ||
| 694 | (http://sourceforge.net/proects/libcg/) | ||
| 695 | - via the python application cset. | ||
| 696 | (http://developer.novell.com/wiki/index.php/Cpuset) | ||
| 688 | 697 | ||
| 689 | The sched_setaffinity calls can also be done at the shell prompt using | 698 | The sched_setaffinity calls can also be done at the shell prompt using |
| 690 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy | 699 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy |
| @@ -756,7 +765,7 @@ mount -t cpuset X /dev/cpuset | |||
| 756 | 765 | ||
| 757 | is equivalent to | 766 | is equivalent to |
| 758 | 767 | ||
| 759 | mount -t cgroup -ocpuset X /dev/cpuset | 768 | mount -t cgroup -ocpuset,noprefix X /dev/cpuset |
| 760 | echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent | 769 | echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent |
| 761 | 770 | ||
| 762 | 2.2 Adding/removing cpus | 771 | 2.2 Adding/removing cpus |
