aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/cgroups/cpusets.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/cgroups/cpusets.txt')
-rw-r--r--Documentation/cgroups/cpusets.txt65
1 files changed, 37 insertions, 28 deletions
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index 5c86c258c791..0611e9528c7c 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -142,7 +142,7 @@ into the rest of the kernel, none in performance critical paths:
142 - in fork and exit, to attach and detach a task from its cpuset. 142 - in fork and exit, to attach and detach a task from its cpuset.
143 - in sched_setaffinity, to mask the requested CPUs by what's 143 - in sched_setaffinity, to mask the requested CPUs by what's
144 allowed in that tasks cpuset. 144 allowed in that tasks cpuset.
145 - in sched.c migrate_all_tasks(), to keep migrating tasks within 145 - in sched.c migrate_live_tasks(), to keep migrating tasks within
146 the CPUs allowed by their cpuset, if possible. 146 the CPUs allowed by their cpuset, if possible.
147 - in the mbind and set_mempolicy system calls, to mask the requested 147 - in the mbind and set_mempolicy system calls, to mask the requested
148 Memory Nodes by what's allowed in that tasks cpuset. 148 Memory Nodes by what's allowed in that tasks cpuset.
@@ -175,6 +175,10 @@ files describing that cpuset:
175 - mem_exclusive flag: is memory placement exclusive? 175 - mem_exclusive flag: is memory placement exclusive?
176 - mem_hardwall flag: is memory allocation hardwalled 176 - mem_hardwall flag: is memory allocation hardwalled
177 - memory_pressure: measure of how much paging pressure in cpuset 177 - memory_pressure: measure of how much paging pressure in cpuset
178 - memory_spread_page flag: if set, spread page cache evenly on allowed nodes
179 - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
180 - sched_load_balance flag: if set, load balance within CPUs on that cpuset
181 - sched_relax_domain_level: the searching range when migrating tasks
178 182
179In addition, the root cpuset only has the following file: 183In addition, the root cpuset only has the following file:
180 - memory_pressure_enabled flag: compute memory_pressure? 184 - memory_pressure_enabled flag: compute memory_pressure?
@@ -252,7 +256,7 @@ is causing.
252 256
253This is useful both on tightly managed systems running a wide mix of 257This is useful both on tightly managed systems running a wide mix of
254submitted jobs, which may choose to terminate or re-prioritize jobs that 258submitted jobs, which may choose to terminate or re-prioritize jobs that
255are trying to use more memory than allowed on the nodes assigned them, 259are trying to use more memory than allowed on the nodes assigned to them,
256and with tightly coupled, long running, massively parallel scientific 260and with tightly coupled, long running, massively parallel scientific
257computing jobs that will dramatically fail to meet required performance 261computing jobs that will dramatically fail to meet required performance
258goals if they start to use more memory than allowed to them. 262goals if they start to use more memory than allowed to them.
@@ -378,7 +382,7 @@ as cpusets and sched_setaffinity.
378The algorithmic cost of load balancing and its impact on key shared 382The algorithmic cost of load balancing and its impact on key shared
379kernel data structures such as the task list increases more than 383kernel data structures such as the task list increases more than
380linearly with the number of CPUs being balanced. So the scheduler 384linearly with the number of CPUs being balanced. So the scheduler
381has support to partition the systems CPUs into a number of sched 385has support to partition the systems CPUs into a number of sched
382domains such that it only load balances within each sched domain. 386domains such that it only load balances within each sched domain.
383Each sched domain covers some subset of the CPUs in the system; 387Each sched domain covers some subset of the CPUs in the system;
384no two sched domains overlap; some CPUs might not be in any sched 388no two sched domains overlap; some CPUs might not be in any sched
@@ -485,17 +489,22 @@ of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
485The internal kernel cpuset to scheduler interface passes from the 489The internal kernel cpuset to scheduler interface passes from the
486cpuset code to the scheduler code a partition of the load balanced 490cpuset code to the scheduler code a partition of the load balanced
487CPUs in the system. This partition is a set of subsets (represented 491CPUs in the system. This partition is a set of subsets (represented
488as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all 492as an array of struct cpumask) of CPUs, pairwise disjoint, that cover
489the CPUs that must be load balanced. 493all the CPUs that must be load balanced.
490 494
491Whenever the 'sched_load_balance' flag changes, or CPUs come or go 495The cpuset code builds a new such partition and passes it to the
492from a cpuset with this flag enabled, or a cpuset with this flag 496scheduler sched domain setup code, to have the sched domains rebuilt
493enabled is removed, the cpuset code builds a new such partition and 497as necessary, whenever:
494passes it to the scheduler sched domain setup code, to have the sched 498 - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes,
495domains rebuilt as necessary. 499 - or CPUs come or go from a cpuset with this flag enabled,
500 - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs
501 and with this flag enabled changes,
502 - or a cpuset with non-empty CPUs and with this flag enabled is removed,
503 - or a cpu is offlined/onlined.
496 504
497This partition exactly defines what sched domains the scheduler should 505This partition exactly defines what sched domains the scheduler should
498setup - one sched domain for each element (cpumask_t) in the partition. 506setup - one sched domain for each element (struct cpumask) in the
507partition.
499 508
500The scheduler remembers the currently active sched domain partitions. 509The scheduler remembers the currently active sched domain partitions.
501When the scheduler routine partition_sched_domains() is invoked from 510When the scheduler routine partition_sched_domains() is invoked from
@@ -559,7 +568,7 @@ domain, the largest value among those is used. Be careful, if one
559requests 0 and others are -1 then 0 is used. 568requests 0 and others are -1 then 0 is used.
560 569
561Note that modifying this file will have both good and bad effects, 570Note that modifying this file will have both good and bad effects,
562and whether it is acceptable or not will be depend on your situation. 571and whether it is acceptable or not depends on your situation.
563Don't modify this file if you are not sure. 572Don't modify this file if you are not sure.
564 573
565If your situation is: 574If your situation is:
@@ -600,19 +609,15 @@ to allocate a page of memory for that task.
600 609
601If a cpuset has its 'cpus' modified, then each task in that cpuset 610If a cpuset has its 'cpus' modified, then each task in that cpuset
602will have its allowed CPU placement changed immediately. Similarly, 611will have its allowed CPU placement changed immediately. Similarly,
603if a tasks pid is written to a cpusets 'tasks' file, in either its 612if a tasks pid is written to another cpusets 'tasks' file, then its
604current cpuset or another cpuset, then its allowed CPU placement is 613allowed CPU placement is changed immediately. If such a task had been
605changed immediately. If such a task had been bound to some subset 614bound to some subset of its cpuset using the sched_setaffinity() call,
606of its cpuset using the sched_setaffinity() call, the task will be 615the task will be allowed to run on any CPU allowed in its new cpuset,
607allowed to run on any CPU allowed in its new cpuset, negating the 616negating the effect of the prior sched_setaffinity() call.
608affect of the prior sched_setaffinity() call.
609 617
610In summary, the memory placement of a task whose cpuset is changed is 618In summary, the memory placement of a task whose cpuset is changed is
611updated by the kernel, on the next allocation of a page for that task, 619updated by the kernel, on the next allocation of a page for that task,
612but the processor placement is not updated, until that tasks pid is 620and the processor placement is updated immediately.
613rewritten to the 'tasks' file of its cpuset. This is done to avoid
614impacting the scheduler code in the kernel with a check for changes
615in a tasks processor placement.
616 621
617Normally, once a page is allocated (given a physical page 622Normally, once a page is allocated (given a physical page
618of main memory) then that page stays on whatever node it 623of main memory) then that page stays on whatever node it
@@ -681,10 +686,14 @@ and then start a subshell 'sh' in that cpuset:
681 # The next line should display '/Charlie' 686 # The next line should display '/Charlie'
682 cat /proc/self/cpuset 687 cat /proc/self/cpuset
683 688
684In the future, a C library interface to cpusets will likely be 689There are ways to query or modify cpusets:
685available. For now, the only way to query or modify cpusets is 690 - via the cpuset file system directly, using the various cd, mkdir, echo,
686via the cpuset file system, using the various cd, mkdir, echo, cat, 691 cat, rmdir commands from the shell, or their equivalent from C.
687rmdir commands from the shell, or their equivalent from C. 692 - via the C library libcpuset.
693 - via the C library libcgroup.
694 (http://sourceforge.net/proects/libcg/)
695 - via the python application cset.
696 (http://developer.novell.com/wiki/index.php/Cpuset)
688 697
689The sched_setaffinity calls can also be done at the shell prompt using 698The sched_setaffinity calls can also be done at the shell prompt using
690SGI's runon or Robert Love's taskset. The mbind and set_mempolicy 699SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
@@ -756,7 +765,7 @@ mount -t cpuset X /dev/cpuset
756 765
757is equivalent to 766is equivalent to
758 767
759mount -t cgroup -ocpuset X /dev/cpuset 768mount -t cgroup -ocpuset,noprefix X /dev/cpuset
760echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent 769echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
761 770
7622.2 Adding/removing cpus 7712.2 Adding/removing cpus