aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/cpusets.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/cpusets.txt')
-rw-r--r--Documentation/cpusets.txt165
1 files changed, 138 insertions, 27 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
index a09a8eb80665..990998ee10b6 100644
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -14,7 +14,10 @@ CONTENTS:
14 1.1 What are cpusets ? 14 1.1 What are cpusets ?
15 1.2 Why are cpusets needed ? 15 1.2 Why are cpusets needed ?
16 1.3 How are cpusets implemented ? 16 1.3 How are cpusets implemented ?
17 1.4 How do I use cpusets ? 17 1.4 What are exclusive cpusets ?
18 1.5 What does notify_on_release do ?
19 1.6 What is memory_pressure ?
20 1.7 How do I use cpusets ?
182. Usage Examples and Syntax 212. Usage Examples and Syntax
19 2.1 Basic Usage 22 2.1 Basic Usage
20 2.2 Adding/removing cpus 23 2.2 Adding/removing cpus
@@ -49,29 +52,6 @@ its cpus_allowed vector, and the kernel page allocator will not
49allocate a page on a node that is not allowed in the requesting tasks 52allocate a page on a node that is not allowed in the requesting tasks
50mems_allowed vector. 53mems_allowed vector.
51 54
52If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct
53ancestor or descendent, may share any of the same CPUs or Memory Nodes.
54A cpuset that is cpu exclusive has a sched domain associated with it.
55The sched domain consists of all cpus in the current cpuset that are not
56part of any exclusive child cpusets.
57This ensures that the scheduler load balacing code only balances
58against the cpus that are in the sched domain as defined above and not
59all of the cpus in the system. This removes any overhead due to
60load balancing code trying to pull tasks outside of the cpu exclusive
61cpuset only to be prevented by the tasks' cpus_allowed mask.
62
63A cpuset that is mem_exclusive restricts kernel allocations for
64page, buffer and other data commonly shared by the kernel across
65multiple users. All cpusets, whether mem_exclusive or not, restrict
66allocations of memory for user space. This enables configuring a
67system so that several independent jobs can share common kernel
68data, such as file system pages, while isolating each jobs user
69allocation in its own cpuset. To do this, construct a large
70mem_exclusive cpuset to hold all the jobs, and construct child,
71non-mem_exclusive cpusets for each individual job. Only a small
72amount of typical kernel memory, such as requests from interrupt
73handlers, is allowed to be taken outside even a mem_exclusive cpuset.
74
75User level code may create and destroy cpusets by name in the cpuset 55User level code may create and destroy cpusets by name in the cpuset
76virtual file system, manage the attributes and permissions of these 56virtual file system, manage the attributes and permissions of these
77cpusets and which CPUs and Memory Nodes are assigned to each cpuset, 57cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
@@ -155,7 +135,7 @@ Cpusets extends these two mechanisms as follows:
155The implementation of cpusets requires a few, simple hooks 135The implementation of cpusets requires a few, simple hooks
156into the rest of the kernel, none in performance critical paths: 136into the rest of the kernel, none in performance critical paths:
157 137
158 - in main/init.c, to initialize the root cpuset at system boot. 138 - in init/main.c, to initialize the root cpuset at system boot.
159 - in fork and exit, to attach and detach a task from its cpuset. 139 - in fork and exit, to attach and detach a task from its cpuset.
160 - in sched_setaffinity, to mask the requested CPUs by what's 140 - in sched_setaffinity, to mask the requested CPUs by what's
161 allowed in that tasks cpuset. 141 allowed in that tasks cpuset.
@@ -166,7 +146,7 @@ into the rest of the kernel, none in performance critical paths:
166 and related changes in both sched.c and arch/ia64/kernel/domain.c 146 and related changes in both sched.c and arch/ia64/kernel/domain.c
167 - in the mbind and set_mempolicy system calls, to mask the requested 147 - in the mbind and set_mempolicy system calls, to mask the requested
168 Memory Nodes by what's allowed in that tasks cpuset. 148 Memory Nodes by what's allowed in that tasks cpuset.
169 - in page_alloc, to restrict memory to allowed nodes. 149 - in page_alloc.c, to restrict memory to allowed nodes.
170 - in vmscan.c, to restrict page recovery to the current cpuset. 150 - in vmscan.c, to restrict page recovery to the current cpuset.
171 151
172In addition a new file system, of type "cpuset" may be mounted, 152In addition a new file system, of type "cpuset" may be mounted,
@@ -192,9 +172,15 @@ containing the following files describing that cpuset:
192 172
193 - cpus: list of CPUs in that cpuset 173 - cpus: list of CPUs in that cpuset
194 - mems: list of Memory Nodes in that cpuset 174 - mems: list of Memory Nodes in that cpuset
175 - memory_migrate flag: if set, move pages to cpusets nodes
195 - cpu_exclusive flag: is cpu placement exclusive? 176 - cpu_exclusive flag: is cpu placement exclusive?
196 - mem_exclusive flag: is memory placement exclusive? 177 - mem_exclusive flag: is memory placement exclusive?
197 - tasks: list of tasks (by pid) attached to that cpuset 178 - tasks: list of tasks (by pid) attached to that cpuset
179 - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
180 - memory_pressure: measure of how much paging pressure in cpuset
181
182In addition, the root cpuset only has the following file:
183 - memory_pressure_enabled flag: compute memory_pressure?
198 184
199New cpusets are created using the mkdir system call or shell 185New cpusets are created using the mkdir system call or shell
200command. The properties of a cpuset, such as its flags, allowed 186command. The properties of a cpuset, such as its flags, allowed
@@ -228,7 +214,108 @@ exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
228to represent the cpuset hierarchy provides for a familiar permission 214to represent the cpuset hierarchy provides for a familiar permission
229and name space for cpusets, with a minimum of additional kernel code. 215and name space for cpusets, with a minimum of additional kernel code.
230 216
2311.4 How do I use cpusets ? 217
2181.4 What are exclusive cpusets ?
219--------------------------------
220
221If a cpuset is cpu or mem exclusive, no other cpuset, other than
222a direct ancestor or descendent, may share any of the same CPUs or
223Memory Nodes.
224
225A cpuset that is cpu_exclusive has a scheduler (sched) domain
226associated with it. The sched domain consists of all CPUs in the
227current cpuset that are not part of any exclusive child cpusets.
228This ensures that the scheduler load balancing code only balances
229against the CPUs that are in the sched domain as defined above and
230not all of the CPUs in the system. This removes any overhead due to
231load balancing code trying to pull tasks outside of the cpu_exclusive
232cpuset only to be prevented by the tasks' cpus_allowed mask.
233
234A cpuset that is mem_exclusive restricts kernel allocations for
235page, buffer and other data commonly shared by the kernel across
236multiple users. All cpusets, whether mem_exclusive or not, restrict
237allocations of memory for user space. This enables configuring a
238system so that several independent jobs can share common kernel data,
239such as file system pages, while isolating each jobs user allocation in
240its own cpuset. To do this, construct a large mem_exclusive cpuset to
241hold all the jobs, and construct child, non-mem_exclusive cpusets for
242each individual job. Only a small amount of typical kernel memory,
243such as requests from interrupt handlers, is allowed to be taken
244outside even a mem_exclusive cpuset.
245
246
2471.5 What does notify_on_release do ?
248------------------------------------
249
250If the notify_on_release flag is enabled (1) in a cpuset, then whenever
251the last task in the cpuset leaves (exits or attaches to some other
252cpuset) and the last child cpuset of that cpuset is removed, then
253the kernel runs the command /sbin/cpuset_release_agent, supplying the
254pathname (relative to the mount point of the cpuset file system) of the
255abandoned cpuset. This enables automatic removal of abandoned cpusets.
256The default value of notify_on_release in the root cpuset at system
257boot is disabled (0). The default value of other cpusets at creation
258is the current value of their parents notify_on_release setting.
259
260
2611.6 What is memory_pressure ?
262-----------------------------
263The memory_pressure of a cpuset provides a simple per-cpuset metric
264of the rate that the tasks in a cpuset are attempting to free up in
265use memory on the nodes of the cpuset to satisfy additional memory
266requests.
267
268This enables batch managers monitoring jobs running in dedicated
269cpusets to efficiently detect what level of memory pressure that job
270is causing.
271
272This is useful both on tightly managed systems running a wide mix of
273submitted jobs, which may choose to terminate or re-prioritize jobs that
274are trying to use more memory than allowed on the nodes assigned them,
275and with tightly coupled, long running, massively parallel scientific
276computing jobs that will dramatically fail to meet required performance
277goals if they start to use more memory than allowed to them.
278
279This mechanism provides a very economical way for the batch manager
280to monitor a cpuset for signs of memory pressure. It's up to the
281batch manager or other user code to decide what to do about it and
282take action.
283
284==> Unless this feature is enabled by writing "1" to the special file
285 /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
286 code of __alloc_pages() for this metric reduces to simply noticing
287 that the cpuset_memory_pressure_enabled flag is zero. So only
288 systems that enable this feature will compute the metric.
289
290Why a per-cpuset, running average:
291
292 Because this meter is per-cpuset, rather than per-task or mm,
293 the system load imposed by a batch scheduler monitoring this
294 metric is sharply reduced on large systems, because a scan of
295 the tasklist can be avoided on each set of queries.
296
297 Because this meter is a running average, instead of an accumulating
298 counter, a batch scheduler can detect memory pressure with a
299 single read, instead of having to read and accumulate results
300 for a period of time.
301
302 Because this meter is per-cpuset rather than per-task or mm,
303 the batch scheduler can obtain the key information, memory
304 pressure in a cpuset, with a single read, rather than having to
305 query and accumulate results over all the (dynamically changing)
306 set of tasks in the cpuset.
307
308A per-cpuset simple digital filter (requires a spinlock and 3 words
309of data per-cpuset) is kept, and updated by any task attached to that
310cpuset, if it enters the synchronous (direct) page reclaim code.
311
312A per-cpuset file provides an integer number representing the recent
313(half-life of 10 seconds) rate of direct page reclaims caused by
314the tasks in the cpuset, in units of reclaims attempted per second,
315times 1000.
316
317
3181.7 How do I use cpusets ?
232-------------------------- 319--------------------------
233 320
234In order to minimize the impact of cpusets on critical kernel 321In order to minimize the impact of cpusets on critical kernel
@@ -277,6 +364,30 @@ rewritten to the 'tasks' file of its cpuset. This is done to avoid
277impacting the scheduler code in the kernel with a check for changes 364impacting the scheduler code in the kernel with a check for changes
278in a tasks processor placement. 365in a tasks processor placement.
279 366
367Normally, once a page is allocated (given a physical page
368of main memory) then that page stays on whatever node it
369was allocated, so long as it remains allocated, even if the
370cpusets memory placement policy 'mems' subsequently changes.
371If the cpuset flag file 'memory_migrate' is set true, then when
372tasks are attached to that cpuset, any pages that task had
373allocated to it on nodes in its previous cpuset are migrated
374to the tasks new cpuset. Depending on the implementation,
375this migration may either be done by swapping the page out,
376so that the next time the page is referenced, it will be paged
377into the tasks new cpuset, usually on the node where it was
378referenced, or this migration may be done by directly copying
379the pages from the tasks previous cpuset to the new cpuset,
380where possible to the same node, relative to the new cpuset,
381as the node that held the page, relative to the old cpuset.
382Also if 'memory_migrate' is set true, then if that cpusets
383'mems' file is modified, pages allocated to tasks in that
384cpuset, that were on nodes in the previous setting of 'mems',
385will be moved to nodes in the new setting of 'mems.' Again,
386depending on the implementation, this might be done by swapping,
387or by direct copying. In either case, pages that were not in
388the tasks prior cpuset, or in the cpusets prior 'mems' setting,
389will not be moved.
390
280There is an exception to the above. If hotplug functionality is used 391There is an exception to the above. If hotplug functionality is used
281to remove all the CPUs that are currently assigned to a cpuset, 392to remove all the CPUs that are currently assigned to a cpuset,
282then the kernel will automatically update the cpus_allowed of all 393then the kernel will automatically update the cpus_allowed of all