aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorPaul Jackson <pj@sgi.com>2006-01-08 04:01:50 -0500
committerLinus Torvalds <torvalds@g5.osdl.org>2006-01-08 23:13:43 -0500
commitbd5e09cf7054878a3db6a8c8bab1c2fabcd4f072 (patch)
tree8038087b61ba0852495d20561bc227f0c3ae04f2
parent3e0d98b9f1eb757fc98efc84e74e54a08308aa73 (diff)
[PATCH] cpuset: document additional features
Document the additional cpuset features: notify_on_release marker_pid memory_pressure memory_pressure_enabled Rearrange and improve formatting of existing documentation for cpu_exclusive and mem_exclusive features. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-rw-r--r--Documentation/cpusets.txt178
1 files changed, 153 insertions, 25 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
index e2d9afc30d2d..ef83b2dd0cb3 100644
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -14,7 +14,11 @@ CONTENTS:
14 1.1 What are cpusets ? 14 1.1 What are cpusets ?
15 1.2 Why are cpusets needed ? 15 1.2 Why are cpusets needed ?
16 1.3 How are cpusets implemented ? 16 1.3 How are cpusets implemented ?
17 1.4 How do I use cpusets ? 17 1.4 What are exclusive cpusets ?
18 1.5 What does notify_on_release do ?
19 1.6 What is a marker_pid ?
20 1.7 What is memory_pressure ?
21 1.8 How do I use cpusets ?
182. Usage Examples and Syntax 222. Usage Examples and Syntax
19 2.1 Basic Usage 23 2.1 Basic Usage
20 2.2 Adding/removing cpus 24 2.2 Adding/removing cpus
@@ -49,29 +53,6 @@ its cpus_allowed vector, and the kernel page allocator will not
49allocate a page on a node that is not allowed in the requesting tasks 53allocate a page on a node that is not allowed in the requesting tasks
50mems_allowed vector. 54mems_allowed vector.
51 55
52If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct
53ancestor or descendent, may share any of the same CPUs or Memory Nodes.
54A cpuset that is cpu exclusive has a sched domain associated with it.
55The sched domain consists of all cpus in the current cpuset that are not
56part of any exclusive child cpusets.
57This ensures that the scheduler load balacing code only balances
58against the cpus that are in the sched domain as defined above and not
59all of the cpus in the system. This removes any overhead due to
60load balancing code trying to pull tasks outside of the cpu exclusive
61cpuset only to be prevented by the tasks' cpus_allowed mask.
62
63A cpuset that is mem_exclusive restricts kernel allocations for
64page, buffer and other data commonly shared by the kernel across
65multiple users. All cpusets, whether mem_exclusive or not, restrict
66allocations of memory for user space. This enables configuring a
67system so that several independent jobs can share common kernel
68data, such as file system pages, while isolating each jobs user
69allocation in its own cpuset. To do this, construct a large
70mem_exclusive cpuset to hold all the jobs, and construct child,
71non-mem_exclusive cpusets for each individual job. Only a small
72amount of typical kernel memory, such as requests from interrupt
73handlers, is allowed to be taken outside even a mem_exclusive cpuset.
74
75User level code may create and destroy cpusets by name in the cpuset 56User level code may create and destroy cpusets by name in the cpuset
76virtual file system, manage the attributes and permissions of these 57virtual file system, manage the attributes and permissions of these
77cpusets and which CPUs and Memory Nodes are assigned to each cpuset, 58cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
@@ -196,6 +177,12 @@ containing the following files describing that cpuset:
196 - cpu_exclusive flag: is cpu placement exclusive? 177 - cpu_exclusive flag: is cpu placement exclusive?
197 - mem_exclusive flag: is memory placement exclusive? 178 - mem_exclusive flag: is memory placement exclusive?
198 - tasks: list of tasks (by pid) attached to that cpuset 179 - tasks: list of tasks (by pid) attached to that cpuset
180 - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
181 - marker_pid: pid of user task in co-ordinated operation sequence
182 - memory_pressure: measure of how much paging pressure in cpuset
183
184In addition, the root cpuset only has the following file:
185 - memory_pressure_enabled flag: compute memory_pressure?
199 186
200New cpusets are created using the mkdir system call or shell 187New cpusets are created using the mkdir system call or shell
201command. The properties of a cpuset, such as its flags, allowed 188command. The properties of a cpuset, such as its flags, allowed
@@ -229,7 +216,148 @@ exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
229to represent the cpuset hierarchy provides for a familiar permission 216to represent the cpuset hierarchy provides for a familiar permission
230and name space for cpusets, with a minimum of additional kernel code. 217and name space for cpusets, with a minimum of additional kernel code.
231 218
2321.4 How do I use cpusets ? 219
2201.4 What are exclusive cpusets ?
221--------------------------------
222
223If a cpuset is cpu or mem exclusive, no other cpuset, other than
224a direct ancestor or descendent, may share any of the same CPUs or
225Memory Nodes.
226
227A cpuset that is cpu_exclusive has a scheduler (sched) domain
228associated with it. The sched domain consists of all CPUs in the
229current cpuset that are not part of any exclusive child cpusets.
230This ensures that the scheduler load balancing code only balances
231against the CPUs that are in the sched domain as defined above and
232not all of the CPUs in the system. This removes any overhead due to
233load balancing code trying to pull tasks outside of the cpu_exclusive
234cpuset only to be prevented by the tasks' cpus_allowed mask.
235
236A cpuset that is mem_exclusive restricts kernel allocations for
237page, buffer and other data commonly shared by the kernel across
238multiple users. All cpusets, whether mem_exclusive or not, restrict
239allocations of memory for user space. This enables configuring a
240system so that several independent jobs can share common kernel data,
241such as file system pages, while isolating each jobs user allocation in
242its own cpuset. To do this, construct a large mem_exclusive cpuset to
243hold all the jobs, and construct child, non-mem_exclusive cpusets for
244each individual job. Only a small amount of typical kernel memory,
245such as requests from interrupt handlers, is allowed to be taken
246outside even a mem_exclusive cpuset.
247
248
2491.5 What does notify_on_release do ?
250------------------------------------
251
252If the notify_on_release flag is enabled (1) in a cpuset, then whenever
253the last task in the cpuset leaves (exits or attaches to some other
254cpuset) and the last child cpuset of that cpuset is removed, then
255the kernel runs the command /sbin/cpuset_release_agent, supplying the
256pathname (relative to the mount point of the cpuset file system) of the
257abandoned cpuset. This enables automatic removal of abandoned cpusets.
258The default value of notify_on_release in the root cpuset at system
259boot is disabled (0). The default value of other cpusets at creation
260is the current value of their parents notify_on_release setting.
261
262
2631.6 What is a marker_pid ?
264--------------------------
265
266The marker_pid helps manage cpuset changes safely from user space.
267
268The interface presented to user space for cpusets uses system wide
269numbering of CPUs and Memory Nodes. It is the responsibility of
270user level code, presumably in a library, to present cpuset-relative
271numbering to applications when that would be more useful to them.
272
273However if a task is moved to a different cpuset, or if the 'cpus' or
274'mems' of a cpuset are changed, then we need a way for such library
275code to detect that its cpuset-relative numbering has changed, when
276expressed using system wide numbering.
277
278The kernel cannot safely allow user code to lock kernel resources.
279The kernel could deliver out-of-band notice of cpuset changes by
280such mechanisms as signals or usermodehelper callbacks, however
281this can't be synchronously delivered to library code linked in
282applications without intruding on the IPC mechanisms available to
283the app. The kernel could require user level code to do all the work,
284tracking the cpuset state before and during changes, to verify no
285unexpected change occurred, but this becomes an onerous task.
286
287The "marker_pid" cpuset field provides a simple way to make this task
288less onerous on user library code. A task writes its pid to a cpusets
289"marker_pid" at the start of a sequence of queries and updates,
290and check as it goes that the cpusets marker_pid doesn't change.
291The pread(2) system call does a seek and read in a single call.
292If the marker_pid changes, the user code should retry the required
293sequence of operations.
294
295Anytime that a task modifies the "cpus" or "mems" of a cpuset,
296unless it's pid is in the cpusets marker_pid field, the kernel zeros
297this field.
298
299The above was inspired by the load linked and store conditional
300(ll/sc) instructions in the MIPS II instruction set.
301
302
3031.7 What is memory_pressure ?
304-----------------------------
305The memory_pressure of a cpuset provides a simple per-cpuset metric
306of the rate that the tasks in a cpuset are attempting to free up in
307use memory on the nodes of the cpuset to satisfy additional memory
308requests.
309
310This enables batch managers monitoring jobs running in dedicated
311cpusets to efficiently detect what level of memory pressure that job
312is causing.
313
314This is useful both on tightly managed systems running a wide mix of
315submitted jobs, which may choose to terminate or re-prioritize jobs that
316are trying to use more memory than allowed on the nodes assigned them,
317and with tightly coupled, long running, massively parallel scientific
318computing jobs that will dramatically fail to meet required performance
319goals if they start to use more memory than allowed to them.
320
321This mechanism provides a very economical way for the batch manager
322to monitor a cpuset for signs of memory pressure. It's up to the
323batch manager or other user code to decide what to do about it and
324take action.
325
326==> Unless this feature is enabled by writing "1" to the special file
327 /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
328 code of __alloc_pages() for this metric reduces to simply noticing
329 that the cpuset_memory_pressure_enabled flag is zero. So only
330 systems that enable this feature will compute the metric.
331
332Why a per-cpuset, running average:
333
334 Because this meter is per-cpuset, rather than per-task or mm,
335 the system load imposed by a batch scheduler monitoring this
336 metric is sharply reduced on large systems, because a scan of
337 the tasklist can be avoided on each set of queries.
338
339 Because this meter is a running average, instead of an accumulating
340 counter, a batch scheduler can detect memory pressure with a
341 single read, instead of having to read and accumulate results
342 for a period of time.
343
344 Because this meter is per-cpuset rather than per-task or mm,
345 the batch scheduler can obtain the key information, memory
346 pressure in a cpuset, with a single read, rather than having to
347 query and accumulate results over all the (dynamically changing)
348 set of tasks in the cpuset.
349
350A per-cpuset simple digital filter (requires a spinlock and 3 words
351of data per-cpuset) is kept, and updated by any task attached to that
352cpuset, if it enters the synchronous (direct) page reclaim code.
353
354A per-cpuset file provides an integer number representing the recent
355(half-life of 10 seconds) rate of direct page reclaims caused by
356the tasks in the cpuset, in units of reclaims attempted per second,
357times 1000.
358
359
3601.8 How do I use cpusets ?
233-------------------------- 361--------------------------
234 362
235In order to minimize the impact of cpusets on critical kernel 363In order to minimize the impact of cpusets on critical kernel