diff options
author | Ingo Molnar <mingo@elte.hu> | 2009-01-18 14:15:05 -0500 |
---|---|---|
committer | Ingo Molnar <mingo@elte.hu> | 2009-01-18 14:15:05 -0500 |
commit | 4092762aebfe55c1f8e31440b80a053c2dbe519b (patch) | |
tree | 8fb9fd14131194174c12daf5d8195afd3b62bc3e /Documentation/cgroups/cpusets.txt | |
parent | 745b1626dd71ce9661a05ea4db57859ed5c773d2 (diff) | |
parent | 1de9e8e70f5acc441550ca75433563d91b269bbe (diff) |
Merge branch 'tracing/ftrace'; commit 'v2.6.29-rc2' into tracing/core
Diffstat (limited to 'Documentation/cgroups/cpusets.txt')
-rw-r--r-- | Documentation/cgroups/cpusets.txt | 808 |
1 files changed, 808 insertions, 0 deletions
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt new file mode 100644 index 000000000000..5c86c258c791 --- /dev/null +++ b/Documentation/cgroups/cpusets.txt | |||
@@ -0,0 +1,808 @@ | |||
1 | CPUSETS | ||
2 | ------- | ||
3 | |||
4 | Copyright (C) 2004 BULL SA. | ||
5 | Written by Simon.Derr@bull.net | ||
6 | |||
7 | Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. | ||
8 | Modified by Paul Jackson <pj@sgi.com> | ||
9 | Modified by Christoph Lameter <clameter@sgi.com> | ||
10 | Modified by Paul Menage <menage@google.com> | ||
11 | Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> | ||
12 | |||
13 | CONTENTS: | ||
14 | ========= | ||
15 | |||
16 | 1. Cpusets | ||
17 | 1.1 What are cpusets ? | ||
18 | 1.2 Why are cpusets needed ? | ||
19 | 1.3 How are cpusets implemented ? | ||
20 | 1.4 What are exclusive cpusets ? | ||
21 | 1.5 What is memory_pressure ? | ||
22 | 1.6 What is memory spread ? | ||
23 | 1.7 What is sched_load_balance ? | ||
24 | 1.8 What is sched_relax_domain_level ? | ||
25 | 1.9 How do I use cpusets ? | ||
26 | 2. Usage Examples and Syntax | ||
27 | 2.1 Basic Usage | ||
28 | 2.2 Adding/removing cpus | ||
29 | 2.3 Setting flags | ||
30 | 2.4 Attaching processes | ||
31 | 3. Questions | ||
32 | 4. Contact | ||
33 | |||
34 | 1. Cpusets | ||
35 | ========== | ||
36 | |||
37 | 1.1 What are cpusets ? | ||
38 | ---------------------- | ||
39 | |||
40 | Cpusets provide a mechanism for assigning a set of CPUs and Memory | ||
41 | Nodes to a set of tasks. In this document "Memory Node" refers to | ||
42 | an on-line node that contains memory. | ||
43 | |||
44 | Cpusets constrain the CPU and Memory placement of tasks to only | ||
45 | the resources within a tasks current cpuset. They form a nested | ||
46 | hierarchy visible in a virtual file system. These are the essential | ||
47 | hooks, beyond what is already present, required to manage dynamic | ||
48 | job placement on large systems. | ||
49 | |||
50 | Cpusets use the generic cgroup subsystem described in | ||
51 | Documentation/cgroups/cgroups.txt. | ||
52 | |||
53 | Requests by a task, using the sched_setaffinity(2) system call to | ||
54 | include CPUs in its CPU affinity mask, and using the mbind(2) and | ||
55 | set_mempolicy(2) system calls to include Memory Nodes in its memory | ||
56 | policy, are both filtered through that tasks cpuset, filtering out any | ||
57 | CPUs or Memory Nodes not in that cpuset. The scheduler will not | ||
58 | schedule a task on a CPU that is not allowed in its cpus_allowed | ||
59 | vector, and the kernel page allocator will not allocate a page on a | ||
60 | node that is not allowed in the requesting tasks mems_allowed vector. | ||
61 | |||
62 | User level code may create and destroy cpusets by name in the cgroup | ||
63 | virtual file system, manage the attributes and permissions of these | ||
64 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | ||
65 | specify and query to which cpuset a task is assigned, and list the | ||
66 | task pids assigned to a cpuset. | ||
67 | |||
68 | |||
69 | 1.2 Why are cpusets needed ? | ||
70 | ---------------------------- | ||
71 | |||
72 | The management of large computer systems, with many processors (CPUs), | ||
73 | complex memory cache hierarchies and multiple Memory Nodes having | ||
74 | non-uniform access times (NUMA) presents additional challenges for | ||
75 | the efficient scheduling and memory placement of processes. | ||
76 | |||
77 | Frequently more modest sized systems can be operated with adequate | ||
78 | efficiency just by letting the operating system automatically share | ||
79 | the available CPU and Memory resources amongst the requesting tasks. | ||
80 | |||
81 | But larger systems, which benefit more from careful processor and | ||
82 | memory placement to reduce memory access times and contention, | ||
83 | and which typically represent a larger investment for the customer, | ||
84 | can benefit from explicitly placing jobs on properly sized subsets of | ||
85 | the system. | ||
86 | |||
87 | This can be especially valuable on: | ||
88 | |||
89 | * Web Servers running multiple instances of the same web application, | ||
90 | * Servers running different applications (for instance, a web server | ||
91 | and a database), or | ||
92 | * NUMA systems running large HPC applications with demanding | ||
93 | performance characteristics. | ||
94 | |||
95 | These subsets, or "soft partitions" must be able to be dynamically | ||
96 | adjusted, as the job mix changes, without impacting other concurrently | ||
97 | executing jobs. The location of the running jobs pages may also be moved | ||
98 | when the memory locations are changed. | ||
99 | |||
100 | The kernel cpuset patch provides the minimum essential kernel | ||
101 | mechanisms required to efficiently implement such subsets. It | ||
102 | leverages existing CPU and Memory Placement facilities in the Linux | ||
103 | kernel to avoid any additional impact on the critical scheduler or | ||
104 | memory allocator code. | ||
105 | |||
106 | |||
107 | 1.3 How are cpusets implemented ? | ||
108 | --------------------------------- | ||
109 | |||
110 | Cpusets provide a Linux kernel mechanism to constrain which CPUs and | ||
111 | Memory Nodes are used by a process or set of processes. | ||
112 | |||
113 | The Linux kernel already has a pair of mechanisms to specify on which | ||
114 | CPUs a task may be scheduled (sched_setaffinity) and on which Memory | ||
115 | Nodes it may obtain memory (mbind, set_mempolicy). | ||
116 | |||
117 | Cpusets extends these two mechanisms as follows: | ||
118 | |||
119 | - Cpusets are sets of allowed CPUs and Memory Nodes, known to the | ||
120 | kernel. | ||
121 | - Each task in the system is attached to a cpuset, via a pointer | ||
122 | in the task structure to a reference counted cgroup structure. | ||
123 | - Calls to sched_setaffinity are filtered to just those CPUs | ||
124 | allowed in that tasks cpuset. | ||
125 | - Calls to mbind and set_mempolicy are filtered to just | ||
126 | those Memory Nodes allowed in that tasks cpuset. | ||
127 | - The root cpuset contains all the systems CPUs and Memory | ||
128 | Nodes. | ||
129 | - For any cpuset, one can define child cpusets containing a subset | ||
130 | of the parents CPU and Memory Node resources. | ||
131 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for | ||
132 | browsing and manipulation from user space. | ||
133 | - A cpuset may be marked exclusive, which ensures that no other | ||
134 | cpuset (except direct ancestors and descendents) may contain | ||
135 | any overlapping CPUs or Memory Nodes. | ||
136 | - You can list all the tasks (by pid) attached to any cpuset. | ||
137 | |||
138 | The implementation of cpusets requires a few, simple hooks | ||
139 | into the rest of the kernel, none in performance critical paths: | ||
140 | |||
141 | - in init/main.c, to initialize the root cpuset at system boot. | ||
142 | - in fork and exit, to attach and detach a task from its cpuset. | ||
143 | - in sched_setaffinity, to mask the requested CPUs by what's | ||
144 | allowed in that tasks cpuset. | ||
145 | - in sched.c migrate_all_tasks(), to keep migrating tasks within | ||
146 | the CPUs allowed by their cpuset, if possible. | ||
147 | - in the mbind and set_mempolicy system calls, to mask the requested | ||
148 | Memory Nodes by what's allowed in that tasks cpuset. | ||
149 | - in page_alloc.c, to restrict memory to allowed nodes. | ||
150 | - in vmscan.c, to restrict page recovery to the current cpuset. | ||
151 | |||
152 | You should mount the "cgroup" filesystem type in order to enable | ||
153 | browsing and modifying the cpusets presently known to the kernel. No | ||
154 | new system calls are added for cpusets - all support for querying and | ||
155 | modifying cpusets is via this cpuset file system. | ||
156 | |||
157 | The /proc/<pid>/status file for each task has four added lines, | ||
158 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) | ||
159 | and mems_allowed (on which Memory Nodes it may obtain memory), | ||
160 | in the two formats seen in the following example: | ||
161 | |||
162 | Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff | ||
163 | Cpus_allowed_list: 0-127 | ||
164 | Mems_allowed: ffffffff,ffffffff | ||
165 | Mems_allowed_list: 0-63 | ||
166 | |||
167 | Each cpuset is represented by a directory in the cgroup file system | ||
168 | containing (on top of the standard cgroup files) the following | ||
169 | files describing that cpuset: | ||
170 | |||
171 | - cpus: list of CPUs in that cpuset | ||
172 | - mems: list of Memory Nodes in that cpuset | ||
173 | - memory_migrate flag: if set, move pages to cpusets nodes | ||
174 | - cpu_exclusive flag: is cpu placement exclusive? | ||
175 | - mem_exclusive flag: is memory placement exclusive? | ||
176 | - mem_hardwall flag: is memory allocation hardwalled | ||
177 | - memory_pressure: measure of how much paging pressure in cpuset | ||
178 | |||
179 | In addition, the root cpuset only has the following file: | ||
180 | - memory_pressure_enabled flag: compute memory_pressure? | ||
181 | |||
182 | New cpusets are created using the mkdir system call or shell | ||
183 | command. The properties of a cpuset, such as its flags, allowed | ||
184 | CPUs and Memory Nodes, and attached tasks, are modified by writing | ||
185 | to the appropriate file in that cpusets directory, as listed above. | ||
186 | |||
187 | The named hierarchical structure of nested cpusets allows partitioning | ||
188 | a large system into nested, dynamically changeable, "soft-partitions". | ||
189 | |||
190 | The attachment of each task, automatically inherited at fork by any | ||
191 | children of that task, to a cpuset allows organizing the work load | ||
192 | on a system into related sets of tasks such that each set is constrained | ||
193 | to using the CPUs and Memory Nodes of a particular cpuset. A task | ||
194 | may be re-attached to any other cpuset, if allowed by the permissions | ||
195 | on the necessary cpuset file system directories. | ||
196 | |||
197 | Such management of a system "in the large" integrates smoothly with | ||
198 | the detailed placement done on individual tasks and memory regions | ||
199 | using the sched_setaffinity, mbind and set_mempolicy system calls. | ||
200 | |||
201 | The following rules apply to each cpuset: | ||
202 | |||
203 | - Its CPUs and Memory Nodes must be a subset of its parents. | ||
204 | - It can't be marked exclusive unless its parent is. | ||
205 | - If its cpu or memory is exclusive, they may not overlap any sibling. | ||
206 | |||
207 | These rules, and the natural hierarchy of cpusets, enable efficient | ||
208 | enforcement of the exclusive guarantee, without having to scan all | ||
209 | cpusets every time any of them change to ensure nothing overlaps a | ||
210 | exclusive cpuset. Also, the use of a Linux virtual file system (vfs) | ||
211 | to represent the cpuset hierarchy provides for a familiar permission | ||
212 | and name space for cpusets, with a minimum of additional kernel code. | ||
213 | |||
214 | The cpus and mems files in the root (top_cpuset) cpuset are | ||
215 | read-only. The cpus file automatically tracks the value of | ||
216 | cpu_online_map using a CPU hotplug notifier, and the mems file | ||
217 | automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e., | ||
218 | nodes with memory--using the cpuset_track_online_nodes() hook. | ||
219 | |||
220 | |||
221 | 1.4 What are exclusive cpusets ? | ||
222 | -------------------------------- | ||
223 | |||
224 | If a cpuset is cpu or mem exclusive, no other cpuset, other than | ||
225 | a direct ancestor or descendent, may share any of the same CPUs or | ||
226 | Memory Nodes. | ||
227 | |||
228 | A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", | ||
229 | i.e. it restricts kernel allocations for page, buffer and other data | ||
230 | commonly shared by the kernel across multiple users. All cpusets, | ||
231 | whether hardwalled or not, restrict allocations of memory for user | ||
232 | space. This enables configuring a system so that several independent | ||
233 | jobs can share common kernel data, such as file system pages, while | ||
234 | isolating each job's user allocation in its own cpuset. To do this, | ||
235 | construct a large mem_exclusive cpuset to hold all the jobs, and | ||
236 | construct child, non-mem_exclusive cpusets for each individual job. | ||
237 | Only a small amount of typical kernel memory, such as requests from | ||
238 | interrupt handlers, is allowed to be taken outside even a | ||
239 | mem_exclusive cpuset. | ||
240 | |||
241 | |||
242 | 1.5 What is memory_pressure ? | ||
243 | ----------------------------- | ||
244 | The memory_pressure of a cpuset provides a simple per-cpuset metric | ||
245 | of the rate that the tasks in a cpuset are attempting to free up in | ||
246 | use memory on the nodes of the cpuset to satisfy additional memory | ||
247 | requests. | ||
248 | |||
249 | This enables batch managers monitoring jobs running in dedicated | ||
250 | cpusets to efficiently detect what level of memory pressure that job | ||
251 | is causing. | ||
252 | |||
253 | This is useful both on tightly managed systems running a wide mix of | ||
254 | submitted jobs, which may choose to terminate or re-prioritize jobs that | ||
255 | are trying to use more memory than allowed on the nodes assigned them, | ||
256 | and with tightly coupled, long running, massively parallel scientific | ||
257 | computing jobs that will dramatically fail to meet required performance | ||
258 | goals if they start to use more memory than allowed to them. | ||
259 | |||
260 | This mechanism provides a very economical way for the batch manager | ||
261 | to monitor a cpuset for signs of memory pressure. It's up to the | ||
262 | batch manager or other user code to decide what to do about it and | ||
263 | take action. | ||
264 | |||
265 | ==> Unless this feature is enabled by writing "1" to the special file | ||
266 | /dev/cpuset/memory_pressure_enabled, the hook in the rebalance | ||
267 | code of __alloc_pages() for this metric reduces to simply noticing | ||
268 | that the cpuset_memory_pressure_enabled flag is zero. So only | ||
269 | systems that enable this feature will compute the metric. | ||
270 | |||
271 | Why a per-cpuset, running average: | ||
272 | |||
273 | Because this meter is per-cpuset, rather than per-task or mm, | ||
274 | the system load imposed by a batch scheduler monitoring this | ||
275 | metric is sharply reduced on large systems, because a scan of | ||
276 | the tasklist can be avoided on each set of queries. | ||
277 | |||
278 | Because this meter is a running average, instead of an accumulating | ||
279 | counter, a batch scheduler can detect memory pressure with a | ||
280 | single read, instead of having to read and accumulate results | ||
281 | for a period of time. | ||
282 | |||
283 | Because this meter is per-cpuset rather than per-task or mm, | ||
284 | the batch scheduler can obtain the key information, memory | ||
285 | pressure in a cpuset, with a single read, rather than having to | ||
286 | query and accumulate results over all the (dynamically changing) | ||
287 | set of tasks in the cpuset. | ||
288 | |||
289 | A per-cpuset simple digital filter (requires a spinlock and 3 words | ||
290 | of data per-cpuset) is kept, and updated by any task attached to that | ||
291 | cpuset, if it enters the synchronous (direct) page reclaim code. | ||
292 | |||
293 | A per-cpuset file provides an integer number representing the recent | ||
294 | (half-life of 10 seconds) rate of direct page reclaims caused by | ||
295 | the tasks in the cpuset, in units of reclaims attempted per second, | ||
296 | times 1000. | ||
297 | |||
298 | |||
299 | 1.6 What is memory spread ? | ||
300 | --------------------------- | ||
301 | There are two boolean flag files per cpuset that control where the | ||
302 | kernel allocates pages for the file system buffers and related in | ||
303 | kernel data structures. They are called 'memory_spread_page' and | ||
304 | 'memory_spread_slab'. | ||
305 | |||
306 | If the per-cpuset boolean flag file 'memory_spread_page' is set, then | ||
307 | the kernel will spread the file system buffers (page cache) evenly | ||
308 | over all the nodes that the faulting task is allowed to use, instead | ||
309 | of preferring to put those pages on the node where the task is running. | ||
310 | |||
311 | If the per-cpuset boolean flag file 'memory_spread_slab' is set, | ||
312 | then the kernel will spread some file system related slab caches, | ||
313 | such as for inodes and dentries evenly over all the nodes that the | ||
314 | faulting task is allowed to use, instead of preferring to put those | ||
315 | pages on the node where the task is running. | ||
316 | |||
317 | The setting of these flags does not affect anonymous data segment or | ||
318 | stack segment pages of a task. | ||
319 | |||
320 | By default, both kinds of memory spreading are off, and memory | ||
321 | pages are allocated on the node local to where the task is running, | ||
322 | except perhaps as modified by the tasks NUMA mempolicy or cpuset | ||
323 | configuration, so long as sufficient free memory pages are available. | ||
324 | |||
325 | When new cpusets are created, they inherit the memory spread settings | ||
326 | of their parent. | ||
327 | |||
328 | Setting memory spreading causes allocations for the affected page | ||
329 | or slab caches to ignore the tasks NUMA mempolicy and be spread | ||
330 | instead. Tasks using mbind() or set_mempolicy() calls to set NUMA | ||
331 | mempolicies will not notice any change in these calls as a result of | ||
332 | their containing tasks memory spread settings. If memory spreading | ||
333 | is turned off, then the currently specified NUMA mempolicy once again | ||
334 | applies to memory page allocations. | ||
335 | |||
336 | Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag | ||
337 | files. By default they contain "0", meaning that the feature is off | ||
338 | for that cpuset. If a "1" is written to that file, then that turns | ||
339 | the named feature on. | ||
340 | |||
341 | The implementation is simple. | ||
342 | |||
343 | Setting the flag 'memory_spread_page' turns on a per-process flag | ||
344 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently | ||
345 | joins that cpuset. The page allocation calls for the page cache | ||
346 | is modified to perform an inline check for this PF_SPREAD_PAGE task | ||
347 | flag, and if set, a call to a new routine cpuset_mem_spread_node() | ||
348 | returns the node to prefer for the allocation. | ||
349 | |||
350 | Similarly, setting 'memory_spread_slab' turns on the flag | ||
351 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate | ||
352 | pages from the node returned by cpuset_mem_spread_node(). | ||
353 | |||
354 | The cpuset_mem_spread_node() routine is also simple. It uses the | ||
355 | value of a per-task rotor cpuset_mem_spread_rotor to select the next | ||
356 | node in the current tasks mems_allowed to prefer for the allocation. | ||
357 | |||
358 | This memory placement policy is also known (in other contexts) as | ||
359 | round-robin or interleave. | ||
360 | |||
361 | This policy can provide substantial improvements for jobs that need | ||
362 | to place thread local data on the corresponding node, but that need | ||
363 | to access large file system data sets that need to be spread across | ||
364 | the several nodes in the jobs cpuset in order to fit. Without this | ||
365 | policy, especially for jobs that might have one thread reading in the | ||
366 | data set, the memory allocation across the nodes in the jobs cpuset | ||
367 | can become very uneven. | ||
368 | |||
369 | 1.7 What is sched_load_balance ? | ||
370 | -------------------------------- | ||
371 | |||
372 | The kernel scheduler (kernel/sched.c) automatically load balances | ||
373 | tasks. If one CPU is underutilized, kernel code running on that | ||
374 | CPU will look for tasks on other more overloaded CPUs and move those | ||
375 | tasks to itself, within the constraints of such placement mechanisms | ||
376 | as cpusets and sched_setaffinity. | ||
377 | |||
378 | The algorithmic cost of load balancing and its impact on key shared | ||
379 | kernel data structures such as the task list increases more than | ||
380 | linearly with the number of CPUs being balanced. So the scheduler | ||
381 | has support to partition the systems CPUs into a number of sched | ||
382 | domains such that it only load balances within each sched domain. | ||
383 | Each sched domain covers some subset of the CPUs in the system; | ||
384 | no two sched domains overlap; some CPUs might not be in any sched | ||
385 | domain and hence won't be load balanced. | ||
386 | |||
387 | Put simply, it costs less to balance between two smaller sched domains | ||
388 | than one big one, but doing so means that overloads in one of the | ||
389 | two domains won't be load balanced to the other one. | ||
390 | |||
391 | By default, there is one sched domain covering all CPUs, except those | ||
392 | marked isolated using the kernel boot time "isolcpus=" argument. | ||
393 | |||
394 | This default load balancing across all CPUs is not well suited for | ||
395 | the following two situations: | ||
396 | 1) On large systems, load balancing across many CPUs is expensive. | ||
397 | If the system is managed using cpusets to place independent jobs | ||
398 | on separate sets of CPUs, full load balancing is unnecessary. | ||
399 | 2) Systems supporting realtime on some CPUs need to minimize | ||
400 | system overhead on those CPUs, including avoiding task load | ||
401 | balancing if that is not needed. | ||
402 | |||
403 | When the per-cpuset flag "sched_load_balance" is enabled (the default | ||
404 | setting), it requests that all the CPUs in that cpusets allowed 'cpus' | ||
405 | be contained in a single sched domain, ensuring that load balancing | ||
406 | can move a task (not otherwised pinned, as by sched_setaffinity) | ||
407 | from any CPU in that cpuset to any other. | ||
408 | |||
409 | When the per-cpuset flag "sched_load_balance" is disabled, then the | ||
410 | scheduler will avoid load balancing across the CPUs in that cpuset, | ||
411 | --except-- in so far as is necessary because some overlapping cpuset | ||
412 | has "sched_load_balance" enabled. | ||
413 | |||
414 | So, for example, if the top cpuset has the flag "sched_load_balance" | ||
415 | enabled, then the scheduler will have one sched domain covering all | ||
416 | CPUs, and the setting of the "sched_load_balance" flag in any other | ||
417 | cpusets won't matter, as we're already fully load balancing. | ||
418 | |||
419 | Therefore in the above two situations, the top cpuset flag | ||
420 | "sched_load_balance" should be disabled, and only some of the smaller, | ||
421 | child cpusets have this flag enabled. | ||
422 | |||
423 | When doing this, you don't usually want to leave any unpinned tasks in | ||
424 | the top cpuset that might use non-trivial amounts of CPU, as such tasks | ||
425 | may be artificially constrained to some subset of CPUs, depending on | ||
426 | the particulars of this flag setting in descendent cpusets. Even if | ||
427 | such a task could use spare CPU cycles in some other CPUs, the kernel | ||
428 | scheduler might not consider the possibility of load balancing that | ||
429 | task to that underused CPU. | ||
430 | |||
431 | Of course, tasks pinned to a particular CPU can be left in a cpuset | ||
432 | that disables "sched_load_balance" as those tasks aren't going anywhere | ||
433 | else anyway. | ||
434 | |||
435 | There is an impedance mismatch here, between cpusets and sched domains. | ||
436 | Cpusets are hierarchical and nest. Sched domains are flat; they don't | ||
437 | overlap and each CPU is in at most one sched domain. | ||
438 | |||
439 | It is necessary for sched domains to be flat because load balancing | ||
440 | across partially overlapping sets of CPUs would risk unstable dynamics | ||
441 | that would be beyond our understanding. So if each of two partially | ||
442 | overlapping cpusets enables the flag 'sched_load_balance', then we | ||
443 | form a single sched domain that is a superset of both. We won't move | ||
444 | a task to a CPU outside it cpuset, but the scheduler load balancing | ||
445 | code might waste some compute cycles considering that possibility. | ||
446 | |||
447 | This mismatch is why there is not a simple one-to-one relation | ||
448 | between which cpusets have the flag "sched_load_balance" enabled, | ||
449 | and the sched domain configuration. If a cpuset enables the flag, it | ||
450 | will get balancing across all its CPUs, but if it disables the flag, | ||
451 | it will only be assured of no load balancing if no other overlapping | ||
452 | cpuset enables the flag. | ||
453 | |||
454 | If two cpusets have partially overlapping 'cpus' allowed, and only | ||
455 | one of them has this flag enabled, then the other may find its | ||
456 | tasks only partially load balanced, just on the overlapping CPUs. | ||
457 | This is just the general case of the top_cpuset example given a few | ||
458 | paragraphs above. In the general case, as in the top cpuset case, | ||
459 | don't leave tasks that might use non-trivial amounts of CPU in | ||
460 | such partially load balanced cpusets, as they may be artificially | ||
461 | constrained to some subset of the CPUs allowed to them, for lack of | ||
462 | load balancing to the other CPUs. | ||
463 | |||
464 | 1.7.1 sched_load_balance implementation details. | ||
465 | ------------------------------------------------ | ||
466 | |||
467 | The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary | ||
468 | to most cpuset flags.) When enabled for a cpuset, the kernel will | ||
469 | ensure that it can load balance across all the CPUs in that cpuset | ||
470 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are | ||
471 | in the same sched domain.) | ||
472 | |||
473 | If two overlapping cpusets both have 'sched_load_balance' enabled, | ||
474 | then they will be (must be) both in the same sched domain. | ||
475 | |||
476 | If, as is the default, the top cpuset has 'sched_load_balance' enabled, | ||
477 | then by the above that means there is a single sched domain covering | ||
478 | the whole system, regardless of any other cpuset settings. | ||
479 | |||
480 | The kernel commits to user space that it will avoid load balancing | ||
481 | where it can. It will pick as fine a granularity partition of sched | ||
482 | domains as it can while still providing load balancing for any set | ||
483 | of CPUs allowed to a cpuset having 'sched_load_balance' enabled. | ||
484 | |||
485 | The internal kernel cpuset to scheduler interface passes from the | ||
486 | cpuset code to the scheduler code a partition of the load balanced | ||
487 | CPUs in the system. This partition is a set of subsets (represented | ||
488 | as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all | ||
489 | the CPUs that must be load balanced. | ||
490 | |||
491 | Whenever the 'sched_load_balance' flag changes, or CPUs come or go | ||
492 | from a cpuset with this flag enabled, or a cpuset with this flag | ||
493 | enabled is removed, the cpuset code builds a new such partition and | ||
494 | passes it to the scheduler sched domain setup code, to have the sched | ||
495 | domains rebuilt as necessary. | ||
496 | |||
497 | This partition exactly defines what sched domains the scheduler should | ||
498 | setup - one sched domain for each element (cpumask_t) in the partition. | ||
499 | |||
500 | The scheduler remembers the currently active sched domain partitions. | ||
501 | When the scheduler routine partition_sched_domains() is invoked from | ||
502 | the cpuset code to update these sched domains, it compares the new | ||
503 | partition requested with the current, and updates its sched domains, | ||
504 | removing the old and adding the new, for each change. | ||
505 | |||
506 | |||
507 | 1.8 What is sched_relax_domain_level ? | ||
508 | -------------------------------------- | ||
509 | |||
510 | In sched domain, the scheduler migrates tasks in 2 ways; periodic load | ||
511 | balance on tick, and at time of some schedule events. | ||
512 | |||
513 | When a task is woken up, scheduler try to move the task on idle CPU. | ||
514 | For example, if a task A running on CPU X activates another task B | ||
515 | on the same CPU X, and if CPU Y is X's sibling and performing idle, | ||
516 | then scheduler migrate task B to CPU Y so that task B can start on | ||
517 | CPU Y without waiting task A on CPU X. | ||
518 | |||
519 | And if a CPU run out of tasks in its runqueue, the CPU try to pull | ||
520 | extra tasks from other busy CPUs to help them before it is going to | ||
521 | be idle. | ||
522 | |||
523 | Of course it takes some searching cost to find movable tasks and/or | ||
524 | idle CPUs, the scheduler might not search all CPUs in the domain | ||
525 | everytime. In fact, in some architectures, the searching ranges on | ||
526 | events are limited in the same socket or node where the CPU locates, | ||
527 | while the load balance on tick searchs all. | ||
528 | |||
529 | For example, assume CPU Z is relatively far from CPU X. Even if CPU Z | ||
530 | is idle while CPU X and the siblings are busy, scheduler can't migrate | ||
531 | woken task B from X to Z since it is out of its searching range. | ||
532 | As the result, task B on CPU X need to wait task A or wait load balance | ||
533 | on the next tick. For some applications in special situation, waiting | ||
534 | 1 tick may be too long. | ||
535 | |||
536 | The 'sched_relax_domain_level' file allows you to request changing | ||
537 | this searching range as you like. This file takes int value which | ||
538 | indicates size of searching range in levels ideally as follows, | ||
539 | otherwise initial value -1 that indicates the cpuset has no request. | ||
540 | |||
541 | -1 : no request. use system default or follow request of others. | ||
542 | 0 : no search. | ||
543 | 1 : search siblings (hyperthreads in a core). | ||
544 | 2 : search cores in a package. | ||
545 | 3 : search cpus in a node [= system wide on non-NUMA system] | ||
546 | ( 4 : search nodes in a chunk of node [on NUMA system] ) | ||
547 | ( 5 : search system wide [on NUMA system] ) | ||
548 | |||
549 | The system default is architecture dependent. The system default | ||
550 | can be changed using the relax_domain_level= boot parameter. | ||
551 | |||
552 | This file is per-cpuset and affect the sched domain where the cpuset | ||
553 | belongs to. Therefore if the flag 'sched_load_balance' of a cpuset | ||
554 | is disabled, then 'sched_relax_domain_level' have no effect since | ||
555 | there is no sched domain belonging the cpuset. | ||
556 | |||
557 | If multiple cpusets are overlapping and hence they form a single sched | ||
558 | domain, the largest value among those is used. Be careful, if one | ||
559 | requests 0 and others are -1 then 0 is used. | ||
560 | |||
561 | Note that modifying this file will have both good and bad effects, | ||
562 | and whether it is acceptable or not will be depend on your situation. | ||
563 | Don't modify this file if you are not sure. | ||
564 | |||
565 | If your situation is: | ||
566 | - The migration costs between each cpu can be assumed considerably | ||
567 | small(for you) due to your special application's behavior or | ||
568 | special hardware support for CPU cache etc. | ||
569 | - The searching cost doesn't have impact(for you) or you can make | ||
570 | the searching cost enough small by managing cpuset to compact etc. | ||
571 | - The latency is required even it sacrifices cache hit rate etc. | ||
572 | then increasing 'sched_relax_domain_level' would benefit you. | ||
573 | |||
574 | |||
575 | 1.9 How do I use cpusets ? | ||
576 | -------------------------- | ||
577 | |||
578 | In order to minimize the impact of cpusets on critical kernel | ||
579 | code, such as the scheduler, and due to the fact that the kernel | ||
580 | does not support one task updating the memory placement of another | ||
581 | task directly, the impact on a task of changing its cpuset CPU | ||
582 | or Memory Node placement, or of changing to which cpuset a task | ||
583 | is attached, is subtle. | ||
584 | |||
585 | If a cpuset has its Memory Nodes modified, then for each task attached | ||
586 | to that cpuset, the next time that the kernel attempts to allocate | ||
587 | a page of memory for that task, the kernel will notice the change | ||
588 | in the tasks cpuset, and update its per-task memory placement to | ||
589 | remain within the new cpusets memory placement. If the task was using | ||
590 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with | ||
591 | its new cpuset, then the task will continue to use whatever subset | ||
592 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task | ||
593 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | ||
594 | in the new cpuset, then the task will be essentially treated as if it | ||
595 | was MPOL_BIND bound to the new cpuset (even though its numa placement, | ||
596 | as queried by get_mempolicy(), doesn't change). If a task is moved | ||
597 | from one cpuset to another, then the kernel will adjust the tasks | ||
598 | memory placement, as above, the next time that the kernel attempts | ||
599 | to allocate a page of memory for that task. | ||
600 | |||
601 | If a cpuset has its 'cpus' modified, then each task in that cpuset | ||
602 | will have its allowed CPU placement changed immediately. Similarly, | ||
603 | if a tasks pid is written to a cpusets 'tasks' file, in either its | ||
604 | current cpuset or another cpuset, then its allowed CPU placement is | ||
605 | changed immediately. If such a task had been bound to some subset | ||
606 | of its cpuset using the sched_setaffinity() call, the task will be | ||
607 | allowed to run on any CPU allowed in its new cpuset, negating the | ||
608 | affect of the prior sched_setaffinity() call. | ||
609 | |||
610 | In summary, the memory placement of a task whose cpuset is changed is | ||
611 | updated by the kernel, on the next allocation of a page for that task, | ||
612 | but the processor placement is not updated, until that tasks pid is | ||
613 | rewritten to the 'tasks' file of its cpuset. This is done to avoid | ||
614 | impacting the scheduler code in the kernel with a check for changes | ||
615 | in a tasks processor placement. | ||
616 | |||
617 | Normally, once a page is allocated (given a physical page | ||
618 | of main memory) then that page stays on whatever node it | ||
619 | was allocated, so long as it remains allocated, even if the | ||
620 | cpusets memory placement policy 'mems' subsequently changes. | ||
621 | If the cpuset flag file 'memory_migrate' is set true, then when | ||
622 | tasks are attached to that cpuset, any pages that task had | ||
623 | allocated to it on nodes in its previous cpuset are migrated | ||
624 | to the tasks new cpuset. The relative placement of the page within | ||
625 | the cpuset is preserved during these migration operations if possible. | ||
626 | For example if the page was on the second valid node of the prior cpuset | ||
627 | then the page will be placed on the second valid node of the new cpuset. | ||
628 | |||
629 | Also if 'memory_migrate' is set true, then if that cpusets | ||
630 | 'mems' file is modified, pages allocated to tasks in that | ||
631 | cpuset, that were on nodes in the previous setting of 'mems', | ||
632 | will be moved to nodes in the new setting of 'mems.' | ||
633 | Pages that were not in the tasks prior cpuset, or in the cpusets | ||
634 | prior 'mems' setting, will not be moved. | ||
635 | |||
636 | There is an exception to the above. If hotplug functionality is used | ||
637 | to remove all the CPUs that are currently assigned to a cpuset, | ||
638 | then all the tasks in that cpuset will be moved to the nearest ancestor | ||
639 | with non-empty cpus. But the moving of some (or all) tasks might fail if | ||
640 | cpuset is bound with another cgroup subsystem which has some restrictions | ||
641 | on task attaching. In this failing case, those tasks will stay | ||
642 | in the original cpuset, and the kernel will automatically update | ||
643 | their cpus_allowed to allow all online CPUs. When memory hotplug | ||
644 | functionality for removing Memory Nodes is available, a similar exception | ||
645 | is expected to apply there as well. In general, the kernel prefers to | ||
646 | violate cpuset placement, over starving a task that has had all | ||
647 | its allowed CPUs or Memory Nodes taken offline. | ||
648 | |||
649 | There is a second exception to the above. GFP_ATOMIC requests are | ||
650 | kernel internal allocations that must be satisfied, immediately. | ||
651 | The kernel may drop some request, in rare cases even panic, if a | ||
652 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within | ||
653 | the current tasks cpuset, then we relax the cpuset, and look for | ||
654 | memory anywhere we can find it. It's better to violate the cpuset | ||
655 | than stress the kernel. | ||
656 | |||
657 | To start a new job that is to be contained within a cpuset, the steps are: | ||
658 | |||
659 | 1) mkdir /dev/cpuset | ||
660 | 2) mount -t cgroup -ocpuset cpuset /dev/cpuset | ||
661 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in | ||
662 | the /dev/cpuset virtual file system. | ||
663 | 4) Start a task that will be the "founding father" of the new job. | ||
664 | 5) Attach that task to the new cpuset by writing its pid to the | ||
665 | /dev/cpuset tasks file for that cpuset. | ||
666 | 6) fork, exec or clone the job tasks from this founding father task. | ||
667 | |||
668 | For example, the following sequence of commands will setup a cpuset | ||
669 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | ||
670 | and then start a subshell 'sh' in that cpuset: | ||
671 | |||
672 | mount -t cgroup -ocpuset cpuset /dev/cpuset | ||
673 | cd /dev/cpuset | ||
674 | mkdir Charlie | ||
675 | cd Charlie | ||
676 | /bin/echo 2-3 > cpus | ||
677 | /bin/echo 1 > mems | ||
678 | /bin/echo $$ > tasks | ||
679 | sh | ||
680 | # The subshell 'sh' is now running in cpuset Charlie | ||
681 | # The next line should display '/Charlie' | ||
682 | cat /proc/self/cpuset | ||
683 | |||
684 | In the future, a C library interface to cpusets will likely be | ||
685 | available. For now, the only way to query or modify cpusets is | ||
686 | via the cpuset file system, using the various cd, mkdir, echo, cat, | ||
687 | rmdir commands from the shell, or their equivalent from C. | ||
688 | |||
689 | The sched_setaffinity calls can also be done at the shell prompt using | ||
690 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy | ||
691 | calls can be done at the shell prompt using the numactl command | ||
692 | (part of Andi Kleen's numa package). | ||
693 | |||
694 | 2. Usage Examples and Syntax | ||
695 | ============================ | ||
696 | |||
697 | 2.1 Basic Usage | ||
698 | --------------- | ||
699 | |||
700 | Creating, modifying, using the cpusets can be done through the cpuset | ||
701 | virtual filesystem. | ||
702 | |||
703 | To mount it, type: | ||
704 | # mount -t cgroup -o cpuset cpuset /dev/cpuset | ||
705 | |||
706 | Then under /dev/cpuset you can find a tree that corresponds to the | ||
707 | tree of the cpusets in the system. For instance, /dev/cpuset | ||
708 | is the cpuset that holds the whole system. | ||
709 | |||
710 | If you want to create a new cpuset under /dev/cpuset: | ||
711 | # cd /dev/cpuset | ||
712 | # mkdir my_cpuset | ||
713 | |||
714 | Now you want to do something with this cpuset. | ||
715 | # cd my_cpuset | ||
716 | |||
717 | In this directory you can find several files: | ||
718 | # ls | ||
719 | cpu_exclusive memory_migrate mems tasks | ||
720 | cpus memory_pressure notify_on_release | ||
721 | mem_exclusive memory_spread_page sched_load_balance | ||
722 | mem_hardwall memory_spread_slab sched_relax_domain_level | ||
723 | |||
724 | Reading them will give you information about the state of this cpuset: | ||
725 | the CPUs and Memory Nodes it can use, the processes that are using | ||
726 | it, its properties. By writing to these files you can manipulate | ||
727 | the cpuset. | ||
728 | |||
729 | Set some flags: | ||
730 | # /bin/echo 1 > cpu_exclusive | ||
731 | |||
732 | Add some cpus: | ||
733 | # /bin/echo 0-7 > cpus | ||
734 | |||
735 | Add some mems: | ||
736 | # /bin/echo 0-7 > mems | ||
737 | |||
738 | Now attach your shell to this cpuset: | ||
739 | # /bin/echo $$ > tasks | ||
740 | |||
741 | You can also create cpusets inside your cpuset by using mkdir in this | ||
742 | directory. | ||
743 | # mkdir my_sub_cs | ||
744 | |||
745 | To remove a cpuset, just use rmdir: | ||
746 | # rmdir my_sub_cs | ||
747 | This will fail if the cpuset is in use (has cpusets inside, or has | ||
748 | processes attached). | ||
749 | |||
750 | Note that for legacy reasons, the "cpuset" filesystem exists as a | ||
751 | wrapper around the cgroup filesystem. | ||
752 | |||
753 | The command | ||
754 | |||
755 | mount -t cpuset X /dev/cpuset | ||
756 | |||
757 | is equivalent to | ||
758 | |||
759 | mount -t cgroup -ocpuset X /dev/cpuset | ||
760 | echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent | ||
761 | |||
762 | 2.2 Adding/removing cpus | ||
763 | ------------------------ | ||
764 | |||
765 | This is the syntax to use when writing in the cpus or mems files | ||
766 | in cpuset directories: | ||
767 | |||
768 | # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 | ||
769 | # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 | ||
770 | |||
771 | 2.3 Setting flags | ||
772 | ----------------- | ||
773 | |||
774 | The syntax is very simple: | ||
775 | |||
776 | # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' | ||
777 | # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' | ||
778 | |||
779 | 2.4 Attaching processes | ||
780 | ----------------------- | ||
781 | |||
782 | # /bin/echo PID > tasks | ||
783 | |||
784 | Note that it is PID, not PIDs. You can only attach ONE task at a time. | ||
785 | If you have several tasks to attach, you have to do it one after another: | ||
786 | |||
787 | # /bin/echo PID1 > tasks | ||
788 | # /bin/echo PID2 > tasks | ||
789 | ... | ||
790 | # /bin/echo PIDn > tasks | ||
791 | |||
792 | |||
793 | 3. Questions | ||
794 | ============ | ||
795 | |||
796 | Q: what's up with this '/bin/echo' ? | ||
797 | A: bash's builtin 'echo' command does not check calls to write() against | ||
798 | errors. If you use it in the cpuset file system, you won't be | ||
799 | able to tell whether a command succeeded or failed. | ||
800 | |||
801 | Q: When I attach processes, only the first of the line gets really attached ! | ||
802 | A: We can only return one error code per call to write(). So you should also | ||
803 | put only ONE pid. | ||
804 | |||
805 | 4. Contact | ||
806 | ========== | ||
807 | |||
808 | Web: http://www.bullopensource.org/cpuset | ||