diff options
Diffstat (limited to 'Documentation/cpusets.txt')
-rw-r--r-- | Documentation/cpusets.txt | 415 |
1 files changed, 415 insertions, 0 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt new file mode 100644 index 000000000000..1ad26d2c20ae --- /dev/null +++ b/Documentation/cpusets.txt | |||
@@ -0,0 +1,415 @@ | |||
1 | CPUSETS | ||
2 | ------- | ||
3 | |||
4 | Copyright (C) 2004 BULL SA. | ||
5 | Written by Simon.Derr@bull.net | ||
6 | |||
7 | Portions Copyright (c) 2004 Silicon Graphics, Inc. | ||
8 | Modified by Paul Jackson <pj@sgi.com> | ||
9 | |||
10 | CONTENTS: | ||
11 | ========= | ||
12 | |||
13 | 1. Cpusets | ||
14 | 1.1 What are cpusets ? | ||
15 | 1.2 Why are cpusets needed ? | ||
16 | 1.3 How are cpusets implemented ? | ||
17 | 1.4 How do I use cpusets ? | ||
18 | 2. Usage Examples and Syntax | ||
19 | 2.1 Basic Usage | ||
20 | 2.2 Adding/removing cpus | ||
21 | 2.3 Setting flags | ||
22 | 2.4 Attaching processes | ||
23 | 3. Questions | ||
24 | 4. Contact | ||
25 | |||
26 | 1. Cpusets | ||
27 | ========== | ||
28 | |||
29 | 1.1 What are cpusets ? | ||
30 | ---------------------- | ||
31 | |||
32 | Cpusets provide a mechanism for assigning a set of CPUs and Memory | ||
33 | Nodes to a set of tasks. | ||
34 | |||
35 | Cpusets constrain the CPU and Memory placement of tasks to only | ||
36 | the resources within a tasks current cpuset. They form a nested | ||
37 | hierarchy visible in a virtual file system. These are the essential | ||
38 | hooks, beyond what is already present, required to manage dynamic | ||
39 | job placement on large systems. | ||
40 | |||
41 | Each task has a pointer to a cpuset. Multiple tasks may reference | ||
42 | the same cpuset. Requests by a task, using the sched_setaffinity(2) | ||
43 | system call to include CPUs in its CPU affinity mask, and using the | ||
44 | mbind(2) and set_mempolicy(2) system calls to include Memory Nodes | ||
45 | in its memory policy, are both filtered through that tasks cpuset, | ||
46 | filtering out any CPUs or Memory Nodes not in that cpuset. The | ||
47 | scheduler will not schedule a task on a CPU that is not allowed in | ||
48 | its cpus_allowed vector, and the kernel page allocator will not | ||
49 | allocate a page on a node that is not allowed in the requesting tasks | ||
50 | mems_allowed vector. | ||
51 | |||
52 | If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct | ||
53 | ancestor or descendent, may share any of the same CPUs or Memory Nodes. | ||
54 | |||
55 | User level code may create and destroy cpusets by name in the cpuset | ||
56 | virtual file system, manage the attributes and permissions of these | ||
57 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | ||
58 | specify and query to which cpuset a task is assigned, and list the | ||
59 | task pids assigned to a cpuset. | ||
60 | |||
61 | |||
62 | 1.2 Why are cpusets needed ? | ||
63 | ---------------------------- | ||
64 | |||
65 | The management of large computer systems, with many processors (CPUs), | ||
66 | complex memory cache hierarchies and multiple Memory Nodes having | ||
67 | non-uniform access times (NUMA) presents additional challenges for | ||
68 | the efficient scheduling and memory placement of processes. | ||
69 | |||
70 | Frequently more modest sized systems can be operated with adequate | ||
71 | efficiency just by letting the operating system automatically share | ||
72 | the available CPU and Memory resources amongst the requesting tasks. | ||
73 | |||
74 | But larger systems, which benefit more from careful processor and | ||
75 | memory placement to reduce memory access times and contention, | ||
76 | and which typically represent a larger investment for the customer, | ||
77 | can benefit from explictly placing jobs on properly sized subsets of | ||
78 | the system. | ||
79 | |||
80 | This can be especially valuable on: | ||
81 | |||
82 | * Web Servers running multiple instances of the same web application, | ||
83 | * Servers running different applications (for instance, a web server | ||
84 | and a database), or | ||
85 | * NUMA systems running large HPC applications with demanding | ||
86 | performance characteristics. | ||
87 | |||
88 | These subsets, or "soft partitions" must be able to be dynamically | ||
89 | adjusted, as the job mix changes, without impacting other concurrently | ||
90 | executing jobs. | ||
91 | |||
92 | The kernel cpuset patch provides the minimum essential kernel | ||
93 | mechanisms required to efficiently implement such subsets. It | ||
94 | leverages existing CPU and Memory Placement facilities in the Linux | ||
95 | kernel to avoid any additional impact on the critical scheduler or | ||
96 | memory allocator code. | ||
97 | |||
98 | |||
99 | 1.3 How are cpusets implemented ? | ||
100 | --------------------------------- | ||
101 | |||
102 | Cpusets provide a Linux kernel (2.6.7 and above) mechanism to constrain | ||
103 | which CPUs and Memory Nodes are used by a process or set of processes. | ||
104 | |||
105 | The Linux kernel already has a pair of mechanisms to specify on which | ||
106 | CPUs a task may be scheduled (sched_setaffinity) and on which Memory | ||
107 | Nodes it may obtain memory (mbind, set_mempolicy). | ||
108 | |||
109 | Cpusets extends these two mechanisms as follows: | ||
110 | |||
111 | - Cpusets are sets of allowed CPUs and Memory Nodes, known to the | ||
112 | kernel. | ||
113 | - Each task in the system is attached to a cpuset, via a pointer | ||
114 | in the task structure to a reference counted cpuset structure. | ||
115 | - Calls to sched_setaffinity are filtered to just those CPUs | ||
116 | allowed in that tasks cpuset. | ||
117 | - Calls to mbind and set_mempolicy are filtered to just | ||
118 | those Memory Nodes allowed in that tasks cpuset. | ||
119 | - The root cpuset contains all the systems CPUs and Memory | ||
120 | Nodes. | ||
121 | - For any cpuset, one can define child cpusets containing a subset | ||
122 | of the parents CPU and Memory Node resources. | ||
123 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for | ||
124 | browsing and manipulation from user space. | ||
125 | - A cpuset may be marked exclusive, which ensures that no other | ||
126 | cpuset (except direct ancestors and descendents) may contain | ||
127 | any overlapping CPUs or Memory Nodes. | ||
128 | - You can list all the tasks (by pid) attached to any cpuset. | ||
129 | |||
130 | The implementation of cpusets requires a few, simple hooks | ||
131 | into the rest of the kernel, none in performance critical paths: | ||
132 | |||
133 | - in main/init.c, to initialize the root cpuset at system boot. | ||
134 | - in fork and exit, to attach and detach a task from its cpuset. | ||
135 | - in sched_setaffinity, to mask the requested CPUs by what's | ||
136 | allowed in that tasks cpuset. | ||
137 | - in sched.c migrate_all_tasks(), to keep migrating tasks within | ||
138 | the CPUs allowed by their cpuset, if possible. | ||
139 | - in the mbind and set_mempolicy system calls, to mask the requested | ||
140 | Memory Nodes by what's allowed in that tasks cpuset. | ||
141 | - in page_alloc, to restrict memory to allowed nodes. | ||
142 | - in vmscan.c, to restrict page recovery to the current cpuset. | ||
143 | |||
144 | In addition a new file system, of type "cpuset" may be mounted, | ||
145 | typically at /dev/cpuset, to enable browsing and modifying the cpusets | ||
146 | presently known to the kernel. No new system calls are added for | ||
147 | cpusets - all support for querying and modifying cpusets is via | ||
148 | this cpuset file system. | ||
149 | |||
150 | Each task under /proc has an added file named 'cpuset', displaying | ||
151 | the cpuset name, as the path relative to the root of the cpuset file | ||
152 | system. | ||
153 | |||
154 | The /proc/<pid>/status file for each task has two added lines, | ||
155 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) | ||
156 | and mems_allowed (on which Memory Nodes it may obtain memory), | ||
157 | in the format seen in the following example: | ||
158 | |||
159 | Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff | ||
160 | Mems_allowed: ffffffff,ffffffff | ||
161 | |||
162 | Each cpuset is represented by a directory in the cpuset file system | ||
163 | containing the following files describing that cpuset: | ||
164 | |||
165 | - cpus: list of CPUs in that cpuset | ||
166 | - mems: list of Memory Nodes in that cpuset | ||
167 | - cpu_exclusive flag: is cpu placement exclusive? | ||
168 | - mem_exclusive flag: is memory placement exclusive? | ||
169 | - tasks: list of tasks (by pid) attached to that cpuset | ||
170 | |||
171 | New cpusets are created using the mkdir system call or shell | ||
172 | command. The properties of a cpuset, such as its flags, allowed | ||
173 | CPUs and Memory Nodes, and attached tasks, are modified by writing | ||
174 | to the appropriate file in that cpusets directory, as listed above. | ||
175 | |||
176 | The named hierarchical structure of nested cpusets allows partitioning | ||
177 | a large system into nested, dynamically changeable, "soft-partitions". | ||
178 | |||
179 | The attachment of each task, automatically inherited at fork by any | ||
180 | children of that task, to a cpuset allows organizing the work load | ||
181 | on a system into related sets of tasks such that each set is constrained | ||
182 | to using the CPUs and Memory Nodes of a particular cpuset. A task | ||
183 | may be re-attached to any other cpuset, if allowed by the permissions | ||
184 | on the necessary cpuset file system directories. | ||
185 | |||
186 | Such management of a system "in the large" integrates smoothly with | ||
187 | the detailed placement done on individual tasks and memory regions | ||
188 | using the sched_setaffinity, mbind and set_mempolicy system calls. | ||
189 | |||
190 | The following rules apply to each cpuset: | ||
191 | |||
192 | - Its CPUs and Memory Nodes must be a subset of its parents. | ||
193 | - It can only be marked exclusive if its parent is. | ||
194 | - If its cpu or memory is exclusive, they may not overlap any sibling. | ||
195 | |||
196 | These rules, and the natural hierarchy of cpusets, enable efficient | ||
197 | enforcement of the exclusive guarantee, without having to scan all | ||
198 | cpusets every time any of them change to ensure nothing overlaps a | ||
199 | exclusive cpuset. Also, the use of a Linux virtual file system (vfs) | ||
200 | to represent the cpuset hierarchy provides for a familiar permission | ||
201 | and name space for cpusets, with a minimum of additional kernel code. | ||
202 | |||
203 | 1.4 How do I use cpusets ? | ||
204 | -------------------------- | ||
205 | |||
206 | In order to minimize the impact of cpusets on critical kernel | ||
207 | code, such as the scheduler, and due to the fact that the kernel | ||
208 | does not support one task updating the memory placement of another | ||
209 | task directly, the impact on a task of changing its cpuset CPU | ||
210 | or Memory Node placement, or of changing to which cpuset a task | ||
211 | is attached, is subtle. | ||
212 | |||
213 | If a cpuset has its Memory Nodes modified, then for each task attached | ||
214 | to that cpuset, the next time that the kernel attempts to allocate | ||
215 | a page of memory for that task, the kernel will notice the change | ||
216 | in the tasks cpuset, and update its per-task memory placement to | ||
217 | remain within the new cpusets memory placement. If the task was using | ||
218 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with | ||
219 | its new cpuset, then the task will continue to use whatever subset | ||
220 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task | ||
221 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | ||
222 | in the new cpuset, then the task will be essentially treated as if it | ||
223 | was MPOL_BIND bound to the new cpuset (even though its numa placement, | ||
224 | as queried by get_mempolicy(), doesn't change). If a task is moved | ||
225 | from one cpuset to another, then the kernel will adjust the tasks | ||
226 | memory placement, as above, the next time that the kernel attempts | ||
227 | to allocate a page of memory for that task. | ||
228 | |||
229 | If a cpuset has its CPUs modified, then each task using that | ||
230 | cpuset does _not_ change its behavior automatically. In order to | ||
231 | minimize the impact on the critical scheduling code in the kernel, | ||
232 | tasks will continue to use their prior CPU placement until they | ||
233 | are rebound to their cpuset, by rewriting their pid to the 'tasks' | ||
234 | file of their cpuset. If a task had been bound to some subset of its | ||
235 | cpuset using the sched_setaffinity() call, and if any of that subset | ||
236 | is still allowed in its new cpuset settings, then the task will be | ||
237 | restricted to the intersection of the CPUs it was allowed on before, | ||
238 | and its new cpuset CPU placement. If, on the other hand, there is | ||
239 | no overlap between a tasks prior placement and its new cpuset CPU | ||
240 | placement, then the task will be allowed to run on any CPU allowed | ||
241 | in its new cpuset. If a task is moved from one cpuset to another, | ||
242 | its CPU placement is updated in the same way as if the tasks pid is | ||
243 | rewritten to the 'tasks' file of its current cpuset. | ||
244 | |||
245 | In summary, the memory placement of a task whose cpuset is changed is | ||
246 | updated by the kernel, on the next allocation of a page for that task, | ||
247 | but the processor placement is not updated, until that tasks pid is | ||
248 | rewritten to the 'tasks' file of its cpuset. This is done to avoid | ||
249 | impacting the scheduler code in the kernel with a check for changes | ||
250 | in a tasks processor placement. | ||
251 | |||
252 | There is an exception to the above. If hotplug funtionality is used | ||
253 | to remove all the CPUs that are currently assigned to a cpuset, | ||
254 | then the kernel will automatically update the cpus_allowed of all | ||
255 | tasks attached to CPUs in that cpuset with the online CPUs of the | ||
256 | nearest parent cpuset that still has some CPUs online. When memory | ||
257 | hotplug functionality for removing Memory Nodes is available, a | ||
258 | similar exception is expected to apply there as well. In general, | ||
259 | the kernel prefers to violate cpuset placement, over starving a task | ||
260 | that has had all its allowed CPUs or Memory Nodes taken offline. User | ||
261 | code should reconfigure cpusets to only refer to online CPUs and Memory | ||
262 | Nodes when using hotplug to add or remove such resources. | ||
263 | |||
264 | There is a second exception to the above. GFP_ATOMIC requests are | ||
265 | kernel internal allocations that must be satisfied, immediately. | ||
266 | The kernel may drop some request, in rare cases even panic, if a | ||
267 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within | ||
268 | the current tasks cpuset, then we relax the cpuset, and look for | ||
269 | memory anywhere we can find it. It's better to violate the cpuset | ||
270 | than stress the kernel. | ||
271 | |||
272 | To start a new job that is to be contained within a cpuset, the steps are: | ||
273 | |||
274 | 1) mkdir /dev/cpuset | ||
275 | 2) mount -t cpuset none /dev/cpuset | ||
276 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in | ||
277 | the /dev/cpuset virtual file system. | ||
278 | 4) Start a task that will be the "founding father" of the new job. | ||
279 | 5) Attach that task to the new cpuset by writing its pid to the | ||
280 | /dev/cpuset tasks file for that cpuset. | ||
281 | 6) fork, exec or clone the job tasks from this founding father task. | ||
282 | |||
283 | For example, the following sequence of commands will setup a cpuset | ||
284 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | ||
285 | and then start a subshell 'sh' in that cpuset: | ||
286 | |||
287 | mount -t cpuset none /dev/cpuset | ||
288 | cd /dev/cpuset | ||
289 | mkdir Charlie | ||
290 | cd Charlie | ||
291 | /bin/echo 2-3 > cpus | ||
292 | /bin/echo 1 > mems | ||
293 | /bin/echo $$ > tasks | ||
294 | sh | ||
295 | # The subshell 'sh' is now running in cpuset Charlie | ||
296 | # The next line should display '/Charlie' | ||
297 | cat /proc/self/cpuset | ||
298 | |||
299 | In the case that a change of cpuset includes wanting to move already | ||
300 | allocated memory pages, consider further the work of IWAMOTO | ||
301 | Toshihiro <iwamoto@valinux.co.jp> for page remapping and memory | ||
302 | hotremoval, which can be found at: | ||
303 | |||
304 | http://people.valinux.co.jp/~iwamoto/mh.html | ||
305 | |||
306 | The integration of cpusets with such memory migration is not yet | ||
307 | available. | ||
308 | |||
309 | In the future, a C library interface to cpusets will likely be | ||
310 | available. For now, the only way to query or modify cpusets is | ||
311 | via the cpuset file system, using the various cd, mkdir, echo, cat, | ||
312 | rmdir commands from the shell, or their equivalent from C. | ||
313 | |||
314 | The sched_setaffinity calls can also be done at the shell prompt using | ||
315 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy | ||
316 | calls can be done at the shell prompt using the numactl command | ||
317 | (part of Andi Kleen's numa package). | ||
318 | |||
319 | 2. Usage Examples and Syntax | ||
320 | ============================ | ||
321 | |||
322 | 2.1 Basic Usage | ||
323 | --------------- | ||
324 | |||
325 | Creating, modifying, using the cpusets can be done through the cpuset | ||
326 | virtual filesystem. | ||
327 | |||
328 | To mount it, type: | ||
329 | # mount -t cpuset none /dev/cpuset | ||
330 | |||
331 | Then under /dev/cpuset you can find a tree that corresponds to the | ||
332 | tree of the cpusets in the system. For instance, /dev/cpuset | ||
333 | is the cpuset that holds the whole system. | ||
334 | |||
335 | If you want to create a new cpuset under /dev/cpuset: | ||
336 | # cd /dev/cpuset | ||
337 | # mkdir my_cpuset | ||
338 | |||
339 | Now you want to do something with this cpuset. | ||
340 | # cd my_cpuset | ||
341 | |||
342 | In this directory you can find several files: | ||
343 | # ls | ||
344 | cpus cpu_exclusive mems mem_exclusive tasks | ||
345 | |||
346 | Reading them will give you information about the state of this cpuset: | ||
347 | the CPUs and Memory Nodes it can use, the processes that are using | ||
348 | it, its properties. By writing to these files you can manipulate | ||
349 | the cpuset. | ||
350 | |||
351 | Set some flags: | ||
352 | # /bin/echo 1 > cpu_exclusive | ||
353 | |||
354 | Add some cpus: | ||
355 | # /bin/echo 0-7 > cpus | ||
356 | |||
357 | Now attach your shell to this cpuset: | ||
358 | # /bin/echo $$ > tasks | ||
359 | |||
360 | You can also create cpusets inside your cpuset by using mkdir in this | ||
361 | directory. | ||
362 | # mkdir my_sub_cs | ||
363 | |||
364 | To remove a cpuset, just use rmdir: | ||
365 | # rmdir my_sub_cs | ||
366 | This will fail if the cpuset is in use (has cpusets inside, or has | ||
367 | processes attached). | ||
368 | |||
369 | 2.2 Adding/removing cpus | ||
370 | ------------------------ | ||
371 | |||
372 | This is the syntax to use when writing in the cpus or mems files | ||
373 | in cpuset directories: | ||
374 | |||
375 | # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 | ||
376 | # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 | ||
377 | |||
378 | 2.3 Setting flags | ||
379 | ----------------- | ||
380 | |||
381 | The syntax is very simple: | ||
382 | |||
383 | # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' | ||
384 | # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' | ||
385 | |||
386 | 2.4 Attaching processes | ||
387 | ----------------------- | ||
388 | |||
389 | # /bin/echo PID > tasks | ||
390 | |||
391 | Note that it is PID, not PIDs. You can only attach ONE task at a time. | ||
392 | If you have several tasks to attach, you have to do it one after another: | ||
393 | |||
394 | # /bin/echo PID1 > tasks | ||
395 | # /bin/echo PID2 > tasks | ||
396 | ... | ||
397 | # /bin/echo PIDn > tasks | ||
398 | |||
399 | |||
400 | 3. Questions | ||
401 | ============ | ||
402 | |||
403 | Q: what's up with this '/bin/echo' ? | ||
404 | A: bash's builtin 'echo' command does not check calls to write() against | ||
405 | errors. If you use it in the cpuset file system, you won't be | ||
406 | able to tell whether a command succeeded or failed. | ||
407 | |||
408 | Q: When I attach processes, only the first of the line gets really attached ! | ||
409 | A: We can only return one error code per call to write(). So you should also | ||
410 | put only ONE pid. | ||
411 | |||
412 | 4. Contact | ||
413 | ========== | ||
414 | |||
415 | Web: http://www.bullopensource.org/cpuset | ||