diff options
Diffstat (limited to 'Documentation/cpusets.txt')
-rw-r--r-- | Documentation/cpusets.txt | 165 |
1 files changed, 138 insertions, 27 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index a09a8eb80665..990998ee10b6 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt | |||
@@ -14,7 +14,10 @@ CONTENTS: | |||
14 | 1.1 What are cpusets ? | 14 | 1.1 What are cpusets ? |
15 | 1.2 Why are cpusets needed ? | 15 | 1.2 Why are cpusets needed ? |
16 | 1.3 How are cpusets implemented ? | 16 | 1.3 How are cpusets implemented ? |
17 | 1.4 How do I use cpusets ? | 17 | 1.4 What are exclusive cpusets ? |
18 | 1.5 What does notify_on_release do ? | ||
19 | 1.6 What is memory_pressure ? | ||
20 | 1.7 How do I use cpusets ? | ||
18 | 2. Usage Examples and Syntax | 21 | 2. Usage Examples and Syntax |
19 | 2.1 Basic Usage | 22 | 2.1 Basic Usage |
20 | 2.2 Adding/removing cpus | 23 | 2.2 Adding/removing cpus |
@@ -49,29 +52,6 @@ its cpus_allowed vector, and the kernel page allocator will not | |||
49 | allocate a page on a node that is not allowed in the requesting tasks | 52 | allocate a page on a node that is not allowed in the requesting tasks |
50 | mems_allowed vector. | 53 | mems_allowed vector. |
51 | 54 | ||
52 | If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct | ||
53 | ancestor or descendent, may share any of the same CPUs or Memory Nodes. | ||
54 | A cpuset that is cpu exclusive has a sched domain associated with it. | ||
55 | The sched domain consists of all cpus in the current cpuset that are not | ||
56 | part of any exclusive child cpusets. | ||
57 | This ensures that the scheduler load balacing code only balances | ||
58 | against the cpus that are in the sched domain as defined above and not | ||
59 | all of the cpus in the system. This removes any overhead due to | ||
60 | load balancing code trying to pull tasks outside of the cpu exclusive | ||
61 | cpuset only to be prevented by the tasks' cpus_allowed mask. | ||
62 | |||
63 | A cpuset that is mem_exclusive restricts kernel allocations for | ||
64 | page, buffer and other data commonly shared by the kernel across | ||
65 | multiple users. All cpusets, whether mem_exclusive or not, restrict | ||
66 | allocations of memory for user space. This enables configuring a | ||
67 | system so that several independent jobs can share common kernel | ||
68 | data, such as file system pages, while isolating each jobs user | ||
69 | allocation in its own cpuset. To do this, construct a large | ||
70 | mem_exclusive cpuset to hold all the jobs, and construct child, | ||
71 | non-mem_exclusive cpusets for each individual job. Only a small | ||
72 | amount of typical kernel memory, such as requests from interrupt | ||
73 | handlers, is allowed to be taken outside even a mem_exclusive cpuset. | ||
74 | |||
75 | User level code may create and destroy cpusets by name in the cpuset | 55 | User level code may create and destroy cpusets by name in the cpuset |
76 | virtual file system, manage the attributes and permissions of these | 56 | virtual file system, manage the attributes and permissions of these |
77 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | 57 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, |
@@ -155,7 +135,7 @@ Cpusets extends these two mechanisms as follows: | |||
155 | The implementation of cpusets requires a few, simple hooks | 135 | The implementation of cpusets requires a few, simple hooks |
156 | into the rest of the kernel, none in performance critical paths: | 136 | into the rest of the kernel, none in performance critical paths: |
157 | 137 | ||
158 | - in main/init.c, to initialize the root cpuset at system boot. | 138 | - in init/main.c, to initialize the root cpuset at system boot. |
159 | - in fork and exit, to attach and detach a task from its cpuset. | 139 | - in fork and exit, to attach and detach a task from its cpuset. |
160 | - in sched_setaffinity, to mask the requested CPUs by what's | 140 | - in sched_setaffinity, to mask the requested CPUs by what's |
161 | allowed in that tasks cpuset. | 141 | allowed in that tasks cpuset. |
@@ -166,7 +146,7 @@ into the rest of the kernel, none in performance critical paths: | |||
166 | and related changes in both sched.c and arch/ia64/kernel/domain.c | 146 | and related changes in both sched.c and arch/ia64/kernel/domain.c |
167 | - in the mbind and set_mempolicy system calls, to mask the requested | 147 | - in the mbind and set_mempolicy system calls, to mask the requested |
168 | Memory Nodes by what's allowed in that tasks cpuset. | 148 | Memory Nodes by what's allowed in that tasks cpuset. |
169 | - in page_alloc, to restrict memory to allowed nodes. | 149 | - in page_alloc.c, to restrict memory to allowed nodes. |
170 | - in vmscan.c, to restrict page recovery to the current cpuset. | 150 | - in vmscan.c, to restrict page recovery to the current cpuset. |
171 | 151 | ||
172 | In addition a new file system, of type "cpuset" may be mounted, | 152 | In addition a new file system, of type "cpuset" may be mounted, |
@@ -192,9 +172,15 @@ containing the following files describing that cpuset: | |||
192 | 172 | ||
193 | - cpus: list of CPUs in that cpuset | 173 | - cpus: list of CPUs in that cpuset |
194 | - mems: list of Memory Nodes in that cpuset | 174 | - mems: list of Memory Nodes in that cpuset |
175 | - memory_migrate flag: if set, move pages to cpusets nodes | ||
195 | - cpu_exclusive flag: is cpu placement exclusive? | 176 | - cpu_exclusive flag: is cpu placement exclusive? |
196 | - mem_exclusive flag: is memory placement exclusive? | 177 | - mem_exclusive flag: is memory placement exclusive? |
197 | - tasks: list of tasks (by pid) attached to that cpuset | 178 | - tasks: list of tasks (by pid) attached to that cpuset |
179 | - notify_on_release flag: run /sbin/cpuset_release_agent on exit? | ||
180 | - memory_pressure: measure of how much paging pressure in cpuset | ||
181 | |||
182 | In addition, the root cpuset only has the following file: | ||
183 | - memory_pressure_enabled flag: compute memory_pressure? | ||
198 | 184 | ||
199 | New cpusets are created using the mkdir system call or shell | 185 | New cpusets are created using the mkdir system call or shell |
200 | command. The properties of a cpuset, such as its flags, allowed | 186 | command. The properties of a cpuset, such as its flags, allowed |
@@ -228,7 +214,108 @@ exclusive cpuset. Also, the use of a Linux virtual file system (vfs) | |||
228 | to represent the cpuset hierarchy provides for a familiar permission | 214 | to represent the cpuset hierarchy provides for a familiar permission |
229 | and name space for cpusets, with a minimum of additional kernel code. | 215 | and name space for cpusets, with a minimum of additional kernel code. |
230 | 216 | ||
231 | 1.4 How do I use cpusets ? | 217 | |
218 | 1.4 What are exclusive cpusets ? | ||
219 | -------------------------------- | ||
220 | |||
221 | If a cpuset is cpu or mem exclusive, no other cpuset, other than | ||
222 | a direct ancestor or descendent, may share any of the same CPUs or | ||
223 | Memory Nodes. | ||
224 | |||
225 | A cpuset that is cpu_exclusive has a scheduler (sched) domain | ||
226 | associated with it. The sched domain consists of all CPUs in the | ||
227 | current cpuset that are not part of any exclusive child cpusets. | ||
228 | This ensures that the scheduler load balancing code only balances | ||
229 | against the CPUs that are in the sched domain as defined above and | ||
230 | not all of the CPUs in the system. This removes any overhead due to | ||
231 | load balancing code trying to pull tasks outside of the cpu_exclusive | ||
232 | cpuset only to be prevented by the tasks' cpus_allowed mask. | ||
233 | |||
234 | A cpuset that is mem_exclusive restricts kernel allocations for | ||
235 | page, buffer and other data commonly shared by the kernel across | ||
236 | multiple users. All cpusets, whether mem_exclusive or not, restrict | ||
237 | allocations of memory for user space. This enables configuring a | ||
238 | system so that several independent jobs can share common kernel data, | ||
239 | such as file system pages, while isolating each jobs user allocation in | ||
240 | its own cpuset. To do this, construct a large mem_exclusive cpuset to | ||
241 | hold all the jobs, and construct child, non-mem_exclusive cpusets for | ||
242 | each individual job. Only a small amount of typical kernel memory, | ||
243 | such as requests from interrupt handlers, is allowed to be taken | ||
244 | outside even a mem_exclusive cpuset. | ||
245 | |||
246 | |||
247 | 1.5 What does notify_on_release do ? | ||
248 | ------------------------------------ | ||
249 | |||
250 | If the notify_on_release flag is enabled (1) in a cpuset, then whenever | ||
251 | the last task in the cpuset leaves (exits or attaches to some other | ||
252 | cpuset) and the last child cpuset of that cpuset is removed, then | ||
253 | the kernel runs the command /sbin/cpuset_release_agent, supplying the | ||
254 | pathname (relative to the mount point of the cpuset file system) of the | ||
255 | abandoned cpuset. This enables automatic removal of abandoned cpusets. | ||
256 | The default value of notify_on_release in the root cpuset at system | ||
257 | boot is disabled (0). The default value of other cpusets at creation | ||
258 | is the current value of their parents notify_on_release setting. | ||
259 | |||
260 | |||
261 | 1.6 What is memory_pressure ? | ||
262 | ----------------------------- | ||
263 | The memory_pressure of a cpuset provides a simple per-cpuset metric | ||
264 | of the rate that the tasks in a cpuset are attempting to free up in | ||
265 | use memory on the nodes of the cpuset to satisfy additional memory | ||
266 | requests. | ||
267 | |||
268 | This enables batch managers monitoring jobs running in dedicated | ||
269 | cpusets to efficiently detect what level of memory pressure that job | ||
270 | is causing. | ||
271 | |||
272 | This is useful both on tightly managed systems running a wide mix of | ||
273 | submitted jobs, which may choose to terminate or re-prioritize jobs that | ||
274 | are trying to use more memory than allowed on the nodes assigned them, | ||
275 | and with tightly coupled, long running, massively parallel scientific | ||
276 | computing jobs that will dramatically fail to meet required performance | ||
277 | goals if they start to use more memory than allowed to them. | ||
278 | |||
279 | This mechanism provides a very economical way for the batch manager | ||
280 | to monitor a cpuset for signs of memory pressure. It's up to the | ||
281 | batch manager or other user code to decide what to do about it and | ||
282 | take action. | ||
283 | |||
284 | ==> Unless this feature is enabled by writing "1" to the special file | ||
285 | /dev/cpuset/memory_pressure_enabled, the hook in the rebalance | ||
286 | code of __alloc_pages() for this metric reduces to simply noticing | ||
287 | that the cpuset_memory_pressure_enabled flag is zero. So only | ||
288 | systems that enable this feature will compute the metric. | ||
289 | |||
290 | Why a per-cpuset, running average: | ||
291 | |||
292 | Because this meter is per-cpuset, rather than per-task or mm, | ||
293 | the system load imposed by a batch scheduler monitoring this | ||
294 | metric is sharply reduced on large systems, because a scan of | ||
295 | the tasklist can be avoided on each set of queries. | ||
296 | |||
297 | Because this meter is a running average, instead of an accumulating | ||
298 | counter, a batch scheduler can detect memory pressure with a | ||
299 | single read, instead of having to read and accumulate results | ||
300 | for a period of time. | ||
301 | |||
302 | Because this meter is per-cpuset rather than per-task or mm, | ||
303 | the batch scheduler can obtain the key information, memory | ||
304 | pressure in a cpuset, with a single read, rather than having to | ||
305 | query and accumulate results over all the (dynamically changing) | ||
306 | set of tasks in the cpuset. | ||
307 | |||
308 | A per-cpuset simple digital filter (requires a spinlock and 3 words | ||
309 | of data per-cpuset) is kept, and updated by any task attached to that | ||
310 | cpuset, if it enters the synchronous (direct) page reclaim code. | ||
311 | |||
312 | A per-cpuset file provides an integer number representing the recent | ||
313 | (half-life of 10 seconds) rate of direct page reclaims caused by | ||
314 | the tasks in the cpuset, in units of reclaims attempted per second, | ||
315 | times 1000. | ||
316 | |||
317 | |||
318 | 1.7 How do I use cpusets ? | ||
232 | -------------------------- | 319 | -------------------------- |
233 | 320 | ||
234 | In order to minimize the impact of cpusets on critical kernel | 321 | In order to minimize the impact of cpusets on critical kernel |
@@ -277,6 +364,30 @@ rewritten to the 'tasks' file of its cpuset. This is done to avoid | |||
277 | impacting the scheduler code in the kernel with a check for changes | 364 | impacting the scheduler code in the kernel with a check for changes |
278 | in a tasks processor placement. | 365 | in a tasks processor placement. |
279 | 366 | ||
367 | Normally, once a page is allocated (given a physical page | ||
368 | of main memory) then that page stays on whatever node it | ||
369 | was allocated, so long as it remains allocated, even if the | ||
370 | cpusets memory placement policy 'mems' subsequently changes. | ||
371 | If the cpuset flag file 'memory_migrate' is set true, then when | ||
372 | tasks are attached to that cpuset, any pages that task had | ||
373 | allocated to it on nodes in its previous cpuset are migrated | ||
374 | to the tasks new cpuset. Depending on the implementation, | ||
375 | this migration may either be done by swapping the page out, | ||
376 | so that the next time the page is referenced, it will be paged | ||
377 | into the tasks new cpuset, usually on the node where it was | ||
378 | referenced, or this migration may be done by directly copying | ||
379 | the pages from the tasks previous cpuset to the new cpuset, | ||
380 | where possible to the same node, relative to the new cpuset, | ||
381 | as the node that held the page, relative to the old cpuset. | ||
382 | Also if 'memory_migrate' is set true, then if that cpusets | ||
383 | 'mems' file is modified, pages allocated to tasks in that | ||
384 | cpuset, that were on nodes in the previous setting of 'mems', | ||
385 | will be moved to nodes in the new setting of 'mems.' Again, | ||
386 | depending on the implementation, this might be done by swapping, | ||
387 | or by direct copying. In either case, pages that were not in | ||
388 | the tasks prior cpuset, or in the cpusets prior 'mems' setting, | ||
389 | will not be moved. | ||
390 | |||
280 | There is an exception to the above. If hotplug functionality is used | 391 | There is an exception to the above. If hotplug functionality is used |
281 | to remove all the CPUs that are currently assigned to a cpuset, | 392 | to remove all the CPUs that are currently assigned to a cpuset, |
282 | then the kernel will automatically update the cpus_allowed of all | 393 | then the kernel will automatically update the cpus_allowed of all |