diff options
Diffstat (limited to 'Documentation/cgroups/cpusets.txt')
-rw-r--r-- | Documentation/cgroups/cpusets.txt | 161 |
1 files changed, 82 insertions, 79 deletions
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 1d7e9784439a..51682ab2dd1a 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt | |||
@@ -42,7 +42,7 @@ Nodes to a set of tasks. In this document "Memory Node" refers to | |||
42 | an on-line node that contains memory. | 42 | an on-line node that contains memory. |
43 | 43 | ||
44 | Cpusets constrain the CPU and Memory placement of tasks to only | 44 | Cpusets constrain the CPU and Memory placement of tasks to only |
45 | the resources within a tasks current cpuset. They form a nested | 45 | the resources within a task's current cpuset. They form a nested |
46 | hierarchy visible in a virtual file system. These are the essential | 46 | hierarchy visible in a virtual file system. These are the essential |
47 | hooks, beyond what is already present, required to manage dynamic | 47 | hooks, beyond what is already present, required to manage dynamic |
48 | job placement on large systems. | 48 | job placement on large systems. |
@@ -53,11 +53,11 @@ Documentation/cgroups/cgroups.txt. | |||
53 | Requests by a task, using the sched_setaffinity(2) system call to | 53 | Requests by a task, using the sched_setaffinity(2) system call to |
54 | include CPUs in its CPU affinity mask, and using the mbind(2) and | 54 | include CPUs in its CPU affinity mask, and using the mbind(2) and |
55 | set_mempolicy(2) system calls to include Memory Nodes in its memory | 55 | set_mempolicy(2) system calls to include Memory Nodes in its memory |
56 | policy, are both filtered through that tasks cpuset, filtering out any | 56 | policy, are both filtered through that task's cpuset, filtering out any |
57 | CPUs or Memory Nodes not in that cpuset. The scheduler will not | 57 | CPUs or Memory Nodes not in that cpuset. The scheduler will not |
58 | schedule a task on a CPU that is not allowed in its cpus_allowed | 58 | schedule a task on a CPU that is not allowed in its cpus_allowed |
59 | vector, and the kernel page allocator will not allocate a page on a | 59 | vector, and the kernel page allocator will not allocate a page on a |
60 | node that is not allowed in the requesting tasks mems_allowed vector. | 60 | node that is not allowed in the requesting task's mems_allowed vector. |
61 | 61 | ||
62 | User level code may create and destroy cpusets by name in the cgroup | 62 | User level code may create and destroy cpusets by name in the cgroup |
63 | virtual file system, manage the attributes and permissions of these | 63 | virtual file system, manage the attributes and permissions of these |
@@ -121,9 +121,9 @@ Cpusets extends these two mechanisms as follows: | |||
121 | - Each task in the system is attached to a cpuset, via a pointer | 121 | - Each task in the system is attached to a cpuset, via a pointer |
122 | in the task structure to a reference counted cgroup structure. | 122 | in the task structure to a reference counted cgroup structure. |
123 | - Calls to sched_setaffinity are filtered to just those CPUs | 123 | - Calls to sched_setaffinity are filtered to just those CPUs |
124 | allowed in that tasks cpuset. | 124 | allowed in that task's cpuset. |
125 | - Calls to mbind and set_mempolicy are filtered to just | 125 | - Calls to mbind and set_mempolicy are filtered to just |
126 | those Memory Nodes allowed in that tasks cpuset. | 126 | those Memory Nodes allowed in that task's cpuset. |
127 | - The root cpuset contains all the systems CPUs and Memory | 127 | - The root cpuset contains all the systems CPUs and Memory |
128 | Nodes. | 128 | Nodes. |
129 | - For any cpuset, one can define child cpusets containing a subset | 129 | - For any cpuset, one can define child cpusets containing a subset |
@@ -141,11 +141,11 @@ into the rest of the kernel, none in performance critical paths: | |||
141 | - in init/main.c, to initialize the root cpuset at system boot. | 141 | - in init/main.c, to initialize the root cpuset at system boot. |
142 | - in fork and exit, to attach and detach a task from its cpuset. | 142 | - in fork and exit, to attach and detach a task from its cpuset. |
143 | - in sched_setaffinity, to mask the requested CPUs by what's | 143 | - in sched_setaffinity, to mask the requested CPUs by what's |
144 | allowed in that tasks cpuset. | 144 | allowed in that task's cpuset. |
145 | - in sched.c migrate_live_tasks(), to keep migrating tasks within | 145 | - in sched.c migrate_live_tasks(), to keep migrating tasks within |
146 | the CPUs allowed by their cpuset, if possible. | 146 | the CPUs allowed by their cpuset, if possible. |
147 | - in the mbind and set_mempolicy system calls, to mask the requested | 147 | - in the mbind and set_mempolicy system calls, to mask the requested |
148 | Memory Nodes by what's allowed in that tasks cpuset. | 148 | Memory Nodes by what's allowed in that task's cpuset. |
149 | - in page_alloc.c, to restrict memory to allowed nodes. | 149 | - in page_alloc.c, to restrict memory to allowed nodes. |
150 | - in vmscan.c, to restrict page recovery to the current cpuset. | 150 | - in vmscan.c, to restrict page recovery to the current cpuset. |
151 | 151 | ||
@@ -155,7 +155,7 @@ new system calls are added for cpusets - all support for querying and | |||
155 | modifying cpusets is via this cpuset file system. | 155 | modifying cpusets is via this cpuset file system. |
156 | 156 | ||
157 | The /proc/<pid>/status file for each task has four added lines, | 157 | The /proc/<pid>/status file for each task has four added lines, |
158 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) | 158 | displaying the task's cpus_allowed (on which CPUs it may be scheduled) |
159 | and mems_allowed (on which Memory Nodes it may obtain memory), | 159 | and mems_allowed (on which Memory Nodes it may obtain memory), |
160 | in the two formats seen in the following example: | 160 | in the two formats seen in the following example: |
161 | 161 | ||
@@ -168,20 +168,20 @@ Each cpuset is represented by a directory in the cgroup file system | |||
168 | containing (on top of the standard cgroup files) the following | 168 | containing (on top of the standard cgroup files) the following |
169 | files describing that cpuset: | 169 | files describing that cpuset: |
170 | 170 | ||
171 | - cpus: list of CPUs in that cpuset | 171 | - cpuset.cpus: list of CPUs in that cpuset |
172 | - mems: list of Memory Nodes in that cpuset | 172 | - cpuset.mems: list of Memory Nodes in that cpuset |
173 | - memory_migrate flag: if set, move pages to cpusets nodes | 173 | - cpuset.memory_migrate flag: if set, move pages to cpusets nodes |
174 | - cpu_exclusive flag: is cpu placement exclusive? | 174 | - cpuset.cpu_exclusive flag: is cpu placement exclusive? |
175 | - mem_exclusive flag: is memory placement exclusive? | 175 | - cpuset.mem_exclusive flag: is memory placement exclusive? |
176 | - mem_hardwall flag: is memory allocation hardwalled | 176 | - cpuset.mem_hardwall flag: is memory allocation hardwalled |
177 | - memory_pressure: measure of how much paging pressure in cpuset | 177 | - cpuset.memory_pressure: measure of how much paging pressure in cpuset |
178 | - memory_spread_page flag: if set, spread page cache evenly on allowed nodes | 178 | - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes |
179 | - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes | 179 | - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes |
180 | - sched_load_balance flag: if set, load balance within CPUs on that cpuset | 180 | - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset |
181 | - sched_relax_domain_level: the searching range when migrating tasks | 181 | - cpuset.sched_relax_domain_level: the searching range when migrating tasks |
182 | 182 | ||
183 | In addition, the root cpuset only has the following file: | 183 | In addition, the root cpuset only has the following file: |
184 | - memory_pressure_enabled flag: compute memory_pressure? | 184 | - cpuset.memory_pressure_enabled flag: compute memory_pressure? |
185 | 185 | ||
186 | New cpusets are created using the mkdir system call or shell | 186 | New cpusets are created using the mkdir system call or shell |
187 | command. The properties of a cpuset, such as its flags, allowed | 187 | command. The properties of a cpuset, such as its flags, allowed |
@@ -229,7 +229,7 @@ If a cpuset is cpu or mem exclusive, no other cpuset, other than | |||
229 | a direct ancestor or descendant, may share any of the same CPUs or | 229 | a direct ancestor or descendant, may share any of the same CPUs or |
230 | Memory Nodes. | 230 | Memory Nodes. |
231 | 231 | ||
232 | A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", | 232 | A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled", |
233 | i.e. it restricts kernel allocations for page, buffer and other data | 233 | i.e. it restricts kernel allocations for page, buffer and other data |
234 | commonly shared by the kernel across multiple users. All cpusets, | 234 | commonly shared by the kernel across multiple users. All cpusets, |
235 | whether hardwalled or not, restrict allocations of memory for user | 235 | whether hardwalled or not, restrict allocations of memory for user |
@@ -304,15 +304,15 @@ times 1000. | |||
304 | --------------------------- | 304 | --------------------------- |
305 | There are two boolean flag files per cpuset that control where the | 305 | There are two boolean flag files per cpuset that control where the |
306 | kernel allocates pages for the file system buffers and related in | 306 | kernel allocates pages for the file system buffers and related in |
307 | kernel data structures. They are called 'memory_spread_page' and | 307 | kernel data structures. They are called 'cpuset.memory_spread_page' and |
308 | 'memory_spread_slab'. | 308 | 'cpuset.memory_spread_slab'. |
309 | 309 | ||
310 | If the per-cpuset boolean flag file 'memory_spread_page' is set, then | 310 | If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then |
311 | the kernel will spread the file system buffers (page cache) evenly | 311 | the kernel will spread the file system buffers (page cache) evenly |
312 | over all the nodes that the faulting task is allowed to use, instead | 312 | over all the nodes that the faulting task is allowed to use, instead |
313 | of preferring to put those pages on the node where the task is running. | 313 | of preferring to put those pages on the node where the task is running. |
314 | 314 | ||
315 | If the per-cpuset boolean flag file 'memory_spread_slab' is set, | 315 | If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set, |
316 | then the kernel will spread some file system related slab caches, | 316 | then the kernel will spread some file system related slab caches, |
317 | such as for inodes and dentries evenly over all the nodes that the | 317 | such as for inodes and dentries evenly over all the nodes that the |
318 | faulting task is allowed to use, instead of preferring to put those | 318 | faulting task is allowed to use, instead of preferring to put those |
@@ -323,41 +323,41 @@ stack segment pages of a task. | |||
323 | 323 | ||
324 | By default, both kinds of memory spreading are off, and memory | 324 | By default, both kinds of memory spreading are off, and memory |
325 | pages are allocated on the node local to where the task is running, | 325 | pages are allocated on the node local to where the task is running, |
326 | except perhaps as modified by the tasks NUMA mempolicy or cpuset | 326 | except perhaps as modified by the task's NUMA mempolicy or cpuset |
327 | configuration, so long as sufficient free memory pages are available. | 327 | configuration, so long as sufficient free memory pages are available. |
328 | 328 | ||
329 | When new cpusets are created, they inherit the memory spread settings | 329 | When new cpusets are created, they inherit the memory spread settings |
330 | of their parent. | 330 | of their parent. |
331 | 331 | ||
332 | Setting memory spreading causes allocations for the affected page | 332 | Setting memory spreading causes allocations for the affected page |
333 | or slab caches to ignore the tasks NUMA mempolicy and be spread | 333 | or slab caches to ignore the task's NUMA mempolicy and be spread |
334 | instead. Tasks using mbind() or set_mempolicy() calls to set NUMA | 334 | instead. Tasks using mbind() or set_mempolicy() calls to set NUMA |
335 | mempolicies will not notice any change in these calls as a result of | 335 | mempolicies will not notice any change in these calls as a result of |
336 | their containing tasks memory spread settings. If memory spreading | 336 | their containing task's memory spread settings. If memory spreading |
337 | is turned off, then the currently specified NUMA mempolicy once again | 337 | is turned off, then the currently specified NUMA mempolicy once again |
338 | applies to memory page allocations. | 338 | applies to memory page allocations. |
339 | 339 | ||
340 | Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag | 340 | Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag |
341 | files. By default they contain "0", meaning that the feature is off | 341 | files. By default they contain "0", meaning that the feature is off |
342 | for that cpuset. If a "1" is written to that file, then that turns | 342 | for that cpuset. If a "1" is written to that file, then that turns |
343 | the named feature on. | 343 | the named feature on. |
344 | 344 | ||
345 | The implementation is simple. | 345 | The implementation is simple. |
346 | 346 | ||
347 | Setting the flag 'memory_spread_page' turns on a per-process flag | 347 | Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag |
348 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently | 348 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently |
349 | joins that cpuset. The page allocation calls for the page cache | 349 | joins that cpuset. The page allocation calls for the page cache |
350 | is modified to perform an inline check for this PF_SPREAD_PAGE task | 350 | is modified to perform an inline check for this PF_SPREAD_PAGE task |
351 | flag, and if set, a call to a new routine cpuset_mem_spread_node() | 351 | flag, and if set, a call to a new routine cpuset_mem_spread_node() |
352 | returns the node to prefer for the allocation. | 352 | returns the node to prefer for the allocation. |
353 | 353 | ||
354 | Similarly, setting 'memory_spread_slab' turns on the flag | 354 | Similarly, setting 'cpuset.memory_spread_slab' turns on the flag |
355 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate | 355 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate |
356 | pages from the node returned by cpuset_mem_spread_node(). | 356 | pages from the node returned by cpuset_mem_spread_node(). |
357 | 357 | ||
358 | The cpuset_mem_spread_node() routine is also simple. It uses the | 358 | The cpuset_mem_spread_node() routine is also simple. It uses the |
359 | value of a per-task rotor cpuset_mem_spread_rotor to select the next | 359 | value of a per-task rotor cpuset_mem_spread_rotor to select the next |
360 | node in the current tasks mems_allowed to prefer for the allocation. | 360 | node in the current task's mems_allowed to prefer for the allocation. |
361 | 361 | ||
362 | This memory placement policy is also known (in other contexts) as | 362 | This memory placement policy is also known (in other contexts) as |
363 | round-robin or interleave. | 363 | round-robin or interleave. |
@@ -404,24 +404,24 @@ the following two situations: | |||
404 | system overhead on those CPUs, including avoiding task load | 404 | system overhead on those CPUs, including avoiding task load |
405 | balancing if that is not needed. | 405 | balancing if that is not needed. |
406 | 406 | ||
407 | When the per-cpuset flag "sched_load_balance" is enabled (the default | 407 | When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default |
408 | setting), it requests that all the CPUs in that cpusets allowed 'cpus' | 408 | setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus' |
409 | be contained in a single sched domain, ensuring that load balancing | 409 | be contained in a single sched domain, ensuring that load balancing |
410 | can move a task (not otherwised pinned, as by sched_setaffinity) | 410 | can move a task (not otherwised pinned, as by sched_setaffinity) |
411 | from any CPU in that cpuset to any other. | 411 | from any CPU in that cpuset to any other. |
412 | 412 | ||
413 | When the per-cpuset flag "sched_load_balance" is disabled, then the | 413 | When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the |
414 | scheduler will avoid load balancing across the CPUs in that cpuset, | 414 | scheduler will avoid load balancing across the CPUs in that cpuset, |
415 | --except-- in so far as is necessary because some overlapping cpuset | 415 | --except-- in so far as is necessary because some overlapping cpuset |
416 | has "sched_load_balance" enabled. | 416 | has "sched_load_balance" enabled. |
417 | 417 | ||
418 | So, for example, if the top cpuset has the flag "sched_load_balance" | 418 | So, for example, if the top cpuset has the flag "cpuset.sched_load_balance" |
419 | enabled, then the scheduler will have one sched domain covering all | 419 | enabled, then the scheduler will have one sched domain covering all |
420 | CPUs, and the setting of the "sched_load_balance" flag in any other | 420 | CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other |
421 | cpusets won't matter, as we're already fully load balancing. | 421 | cpusets won't matter, as we're already fully load balancing. |
422 | 422 | ||
423 | Therefore in the above two situations, the top cpuset flag | 423 | Therefore in the above two situations, the top cpuset flag |
424 | "sched_load_balance" should be disabled, and only some of the smaller, | 424 | "cpuset.sched_load_balance" should be disabled, and only some of the smaller, |
425 | child cpusets have this flag enabled. | 425 | child cpusets have this flag enabled. |
426 | 426 | ||
427 | When doing this, you don't usually want to leave any unpinned tasks in | 427 | When doing this, you don't usually want to leave any unpinned tasks in |
@@ -433,7 +433,7 @@ scheduler might not consider the possibility of load balancing that | |||
433 | task to that underused CPU. | 433 | task to that underused CPU. |
434 | 434 | ||
435 | Of course, tasks pinned to a particular CPU can be left in a cpuset | 435 | Of course, tasks pinned to a particular CPU can be left in a cpuset |
436 | that disables "sched_load_balance" as those tasks aren't going anywhere | 436 | that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere |
437 | else anyway. | 437 | else anyway. |
438 | 438 | ||
439 | There is an impedance mismatch here, between cpusets and sched domains. | 439 | There is an impedance mismatch here, between cpusets and sched domains. |
@@ -443,19 +443,19 @@ overlap and each CPU is in at most one sched domain. | |||
443 | It is necessary for sched domains to be flat because load balancing | 443 | It is necessary for sched domains to be flat because load balancing |
444 | across partially overlapping sets of CPUs would risk unstable dynamics | 444 | across partially overlapping sets of CPUs would risk unstable dynamics |
445 | that would be beyond our understanding. So if each of two partially | 445 | that would be beyond our understanding. So if each of two partially |
446 | overlapping cpusets enables the flag 'sched_load_balance', then we | 446 | overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we |
447 | form a single sched domain that is a superset of both. We won't move | 447 | form a single sched domain that is a superset of both. We won't move |
448 | a task to a CPU outside it cpuset, but the scheduler load balancing | 448 | a task to a CPU outside it cpuset, but the scheduler load balancing |
449 | code might waste some compute cycles considering that possibility. | 449 | code might waste some compute cycles considering that possibility. |
450 | 450 | ||
451 | This mismatch is why there is not a simple one-to-one relation | 451 | This mismatch is why there is not a simple one-to-one relation |
452 | between which cpusets have the flag "sched_load_balance" enabled, | 452 | between which cpusets have the flag "cpuset.sched_load_balance" enabled, |
453 | and the sched domain configuration. If a cpuset enables the flag, it | 453 | and the sched domain configuration. If a cpuset enables the flag, it |
454 | will get balancing across all its CPUs, but if it disables the flag, | 454 | will get balancing across all its CPUs, but if it disables the flag, |
455 | it will only be assured of no load balancing if no other overlapping | 455 | it will only be assured of no load balancing if no other overlapping |
456 | cpuset enables the flag. | 456 | cpuset enables the flag. |
457 | 457 | ||
458 | If two cpusets have partially overlapping 'cpus' allowed, and only | 458 | If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only |
459 | one of them has this flag enabled, then the other may find its | 459 | one of them has this flag enabled, then the other may find its |
460 | tasks only partially load balanced, just on the overlapping CPUs. | 460 | tasks only partially load balanced, just on the overlapping CPUs. |
461 | This is just the general case of the top_cpuset example given a few | 461 | This is just the general case of the top_cpuset example given a few |
@@ -468,23 +468,23 @@ load balancing to the other CPUs. | |||
468 | 1.7.1 sched_load_balance implementation details. | 468 | 1.7.1 sched_load_balance implementation details. |
469 | ------------------------------------------------ | 469 | ------------------------------------------------ |
470 | 470 | ||
471 | The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary | 471 | The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary |
472 | to most cpuset flags.) When enabled for a cpuset, the kernel will | 472 | to most cpuset flags.) When enabled for a cpuset, the kernel will |
473 | ensure that it can load balance across all the CPUs in that cpuset | 473 | ensure that it can load balance across all the CPUs in that cpuset |
474 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are | 474 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are |
475 | in the same sched domain.) | 475 | in the same sched domain.) |
476 | 476 | ||
477 | If two overlapping cpusets both have 'sched_load_balance' enabled, | 477 | If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled, |
478 | then they will be (must be) both in the same sched domain. | 478 | then they will be (must be) both in the same sched domain. |
479 | 479 | ||
480 | If, as is the default, the top cpuset has 'sched_load_balance' enabled, | 480 | If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled, |
481 | then by the above that means there is a single sched domain covering | 481 | then by the above that means there is a single sched domain covering |
482 | the whole system, regardless of any other cpuset settings. | 482 | the whole system, regardless of any other cpuset settings. |
483 | 483 | ||
484 | The kernel commits to user space that it will avoid load balancing | 484 | The kernel commits to user space that it will avoid load balancing |
485 | where it can. It will pick as fine a granularity partition of sched | 485 | where it can. It will pick as fine a granularity partition of sched |
486 | domains as it can while still providing load balancing for any set | 486 | domains as it can while still providing load balancing for any set |
487 | of CPUs allowed to a cpuset having 'sched_load_balance' enabled. | 487 | of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled. |
488 | 488 | ||
489 | The internal kernel cpuset to scheduler interface passes from the | 489 | The internal kernel cpuset to scheduler interface passes from the |
490 | cpuset code to the scheduler code a partition of the load balanced | 490 | cpuset code to the scheduler code a partition of the load balanced |
@@ -495,9 +495,9 @@ all the CPUs that must be load balanced. | |||
495 | The cpuset code builds a new such partition and passes it to the | 495 | The cpuset code builds a new such partition and passes it to the |
496 | scheduler sched domain setup code, to have the sched domains rebuilt | 496 | scheduler sched domain setup code, to have the sched domains rebuilt |
497 | as necessary, whenever: | 497 | as necessary, whenever: |
498 | - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes, | 498 | - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, |
499 | - or CPUs come or go from a cpuset with this flag enabled, | 499 | - or CPUs come or go from a cpuset with this flag enabled, |
500 | - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs | 500 | - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs |
501 | and with this flag enabled changes, | 501 | and with this flag enabled changes, |
502 | - or a cpuset with non-empty CPUs and with this flag enabled is removed, | 502 | - or a cpuset with non-empty CPUs and with this flag enabled is removed, |
503 | - or a cpu is offlined/onlined. | 503 | - or a cpu is offlined/onlined. |
@@ -542,7 +542,7 @@ As the result, task B on CPU X need to wait task A or wait load balance | |||
542 | on the next tick. For some applications in special situation, waiting | 542 | on the next tick. For some applications in special situation, waiting |
543 | 1 tick may be too long. | 543 | 1 tick may be too long. |
544 | 544 | ||
545 | The 'sched_relax_domain_level' file allows you to request changing | 545 | The 'cpuset.sched_relax_domain_level' file allows you to request changing |
546 | this searching range as you like. This file takes int value which | 546 | this searching range as you like. This file takes int value which |
547 | indicates size of searching range in levels ideally as follows, | 547 | indicates size of searching range in levels ideally as follows, |
548 | otherwise initial value -1 that indicates the cpuset has no request. | 548 | otherwise initial value -1 that indicates the cpuset has no request. |
@@ -559,8 +559,8 @@ The system default is architecture dependent. The system default | |||
559 | can be changed using the relax_domain_level= boot parameter. | 559 | can be changed using the relax_domain_level= boot parameter. |
560 | 560 | ||
561 | This file is per-cpuset and affect the sched domain where the cpuset | 561 | This file is per-cpuset and affect the sched domain where the cpuset |
562 | belongs to. Therefore if the flag 'sched_load_balance' of a cpuset | 562 | belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset |
563 | is disabled, then 'sched_relax_domain_level' have no effect since | 563 | is disabled, then 'cpuset.sched_relax_domain_level' have no effect since |
564 | there is no sched domain belonging the cpuset. | 564 | there is no sched domain belonging the cpuset. |
565 | 565 | ||
566 | If multiple cpusets are overlapping and hence they form a single sched | 566 | If multiple cpusets are overlapping and hence they form a single sched |
@@ -594,7 +594,7 @@ is attached, is subtle. | |||
594 | If a cpuset has its Memory Nodes modified, then for each task attached | 594 | If a cpuset has its Memory Nodes modified, then for each task attached |
595 | to that cpuset, the next time that the kernel attempts to allocate | 595 | to that cpuset, the next time that the kernel attempts to allocate |
596 | a page of memory for that task, the kernel will notice the change | 596 | a page of memory for that task, the kernel will notice the change |
597 | in the tasks cpuset, and update its per-task memory placement to | 597 | in the task's cpuset, and update its per-task memory placement to |
598 | remain within the new cpusets memory placement. If the task was using | 598 | remain within the new cpusets memory placement. If the task was using |
599 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with | 599 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with |
600 | its new cpuset, then the task will continue to use whatever subset | 600 | its new cpuset, then the task will continue to use whatever subset |
@@ -603,13 +603,13 @@ was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | |||
603 | in the new cpuset, then the task will be essentially treated as if it | 603 | in the new cpuset, then the task will be essentially treated as if it |
604 | was MPOL_BIND bound to the new cpuset (even though its NUMA placement, | 604 | was MPOL_BIND bound to the new cpuset (even though its NUMA placement, |
605 | as queried by get_mempolicy(), doesn't change). If a task is moved | 605 | as queried by get_mempolicy(), doesn't change). If a task is moved |
606 | from one cpuset to another, then the kernel will adjust the tasks | 606 | from one cpuset to another, then the kernel will adjust the task's |
607 | memory placement, as above, the next time that the kernel attempts | 607 | memory placement, as above, the next time that the kernel attempts |
608 | to allocate a page of memory for that task. | 608 | to allocate a page of memory for that task. |
609 | 609 | ||
610 | If a cpuset has its 'cpus' modified, then each task in that cpuset | 610 | If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset |
611 | will have its allowed CPU placement changed immediately. Similarly, | 611 | will have its allowed CPU placement changed immediately. Similarly, |
612 | if a tasks pid is written to another cpusets 'tasks' file, then its | 612 | if a task's pid is written to another cpusets 'cpuset.tasks' file, then its |
613 | allowed CPU placement is changed immediately. If such a task had been | 613 | allowed CPU placement is changed immediately. If such a task had been |
614 | bound to some subset of its cpuset using the sched_setaffinity() call, | 614 | bound to some subset of its cpuset using the sched_setaffinity() call, |
615 | the task will be allowed to run on any CPU allowed in its new cpuset, | 615 | the task will be allowed to run on any CPU allowed in its new cpuset, |
@@ -622,21 +622,21 @@ and the processor placement is updated immediately. | |||
622 | Normally, once a page is allocated (given a physical page | 622 | Normally, once a page is allocated (given a physical page |
623 | of main memory) then that page stays on whatever node it | 623 | of main memory) then that page stays on whatever node it |
624 | was allocated, so long as it remains allocated, even if the | 624 | was allocated, so long as it remains allocated, even if the |
625 | cpusets memory placement policy 'mems' subsequently changes. | 625 | cpusets memory placement policy 'cpuset.mems' subsequently changes. |
626 | If the cpuset flag file 'memory_migrate' is set true, then when | 626 | If the cpuset flag file 'cpuset.memory_migrate' is set true, then when |
627 | tasks are attached to that cpuset, any pages that task had | 627 | tasks are attached to that cpuset, any pages that task had |
628 | allocated to it on nodes in its previous cpuset are migrated | 628 | allocated to it on nodes in its previous cpuset are migrated |
629 | to the tasks new cpuset. The relative placement of the page within | 629 | to the task's new cpuset. The relative placement of the page within |
630 | the cpuset is preserved during these migration operations if possible. | 630 | the cpuset is preserved during these migration operations if possible. |
631 | For example if the page was on the second valid node of the prior cpuset | 631 | For example if the page was on the second valid node of the prior cpuset |
632 | then the page will be placed on the second valid node of the new cpuset. | 632 | then the page will be placed on the second valid node of the new cpuset. |
633 | 633 | ||
634 | Also if 'memory_migrate' is set true, then if that cpusets | 634 | Also if 'cpuset.memory_migrate' is set true, then if that cpuset's |
635 | 'mems' file is modified, pages allocated to tasks in that | 635 | 'cpuset.mems' file is modified, pages allocated to tasks in that |
636 | cpuset, that were on nodes in the previous setting of 'mems', | 636 | cpuset, that were on nodes in the previous setting of 'cpuset.mems', |
637 | will be moved to nodes in the new setting of 'mems.' | 637 | will be moved to nodes in the new setting of 'mems.' |
638 | Pages that were not in the tasks prior cpuset, or in the cpusets | 638 | Pages that were not in the task's prior cpuset, or in the cpuset's |
639 | prior 'mems' setting, will not be moved. | 639 | prior 'cpuset.mems' setting, will not be moved. |
640 | 640 | ||
641 | There is an exception to the above. If hotplug functionality is used | 641 | There is an exception to the above. If hotplug functionality is used |
642 | to remove all the CPUs that are currently assigned to a cpuset, | 642 | to remove all the CPUs that are currently assigned to a cpuset, |
@@ -655,7 +655,7 @@ There is a second exception to the above. GFP_ATOMIC requests are | |||
655 | kernel internal allocations that must be satisfied, immediately. | 655 | kernel internal allocations that must be satisfied, immediately. |
656 | The kernel may drop some request, in rare cases even panic, if a | 656 | The kernel may drop some request, in rare cases even panic, if a |
657 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within | 657 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within |
658 | the current tasks cpuset, then we relax the cpuset, and look for | 658 | the current task's cpuset, then we relax the cpuset, and look for |
659 | memory anywhere we can find it. It's better to violate the cpuset | 659 | memory anywhere we can find it. It's better to violate the cpuset |
660 | than stress the kernel. | 660 | than stress the kernel. |
661 | 661 | ||
@@ -678,8 +678,8 @@ and then start a subshell 'sh' in that cpuset: | |||
678 | cd /dev/cpuset | 678 | cd /dev/cpuset |
679 | mkdir Charlie | 679 | mkdir Charlie |
680 | cd Charlie | 680 | cd Charlie |
681 | /bin/echo 2-3 > cpus | 681 | /bin/echo 2-3 > cpuset.cpus |
682 | /bin/echo 1 > mems | 682 | /bin/echo 1 > cpuset.mems |
683 | /bin/echo $$ > tasks | 683 | /bin/echo $$ > tasks |
684 | sh | 684 | sh |
685 | # The subshell 'sh' is now running in cpuset Charlie | 685 | # The subshell 'sh' is now running in cpuset Charlie |
@@ -725,10 +725,13 @@ Now you want to do something with this cpuset. | |||
725 | 725 | ||
726 | In this directory you can find several files: | 726 | In this directory you can find several files: |
727 | # ls | 727 | # ls |
728 | cpu_exclusive memory_migrate mems tasks | 728 | cpuset.cpu_exclusive cpuset.memory_spread_slab |
729 | cpus memory_pressure notify_on_release | 729 | cpuset.cpus cpuset.mems |
730 | mem_exclusive memory_spread_page sched_load_balance | 730 | cpuset.mem_exclusive cpuset.sched_load_balance |
731 | mem_hardwall memory_spread_slab sched_relax_domain_level | 731 | cpuset.mem_hardwall cpuset.sched_relax_domain_level |
732 | cpuset.memory_migrate notify_on_release | ||
733 | cpuset.memory_pressure tasks | ||
734 | cpuset.memory_spread_page | ||
732 | 735 | ||
733 | Reading them will give you information about the state of this cpuset: | 736 | Reading them will give you information about the state of this cpuset: |
734 | the CPUs and Memory Nodes it can use, the processes that are using | 737 | the CPUs and Memory Nodes it can use, the processes that are using |
@@ -736,13 +739,13 @@ it, its properties. By writing to these files you can manipulate | |||
736 | the cpuset. | 739 | the cpuset. |
737 | 740 | ||
738 | Set some flags: | 741 | Set some flags: |
739 | # /bin/echo 1 > cpu_exclusive | 742 | # /bin/echo 1 > cpuset.cpu_exclusive |
740 | 743 | ||
741 | Add some cpus: | 744 | Add some cpus: |
742 | # /bin/echo 0-7 > cpus | 745 | # /bin/echo 0-7 > cpuset.cpus |
743 | 746 | ||
744 | Add some mems: | 747 | Add some mems: |
745 | # /bin/echo 0-7 > mems | 748 | # /bin/echo 0-7 > cpuset.mems |
746 | 749 | ||
747 | Now attach your shell to this cpuset: | 750 | Now attach your shell to this cpuset: |
748 | # /bin/echo $$ > tasks | 751 | # /bin/echo $$ > tasks |
@@ -774,28 +777,28 @@ echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent | |||
774 | This is the syntax to use when writing in the cpus or mems files | 777 | This is the syntax to use when writing in the cpus or mems files |
775 | in cpuset directories: | 778 | in cpuset directories: |
776 | 779 | ||
777 | # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 | 780 | # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 |
778 | # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 | 781 | # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 |
779 | 782 | ||
780 | To add a CPU to a cpuset, write the new list of CPUs including the | 783 | To add a CPU to a cpuset, write the new list of CPUs including the |
781 | CPU to be added. To add 6 to the above cpuset: | 784 | CPU to be added. To add 6 to the above cpuset: |
782 | 785 | ||
783 | # /bin/echo 1-4,6 > cpus -> set cpus list to cpus 1,2,3,4,6 | 786 | # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 |
784 | 787 | ||
785 | Similarly to remove a CPU from a cpuset, write the new list of CPUs | 788 | Similarly to remove a CPU from a cpuset, write the new list of CPUs |
786 | without the CPU to be removed. | 789 | without the CPU to be removed. |
787 | 790 | ||
788 | To remove all the CPUs: | 791 | To remove all the CPUs: |
789 | 792 | ||
790 | # /bin/echo "" > cpus -> clear cpus list | 793 | # /bin/echo "" > cpuset.cpus -> clear cpus list |
791 | 794 | ||
792 | 2.3 Setting flags | 795 | 2.3 Setting flags |
793 | ----------------- | 796 | ----------------- |
794 | 797 | ||
795 | The syntax is very simple: | 798 | The syntax is very simple: |
796 | 799 | ||
797 | # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' | 800 | # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' |
798 | # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' | 801 | # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' |
799 | 802 | ||
800 | 2.4 Attaching processes | 803 | 2.4 Attaching processes |
801 | ----------------------- | 804 | ----------------------- |