diff options
Diffstat (limited to 'Documentation/cgroups')
-rw-r--r-- | Documentation/cgroups/00-INDEX | 18 | ||||
-rw-r--r-- | Documentation/cgroups/cgroups.txt | 36 | ||||
-rw-r--r-- | Documentation/cgroups/cpusets.txt | 12 | ||||
-rw-r--r-- | Documentation/cgroups/devices.txt | 2 | ||||
-rw-r--r-- | Documentation/cgroups/memcg_test.txt | 22 | ||||
-rw-r--r-- | Documentation/cgroups/memory.txt | 2 |
6 files changed, 73 insertions, 19 deletions
diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroups/00-INDEX new file mode 100644 index 000000000000..3f58fa3d6d00 --- /dev/null +++ b/Documentation/cgroups/00-INDEX | |||
@@ -0,0 +1,18 @@ | |||
1 | 00-INDEX | ||
2 | - this file | ||
3 | cgroups.txt | ||
4 | - Control Groups definition, implementation details, examples and API. | ||
5 | cpuacct.txt | ||
6 | - CPU Accounting Controller; account CPU usage for groups of tasks. | ||
7 | cpusets.txt | ||
8 | - documents the cpusets feature; assign CPUs and Mem to a set of tasks. | ||
9 | devices.txt | ||
10 | - Device Whitelist Controller; description, interface and security. | ||
11 | freezer-subsystem.txt | ||
12 | - checkpointing; rationale to not use signals, interface. | ||
13 | memcg_test.txt | ||
14 | - Memory Resource Controller; implementation details. | ||
15 | memory.txt | ||
16 | - Memory Resource Controller; design, accounting, interface, testing. | ||
17 | resource_counter.txt | ||
18 | - Resource Counter API. | ||
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 93feb8444489..6eb1a97e88ce 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt | |||
@@ -56,7 +56,7 @@ hierarchy, and a set of subsystems; each subsystem has system-specific | |||
56 | state attached to each cgroup in the hierarchy. Each hierarchy has | 56 | state attached to each cgroup in the hierarchy. Each hierarchy has |
57 | an instance of the cgroup virtual filesystem associated with it. | 57 | an instance of the cgroup virtual filesystem associated with it. |
58 | 58 | ||
59 | At any one time there may be multiple active hierachies of task | 59 | At any one time there may be multiple active hierarchies of task |
60 | cgroups. Each hierarchy is a partition of all tasks in the system. | 60 | cgroups. Each hierarchy is a partition of all tasks in the system. |
61 | 61 | ||
62 | User level code may create and destroy cgroups by name in an | 62 | User level code may create and destroy cgroups by name in an |
@@ -124,10 +124,10 @@ following lines: | |||
124 | / \ | 124 | / \ |
125 | Prof (15%) students (5%) | 125 | Prof (15%) students (5%) |
126 | 126 | ||
127 | Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go | 127 | Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go |
128 | into NFS network class. | 128 | into NFS network class. |
129 | 129 | ||
130 | At the same time firefox/lynx will share an appropriate CPU/Memory class | 130 | At the same time Firefox/Lynx will share an appropriate CPU/Memory class |
131 | depending on who launched it (prof/student). | 131 | depending on who launched it (prof/student). |
132 | 132 | ||
133 | With the ability to classify tasks differently for different resources | 133 | With the ability to classify tasks differently for different resources |
@@ -325,7 +325,7 @@ and then start a subshell 'sh' in that cgroup: | |||
325 | Creating, modifying, using the cgroups can be done through the cgroup | 325 | Creating, modifying, using the cgroups can be done through the cgroup |
326 | virtual filesystem. | 326 | virtual filesystem. |
327 | 327 | ||
328 | To mount a cgroup hierarchy will all available subsystems, type: | 328 | To mount a cgroup hierarchy with all available subsystems, type: |
329 | # mount -t cgroup xxx /dev/cgroup | 329 | # mount -t cgroup xxx /dev/cgroup |
330 | 330 | ||
331 | The "xxx" is not interpreted by the cgroup code, but will appear in | 331 | The "xxx" is not interpreted by the cgroup code, but will appear in |
@@ -333,12 +333,23 @@ The "xxx" is not interpreted by the cgroup code, but will appear in | |||
333 | 333 | ||
334 | To mount a cgroup hierarchy with just the cpuset and numtasks | 334 | To mount a cgroup hierarchy with just the cpuset and numtasks |
335 | subsystems, type: | 335 | subsystems, type: |
336 | # mount -t cgroup -o cpuset,numtasks hier1 /dev/cgroup | 336 | # mount -t cgroup -o cpuset,memory hier1 /dev/cgroup |
337 | 337 | ||
338 | To change the set of subsystems bound to a mounted hierarchy, just | 338 | To change the set of subsystems bound to a mounted hierarchy, just |
339 | remount with different options: | 339 | remount with different options: |
340 | # mount -o remount,cpuset,ns hier1 /dev/cgroup | ||
340 | 341 | ||
341 | # mount -o remount,cpuset,ns /dev/cgroup | 342 | Now memory is removed from the hierarchy and ns is added. |
343 | |||
344 | Note this will add ns to the hierarchy but won't remove memory or | ||
345 | cpuset, because the new options are appended to the old ones: | ||
346 | # mount -o remount,ns /dev/cgroup | ||
347 | |||
348 | To Specify a hierarchy's release_agent: | ||
349 | # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ | ||
350 | xxx /dev/cgroup | ||
351 | |||
352 | Note that specifying 'release_agent' more than once will return failure. | ||
342 | 353 | ||
343 | Note that changing the set of subsystems is currently only supported | 354 | Note that changing the set of subsystems is currently only supported |
344 | when the hierarchy consists of a single (root) cgroup. Supporting | 355 | when the hierarchy consists of a single (root) cgroup. Supporting |
@@ -349,6 +360,11 @@ Then under /dev/cgroup you can find a tree that corresponds to the | |||
349 | tree of the cgroups in the system. For instance, /dev/cgroup | 360 | tree of the cgroups in the system. For instance, /dev/cgroup |
350 | is the cgroup that holds the whole system. | 361 | is the cgroup that holds the whole system. |
351 | 362 | ||
363 | If you want to change the value of release_agent: | ||
364 | # echo "/sbin/new_release_agent" > /dev/cgroup/release_agent | ||
365 | |||
366 | It can also be changed via remount. | ||
367 | |||
352 | If you want to create a new cgroup under /dev/cgroup: | 368 | If you want to create a new cgroup under /dev/cgroup: |
353 | # cd /dev/cgroup | 369 | # cd /dev/cgroup |
354 | # mkdir my_cgroup | 370 | # mkdir my_cgroup |
@@ -476,11 +492,13 @@ cgroup->parent is still valid. (Note - can also be called for a | |||
476 | newly-created cgroup if an error occurs after this subsystem's | 492 | newly-created cgroup if an error occurs after this subsystem's |
477 | create() method has been called for the new cgroup). | 493 | create() method has been called for the new cgroup). |
478 | 494 | ||
479 | void pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp); | 495 | int pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp); |
480 | 496 | ||
481 | Called before checking the reference count on each subsystem. This may | 497 | Called before checking the reference count on each subsystem. This may |
482 | be useful for subsystems which have some extra references even if | 498 | be useful for subsystems which have some extra references even if |
483 | there are not tasks in the cgroup. | 499 | there are not tasks in the cgroup. If pre_destroy() returns error code, |
500 | rmdir() will fail with it. From this behavior, pre_destroy() can be | ||
501 | called multiple times against a cgroup. | ||
484 | 502 | ||
485 | int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | 503 | int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, |
486 | struct task_struct *task) | 504 | struct task_struct *task) |
@@ -521,7 +539,7 @@ always handled well. | |||
521 | void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp) | 539 | void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp) |
522 | (cgroup_mutex held by caller) | 540 | (cgroup_mutex held by caller) |
523 | 541 | ||
524 | Called at the end of cgroup_clone() to do any paramater | 542 | Called at the end of cgroup_clone() to do any parameter |
525 | initialization which might be required before a task could attach. For | 543 | initialization which might be required before a task could attach. For |
526 | example in cpusets, no task may attach before 'cpus' and 'mems' are set | 544 | example in cpusets, no task may attach before 'cpus' and 'mems' are set |
527 | up. | 545 | up. |
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 0611e9528c7c..f9ca389dddf4 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt | |||
@@ -131,7 +131,7 @@ Cpusets extends these two mechanisms as follows: | |||
131 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for | 131 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for |
132 | browsing and manipulation from user space. | 132 | browsing and manipulation from user space. |
133 | - A cpuset may be marked exclusive, which ensures that no other | 133 | - A cpuset may be marked exclusive, which ensures that no other |
134 | cpuset (except direct ancestors and descendents) may contain | 134 | cpuset (except direct ancestors and descendants) may contain |
135 | any overlapping CPUs or Memory Nodes. | 135 | any overlapping CPUs or Memory Nodes. |
136 | - You can list all the tasks (by pid) attached to any cpuset. | 136 | - You can list all the tasks (by pid) attached to any cpuset. |
137 | 137 | ||
@@ -226,7 +226,7 @@ nodes with memory--using the cpuset_track_online_nodes() hook. | |||
226 | -------------------------------- | 226 | -------------------------------- |
227 | 227 | ||
228 | If a cpuset is cpu or mem exclusive, no other cpuset, other than | 228 | If a cpuset is cpu or mem exclusive, no other cpuset, other than |
229 | a direct ancestor or descendent, may share any of the same CPUs or | 229 | a direct ancestor or descendant, may share any of the same CPUs or |
230 | Memory Nodes. | 230 | Memory Nodes. |
231 | 231 | ||
232 | A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", | 232 | A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", |
@@ -427,7 +427,7 @@ child cpusets have this flag enabled. | |||
427 | When doing this, you don't usually want to leave any unpinned tasks in | 427 | When doing this, you don't usually want to leave any unpinned tasks in |
428 | the top cpuset that might use non-trivial amounts of CPU, as such tasks | 428 | the top cpuset that might use non-trivial amounts of CPU, as such tasks |
429 | may be artificially constrained to some subset of CPUs, depending on | 429 | may be artificially constrained to some subset of CPUs, depending on |
430 | the particulars of this flag setting in descendent cpusets. Even if | 430 | the particulars of this flag setting in descendant cpusets. Even if |
431 | such a task could use spare CPU cycles in some other CPUs, the kernel | 431 | such a task could use spare CPU cycles in some other CPUs, the kernel |
432 | scheduler might not consider the possibility of load balancing that | 432 | scheduler might not consider the possibility of load balancing that |
433 | task to that underused CPU. | 433 | task to that underused CPU. |
@@ -531,9 +531,9 @@ be idle. | |||
531 | 531 | ||
532 | Of course it takes some searching cost to find movable tasks and/or | 532 | Of course it takes some searching cost to find movable tasks and/or |
533 | idle CPUs, the scheduler might not search all CPUs in the domain | 533 | idle CPUs, the scheduler might not search all CPUs in the domain |
534 | everytime. In fact, in some architectures, the searching ranges on | 534 | every time. In fact, in some architectures, the searching ranges on |
535 | events are limited in the same socket or node where the CPU locates, | 535 | events are limited in the same socket or node where the CPU locates, |
536 | while the load balance on tick searchs all. | 536 | while the load balance on tick searches all. |
537 | 537 | ||
538 | For example, assume CPU Z is relatively far from CPU X. Even if CPU Z | 538 | For example, assume CPU Z is relatively far from CPU X. Even if CPU Z |
539 | is idle while CPU X and the siblings are busy, scheduler can't migrate | 539 | is idle while CPU X and the siblings are busy, scheduler can't migrate |
@@ -601,7 +601,7 @@ its new cpuset, then the task will continue to use whatever subset | |||
601 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task | 601 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task |
602 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | 602 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed |
603 | in the new cpuset, then the task will be essentially treated as if it | 603 | in the new cpuset, then the task will be essentially treated as if it |
604 | was MPOL_BIND bound to the new cpuset (even though its numa placement, | 604 | was MPOL_BIND bound to the new cpuset (even though its NUMA placement, |
605 | as queried by get_mempolicy(), doesn't change). If a task is moved | 605 | as queried by get_mempolicy(), doesn't change). If a task is moved |
606 | from one cpuset to another, then the kernel will adjust the tasks | 606 | from one cpuset to another, then the kernel will adjust the tasks |
607 | memory placement, as above, the next time that the kernel attempts | 607 | memory placement, as above, the next time that the kernel attempts |
diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt index 7cc6e6a60672..57ca4c89fe5c 100644 --- a/Documentation/cgroups/devices.txt +++ b/Documentation/cgroups/devices.txt | |||
@@ -42,7 +42,7 @@ suffice, but we can decide the best way to adequately restrict | |||
42 | movement as people get some experience with this. We may just want | 42 | movement as people get some experience with this. We may just want |
43 | to require CAP_SYS_ADMIN, which at least is a separate bit from | 43 | to require CAP_SYS_ADMIN, which at least is a separate bit from |
44 | CAP_MKNOD. We may want to just refuse moving to a cgroup which | 44 | CAP_MKNOD. We may want to just refuse moving to a cgroup which |
45 | isn't a descendent of the current one. Or we may want to use | 45 | isn't a descendant of the current one. Or we may want to use |
46 | CAP_MAC_ADMIN, since we really are trying to lock down root. | 46 | CAP_MAC_ADMIN, since we really are trying to lock down root. |
47 | 47 | ||
48 | CAP_SYS_ADMIN is needed to modify the whitelist or move another | 48 | CAP_SYS_ADMIN is needed to modify the whitelist or move another |
diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index 523a9c16c400..72db89ed0609 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt | |||
@@ -1,5 +1,5 @@ | |||
1 | Memory Resource Controller(Memcg) Implementation Memo. | 1 | Memory Resource Controller(Memcg) Implementation Memo. |
2 | Last Updated: 2009/1/19 | 2 | Last Updated: 2009/1/20 |
3 | Base Kernel Version: based on 2.6.29-rc2. | 3 | Base Kernel Version: based on 2.6.29-rc2. |
4 | 4 | ||
5 | Because VM is getting complex (one of reasons is memcg...), memcg's behavior | 5 | Because VM is getting complex (one of reasons is memcg...), memcg's behavior |
@@ -356,7 +356,25 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | |||
356 | (Shell-B) | 356 | (Shell-B) |
357 | # move all tasks in /cgroup/test to /cgroup | 357 | # move all tasks in /cgroup/test to /cgroup |
358 | # /sbin/swapoff -a | 358 | # /sbin/swapoff -a |
359 | # rmdir /test/cgroup | 359 | # rmdir /cgroup/test |
360 | # kill malloc task. | 360 | # kill malloc task. |
361 | 361 | ||
362 | Of course, tmpfs v.s. swapoff test should be tested, too. | 362 | Of course, tmpfs v.s. swapoff test should be tested, too. |
363 | |||
364 | 9.8 OOM-Killer | ||
365 | Out-of-memory caused by memcg's limit will kill tasks under | ||
366 | the memcg. When hierarchy is used, a task under hierarchy | ||
367 | will be killed by the kernel. | ||
368 | In this case, panic_on_oom shouldn't be invoked and tasks | ||
369 | in other groups shouldn't be killed. | ||
370 | |||
371 | It's not difficult to cause OOM under memcg as following. | ||
372 | Case A) when you can swapoff | ||
373 | #swapoff -a | ||
374 | #echo 50M > /memory.limit_in_bytes | ||
375 | run 51M of malloc | ||
376 | |||
377 | Case B) when you use mem+swap limitation. | ||
378 | #echo 50M > memory.limit_in_bytes | ||
379 | #echo 50M > memory.memsw.limit_in_bytes | ||
380 | run 51M of malloc | ||
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index e1501964df1e..a98a7fe7aabb 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt | |||
@@ -302,7 +302,7 @@ will be charged as a new owner of it. | |||
302 | unevictable - # of pages cannot be reclaimed.(mlocked etc) | 302 | unevictable - # of pages cannot be reclaimed.(mlocked etc) |
303 | 303 | ||
304 | Below is depend on CONFIG_DEBUG_VM. | 304 | Below is depend on CONFIG_DEBUG_VM. |
305 | inactive_ratio - VM inernal parameter. (see mm/page_alloc.c) | 305 | inactive_ratio - VM internal parameter. (see mm/page_alloc.c) |
306 | recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) | 306 | recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) |
307 | recent_rotated_file - VM internal parameter. (see mm/vmscan.c) | 307 | recent_rotated_file - VM internal parameter. (see mm/vmscan.c) |
308 | recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) | 308 | recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) |