diff options
| author | Andrea Bastoni <bastoni@cs.unc.edu> | 2010-05-30 19:16:45 -0400 |
|---|---|---|
| committer | Andrea Bastoni <bastoni@cs.unc.edu> | 2010-05-30 19:16:45 -0400 |
| commit | ada47b5fe13d89735805b566185f4885f5a3f750 (patch) | |
| tree | 644b88f8a71896307d71438e9b3af49126ffb22b /Documentation/cgroups | |
| parent | 43e98717ad40a4ae64545b5ba047c7b86aa44f4f (diff) | |
| parent | 3280f21d43ee541f97f8cda5792150d2dbec20d5 (diff) | |
Merge branch 'wip-2.6.34' into old-private-masterarchived-private-master
Diffstat (limited to 'Documentation/cgroups')
| -rw-r--r-- | Documentation/cgroups/blkio-controller.txt | 135 | ||||
| -rw-r--r-- | Documentation/cgroups/cgroup_event_listener.c | 110 | ||||
| -rw-r--r-- | Documentation/cgroups/cgroups.txt | 42 | ||||
| -rw-r--r-- | Documentation/cgroups/cpusets.txt | 127 | ||||
| -rw-r--r-- | Documentation/cgroups/memcg_test.txt | 47 | ||||
| -rw-r--r-- | Documentation/cgroups/memory.txt | 82 |
6 files changed, 470 insertions, 73 deletions
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt new file mode 100644 index 000000000000..630879cd9a42 --- /dev/null +++ b/Documentation/cgroups/blkio-controller.txt | |||
| @@ -0,0 +1,135 @@ | |||
| 1 | Block IO Controller | ||
| 2 | =================== | ||
| 3 | Overview | ||
| 4 | ======== | ||
| 5 | cgroup subsys "blkio" implements the block io controller. There seems to be | ||
| 6 | a need of various kinds of IO control policies (like proportional BW, max BW) | ||
| 7 | both at leaf nodes as well as at intermediate nodes in a storage hierarchy. | ||
| 8 | Plan is to use the same cgroup based management interface for blkio controller | ||
| 9 | and based on user options switch IO policies in the background. | ||
| 10 | |||
| 11 | In the first phase, this patchset implements proportional weight time based | ||
| 12 | division of disk policy. It is implemented in CFQ. Hence this policy takes | ||
| 13 | effect only on leaf nodes when CFQ is being used. | ||
| 14 | |||
| 15 | HOWTO | ||
| 16 | ===== | ||
| 17 | You can do a very simple testing of running two dd threads in two different | ||
| 18 | cgroups. Here is what you can do. | ||
| 19 | |||
| 20 | - Enable group scheduling in CFQ | ||
| 21 | CONFIG_CFQ_GROUP_IOSCHED=y | ||
| 22 | |||
| 23 | - Compile and boot into kernel and mount IO controller (blkio). | ||
| 24 | |||
| 25 | mount -t cgroup -o blkio none /cgroup | ||
| 26 | |||
| 27 | - Create two cgroups | ||
| 28 | mkdir -p /cgroup/test1/ /cgroup/test2 | ||
| 29 | |||
| 30 | - Set weights of group test1 and test2 | ||
| 31 | echo 1000 > /cgroup/test1/blkio.weight | ||
| 32 | echo 500 > /cgroup/test2/blkio.weight | ||
| 33 | |||
| 34 | - Create two same size files (say 512MB each) on same disk (file1, file2) and | ||
| 35 | launch two dd threads in different cgroup to read those files. | ||
| 36 | |||
| 37 | sync | ||
| 38 | echo 3 > /proc/sys/vm/drop_caches | ||
| 39 | |||
| 40 | dd if=/mnt/sdb/zerofile1 of=/dev/null & | ||
| 41 | echo $! > /cgroup/test1/tasks | ||
| 42 | cat /cgroup/test1/tasks | ||
| 43 | |||
| 44 | dd if=/mnt/sdb/zerofile2 of=/dev/null & | ||
| 45 | echo $! > /cgroup/test2/tasks | ||
| 46 | cat /cgroup/test2/tasks | ||
| 47 | |||
| 48 | - At macro level, first dd should finish first. To get more precise data, keep | ||
| 49 | on looking at (with the help of script), at blkio.disk_time and | ||
| 50 | blkio.disk_sectors files of both test1 and test2 groups. This will tell how | ||
| 51 | much disk time (in milli seconds), each group got and how many secotors each | ||
| 52 | group dispatched to the disk. We provide fairness in terms of disk time, so | ||
| 53 | ideally io.disk_time of cgroups should be in proportion to the weight. | ||
| 54 | |||
| 55 | Various user visible config options | ||
| 56 | =================================== | ||
| 57 | CONFIG_CFQ_GROUP_IOSCHED | ||
| 58 | - Enables group scheduling in CFQ. Currently only 1 level of group | ||
| 59 | creation is allowed. | ||
| 60 | |||
| 61 | CONFIG_DEBUG_CFQ_IOSCHED | ||
| 62 | - Enables some debugging messages in blktrace. Also creates extra | ||
| 63 | cgroup file blkio.dequeue. | ||
| 64 | |||
| 65 | Config options selected automatically | ||
| 66 | ===================================== | ||
| 67 | These config options are not user visible and are selected/deselected | ||
| 68 | automatically based on IO scheduler configuration. | ||
| 69 | |||
| 70 | CONFIG_BLK_CGROUP | ||
| 71 | - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED. | ||
| 72 | |||
| 73 | CONFIG_DEBUG_BLK_CGROUP | ||
| 74 | - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED. | ||
| 75 | |||
| 76 | Details of cgroup files | ||
| 77 | ======================= | ||
| 78 | - blkio.weight | ||
| 79 | - Specifies per cgroup weight. | ||
| 80 | |||
| 81 | Currently allowed range of weights is from 100 to 1000. | ||
| 82 | |||
| 83 | - blkio.time | ||
| 84 | - disk time allocated to cgroup per device in milliseconds. First | ||
| 85 | two fields specify the major and minor number of the device and | ||
| 86 | third field specifies the disk time allocated to group in | ||
| 87 | milliseconds. | ||
| 88 | |||
| 89 | - blkio.sectors | ||
| 90 | - number of sectors transferred to/from disk by the group. First | ||
| 91 | two fields specify the major and minor number of the device and | ||
| 92 | third field specifies the number of sectors transferred by the | ||
| 93 | group to/from the device. | ||
| 94 | |||
| 95 | - blkio.dequeue | ||
| 96 | - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This | ||
| 97 | gives the statistics about how many a times a group was dequeued | ||
| 98 | from service tree of the device. First two fields specify the major | ||
| 99 | and minor number of the device and third field specifies the number | ||
| 100 | of times a group was dequeued from a particular device. | ||
| 101 | |||
| 102 | CFQ sysfs tunable | ||
| 103 | ================= | ||
| 104 | /sys/block/<disk>/queue/iosched/group_isolation | ||
| 105 | |||
| 106 | If group_isolation=1, it provides stronger isolation between groups at the | ||
| 107 | expense of throughput. By default group_isolation is 0. In general that | ||
| 108 | means that if group_isolation=0, expect fairness for sequential workload | ||
| 109 | only. Set group_isolation=1 to see fairness for random IO workload also. | ||
| 110 | |||
| 111 | Generally CFQ will put random seeky workload in sync-noidle category. CFQ | ||
| 112 | will disable idling on these queues and it does a collective idling on group | ||
| 113 | of such queues. Generally these are slow moving queues and if there is a | ||
| 114 | sync-noidle service tree in each group, that group gets exclusive access to | ||
| 115 | disk for certain period. That means it will bring the throughput down if | ||
| 116 | group does not have enough IO to drive deeper queue depths and utilize disk | ||
| 117 | capacity to the fullest in the slice allocated to it. But the flip side is | ||
| 118 | that even a random reader should get better latencies and overall throughput | ||
| 119 | if there are lots of sequential readers/sync-idle workload running in the | ||
| 120 | system. | ||
| 121 | |||
| 122 | If group_isolation=0, then CFQ automatically moves all the random seeky queues | ||
| 123 | in the root group. That means there will be no service differentiation for | ||
| 124 | that kind of workload. This leads to better throughput as we do collective | ||
| 125 | idling on root sync-noidle tree. | ||
| 126 | |||
| 127 | By default one should run with group_isolation=0. If that is not sufficient | ||
| 128 | and one wants stronger isolation between groups, then set group_isolation=1 | ||
| 129 | but this will come at cost of reduced throughput. | ||
| 130 | |||
| 131 | What works | ||
| 132 | ========== | ||
| 133 | - Currently only sync IO queues are support. All the buffered writes are | ||
| 134 | still system wide and not per group. Hence we will not see service | ||
| 135 | differentiation between buffered writes between groups. | ||
diff --git a/Documentation/cgroups/cgroup_event_listener.c b/Documentation/cgroups/cgroup_event_listener.c new file mode 100644 index 000000000000..8c2bfc4a6358 --- /dev/null +++ b/Documentation/cgroups/cgroup_event_listener.c | |||
| @@ -0,0 +1,110 @@ | |||
| 1 | /* | ||
| 2 | * cgroup_event_listener.c - Simple listener of cgroup events | ||
| 3 | * | ||
| 4 | * Copyright (C) Kirill A. Shutemov <kirill@shutemov.name> | ||
| 5 | */ | ||
| 6 | |||
| 7 | #include <assert.h> | ||
| 8 | #include <errno.h> | ||
| 9 | #include <fcntl.h> | ||
| 10 | #include <libgen.h> | ||
| 11 | #include <limits.h> | ||
| 12 | #include <stdio.h> | ||
| 13 | #include <string.h> | ||
| 14 | #include <unistd.h> | ||
| 15 | |||
| 16 | #include <sys/eventfd.h> | ||
| 17 | |||
| 18 | #define USAGE_STR "Usage: cgroup_event_listener <path-to-control-file> <args>\n" | ||
| 19 | |||
| 20 | int main(int argc, char **argv) | ||
| 21 | { | ||
| 22 | int efd = -1; | ||
| 23 | int cfd = -1; | ||
| 24 | int event_control = -1; | ||
| 25 | char event_control_path[PATH_MAX]; | ||
| 26 | char line[LINE_MAX]; | ||
| 27 | int ret; | ||
| 28 | |||
| 29 | if (argc != 3) { | ||
| 30 | fputs(USAGE_STR, stderr); | ||
| 31 | return 1; | ||
| 32 | } | ||
| 33 | |||
| 34 | cfd = open(argv[1], O_RDONLY); | ||
| 35 | if (cfd == -1) { | ||
| 36 | fprintf(stderr, "Cannot open %s: %s\n", argv[1], | ||
| 37 | strerror(errno)); | ||
| 38 | goto out; | ||
| 39 | } | ||
| 40 | |||
| 41 | ret = snprintf(event_control_path, PATH_MAX, "%s/cgroup.event_control", | ||
| 42 | dirname(argv[1])); | ||
| 43 | if (ret >= PATH_MAX) { | ||
| 44 | fputs("Path to cgroup.event_control is too long\n", stderr); | ||
| 45 | goto out; | ||
| 46 | } | ||
| 47 | |||
| 48 | event_control = open(event_control_path, O_WRONLY); | ||
| 49 | if (event_control == -1) { | ||
| 50 | fprintf(stderr, "Cannot open %s: %s\n", event_control_path, | ||
| 51 | strerror(errno)); | ||
| 52 | goto out; | ||
| 53 | } | ||
| 54 | |||
| 55 | efd = eventfd(0, 0); | ||
| 56 | if (efd == -1) { | ||
| 57 | perror("eventfd() failed"); | ||
| 58 | goto out; | ||
| 59 | } | ||
| 60 | |||
| 61 | ret = snprintf(line, LINE_MAX, "%d %d %s", efd, cfd, argv[2]); | ||
| 62 | if (ret >= LINE_MAX) { | ||
| 63 | fputs("Arguments string is too long\n", stderr); | ||
| 64 | goto out; | ||
| 65 | } | ||
| 66 | |||
| 67 | ret = write(event_control, line, strlen(line) + 1); | ||
| 68 | if (ret == -1) { | ||
| 69 | perror("Cannot write to cgroup.event_control"); | ||
| 70 | goto out; | ||
| 71 | } | ||
| 72 | |||
| 73 | while (1) { | ||
| 74 | uint64_t result; | ||
| 75 | |||
| 76 | ret = read(efd, &result, sizeof(result)); | ||
| 77 | if (ret == -1) { | ||
| 78 | if (errno == EINTR) | ||
| 79 | continue; | ||
| 80 | perror("Cannot read from eventfd"); | ||
| 81 | break; | ||
| 82 | } | ||
| 83 | assert(ret == sizeof(result)); | ||
| 84 | |||
| 85 | ret = access(event_control_path, W_OK); | ||
| 86 | if ((ret == -1) && (errno == ENOENT)) { | ||
| 87 | puts("The cgroup seems to have removed."); | ||
| 88 | ret = 0; | ||
| 89 | break; | ||
| 90 | } | ||
| 91 | |||
| 92 | if (ret == -1) { | ||
| 93 | perror("cgroup.event_control " | ||
| 94 | "is not accessable any more"); | ||
| 95 | break; | ||
| 96 | } | ||
| 97 | |||
| 98 | printf("%s %s: crossed\n", argv[1], argv[2]); | ||
| 99 | } | ||
| 100 | |||
| 101 | out: | ||
| 102 | if (efd >= 0) | ||
| 103 | close(efd); | ||
| 104 | if (event_control >= 0) | ||
| 105 | close(event_control); | ||
| 106 | if (cfd >= 0) | ||
| 107 | close(cfd); | ||
| 108 | |||
| 109 | return (ret != 0); | ||
| 110 | } | ||
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 0b33bfe7dde9..a1ca5924faff 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt | |||
| @@ -22,6 +22,8 @@ CONTENTS: | |||
| 22 | 2. Usage Examples and Syntax | 22 | 2. Usage Examples and Syntax |
| 23 | 2.1 Basic Usage | 23 | 2.1 Basic Usage |
| 24 | 2.2 Attaching processes | 24 | 2.2 Attaching processes |
| 25 | 2.3 Mounting hierarchies by name | ||
| 26 | 2.4 Notification API | ||
| 25 | 3. Kernel API | 27 | 3. Kernel API |
| 26 | 3.1 Overview | 28 | 3.1 Overview |
| 27 | 3.2 Synchronization | 29 | 3.2 Synchronization |
| @@ -233,8 +235,7 @@ containing the following files describing that cgroup: | |||
| 233 | - cgroup.procs: list of tgids in the cgroup. This list is not | 235 | - cgroup.procs: list of tgids in the cgroup. This list is not |
| 234 | guaranteed to be sorted or free of duplicate tgids, and userspace | 236 | guaranteed to be sorted or free of duplicate tgids, and userspace |
| 235 | should sort/uniquify the list if this property is required. | 237 | should sort/uniquify the list if this property is required. |
| 236 | Writing a tgid into this file moves all threads with that tgid into | 238 | This is a read-only file, for now. |
| 237 | this cgroup. | ||
| 238 | - notify_on_release flag: run the release agent on exit? | 239 | - notify_on_release flag: run the release agent on exit? |
| 239 | - release_agent: the path to use for release notifications (this file | 240 | - release_agent: the path to use for release notifications (this file |
| 240 | exists in the top cgroup only) | 241 | exists in the top cgroup only) |
| @@ -434,6 +435,25 @@ you give a subsystem a name. | |||
| 434 | The name of the subsystem appears as part of the hierarchy description | 435 | The name of the subsystem appears as part of the hierarchy description |
| 435 | in /proc/mounts and /proc/<pid>/cgroups. | 436 | in /proc/mounts and /proc/<pid>/cgroups. |
| 436 | 437 | ||
| 438 | 2.4 Notification API | ||
| 439 | -------------------- | ||
| 440 | |||
| 441 | There is mechanism which allows to get notifications about changing | ||
| 442 | status of a cgroup. | ||
| 443 | |||
| 444 | To register new notification handler you need: | ||
| 445 | - create a file descriptor for event notification using eventfd(2); | ||
| 446 | - open a control file to be monitored (e.g. memory.usage_in_bytes); | ||
| 447 | - write "<event_fd> <control_fd> <args>" to cgroup.event_control. | ||
| 448 | Interpretation of args is defined by control file implementation; | ||
| 449 | |||
| 450 | eventfd will be woken up by control file implementation or when the | ||
| 451 | cgroup is removed. | ||
| 452 | |||
| 453 | To unregister notification handler just close eventfd. | ||
| 454 | |||
| 455 | NOTE: Support of notifications should be implemented for the control | ||
| 456 | file. See documentation for the subsystem. | ||
| 437 | 457 | ||
| 438 | 3. Kernel API | 458 | 3. Kernel API |
| 439 | ============= | 459 | ============= |
| @@ -488,6 +508,11 @@ Each subsystem should: | |||
| 488 | - add an entry in linux/cgroup_subsys.h | 508 | - add an entry in linux/cgroup_subsys.h |
| 489 | - define a cgroup_subsys object called <name>_subsys | 509 | - define a cgroup_subsys object called <name>_subsys |
| 490 | 510 | ||
| 511 | If a subsystem can be compiled as a module, it should also have in its | ||
| 512 | module initcall a call to cgroup_load_subsys(), and in its exitcall a | ||
| 513 | call to cgroup_unload_subsys(). It should also set its_subsys.module = | ||
| 514 | THIS_MODULE in its .c file. | ||
| 515 | |||
| 491 | Each subsystem may export the following methods. The only mandatory | 516 | Each subsystem may export the following methods. The only mandatory |
| 492 | methods are create/destroy. Any others that are null are presumed to | 517 | methods are create/destroy. Any others that are null are presumed to |
| 493 | be successful no-ops. | 518 | be successful no-ops. |
| @@ -536,10 +561,21 @@ returns an error, this will abort the attach operation. If a NULL | |||
| 536 | task is passed, then a successful result indicates that *any* | 561 | task is passed, then a successful result indicates that *any* |
| 537 | unspecified task can be moved into the cgroup. Note that this isn't | 562 | unspecified task can be moved into the cgroup. Note that this isn't |
| 538 | called on a fork. If this method returns 0 (success) then this should | 563 | called on a fork. If this method returns 0 (success) then this should |
| 539 | remain valid while the caller holds cgroup_mutex. If threadgroup is | 564 | remain valid while the caller holds cgroup_mutex and it is ensured that either |
| 565 | attach() or cancel_attach() will be called in future. If threadgroup is | ||
| 540 | true, then a successful result indicates that all threads in the given | 566 | true, then a successful result indicates that all threads in the given |
| 541 | thread's threadgroup can be moved together. | 567 | thread's threadgroup can be moved together. |
| 542 | 568 | ||
| 569 | void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | ||
| 570 | struct task_struct *task, bool threadgroup) | ||
| 571 | (cgroup_mutex held by caller) | ||
| 572 | |||
| 573 | Called when a task attach operation has failed after can_attach() has succeeded. | ||
| 574 | A subsystem whose can_attach() has some side-effects should provide this | ||
| 575 | function, so that the subsytem can implement a rollback. If not, not necessary. | ||
| 576 | This will be called only about subsystems whose can_attach() operation have | ||
| 577 | succeeded. | ||
| 578 | |||
| 543 | void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | 579 | void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, |
| 544 | struct cgroup *old_cgrp, struct task_struct *task, | 580 | struct cgroup *old_cgrp, struct task_struct *task, |
| 545 | bool threadgroup) | 581 | bool threadgroup) |
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 1d7e9784439a..4160df82b3f5 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt | |||
| @@ -168,20 +168,20 @@ Each cpuset is represented by a directory in the cgroup file system | |||
| 168 | containing (on top of the standard cgroup files) the following | 168 | containing (on top of the standard cgroup files) the following |
| 169 | files describing that cpuset: | 169 | files describing that cpuset: |
| 170 | 170 | ||
| 171 | - cpus: list of CPUs in that cpuset | 171 | - cpuset.cpus: list of CPUs in that cpuset |
| 172 | - mems: list of Memory Nodes in that cpuset | 172 | - cpuset.mems: list of Memory Nodes in that cpuset |
| 173 | - memory_migrate flag: if set, move pages to cpusets nodes | 173 | - cpuset.memory_migrate flag: if set, move pages to cpusets nodes |
| 174 | - cpu_exclusive flag: is cpu placement exclusive? | 174 | - cpuset.cpu_exclusive flag: is cpu placement exclusive? |
| 175 | - mem_exclusive flag: is memory placement exclusive? | 175 | - cpuset.mem_exclusive flag: is memory placement exclusive? |
| 176 | - mem_hardwall flag: is memory allocation hardwalled | 176 | - cpuset.mem_hardwall flag: is memory allocation hardwalled |
| 177 | - memory_pressure: measure of how much paging pressure in cpuset | 177 | - cpuset.memory_pressure: measure of how much paging pressure in cpuset |
| 178 | - memory_spread_page flag: if set, spread page cache evenly on allowed nodes | 178 | - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes |
| 179 | - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes | 179 | - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes |
| 180 | - sched_load_balance flag: if set, load balance within CPUs on that cpuset | 180 | - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset |
| 181 | - sched_relax_domain_level: the searching range when migrating tasks | 181 | - cpuset.sched_relax_domain_level: the searching range when migrating tasks |
| 182 | 182 | ||
| 183 | In addition, the root cpuset only has the following file: | 183 | In addition, the root cpuset only has the following file: |
| 184 | - memory_pressure_enabled flag: compute memory_pressure? | 184 | - cpuset.memory_pressure_enabled flag: compute memory_pressure? |
| 185 | 185 | ||
| 186 | New cpusets are created using the mkdir system call or shell | 186 | New cpusets are created using the mkdir system call or shell |
| 187 | command. The properties of a cpuset, such as its flags, allowed | 187 | command. The properties of a cpuset, such as its flags, allowed |
| @@ -229,7 +229,7 @@ If a cpuset is cpu or mem exclusive, no other cpuset, other than | |||
| 229 | a direct ancestor or descendant, may share any of the same CPUs or | 229 | a direct ancestor or descendant, may share any of the same CPUs or |
| 230 | Memory Nodes. | 230 | Memory Nodes. |
| 231 | 231 | ||
| 232 | A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", | 232 | A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled", |
| 233 | i.e. it restricts kernel allocations for page, buffer and other data | 233 | i.e. it restricts kernel allocations for page, buffer and other data |
| 234 | commonly shared by the kernel across multiple users. All cpusets, | 234 | commonly shared by the kernel across multiple users. All cpusets, |
| 235 | whether hardwalled or not, restrict allocations of memory for user | 235 | whether hardwalled or not, restrict allocations of memory for user |
| @@ -304,15 +304,15 @@ times 1000. | |||
| 304 | --------------------------- | 304 | --------------------------- |
| 305 | There are two boolean flag files per cpuset that control where the | 305 | There are two boolean flag files per cpuset that control where the |
| 306 | kernel allocates pages for the file system buffers and related in | 306 | kernel allocates pages for the file system buffers and related in |
| 307 | kernel data structures. They are called 'memory_spread_page' and | 307 | kernel data structures. They are called 'cpuset.memory_spread_page' and |
| 308 | 'memory_spread_slab'. | 308 | 'cpuset.memory_spread_slab'. |
| 309 | 309 | ||
| 310 | If the per-cpuset boolean flag file 'memory_spread_page' is set, then | 310 | If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then |
| 311 | the kernel will spread the file system buffers (page cache) evenly | 311 | the kernel will spread the file system buffers (page cache) evenly |
| 312 | over all the nodes that the faulting task is allowed to use, instead | 312 | over all the nodes that the faulting task is allowed to use, instead |
| 313 | of preferring to put those pages on the node where the task is running. | 313 | of preferring to put those pages on the node where the task is running. |
| 314 | 314 | ||
| 315 | If the per-cpuset boolean flag file 'memory_spread_slab' is set, | 315 | If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set, |
| 316 | then the kernel will spread some file system related slab caches, | 316 | then the kernel will spread some file system related slab caches, |
| 317 | such as for inodes and dentries evenly over all the nodes that the | 317 | such as for inodes and dentries evenly over all the nodes that the |
| 318 | faulting task is allowed to use, instead of preferring to put those | 318 | faulting task is allowed to use, instead of preferring to put those |
| @@ -337,21 +337,21 @@ their containing tasks memory spread settings. If memory spreading | |||
| 337 | is turned off, then the currently specified NUMA mempolicy once again | 337 | is turned off, then the currently specified NUMA mempolicy once again |
| 338 | applies to memory page allocations. | 338 | applies to memory page allocations. |
| 339 | 339 | ||
| 340 | Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag | 340 | Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag |
| 341 | files. By default they contain "0", meaning that the feature is off | 341 | files. By default they contain "0", meaning that the feature is off |
| 342 | for that cpuset. If a "1" is written to that file, then that turns | 342 | for that cpuset. If a "1" is written to that file, then that turns |
| 343 | the named feature on. | 343 | the named feature on. |
| 344 | 344 | ||
| 345 | The implementation is simple. | 345 | The implementation is simple. |
| 346 | 346 | ||
| 347 | Setting the flag 'memory_spread_page' turns on a per-process flag | 347 | Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag |
| 348 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently | 348 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently |
| 349 | joins that cpuset. The page allocation calls for the page cache | 349 | joins that cpuset. The page allocation calls for the page cache |
| 350 | is modified to perform an inline check for this PF_SPREAD_PAGE task | 350 | is modified to perform an inline check for this PF_SPREAD_PAGE task |
| 351 | flag, and if set, a call to a new routine cpuset_mem_spread_node() | 351 | flag, and if set, a call to a new routine cpuset_mem_spread_node() |
| 352 | returns the node to prefer for the allocation. | 352 | returns the node to prefer for the allocation. |
| 353 | 353 | ||
| 354 | Similarly, setting 'memory_spread_slab' turns on the flag | 354 | Similarly, setting 'cpuset.memory_spread_slab' turns on the flag |
| 355 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate | 355 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate |
| 356 | pages from the node returned by cpuset_mem_spread_node(). | 356 | pages from the node returned by cpuset_mem_spread_node(). |
| 357 | 357 | ||
| @@ -404,24 +404,24 @@ the following two situations: | |||
| 404 | system overhead on those CPUs, including avoiding task load | 404 | system overhead on those CPUs, including avoiding task load |
| 405 | balancing if that is not needed. | 405 | balancing if that is not needed. |
| 406 | 406 | ||
| 407 | When the per-cpuset flag "sched_load_balance" is enabled (the default | 407 | When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default |
| 408 | setting), it requests that all the CPUs in that cpusets allowed 'cpus' | 408 | setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus' |
| 409 | be contained in a single sched domain, ensuring that load balancing | 409 | be contained in a single sched domain, ensuring that load balancing |
| 410 | can move a task (not otherwised pinned, as by sched_setaffinity) | 410 | can move a task (not otherwised pinned, as by sched_setaffinity) |
| 411 | from any CPU in that cpuset to any other. | 411 | from any CPU in that cpuset to any other. |
| 412 | 412 | ||
| 413 | When the per-cpuset flag "sched_load_balance" is disabled, then the | 413 | When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the |
| 414 | scheduler will avoid load balancing across the CPUs in that cpuset, | 414 | scheduler will avoid load balancing across the CPUs in that cpuset, |
| 415 | --except-- in so far as is necessary because some overlapping cpuset | 415 | --except-- in so far as is necessary because some overlapping cpuset |
| 416 | has "sched_load_balance" enabled. | 416 | has "sched_load_balance" enabled. |
| 417 | 417 | ||
| 418 | So, for example, if the top cpuset has the flag "sched_load_balance" | 418 | So, for example, if the top cpuset has the flag "cpuset.sched_load_balance" |
| 419 | enabled, then the scheduler will have one sched domain covering all | 419 | enabled, then the scheduler will have one sched domain covering all |
| 420 | CPUs, and the setting of the "sched_load_balance" flag in any other | 420 | CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other |
| 421 | cpusets won't matter, as we're already fully load balancing. | 421 | cpusets won't matter, as we're already fully load balancing. |
| 422 | 422 | ||
| 423 | Therefore in the above two situations, the top cpuset flag | 423 | Therefore in the above two situations, the top cpuset flag |
| 424 | "sched_load_balance" should be disabled, and only some of the smaller, | 424 | "cpuset.sched_load_balance" should be disabled, and only some of the smaller, |
| 425 | child cpusets have this flag enabled. | 425 | child cpusets have this flag enabled. |
| 426 | 426 | ||
| 427 | When doing this, you don't usually want to leave any unpinned tasks in | 427 | When doing this, you don't usually want to leave any unpinned tasks in |
| @@ -433,7 +433,7 @@ scheduler might not consider the possibility of load balancing that | |||
| 433 | task to that underused CPU. | 433 | task to that underused CPU. |
| 434 | 434 | ||
| 435 | Of course, tasks pinned to a particular CPU can be left in a cpuset | 435 | Of course, tasks pinned to a particular CPU can be left in a cpuset |
| 436 | that disables "sched_load_balance" as those tasks aren't going anywhere | 436 | that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere |
| 437 | else anyway. | 437 | else anyway. |
| 438 | 438 | ||
| 439 | There is an impedance mismatch here, between cpusets and sched domains. | 439 | There is an impedance mismatch here, between cpusets and sched domains. |
| @@ -443,19 +443,19 @@ overlap and each CPU is in at most one sched domain. | |||
| 443 | It is necessary for sched domains to be flat because load balancing | 443 | It is necessary for sched domains to be flat because load balancing |
| 444 | across partially overlapping sets of CPUs would risk unstable dynamics | 444 | across partially overlapping sets of CPUs would risk unstable dynamics |
| 445 | that would be beyond our understanding. So if each of two partially | 445 | that would be beyond our understanding. So if each of two partially |
| 446 | overlapping cpusets enables the flag 'sched_load_balance', then we | 446 | overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we |
| 447 | form a single sched domain that is a superset of both. We won't move | 447 | form a single sched domain that is a superset of both. We won't move |
| 448 | a task to a CPU outside it cpuset, but the scheduler load balancing | 448 | a task to a CPU outside it cpuset, but the scheduler load balancing |
| 449 | code might waste some compute cycles considering that possibility. | 449 | code might waste some compute cycles considering that possibility. |
| 450 | 450 | ||
| 451 | This mismatch is why there is not a simple one-to-one relation | 451 | This mismatch is why there is not a simple one-to-one relation |
| 452 | between which cpusets have the flag "sched_load_balance" enabled, | 452 | between which cpusets have the flag "cpuset.sched_load_balance" enabled, |
| 453 | and the sched domain configuration. If a cpuset enables the flag, it | 453 | and the sched domain configuration. If a cpuset enables the flag, it |
| 454 | will get balancing across all its CPUs, but if it disables the flag, | 454 | will get balancing across all its CPUs, but if it disables the flag, |
| 455 | it will only be assured of no load balancing if no other overlapping | 455 | it will only be assured of no load balancing if no other overlapping |
| 456 | cpuset enables the flag. | 456 | cpuset enables the flag. |
| 457 | 457 | ||
| 458 | If two cpusets have partially overlapping 'cpus' allowed, and only | 458 | If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only |
| 459 | one of them has this flag enabled, then the other may find its | 459 | one of them has this flag enabled, then the other may find its |
| 460 | tasks only partially load balanced, just on the overlapping CPUs. | 460 | tasks only partially load balanced, just on the overlapping CPUs. |
| 461 | This is just the general case of the top_cpuset example given a few | 461 | This is just the general case of the top_cpuset example given a few |
| @@ -468,23 +468,23 @@ load balancing to the other CPUs. | |||
| 468 | 1.7.1 sched_load_balance implementation details. | 468 | 1.7.1 sched_load_balance implementation details. |
| 469 | ------------------------------------------------ | 469 | ------------------------------------------------ |
| 470 | 470 | ||
| 471 | The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary | 471 | The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary |
| 472 | to most cpuset flags.) When enabled for a cpuset, the kernel will | 472 | to most cpuset flags.) When enabled for a cpuset, the kernel will |
| 473 | ensure that it can load balance across all the CPUs in that cpuset | 473 | ensure that it can load balance across all the CPUs in that cpuset |
| 474 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are | 474 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are |
| 475 | in the same sched domain.) | 475 | in the same sched domain.) |
| 476 | 476 | ||
| 477 | If two overlapping cpusets both have 'sched_load_balance' enabled, | 477 | If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled, |
| 478 | then they will be (must be) both in the same sched domain. | 478 | then they will be (must be) both in the same sched domain. |
| 479 | 479 | ||
| 480 | If, as is the default, the top cpuset has 'sched_load_balance' enabled, | 480 | If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled, |
| 481 | then by the above that means there is a single sched domain covering | 481 | then by the above that means there is a single sched domain covering |
| 482 | the whole system, regardless of any other cpuset settings. | 482 | the whole system, regardless of any other cpuset settings. |
| 483 | 483 | ||
| 484 | The kernel commits to user space that it will avoid load balancing | 484 | The kernel commits to user space that it will avoid load balancing |
| 485 | where it can. It will pick as fine a granularity partition of sched | 485 | where it can. It will pick as fine a granularity partition of sched |
| 486 | domains as it can while still providing load balancing for any set | 486 | domains as it can while still providing load balancing for any set |
| 487 | of CPUs allowed to a cpuset having 'sched_load_balance' enabled. | 487 | of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled. |
| 488 | 488 | ||
| 489 | The internal kernel cpuset to scheduler interface passes from the | 489 | The internal kernel cpuset to scheduler interface passes from the |
| 490 | cpuset code to the scheduler code a partition of the load balanced | 490 | cpuset code to the scheduler code a partition of the load balanced |
| @@ -495,9 +495,9 @@ all the CPUs that must be load balanced. | |||
| 495 | The cpuset code builds a new such partition and passes it to the | 495 | The cpuset code builds a new such partition and passes it to the |
| 496 | scheduler sched domain setup code, to have the sched domains rebuilt | 496 | scheduler sched domain setup code, to have the sched domains rebuilt |
| 497 | as necessary, whenever: | 497 | as necessary, whenever: |
| 498 | - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes, | 498 | - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, |
| 499 | - or CPUs come or go from a cpuset with this flag enabled, | 499 | - or CPUs come or go from a cpuset with this flag enabled, |
| 500 | - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs | 500 | - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs |
| 501 | and with this flag enabled changes, | 501 | and with this flag enabled changes, |
| 502 | - or a cpuset with non-empty CPUs and with this flag enabled is removed, | 502 | - or a cpuset with non-empty CPUs and with this flag enabled is removed, |
| 503 | - or a cpu is offlined/onlined. | 503 | - or a cpu is offlined/onlined. |
| @@ -542,7 +542,7 @@ As the result, task B on CPU X need to wait task A or wait load balance | |||
| 542 | on the next tick. For some applications in special situation, waiting | 542 | on the next tick. For some applications in special situation, waiting |
| 543 | 1 tick may be too long. | 543 | 1 tick may be too long. |
| 544 | 544 | ||
| 545 | The 'sched_relax_domain_level' file allows you to request changing | 545 | The 'cpuset.sched_relax_domain_level' file allows you to request changing |
| 546 | this searching range as you like. This file takes int value which | 546 | this searching range as you like. This file takes int value which |
| 547 | indicates size of searching range in levels ideally as follows, | 547 | indicates size of searching range in levels ideally as follows, |
| 548 | otherwise initial value -1 that indicates the cpuset has no request. | 548 | otherwise initial value -1 that indicates the cpuset has no request. |
| @@ -559,8 +559,8 @@ The system default is architecture dependent. The system default | |||
| 559 | can be changed using the relax_domain_level= boot parameter. | 559 | can be changed using the relax_domain_level= boot parameter. |
| 560 | 560 | ||
| 561 | This file is per-cpuset and affect the sched domain where the cpuset | 561 | This file is per-cpuset and affect the sched domain where the cpuset |
| 562 | belongs to. Therefore if the flag 'sched_load_balance' of a cpuset | 562 | belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset |
| 563 | is disabled, then 'sched_relax_domain_level' have no effect since | 563 | is disabled, then 'cpuset.sched_relax_domain_level' have no effect since |
| 564 | there is no sched domain belonging the cpuset. | 564 | there is no sched domain belonging the cpuset. |
| 565 | 565 | ||
| 566 | If multiple cpusets are overlapping and hence they form a single sched | 566 | If multiple cpusets are overlapping and hence they form a single sched |
| @@ -607,9 +607,9 @@ from one cpuset to another, then the kernel will adjust the tasks | |||
| 607 | memory placement, as above, the next time that the kernel attempts | 607 | memory placement, as above, the next time that the kernel attempts |
| 608 | to allocate a page of memory for that task. | 608 | to allocate a page of memory for that task. |
| 609 | 609 | ||
| 610 | If a cpuset has its 'cpus' modified, then each task in that cpuset | 610 | If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset |
| 611 | will have its allowed CPU placement changed immediately. Similarly, | 611 | will have its allowed CPU placement changed immediately. Similarly, |
| 612 | if a tasks pid is written to another cpusets 'tasks' file, then its | 612 | if a tasks pid is written to another cpusets 'cpuset.tasks' file, then its |
| 613 | allowed CPU placement is changed immediately. If such a task had been | 613 | allowed CPU placement is changed immediately. If such a task had been |
| 614 | bound to some subset of its cpuset using the sched_setaffinity() call, | 614 | bound to some subset of its cpuset using the sched_setaffinity() call, |
| 615 | the task will be allowed to run on any CPU allowed in its new cpuset, | 615 | the task will be allowed to run on any CPU allowed in its new cpuset, |
| @@ -622,8 +622,8 @@ and the processor placement is updated immediately. | |||
| 622 | Normally, once a page is allocated (given a physical page | 622 | Normally, once a page is allocated (given a physical page |
| 623 | of main memory) then that page stays on whatever node it | 623 | of main memory) then that page stays on whatever node it |
| 624 | was allocated, so long as it remains allocated, even if the | 624 | was allocated, so long as it remains allocated, even if the |
| 625 | cpusets memory placement policy 'mems' subsequently changes. | 625 | cpusets memory placement policy 'cpuset.mems' subsequently changes. |
| 626 | If the cpuset flag file 'memory_migrate' is set true, then when | 626 | If the cpuset flag file 'cpuset.memory_migrate' is set true, then when |
| 627 | tasks are attached to that cpuset, any pages that task had | 627 | tasks are attached to that cpuset, any pages that task had |
| 628 | allocated to it on nodes in its previous cpuset are migrated | 628 | allocated to it on nodes in its previous cpuset are migrated |
| 629 | to the tasks new cpuset. The relative placement of the page within | 629 | to the tasks new cpuset. The relative placement of the page within |
| @@ -631,12 +631,12 @@ the cpuset is preserved during these migration operations if possible. | |||
| 631 | For example if the page was on the second valid node of the prior cpuset | 631 | For example if the page was on the second valid node of the prior cpuset |
| 632 | then the page will be placed on the second valid node of the new cpuset. | 632 | then the page will be placed on the second valid node of the new cpuset. |
| 633 | 633 | ||
| 634 | Also if 'memory_migrate' is set true, then if that cpusets | 634 | Also if 'cpuset.memory_migrate' is set true, then if that cpusets |
| 635 | 'mems' file is modified, pages allocated to tasks in that | 635 | 'cpuset.mems' file is modified, pages allocated to tasks in that |
| 636 | cpuset, that were on nodes in the previous setting of 'mems', | 636 | cpuset, that were on nodes in the previous setting of 'cpuset.mems', |
| 637 | will be moved to nodes in the new setting of 'mems.' | 637 | will be moved to nodes in the new setting of 'mems.' |
| 638 | Pages that were not in the tasks prior cpuset, or in the cpusets | 638 | Pages that were not in the tasks prior cpuset, or in the cpusets |
| 639 | prior 'mems' setting, will not be moved. | 639 | prior 'cpuset.mems' setting, will not be moved. |
| 640 | 640 | ||
| 641 | There is an exception to the above. If hotplug functionality is used | 641 | There is an exception to the above. If hotplug functionality is used |
| 642 | to remove all the CPUs that are currently assigned to a cpuset, | 642 | to remove all the CPUs that are currently assigned to a cpuset, |
| @@ -678,8 +678,8 @@ and then start a subshell 'sh' in that cpuset: | |||
| 678 | cd /dev/cpuset | 678 | cd /dev/cpuset |
| 679 | mkdir Charlie | 679 | mkdir Charlie |
| 680 | cd Charlie | 680 | cd Charlie |
| 681 | /bin/echo 2-3 > cpus | 681 | /bin/echo 2-3 > cpuset.cpus |
| 682 | /bin/echo 1 > mems | 682 | /bin/echo 1 > cpuset.mems |
| 683 | /bin/echo $$ > tasks | 683 | /bin/echo $$ > tasks |
| 684 | sh | 684 | sh |
| 685 | # The subshell 'sh' is now running in cpuset Charlie | 685 | # The subshell 'sh' is now running in cpuset Charlie |
| @@ -725,10 +725,13 @@ Now you want to do something with this cpuset. | |||
| 725 | 725 | ||
| 726 | In this directory you can find several files: | 726 | In this directory you can find several files: |
| 727 | # ls | 727 | # ls |
| 728 | cpu_exclusive memory_migrate mems tasks | 728 | cpuset.cpu_exclusive cpuset.memory_spread_slab |
| 729 | cpus memory_pressure notify_on_release | 729 | cpuset.cpus cpuset.mems |
| 730 | mem_exclusive memory_spread_page sched_load_balance | 730 | cpuset.mem_exclusive cpuset.sched_load_balance |
| 731 | mem_hardwall memory_spread_slab sched_relax_domain_level | 731 | cpuset.mem_hardwall cpuset.sched_relax_domain_level |
| 732 | cpuset.memory_migrate notify_on_release | ||
| 733 | cpuset.memory_pressure tasks | ||
| 734 | cpuset.memory_spread_page | ||
| 732 | 735 | ||
| 733 | Reading them will give you information about the state of this cpuset: | 736 | Reading them will give you information about the state of this cpuset: |
| 734 | the CPUs and Memory Nodes it can use, the processes that are using | 737 | the CPUs and Memory Nodes it can use, the processes that are using |
| @@ -736,13 +739,13 @@ it, its properties. By writing to these files you can manipulate | |||
| 736 | the cpuset. | 739 | the cpuset. |
| 737 | 740 | ||
| 738 | Set some flags: | 741 | Set some flags: |
| 739 | # /bin/echo 1 > cpu_exclusive | 742 | # /bin/echo 1 > cpuset.cpu_exclusive |
| 740 | 743 | ||
| 741 | Add some cpus: | 744 | Add some cpus: |
| 742 | # /bin/echo 0-7 > cpus | 745 | # /bin/echo 0-7 > cpuset.cpus |
| 743 | 746 | ||
| 744 | Add some mems: | 747 | Add some mems: |
| 745 | # /bin/echo 0-7 > mems | 748 | # /bin/echo 0-7 > cpuset.mems |
| 746 | 749 | ||
| 747 | Now attach your shell to this cpuset: | 750 | Now attach your shell to this cpuset: |
| 748 | # /bin/echo $$ > tasks | 751 | # /bin/echo $$ > tasks |
| @@ -774,28 +777,28 @@ echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent | |||
| 774 | This is the syntax to use when writing in the cpus or mems files | 777 | This is the syntax to use when writing in the cpus or mems files |
| 775 | in cpuset directories: | 778 | in cpuset directories: |
| 776 | 779 | ||
| 777 | # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 | 780 | # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 |
| 778 | # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 | 781 | # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 |
| 779 | 782 | ||
| 780 | To add a CPU to a cpuset, write the new list of CPUs including the | 783 | To add a CPU to a cpuset, write the new list of CPUs including the |
| 781 | CPU to be added. To add 6 to the above cpuset: | 784 | CPU to be added. To add 6 to the above cpuset: |
| 782 | 785 | ||
| 783 | # /bin/echo 1-4,6 > cpus -> set cpus list to cpus 1,2,3,4,6 | 786 | # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 |
| 784 | 787 | ||
| 785 | Similarly to remove a CPU from a cpuset, write the new list of CPUs | 788 | Similarly to remove a CPU from a cpuset, write the new list of CPUs |
| 786 | without the CPU to be removed. | 789 | without the CPU to be removed. |
| 787 | 790 | ||
| 788 | To remove all the CPUs: | 791 | To remove all the CPUs: |
| 789 | 792 | ||
| 790 | # /bin/echo "" > cpus -> clear cpus list | 793 | # /bin/echo "" > cpuset.cpus -> clear cpus list |
| 791 | 794 | ||
| 792 | 2.3 Setting flags | 795 | 2.3 Setting flags |
| 793 | ----------------- | 796 | ----------------- |
| 794 | 797 | ||
| 795 | The syntax is very simple: | 798 | The syntax is very simple: |
| 796 | 799 | ||
| 797 | # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' | 800 | # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' |
| 798 | # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' | 801 | # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' |
| 799 | 802 | ||
| 800 | 2.4 Attaching processes | 803 | 2.4 Attaching processes |
| 801 | ----------------------- | 804 | ----------------------- |
diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index 72db89ed0609..f7f68b2ac199 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt | |||
| @@ -1,6 +1,6 @@ | |||
| 1 | Memory Resource Controller(Memcg) Implementation Memo. | 1 | Memory Resource Controller(Memcg) Implementation Memo. |
| 2 | Last Updated: 2009/1/20 | 2 | Last Updated: 2010/2 |
| 3 | Base Kernel Version: based on 2.6.29-rc2. | 3 | Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). |
| 4 | 4 | ||
| 5 | Because VM is getting complex (one of reasons is memcg...), memcg's behavior | 5 | Because VM is getting complex (one of reasons is memcg...), memcg's behavior |
| 6 | is complex. This is a document for memcg's internal behavior. | 6 | is complex. This is a document for memcg's internal behavior. |
| @@ -337,7 +337,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | |||
| 337 | race and lock dependency with other cgroup subsystems. | 337 | race and lock dependency with other cgroup subsystems. |
| 338 | 338 | ||
| 339 | example) | 339 | example) |
| 340 | # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices | 340 | # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices |
| 341 | 341 | ||
| 342 | and do task move, mkdir, rmdir etc...under this. | 342 | and do task move, mkdir, rmdir etc...under this. |
| 343 | 343 | ||
| @@ -348,7 +348,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | |||
| 348 | 348 | ||
| 349 | For example, test like following is good. | 349 | For example, test like following is good. |
| 350 | (Shell-A) | 350 | (Shell-A) |
| 351 | # mount -t cgroup none /cgroup -t memory | 351 | # mount -t cgroup none /cgroup -o memory |
| 352 | # mkdir /cgroup/test | 352 | # mkdir /cgroup/test |
| 353 | # echo 40M > /cgroup/test/memory.limit_in_bytes | 353 | # echo 40M > /cgroup/test/memory.limit_in_bytes |
| 354 | # echo 0 > /cgroup/test/tasks | 354 | # echo 0 > /cgroup/test/tasks |
| @@ -378,3 +378,42 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | |||
| 378 | #echo 50M > memory.limit_in_bytes | 378 | #echo 50M > memory.limit_in_bytes |
| 379 | #echo 50M > memory.memsw.limit_in_bytes | 379 | #echo 50M > memory.memsw.limit_in_bytes |
| 380 | run 51M of malloc | 380 | run 51M of malloc |
| 381 | |||
| 382 | 9.9 Move charges at task migration | ||
| 383 | Charges associated with a task can be moved along with task migration. | ||
| 384 | |||
| 385 | (Shell-A) | ||
| 386 | #mkdir /cgroup/A | ||
| 387 | #echo $$ >/cgroup/A/tasks | ||
| 388 | run some programs which uses some amount of memory in /cgroup/A. | ||
| 389 | |||
| 390 | (Shell-B) | ||
| 391 | #mkdir /cgroup/B | ||
| 392 | #echo 1 >/cgroup/B/memory.move_charge_at_immigrate | ||
| 393 | #echo "pid of the program running in group A" >/cgroup/B/tasks | ||
| 394 | |||
| 395 | You can see charges have been moved by reading *.usage_in_bytes or | ||
| 396 | memory.stat of both A and B. | ||
| 397 | See 8.2 of Documentation/cgroups/memory.txt to see what value should be | ||
| 398 | written to move_charge_at_immigrate. | ||
| 399 | |||
| 400 | 9.10 Memory thresholds | ||
| 401 | Memory controler implements memory thresholds using cgroups notification | ||
| 402 | API. You can use Documentation/cgroups/cgroup_event_listener.c to test | ||
| 403 | it. | ||
| 404 | |||
| 405 | (Shell-A) Create cgroup and run event listener | ||
| 406 | # mkdir /cgroup/A | ||
| 407 | # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M | ||
| 408 | |||
| 409 | (Shell-B) Add task to cgroup and try to allocate and free memory | ||
| 410 | # echo $$ >/cgroup/A/tasks | ||
| 411 | # a="$(dd if=/dev/zero bs=1M count=10)" | ||
| 412 | # a= | ||
| 413 | |||
| 414 | You will see message from cgroup_event_listener every time you cross | ||
| 415 | the thresholds. | ||
| 416 | |||
| 417 | Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds. | ||
| 418 | |||
| 419 | It's good idea to test root cgroup as well. | ||
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index b871f2552b45..3a6aecd078ba 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt | |||
| @@ -182,6 +182,8 @@ list. | |||
| 182 | NOTE: Reclaim does not work for the root cgroup, since we cannot set any | 182 | NOTE: Reclaim does not work for the root cgroup, since we cannot set any |
| 183 | limits on the root cgroup. | 183 | limits on the root cgroup. |
| 184 | 184 | ||
| 185 | Note2: When panic_on_oom is set to "2", the whole system will panic. | ||
| 186 | |||
| 185 | 2. Locking | 187 | 2. Locking |
| 186 | 188 | ||
| 187 | The memory controller uses the following hierarchy | 189 | The memory controller uses the following hierarchy |
| @@ -262,10 +264,12 @@ some of the pages cached in the cgroup (page cache pages). | |||
| 262 | 4.2 Task migration | 264 | 4.2 Task migration |
| 263 | 265 | ||
| 264 | When a task migrates from one cgroup to another, it's charge is not | 266 | When a task migrates from one cgroup to another, it's charge is not |
| 265 | carried forward. The pages allocated from the original cgroup still | 267 | carried forward by default. The pages allocated from the original cgroup still |
| 266 | remain charged to it, the charge is dropped when the page is freed or | 268 | remain charged to it, the charge is dropped when the page is freed or |
| 267 | reclaimed. | 269 | reclaimed. |
| 268 | 270 | ||
| 271 | Note: You can move charges of a task along with task migration. See 8. | ||
| 272 | |||
| 269 | 4.3 Removing a cgroup | 273 | 4.3 Removing a cgroup |
| 270 | 274 | ||
| 271 | A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a | 275 | A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a |
| @@ -336,7 +340,7 @@ Note: | |||
| 336 | 5.3 swappiness | 340 | 5.3 swappiness |
| 337 | Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. | 341 | Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. |
| 338 | 342 | ||
| 339 | Following cgroups' swapiness can't be changed. | 343 | Following cgroups' swappiness can't be changed. |
| 340 | - root cgroup (uses /proc/sys/vm/swappiness). | 344 | - root cgroup (uses /proc/sys/vm/swappiness). |
| 341 | - a cgroup which uses hierarchy and it has child cgroup. | 345 | - a cgroup which uses hierarchy and it has child cgroup. |
| 342 | - a cgroup which uses hierarchy and not the root of hierarchy. | 346 | - a cgroup which uses hierarchy and not the root of hierarchy. |
| @@ -377,7 +381,8 @@ The feature can be disabled by | |||
| 377 | NOTE1: Enabling/disabling will fail if the cgroup already has other | 381 | NOTE1: Enabling/disabling will fail if the cgroup already has other |
| 378 | cgroups created below it. | 382 | cgroups created below it. |
| 379 | 383 | ||
| 380 | NOTE2: This feature can be enabled/disabled per subtree. | 384 | NOTE2: When panic_on_oom is set to "2", the whole system will panic in |
| 385 | case of an oom event in any cgroup. | ||
| 381 | 386 | ||
| 382 | 7. Soft limits | 387 | 7. Soft limits |
| 383 | 388 | ||
| @@ -414,7 +419,76 @@ NOTE1: Soft limits take effect over a long period of time, since they involve | |||
| 414 | NOTE2: It is recommended to set the soft limit always below the hard limit, | 419 | NOTE2: It is recommended to set the soft limit always below the hard limit, |
| 415 | otherwise the hard limit will take precedence. | 420 | otherwise the hard limit will take precedence. |
| 416 | 421 | ||
| 417 | 8. TODO | 422 | 8. Move charges at task migration |
| 423 | |||
| 424 | Users can move charges associated with a task along with task migration, that | ||
| 425 | is, uncharge task's pages from the old cgroup and charge them to the new cgroup. | ||
| 426 | This feature is not supported in !CONFIG_MMU environments because of lack of | ||
| 427 | page tables. | ||
| 428 | |||
| 429 | 8.1 Interface | ||
| 430 | |||
| 431 | This feature is disabled by default. It can be enabled(and disabled again) by | ||
| 432 | writing to memory.move_charge_at_immigrate of the destination cgroup. | ||
| 433 | |||
| 434 | If you want to enable it: | ||
| 435 | |||
| 436 | # echo (some positive value) > memory.move_charge_at_immigrate | ||
| 437 | |||
| 438 | Note: Each bits of move_charge_at_immigrate has its own meaning about what type | ||
| 439 | of charges should be moved. See 8.2 for details. | ||
| 440 | Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread | ||
| 441 | group. | ||
| 442 | Note: If we cannot find enough space for the task in the destination cgroup, we | ||
| 443 | try to make space by reclaiming memory. Task migration may fail if we | ||
| 444 | cannot make enough space. | ||
| 445 | Note: It can take several seconds if you move charges in giga bytes order. | ||
| 446 | |||
| 447 | And if you want disable it again: | ||
| 448 | |||
| 449 | # echo 0 > memory.move_charge_at_immigrate | ||
| 450 | |||
| 451 | 8.2 Type of charges which can be move | ||
| 452 | |||
| 453 | Each bits of move_charge_at_immigrate has its own meaning about what type of | ||
| 454 | charges should be moved. | ||
| 455 | |||
| 456 | bit | what type of charges would be moved ? | ||
| 457 | -----+------------------------------------------------------------------------ | ||
| 458 | 0 | A charge of an anonymous page(or swap of it) used by the target task. | ||
| 459 | | Those pages and swaps must be used only by the target task. You must | ||
| 460 | | enable Swap Extension(see 2.4) to enable move of swap charges. | ||
| 461 | |||
| 462 | Note: Those pages and swaps must be charged to the old cgroup. | ||
| 463 | Note: More type of pages(e.g. file cache, shmem,) will be supported by other | ||
| 464 | bits in future. | ||
| 465 | |||
| 466 | 8.3 TODO | ||
| 467 | |||
| 468 | - Add support for other types of pages(e.g. file cache, shmem, etc.). | ||
| 469 | - Implement madvise(2) to let users decide the vma to be moved or not to be | ||
| 470 | moved. | ||
| 471 | - All of moving charge operations are done under cgroup_mutex. It's not good | ||
| 472 | behavior to hold the mutex too long, so we may need some trick. | ||
| 473 | |||
| 474 | 9. Memory thresholds | ||
| 475 | |||
| 476 | Memory controler implements memory thresholds using cgroups notification | ||
| 477 | API (see cgroups.txt). It allows to register multiple memory and memsw | ||
| 478 | thresholds and gets notifications when it crosses. | ||
| 479 | |||
| 480 | To register a threshold application need: | ||
| 481 | - create an eventfd using eventfd(2); | ||
| 482 | - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; | ||
| 483 | - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to | ||
| 484 | cgroup.event_control. | ||
| 485 | |||
| 486 | Application will be notified through eventfd when memory usage crosses | ||
| 487 | threshold in any direction. | ||
| 488 | |||
| 489 | It's applicable for root and non-root cgroup. | ||
| 490 | |||
| 491 | 10. TODO | ||
| 418 | 492 | ||
| 419 | 1. Add support for accounting huge pages (as a separate controller) | 493 | 1. Add support for accounting huge pages (as a separate controller) |
| 420 | 2. Make per-cgroup scanner reclaim not-shared pages first | 494 | 2. Make per-cgroup scanner reclaim not-shared pages first |
