diff options
Diffstat (limited to 'Documentation/cgroups')
-rw-r--r-- | Documentation/cgroups/blkio-controller.txt | 151 | ||||
-rw-r--r-- | Documentation/cgroups/cgroup_event_listener.c | 110 | ||||
-rw-r--r-- | Documentation/cgroups/cgroups.txt | 44 | ||||
-rw-r--r-- | Documentation/cgroups/cpusets.txt | 161 | ||||
-rw-r--r-- | Documentation/cgroups/memcg_test.txt | 49 | ||||
-rw-r--r-- | Documentation/cgroups/memory.txt | 378 |
6 files changed, 711 insertions, 182 deletions
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt index 630879cd9a42..48e0b21b0059 100644 --- a/Documentation/cgroups/blkio-controller.txt +++ b/Documentation/cgroups/blkio-controller.txt | |||
@@ -17,6 +17,9 @@ HOWTO | |||
17 | You can do a very simple testing of running two dd threads in two different | 17 | You can do a very simple testing of running two dd threads in two different |
18 | cgroups. Here is what you can do. | 18 | cgroups. Here is what you can do. |
19 | 19 | ||
20 | - Enable Block IO controller | ||
21 | CONFIG_BLK_CGROUP=y | ||
22 | |||
20 | - Enable group scheduling in CFQ | 23 | - Enable group scheduling in CFQ |
21 | CONFIG_CFQ_GROUP_IOSCHED=y | 24 | CONFIG_CFQ_GROUP_IOSCHED=y |
22 | 25 | ||
@@ -54,32 +57,52 @@ cgroups. Here is what you can do. | |||
54 | 57 | ||
55 | Various user visible config options | 58 | Various user visible config options |
56 | =================================== | 59 | =================================== |
57 | CONFIG_CFQ_GROUP_IOSCHED | ||
58 | - Enables group scheduling in CFQ. Currently only 1 level of group | ||
59 | creation is allowed. | ||
60 | |||
61 | CONFIG_DEBUG_CFQ_IOSCHED | ||
62 | - Enables some debugging messages in blktrace. Also creates extra | ||
63 | cgroup file blkio.dequeue. | ||
64 | |||
65 | Config options selected automatically | ||
66 | ===================================== | ||
67 | These config options are not user visible and are selected/deselected | ||
68 | automatically based on IO scheduler configuration. | ||
69 | |||
70 | CONFIG_BLK_CGROUP | 60 | CONFIG_BLK_CGROUP |
71 | - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED. | 61 | - Block IO controller. |
72 | 62 | ||
73 | CONFIG_DEBUG_BLK_CGROUP | 63 | CONFIG_DEBUG_BLK_CGROUP |
74 | - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED. | 64 | - Debug help. Right now some additional stats file show up in cgroup |
65 | if this option is enabled. | ||
66 | |||
67 | CONFIG_CFQ_GROUP_IOSCHED | ||
68 | - Enables group scheduling in CFQ. Currently only 1 level of group | ||
69 | creation is allowed. | ||
75 | 70 | ||
76 | Details of cgroup files | 71 | Details of cgroup files |
77 | ======================= | 72 | ======================= |
78 | - blkio.weight | 73 | - blkio.weight |
79 | - Specifies per cgroup weight. | 74 | - Specifies per cgroup weight. This is default weight of the group |
80 | 75 | on all the devices until and unless overridden by per device rule. | |
76 | (See blkio.weight_device). | ||
81 | Currently allowed range of weights is from 100 to 1000. | 77 | Currently allowed range of weights is from 100 to 1000. |
82 | 78 | ||
79 | - blkio.weight_device | ||
80 | - One can specify per cgroup per device rules using this interface. | ||
81 | These rules override the default value of group weight as specified | ||
82 | by blkio.weight. | ||
83 | |||
84 | Following is the format. | ||
85 | |||
86 | #echo dev_maj:dev_minor weight > /path/to/cgroup/blkio.weight_device | ||
87 | Configure weight=300 on /dev/sdb (8:16) in this cgroup | ||
88 | # echo 8:16 300 > blkio.weight_device | ||
89 | # cat blkio.weight_device | ||
90 | dev weight | ||
91 | 8:16 300 | ||
92 | |||
93 | Configure weight=500 on /dev/sda (8:0) in this cgroup | ||
94 | # echo 8:0 500 > blkio.weight_device | ||
95 | # cat blkio.weight_device | ||
96 | dev weight | ||
97 | 8:0 500 | ||
98 | 8:16 300 | ||
99 | |||
100 | Remove specific weight for /dev/sda in this cgroup | ||
101 | # echo 8:0 0 > blkio.weight_device | ||
102 | # cat blkio.weight_device | ||
103 | dev weight | ||
104 | 8:16 300 | ||
105 | |||
83 | - blkio.time | 106 | - blkio.time |
84 | - disk time allocated to cgroup per device in milliseconds. First | 107 | - disk time allocated to cgroup per device in milliseconds. First |
85 | two fields specify the major and minor number of the device and | 108 | two fields specify the major and minor number of the device and |
@@ -92,13 +115,105 @@ Details of cgroup files | |||
92 | third field specifies the number of sectors transferred by the | 115 | third field specifies the number of sectors transferred by the |
93 | group to/from the device. | 116 | group to/from the device. |
94 | 117 | ||
118 | - blkio.io_service_bytes | ||
119 | - Number of bytes transferred to/from the disk by the group. These | ||
120 | are further divided by the type of operation - read or write, sync | ||
121 | or async. First two fields specify the major and minor number of the | ||
122 | device, third field specifies the operation type and the fourth field | ||
123 | specifies the number of bytes. | ||
124 | |||
125 | - blkio.io_serviced | ||
126 | - Number of IOs completed to/from the disk by the group. These | ||
127 | are further divided by the type of operation - read or write, sync | ||
128 | or async. First two fields specify the major and minor number of the | ||
129 | device, third field specifies the operation type and the fourth field | ||
130 | specifies the number of IOs. | ||
131 | |||
132 | - blkio.io_service_time | ||
133 | - Total amount of time between request dispatch and request completion | ||
134 | for the IOs done by this cgroup. This is in nanoseconds to make it | ||
135 | meaningful for flash devices too. For devices with queue depth of 1, | ||
136 | this time represents the actual service time. When queue_depth > 1, | ||
137 | that is no longer true as requests may be served out of order. This | ||
138 | may cause the service time for a given IO to include the service time | ||
139 | of multiple IOs when served out of order which may result in total | ||
140 | io_service_time > actual time elapsed. This time is further divided by | ||
141 | the type of operation - read or write, sync or async. First two fields | ||
142 | specify the major and minor number of the device, third field | ||
143 | specifies the operation type and the fourth field specifies the | ||
144 | io_service_time in ns. | ||
145 | |||
146 | - blkio.io_wait_time | ||
147 | - Total amount of time the IOs for this cgroup spent waiting in the | ||
148 | scheduler queues for service. This can be greater than the total time | ||
149 | elapsed since it is cumulative io_wait_time for all IOs. It is not a | ||
150 | measure of total time the cgroup spent waiting but rather a measure of | ||
151 | the wait_time for its individual IOs. For devices with queue_depth > 1 | ||
152 | this metric does not include the time spent waiting for service once | ||
153 | the IO is dispatched to the device but till it actually gets serviced | ||
154 | (there might be a time lag here due to re-ordering of requests by the | ||
155 | device). This is in nanoseconds to make it meaningful for flash | ||
156 | devices too. This time is further divided by the type of operation - | ||
157 | read or write, sync or async. First two fields specify the major and | ||
158 | minor number of the device, third field specifies the operation type | ||
159 | and the fourth field specifies the io_wait_time in ns. | ||
160 | |||
161 | - blkio.io_merged | ||
162 | - Total number of bios/requests merged into requests belonging to this | ||
163 | cgroup. This is further divided by the type of operation - read or | ||
164 | write, sync or async. | ||
165 | |||
166 | - blkio.io_queued | ||
167 | - Total number of requests queued up at any given instant for this | ||
168 | cgroup. This is further divided by the type of operation - read or | ||
169 | write, sync or async. | ||
170 | |||
171 | - blkio.avg_queue_size | ||
172 | - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. | ||
173 | The average queue size for this cgroup over the entire time of this | ||
174 | cgroup's existence. Queue size samples are taken each time one of the | ||
175 | queues of this cgroup gets a timeslice. | ||
176 | |||
177 | - blkio.group_wait_time | ||
178 | - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. | ||
179 | This is the amount of time the cgroup had to wait since it became busy | ||
180 | (i.e., went from 0 to 1 request queued) to get a timeslice for one of | ||
181 | its queues. This is different from the io_wait_time which is the | ||
182 | cumulative total of the amount of time spent by each IO in that cgroup | ||
183 | waiting in the scheduler queue. This is in nanoseconds. If this is | ||
184 | read when the cgroup is in a waiting (for timeslice) state, the stat | ||
185 | will only report the group_wait_time accumulated till the last time it | ||
186 | got a timeslice and will not include the current delta. | ||
187 | |||
188 | - blkio.empty_time | ||
189 | - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. | ||
190 | This is the amount of time a cgroup spends without any pending | ||
191 | requests when not being served, i.e., it does not include any time | ||
192 | spent idling for one of the queues of the cgroup. This is in | ||
193 | nanoseconds. If this is read when the cgroup is in an empty state, | ||
194 | the stat will only report the empty_time accumulated till the last | ||
195 | time it had a pending request and will not include the current delta. | ||
196 | |||
197 | - blkio.idle_time | ||
198 | - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. | ||
199 | This is the amount of time spent by the IO scheduler idling for a | ||
200 | given cgroup in anticipation of a better request than the exising ones | ||
201 | from other queues/cgroups. This is in nanoseconds. If this is read | ||
202 | when the cgroup is in an idling state, the stat will only report the | ||
203 | idle_time accumulated till the last idle period and will not include | ||
204 | the current delta. | ||
205 | |||
95 | - blkio.dequeue | 206 | - blkio.dequeue |
96 | - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This | 207 | - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. This |
97 | gives the statistics about how many a times a group was dequeued | 208 | gives the statistics about how many a times a group was dequeued |
98 | from service tree of the device. First two fields specify the major | 209 | from service tree of the device. First two fields specify the major |
99 | and minor number of the device and third field specifies the number | 210 | and minor number of the device and third field specifies the number |
100 | of times a group was dequeued from a particular device. | 211 | of times a group was dequeued from a particular device. |
101 | 212 | ||
213 | - blkio.reset_stats | ||
214 | - Writing an int to this file will result in resetting all the stats | ||
215 | for that cgroup. | ||
216 | |||
102 | CFQ sysfs tunable | 217 | CFQ sysfs tunable |
103 | ================= | 218 | ================= |
104 | /sys/block/<disk>/queue/iosched/group_isolation | 219 | /sys/block/<disk>/queue/iosched/group_isolation |
diff --git a/Documentation/cgroups/cgroup_event_listener.c b/Documentation/cgroups/cgroup_event_listener.c new file mode 100644 index 000000000000..8c2bfc4a6358 --- /dev/null +++ b/Documentation/cgroups/cgroup_event_listener.c | |||
@@ -0,0 +1,110 @@ | |||
1 | /* | ||
2 | * cgroup_event_listener.c - Simple listener of cgroup events | ||
3 | * | ||
4 | * Copyright (C) Kirill A. Shutemov <kirill@shutemov.name> | ||
5 | */ | ||
6 | |||
7 | #include <assert.h> | ||
8 | #include <errno.h> | ||
9 | #include <fcntl.h> | ||
10 | #include <libgen.h> | ||
11 | #include <limits.h> | ||
12 | #include <stdio.h> | ||
13 | #include <string.h> | ||
14 | #include <unistd.h> | ||
15 | |||
16 | #include <sys/eventfd.h> | ||
17 | |||
18 | #define USAGE_STR "Usage: cgroup_event_listener <path-to-control-file> <args>\n" | ||
19 | |||
20 | int main(int argc, char **argv) | ||
21 | { | ||
22 | int efd = -1; | ||
23 | int cfd = -1; | ||
24 | int event_control = -1; | ||
25 | char event_control_path[PATH_MAX]; | ||
26 | char line[LINE_MAX]; | ||
27 | int ret; | ||
28 | |||
29 | if (argc != 3) { | ||
30 | fputs(USAGE_STR, stderr); | ||
31 | return 1; | ||
32 | } | ||
33 | |||
34 | cfd = open(argv[1], O_RDONLY); | ||
35 | if (cfd == -1) { | ||
36 | fprintf(stderr, "Cannot open %s: %s\n", argv[1], | ||
37 | strerror(errno)); | ||
38 | goto out; | ||
39 | } | ||
40 | |||
41 | ret = snprintf(event_control_path, PATH_MAX, "%s/cgroup.event_control", | ||
42 | dirname(argv[1])); | ||
43 | if (ret >= PATH_MAX) { | ||
44 | fputs("Path to cgroup.event_control is too long\n", stderr); | ||
45 | goto out; | ||
46 | } | ||
47 | |||
48 | event_control = open(event_control_path, O_WRONLY); | ||
49 | if (event_control == -1) { | ||
50 | fprintf(stderr, "Cannot open %s: %s\n", event_control_path, | ||
51 | strerror(errno)); | ||
52 | goto out; | ||
53 | } | ||
54 | |||
55 | efd = eventfd(0, 0); | ||
56 | if (efd == -1) { | ||
57 | perror("eventfd() failed"); | ||
58 | goto out; | ||
59 | } | ||
60 | |||
61 | ret = snprintf(line, LINE_MAX, "%d %d %s", efd, cfd, argv[2]); | ||
62 | if (ret >= LINE_MAX) { | ||
63 | fputs("Arguments string is too long\n", stderr); | ||
64 | goto out; | ||
65 | } | ||
66 | |||
67 | ret = write(event_control, line, strlen(line) + 1); | ||
68 | if (ret == -1) { | ||
69 | perror("Cannot write to cgroup.event_control"); | ||
70 | goto out; | ||
71 | } | ||
72 | |||
73 | while (1) { | ||
74 | uint64_t result; | ||
75 | |||
76 | ret = read(efd, &result, sizeof(result)); | ||
77 | if (ret == -1) { | ||
78 | if (errno == EINTR) | ||
79 | continue; | ||
80 | perror("Cannot read from eventfd"); | ||
81 | break; | ||
82 | } | ||
83 | assert(ret == sizeof(result)); | ||
84 | |||
85 | ret = access(event_control_path, W_OK); | ||
86 | if ((ret == -1) && (errno == ENOENT)) { | ||
87 | puts("The cgroup seems to have removed."); | ||
88 | ret = 0; | ||
89 | break; | ||
90 | } | ||
91 | |||
92 | if (ret == -1) { | ||
93 | perror("cgroup.event_control " | ||
94 | "is not accessable any more"); | ||
95 | break; | ||
96 | } | ||
97 | |||
98 | printf("%s %s: crossed\n", argv[1], argv[2]); | ||
99 | } | ||
100 | |||
101 | out: | ||
102 | if (efd >= 0) | ||
103 | close(efd); | ||
104 | if (event_control >= 0) | ||
105 | close(event_control); | ||
106 | if (cfd >= 0) | ||
107 | close(cfd); | ||
108 | |||
109 | return (ret != 0); | ||
110 | } | ||
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 0b33bfe7dde9..b34823ff1646 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt | |||
@@ -22,6 +22,8 @@ CONTENTS: | |||
22 | 2. Usage Examples and Syntax | 22 | 2. Usage Examples and Syntax |
23 | 2.1 Basic Usage | 23 | 2.1 Basic Usage |
24 | 2.2 Attaching processes | 24 | 2.2 Attaching processes |
25 | 2.3 Mounting hierarchies by name | ||
26 | 2.4 Notification API | ||
25 | 3. Kernel API | 27 | 3. Kernel API |
26 | 3.1 Overview | 28 | 3.1 Overview |
27 | 3.2 Synchronization | 29 | 3.2 Synchronization |
@@ -233,8 +235,7 @@ containing the following files describing that cgroup: | |||
233 | - cgroup.procs: list of tgids in the cgroup. This list is not | 235 | - cgroup.procs: list of tgids in the cgroup. This list is not |
234 | guaranteed to be sorted or free of duplicate tgids, and userspace | 236 | guaranteed to be sorted or free of duplicate tgids, and userspace |
235 | should sort/uniquify the list if this property is required. | 237 | should sort/uniquify the list if this property is required. |
236 | Writing a tgid into this file moves all threads with that tgid into | 238 | This is a read-only file, for now. |
237 | this cgroup. | ||
238 | - notify_on_release flag: run the release agent on exit? | 239 | - notify_on_release flag: run the release agent on exit? |
239 | - release_agent: the path to use for release notifications (this file | 240 | - release_agent: the path to use for release notifications (this file |
240 | exists in the top cgroup only) | 241 | exists in the top cgroup only) |
@@ -338,7 +339,7 @@ To mount a cgroup hierarchy with all available subsystems, type: | |||
338 | The "xxx" is not interpreted by the cgroup code, but will appear in | 339 | The "xxx" is not interpreted by the cgroup code, but will appear in |
339 | /proc/mounts so may be any useful identifying string that you like. | 340 | /proc/mounts so may be any useful identifying string that you like. |
340 | 341 | ||
341 | To mount a cgroup hierarchy with just the cpuset and numtasks | 342 | To mount a cgroup hierarchy with just the cpuset and memory |
342 | subsystems, type: | 343 | subsystems, type: |
343 | # mount -t cgroup -o cpuset,memory hier1 /dev/cgroup | 344 | # mount -t cgroup -o cpuset,memory hier1 /dev/cgroup |
344 | 345 | ||
@@ -434,6 +435,25 @@ you give a subsystem a name. | |||
434 | The name of the subsystem appears as part of the hierarchy description | 435 | The name of the subsystem appears as part of the hierarchy description |
435 | in /proc/mounts and /proc/<pid>/cgroups. | 436 | in /proc/mounts and /proc/<pid>/cgroups. |
436 | 437 | ||
438 | 2.4 Notification API | ||
439 | -------------------- | ||
440 | |||
441 | There is mechanism which allows to get notifications about changing | ||
442 | status of a cgroup. | ||
443 | |||
444 | To register new notification handler you need: | ||
445 | - create a file descriptor for event notification using eventfd(2); | ||
446 | - open a control file to be monitored (e.g. memory.usage_in_bytes); | ||
447 | - write "<event_fd> <control_fd> <args>" to cgroup.event_control. | ||
448 | Interpretation of args is defined by control file implementation; | ||
449 | |||
450 | eventfd will be woken up by control file implementation or when the | ||
451 | cgroup is removed. | ||
452 | |||
453 | To unregister notification handler just close eventfd. | ||
454 | |||
455 | NOTE: Support of notifications should be implemented for the control | ||
456 | file. See documentation for the subsystem. | ||
437 | 457 | ||
438 | 3. Kernel API | 458 | 3. Kernel API |
439 | ============= | 459 | ============= |
@@ -488,6 +508,11 @@ Each subsystem should: | |||
488 | - add an entry in linux/cgroup_subsys.h | 508 | - add an entry in linux/cgroup_subsys.h |
489 | - define a cgroup_subsys object called <name>_subsys | 509 | - define a cgroup_subsys object called <name>_subsys |
490 | 510 | ||
511 | If a subsystem can be compiled as a module, it should also have in its | ||
512 | module initcall a call to cgroup_load_subsys(), and in its exitcall a | ||
513 | call to cgroup_unload_subsys(). It should also set its_subsys.module = | ||
514 | THIS_MODULE in its .c file. | ||
515 | |||
491 | Each subsystem may export the following methods. The only mandatory | 516 | Each subsystem may export the following methods. The only mandatory |
492 | methods are create/destroy. Any others that are null are presumed to | 517 | methods are create/destroy. Any others that are null are presumed to |
493 | be successful no-ops. | 518 | be successful no-ops. |
@@ -536,10 +561,21 @@ returns an error, this will abort the attach operation. If a NULL | |||
536 | task is passed, then a successful result indicates that *any* | 561 | task is passed, then a successful result indicates that *any* |
537 | unspecified task can be moved into the cgroup. Note that this isn't | 562 | unspecified task can be moved into the cgroup. Note that this isn't |
538 | called on a fork. If this method returns 0 (success) then this should | 563 | called on a fork. If this method returns 0 (success) then this should |
539 | remain valid while the caller holds cgroup_mutex. If threadgroup is | 564 | remain valid while the caller holds cgroup_mutex and it is ensured that either |
565 | attach() or cancel_attach() will be called in future. If threadgroup is | ||
540 | true, then a successful result indicates that all threads in the given | 566 | true, then a successful result indicates that all threads in the given |
541 | thread's threadgroup can be moved together. | 567 | thread's threadgroup can be moved together. |
542 | 568 | ||
569 | void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | ||
570 | struct task_struct *task, bool threadgroup) | ||
571 | (cgroup_mutex held by caller) | ||
572 | |||
573 | Called when a task attach operation has failed after can_attach() has succeeded. | ||
574 | A subsystem whose can_attach() has some side-effects should provide this | ||
575 | function, so that the subsystem can implement a rollback. If not, not necessary. | ||
576 | This will be called only about subsystems whose can_attach() operation have | ||
577 | succeeded. | ||
578 | |||
543 | void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | 579 | void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, |
544 | struct cgroup *old_cgrp, struct task_struct *task, | 580 | struct cgroup *old_cgrp, struct task_struct *task, |
545 | bool threadgroup) | 581 | bool threadgroup) |
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 1d7e9784439a..51682ab2dd1a 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt | |||
@@ -42,7 +42,7 @@ Nodes to a set of tasks. In this document "Memory Node" refers to | |||
42 | an on-line node that contains memory. | 42 | an on-line node that contains memory. |
43 | 43 | ||
44 | Cpusets constrain the CPU and Memory placement of tasks to only | 44 | Cpusets constrain the CPU and Memory placement of tasks to only |
45 | the resources within a tasks current cpuset. They form a nested | 45 | the resources within a task's current cpuset. They form a nested |
46 | hierarchy visible in a virtual file system. These are the essential | 46 | hierarchy visible in a virtual file system. These are the essential |
47 | hooks, beyond what is already present, required to manage dynamic | 47 | hooks, beyond what is already present, required to manage dynamic |
48 | job placement on large systems. | 48 | job placement on large systems. |
@@ -53,11 +53,11 @@ Documentation/cgroups/cgroups.txt. | |||
53 | Requests by a task, using the sched_setaffinity(2) system call to | 53 | Requests by a task, using the sched_setaffinity(2) system call to |
54 | include CPUs in its CPU affinity mask, and using the mbind(2) and | 54 | include CPUs in its CPU affinity mask, and using the mbind(2) and |
55 | set_mempolicy(2) system calls to include Memory Nodes in its memory | 55 | set_mempolicy(2) system calls to include Memory Nodes in its memory |
56 | policy, are both filtered through that tasks cpuset, filtering out any | 56 | policy, are both filtered through that task's cpuset, filtering out any |
57 | CPUs or Memory Nodes not in that cpuset. The scheduler will not | 57 | CPUs or Memory Nodes not in that cpuset. The scheduler will not |
58 | schedule a task on a CPU that is not allowed in its cpus_allowed | 58 | schedule a task on a CPU that is not allowed in its cpus_allowed |
59 | vector, and the kernel page allocator will not allocate a page on a | 59 | vector, and the kernel page allocator will not allocate a page on a |
60 | node that is not allowed in the requesting tasks mems_allowed vector. | 60 | node that is not allowed in the requesting task's mems_allowed vector. |
61 | 61 | ||
62 | User level code may create and destroy cpusets by name in the cgroup | 62 | User level code may create and destroy cpusets by name in the cgroup |
63 | virtual file system, manage the attributes and permissions of these | 63 | virtual file system, manage the attributes and permissions of these |
@@ -121,9 +121,9 @@ Cpusets extends these two mechanisms as follows: | |||
121 | - Each task in the system is attached to a cpuset, via a pointer | 121 | - Each task in the system is attached to a cpuset, via a pointer |
122 | in the task structure to a reference counted cgroup structure. | 122 | in the task structure to a reference counted cgroup structure. |
123 | - Calls to sched_setaffinity are filtered to just those CPUs | 123 | - Calls to sched_setaffinity are filtered to just those CPUs |
124 | allowed in that tasks cpuset. | 124 | allowed in that task's cpuset. |
125 | - Calls to mbind and set_mempolicy are filtered to just | 125 | - Calls to mbind and set_mempolicy are filtered to just |
126 | those Memory Nodes allowed in that tasks cpuset. | 126 | those Memory Nodes allowed in that task's cpuset. |
127 | - The root cpuset contains all the systems CPUs and Memory | 127 | - The root cpuset contains all the systems CPUs and Memory |
128 | Nodes. | 128 | Nodes. |
129 | - For any cpuset, one can define child cpusets containing a subset | 129 | - For any cpuset, one can define child cpusets containing a subset |
@@ -141,11 +141,11 @@ into the rest of the kernel, none in performance critical paths: | |||
141 | - in init/main.c, to initialize the root cpuset at system boot. | 141 | - in init/main.c, to initialize the root cpuset at system boot. |
142 | - in fork and exit, to attach and detach a task from its cpuset. | 142 | - in fork and exit, to attach and detach a task from its cpuset. |
143 | - in sched_setaffinity, to mask the requested CPUs by what's | 143 | - in sched_setaffinity, to mask the requested CPUs by what's |
144 | allowed in that tasks cpuset. | 144 | allowed in that task's cpuset. |
145 | - in sched.c migrate_live_tasks(), to keep migrating tasks within | 145 | - in sched.c migrate_live_tasks(), to keep migrating tasks within |
146 | the CPUs allowed by their cpuset, if possible. | 146 | the CPUs allowed by their cpuset, if possible. |
147 | - in the mbind and set_mempolicy system calls, to mask the requested | 147 | - in the mbind and set_mempolicy system calls, to mask the requested |
148 | Memory Nodes by what's allowed in that tasks cpuset. | 148 | Memory Nodes by what's allowed in that task's cpuset. |
149 | - in page_alloc.c, to restrict memory to allowed nodes. | 149 | - in page_alloc.c, to restrict memory to allowed nodes. |
150 | - in vmscan.c, to restrict page recovery to the current cpuset. | 150 | - in vmscan.c, to restrict page recovery to the current cpuset. |
151 | 151 | ||
@@ -155,7 +155,7 @@ new system calls are added for cpusets - all support for querying and | |||
155 | modifying cpusets is via this cpuset file system. | 155 | modifying cpusets is via this cpuset file system. |
156 | 156 | ||
157 | The /proc/<pid>/status file for each task has four added lines, | 157 | The /proc/<pid>/status file for each task has four added lines, |
158 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) | 158 | displaying the task's cpus_allowed (on which CPUs it may be scheduled) |
159 | and mems_allowed (on which Memory Nodes it may obtain memory), | 159 | and mems_allowed (on which Memory Nodes it may obtain memory), |
160 | in the two formats seen in the following example: | 160 | in the two formats seen in the following example: |
161 | 161 | ||
@@ -168,20 +168,20 @@ Each cpuset is represented by a directory in the cgroup file system | |||
168 | containing (on top of the standard cgroup files) the following | 168 | containing (on top of the standard cgroup files) the following |
169 | files describing that cpuset: | 169 | files describing that cpuset: |
170 | 170 | ||
171 | - cpus: list of CPUs in that cpuset | 171 | - cpuset.cpus: list of CPUs in that cpuset |
172 | - mems: list of Memory Nodes in that cpuset | 172 | - cpuset.mems: list of Memory Nodes in that cpuset |
173 | - memory_migrate flag: if set, move pages to cpusets nodes | 173 | - cpuset.memory_migrate flag: if set, move pages to cpusets nodes |
174 | - cpu_exclusive flag: is cpu placement exclusive? | 174 | - cpuset.cpu_exclusive flag: is cpu placement exclusive? |
175 | - mem_exclusive flag: is memory placement exclusive? | 175 | - cpuset.mem_exclusive flag: is memory placement exclusive? |
176 | - mem_hardwall flag: is memory allocation hardwalled | 176 | - cpuset.mem_hardwall flag: is memory allocation hardwalled |
177 | - memory_pressure: measure of how much paging pressure in cpuset | 177 | - cpuset.memory_pressure: measure of how much paging pressure in cpuset |
178 | - memory_spread_page flag: if set, spread page cache evenly on allowed nodes | 178 | - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes |
179 | - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes | 179 | - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes |
180 | - sched_load_balance flag: if set, load balance within CPUs on that cpuset | 180 | - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset |
181 | - sched_relax_domain_level: the searching range when migrating tasks | 181 | - cpuset.sched_relax_domain_level: the searching range when migrating tasks |
182 | 182 | ||
183 | In addition, the root cpuset only has the following file: | 183 | In addition, the root cpuset only has the following file: |
184 | - memory_pressure_enabled flag: compute memory_pressure? | 184 | - cpuset.memory_pressure_enabled flag: compute memory_pressure? |
185 | 185 | ||
186 | New cpusets are created using the mkdir system call or shell | 186 | New cpusets are created using the mkdir system call or shell |
187 | command. The properties of a cpuset, such as its flags, allowed | 187 | command. The properties of a cpuset, such as its flags, allowed |
@@ -229,7 +229,7 @@ If a cpuset is cpu or mem exclusive, no other cpuset, other than | |||
229 | a direct ancestor or descendant, may share any of the same CPUs or | 229 | a direct ancestor or descendant, may share any of the same CPUs or |
230 | Memory Nodes. | 230 | Memory Nodes. |
231 | 231 | ||
232 | A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", | 232 | A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled", |
233 | i.e. it restricts kernel allocations for page, buffer and other data | 233 | i.e. it restricts kernel allocations for page, buffer and other data |
234 | commonly shared by the kernel across multiple users. All cpusets, | 234 | commonly shared by the kernel across multiple users. All cpusets, |
235 | whether hardwalled or not, restrict allocations of memory for user | 235 | whether hardwalled or not, restrict allocations of memory for user |
@@ -304,15 +304,15 @@ times 1000. | |||
304 | --------------------------- | 304 | --------------------------- |
305 | There are two boolean flag files per cpuset that control where the | 305 | There are two boolean flag files per cpuset that control where the |
306 | kernel allocates pages for the file system buffers and related in | 306 | kernel allocates pages for the file system buffers and related in |
307 | kernel data structures. They are called 'memory_spread_page' and | 307 | kernel data structures. They are called 'cpuset.memory_spread_page' and |
308 | 'memory_spread_slab'. | 308 | 'cpuset.memory_spread_slab'. |
309 | 309 | ||
310 | If the per-cpuset boolean flag file 'memory_spread_page' is set, then | 310 | If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then |
311 | the kernel will spread the file system buffers (page cache) evenly | 311 | the kernel will spread the file system buffers (page cache) evenly |
312 | over all the nodes that the faulting task is allowed to use, instead | 312 | over all the nodes that the faulting task is allowed to use, instead |
313 | of preferring to put those pages on the node where the task is running. | 313 | of preferring to put those pages on the node where the task is running. |
314 | 314 | ||
315 | If the per-cpuset boolean flag file 'memory_spread_slab' is set, | 315 | If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set, |
316 | then the kernel will spread some file system related slab caches, | 316 | then the kernel will spread some file system related slab caches, |
317 | such as for inodes and dentries evenly over all the nodes that the | 317 | such as for inodes and dentries evenly over all the nodes that the |
318 | faulting task is allowed to use, instead of preferring to put those | 318 | faulting task is allowed to use, instead of preferring to put those |
@@ -323,41 +323,41 @@ stack segment pages of a task. | |||
323 | 323 | ||
324 | By default, both kinds of memory spreading are off, and memory | 324 | By default, both kinds of memory spreading are off, and memory |
325 | pages are allocated on the node local to where the task is running, | 325 | pages are allocated on the node local to where the task is running, |
326 | except perhaps as modified by the tasks NUMA mempolicy or cpuset | 326 | except perhaps as modified by the task's NUMA mempolicy or cpuset |
327 | configuration, so long as sufficient free memory pages are available. | 327 | configuration, so long as sufficient free memory pages are available. |
328 | 328 | ||
329 | When new cpusets are created, they inherit the memory spread settings | 329 | When new cpusets are created, they inherit the memory spread settings |
330 | of their parent. | 330 | of their parent. |
331 | 331 | ||
332 | Setting memory spreading causes allocations for the affected page | 332 | Setting memory spreading causes allocations for the affected page |
333 | or slab caches to ignore the tasks NUMA mempolicy and be spread | 333 | or slab caches to ignore the task's NUMA mempolicy and be spread |
334 | instead. Tasks using mbind() or set_mempolicy() calls to set NUMA | 334 | instead. Tasks using mbind() or set_mempolicy() calls to set NUMA |
335 | mempolicies will not notice any change in these calls as a result of | 335 | mempolicies will not notice any change in these calls as a result of |
336 | their containing tasks memory spread settings. If memory spreading | 336 | their containing task's memory spread settings. If memory spreading |
337 | is turned off, then the currently specified NUMA mempolicy once again | 337 | is turned off, then the currently specified NUMA mempolicy once again |
338 | applies to memory page allocations. | 338 | applies to memory page allocations. |
339 | 339 | ||
340 | Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag | 340 | Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag |
341 | files. By default they contain "0", meaning that the feature is off | 341 | files. By default they contain "0", meaning that the feature is off |
342 | for that cpuset. If a "1" is written to that file, then that turns | 342 | for that cpuset. If a "1" is written to that file, then that turns |
343 | the named feature on. | 343 | the named feature on. |
344 | 344 | ||
345 | The implementation is simple. | 345 | The implementation is simple. |
346 | 346 | ||
347 | Setting the flag 'memory_spread_page' turns on a per-process flag | 347 | Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag |
348 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently | 348 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently |
349 | joins that cpuset. The page allocation calls for the page cache | 349 | joins that cpuset. The page allocation calls for the page cache |
350 | is modified to perform an inline check for this PF_SPREAD_PAGE task | 350 | is modified to perform an inline check for this PF_SPREAD_PAGE task |
351 | flag, and if set, a call to a new routine cpuset_mem_spread_node() | 351 | flag, and if set, a call to a new routine cpuset_mem_spread_node() |
352 | returns the node to prefer for the allocation. | 352 | returns the node to prefer for the allocation. |
353 | 353 | ||
354 | Similarly, setting 'memory_spread_slab' turns on the flag | 354 | Similarly, setting 'cpuset.memory_spread_slab' turns on the flag |
355 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate | 355 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate |
356 | pages from the node returned by cpuset_mem_spread_node(). | 356 | pages from the node returned by cpuset_mem_spread_node(). |
357 | 357 | ||
358 | The cpuset_mem_spread_node() routine is also simple. It uses the | 358 | The cpuset_mem_spread_node() routine is also simple. It uses the |
359 | value of a per-task rotor cpuset_mem_spread_rotor to select the next | 359 | value of a per-task rotor cpuset_mem_spread_rotor to select the next |
360 | node in the current tasks mems_allowed to prefer for the allocation. | 360 | node in the current task's mems_allowed to prefer for the allocation. |
361 | 361 | ||
362 | This memory placement policy is also known (in other contexts) as | 362 | This memory placement policy is also known (in other contexts) as |
363 | round-robin or interleave. | 363 | round-robin or interleave. |
@@ -404,24 +404,24 @@ the following two situations: | |||
404 | system overhead on those CPUs, including avoiding task load | 404 | system overhead on those CPUs, including avoiding task load |
405 | balancing if that is not needed. | 405 | balancing if that is not needed. |
406 | 406 | ||
407 | When the per-cpuset flag "sched_load_balance" is enabled (the default | 407 | When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default |
408 | setting), it requests that all the CPUs in that cpusets allowed 'cpus' | 408 | setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus' |
409 | be contained in a single sched domain, ensuring that load balancing | 409 | be contained in a single sched domain, ensuring that load balancing |
410 | can move a task (not otherwised pinned, as by sched_setaffinity) | 410 | can move a task (not otherwised pinned, as by sched_setaffinity) |
411 | from any CPU in that cpuset to any other. | 411 | from any CPU in that cpuset to any other. |
412 | 412 | ||
413 | When the per-cpuset flag "sched_load_balance" is disabled, then the | 413 | When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the |
414 | scheduler will avoid load balancing across the CPUs in that cpuset, | 414 | scheduler will avoid load balancing across the CPUs in that cpuset, |
415 | --except-- in so far as is necessary because some overlapping cpuset | 415 | --except-- in so far as is necessary because some overlapping cpuset |
416 | has "sched_load_balance" enabled. | 416 | has "sched_load_balance" enabled. |
417 | 417 | ||
418 | So, for example, if the top cpuset has the flag "sched_load_balance" | 418 | So, for example, if the top cpuset has the flag "cpuset.sched_load_balance" |
419 | enabled, then the scheduler will have one sched domain covering all | 419 | enabled, then the scheduler will have one sched domain covering all |
420 | CPUs, and the setting of the "sched_load_balance" flag in any other | 420 | CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other |
421 | cpusets won't matter, as we're already fully load balancing. | 421 | cpusets won't matter, as we're already fully load balancing. |
422 | 422 | ||
423 | Therefore in the above two situations, the top cpuset flag | 423 | Therefore in the above two situations, the top cpuset flag |
424 | "sched_load_balance" should be disabled, and only some of the smaller, | 424 | "cpuset.sched_load_balance" should be disabled, and only some of the smaller, |
425 | child cpusets have this flag enabled. | 425 | child cpusets have this flag enabled. |
426 | 426 | ||
427 | When doing this, you don't usually want to leave any unpinned tasks in | 427 | When doing this, you don't usually want to leave any unpinned tasks in |
@@ -433,7 +433,7 @@ scheduler might not consider the possibility of load balancing that | |||
433 | task to that underused CPU. | 433 | task to that underused CPU. |
434 | 434 | ||
435 | Of course, tasks pinned to a particular CPU can be left in a cpuset | 435 | Of course, tasks pinned to a particular CPU can be left in a cpuset |
436 | that disables "sched_load_balance" as those tasks aren't going anywhere | 436 | that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere |
437 | else anyway. | 437 | else anyway. |
438 | 438 | ||
439 | There is an impedance mismatch here, between cpusets and sched domains. | 439 | There is an impedance mismatch here, between cpusets and sched domains. |
@@ -443,19 +443,19 @@ overlap and each CPU is in at most one sched domain. | |||
443 | It is necessary for sched domains to be flat because load balancing | 443 | It is necessary for sched domains to be flat because load balancing |
444 | across partially overlapping sets of CPUs would risk unstable dynamics | 444 | across partially overlapping sets of CPUs would risk unstable dynamics |
445 | that would be beyond our understanding. So if each of two partially | 445 | that would be beyond our understanding. So if each of two partially |
446 | overlapping cpusets enables the flag 'sched_load_balance', then we | 446 | overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we |
447 | form a single sched domain that is a superset of both. We won't move | 447 | form a single sched domain that is a superset of both. We won't move |
448 | a task to a CPU outside it cpuset, but the scheduler load balancing | 448 | a task to a CPU outside it cpuset, but the scheduler load balancing |
449 | code might waste some compute cycles considering that possibility. | 449 | code might waste some compute cycles considering that possibility. |
450 | 450 | ||
451 | This mismatch is why there is not a simple one-to-one relation | 451 | This mismatch is why there is not a simple one-to-one relation |
452 | between which cpusets have the flag "sched_load_balance" enabled, | 452 | between which cpusets have the flag "cpuset.sched_load_balance" enabled, |
453 | and the sched domain configuration. If a cpuset enables the flag, it | 453 | and the sched domain configuration. If a cpuset enables the flag, it |
454 | will get balancing across all its CPUs, but if it disables the flag, | 454 | will get balancing across all its CPUs, but if it disables the flag, |
455 | it will only be assured of no load balancing if no other overlapping | 455 | it will only be assured of no load balancing if no other overlapping |
456 | cpuset enables the flag. | 456 | cpuset enables the flag. |
457 | 457 | ||
458 | If two cpusets have partially overlapping 'cpus' allowed, and only | 458 | If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only |
459 | one of them has this flag enabled, then the other may find its | 459 | one of them has this flag enabled, then the other may find its |
460 | tasks only partially load balanced, just on the overlapping CPUs. | 460 | tasks only partially load balanced, just on the overlapping CPUs. |
461 | This is just the general case of the top_cpuset example given a few | 461 | This is just the general case of the top_cpuset example given a few |
@@ -468,23 +468,23 @@ load balancing to the other CPUs. | |||
468 | 1.7.1 sched_load_balance implementation details. | 468 | 1.7.1 sched_load_balance implementation details. |
469 | ------------------------------------------------ | 469 | ------------------------------------------------ |
470 | 470 | ||
471 | The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary | 471 | The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary |
472 | to most cpuset flags.) When enabled for a cpuset, the kernel will | 472 | to most cpuset flags.) When enabled for a cpuset, the kernel will |
473 | ensure that it can load balance across all the CPUs in that cpuset | 473 | ensure that it can load balance across all the CPUs in that cpuset |
474 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are | 474 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are |
475 | in the same sched domain.) | 475 | in the same sched domain.) |
476 | 476 | ||
477 | If two overlapping cpusets both have 'sched_load_balance' enabled, | 477 | If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled, |
478 | then they will be (must be) both in the same sched domain. | 478 | then they will be (must be) both in the same sched domain. |
479 | 479 | ||
480 | If, as is the default, the top cpuset has 'sched_load_balance' enabled, | 480 | If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled, |
481 | then by the above that means there is a single sched domain covering | 481 | then by the above that means there is a single sched domain covering |
482 | the whole system, regardless of any other cpuset settings. | 482 | the whole system, regardless of any other cpuset settings. |
483 | 483 | ||
484 | The kernel commits to user space that it will avoid load balancing | 484 | The kernel commits to user space that it will avoid load balancing |
485 | where it can. It will pick as fine a granularity partition of sched | 485 | where it can. It will pick as fine a granularity partition of sched |
486 | domains as it can while still providing load balancing for any set | 486 | domains as it can while still providing load balancing for any set |
487 | of CPUs allowed to a cpuset having 'sched_load_balance' enabled. | 487 | of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled. |
488 | 488 | ||
489 | The internal kernel cpuset to scheduler interface passes from the | 489 | The internal kernel cpuset to scheduler interface passes from the |
490 | cpuset code to the scheduler code a partition of the load balanced | 490 | cpuset code to the scheduler code a partition of the load balanced |
@@ -495,9 +495,9 @@ all the CPUs that must be load balanced. | |||
495 | The cpuset code builds a new such partition and passes it to the | 495 | The cpuset code builds a new such partition and passes it to the |
496 | scheduler sched domain setup code, to have the sched domains rebuilt | 496 | scheduler sched domain setup code, to have the sched domains rebuilt |
497 | as necessary, whenever: | 497 | as necessary, whenever: |
498 | - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes, | 498 | - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, |
499 | - or CPUs come or go from a cpuset with this flag enabled, | 499 | - or CPUs come or go from a cpuset with this flag enabled, |
500 | - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs | 500 | - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs |
501 | and with this flag enabled changes, | 501 | and with this flag enabled changes, |
502 | - or a cpuset with non-empty CPUs and with this flag enabled is removed, | 502 | - or a cpuset with non-empty CPUs and with this flag enabled is removed, |
503 | - or a cpu is offlined/onlined. | 503 | - or a cpu is offlined/onlined. |
@@ -542,7 +542,7 @@ As the result, task B on CPU X need to wait task A or wait load balance | |||
542 | on the next tick. For some applications in special situation, waiting | 542 | on the next tick. For some applications in special situation, waiting |
543 | 1 tick may be too long. | 543 | 1 tick may be too long. |
544 | 544 | ||
545 | The 'sched_relax_domain_level' file allows you to request changing | 545 | The 'cpuset.sched_relax_domain_level' file allows you to request changing |
546 | this searching range as you like. This file takes int value which | 546 | this searching range as you like. This file takes int value which |
547 | indicates size of searching range in levels ideally as follows, | 547 | indicates size of searching range in levels ideally as follows, |
548 | otherwise initial value -1 that indicates the cpuset has no request. | 548 | otherwise initial value -1 that indicates the cpuset has no request. |
@@ -559,8 +559,8 @@ The system default is architecture dependent. The system default | |||
559 | can be changed using the relax_domain_level= boot parameter. | 559 | can be changed using the relax_domain_level= boot parameter. |
560 | 560 | ||
561 | This file is per-cpuset and affect the sched domain where the cpuset | 561 | This file is per-cpuset and affect the sched domain where the cpuset |
562 | belongs to. Therefore if the flag 'sched_load_balance' of a cpuset | 562 | belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset |
563 | is disabled, then 'sched_relax_domain_level' have no effect since | 563 | is disabled, then 'cpuset.sched_relax_domain_level' have no effect since |
564 | there is no sched domain belonging the cpuset. | 564 | there is no sched domain belonging the cpuset. |
565 | 565 | ||
566 | If multiple cpusets are overlapping and hence they form a single sched | 566 | If multiple cpusets are overlapping and hence they form a single sched |
@@ -594,7 +594,7 @@ is attached, is subtle. | |||
594 | If a cpuset has its Memory Nodes modified, then for each task attached | 594 | If a cpuset has its Memory Nodes modified, then for each task attached |
595 | to that cpuset, the next time that the kernel attempts to allocate | 595 | to that cpuset, the next time that the kernel attempts to allocate |
596 | a page of memory for that task, the kernel will notice the change | 596 | a page of memory for that task, the kernel will notice the change |
597 | in the tasks cpuset, and update its per-task memory placement to | 597 | in the task's cpuset, and update its per-task memory placement to |
598 | remain within the new cpusets memory placement. If the task was using | 598 | remain within the new cpusets memory placement. If the task was using |
599 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with | 599 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with |
600 | its new cpuset, then the task will continue to use whatever subset | 600 | its new cpuset, then the task will continue to use whatever subset |
@@ -603,13 +603,13 @@ was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | |||
603 | in the new cpuset, then the task will be essentially treated as if it | 603 | in the new cpuset, then the task will be essentially treated as if it |
604 | was MPOL_BIND bound to the new cpuset (even though its NUMA placement, | 604 | was MPOL_BIND bound to the new cpuset (even though its NUMA placement, |
605 | as queried by get_mempolicy(), doesn't change). If a task is moved | 605 | as queried by get_mempolicy(), doesn't change). If a task is moved |
606 | from one cpuset to another, then the kernel will adjust the tasks | 606 | from one cpuset to another, then the kernel will adjust the task's |
607 | memory placement, as above, the next time that the kernel attempts | 607 | memory placement, as above, the next time that the kernel attempts |
608 | to allocate a page of memory for that task. | 608 | to allocate a page of memory for that task. |
609 | 609 | ||
610 | If a cpuset has its 'cpus' modified, then each task in that cpuset | 610 | If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset |
611 | will have its allowed CPU placement changed immediately. Similarly, | 611 | will have its allowed CPU placement changed immediately. Similarly, |
612 | if a tasks pid is written to another cpusets 'tasks' file, then its | 612 | if a task's pid is written to another cpusets 'cpuset.tasks' file, then its |
613 | allowed CPU placement is changed immediately. If such a task had been | 613 | allowed CPU placement is changed immediately. If such a task had been |
614 | bound to some subset of its cpuset using the sched_setaffinity() call, | 614 | bound to some subset of its cpuset using the sched_setaffinity() call, |
615 | the task will be allowed to run on any CPU allowed in its new cpuset, | 615 | the task will be allowed to run on any CPU allowed in its new cpuset, |
@@ -622,21 +622,21 @@ and the processor placement is updated immediately. | |||
622 | Normally, once a page is allocated (given a physical page | 622 | Normally, once a page is allocated (given a physical page |
623 | of main memory) then that page stays on whatever node it | 623 | of main memory) then that page stays on whatever node it |
624 | was allocated, so long as it remains allocated, even if the | 624 | was allocated, so long as it remains allocated, even if the |
625 | cpusets memory placement policy 'mems' subsequently changes. | 625 | cpusets memory placement policy 'cpuset.mems' subsequently changes. |
626 | If the cpuset flag file 'memory_migrate' is set true, then when | 626 | If the cpuset flag file 'cpuset.memory_migrate' is set true, then when |
627 | tasks are attached to that cpuset, any pages that task had | 627 | tasks are attached to that cpuset, any pages that task had |
628 | allocated to it on nodes in its previous cpuset are migrated | 628 | allocated to it on nodes in its previous cpuset are migrated |
629 | to the tasks new cpuset. The relative placement of the page within | 629 | to the task's new cpuset. The relative placement of the page within |
630 | the cpuset is preserved during these migration operations if possible. | 630 | the cpuset is preserved during these migration operations if possible. |
631 | For example if the page was on the second valid node of the prior cpuset | 631 | For example if the page was on the second valid node of the prior cpuset |
632 | then the page will be placed on the second valid node of the new cpuset. | 632 | then the page will be placed on the second valid node of the new cpuset. |
633 | 633 | ||
634 | Also if 'memory_migrate' is set true, then if that cpusets | 634 | Also if 'cpuset.memory_migrate' is set true, then if that cpuset's |
635 | 'mems' file is modified, pages allocated to tasks in that | 635 | 'cpuset.mems' file is modified, pages allocated to tasks in that |
636 | cpuset, that were on nodes in the previous setting of 'mems', | 636 | cpuset, that were on nodes in the previous setting of 'cpuset.mems', |
637 | will be moved to nodes in the new setting of 'mems.' | 637 | will be moved to nodes in the new setting of 'mems.' |
638 | Pages that were not in the tasks prior cpuset, or in the cpusets | 638 | Pages that were not in the task's prior cpuset, or in the cpuset's |
639 | prior 'mems' setting, will not be moved. | 639 | prior 'cpuset.mems' setting, will not be moved. |
640 | 640 | ||
641 | There is an exception to the above. If hotplug functionality is used | 641 | There is an exception to the above. If hotplug functionality is used |
642 | to remove all the CPUs that are currently assigned to a cpuset, | 642 | to remove all the CPUs that are currently assigned to a cpuset, |
@@ -655,7 +655,7 @@ There is a second exception to the above. GFP_ATOMIC requests are | |||
655 | kernel internal allocations that must be satisfied, immediately. | 655 | kernel internal allocations that must be satisfied, immediately. |
656 | The kernel may drop some request, in rare cases even panic, if a | 656 | The kernel may drop some request, in rare cases even panic, if a |
657 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within | 657 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within |
658 | the current tasks cpuset, then we relax the cpuset, and look for | 658 | the current task's cpuset, then we relax the cpuset, and look for |
659 | memory anywhere we can find it. It's better to violate the cpuset | 659 | memory anywhere we can find it. It's better to violate the cpuset |
660 | than stress the kernel. | 660 | than stress the kernel. |
661 | 661 | ||
@@ -678,8 +678,8 @@ and then start a subshell 'sh' in that cpuset: | |||
678 | cd /dev/cpuset | 678 | cd /dev/cpuset |
679 | mkdir Charlie | 679 | mkdir Charlie |
680 | cd Charlie | 680 | cd Charlie |
681 | /bin/echo 2-3 > cpus | 681 | /bin/echo 2-3 > cpuset.cpus |
682 | /bin/echo 1 > mems | 682 | /bin/echo 1 > cpuset.mems |
683 | /bin/echo $$ > tasks | 683 | /bin/echo $$ > tasks |
684 | sh | 684 | sh |
685 | # The subshell 'sh' is now running in cpuset Charlie | 685 | # The subshell 'sh' is now running in cpuset Charlie |
@@ -725,10 +725,13 @@ Now you want to do something with this cpuset. | |||
725 | 725 | ||
726 | In this directory you can find several files: | 726 | In this directory you can find several files: |
727 | # ls | 727 | # ls |
728 | cpu_exclusive memory_migrate mems tasks | 728 | cpuset.cpu_exclusive cpuset.memory_spread_slab |
729 | cpus memory_pressure notify_on_release | 729 | cpuset.cpus cpuset.mems |
730 | mem_exclusive memory_spread_page sched_load_balance | 730 | cpuset.mem_exclusive cpuset.sched_load_balance |
731 | mem_hardwall memory_spread_slab sched_relax_domain_level | 731 | cpuset.mem_hardwall cpuset.sched_relax_domain_level |
732 | cpuset.memory_migrate notify_on_release | ||
733 | cpuset.memory_pressure tasks | ||
734 | cpuset.memory_spread_page | ||
732 | 735 | ||
733 | Reading them will give you information about the state of this cpuset: | 736 | Reading them will give you information about the state of this cpuset: |
734 | the CPUs and Memory Nodes it can use, the processes that are using | 737 | the CPUs and Memory Nodes it can use, the processes that are using |
@@ -736,13 +739,13 @@ it, its properties. By writing to these files you can manipulate | |||
736 | the cpuset. | 739 | the cpuset. |
737 | 740 | ||
738 | Set some flags: | 741 | Set some flags: |
739 | # /bin/echo 1 > cpu_exclusive | 742 | # /bin/echo 1 > cpuset.cpu_exclusive |
740 | 743 | ||
741 | Add some cpus: | 744 | Add some cpus: |
742 | # /bin/echo 0-7 > cpus | 745 | # /bin/echo 0-7 > cpuset.cpus |
743 | 746 | ||
744 | Add some mems: | 747 | Add some mems: |
745 | # /bin/echo 0-7 > mems | 748 | # /bin/echo 0-7 > cpuset.mems |
746 | 749 | ||
747 | Now attach your shell to this cpuset: | 750 | Now attach your shell to this cpuset: |
748 | # /bin/echo $$ > tasks | 751 | # /bin/echo $$ > tasks |
@@ -774,28 +777,28 @@ echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent | |||
774 | This is the syntax to use when writing in the cpus or mems files | 777 | This is the syntax to use when writing in the cpus or mems files |
775 | in cpuset directories: | 778 | in cpuset directories: |
776 | 779 | ||
777 | # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 | 780 | # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 |
778 | # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 | 781 | # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 |
779 | 782 | ||
780 | To add a CPU to a cpuset, write the new list of CPUs including the | 783 | To add a CPU to a cpuset, write the new list of CPUs including the |
781 | CPU to be added. To add 6 to the above cpuset: | 784 | CPU to be added. To add 6 to the above cpuset: |
782 | 785 | ||
783 | # /bin/echo 1-4,6 > cpus -> set cpus list to cpus 1,2,3,4,6 | 786 | # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 |
784 | 787 | ||
785 | Similarly to remove a CPU from a cpuset, write the new list of CPUs | 788 | Similarly to remove a CPU from a cpuset, write the new list of CPUs |
786 | without the CPU to be removed. | 789 | without the CPU to be removed. |
787 | 790 | ||
788 | To remove all the CPUs: | 791 | To remove all the CPUs: |
789 | 792 | ||
790 | # /bin/echo "" > cpus -> clear cpus list | 793 | # /bin/echo "" > cpuset.cpus -> clear cpus list |
791 | 794 | ||
792 | 2.3 Setting flags | 795 | 2.3 Setting flags |
793 | ----------------- | 796 | ----------------- |
794 | 797 | ||
795 | The syntax is very simple: | 798 | The syntax is very simple: |
796 | 799 | ||
797 | # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' | 800 | # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' |
798 | # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' | 801 | # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' |
799 | 802 | ||
800 | 2.4 Attaching processes | 803 | 2.4 Attaching processes |
801 | ----------------------- | 804 | ----------------------- |
diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index 72db89ed0609..b7eececfb195 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt | |||
@@ -1,6 +1,6 @@ | |||
1 | Memory Resource Controller(Memcg) Implementation Memo. | 1 | Memory Resource Controller(Memcg) Implementation Memo. |
2 | Last Updated: 2009/1/20 | 2 | Last Updated: 2010/2 |
3 | Base Kernel Version: based on 2.6.29-rc2. | 3 | Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). |
4 | 4 | ||
5 | Because VM is getting complex (one of reasons is memcg...), memcg's behavior | 5 | Because VM is getting complex (one of reasons is memcg...), memcg's behavior |
6 | is complex. This is a document for memcg's internal behavior. | 6 | is complex. This is a document for memcg's internal behavior. |
@@ -244,7 +244,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | |||
244 | we have to check if OLDPAGE/NEWPAGE is a valid page after commit(). | 244 | we have to check if OLDPAGE/NEWPAGE is a valid page after commit(). |
245 | 245 | ||
246 | 8. LRU | 246 | 8. LRU |
247 | Each memcg has its own private LRU. Now, it's handling is under global | 247 | Each memcg has its own private LRU. Now, its handling is under global |
248 | VM's control (means that it's handled under global zone->lru_lock). | 248 | VM's control (means that it's handled under global zone->lru_lock). |
249 | Almost all routines around memcg's LRU is called by global LRU's | 249 | Almost all routines around memcg's LRU is called by global LRU's |
250 | list management functions under zone->lru_lock(). | 250 | list management functions under zone->lru_lock(). |
@@ -337,7 +337,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | |||
337 | race and lock dependency with other cgroup subsystems. | 337 | race and lock dependency with other cgroup subsystems. |
338 | 338 | ||
339 | example) | 339 | example) |
340 | # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices | 340 | # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices |
341 | 341 | ||
342 | and do task move, mkdir, rmdir etc...under this. | 342 | and do task move, mkdir, rmdir etc...under this. |
343 | 343 | ||
@@ -348,7 +348,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | |||
348 | 348 | ||
349 | For example, test like following is good. | 349 | For example, test like following is good. |
350 | (Shell-A) | 350 | (Shell-A) |
351 | # mount -t cgroup none /cgroup -t memory | 351 | # mount -t cgroup none /cgroup -o memory |
352 | # mkdir /cgroup/test | 352 | # mkdir /cgroup/test |
353 | # echo 40M > /cgroup/test/memory.limit_in_bytes | 353 | # echo 40M > /cgroup/test/memory.limit_in_bytes |
354 | # echo 0 > /cgroup/test/tasks | 354 | # echo 0 > /cgroup/test/tasks |
@@ -378,3 +378,42 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | |||
378 | #echo 50M > memory.limit_in_bytes | 378 | #echo 50M > memory.limit_in_bytes |
379 | #echo 50M > memory.memsw.limit_in_bytes | 379 | #echo 50M > memory.memsw.limit_in_bytes |
380 | run 51M of malloc | 380 | run 51M of malloc |
381 | |||
382 | 9.9 Move charges at task migration | ||
383 | Charges associated with a task can be moved along with task migration. | ||
384 | |||
385 | (Shell-A) | ||
386 | #mkdir /cgroup/A | ||
387 | #echo $$ >/cgroup/A/tasks | ||
388 | run some programs which uses some amount of memory in /cgroup/A. | ||
389 | |||
390 | (Shell-B) | ||
391 | #mkdir /cgroup/B | ||
392 | #echo 1 >/cgroup/B/memory.move_charge_at_immigrate | ||
393 | #echo "pid of the program running in group A" >/cgroup/B/tasks | ||
394 | |||
395 | You can see charges have been moved by reading *.usage_in_bytes or | ||
396 | memory.stat of both A and B. | ||
397 | See 8.2 of Documentation/cgroups/memory.txt to see what value should be | ||
398 | written to move_charge_at_immigrate. | ||
399 | |||
400 | 9.10 Memory thresholds | ||
401 | Memory controler implements memory thresholds using cgroups notification | ||
402 | API. You can use Documentation/cgroups/cgroup_event_listener.c to test | ||
403 | it. | ||
404 | |||
405 | (Shell-A) Create cgroup and run event listener | ||
406 | # mkdir /cgroup/A | ||
407 | # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M | ||
408 | |||
409 | (Shell-B) Add task to cgroup and try to allocate and free memory | ||
410 | # echo $$ >/cgroup/A/tasks | ||
411 | # a="$(dd if=/dev/zero bs=1M count=10)" | ||
412 | # a= | ||
413 | |||
414 | You will see message from cgroup_event_listener every time you cross | ||
415 | the thresholds. | ||
416 | |||
417 | Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds. | ||
418 | |||
419 | It's good idea to test root cgroup as well. | ||
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index b871f2552b45..7781857dc940 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt | |||
@@ -1,18 +1,15 @@ | |||
1 | Memory Resource Controller | 1 | Memory Resource Controller |
2 | 2 | ||
3 | NOTE: The Memory Resource Controller has been generically been referred | 3 | NOTE: The Memory Resource Controller has been generically been referred |
4 | to as the memory controller in this document. Do not confuse memory controller | 4 | to as the memory controller in this document. Do not confuse memory |
5 | used here with the memory controller that is used in hardware. | 5 | controller used here with the memory controller that is used in hardware. |
6 | 6 | ||
7 | Salient features | 7 | (For editors) |
8 | 8 | In this document: | |
9 | a. Enable control of Anonymous, Page Cache (mapped and unmapped) and | 9 | When we mention a cgroup (cgroupfs's directory) with memory controller, |
10 | Swap Cache memory pages. | 10 | we call it "memory cgroup". When you see git-log and source code, you'll |
11 | b. The infrastructure allows easy addition of other types of memory to control | 11 | see patch's title and function names tend to use "memcg". |
12 | c. Provides *zero overhead* for non memory controller users | 12 | In this document, we avoid using it. |
13 | d. Provides a double LRU: global memory pressure causes reclaim from the | ||
14 | global LRU; a cgroup on hitting a limit, reclaims from the per | ||
15 | cgroup LRU | ||
16 | 13 | ||
17 | Benefits and Purpose of the memory controller | 14 | Benefits and Purpose of the memory controller |
18 | 15 | ||
@@ -33,6 +30,45 @@ d. A CD/DVD burner could control the amount of memory used by the | |||
33 | e. There are several other use cases, find one or use the controller just | 30 | e. There are several other use cases, find one or use the controller just |
34 | for fun (to learn and hack on the VM subsystem). | 31 | for fun (to learn and hack on the VM subsystem). |
35 | 32 | ||
33 | Current Status: linux-2.6.34-mmotm(development version of 2010/April) | ||
34 | |||
35 | Features: | ||
36 | - accounting anonymous pages, file caches, swap caches usage and limiting them. | ||
37 | - private LRU and reclaim routine. (system's global LRU and private LRU | ||
38 | work independently from each other) | ||
39 | - optionally, memory+swap usage can be accounted and limited. | ||
40 | - hierarchical accounting | ||
41 | - soft limit | ||
42 | - moving(recharging) account at moving a task is selectable. | ||
43 | - usage threshold notifier | ||
44 | - oom-killer disable knob and oom-notifier | ||
45 | - Root cgroup has no limit controls. | ||
46 | |||
47 | Kernel memory and Hugepages are not under control yet. We just manage | ||
48 | pages on LRU. To add more controls, we have to take care of performance. | ||
49 | |||
50 | Brief summary of control files. | ||
51 | |||
52 | tasks # attach a task(thread) and show list of threads | ||
53 | cgroup.procs # show list of processes | ||
54 | cgroup.event_control # an interface for event_fd() | ||
55 | memory.usage_in_bytes # show current memory(RSS+Cache) usage. | ||
56 | memory.memsw.usage_in_bytes # show current memory+Swap usage | ||
57 | memory.limit_in_bytes # set/show limit of memory usage | ||
58 | memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage | ||
59 | memory.failcnt # show the number of memory usage hits limits | ||
60 | memory.memsw.failcnt # show the number of memory+Swap hits limits | ||
61 | memory.max_usage_in_bytes # show max memory usage recorded | ||
62 | memory.memsw.usage_in_bytes # show max memory+Swap usage recorded | ||
63 | memory.soft_limit_in_bytes # set/show soft limit of memory usage | ||
64 | memory.stat # show various statistics | ||
65 | memory.use_hierarchy # set/show hierarchical account enabled | ||
66 | memory.force_empty # trigger forced move charge to parent | ||
67 | memory.swappiness # set/show swappiness parameter of vmscan | ||
68 | (See sysctl's vm.swappiness) | ||
69 | memory.move_charge_at_immigrate # set/show controls of moving charges | ||
70 | memory.oom_control # set/show oom controls. | ||
71 | |||
36 | 1. History | 72 | 1. History |
37 | 73 | ||
38 | The memory controller has a long history. A request for comments for the memory | 74 | The memory controller has a long history. A request for comments for the memory |
@@ -106,14 +142,14 @@ the necessary data structures and check if the cgroup that is being charged | |||
106 | is over its limit. If it is then reclaim is invoked on the cgroup. | 142 | is over its limit. If it is then reclaim is invoked on the cgroup. |
107 | More details can be found in the reclaim section of this document. | 143 | More details can be found in the reclaim section of this document. |
108 | If everything goes well, a page meta-data-structure called page_cgroup is | 144 | If everything goes well, a page meta-data-structure called page_cgroup is |
109 | allocated and associated with the page. This routine also adds the page to | 145 | updated. page_cgroup has its own LRU on cgroup. |
110 | the per cgroup LRU. | 146 | (*) page_cgroup structure is allocated at boot/memory-hotplug time. |
111 | 147 | ||
112 | 2.2.1 Accounting details | 148 | 2.2.1 Accounting details |
113 | 149 | ||
114 | All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. | 150 | All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. |
115 | (some pages which never be reclaimable and will not be on global LRU | 151 | Some pages which are never reclaimable and will not be on the global LRU |
116 | are not accounted. we just accounts pages under usual vm management.) | 152 | are not accounted. We just account pages under usual VM management. |
117 | 153 | ||
118 | RSS pages are accounted at page_fault unless they've already been accounted | 154 | RSS pages are accounted at page_fault unless they've already been accounted |
119 | for earlier. A file page will be accounted for as Page Cache when it's | 155 | for earlier. A file page will be accounted for as Page Cache when it's |
@@ -121,12 +157,19 @@ inserted into inode (radix-tree). While it's mapped into the page tables of | |||
121 | processes, duplicate accounting is carefully avoided. | 157 | processes, duplicate accounting is carefully avoided. |
122 | 158 | ||
123 | A RSS page is unaccounted when it's fully unmapped. A PageCache page is | 159 | A RSS page is unaccounted when it's fully unmapped. A PageCache page is |
124 | unaccounted when it's removed from radix-tree. | 160 | unaccounted when it's removed from radix-tree. Even if RSS pages are fully |
161 | unmapped (by kswapd), they may exist as SwapCache in the system until they | ||
162 | are really freed. Such SwapCaches also also accounted. | ||
163 | A swapped-in page is not accounted until it's mapped. | ||
164 | |||
165 | Note: The kernel does swapin-readahead and read multiple swaps at once. | ||
166 | This means swapped-in pages may contain pages for other tasks than a task | ||
167 | causing page fault. So, we avoid accounting at swap-in I/O. | ||
125 | 168 | ||
126 | At page migration, accounting information is kept. | 169 | At page migration, accounting information is kept. |
127 | 170 | ||
128 | Note: we just account pages-on-lru because our purpose is to control amount | 171 | Note: we just account pages-on-LRU because our purpose is to control amount |
129 | of used pages. not-on-lru pages are tend to be out-of-control from vm view. | 172 | of used pages; not-on-LRU pages tend to be out-of-control from VM view. |
130 | 173 | ||
131 | 2.3 Shared Page Accounting | 174 | 2.3 Shared Page Accounting |
132 | 175 | ||
@@ -143,6 +186,7 @@ caller of swapoff rather than the users of shmem. | |||
143 | 186 | ||
144 | 187 | ||
145 | 2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) | 188 | 2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) |
189 | |||
146 | Swap Extension allows you to record charge for swap. A swapped-in page is | 190 | Swap Extension allows you to record charge for swap. A swapped-in page is |
147 | charged back to original page allocator if possible. | 191 | charged back to original page allocator if possible. |
148 | 192 | ||
@@ -150,13 +194,20 @@ When swap is accounted, following files are added. | |||
150 | - memory.memsw.usage_in_bytes. | 194 | - memory.memsw.usage_in_bytes. |
151 | - memory.memsw.limit_in_bytes. | 195 | - memory.memsw.limit_in_bytes. |
152 | 196 | ||
153 | usage of mem+swap is limited by memsw.limit_in_bytes. | 197 | memsw means memory+swap. Usage of memory+swap is limited by |
198 | memsw.limit_in_bytes. | ||
154 | 199 | ||
155 | * why 'mem+swap' rather than swap. | 200 | Example: Assume a system with 4G of swap. A task which allocates 6G of memory |
201 | (by mistake) under 2G memory limitation will use all swap. | ||
202 | In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. | ||
203 | By using memsw limit, you can avoid system OOM which can be caused by swap | ||
204 | shortage. | ||
205 | |||
206 | * why 'memory+swap' rather than swap. | ||
156 | The global LRU(kswapd) can swap out arbitrary pages. Swap-out means | 207 | The global LRU(kswapd) can swap out arbitrary pages. Swap-out means |
157 | to move account from memory to swap...there is no change in usage of | 208 | to move account from memory to swap...there is no change in usage of |
158 | mem+swap. In other words, when we want to limit the usage of swap without | 209 | memory+swap. In other words, when we want to limit the usage of swap without |
159 | affecting global LRU, mem+swap limit is better than just limiting swap from | 210 | affecting global LRU, memory+swap limit is better than just limiting swap from |
160 | OS point of view. | 211 | OS point of view. |
161 | 212 | ||
162 | * What happens when a cgroup hits memory.memsw.limit_in_bytes | 213 | * What happens when a cgroup hits memory.memsw.limit_in_bytes |
@@ -168,12 +219,12 @@ it by cgroup. | |||
168 | 219 | ||
169 | 2.5 Reclaim | 220 | 2.5 Reclaim |
170 | 221 | ||
171 | Each cgroup maintains a per cgroup LRU that consists of an active | 222 | Each cgroup maintains a per cgroup LRU which has the same structure as |
172 | and inactive list. When a cgroup goes over its limit, we first try | 223 | global VM. When a cgroup goes over its limit, we first try |
173 | to reclaim memory from the cgroup so as to make space for the new | 224 | to reclaim memory from the cgroup so as to make space for the new |
174 | pages that the cgroup has touched. If the reclaim is unsuccessful, | 225 | pages that the cgroup has touched. If the reclaim is unsuccessful, |
175 | an OOM routine is invoked to select and kill the bulkiest task in the | 226 | an OOM routine is invoked to select and kill the bulkiest task in the |
176 | cgroup. | 227 | cgroup. (See 10. OOM Control below.) |
177 | 228 | ||
178 | The reclaim algorithm has not been modified for cgroups, except that | 229 | The reclaim algorithm has not been modified for cgroups, except that |
179 | pages that are selected for reclaiming come from the per cgroup LRU | 230 | pages that are selected for reclaiming come from the per cgroup LRU |
@@ -182,13 +233,24 @@ list. | |||
182 | NOTE: Reclaim does not work for the root cgroup, since we cannot set any | 233 | NOTE: Reclaim does not work for the root cgroup, since we cannot set any |
183 | limits on the root cgroup. | 234 | limits on the root cgroup. |
184 | 235 | ||
185 | 2. Locking | 236 | Note2: When panic_on_oom is set to "2", the whole system will panic. |
237 | |||
238 | When oom event notifier is registered, event will be delivered. | ||
239 | (See oom_control section) | ||
240 | |||
241 | 2.6 Locking | ||
186 | 242 | ||
187 | The memory controller uses the following hierarchy | 243 | lock_page_cgroup()/unlock_page_cgroup() should not be called under |
244 | mapping->tree_lock. | ||
188 | 245 | ||
189 | 1. zone->lru_lock is used for selecting pages to be isolated | 246 | Other lock order is following: |
190 | 2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone) | 247 | PG_locked. |
191 | 3. lock_page_cgroup() is used to protect page->page_cgroup | 248 | mm->page_table_lock |
249 | zone->lru_lock | ||
250 | lock_page_cgroup. | ||
251 | In many cases, just lock_page_cgroup() is called. | ||
252 | per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by | ||
253 | zone->lru_lock, it has no lock of its own. | ||
192 | 254 | ||
193 | 3. User Interface | 255 | 3. User Interface |
194 | 256 | ||
@@ -197,6 +259,7 @@ The memory controller uses the following hierarchy | |||
197 | a. Enable CONFIG_CGROUPS | 259 | a. Enable CONFIG_CGROUPS |
198 | b. Enable CONFIG_RESOURCE_COUNTERS | 260 | b. Enable CONFIG_RESOURCE_COUNTERS |
199 | c. Enable CONFIG_CGROUP_MEM_RES_CTLR | 261 | c. Enable CONFIG_CGROUP_MEM_RES_CTLR |
262 | d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension) | ||
200 | 263 | ||
201 | 1. Prepare the cgroups | 264 | 1. Prepare the cgroups |
202 | # mkdir -p /cgroups | 265 | # mkdir -p /cgroups |
@@ -204,31 +267,28 @@ c. Enable CONFIG_CGROUP_MEM_RES_CTLR | |||
204 | 267 | ||
205 | 2. Make the new group and move bash into it | 268 | 2. Make the new group and move bash into it |
206 | # mkdir /cgroups/0 | 269 | # mkdir /cgroups/0 |
207 | # echo $$ > /cgroups/0/tasks | 270 | # echo $$ > /cgroups/0/tasks |
208 | 271 | ||
209 | Since now we're in the 0 cgroup, | 272 | Since now we're in the 0 cgroup, we can alter the memory limit: |
210 | We can alter the memory limit: | ||
211 | # echo 4M > /cgroups/0/memory.limit_in_bytes | 273 | # echo 4M > /cgroups/0/memory.limit_in_bytes |
212 | 274 | ||
213 | NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, | 275 | NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, |
214 | mega or gigabytes. | 276 | mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.) |
277 | |||
215 | NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). | 278 | NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). |
216 | NOTE: We cannot set limits on the root cgroup any more. | 279 | NOTE: We cannot set limits on the root cgroup any more. |
217 | 280 | ||
218 | # cat /cgroups/0/memory.limit_in_bytes | 281 | # cat /cgroups/0/memory.limit_in_bytes |
219 | 4194304 | 282 | 4194304 |
220 | 283 | ||
221 | NOTE: The interface has now changed to display the usage in bytes | ||
222 | instead of pages | ||
223 | |||
224 | We can check the usage: | 284 | We can check the usage: |
225 | # cat /cgroups/0/memory.usage_in_bytes | 285 | # cat /cgroups/0/memory.usage_in_bytes |
226 | 1216512 | 286 | 1216512 |
227 | 287 | ||
228 | A successful write to this file does not guarantee a successful set of | 288 | A successful write to this file does not guarantee a successful set of |
229 | this limit to the value written into the file. This can be due to a | 289 | this limit to the value written into the file. This can be due to a |
230 | number of factors, such as rounding up to page boundaries or the total | 290 | number of factors, such as rounding up to page boundaries or the total |
231 | availability of memory on the system. The user is required to re-read | 291 | availability of memory on the system. The user is required to re-read |
232 | this file after a write to guarantee the value committed by the kernel. | 292 | this file after a write to guarantee the value committed by the kernel. |
233 | 293 | ||
234 | # echo 1 > memory.limit_in_bytes | 294 | # echo 1 > memory.limit_in_bytes |
@@ -243,15 +303,23 @@ caches, RSS and Active pages/Inactive pages are shown. | |||
243 | 303 | ||
244 | 4. Testing | 304 | 4. Testing |
245 | 305 | ||
246 | Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. | 306 | For testing features and implementation, see memcg_test.txt. |
247 | Apart from that v6 has been tested with several applications and regular | 307 | |
248 | daily use. The controller has also been tested on the PPC64, x86_64 and | 308 | Performance test is also important. To see pure memory controller's overhead, |
249 | UML platforms. | 309 | testing on tmpfs will give you good numbers of small overheads. |
310 | Example: do kernel make on tmpfs. | ||
311 | |||
312 | Page-fault scalability is also important. At measuring parallel | ||
313 | page fault test, multi-process test may be better than multi-thread | ||
314 | test because it has noise of shared objects/status. | ||
315 | |||
316 | But the above two are testing extreme situations. | ||
317 | Trying usual test under memory controller is always helpful. | ||
250 | 318 | ||
251 | 4.1 Troubleshooting | 319 | 4.1 Troubleshooting |
252 | 320 | ||
253 | Sometimes a user might find that the application under a cgroup is | 321 | Sometimes a user might find that the application under a cgroup is |
254 | terminated. There are several causes for this: | 322 | terminated by OOM killer. There are several causes for this: |
255 | 323 | ||
256 | 1. The cgroup limit is too low (just too low to do anything useful) | 324 | 1. The cgroup limit is too low (just too low to do anything useful) |
257 | 2. The user is using anonymous memory and swap is turned off or too low | 325 | 2. The user is using anonymous memory and swap is turned off or too low |
@@ -259,21 +327,29 @@ terminated. There are several causes for this: | |||
259 | A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of | 327 | A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of |
260 | some of the pages cached in the cgroup (page cache pages). | 328 | some of the pages cached in the cgroup (page cache pages). |
261 | 329 | ||
330 | To know what happens, disable OOM_Kill by 10. OOM Control(see below) and | ||
331 | seeing what happens will be helpful. | ||
332 | |||
262 | 4.2 Task migration | 333 | 4.2 Task migration |
263 | 334 | ||
264 | When a task migrates from one cgroup to another, it's charge is not | 335 | When a task migrates from one cgroup to another, its charge is not |
265 | carried forward. The pages allocated from the original cgroup still | 336 | carried forward by default. The pages allocated from the original cgroup still |
266 | remain charged to it, the charge is dropped when the page is freed or | 337 | remain charged to it, the charge is dropped when the page is freed or |
267 | reclaimed. | 338 | reclaimed. |
268 | 339 | ||
340 | You can move charges of a task along with task migration. | ||
341 | See 8. "Move charges at task migration" | ||
342 | |||
269 | 4.3 Removing a cgroup | 343 | 4.3 Removing a cgroup |
270 | 344 | ||
271 | A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a | 345 | A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a |
272 | cgroup might have some charge associated with it, even though all | 346 | cgroup might have some charge associated with it, even though all |
273 | tasks have migrated away from it. | 347 | tasks have migrated away from it. (because we charge against pages, not |
274 | Such charges are freed(at default) or moved to its parent. When moved, | 348 | against tasks.) |
275 | both of RSS and CACHES are moved to parent. | 349 | |
276 | If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. | 350 | Such charges are freed or moved to their parent. At moving, both of RSS |
351 | and CACHES are moved to parent. | ||
352 | rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also. | ||
277 | 353 | ||
278 | Charges recorded in swap information is not updated at removal of cgroup. | 354 | Charges recorded in swap information is not updated at removal of cgroup. |
279 | Recorded information is discarded and a cgroup which uses swap (swapcache) | 355 | Recorded information is discarded and a cgroup which uses swap (swapcache) |
@@ -289,10 +365,10 @@ will be charged as a new owner of it. | |||
289 | 365 | ||
290 | # echo 0 > memory.force_empty | 366 | # echo 0 > memory.force_empty |
291 | 367 | ||
292 | Almost all pages tracked by this memcg will be unmapped and freed. Some of | 368 | Almost all pages tracked by this memory cgroup will be unmapped and freed. |
293 | pages cannot be freed because it's locked or in-use. Such pages are moved | 369 | Some pages cannot be freed because they are locked or in-use. Such pages are |
294 | to parent and this cgroup will be empty. But this may return -EBUSY in | 370 | moved to parent and this cgroup will be empty. This may return -EBUSY if |
295 | some too busy case. | 371 | VM is too busy to free/move all pages immediately. |
296 | 372 | ||
297 | Typical use case of this interface is that calling this before rmdir(). | 373 | Typical use case of this interface is that calling this before rmdir(). |
298 | Because rmdir() moves all pages to parent, some out-of-use page caches can be | 374 | Because rmdir() moves all pages to parent, some out-of-use page caches can be |
@@ -302,19 +378,41 @@ will be charged as a new owner of it. | |||
302 | 378 | ||
303 | memory.stat file includes following statistics | 379 | memory.stat file includes following statistics |
304 | 380 | ||
381 | # per-memory cgroup local status | ||
305 | cache - # of bytes of page cache memory. | 382 | cache - # of bytes of page cache memory. |
306 | rss - # of bytes of anonymous and swap cache memory. | 383 | rss - # of bytes of anonymous and swap cache memory. |
384 | mapped_file - # of bytes of mapped file (includes tmpfs/shmem) | ||
307 | pgpgin - # of pages paged in (equivalent to # of charging events). | 385 | pgpgin - # of pages paged in (equivalent to # of charging events). |
308 | pgpgout - # of pages paged out (equivalent to # of uncharging events). | 386 | pgpgout - # of pages paged out (equivalent to # of uncharging events). |
309 | active_anon - # of bytes of anonymous and swap cache memory on active | 387 | swap - # of bytes of swap usage |
310 | lru list. | ||
311 | inactive_anon - # of bytes of anonymous memory and swap cache memory on | 388 | inactive_anon - # of bytes of anonymous memory and swap cache memory on |
312 | inactive lru list. | 389 | LRU list. |
313 | active_file - # of bytes of file-backed memory on active lru list. | 390 | active_anon - # of bytes of anonymous and swap cache memory on active |
314 | inactive_file - # of bytes of file-backed memory on inactive lru list. | 391 | inactive LRU list. |
392 | inactive_file - # of bytes of file-backed memory on inactive LRU list. | ||
393 | active_file - # of bytes of file-backed memory on active LRU list. | ||
315 | unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc). | 394 | unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc). |
316 | 395 | ||
317 | The following additional stats are dependent on CONFIG_DEBUG_VM. | 396 | # status considering hierarchy (see memory.use_hierarchy settings) |
397 | |||
398 | hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy | ||
399 | under which the memory cgroup is | ||
400 | hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to | ||
401 | hierarchy under which memory cgroup is. | ||
402 | |||
403 | total_cache - sum of all children's "cache" | ||
404 | total_rss - sum of all children's "rss" | ||
405 | total_mapped_file - sum of all children's "cache" | ||
406 | total_pgpgin - sum of all children's "pgpgin" | ||
407 | total_pgpgout - sum of all children's "pgpgout" | ||
408 | total_swap - sum of all children's "swap" | ||
409 | total_inactive_anon - sum of all children's "inactive_anon" | ||
410 | total_active_anon - sum of all children's "active_anon" | ||
411 | total_inactive_file - sum of all children's "inactive_file" | ||
412 | total_active_file - sum of all children's "active_file" | ||
413 | total_unevictable - sum of all children's "unevictable" | ||
414 | |||
415 | # The following additional stats are dependent on CONFIG_DEBUG_VM. | ||
318 | 416 | ||
319 | inactive_ratio - VM internal parameter. (see mm/page_alloc.c) | 417 | inactive_ratio - VM internal parameter. (see mm/page_alloc.c) |
320 | recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) | 418 | recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) |
@@ -323,24 +421,37 @@ recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) | |||
323 | recent_scanned_file - VM internal parameter. (see mm/vmscan.c) | 421 | recent_scanned_file - VM internal parameter. (see mm/vmscan.c) |
324 | 422 | ||
325 | Memo: | 423 | Memo: |
326 | recent_rotated means recent frequency of lru rotation. | 424 | recent_rotated means recent frequency of LRU rotation. |
327 | recent_scanned means recent # of scans to lru. | 425 | recent_scanned means recent # of scans to LRU. |
328 | showing for better debug please see the code for meanings. | 426 | showing for better debug please see the code for meanings. |
329 | 427 | ||
330 | Note: | 428 | Note: |
331 | Only anonymous and swap cache memory is listed as part of 'rss' stat. | 429 | Only anonymous and swap cache memory is listed as part of 'rss' stat. |
332 | This should not be confused with the true 'resident set size' or the | 430 | This should not be confused with the true 'resident set size' or the |
333 | amount of physical memory used by the cgroup. Per-cgroup rss | 431 | amount of physical memory used by the cgroup. |
334 | accounting is not done yet. | 432 | 'rss + file_mapped" will give you resident set size of cgroup. |
433 | (Note: file and shmem may be shared among other cgroups. In that case, | ||
434 | file_mapped is accounted only when the memory cgroup is owner of page | ||
435 | cache.) | ||
335 | 436 | ||
336 | 5.3 swappiness | 437 | 5.3 swappiness |
337 | Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. | ||
338 | 438 | ||
339 | Following cgroups' swapiness can't be changed. | 439 | Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. |
340 | - root cgroup (uses /proc/sys/vm/swappiness). | 440 | |
341 | - a cgroup which uses hierarchy and it has child cgroup. | 441 | Following cgroups' swappiness can't be changed. |
342 | - a cgroup which uses hierarchy and not the root of hierarchy. | 442 | - root cgroup (uses /proc/sys/vm/swappiness). |
443 | - a cgroup which uses hierarchy and it has other cgroup(s) below it. | ||
444 | - a cgroup which uses hierarchy and not the root of hierarchy. | ||
445 | |||
446 | 5.4 failcnt | ||
343 | 447 | ||
448 | A memory cgroup provides memory.failcnt and memory.memsw.failcnt files. | ||
449 | This failcnt(== failure count) shows the number of times that a usage counter | ||
450 | hit its limit. When a memory cgroup hits a limit, failcnt increases and | ||
451 | memory under it will be reclaimed. | ||
452 | |||
453 | You can reset failcnt by writing 0 to failcnt file. | ||
454 | # echo 0 > .../memory.failcnt | ||
344 | 455 | ||
345 | 6. Hierarchy support | 456 | 6. Hierarchy support |
346 | 457 | ||
@@ -359,13 +470,13 @@ hierarchy | |||
359 | 470 | ||
360 | In the diagram above, with hierarchical accounting enabled, all memory | 471 | In the diagram above, with hierarchical accounting enabled, all memory |
361 | usage of e, is accounted to its ancestors up until the root (i.e, c and root), | 472 | usage of e, is accounted to its ancestors up until the root (i.e, c and root), |
362 | that has memory.use_hierarchy enabled. If one of the ancestors goes over its | 473 | that has memory.use_hierarchy enabled. If one of the ancestors goes over its |
363 | limit, the reclaim algorithm reclaims from the tasks in the ancestor and the | 474 | limit, the reclaim algorithm reclaims from the tasks in the ancestor and the |
364 | children of the ancestor. | 475 | children of the ancestor. |
365 | 476 | ||
366 | 6.1 Enabling hierarchical accounting and reclaim | 477 | 6.1 Enabling hierarchical accounting and reclaim |
367 | 478 | ||
368 | The memory controller by default disables the hierarchy feature. Support | 479 | A memory cgroup by default disables the hierarchy feature. Support |
369 | can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup | 480 | can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup |
370 | 481 | ||
371 | # echo 1 > memory.use_hierarchy | 482 | # echo 1 > memory.use_hierarchy |
@@ -375,9 +486,10 @@ The feature can be disabled by | |||
375 | # echo 0 > memory.use_hierarchy | 486 | # echo 0 > memory.use_hierarchy |
376 | 487 | ||
377 | NOTE1: Enabling/disabling will fail if the cgroup already has other | 488 | NOTE1: Enabling/disabling will fail if the cgroup already has other |
378 | cgroups created below it. | 489 | cgroups created below it. |
379 | 490 | ||
380 | NOTE2: This feature can be enabled/disabled per subtree. | 491 | NOTE2: When panic_on_oom is set to "2", the whole system will panic in |
492 | case of an OOM event in any cgroup. | ||
381 | 493 | ||
382 | 7. Soft limits | 494 | 7. Soft limits |
383 | 495 | ||
@@ -387,7 +499,7 @@ is to allow control groups to use as much of the memory as needed, provided | |||
387 | a. There is no memory contention | 499 | a. There is no memory contention |
388 | b. They do not exceed their hard limit | 500 | b. They do not exceed their hard limit |
389 | 501 | ||
390 | When the system detects memory contention or low memory control groups | 502 | When the system detects memory contention or low memory, control groups |
391 | are pushed back to their soft limits. If the soft limit of each control | 503 | are pushed back to their soft limits. If the soft limit of each control |
392 | group is very high, they are pushed back as much as possible to make | 504 | group is very high, they are pushed back as much as possible to make |
393 | sure that one control group does not starve the others of memory. | 505 | sure that one control group does not starve the others of memory. |
@@ -401,7 +513,7 @@ it gets invoked from balance_pgdat (kswapd). | |||
401 | 7.1 Interface | 513 | 7.1 Interface |
402 | 514 | ||
403 | Soft limits can be setup by using the following commands (in this example we | 515 | Soft limits can be setup by using the following commands (in this example we |
404 | assume a soft limit of 256 megabytes) | 516 | assume a soft limit of 256 MiB) |
405 | 517 | ||
406 | # echo 256M > memory.soft_limit_in_bytes | 518 | # echo 256M > memory.soft_limit_in_bytes |
407 | 519 | ||
@@ -414,7 +526,121 @@ NOTE1: Soft limits take effect over a long period of time, since they involve | |||
414 | NOTE2: It is recommended to set the soft limit always below the hard limit, | 526 | NOTE2: It is recommended to set the soft limit always below the hard limit, |
415 | otherwise the hard limit will take precedence. | 527 | otherwise the hard limit will take precedence. |
416 | 528 | ||
417 | 8. TODO | 529 | 8. Move charges at task migration |
530 | |||
531 | Users can move charges associated with a task along with task migration, that | ||
532 | is, uncharge task's pages from the old cgroup and charge them to the new cgroup. | ||
533 | This feature is not supported in !CONFIG_MMU environments because of lack of | ||
534 | page tables. | ||
535 | |||
536 | 8.1 Interface | ||
537 | |||
538 | This feature is disabled by default. It can be enabled(and disabled again) by | ||
539 | writing to memory.move_charge_at_immigrate of the destination cgroup. | ||
540 | |||
541 | If you want to enable it: | ||
542 | |||
543 | # echo (some positive value) > memory.move_charge_at_immigrate | ||
544 | |||
545 | Note: Each bits of move_charge_at_immigrate has its own meaning about what type | ||
546 | of charges should be moved. See 8.2 for details. | ||
547 | Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread | ||
548 | group. | ||
549 | Note: If we cannot find enough space for the task in the destination cgroup, we | ||
550 | try to make space by reclaiming memory. Task migration may fail if we | ||
551 | cannot make enough space. | ||
552 | Note: It can take several seconds if you move charges much. | ||
553 | |||
554 | And if you want disable it again: | ||
555 | |||
556 | # echo 0 > memory.move_charge_at_immigrate | ||
557 | |||
558 | 8.2 Type of charges which can be move | ||
559 | |||
560 | Each bits of move_charge_at_immigrate has its own meaning about what type of | ||
561 | charges should be moved. But in any cases, it must be noted that an account of | ||
562 | a page or a swap can be moved only when it is charged to the task's current(old) | ||
563 | memory cgroup. | ||
564 | |||
565 | bit | what type of charges would be moved ? | ||
566 | -----+------------------------------------------------------------------------ | ||
567 | 0 | A charge of an anonymous page(or swap of it) used by the target task. | ||
568 | | Those pages and swaps must be used only by the target task. You must | ||
569 | | enable Swap Extension(see 2.4) to enable move of swap charges. | ||
570 | -----+------------------------------------------------------------------------ | ||
571 | 1 | A charge of file pages(normal file, tmpfs file(e.g. ipc shared memory) | ||
572 | | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | ||
573 | | anonymous pages, file pages(and swaps) in the range mmapped by the task | ||
574 | | will be moved even if the task hasn't done page fault, i.e. they might | ||
575 | | not be the task's "RSS", but other task's "RSS" that maps the same file. | ||
576 | | And mapcount of the page is ignored(the page can be moved even if | ||
577 | | page_mapcount(page) > 1). You must enable Swap Extension(see 2.4) to | ||
578 | | enable move of swap charges. | ||
579 | |||
580 | 8.3 TODO | ||
581 | |||
582 | - Implement madvise(2) to let users decide the vma to be moved or not to be | ||
583 | moved. | ||
584 | - All of moving charge operations are done under cgroup_mutex. It's not good | ||
585 | behavior to hold the mutex too long, so we may need some trick. | ||
586 | |||
587 | 9. Memory thresholds | ||
588 | |||
589 | Memory cgroup implements memory thresholds using cgroups notification | ||
590 | API (see cgroups.txt). It allows to register multiple memory and memsw | ||
591 | thresholds and gets notifications when it crosses. | ||
592 | |||
593 | To register a threshold application need: | ||
594 | - create an eventfd using eventfd(2); | ||
595 | - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; | ||
596 | - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to | ||
597 | cgroup.event_control. | ||
598 | |||
599 | Application will be notified through eventfd when memory usage crosses | ||
600 | threshold in any direction. | ||
601 | |||
602 | It's applicable for root and non-root cgroup. | ||
603 | |||
604 | 10. OOM Control | ||
605 | |||
606 | memory.oom_control file is for OOM notification and other controls. | ||
607 | |||
608 | Memory cgroup implements OOM notifier using cgroup notification | ||
609 | API (See cgroups.txt). It allows to register multiple OOM notification | ||
610 | delivery and gets notification when OOM happens. | ||
611 | |||
612 | To register a notifier, application need: | ||
613 | - create an eventfd using eventfd(2) | ||
614 | - open memory.oom_control file | ||
615 | - write string like "<event_fd> <fd of memory.oom_control>" to | ||
616 | cgroup.event_control | ||
617 | |||
618 | Application will be notified through eventfd when OOM happens. | ||
619 | OOM notification doesn't work for root cgroup. | ||
620 | |||
621 | You can disable OOM-killer by writing "1" to memory.oom_control file, as: | ||
622 | |||
623 | #echo 1 > memory.oom_control | ||
624 | |||
625 | This operation is only allowed to the top cgroup of sub-hierarchy. | ||
626 | If OOM-killer is disabled, tasks under cgroup will hang/sleep | ||
627 | in memory cgroup's OOM-waitqueue when they request accountable memory. | ||
628 | |||
629 | For running them, you have to relax the memory cgroup's OOM status by | ||
630 | * enlarge limit or reduce usage. | ||
631 | To reduce usage, | ||
632 | * kill some tasks. | ||
633 | * move some tasks to other group with account migration. | ||
634 | * remove some files (on tmpfs?) | ||
635 | |||
636 | Then, stopped tasks will work again. | ||
637 | |||
638 | At reading, current status of OOM is shown. | ||
639 | oom_kill_disable 0 or 1 (if 1, oom-killer is disabled) | ||
640 | under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may | ||
641 | be stopped.) | ||
642 | |||
643 | 11. TODO | ||
418 | 644 | ||
419 | 1. Add support for accounting huge pages (as a separate controller) | 645 | 1. Add support for accounting huge pages (as a separate controller) |
420 | 2. Make per-cgroup scanner reclaim not-shared pages first | 646 | 2. Make per-cgroup scanner reclaim not-shared pages first |