aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/cgroups
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/cgroups')
-rw-r--r--Documentation/cgroups/blkio-controller.txt151
-rw-r--r--Documentation/cgroups/cgroup_event_listener.c110
-rw-r--r--Documentation/cgroups/cgroups.txt44
-rw-r--r--Documentation/cgroups/cpusets.txt161
-rw-r--r--Documentation/cgroups/memcg_test.txt49
-rw-r--r--Documentation/cgroups/memory.txt378
6 files changed, 711 insertions, 182 deletions
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index 630879cd9a42..48e0b21b0059 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -17,6 +17,9 @@ HOWTO
17You can do a very simple testing of running two dd threads in two different 17You can do a very simple testing of running two dd threads in two different
18cgroups. Here is what you can do. 18cgroups. Here is what you can do.
19 19
20- Enable Block IO controller
21 CONFIG_BLK_CGROUP=y
22
20- Enable group scheduling in CFQ 23- Enable group scheduling in CFQ
21 CONFIG_CFQ_GROUP_IOSCHED=y 24 CONFIG_CFQ_GROUP_IOSCHED=y
22 25
@@ -54,32 +57,52 @@ cgroups. Here is what you can do.
54 57
55Various user visible config options 58Various user visible config options
56=================================== 59===================================
57CONFIG_CFQ_GROUP_IOSCHED
58 - Enables group scheduling in CFQ. Currently only 1 level of group
59 creation is allowed.
60
61CONFIG_DEBUG_CFQ_IOSCHED
62 - Enables some debugging messages in blktrace. Also creates extra
63 cgroup file blkio.dequeue.
64
65Config options selected automatically
66=====================================
67These config options are not user visible and are selected/deselected
68automatically based on IO scheduler configuration.
69
70CONFIG_BLK_CGROUP 60CONFIG_BLK_CGROUP
71 - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED. 61 - Block IO controller.
72 62
73CONFIG_DEBUG_BLK_CGROUP 63CONFIG_DEBUG_BLK_CGROUP
74 - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED. 64 - Debug help. Right now some additional stats file show up in cgroup
65 if this option is enabled.
66
67CONFIG_CFQ_GROUP_IOSCHED
68 - Enables group scheduling in CFQ. Currently only 1 level of group
69 creation is allowed.
75 70
76Details of cgroup files 71Details of cgroup files
77======================= 72=======================
78- blkio.weight 73- blkio.weight
79 - Specifies per cgroup weight. 74 - Specifies per cgroup weight. This is default weight of the group
80 75 on all the devices until and unless overridden by per device rule.
76 (See blkio.weight_device).
81 Currently allowed range of weights is from 100 to 1000. 77 Currently allowed range of weights is from 100 to 1000.
82 78
79- blkio.weight_device
80 - One can specify per cgroup per device rules using this interface.
81 These rules override the default value of group weight as specified
82 by blkio.weight.
83
84 Following is the format.
85
86 #echo dev_maj:dev_minor weight > /path/to/cgroup/blkio.weight_device
87 Configure weight=300 on /dev/sdb (8:16) in this cgroup
88 # echo 8:16 300 > blkio.weight_device
89 # cat blkio.weight_device
90 dev weight
91 8:16 300
92
93 Configure weight=500 on /dev/sda (8:0) in this cgroup
94 # echo 8:0 500 > blkio.weight_device
95 # cat blkio.weight_device
96 dev weight
97 8:0 500
98 8:16 300
99
100 Remove specific weight for /dev/sda in this cgroup
101 # echo 8:0 0 > blkio.weight_device
102 # cat blkio.weight_device
103 dev weight
104 8:16 300
105
83- blkio.time 106- blkio.time
84 - disk time allocated to cgroup per device in milliseconds. First 107 - disk time allocated to cgroup per device in milliseconds. First
85 two fields specify the major and minor number of the device and 108 two fields specify the major and minor number of the device and
@@ -92,13 +115,105 @@ Details of cgroup files
92 third field specifies the number of sectors transferred by the 115 third field specifies the number of sectors transferred by the
93 group to/from the device. 116 group to/from the device.
94 117
118- blkio.io_service_bytes
119 - Number of bytes transferred to/from the disk by the group. These
120 are further divided by the type of operation - read or write, sync
121 or async. First two fields specify the major and minor number of the
122 device, third field specifies the operation type and the fourth field
123 specifies the number of bytes.
124
125- blkio.io_serviced
126 - Number of IOs completed to/from the disk by the group. These
127 are further divided by the type of operation - read or write, sync
128 or async. First two fields specify the major and minor number of the
129 device, third field specifies the operation type and the fourth field
130 specifies the number of IOs.
131
132- blkio.io_service_time
133 - Total amount of time between request dispatch and request completion
134 for the IOs done by this cgroup. This is in nanoseconds to make it
135 meaningful for flash devices too. For devices with queue depth of 1,
136 this time represents the actual service time. When queue_depth > 1,
137 that is no longer true as requests may be served out of order. This
138 may cause the service time for a given IO to include the service time
139 of multiple IOs when served out of order which may result in total
140 io_service_time > actual time elapsed. This time is further divided by
141 the type of operation - read or write, sync or async. First two fields
142 specify the major and minor number of the device, third field
143 specifies the operation type and the fourth field specifies the
144 io_service_time in ns.
145
146- blkio.io_wait_time
147 - Total amount of time the IOs for this cgroup spent waiting in the
148 scheduler queues for service. This can be greater than the total time
149 elapsed since it is cumulative io_wait_time for all IOs. It is not a
150 measure of total time the cgroup spent waiting but rather a measure of
151 the wait_time for its individual IOs. For devices with queue_depth > 1
152 this metric does not include the time spent waiting for service once
153 the IO is dispatched to the device but till it actually gets serviced
154 (there might be a time lag here due to re-ordering of requests by the
155 device). This is in nanoseconds to make it meaningful for flash
156 devices too. This time is further divided by the type of operation -
157 read or write, sync or async. First two fields specify the major and
158 minor number of the device, third field specifies the operation type
159 and the fourth field specifies the io_wait_time in ns.
160
161- blkio.io_merged
162 - Total number of bios/requests merged into requests belonging to this
163 cgroup. This is further divided by the type of operation - read or
164 write, sync or async.
165
166- blkio.io_queued
167 - Total number of requests queued up at any given instant for this
168 cgroup. This is further divided by the type of operation - read or
169 write, sync or async.
170
171- blkio.avg_queue_size
172 - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
173 The average queue size for this cgroup over the entire time of this
174 cgroup's existence. Queue size samples are taken each time one of the
175 queues of this cgroup gets a timeslice.
176
177- blkio.group_wait_time
178 - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
179 This is the amount of time the cgroup had to wait since it became busy
180 (i.e., went from 0 to 1 request queued) to get a timeslice for one of
181 its queues. This is different from the io_wait_time which is the
182 cumulative total of the amount of time spent by each IO in that cgroup
183 waiting in the scheduler queue. This is in nanoseconds. If this is
184 read when the cgroup is in a waiting (for timeslice) state, the stat
185 will only report the group_wait_time accumulated till the last time it
186 got a timeslice and will not include the current delta.
187
188- blkio.empty_time
189 - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
190 This is the amount of time a cgroup spends without any pending
191 requests when not being served, i.e., it does not include any time
192 spent idling for one of the queues of the cgroup. This is in
193 nanoseconds. If this is read when the cgroup is in an empty state,
194 the stat will only report the empty_time accumulated till the last
195 time it had a pending request and will not include the current delta.
196
197- blkio.idle_time
198 - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
199 This is the amount of time spent by the IO scheduler idling for a
200 given cgroup in anticipation of a better request than the exising ones
201 from other queues/cgroups. This is in nanoseconds. If this is read
202 when the cgroup is in an idling state, the stat will only report the
203 idle_time accumulated till the last idle period and will not include
204 the current delta.
205
95- blkio.dequeue 206- blkio.dequeue
96 - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This 207 - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. This
97 gives the statistics about how many a times a group was dequeued 208 gives the statistics about how many a times a group was dequeued
98 from service tree of the device. First two fields specify the major 209 from service tree of the device. First two fields specify the major
99 and minor number of the device and third field specifies the number 210 and minor number of the device and third field specifies the number
100 of times a group was dequeued from a particular device. 211 of times a group was dequeued from a particular device.
101 212
213- blkio.reset_stats
214 - Writing an int to this file will result in resetting all the stats
215 for that cgroup.
216
102CFQ sysfs tunable 217CFQ sysfs tunable
103================= 218=================
104/sys/block/<disk>/queue/iosched/group_isolation 219/sys/block/<disk>/queue/iosched/group_isolation
diff --git a/Documentation/cgroups/cgroup_event_listener.c b/Documentation/cgroups/cgroup_event_listener.c
new file mode 100644
index 000000000000..8c2bfc4a6358
--- /dev/null
+++ b/Documentation/cgroups/cgroup_event_listener.c
@@ -0,0 +1,110 @@
1/*
2 * cgroup_event_listener.c - Simple listener of cgroup events
3 *
4 * Copyright (C) Kirill A. Shutemov <kirill@shutemov.name>
5 */
6
7#include <assert.h>
8#include <errno.h>
9#include <fcntl.h>
10#include <libgen.h>
11#include <limits.h>
12#include <stdio.h>
13#include <string.h>
14#include <unistd.h>
15
16#include <sys/eventfd.h>
17
18#define USAGE_STR "Usage: cgroup_event_listener <path-to-control-file> <args>\n"
19
20int main(int argc, char **argv)
21{
22 int efd = -1;
23 int cfd = -1;
24 int event_control = -1;
25 char event_control_path[PATH_MAX];
26 char line[LINE_MAX];
27 int ret;
28
29 if (argc != 3) {
30 fputs(USAGE_STR, stderr);
31 return 1;
32 }
33
34 cfd = open(argv[1], O_RDONLY);
35 if (cfd == -1) {
36 fprintf(stderr, "Cannot open %s: %s\n", argv[1],
37 strerror(errno));
38 goto out;
39 }
40
41 ret = snprintf(event_control_path, PATH_MAX, "%s/cgroup.event_control",
42 dirname(argv[1]));
43 if (ret >= PATH_MAX) {
44 fputs("Path to cgroup.event_control is too long\n", stderr);
45 goto out;
46 }
47
48 event_control = open(event_control_path, O_WRONLY);
49 if (event_control == -1) {
50 fprintf(stderr, "Cannot open %s: %s\n", event_control_path,
51 strerror(errno));
52 goto out;
53 }
54
55 efd = eventfd(0, 0);
56 if (efd == -1) {
57 perror("eventfd() failed");
58 goto out;
59 }
60
61 ret = snprintf(line, LINE_MAX, "%d %d %s", efd, cfd, argv[2]);
62 if (ret >= LINE_MAX) {
63 fputs("Arguments string is too long\n", stderr);
64 goto out;
65 }
66
67 ret = write(event_control, line, strlen(line) + 1);
68 if (ret == -1) {
69 perror("Cannot write to cgroup.event_control");
70 goto out;
71 }
72
73 while (1) {
74 uint64_t result;
75
76 ret = read(efd, &result, sizeof(result));
77 if (ret == -1) {
78 if (errno == EINTR)
79 continue;
80 perror("Cannot read from eventfd");
81 break;
82 }
83 assert(ret == sizeof(result));
84
85 ret = access(event_control_path, W_OK);
86 if ((ret == -1) && (errno == ENOENT)) {
87 puts("The cgroup seems to have removed.");
88 ret = 0;
89 break;
90 }
91
92 if (ret == -1) {
93 perror("cgroup.event_control "
94 "is not accessable any more");
95 break;
96 }
97
98 printf("%s %s: crossed\n", argv[1], argv[2]);
99 }
100
101out:
102 if (efd >= 0)
103 close(efd);
104 if (event_control >= 0)
105 close(event_control);
106 if (cfd >= 0)
107 close(cfd);
108
109 return (ret != 0);
110}
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 0b33bfe7dde9..b34823ff1646 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -22,6 +22,8 @@ CONTENTS:
222. Usage Examples and Syntax 222. Usage Examples and Syntax
23 2.1 Basic Usage 23 2.1 Basic Usage
24 2.2 Attaching processes 24 2.2 Attaching processes
25 2.3 Mounting hierarchies by name
26 2.4 Notification API
253. Kernel API 273. Kernel API
26 3.1 Overview 28 3.1 Overview
27 3.2 Synchronization 29 3.2 Synchronization
@@ -233,8 +235,7 @@ containing the following files describing that cgroup:
233 - cgroup.procs: list of tgids in the cgroup. This list is not 235 - cgroup.procs: list of tgids in the cgroup. This list is not
234 guaranteed to be sorted or free of duplicate tgids, and userspace 236 guaranteed to be sorted or free of duplicate tgids, and userspace
235 should sort/uniquify the list if this property is required. 237 should sort/uniquify the list if this property is required.
236 Writing a tgid into this file moves all threads with that tgid into 238 This is a read-only file, for now.
237 this cgroup.
238 - notify_on_release flag: run the release agent on exit? 239 - notify_on_release flag: run the release agent on exit?
239 - release_agent: the path to use for release notifications (this file 240 - release_agent: the path to use for release notifications (this file
240 exists in the top cgroup only) 241 exists in the top cgroup only)
@@ -338,7 +339,7 @@ To mount a cgroup hierarchy with all available subsystems, type:
338The "xxx" is not interpreted by the cgroup code, but will appear in 339The "xxx" is not interpreted by the cgroup code, but will appear in
339/proc/mounts so may be any useful identifying string that you like. 340/proc/mounts so may be any useful identifying string that you like.
340 341
341To mount a cgroup hierarchy with just the cpuset and numtasks 342To mount a cgroup hierarchy with just the cpuset and memory
342subsystems, type: 343subsystems, type:
343# mount -t cgroup -o cpuset,memory hier1 /dev/cgroup 344# mount -t cgroup -o cpuset,memory hier1 /dev/cgroup
344 345
@@ -434,6 +435,25 @@ you give a subsystem a name.
434The name of the subsystem appears as part of the hierarchy description 435The name of the subsystem appears as part of the hierarchy description
435in /proc/mounts and /proc/<pid>/cgroups. 436in /proc/mounts and /proc/<pid>/cgroups.
436 437
4382.4 Notification API
439--------------------
440
441There is mechanism which allows to get notifications about changing
442status of a cgroup.
443
444To register new notification handler you need:
445 - create a file descriptor for event notification using eventfd(2);
446 - open a control file to be monitored (e.g. memory.usage_in_bytes);
447 - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
448 Interpretation of args is defined by control file implementation;
449
450eventfd will be woken up by control file implementation or when the
451cgroup is removed.
452
453To unregister notification handler just close eventfd.
454
455NOTE: Support of notifications should be implemented for the control
456file. See documentation for the subsystem.
437 457
4383. Kernel API 4583. Kernel API
439============= 459=============
@@ -488,6 +508,11 @@ Each subsystem should:
488- add an entry in linux/cgroup_subsys.h 508- add an entry in linux/cgroup_subsys.h
489- define a cgroup_subsys object called <name>_subsys 509- define a cgroup_subsys object called <name>_subsys
490 510
511If a subsystem can be compiled as a module, it should also have in its
512module initcall a call to cgroup_load_subsys(), and in its exitcall a
513call to cgroup_unload_subsys(). It should also set its_subsys.module =
514THIS_MODULE in its .c file.
515
491Each subsystem may export the following methods. The only mandatory 516Each subsystem may export the following methods. The only mandatory
492methods are create/destroy. Any others that are null are presumed to 517methods are create/destroy. Any others that are null are presumed to
493be successful no-ops. 518be successful no-ops.
@@ -536,10 +561,21 @@ returns an error, this will abort the attach operation. If a NULL
536task is passed, then a successful result indicates that *any* 561task is passed, then a successful result indicates that *any*
537unspecified task can be moved into the cgroup. Note that this isn't 562unspecified task can be moved into the cgroup. Note that this isn't
538called on a fork. If this method returns 0 (success) then this should 563called on a fork. If this method returns 0 (success) then this should
539remain valid while the caller holds cgroup_mutex. If threadgroup is 564remain valid while the caller holds cgroup_mutex and it is ensured that either
565attach() or cancel_attach() will be called in future. If threadgroup is
540true, then a successful result indicates that all threads in the given 566true, then a successful result indicates that all threads in the given
541thread's threadgroup can be moved together. 567thread's threadgroup can be moved together.
542 568
569void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
570 struct task_struct *task, bool threadgroup)
571(cgroup_mutex held by caller)
572
573Called when a task attach operation has failed after can_attach() has succeeded.
574A subsystem whose can_attach() has some side-effects should provide this
575function, so that the subsystem can implement a rollback. If not, not necessary.
576This will be called only about subsystems whose can_attach() operation have
577succeeded.
578
543void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, 579void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
544 struct cgroup *old_cgrp, struct task_struct *task, 580 struct cgroup *old_cgrp, struct task_struct *task,
545 bool threadgroup) 581 bool threadgroup)
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index 1d7e9784439a..51682ab2dd1a 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -42,7 +42,7 @@ Nodes to a set of tasks. In this document "Memory Node" refers to
42an on-line node that contains memory. 42an on-line node that contains memory.
43 43
44Cpusets constrain the CPU and Memory placement of tasks to only 44Cpusets constrain the CPU and Memory placement of tasks to only
45the resources within a tasks current cpuset. They form a nested 45the resources within a task's current cpuset. They form a nested
46hierarchy visible in a virtual file system. These are the essential 46hierarchy visible in a virtual file system. These are the essential
47hooks, beyond what is already present, required to manage dynamic 47hooks, beyond what is already present, required to manage dynamic
48job placement on large systems. 48job placement on large systems.
@@ -53,11 +53,11 @@ Documentation/cgroups/cgroups.txt.
53Requests by a task, using the sched_setaffinity(2) system call to 53Requests by a task, using the sched_setaffinity(2) system call to
54include CPUs in its CPU affinity mask, and using the mbind(2) and 54include CPUs in its CPU affinity mask, and using the mbind(2) and
55set_mempolicy(2) system calls to include Memory Nodes in its memory 55set_mempolicy(2) system calls to include Memory Nodes in its memory
56policy, are both filtered through that tasks cpuset, filtering out any 56policy, are both filtered through that task's cpuset, filtering out any
57CPUs or Memory Nodes not in that cpuset. The scheduler will not 57CPUs or Memory Nodes not in that cpuset. The scheduler will not
58schedule a task on a CPU that is not allowed in its cpus_allowed 58schedule a task on a CPU that is not allowed in its cpus_allowed
59vector, and the kernel page allocator will not allocate a page on a 59vector, and the kernel page allocator will not allocate a page on a
60node that is not allowed in the requesting tasks mems_allowed vector. 60node that is not allowed in the requesting task's mems_allowed vector.
61 61
62User level code may create and destroy cpusets by name in the cgroup 62User level code may create and destroy cpusets by name in the cgroup
63virtual file system, manage the attributes and permissions of these 63virtual file system, manage the attributes and permissions of these
@@ -121,9 +121,9 @@ Cpusets extends these two mechanisms as follows:
121 - Each task in the system is attached to a cpuset, via a pointer 121 - Each task in the system is attached to a cpuset, via a pointer
122 in the task structure to a reference counted cgroup structure. 122 in the task structure to a reference counted cgroup structure.
123 - Calls to sched_setaffinity are filtered to just those CPUs 123 - Calls to sched_setaffinity are filtered to just those CPUs
124 allowed in that tasks cpuset. 124 allowed in that task's cpuset.
125 - Calls to mbind and set_mempolicy are filtered to just 125 - Calls to mbind and set_mempolicy are filtered to just
126 those Memory Nodes allowed in that tasks cpuset. 126 those Memory Nodes allowed in that task's cpuset.
127 - The root cpuset contains all the systems CPUs and Memory 127 - The root cpuset contains all the systems CPUs and Memory
128 Nodes. 128 Nodes.
129 - For any cpuset, one can define child cpusets containing a subset 129 - For any cpuset, one can define child cpusets containing a subset
@@ -141,11 +141,11 @@ into the rest of the kernel, none in performance critical paths:
141 - in init/main.c, to initialize the root cpuset at system boot. 141 - in init/main.c, to initialize the root cpuset at system boot.
142 - in fork and exit, to attach and detach a task from its cpuset. 142 - in fork and exit, to attach and detach a task from its cpuset.
143 - in sched_setaffinity, to mask the requested CPUs by what's 143 - in sched_setaffinity, to mask the requested CPUs by what's
144 allowed in that tasks cpuset. 144 allowed in that task's cpuset.
145 - in sched.c migrate_live_tasks(), to keep migrating tasks within 145 - in sched.c migrate_live_tasks(), to keep migrating tasks within
146 the CPUs allowed by their cpuset, if possible. 146 the CPUs allowed by their cpuset, if possible.
147 - in the mbind and set_mempolicy system calls, to mask the requested 147 - in the mbind and set_mempolicy system calls, to mask the requested
148 Memory Nodes by what's allowed in that tasks cpuset. 148 Memory Nodes by what's allowed in that task's cpuset.
149 - in page_alloc.c, to restrict memory to allowed nodes. 149 - in page_alloc.c, to restrict memory to allowed nodes.
150 - in vmscan.c, to restrict page recovery to the current cpuset. 150 - in vmscan.c, to restrict page recovery to the current cpuset.
151 151
@@ -155,7 +155,7 @@ new system calls are added for cpusets - all support for querying and
155modifying cpusets is via this cpuset file system. 155modifying cpusets is via this cpuset file system.
156 156
157The /proc/<pid>/status file for each task has four added lines, 157The /proc/<pid>/status file for each task has four added lines,
158displaying the tasks cpus_allowed (on which CPUs it may be scheduled) 158displaying the task's cpus_allowed (on which CPUs it may be scheduled)
159and mems_allowed (on which Memory Nodes it may obtain memory), 159and mems_allowed (on which Memory Nodes it may obtain memory),
160in the two formats seen in the following example: 160in the two formats seen in the following example:
161 161
@@ -168,20 +168,20 @@ Each cpuset is represented by a directory in the cgroup file system
168containing (on top of the standard cgroup files) the following 168containing (on top of the standard cgroup files) the following
169files describing that cpuset: 169files describing that cpuset:
170 170
171 - cpus: list of CPUs in that cpuset 171 - cpuset.cpus: list of CPUs in that cpuset
172 - mems: list of Memory Nodes in that cpuset 172 - cpuset.mems: list of Memory Nodes in that cpuset
173 - memory_migrate flag: if set, move pages to cpusets nodes 173 - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
174 - cpu_exclusive flag: is cpu placement exclusive? 174 - cpuset.cpu_exclusive flag: is cpu placement exclusive?
175 - mem_exclusive flag: is memory placement exclusive? 175 - cpuset.mem_exclusive flag: is memory placement exclusive?
176 - mem_hardwall flag: is memory allocation hardwalled 176 - cpuset.mem_hardwall flag: is memory allocation hardwalled
177 - memory_pressure: measure of how much paging pressure in cpuset 177 - cpuset.memory_pressure: measure of how much paging pressure in cpuset
178 - memory_spread_page flag: if set, spread page cache evenly on allowed nodes 178 - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
179 - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes 179 - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
180 - sched_load_balance flag: if set, load balance within CPUs on that cpuset 180 - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
181 - sched_relax_domain_level: the searching range when migrating tasks 181 - cpuset.sched_relax_domain_level: the searching range when migrating tasks
182 182
183In addition, the root cpuset only has the following file: 183In addition, the root cpuset only has the following file:
184 - memory_pressure_enabled flag: compute memory_pressure? 184 - cpuset.memory_pressure_enabled flag: compute memory_pressure?
185 185
186New cpusets are created using the mkdir system call or shell 186New cpusets are created using the mkdir system call or shell
187command. The properties of a cpuset, such as its flags, allowed 187command. The properties of a cpuset, such as its flags, allowed
@@ -229,7 +229,7 @@ If a cpuset is cpu or mem exclusive, no other cpuset, other than
229a direct ancestor or descendant, may share any of the same CPUs or 229a direct ancestor or descendant, may share any of the same CPUs or
230Memory Nodes. 230Memory Nodes.
231 231
232A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", 232A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
233i.e. it restricts kernel allocations for page, buffer and other data 233i.e. it restricts kernel allocations for page, buffer and other data
234commonly shared by the kernel across multiple users. All cpusets, 234commonly shared by the kernel across multiple users. All cpusets,
235whether hardwalled or not, restrict allocations of memory for user 235whether hardwalled or not, restrict allocations of memory for user
@@ -304,15 +304,15 @@ times 1000.
304--------------------------- 304---------------------------
305There are two boolean flag files per cpuset that control where the 305There are two boolean flag files per cpuset that control where the
306kernel allocates pages for the file system buffers and related in 306kernel allocates pages for the file system buffers and related in
307kernel data structures. They are called 'memory_spread_page' and 307kernel data structures. They are called 'cpuset.memory_spread_page' and
308'memory_spread_slab'. 308'cpuset.memory_spread_slab'.
309 309
310If the per-cpuset boolean flag file 'memory_spread_page' is set, then 310If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
311the kernel will spread the file system buffers (page cache) evenly 311the kernel will spread the file system buffers (page cache) evenly
312over all the nodes that the faulting task is allowed to use, instead 312over all the nodes that the faulting task is allowed to use, instead
313of preferring to put those pages on the node where the task is running. 313of preferring to put those pages on the node where the task is running.
314 314
315If the per-cpuset boolean flag file 'memory_spread_slab' is set, 315If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
316then the kernel will spread some file system related slab caches, 316then the kernel will spread some file system related slab caches,
317such as for inodes and dentries evenly over all the nodes that the 317such as for inodes and dentries evenly over all the nodes that the
318faulting task is allowed to use, instead of preferring to put those 318faulting task is allowed to use, instead of preferring to put those
@@ -323,41 +323,41 @@ stack segment pages of a task.
323 323
324By default, both kinds of memory spreading are off, and memory 324By default, both kinds of memory spreading are off, and memory
325pages are allocated on the node local to where the task is running, 325pages are allocated on the node local to where the task is running,
326except perhaps as modified by the tasks NUMA mempolicy or cpuset 326except perhaps as modified by the task's NUMA mempolicy or cpuset
327configuration, so long as sufficient free memory pages are available. 327configuration, so long as sufficient free memory pages are available.
328 328
329When new cpusets are created, they inherit the memory spread settings 329When new cpusets are created, they inherit the memory spread settings
330of their parent. 330of their parent.
331 331
332Setting memory spreading causes allocations for the affected page 332Setting memory spreading causes allocations for the affected page
333or slab caches to ignore the tasks NUMA mempolicy and be spread 333or slab caches to ignore the task's NUMA mempolicy and be spread
334instead. Tasks using mbind() or set_mempolicy() calls to set NUMA 334instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
335mempolicies will not notice any change in these calls as a result of 335mempolicies will not notice any change in these calls as a result of
336their containing tasks memory spread settings. If memory spreading 336their containing task's memory spread settings. If memory spreading
337is turned off, then the currently specified NUMA mempolicy once again 337is turned off, then the currently specified NUMA mempolicy once again
338applies to memory page allocations. 338applies to memory page allocations.
339 339
340Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag 340Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
341files. By default they contain "0", meaning that the feature is off 341files. By default they contain "0", meaning that the feature is off
342for that cpuset. If a "1" is written to that file, then that turns 342for that cpuset. If a "1" is written to that file, then that turns
343the named feature on. 343the named feature on.
344 344
345The implementation is simple. 345The implementation is simple.
346 346
347Setting the flag 'memory_spread_page' turns on a per-process flag 347Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag
348PF_SPREAD_PAGE for each task that is in that cpuset or subsequently 348PF_SPREAD_PAGE for each task that is in that cpuset or subsequently
349joins that cpuset. The page allocation calls for the page cache 349joins that cpuset. The page allocation calls for the page cache
350is modified to perform an inline check for this PF_SPREAD_PAGE task 350is modified to perform an inline check for this PF_SPREAD_PAGE task
351flag, and if set, a call to a new routine cpuset_mem_spread_node() 351flag, and if set, a call to a new routine cpuset_mem_spread_node()
352returns the node to prefer for the allocation. 352returns the node to prefer for the allocation.
353 353
354Similarly, setting 'memory_spread_slab' turns on the flag 354Similarly, setting 'cpuset.memory_spread_slab' turns on the flag
355PF_SPREAD_SLAB, and appropriately marked slab caches will allocate 355PF_SPREAD_SLAB, and appropriately marked slab caches will allocate
356pages from the node returned by cpuset_mem_spread_node(). 356pages from the node returned by cpuset_mem_spread_node().
357 357
358The cpuset_mem_spread_node() routine is also simple. It uses the 358The cpuset_mem_spread_node() routine is also simple. It uses the
359value of a per-task rotor cpuset_mem_spread_rotor to select the next 359value of a per-task rotor cpuset_mem_spread_rotor to select the next
360node in the current tasks mems_allowed to prefer for the allocation. 360node in the current task's mems_allowed to prefer for the allocation.
361 361
362This memory placement policy is also known (in other contexts) as 362This memory placement policy is also known (in other contexts) as
363round-robin or interleave. 363round-robin or interleave.
@@ -404,24 +404,24 @@ the following two situations:
404 system overhead on those CPUs, including avoiding task load 404 system overhead on those CPUs, including avoiding task load
405 balancing if that is not needed. 405 balancing if that is not needed.
406 406
407When the per-cpuset flag "sched_load_balance" is enabled (the default 407When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
408setting), it requests that all the CPUs in that cpusets allowed 'cpus' 408setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
409be contained in a single sched domain, ensuring that load balancing 409be contained in a single sched domain, ensuring that load balancing
410can move a task (not otherwised pinned, as by sched_setaffinity) 410can move a task (not otherwised pinned, as by sched_setaffinity)
411from any CPU in that cpuset to any other. 411from any CPU in that cpuset to any other.
412 412
413When the per-cpuset flag "sched_load_balance" is disabled, then the 413When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
414scheduler will avoid load balancing across the CPUs in that cpuset, 414scheduler will avoid load balancing across the CPUs in that cpuset,
415--except-- in so far as is necessary because some overlapping cpuset 415--except-- in so far as is necessary because some overlapping cpuset
416has "sched_load_balance" enabled. 416has "sched_load_balance" enabled.
417 417
418So, for example, if the top cpuset has the flag "sched_load_balance" 418So, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
419enabled, then the scheduler will have one sched domain covering all 419enabled, then the scheduler will have one sched domain covering all
420CPUs, and the setting of the "sched_load_balance" flag in any other 420CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
421cpusets won't matter, as we're already fully load balancing. 421cpusets won't matter, as we're already fully load balancing.
422 422
423Therefore in the above two situations, the top cpuset flag 423Therefore in the above two situations, the top cpuset flag
424"sched_load_balance" should be disabled, and only some of the smaller, 424"cpuset.sched_load_balance" should be disabled, and only some of the smaller,
425child cpusets have this flag enabled. 425child cpusets have this flag enabled.
426 426
427When doing this, you don't usually want to leave any unpinned tasks in 427When doing this, you don't usually want to leave any unpinned tasks in
@@ -433,7 +433,7 @@ scheduler might not consider the possibility of load balancing that
433task to that underused CPU. 433task to that underused CPU.
434 434
435Of course, tasks pinned to a particular CPU can be left in a cpuset 435Of course, tasks pinned to a particular CPU can be left in a cpuset
436that disables "sched_load_balance" as those tasks aren't going anywhere 436that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
437else anyway. 437else anyway.
438 438
439There is an impedance mismatch here, between cpusets and sched domains. 439There is an impedance mismatch here, between cpusets and sched domains.
@@ -443,19 +443,19 @@ overlap and each CPU is in at most one sched domain.
443It is necessary for sched domains to be flat because load balancing 443It is necessary for sched domains to be flat because load balancing
444across partially overlapping sets of CPUs would risk unstable dynamics 444across partially overlapping sets of CPUs would risk unstable dynamics
445that would be beyond our understanding. So if each of two partially 445that would be beyond our understanding. So if each of two partially
446overlapping cpusets enables the flag 'sched_load_balance', then we 446overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
447form a single sched domain that is a superset of both. We won't move 447form a single sched domain that is a superset of both. We won't move
448a task to a CPU outside it cpuset, but the scheduler load balancing 448a task to a CPU outside it cpuset, but the scheduler load balancing
449code might waste some compute cycles considering that possibility. 449code might waste some compute cycles considering that possibility.
450 450
451This mismatch is why there is not a simple one-to-one relation 451This mismatch is why there is not a simple one-to-one relation
452between which cpusets have the flag "sched_load_balance" enabled, 452between which cpusets have the flag "cpuset.sched_load_balance" enabled,
453and the sched domain configuration. If a cpuset enables the flag, it 453and the sched domain configuration. If a cpuset enables the flag, it
454will get balancing across all its CPUs, but if it disables the flag, 454will get balancing across all its CPUs, but if it disables the flag,
455it will only be assured of no load balancing if no other overlapping 455it will only be assured of no load balancing if no other overlapping
456cpuset enables the flag. 456cpuset enables the flag.
457 457
458If two cpusets have partially overlapping 'cpus' allowed, and only 458If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
459one of them has this flag enabled, then the other may find its 459one of them has this flag enabled, then the other may find its
460tasks only partially load balanced, just on the overlapping CPUs. 460tasks only partially load balanced, just on the overlapping CPUs.
461This is just the general case of the top_cpuset example given a few 461This is just the general case of the top_cpuset example given a few
@@ -468,23 +468,23 @@ load balancing to the other CPUs.
4681.7.1 sched_load_balance implementation details. 4681.7.1 sched_load_balance implementation details.
469------------------------------------------------ 469------------------------------------------------
470 470
471The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary 471The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
472to most cpuset flags.) When enabled for a cpuset, the kernel will 472to most cpuset flags.) When enabled for a cpuset, the kernel will
473ensure that it can load balance across all the CPUs in that cpuset 473ensure that it can load balance across all the CPUs in that cpuset
474(makes sure that all the CPUs in the cpus_allowed of that cpuset are 474(makes sure that all the CPUs in the cpus_allowed of that cpuset are
475in the same sched domain.) 475in the same sched domain.)
476 476
477If two overlapping cpusets both have 'sched_load_balance' enabled, 477If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
478then they will be (must be) both in the same sched domain. 478then they will be (must be) both in the same sched domain.
479 479
480If, as is the default, the top cpuset has 'sched_load_balance' enabled, 480If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
481then by the above that means there is a single sched domain covering 481then by the above that means there is a single sched domain covering
482the whole system, regardless of any other cpuset settings. 482the whole system, regardless of any other cpuset settings.
483 483
484The kernel commits to user space that it will avoid load balancing 484The kernel commits to user space that it will avoid load balancing
485where it can. It will pick as fine a granularity partition of sched 485where it can. It will pick as fine a granularity partition of sched
486domains as it can while still providing load balancing for any set 486domains as it can while still providing load balancing for any set
487of CPUs allowed to a cpuset having 'sched_load_balance' enabled. 487of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
488 488
489The internal kernel cpuset to scheduler interface passes from the 489The internal kernel cpuset to scheduler interface passes from the
490cpuset code to the scheduler code a partition of the load balanced 490cpuset code to the scheduler code a partition of the load balanced
@@ -495,9 +495,9 @@ all the CPUs that must be load balanced.
495The cpuset code builds a new such partition and passes it to the 495The cpuset code builds a new such partition and passes it to the
496scheduler sched domain setup code, to have the sched domains rebuilt 496scheduler sched domain setup code, to have the sched domains rebuilt
497as necessary, whenever: 497as necessary, whenever:
498 - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes, 498 - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
499 - or CPUs come or go from a cpuset with this flag enabled, 499 - or CPUs come or go from a cpuset with this flag enabled,
500 - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs 500 - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
501 and with this flag enabled changes, 501 and with this flag enabled changes,
502 - or a cpuset with non-empty CPUs and with this flag enabled is removed, 502 - or a cpuset with non-empty CPUs and with this flag enabled is removed,
503 - or a cpu is offlined/onlined. 503 - or a cpu is offlined/onlined.
@@ -542,7 +542,7 @@ As the result, task B on CPU X need to wait task A or wait load balance
542on the next tick. For some applications in special situation, waiting 542on the next tick. For some applications in special situation, waiting
5431 tick may be too long. 5431 tick may be too long.
544 544
545The 'sched_relax_domain_level' file allows you to request changing 545The 'cpuset.sched_relax_domain_level' file allows you to request changing
546this searching range as you like. This file takes int value which 546this searching range as you like. This file takes int value which
547indicates size of searching range in levels ideally as follows, 547indicates size of searching range in levels ideally as follows,
548otherwise initial value -1 that indicates the cpuset has no request. 548otherwise initial value -1 that indicates the cpuset has no request.
@@ -559,8 +559,8 @@ The system default is architecture dependent. The system default
559can be changed using the relax_domain_level= boot parameter. 559can be changed using the relax_domain_level= boot parameter.
560 560
561This file is per-cpuset and affect the sched domain where the cpuset 561This file is per-cpuset and affect the sched domain where the cpuset
562belongs to. Therefore if the flag 'sched_load_balance' of a cpuset 562belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
563is disabled, then 'sched_relax_domain_level' have no effect since 563is disabled, then 'cpuset.sched_relax_domain_level' have no effect since
564there is no sched domain belonging the cpuset. 564there is no sched domain belonging the cpuset.
565 565
566If multiple cpusets are overlapping and hence they form a single sched 566If multiple cpusets are overlapping and hence they form a single sched
@@ -594,7 +594,7 @@ is attached, is subtle.
594If a cpuset has its Memory Nodes modified, then for each task attached 594If a cpuset has its Memory Nodes modified, then for each task attached
595to that cpuset, the next time that the kernel attempts to allocate 595to that cpuset, the next time that the kernel attempts to allocate
596a page of memory for that task, the kernel will notice the change 596a page of memory for that task, the kernel will notice the change
597in the tasks cpuset, and update its per-task memory placement to 597in the task's cpuset, and update its per-task memory placement to
598remain within the new cpusets memory placement. If the task was using 598remain within the new cpusets memory placement. If the task was using
599mempolicy MPOL_BIND, and the nodes to which it was bound overlap with 599mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
600its new cpuset, then the task will continue to use whatever subset 600its new cpuset, then the task will continue to use whatever subset
@@ -603,13 +603,13 @@ was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
603in the new cpuset, then the task will be essentially treated as if it 603in the new cpuset, then the task will be essentially treated as if it
604was MPOL_BIND bound to the new cpuset (even though its NUMA placement, 604was MPOL_BIND bound to the new cpuset (even though its NUMA placement,
605as queried by get_mempolicy(), doesn't change). If a task is moved 605as queried by get_mempolicy(), doesn't change). If a task is moved
606from one cpuset to another, then the kernel will adjust the tasks 606from one cpuset to another, then the kernel will adjust the task's
607memory placement, as above, the next time that the kernel attempts 607memory placement, as above, the next time that the kernel attempts
608to allocate a page of memory for that task. 608to allocate a page of memory for that task.
609 609
610If a cpuset has its 'cpus' modified, then each task in that cpuset 610If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
611will have its allowed CPU placement changed immediately. Similarly, 611will have its allowed CPU placement changed immediately. Similarly,
612if a tasks pid is written to another cpusets 'tasks' file, then its 612if a task's pid is written to another cpusets 'cpuset.tasks' file, then its
613allowed CPU placement is changed immediately. If such a task had been 613allowed CPU placement is changed immediately. If such a task had been
614bound to some subset of its cpuset using the sched_setaffinity() call, 614bound to some subset of its cpuset using the sched_setaffinity() call,
615the task will be allowed to run on any CPU allowed in its new cpuset, 615the task will be allowed to run on any CPU allowed in its new cpuset,
@@ -622,21 +622,21 @@ and the processor placement is updated immediately.
622Normally, once a page is allocated (given a physical page 622Normally, once a page is allocated (given a physical page
623of main memory) then that page stays on whatever node it 623of main memory) then that page stays on whatever node it
624was allocated, so long as it remains allocated, even if the 624was allocated, so long as it remains allocated, even if the
625cpusets memory placement policy 'mems' subsequently changes. 625cpusets memory placement policy 'cpuset.mems' subsequently changes.
626If the cpuset flag file 'memory_migrate' is set true, then when 626If the cpuset flag file 'cpuset.memory_migrate' is set true, then when
627tasks are attached to that cpuset, any pages that task had 627tasks are attached to that cpuset, any pages that task had
628allocated to it on nodes in its previous cpuset are migrated 628allocated to it on nodes in its previous cpuset are migrated
629to the tasks new cpuset. The relative placement of the page within 629to the task's new cpuset. The relative placement of the page within
630the cpuset is preserved during these migration operations if possible. 630the cpuset is preserved during these migration operations if possible.
631For example if the page was on the second valid node of the prior cpuset 631For example if the page was on the second valid node of the prior cpuset
632then the page will be placed on the second valid node of the new cpuset. 632then the page will be placed on the second valid node of the new cpuset.
633 633
634Also if 'memory_migrate' is set true, then if that cpusets 634Also if 'cpuset.memory_migrate' is set true, then if that cpuset's
635'mems' file is modified, pages allocated to tasks in that 635'cpuset.mems' file is modified, pages allocated to tasks in that
636cpuset, that were on nodes in the previous setting of 'mems', 636cpuset, that were on nodes in the previous setting of 'cpuset.mems',
637will be moved to nodes in the new setting of 'mems.' 637will be moved to nodes in the new setting of 'mems.'
638Pages that were not in the tasks prior cpuset, or in the cpusets 638Pages that were not in the task's prior cpuset, or in the cpuset's
639prior 'mems' setting, will not be moved. 639prior 'cpuset.mems' setting, will not be moved.
640 640
641There is an exception to the above. If hotplug functionality is used 641There is an exception to the above. If hotplug functionality is used
642to remove all the CPUs that are currently assigned to a cpuset, 642to remove all the CPUs that are currently assigned to a cpuset,
@@ -655,7 +655,7 @@ There is a second exception to the above. GFP_ATOMIC requests are
655kernel internal allocations that must be satisfied, immediately. 655kernel internal allocations that must be satisfied, immediately.
656The kernel may drop some request, in rare cases even panic, if a 656The kernel may drop some request, in rare cases even panic, if a
657GFP_ATOMIC alloc fails. If the request cannot be satisfied within 657GFP_ATOMIC alloc fails. If the request cannot be satisfied within
658the current tasks cpuset, then we relax the cpuset, and look for 658the current task's cpuset, then we relax the cpuset, and look for
659memory anywhere we can find it. It's better to violate the cpuset 659memory anywhere we can find it. It's better to violate the cpuset
660than stress the kernel. 660than stress the kernel.
661 661
@@ -678,8 +678,8 @@ and then start a subshell 'sh' in that cpuset:
678 cd /dev/cpuset 678 cd /dev/cpuset
679 mkdir Charlie 679 mkdir Charlie
680 cd Charlie 680 cd Charlie
681 /bin/echo 2-3 > cpus 681 /bin/echo 2-3 > cpuset.cpus
682 /bin/echo 1 > mems 682 /bin/echo 1 > cpuset.mems
683 /bin/echo $$ > tasks 683 /bin/echo $$ > tasks
684 sh 684 sh
685 # The subshell 'sh' is now running in cpuset Charlie 685 # The subshell 'sh' is now running in cpuset Charlie
@@ -725,10 +725,13 @@ Now you want to do something with this cpuset.
725 725
726In this directory you can find several files: 726In this directory you can find several files:
727# ls 727# ls
728cpu_exclusive memory_migrate mems tasks 728cpuset.cpu_exclusive cpuset.memory_spread_slab
729cpus memory_pressure notify_on_release 729cpuset.cpus cpuset.mems
730mem_exclusive memory_spread_page sched_load_balance 730cpuset.mem_exclusive cpuset.sched_load_balance
731mem_hardwall memory_spread_slab sched_relax_domain_level 731cpuset.mem_hardwall cpuset.sched_relax_domain_level
732cpuset.memory_migrate notify_on_release
733cpuset.memory_pressure tasks
734cpuset.memory_spread_page
732 735
733Reading them will give you information about the state of this cpuset: 736Reading them will give you information about the state of this cpuset:
734the CPUs and Memory Nodes it can use, the processes that are using 737the CPUs and Memory Nodes it can use, the processes that are using
@@ -736,13 +739,13 @@ it, its properties. By writing to these files you can manipulate
736the cpuset. 739the cpuset.
737 740
738Set some flags: 741Set some flags:
739# /bin/echo 1 > cpu_exclusive 742# /bin/echo 1 > cpuset.cpu_exclusive
740 743
741Add some cpus: 744Add some cpus:
742# /bin/echo 0-7 > cpus 745# /bin/echo 0-7 > cpuset.cpus
743 746
744Add some mems: 747Add some mems:
745# /bin/echo 0-7 > mems 748# /bin/echo 0-7 > cpuset.mems
746 749
747Now attach your shell to this cpuset: 750Now attach your shell to this cpuset:
748# /bin/echo $$ > tasks 751# /bin/echo $$ > tasks
@@ -774,28 +777,28 @@ echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
774This is the syntax to use when writing in the cpus or mems files 777This is the syntax to use when writing in the cpus or mems files
775in cpuset directories: 778in cpuset directories:
776 779
777# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 780# /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
778# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 781# /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
779 782
780To add a CPU to a cpuset, write the new list of CPUs including the 783To add a CPU to a cpuset, write the new list of CPUs including the
781CPU to be added. To add 6 to the above cpuset: 784CPU to be added. To add 6 to the above cpuset:
782 785
783# /bin/echo 1-4,6 > cpus -> set cpus list to cpus 1,2,3,4,6 786# /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6
784 787
785Similarly to remove a CPU from a cpuset, write the new list of CPUs 788Similarly to remove a CPU from a cpuset, write the new list of CPUs
786without the CPU to be removed. 789without the CPU to be removed.
787 790
788To remove all the CPUs: 791To remove all the CPUs:
789 792
790# /bin/echo "" > cpus -> clear cpus list 793# /bin/echo "" > cpuset.cpus -> clear cpus list
791 794
7922.3 Setting flags 7952.3 Setting flags
793----------------- 796-----------------
794 797
795The syntax is very simple: 798The syntax is very simple:
796 799
797# /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' 800# /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive'
798# /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' 801# /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive'
799 802
8002.4 Attaching processes 8032.4 Attaching processes
801----------------------- 804-----------------------
diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index 72db89ed0609..b7eececfb195 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -1,6 +1,6 @@
1Memory Resource Controller(Memcg) Implementation Memo. 1Memory Resource Controller(Memcg) Implementation Memo.
2Last Updated: 2009/1/20 2Last Updated: 2010/2
3Base Kernel Version: based on 2.6.29-rc2. 3Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
4 4
5Because VM is getting complex (one of reasons is memcg...), memcg's behavior 5Because VM is getting complex (one of reasons is memcg...), memcg's behavior
6is complex. This is a document for memcg's internal behavior. 6is complex. This is a document for memcg's internal behavior.
@@ -244,7 +244,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
244 we have to check if OLDPAGE/NEWPAGE is a valid page after commit(). 244 we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
245 245
2468. LRU 2468. LRU
247 Each memcg has its own private LRU. Now, it's handling is under global 247 Each memcg has its own private LRU. Now, its handling is under global
248 VM's control (means that it's handled under global zone->lru_lock). 248 VM's control (means that it's handled under global zone->lru_lock).
249 Almost all routines around memcg's LRU is called by global LRU's 249 Almost all routines around memcg's LRU is called by global LRU's
250 list management functions under zone->lru_lock(). 250 list management functions under zone->lru_lock().
@@ -337,7 +337,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
337 race and lock dependency with other cgroup subsystems. 337 race and lock dependency with other cgroup subsystems.
338 338
339 example) 339 example)
340 # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices 340 # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
341 341
342 and do task move, mkdir, rmdir etc...under this. 342 and do task move, mkdir, rmdir etc...under this.
343 343
@@ -348,7 +348,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
348 348
349 For example, test like following is good. 349 For example, test like following is good.
350 (Shell-A) 350 (Shell-A)
351 # mount -t cgroup none /cgroup -t memory 351 # mount -t cgroup none /cgroup -o memory
352 # mkdir /cgroup/test 352 # mkdir /cgroup/test
353 # echo 40M > /cgroup/test/memory.limit_in_bytes 353 # echo 40M > /cgroup/test/memory.limit_in_bytes
354 # echo 0 > /cgroup/test/tasks 354 # echo 0 > /cgroup/test/tasks
@@ -378,3 +378,42 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
378 #echo 50M > memory.limit_in_bytes 378 #echo 50M > memory.limit_in_bytes
379 #echo 50M > memory.memsw.limit_in_bytes 379 #echo 50M > memory.memsw.limit_in_bytes
380 run 51M of malloc 380 run 51M of malloc
381
382 9.9 Move charges at task migration
383 Charges associated with a task can be moved along with task migration.
384
385 (Shell-A)
386 #mkdir /cgroup/A
387 #echo $$ >/cgroup/A/tasks
388 run some programs which uses some amount of memory in /cgroup/A.
389
390 (Shell-B)
391 #mkdir /cgroup/B
392 #echo 1 >/cgroup/B/memory.move_charge_at_immigrate
393 #echo "pid of the program running in group A" >/cgroup/B/tasks
394
395 You can see charges have been moved by reading *.usage_in_bytes or
396 memory.stat of both A and B.
397 See 8.2 of Documentation/cgroups/memory.txt to see what value should be
398 written to move_charge_at_immigrate.
399
400 9.10 Memory thresholds
401 Memory controler implements memory thresholds using cgroups notification
402 API. You can use Documentation/cgroups/cgroup_event_listener.c to test
403 it.
404
405 (Shell-A) Create cgroup and run event listener
406 # mkdir /cgroup/A
407 # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
408
409 (Shell-B) Add task to cgroup and try to allocate and free memory
410 # echo $$ >/cgroup/A/tasks
411 # a="$(dd if=/dev/zero bs=1M count=10)"
412 # a=
413
414 You will see message from cgroup_event_listener every time you cross
415 the thresholds.
416
417 Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
418
419 It's good idea to test root cgroup as well.
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f2552b45..7781857dc940 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -1,18 +1,15 @@
1Memory Resource Controller 1Memory Resource Controller
2 2
3NOTE: The Memory Resource Controller has been generically been referred 3NOTE: The Memory Resource Controller has been generically been referred
4to as the memory controller in this document. Do not confuse memory controller 4 to as the memory controller in this document. Do not confuse memory
5used here with the memory controller that is used in hardware. 5 controller used here with the memory controller that is used in hardware.
6 6
7Salient features 7(For editors)
8 8In this document:
9a. Enable control of Anonymous, Page Cache (mapped and unmapped) and 9 When we mention a cgroup (cgroupfs's directory) with memory controller,
10 Swap Cache memory pages. 10 we call it "memory cgroup". When you see git-log and source code, you'll
11b. The infrastructure allows easy addition of other types of memory to control 11 see patch's title and function names tend to use "memcg".
12c. Provides *zero overhead* for non memory controller users 12 In this document, we avoid using it.
13d. Provides a double LRU: global memory pressure causes reclaim from the
14 global LRU; a cgroup on hitting a limit, reclaims from the per
15 cgroup LRU
16 13
17Benefits and Purpose of the memory controller 14Benefits and Purpose of the memory controller
18 15
@@ -33,6 +30,45 @@ d. A CD/DVD burner could control the amount of memory used by the
33e. There are several other use cases, find one or use the controller just 30e. There are several other use cases, find one or use the controller just
34 for fun (to learn and hack on the VM subsystem). 31 for fun (to learn and hack on the VM subsystem).
35 32
33Current Status: linux-2.6.34-mmotm(development version of 2010/April)
34
35Features:
36 - accounting anonymous pages, file caches, swap caches usage and limiting them.
37 - private LRU and reclaim routine. (system's global LRU and private LRU
38 work independently from each other)
39 - optionally, memory+swap usage can be accounted and limited.
40 - hierarchical accounting
41 - soft limit
42 - moving(recharging) account at moving a task is selectable.
43 - usage threshold notifier
44 - oom-killer disable knob and oom-notifier
45 - Root cgroup has no limit controls.
46
47 Kernel memory and Hugepages are not under control yet. We just manage
48 pages on LRU. To add more controls, we have to take care of performance.
49
50Brief summary of control files.
51
52 tasks # attach a task(thread) and show list of threads
53 cgroup.procs # show list of processes
54 cgroup.event_control # an interface for event_fd()
55 memory.usage_in_bytes # show current memory(RSS+Cache) usage.
56 memory.memsw.usage_in_bytes # show current memory+Swap usage
57 memory.limit_in_bytes # set/show limit of memory usage
58 memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
59 memory.failcnt # show the number of memory usage hits limits
60 memory.memsw.failcnt # show the number of memory+Swap hits limits
61 memory.max_usage_in_bytes # show max memory usage recorded
62 memory.memsw.usage_in_bytes # show max memory+Swap usage recorded
63 memory.soft_limit_in_bytes # set/show soft limit of memory usage
64 memory.stat # show various statistics
65 memory.use_hierarchy # set/show hierarchical account enabled
66 memory.force_empty # trigger forced move charge to parent
67 memory.swappiness # set/show swappiness parameter of vmscan
68 (See sysctl's vm.swappiness)
69 memory.move_charge_at_immigrate # set/show controls of moving charges
70 memory.oom_control # set/show oom controls.
71
361. History 721. History
37 73
38The memory controller has a long history. A request for comments for the memory 74The memory controller has a long history. A request for comments for the memory
@@ -106,14 +142,14 @@ the necessary data structures and check if the cgroup that is being charged
106is over its limit. If it is then reclaim is invoked on the cgroup. 142is over its limit. If it is then reclaim is invoked on the cgroup.
107More details can be found in the reclaim section of this document. 143More details can be found in the reclaim section of this document.
108If everything goes well, a page meta-data-structure called page_cgroup is 144If everything goes well, a page meta-data-structure called page_cgroup is
109allocated and associated with the page. This routine also adds the page to 145updated. page_cgroup has its own LRU on cgroup.
110the per cgroup LRU. 146(*) page_cgroup structure is allocated at boot/memory-hotplug time.
111 147
1122.2.1 Accounting details 1482.2.1 Accounting details
113 149
114All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. 150All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
115(some pages which never be reclaimable and will not be on global LRU 151Some pages which are never reclaimable and will not be on the global LRU
116 are not accounted. we just accounts pages under usual vm management.) 152are not accounted. We just account pages under usual VM management.
117 153
118RSS pages are accounted at page_fault unless they've already been accounted 154RSS pages are accounted at page_fault unless they've already been accounted
119for earlier. A file page will be accounted for as Page Cache when it's 155for earlier. A file page will be accounted for as Page Cache when it's
@@ -121,12 +157,19 @@ inserted into inode (radix-tree). While it's mapped into the page tables of
121processes, duplicate accounting is carefully avoided. 157processes, duplicate accounting is carefully avoided.
122 158
123A RSS page is unaccounted when it's fully unmapped. A PageCache page is 159A RSS page is unaccounted when it's fully unmapped. A PageCache page is
124unaccounted when it's removed from radix-tree. 160unaccounted when it's removed from radix-tree. Even if RSS pages are fully
161unmapped (by kswapd), they may exist as SwapCache in the system until they
162are really freed. Such SwapCaches also also accounted.
163A swapped-in page is not accounted until it's mapped.
164
165Note: The kernel does swapin-readahead and read multiple swaps at once.
166This means swapped-in pages may contain pages for other tasks than a task
167causing page fault. So, we avoid accounting at swap-in I/O.
125 168
126At page migration, accounting information is kept. 169At page migration, accounting information is kept.
127 170
128Note: we just account pages-on-lru because our purpose is to control amount 171Note: we just account pages-on-LRU because our purpose is to control amount
129of used pages. not-on-lru pages are tend to be out-of-control from vm view. 172of used pages; not-on-LRU pages tend to be out-of-control from VM view.
130 173
1312.3 Shared Page Accounting 1742.3 Shared Page Accounting
132 175
@@ -143,6 +186,7 @@ caller of swapoff rather than the users of shmem.
143 186
144 187
1452.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) 1882.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
189
146Swap Extension allows you to record charge for swap. A swapped-in page is 190Swap Extension allows you to record charge for swap. A swapped-in page is
147charged back to original page allocator if possible. 191charged back to original page allocator if possible.
148 192
@@ -150,13 +194,20 @@ When swap is accounted, following files are added.
150 - memory.memsw.usage_in_bytes. 194 - memory.memsw.usage_in_bytes.
151 - memory.memsw.limit_in_bytes. 195 - memory.memsw.limit_in_bytes.
152 196
153usage of mem+swap is limited by memsw.limit_in_bytes. 197memsw means memory+swap. Usage of memory+swap is limited by
198memsw.limit_in_bytes.
154 199
155* why 'mem+swap' rather than swap. 200Example: Assume a system with 4G of swap. A task which allocates 6G of memory
201(by mistake) under 2G memory limitation will use all swap.
202In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
203By using memsw limit, you can avoid system OOM which can be caused by swap
204shortage.
205
206* why 'memory+swap' rather than swap.
156The global LRU(kswapd) can swap out arbitrary pages. Swap-out means 207The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
157to move account from memory to swap...there is no change in usage of 208to move account from memory to swap...there is no change in usage of
158mem+swap. In other words, when we want to limit the usage of swap without 209memory+swap. In other words, when we want to limit the usage of swap without
159affecting global LRU, mem+swap limit is better than just limiting swap from 210affecting global LRU, memory+swap limit is better than just limiting swap from
160OS point of view. 211OS point of view.
161 212
162* What happens when a cgroup hits memory.memsw.limit_in_bytes 213* What happens when a cgroup hits memory.memsw.limit_in_bytes
@@ -168,12 +219,12 @@ it by cgroup.
168 219
1692.5 Reclaim 2202.5 Reclaim
170 221
171Each cgroup maintains a per cgroup LRU that consists of an active 222Each cgroup maintains a per cgroup LRU which has the same structure as
172and inactive list. When a cgroup goes over its limit, we first try 223global VM. When a cgroup goes over its limit, we first try
173to reclaim memory from the cgroup so as to make space for the new 224to reclaim memory from the cgroup so as to make space for the new
174pages that the cgroup has touched. If the reclaim is unsuccessful, 225pages that the cgroup has touched. If the reclaim is unsuccessful,
175an OOM routine is invoked to select and kill the bulkiest task in the 226an OOM routine is invoked to select and kill the bulkiest task in the
176cgroup. 227cgroup. (See 10. OOM Control below.)
177 228
178The reclaim algorithm has not been modified for cgroups, except that 229The reclaim algorithm has not been modified for cgroups, except that
179pages that are selected for reclaiming come from the per cgroup LRU 230pages that are selected for reclaiming come from the per cgroup LRU
@@ -182,13 +233,24 @@ list.
182NOTE: Reclaim does not work for the root cgroup, since we cannot set any 233NOTE: Reclaim does not work for the root cgroup, since we cannot set any
183limits on the root cgroup. 234limits on the root cgroup.
184 235
1852. Locking 236Note2: When panic_on_oom is set to "2", the whole system will panic.
237
238When oom event notifier is registered, event will be delivered.
239(See oom_control section)
240
2412.6 Locking
186 242
187The memory controller uses the following hierarchy 243 lock_page_cgroup()/unlock_page_cgroup() should not be called under
244 mapping->tree_lock.
188 245
1891. zone->lru_lock is used for selecting pages to be isolated 246 Other lock order is following:
1902. mem->per_zone->lru_lock protects the per cgroup LRU (per zone) 247 PG_locked.
1913. lock_page_cgroup() is used to protect page->page_cgroup 248 mm->page_table_lock
249 zone->lru_lock
250 lock_page_cgroup.
251 In many cases, just lock_page_cgroup() is called.
252 per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
253 zone->lru_lock, it has no lock of its own.
192 254
1933. User Interface 2553. User Interface
194 256
@@ -197,6 +259,7 @@ The memory controller uses the following hierarchy
197a. Enable CONFIG_CGROUPS 259a. Enable CONFIG_CGROUPS
198b. Enable CONFIG_RESOURCE_COUNTERS 260b. Enable CONFIG_RESOURCE_COUNTERS
199c. Enable CONFIG_CGROUP_MEM_RES_CTLR 261c. Enable CONFIG_CGROUP_MEM_RES_CTLR
262d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension)
200 263
2011. Prepare the cgroups 2641. Prepare the cgroups
202# mkdir -p /cgroups 265# mkdir -p /cgroups
@@ -204,31 +267,28 @@ c. Enable CONFIG_CGROUP_MEM_RES_CTLR
204 267
2052. Make the new group and move bash into it 2682. Make the new group and move bash into it
206# mkdir /cgroups/0 269# mkdir /cgroups/0
207# echo $$ > /cgroups/0/tasks 270# echo $$ > /cgroups/0/tasks
208 271
209Since now we're in the 0 cgroup, 272Since now we're in the 0 cgroup, we can alter the memory limit:
210We can alter the memory limit:
211# echo 4M > /cgroups/0/memory.limit_in_bytes 273# echo 4M > /cgroups/0/memory.limit_in_bytes
212 274
213NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, 275NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
214mega or gigabytes. 276mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
277
215NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). 278NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).
216NOTE: We cannot set limits on the root cgroup any more. 279NOTE: We cannot set limits on the root cgroup any more.
217 280
218# cat /cgroups/0/memory.limit_in_bytes 281# cat /cgroups/0/memory.limit_in_bytes
2194194304 2824194304
220 283
221NOTE: The interface has now changed to display the usage in bytes
222instead of pages
223
224We can check the usage: 284We can check the usage:
225# cat /cgroups/0/memory.usage_in_bytes 285# cat /cgroups/0/memory.usage_in_bytes
2261216512 2861216512
227 287
228A successful write to this file does not guarantee a successful set of 288A successful write to this file does not guarantee a successful set of
229this limit to the value written into the file. This can be due to a 289this limit to the value written into the file. This can be due to a
230number of factors, such as rounding up to page boundaries or the total 290number of factors, such as rounding up to page boundaries or the total
231availability of memory on the system. The user is required to re-read 291availability of memory on the system. The user is required to re-read
232this file after a write to guarantee the value committed by the kernel. 292this file after a write to guarantee the value committed by the kernel.
233 293
234# echo 1 > memory.limit_in_bytes 294# echo 1 > memory.limit_in_bytes
@@ -243,15 +303,23 @@ caches, RSS and Active pages/Inactive pages are shown.
243 303
2444. Testing 3044. Testing
245 305
246Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. 306For testing features and implementation, see memcg_test.txt.
247Apart from that v6 has been tested with several applications and regular 307
248daily use. The controller has also been tested on the PPC64, x86_64 and 308Performance test is also important. To see pure memory controller's overhead,
249UML platforms. 309testing on tmpfs will give you good numbers of small overheads.
310Example: do kernel make on tmpfs.
311
312Page-fault scalability is also important. At measuring parallel
313page fault test, multi-process test may be better than multi-thread
314test because it has noise of shared objects/status.
315
316But the above two are testing extreme situations.
317Trying usual test under memory controller is always helpful.
250 318
2514.1 Troubleshooting 3194.1 Troubleshooting
252 320
253Sometimes a user might find that the application under a cgroup is 321Sometimes a user might find that the application under a cgroup is
254terminated. There are several causes for this: 322terminated by OOM killer. There are several causes for this:
255 323
2561. The cgroup limit is too low (just too low to do anything useful) 3241. The cgroup limit is too low (just too low to do anything useful)
2572. The user is using anonymous memory and swap is turned off or too low 3252. The user is using anonymous memory and swap is turned off or too low
@@ -259,21 +327,29 @@ terminated. There are several causes for this:
259A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of 327A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
260some of the pages cached in the cgroup (page cache pages). 328some of the pages cached in the cgroup (page cache pages).
261 329
330To know what happens, disable OOM_Kill by 10. OOM Control(see below) and
331seeing what happens will be helpful.
332
2624.2 Task migration 3334.2 Task migration
263 334
264When a task migrates from one cgroup to another, it's charge is not 335When a task migrates from one cgroup to another, its charge is not
265carried forward. The pages allocated from the original cgroup still 336carried forward by default. The pages allocated from the original cgroup still
266remain charged to it, the charge is dropped when the page is freed or 337remain charged to it, the charge is dropped when the page is freed or
267reclaimed. 338reclaimed.
268 339
340You can move charges of a task along with task migration.
341See 8. "Move charges at task migration"
342
2694.3 Removing a cgroup 3434.3 Removing a cgroup
270 344
271A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a 345A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
272cgroup might have some charge associated with it, even though all 346cgroup might have some charge associated with it, even though all
273tasks have migrated away from it. 347tasks have migrated away from it. (because we charge against pages, not
274Such charges are freed(at default) or moved to its parent. When moved, 348against tasks.)
275both of RSS and CACHES are moved to parent. 349
276If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. 350Such charges are freed or moved to their parent. At moving, both of RSS
351and CACHES are moved to parent.
352rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also.
277 353
278Charges recorded in swap information is not updated at removal of cgroup. 354Charges recorded in swap information is not updated at removal of cgroup.
279Recorded information is discarded and a cgroup which uses swap (swapcache) 355Recorded information is discarded and a cgroup which uses swap (swapcache)
@@ -289,10 +365,10 @@ will be charged as a new owner of it.
289 365
290 # echo 0 > memory.force_empty 366 # echo 0 > memory.force_empty
291 367
292 Almost all pages tracked by this memcg will be unmapped and freed. Some of 368 Almost all pages tracked by this memory cgroup will be unmapped and freed.
293 pages cannot be freed because it's locked or in-use. Such pages are moved 369 Some pages cannot be freed because they are locked or in-use. Such pages are
294 to parent and this cgroup will be empty. But this may return -EBUSY in 370 moved to parent and this cgroup will be empty. This may return -EBUSY if
295 some too busy case. 371 VM is too busy to free/move all pages immediately.
296 372
297 Typical use case of this interface is that calling this before rmdir(). 373 Typical use case of this interface is that calling this before rmdir().
298 Because rmdir() moves all pages to parent, some out-of-use page caches can be 374 Because rmdir() moves all pages to parent, some out-of-use page caches can be
@@ -302,19 +378,41 @@ will be charged as a new owner of it.
302 378
303memory.stat file includes following statistics 379memory.stat file includes following statistics
304 380
381# per-memory cgroup local status
305cache - # of bytes of page cache memory. 382cache - # of bytes of page cache memory.
306rss - # of bytes of anonymous and swap cache memory. 383rss - # of bytes of anonymous and swap cache memory.
384mapped_file - # of bytes of mapped file (includes tmpfs/shmem)
307pgpgin - # of pages paged in (equivalent to # of charging events). 385pgpgin - # of pages paged in (equivalent to # of charging events).
308pgpgout - # of pages paged out (equivalent to # of uncharging events). 386pgpgout - # of pages paged out (equivalent to # of uncharging events).
309active_anon - # of bytes of anonymous and swap cache memory on active 387swap - # of bytes of swap usage
310 lru list.
311inactive_anon - # of bytes of anonymous memory and swap cache memory on 388inactive_anon - # of bytes of anonymous memory and swap cache memory on
312 inactive lru list. 389 LRU list.
313active_file - # of bytes of file-backed memory on active lru list. 390active_anon - # of bytes of anonymous and swap cache memory on active
314inactive_file - # of bytes of file-backed memory on inactive lru list. 391 inactive LRU list.
392inactive_file - # of bytes of file-backed memory on inactive LRU list.
393active_file - # of bytes of file-backed memory on active LRU list.
315unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc). 394unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc).
316 395
317The following additional stats are dependent on CONFIG_DEBUG_VM. 396# status considering hierarchy (see memory.use_hierarchy settings)
397
398hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy
399 under which the memory cgroup is
400hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to
401 hierarchy under which memory cgroup is.
402
403total_cache - sum of all children's "cache"
404total_rss - sum of all children's "rss"
405total_mapped_file - sum of all children's "cache"
406total_pgpgin - sum of all children's "pgpgin"
407total_pgpgout - sum of all children's "pgpgout"
408total_swap - sum of all children's "swap"
409total_inactive_anon - sum of all children's "inactive_anon"
410total_active_anon - sum of all children's "active_anon"
411total_inactive_file - sum of all children's "inactive_file"
412total_active_file - sum of all children's "active_file"
413total_unevictable - sum of all children's "unevictable"
414
415# The following additional stats are dependent on CONFIG_DEBUG_VM.
318 416
319inactive_ratio - VM internal parameter. (see mm/page_alloc.c) 417inactive_ratio - VM internal parameter. (see mm/page_alloc.c)
320recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) 418recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
@@ -323,24 +421,37 @@ recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
323recent_scanned_file - VM internal parameter. (see mm/vmscan.c) 421recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
324 422
325Memo: 423Memo:
326 recent_rotated means recent frequency of lru rotation. 424 recent_rotated means recent frequency of LRU rotation.
327 recent_scanned means recent # of scans to lru. 425 recent_scanned means recent # of scans to LRU.
328 showing for better debug please see the code for meanings. 426 showing for better debug please see the code for meanings.
329 427
330Note: 428Note:
331 Only anonymous and swap cache memory is listed as part of 'rss' stat. 429 Only anonymous and swap cache memory is listed as part of 'rss' stat.
332 This should not be confused with the true 'resident set size' or the 430 This should not be confused with the true 'resident set size' or the
333 amount of physical memory used by the cgroup. Per-cgroup rss 431 amount of physical memory used by the cgroup.
334 accounting is not done yet. 432 'rss + file_mapped" will give you resident set size of cgroup.
433 (Note: file and shmem may be shared among other cgroups. In that case,
434 file_mapped is accounted only when the memory cgroup is owner of page
435 cache.)
335 436
3365.3 swappiness 4375.3 swappiness
337 Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
338 438
339 Following cgroups' swapiness can't be changed. 439Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
340 - root cgroup (uses /proc/sys/vm/swappiness). 440
341 - a cgroup which uses hierarchy and it has child cgroup. 441Following cgroups' swappiness can't be changed.
342 - a cgroup which uses hierarchy and not the root of hierarchy. 442- root cgroup (uses /proc/sys/vm/swappiness).
443- a cgroup which uses hierarchy and it has other cgroup(s) below it.
444- a cgroup which uses hierarchy and not the root of hierarchy.
445
4465.4 failcnt
343 447
448A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
449This failcnt(== failure count) shows the number of times that a usage counter
450hit its limit. When a memory cgroup hits a limit, failcnt increases and
451memory under it will be reclaimed.
452
453You can reset failcnt by writing 0 to failcnt file.
454# echo 0 > .../memory.failcnt
344 455
3456. Hierarchy support 4566. Hierarchy support
346 457
@@ -359,13 +470,13 @@ hierarchy
359 470
360In the diagram above, with hierarchical accounting enabled, all memory 471In the diagram above, with hierarchical accounting enabled, all memory
361usage of e, is accounted to its ancestors up until the root (i.e, c and root), 472usage of e, is accounted to its ancestors up until the root (i.e, c and root),
362that has memory.use_hierarchy enabled. If one of the ancestors goes over its 473that has memory.use_hierarchy enabled. If one of the ancestors goes over its
363limit, the reclaim algorithm reclaims from the tasks in the ancestor and the 474limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
364children of the ancestor. 475children of the ancestor.
365 476
3666.1 Enabling hierarchical accounting and reclaim 4776.1 Enabling hierarchical accounting and reclaim
367 478
368The memory controller by default disables the hierarchy feature. Support 479A memory cgroup by default disables the hierarchy feature. Support
369can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup 480can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
370 481
371# echo 1 > memory.use_hierarchy 482# echo 1 > memory.use_hierarchy
@@ -375,9 +486,10 @@ The feature can be disabled by
375# echo 0 > memory.use_hierarchy 486# echo 0 > memory.use_hierarchy
376 487
377NOTE1: Enabling/disabling will fail if the cgroup already has other 488NOTE1: Enabling/disabling will fail if the cgroup already has other
378cgroups created below it. 489 cgroups created below it.
379 490
380NOTE2: This feature can be enabled/disabled per subtree. 491NOTE2: When panic_on_oom is set to "2", the whole system will panic in
492 case of an OOM event in any cgroup.
381 493
3827. Soft limits 4947. Soft limits
383 495
@@ -387,7 +499,7 @@ is to allow control groups to use as much of the memory as needed, provided
387a. There is no memory contention 499a. There is no memory contention
388b. They do not exceed their hard limit 500b. They do not exceed their hard limit
389 501
390When the system detects memory contention or low memory control groups 502When the system detects memory contention or low memory, control groups
391are pushed back to their soft limits. If the soft limit of each control 503are pushed back to their soft limits. If the soft limit of each control
392group is very high, they are pushed back as much as possible to make 504group is very high, they are pushed back as much as possible to make
393sure that one control group does not starve the others of memory. 505sure that one control group does not starve the others of memory.
@@ -401,7 +513,7 @@ it gets invoked from balance_pgdat (kswapd).
4017.1 Interface 5137.1 Interface
402 514
403Soft limits can be setup by using the following commands (in this example we 515Soft limits can be setup by using the following commands (in this example we
404assume a soft limit of 256 megabytes) 516assume a soft limit of 256 MiB)
405 517
406# echo 256M > memory.soft_limit_in_bytes 518# echo 256M > memory.soft_limit_in_bytes
407 519
@@ -414,7 +526,121 @@ NOTE1: Soft limits take effect over a long period of time, since they involve
414NOTE2: It is recommended to set the soft limit always below the hard limit, 526NOTE2: It is recommended to set the soft limit always below the hard limit,
415 otherwise the hard limit will take precedence. 527 otherwise the hard limit will take precedence.
416 528
4178. TODO 5298. Move charges at task migration
530
531Users can move charges associated with a task along with task migration, that
532is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
533This feature is not supported in !CONFIG_MMU environments because of lack of
534page tables.
535
5368.1 Interface
537
538This feature is disabled by default. It can be enabled(and disabled again) by
539writing to memory.move_charge_at_immigrate of the destination cgroup.
540
541If you want to enable it:
542
543# echo (some positive value) > memory.move_charge_at_immigrate
544
545Note: Each bits of move_charge_at_immigrate has its own meaning about what type
546 of charges should be moved. See 8.2 for details.
547Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread
548 group.
549Note: If we cannot find enough space for the task in the destination cgroup, we
550 try to make space by reclaiming memory. Task migration may fail if we
551 cannot make enough space.
552Note: It can take several seconds if you move charges much.
553
554And if you want disable it again:
555
556# echo 0 > memory.move_charge_at_immigrate
557
5588.2 Type of charges which can be move
559
560Each bits of move_charge_at_immigrate has its own meaning about what type of
561charges should be moved. But in any cases, it must be noted that an account of
562a page or a swap can be moved only when it is charged to the task's current(old)
563memory cgroup.
564
565 bit | what type of charges would be moved ?
566 -----+------------------------------------------------------------------------
567 0 | A charge of an anonymous page(or swap of it) used by the target task.
568 | Those pages and swaps must be used only by the target task. You must
569 | enable Swap Extension(see 2.4) to enable move of swap charges.
570 -----+------------------------------------------------------------------------
571 1 | A charge of file pages(normal file, tmpfs file(e.g. ipc shared memory)
572 | and swaps of tmpfs file) mmapped by the target task. Unlike the case of
573 | anonymous pages, file pages(and swaps) in the range mmapped by the task
574 | will be moved even if the task hasn't done page fault, i.e. they might
575 | not be the task's "RSS", but other task's "RSS" that maps the same file.
576 | And mapcount of the page is ignored(the page can be moved even if
577 | page_mapcount(page) > 1). You must enable Swap Extension(see 2.4) to
578 | enable move of swap charges.
579
5808.3 TODO
581
582- Implement madvise(2) to let users decide the vma to be moved or not to be
583 moved.
584- All of moving charge operations are done under cgroup_mutex. It's not good
585 behavior to hold the mutex too long, so we may need some trick.
586
5879. Memory thresholds
588
589Memory cgroup implements memory thresholds using cgroups notification
590API (see cgroups.txt). It allows to register multiple memory and memsw
591thresholds and gets notifications when it crosses.
592
593To register a threshold application need:
594- create an eventfd using eventfd(2);
595- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
596- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
597 cgroup.event_control.
598
599Application will be notified through eventfd when memory usage crosses
600threshold in any direction.
601
602It's applicable for root and non-root cgroup.
603
60410. OOM Control
605
606memory.oom_control file is for OOM notification and other controls.
607
608Memory cgroup implements OOM notifier using cgroup notification
609API (See cgroups.txt). It allows to register multiple OOM notification
610delivery and gets notification when OOM happens.
611
612To register a notifier, application need:
613 - create an eventfd using eventfd(2)
614 - open memory.oom_control file
615 - write string like "<event_fd> <fd of memory.oom_control>" to
616 cgroup.event_control
617
618Application will be notified through eventfd when OOM happens.
619OOM notification doesn't work for root cgroup.
620
621You can disable OOM-killer by writing "1" to memory.oom_control file, as:
622
623 #echo 1 > memory.oom_control
624
625This operation is only allowed to the top cgroup of sub-hierarchy.
626If OOM-killer is disabled, tasks under cgroup will hang/sleep
627in memory cgroup's OOM-waitqueue when they request accountable memory.
628
629For running them, you have to relax the memory cgroup's OOM status by
630 * enlarge limit or reduce usage.
631To reduce usage,
632 * kill some tasks.
633 * move some tasks to other group with account migration.
634 * remove some files (on tmpfs?)
635
636Then, stopped tasks will work again.
637
638At reading, current status of OOM is shown.
639 oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)
640 under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
641 be stopped.)
642
64311. TODO
418 644
4191. Add support for accounting huge pages (as a separate controller) 6451. Add support for accounting huge pages (as a separate controller)
4202. Make per-cgroup scanner reclaim not-shared pages first 6462. Make per-cgroup scanner reclaim not-shared pages first