aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/cgroups
diff options
context:
space:
mode:
authorGlenn Elliott <gelliott@cs.unc.edu>2012-03-04 19:47:13 -0500
committerGlenn Elliott <gelliott@cs.unc.edu>2012-03-04 19:47:13 -0500
commitc71c03bda1e86c9d5198c5d83f712e695c4f2a1e (patch)
treeecb166cb3e2b7e2adb3b5e292245fefd23381ac8 /Documentation/cgroups
parentea53c912f8a86a8567697115b6a0d8152beee5c8 (diff)
parent6a00f206debf8a5c8899055726ad127dbeeed098 (diff)
Merge branch 'mpi-master' into wip-k-fmlpwip-k-fmlp
Conflicts: litmus/sched_cedf.c
Diffstat (limited to 'Documentation/cgroups')
-rw-r--r--Documentation/cgroups/blkio-controller.txt186
-rw-r--r--Documentation/cgroups/cgroup_event_listener.c2
-rw-r--r--Documentation/cgroups/cgroups.txt143
-rw-r--r--Documentation/cgroups/cpuacct.txt21
-rw-r--r--Documentation/cgroups/cpusets.txt45
-rw-r--r--Documentation/cgroups/devices.txt6
-rw-r--r--Documentation/cgroups/freezer-subsystem.txt20
-rw-r--r--Documentation/cgroups/memcg_test.txt2
-rw-r--r--Documentation/cgroups/memory.txt78
9 files changed, 346 insertions, 157 deletions
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index 6919d62591d9..84f0a15fc210 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -8,12 +8,17 @@ both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
8Plan is to use the same cgroup based management interface for blkio controller 8Plan is to use the same cgroup based management interface for blkio controller
9and based on user options switch IO policies in the background. 9and based on user options switch IO policies in the background.
10 10
11In the first phase, this patchset implements proportional weight time based 11Currently two IO control policies are implemented. First one is proportional
12division of disk policy. It is implemented in CFQ. Hence this policy takes 12weight time based division of disk policy. It is implemented in CFQ. Hence
13effect only on leaf nodes when CFQ is being used. 13this policy takes effect only on leaf nodes when CFQ is being used. The second
14one is throttling policy which can be used to specify upper IO rate limits
15on devices. This policy is implemented in generic block layer and can be
16used on leaf nodes as well as higher level logical devices like device mapper.
14 17
15HOWTO 18HOWTO
16===== 19=====
20Proportional Weight division of bandwidth
21-----------------------------------------
17You can do a very simple testing of running two dd threads in two different 22You can do a very simple testing of running two dd threads in two different
18cgroups. Here is what you can do. 23cgroups. Here is what you can do.
19 24
@@ -23,16 +28,19 @@ cgroups. Here is what you can do.
23- Enable group scheduling in CFQ 28- Enable group scheduling in CFQ
24 CONFIG_CFQ_GROUP_IOSCHED=y 29 CONFIG_CFQ_GROUP_IOSCHED=y
25 30
26- Compile and boot into kernel and mount IO controller (blkio). 31- Compile and boot into kernel and mount IO controller (blkio); see
32 cgroups.txt, Why are cgroups needed?.
27 33
28 mount -t cgroup -o blkio none /cgroup 34 mount -t tmpfs cgroup_root /sys/fs/cgroup
35 mkdir /sys/fs/cgroup/blkio
36 mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
29 37
30- Create two cgroups 38- Create two cgroups
31 mkdir -p /cgroup/test1/ /cgroup/test2 39 mkdir -p /sys/fs/cgroup/blkio/test1/ /sys/fs/cgroup/blkio/test2
32 40
33- Set weights of group test1 and test2 41- Set weights of group test1 and test2
34 echo 1000 > /cgroup/test1/blkio.weight 42 echo 1000 > /sys/fs/cgroup/blkio/test1/blkio.weight
35 echo 500 > /cgroup/test2/blkio.weight 43 echo 500 > /sys/fs/cgroup/blkio/test2/blkio.weight
36 44
37- Create two same size files (say 512MB each) on same disk (file1, file2) and 45- Create two same size files (say 512MB each) on same disk (file1, file2) and
38 launch two dd threads in different cgroup to read those files. 46 launch two dd threads in different cgroup to read those files.
@@ -41,12 +49,12 @@ cgroups. Here is what you can do.
41 echo 3 > /proc/sys/vm/drop_caches 49 echo 3 > /proc/sys/vm/drop_caches
42 50
43 dd if=/mnt/sdb/zerofile1 of=/dev/null & 51 dd if=/mnt/sdb/zerofile1 of=/dev/null &
44 echo $! > /cgroup/test1/tasks 52 echo $! > /sys/fs/cgroup/blkio/test1/tasks
45 cat /cgroup/test1/tasks 53 cat /sys/fs/cgroup/blkio/test1/tasks
46 54
47 dd if=/mnt/sdb/zerofile2 of=/dev/null & 55 dd if=/mnt/sdb/zerofile2 of=/dev/null &
48 echo $! > /cgroup/test2/tasks 56 echo $! > /sys/fs/cgroup/blkio/test2/tasks
49 cat /cgroup/test2/tasks 57 cat /sys/fs/cgroup/blkio/test2/tasks
50 58
51- At macro level, first dd should finish first. To get more precise data, keep 59- At macro level, first dd should finish first. To get more precise data, keep
52 on looking at (with the help of script), at blkio.disk_time and 60 on looking at (with the help of script), at blkio.disk_time and
@@ -55,6 +63,62 @@ cgroups. Here is what you can do.
55 group dispatched to the disk. We provide fairness in terms of disk time, so 63 group dispatched to the disk. We provide fairness in terms of disk time, so
56 ideally io.disk_time of cgroups should be in proportion to the weight. 64 ideally io.disk_time of cgroups should be in proportion to the weight.
57 65
66Throttling/Upper Limit policy
67-----------------------------
68- Enable Block IO controller
69 CONFIG_BLK_CGROUP=y
70
71- Enable throttling in block layer
72 CONFIG_BLK_DEV_THROTTLING=y
73
74- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)
75 mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
76
77- Specify a bandwidth rate on particular device for root group. The format
78 for policy is "<major>:<minor> <byes_per_second>".
79
80 echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
81
82 Above will put a limit of 1MB/second on reads happening for root group
83 on device having major/minor number 8:16.
84
85- Run dd to read a file and see if rate is throttled to 1MB/s or not.
86
87 # dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
88 # iflag=direct
89 1024+0 records in
90 1024+0 records out
91 4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
92
93 Limits for writes can be put using blkio.throttle.write_bps_device file.
94
95Hierarchical Cgroups
96====================
97- Currently none of the IO control policy supports hierarhical groups. But
98 cgroup interface does allow creation of hierarhical cgroups and internally
99 IO policies treat them as flat hierarchy.
100
101 So this patch will allow creation of cgroup hierarhcy but at the backend
102 everything will be treated as flat. So if somebody created a hierarchy like
103 as follows.
104
105 root
106 / \
107 test1 test2
108 |
109 test3
110
111 CFQ and throttling will practically treat all groups at same level.
112
113 pivot
114 / / \ \
115 root test1 test2 test3
116
117 Down the line we can implement hierarchical accounting/control support
118 and also introduce a new cgroup file "use_hierarchy" which will control
119 whether cgroup hierarchy is viewed as flat or hierarchical by the policy..
120 This is how memory controller also has implemented the things.
121
58Various user visible config options 122Various user visible config options
59=================================== 123===================================
60CONFIG_BLK_CGROUP 124CONFIG_BLK_CGROUP
@@ -68,13 +132,18 @@ CONFIG_CFQ_GROUP_IOSCHED
68 - Enables group scheduling in CFQ. Currently only 1 level of group 132 - Enables group scheduling in CFQ. Currently only 1 level of group
69 creation is allowed. 133 creation is allowed.
70 134
135CONFIG_BLK_DEV_THROTTLING
136 - Enable block device throttling support in block layer.
137
71Details of cgroup files 138Details of cgroup files
72======================= 139=======================
140Proportional weight policy files
141--------------------------------
73- blkio.weight 142- blkio.weight
74 - Specifies per cgroup weight. This is default weight of the group 143 - Specifies per cgroup weight. This is default weight of the group
75 on all the devices until and unless overridden by per device rule. 144 on all the devices until and unless overridden by per device rule.
76 (See blkio.weight_device). 145 (See blkio.weight_device).
77 Currently allowed range of weights is from 100 to 1000. 146 Currently allowed range of weights is from 10 to 1000.
78 147
79- blkio.weight_device 148- blkio.weight_device
80 - One can specify per cgroup per device rules using this interface. 149 - One can specify per cgroup per device rules using this interface.
@@ -83,7 +152,7 @@ Details of cgroup files
83 152
84 Following is the format. 153 Following is the format.
85 154
86 #echo dev_maj:dev_minor weight > /path/to/cgroup/blkio.weight_device 155 # echo dev_maj:dev_minor weight > blkio.weight_device
87 Configure weight=300 on /dev/sdb (8:16) in this cgroup 156 Configure weight=300 on /dev/sdb (8:16) in this cgroup
88 # echo 8:16 300 > blkio.weight_device 157 # echo 8:16 300 > blkio.weight_device
89 # cat blkio.weight_device 158 # cat blkio.weight_device
@@ -210,40 +279,73 @@ Details of cgroup files
210 and minor number of the device and third field specifies the number 279 and minor number of the device and third field specifies the number
211 of times a group was dequeued from a particular device. 280 of times a group was dequeued from a particular device.
212 281
282Throttling/Upper limit policy files
283-----------------------------------
284- blkio.throttle.read_bps_device
285 - Specifies upper limit on READ rate from the device. IO rate is
286 specified in bytes per second. Rules are per deivce. Following is
287 the format.
288
289 echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
290
291- blkio.throttle.write_bps_device
292 - Specifies upper limit on WRITE rate to the device. IO rate is
293 specified in bytes per second. Rules are per deivce. Following is
294 the format.
295
296 echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
297
298- blkio.throttle.read_iops_device
299 - Specifies upper limit on READ rate from the device. IO rate is
300 specified in IO per second. Rules are per deivce. Following is
301 the format.
302
303 echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
304
305- blkio.throttle.write_iops_device
306 - Specifies upper limit on WRITE rate to the device. IO rate is
307 specified in io per second. Rules are per deivce. Following is
308 the format.
309
310 echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
311
312Note: If both BW and IOPS rules are specified for a device, then IO is
313 subjectd to both the constraints.
314
315- blkio.throttle.io_serviced
316 - Number of IOs (bio) completed to/from the disk by the group (as
317 seen by throttling policy). These are further divided by the type
318 of operation - read or write, sync or async. First two fields specify
319 the major and minor number of the device, third field specifies the
320 operation type and the fourth field specifies the number of IOs.
321
322 blkio.io_serviced does accounting as seen by CFQ and counts are in
323 number of requests (struct request). On the other hand,
324 blkio.throttle.io_serviced counts number of IO in terms of number
325 of bios as seen by throttling policy. These bios can later be
326 merged by elevator and total number of requests completed can be
327 lesser.
328
329- blkio.throttle.io_service_bytes
330 - Number of bytes transferred to/from the disk by the group. These
331 are further divided by the type of operation - read or write, sync
332 or async. First two fields specify the major and minor number of the
333 device, third field specifies the operation type and the fourth field
334 specifies the number of bytes.
335
336 These numbers should roughly be same as blkio.io_service_bytes as
337 updated by CFQ. The difference between two is that
338 blkio.io_service_bytes will not be updated if CFQ is not operating
339 on request queue.
340
341Common files among various policies
342-----------------------------------
213- blkio.reset_stats 343- blkio.reset_stats
214 - Writing an int to this file will result in resetting all the stats 344 - Writing an int to this file will result in resetting all the stats
215 for that cgroup. 345 for that cgroup.
216 346
217CFQ sysfs tunable 347CFQ sysfs tunable
218================= 348=================
219/sys/block/<disk>/queue/iosched/group_isolation
220-----------------------------------------------
221
222If group_isolation=1, it provides stronger isolation between groups at the
223expense of throughput. By default group_isolation is 0. In general that
224means that if group_isolation=0, expect fairness for sequential workload
225only. Set group_isolation=1 to see fairness for random IO workload also.
226
227Generally CFQ will put random seeky workload in sync-noidle category. CFQ
228will disable idling on these queues and it does a collective idling on group
229of such queues. Generally these are slow moving queues and if there is a
230sync-noidle service tree in each group, that group gets exclusive access to
231disk for certain period. That means it will bring the throughput down if
232group does not have enough IO to drive deeper queue depths and utilize disk
233capacity to the fullest in the slice allocated to it. But the flip side is
234that even a random reader should get better latencies and overall throughput
235if there are lots of sequential readers/sync-idle workload running in the
236system.
237
238If group_isolation=0, then CFQ automatically moves all the random seeky queues
239in the root group. That means there will be no service differentiation for
240that kind of workload. This leads to better throughput as we do collective
241idling on root sync-noidle tree.
242
243By default one should run with group_isolation=0. If that is not sufficient
244and one wants stronger isolation between groups, then set group_isolation=1
245but this will come at cost of reduced throughput.
246
247/sys/block/<disk>/queue/iosched/slice_idle 349/sys/block/<disk>/queue/iosched/slice_idle
248------------------------------------------ 350------------------------------------------
249On a faster hardware CFQ can be slow, especially with sequential workload. 351On a faster hardware CFQ can be slow, especially with sequential workload.
diff --git a/Documentation/cgroups/cgroup_event_listener.c b/Documentation/cgroups/cgroup_event_listener.c
index 8c2bfc4a6358..3e082f96dc12 100644
--- a/Documentation/cgroups/cgroup_event_listener.c
+++ b/Documentation/cgroups/cgroup_event_listener.c
@@ -91,7 +91,7 @@ int main(int argc, char **argv)
91 91
92 if (ret == -1) { 92 if (ret == -1) {
93 perror("cgroup.event_control " 93 perror("cgroup.event_control "
94 "is not accessable any more"); 94 "is not accessible any more");
95 break; 95 break;
96 } 96 }
97 97
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index b34823ff1646..cd67e90003c0 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -18,7 +18,8 @@ CONTENTS:
18 1.2 Why are cgroups needed ? 18 1.2 Why are cgroups needed ?
19 1.3 How are cgroups implemented ? 19 1.3 How are cgroups implemented ?
20 1.4 What does notify_on_release do ? 20 1.4 What does notify_on_release do ?
21 1.5 How do I use cgroups ? 21 1.5 What does clone_children do ?
22 1.6 How do I use cgroups ?
222. Usage Examples and Syntax 232. Usage Examples and Syntax
23 2.1 Basic Usage 24 2.1 Basic Usage
24 2.2 Attaching processes 25 2.2 Attaching processes
@@ -109,22 +110,22 @@ university server with various users - students, professors, system
109tasks etc. The resource planning for this server could be along the 110tasks etc. The resource planning for this server could be along the
110following lines: 111following lines:
111 112
112 CPU : Top cpuset 113 CPU : "Top cpuset"
113 / \ 114 / \
114 CPUSet1 CPUSet2 115 CPUSet1 CPUSet2
115 | | 116 | |
116 (Profs) (Students) 117 (Professors) (Students)
117 118
118 In addition (system tasks) are attached to topcpuset (so 119 In addition (system tasks) are attached to topcpuset (so
119 that they can run anywhere) with a limit of 20% 120 that they can run anywhere) with a limit of 20%
120 121
121 Memory : Professors (50%), students (30%), system (20%) 122 Memory : Professors (50%), Students (30%), system (20%)
122 123
123 Disk : Prof (50%), students (30%), system (20%) 124 Disk : Professors (50%), Students (30%), system (20%)
124 125
125 Network : WWW browsing (20%), Network File System (60%), others (20%) 126 Network : WWW browsing (20%), Network File System (60%), others (20%)
126 / \ 127 / \
127 Prof (15%) students (5%) 128 Professors (15%) students (5%)
128 129
129Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go 130Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go
130into NFS network class. 131into NFS network class.
@@ -137,11 +138,11 @@ With the ability to classify tasks differently for different resources
137the admin can easily set up a script which receives exec notifications 138the admin can easily set up a script which receives exec notifications
138and depending on who is launching the browser he can 139and depending on who is launching the browser he can
139 140
140 # echo browser_pid > /mnt/<restype>/<userclass>/tasks 141 # echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
141 142
142With only a single hierarchy, he now would potentially have to create 143With only a single hierarchy, he now would potentially have to create
143a separate cgroup for every browser launched and associate it with 144a separate cgroup for every browser launched and associate it with
144approp network and other resource class. This may lead to 145appropriate network and other resource class. This may lead to
145proliferation of such cgroups. 146proliferation of such cgroups.
146 147
147Also lets say that the administrator would like to give enhanced network 148Also lets say that the administrator would like to give enhanced network
@@ -152,9 +153,9 @@ apps enhanced CPU power,
152With ability to write pids directly to resource classes, it's just a 153With ability to write pids directly to resource classes, it's just a
153matter of : 154matter of :
154 155
155 # echo pid > /mnt/network/<new_class>/tasks 156 # echo pid > /sys/fs/cgroup/network/<new_class>/tasks
156 (after some time) 157 (after some time)
157 # echo pid > /mnt/network/<orig_class>/tasks 158 # echo pid > /sys/fs/cgroup/network/<orig_class>/tasks
158 159
159Without this ability, he would have to split the cgroup into 160Without this ability, he would have to split the cgroup into
160multiple separate ones and then associate the new cgroups with the 161multiple separate ones and then associate the new cgroups with the
@@ -235,7 +236,8 @@ containing the following files describing that cgroup:
235 - cgroup.procs: list of tgids in the cgroup. This list is not 236 - cgroup.procs: list of tgids in the cgroup. This list is not
236 guaranteed to be sorted or free of duplicate tgids, and userspace 237 guaranteed to be sorted or free of duplicate tgids, and userspace
237 should sort/uniquify the list if this property is required. 238 should sort/uniquify the list if this property is required.
238 This is a read-only file, for now. 239 Writing a thread group id into this file moves all threads in that
240 group into this cgroup.
239 - notify_on_release flag: run the release agent on exit? 241 - notify_on_release flag: run the release agent on exit?
240 - release_agent: the path to use for release notifications (this file 242 - release_agent: the path to use for release notifications (this file
241 exists in the top cgroup only) 243 exists in the top cgroup only)
@@ -293,27 +295,39 @@ notify_on_release in the root cgroup at system boot is disabled
293value of their parents notify_on_release setting. The default value of 295value of their parents notify_on_release setting. The default value of
294a cgroup hierarchy's release_agent path is empty. 296a cgroup hierarchy's release_agent path is empty.
295 297
2961.5 How do I use cgroups ? 2981.5 What does clone_children do ?
299---------------------------------
300
301If the clone_children flag is enabled (1) in a cgroup, then all
302cgroups created beneath will call the post_clone callbacks for each
303subsystem of the newly created cgroup. Usually when this callback is
304implemented for a subsystem, it copies the values of the parent
305subsystem, this is the case for the cpuset.
306
3071.6 How do I use cgroups ?
297-------------------------- 308--------------------------
298 309
299To start a new job that is to be contained within a cgroup, using 310To start a new job that is to be contained within a cgroup, using
300the "cpuset" cgroup subsystem, the steps are something like: 311the "cpuset" cgroup subsystem, the steps are something like:
301 312
302 1) mkdir /dev/cgroup 313 1) mount -t tmpfs cgroup_root /sys/fs/cgroup
303 2) mount -t cgroup -ocpuset cpuset /dev/cgroup 314 2) mkdir /sys/fs/cgroup/cpuset
304 3) Create the new cgroup by doing mkdir's and write's (or echo's) in 315 3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
305 the /dev/cgroup virtual file system. 316 4) Create the new cgroup by doing mkdir's and write's (or echo's) in
306 4) Start a task that will be the "founding father" of the new job. 317 the /sys/fs/cgroup virtual file system.
307 5) Attach that task to the new cgroup by writing its pid to the 318 5) Start a task that will be the "founding father" of the new job.
308 /dev/cgroup tasks file for that cgroup. 319 6) Attach that task to the new cgroup by writing its pid to the
309 6) fork, exec or clone the job tasks from this founding father task. 320 /sys/fs/cgroup/cpuset/tasks file for that cgroup.
321 7) fork, exec or clone the job tasks from this founding father task.
310 322
311For example, the following sequence of commands will setup a cgroup 323For example, the following sequence of commands will setup a cgroup
312named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, 324named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
313and then start a subshell 'sh' in that cgroup: 325and then start a subshell 'sh' in that cgroup:
314 326
315 mount -t cgroup cpuset -ocpuset /dev/cgroup 327 mount -t tmpfs cgroup_root /sys/fs/cgroup
316 cd /dev/cgroup 328 mkdir /sys/fs/cgroup/cpuset
329 mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset
330 cd /sys/fs/cgroup/cpuset
317 mkdir Charlie 331 mkdir Charlie
318 cd Charlie 332 cd Charlie
319 /bin/echo 2-3 > cpuset.cpus 333 /bin/echo 2-3 > cpuset.cpus
@@ -334,28 +348,41 @@ Creating, modifying, using the cgroups can be done through the cgroup
334virtual filesystem. 348virtual filesystem.
335 349
336To mount a cgroup hierarchy with all available subsystems, type: 350To mount a cgroup hierarchy with all available subsystems, type:
337# mount -t cgroup xxx /dev/cgroup 351# mount -t cgroup xxx /sys/fs/cgroup
338 352
339The "xxx" is not interpreted by the cgroup code, but will appear in 353The "xxx" is not interpreted by the cgroup code, but will appear in
340/proc/mounts so may be any useful identifying string that you like. 354/proc/mounts so may be any useful identifying string that you like.
341 355
356Note: Some subsystems do not work without some user input first. For instance,
357if cpusets are enabled the user will have to populate the cpus and mems files
358for each new cgroup created before that group can be used.
359
360As explained in section `1.2 Why are cgroups needed?' you should create
361different hierarchies of cgroups for each single resource or group of
362resources you want to control. Therefore, you should mount a tmpfs on
363/sys/fs/cgroup and create directories for each cgroup resource or resource
364group.
365
366# mount -t tmpfs cgroup_root /sys/fs/cgroup
367# mkdir /sys/fs/cgroup/rg1
368
342To mount a cgroup hierarchy with just the cpuset and memory 369To mount a cgroup hierarchy with just the cpuset and memory
343subsystems, type: 370subsystems, type:
344# mount -t cgroup -o cpuset,memory hier1 /dev/cgroup 371# mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
345 372
346To change the set of subsystems bound to a mounted hierarchy, just 373To change the set of subsystems bound to a mounted hierarchy, just
347remount with different options: 374remount with different options:
348# mount -o remount,cpuset,ns hier1 /dev/cgroup 375# mount -o remount,cpuset,blkio hier1 /sys/fs/cgroup/rg1
349 376
350Now memory is removed from the hierarchy and ns is added. 377Now memory is removed from the hierarchy and blkio is added.
351 378
352Note this will add ns to the hierarchy but won't remove memory or 379Note this will add blkio to the hierarchy but won't remove memory or
353cpuset, because the new options are appended to the old ones: 380cpuset, because the new options are appended to the old ones:
354# mount -o remount,ns /dev/cgroup 381# mount -o remount,blkio /sys/fs/cgroup/rg1
355 382
356To Specify a hierarchy's release_agent: 383To Specify a hierarchy's release_agent:
357# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ 384# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
358 xxx /dev/cgroup 385 xxx /sys/fs/cgroup/rg1
359 386
360Note that specifying 'release_agent' more than once will return failure. 387Note that specifying 'release_agent' more than once will return failure.
361 388
@@ -364,17 +391,17 @@ when the hierarchy consists of a single (root) cgroup. Supporting
364the ability to arbitrarily bind/unbind subsystems from an existing 391the ability to arbitrarily bind/unbind subsystems from an existing
365cgroup hierarchy is intended to be implemented in the future. 392cgroup hierarchy is intended to be implemented in the future.
366 393
367Then under /dev/cgroup you can find a tree that corresponds to the 394Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
368tree of the cgroups in the system. For instance, /dev/cgroup 395tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
369is the cgroup that holds the whole system. 396is the cgroup that holds the whole system.
370 397
371If you want to change the value of release_agent: 398If you want to change the value of release_agent:
372# echo "/sbin/new_release_agent" > /dev/cgroup/release_agent 399# echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
373 400
374It can also be changed via remount. 401It can also be changed via remount.
375 402
376If you want to create a new cgroup under /dev/cgroup: 403If you want to create a new cgroup under /sys/fs/cgroup/rg1:
377# cd /dev/cgroup 404# cd /sys/fs/cgroup/rg1
378# mkdir my_cgroup 405# mkdir my_cgroup
379 406
380Now you want to do something with this cgroup. 407Now you want to do something with this cgroup.
@@ -416,6 +443,20 @@ You can attach the current shell task by echoing 0:
416 443
417# echo 0 > tasks 444# echo 0 > tasks
418 445
446You can use the cgroup.procs file instead of the tasks file to move all
447threads in a threadgroup at once. Echoing the pid of any task in a
448threadgroup to cgroup.procs causes all tasks in that threadgroup to be
449be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
450in the writing task's threadgroup.
451
452Note: Since every task is always a member of exactly one cgroup in each
453mounted hierarchy, to remove a task from its current cgroup you must
454move it into a new cgroup (possibly the root cgroup) by writing to the
455new cgroup's tasks file.
456
457Note: If the ns cgroup is active, moving a process to another cgroup can
458fail.
459
4192.3 Mounting hierarchies by name 4602.3 Mounting hierarchies by name
420-------------------------------- 461--------------------------------
421 462
@@ -553,7 +594,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
553called multiple times against a cgroup. 594called multiple times against a cgroup.
554 595
555int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, 596int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
556 struct task_struct *task, bool threadgroup) 597 struct task_struct *task)
557(cgroup_mutex held by caller) 598(cgroup_mutex held by caller)
558 599
559Called prior to moving a task into a cgroup; if the subsystem 600Called prior to moving a task into a cgroup; if the subsystem
@@ -562,9 +603,14 @@ task is passed, then a successful result indicates that *any*
562unspecified task can be moved into the cgroup. Note that this isn't 603unspecified task can be moved into the cgroup. Note that this isn't
563called on a fork. If this method returns 0 (success) then this should 604called on a fork. If this method returns 0 (success) then this should
564remain valid while the caller holds cgroup_mutex and it is ensured that either 605remain valid while the caller holds cgroup_mutex and it is ensured that either
565attach() or cancel_attach() will be called in future. If threadgroup is 606attach() or cancel_attach() will be called in future.
566true, then a successful result indicates that all threads in the given 607
567thread's threadgroup can be moved together. 608int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
609(cgroup_mutex held by caller)
610
611As can_attach, but for operations that must be run once per task to be
612attached (possibly many when using cgroup_attach_proc). Called after
613can_attach.
568 614
569void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, 615void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
570 struct task_struct *task, bool threadgroup) 616 struct task_struct *task, bool threadgroup)
@@ -576,15 +622,24 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
576This will be called only about subsystems whose can_attach() operation have 622This will be called only about subsystems whose can_attach() operation have
577succeeded. 623succeeded.
578 624
625void pre_attach(struct cgroup *cgrp);
626(cgroup_mutex held by caller)
627
628For any non-per-thread attachment work that needs to happen before
629attach_task. Needed by cpuset.
630
579void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, 631void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
580 struct cgroup *old_cgrp, struct task_struct *task, 632 struct cgroup *old_cgrp, struct task_struct *task)
581 bool threadgroup)
582(cgroup_mutex held by caller) 633(cgroup_mutex held by caller)
583 634
584Called after the task has been attached to the cgroup, to allow any 635Called after the task has been attached to the cgroup, to allow any
585post-attachment activity that requires memory allocations or blocking. 636post-attachment activity that requires memory allocations or blocking.
586If threadgroup is true, the subsystem should take care of all threads 637
587in the specified thread's threadgroup. Currently does not support any 638void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
639(cgroup_mutex held by caller)
640
641As attach, but for operations that must be run once per task to be attached,
642like can_attach_task. Called before attach. Currently does not support any
588subsystem that might need the old_cgrp for every thread in the group. 643subsystem that might need the old_cgrp for every thread in the group.
589 644
590void fork(struct cgroup_subsy *ss, struct task_struct *task) 645void fork(struct cgroup_subsy *ss, struct task_struct *task)
@@ -608,7 +663,7 @@ always handled well.
608void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp) 663void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
609(cgroup_mutex held by caller) 664(cgroup_mutex held by caller)
610 665
611Called at the end of cgroup_clone() to do any parameter 666Called during cgroup_create() to do any parameter
612initialization which might be required before a task could attach. For 667initialization which might be required before a task could attach. For
613example in cpusets, no task may attach before 'cpus' and 'mems' are set 668example in cpusets, no task may attach before 'cpus' and 'mems' are set
614up. 669up.
diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
index 8b930946c52a..9ad85df4b983 100644
--- a/Documentation/cgroups/cpuacct.txt
+++ b/Documentation/cgroups/cpuacct.txt
@@ -10,26 +10,25 @@ directly present in its group.
10 10
11Accounting groups can be created by first mounting the cgroup filesystem. 11Accounting groups can be created by first mounting the cgroup filesystem.
12 12
13# mkdir /cgroups 13# mount -t cgroup -ocpuacct none /sys/fs/cgroup
14# mount -t cgroup -ocpuacct none /cgroups 14
15 15With the above step, the initial or the parent accounting group becomes
16With the above step, the initial or the parent accounting group 16visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
17becomes visible at /cgroups. At bootup, this group includes all the 17the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
18tasks in the system. /cgroups/tasks lists the tasks in this cgroup. 18/sys/fs/cgroup/cpuacct.usage gives the CPU time (in nanoseconds) obtained
19/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by 19by this group which is essentially the CPU time obtained by all the tasks
20this group which is essentially the CPU time obtained by all the tasks
21in the system. 20in the system.
22 21
23New accounting groups can be created under the parent group /cgroups. 22New accounting groups can be created under the parent group /sys/fs/cgroup.
24 23
25# cd /cgroups 24# cd /sys/fs/cgroup
26# mkdir g1 25# mkdir g1
27# echo $$ > g1 26# echo $$ > g1
28 27
29The above steps create a new group g1 and move the current shell 28The above steps create a new group g1 and move the current shell
30process (bash) into it. CPU time consumed by this bash and its children 29process (bash) into it. CPU time consumed by this bash and its children
31can be obtained from g1/cpuacct.usage and the same is accumulated in 30can be obtained from g1/cpuacct.usage and the same is accumulated in
32/cgroups/cpuacct.usage also. 31/sys/fs/cgroup/cpuacct.usage also.
33 32
34cpuacct.stat file lists a few statistics which further divide the 33cpuacct.stat file lists a few statistics which further divide the
35CPU time obtained by the cgroup into user and system times. Currently 34CPU time obtained by the cgroup into user and system times. Currently
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index 5d0d5692a365..5b0d78e55ccc 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -661,21 +661,21 @@ than stress the kernel.
661 661
662To start a new job that is to be contained within a cpuset, the steps are: 662To start a new job that is to be contained within a cpuset, the steps are:
663 663
664 1) mkdir /dev/cpuset 664 1) mkdir /sys/fs/cgroup/cpuset
665 2) mount -t cgroup -ocpuset cpuset /dev/cpuset 665 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
666 3) Create the new cpuset by doing mkdir's and write's (or echo's) in 666 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
667 the /dev/cpuset virtual file system. 667 the /sys/fs/cgroup/cpuset virtual file system.
668 4) Start a task that will be the "founding father" of the new job. 668 4) Start a task that will be the "founding father" of the new job.
669 5) Attach that task to the new cpuset by writing its pid to the 669 5) Attach that task to the new cpuset by writing its pid to the
670 /dev/cpuset tasks file for that cpuset. 670 /sys/fs/cgroup/cpuset tasks file for that cpuset.
671 6) fork, exec or clone the job tasks from this founding father task. 671 6) fork, exec or clone the job tasks from this founding father task.
672 672
673For example, the following sequence of commands will setup a cpuset 673For example, the following sequence of commands will setup a cpuset
674named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, 674named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
675and then start a subshell 'sh' in that cpuset: 675and then start a subshell 'sh' in that cpuset:
676 676
677 mount -t cgroup -ocpuset cpuset /dev/cpuset 677 mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
678 cd /dev/cpuset 678 cd /sys/fs/cgroup/cpuset
679 mkdir Charlie 679 mkdir Charlie
680 cd Charlie 680 cd Charlie
681 /bin/echo 2-3 > cpuset.cpus 681 /bin/echo 2-3 > cpuset.cpus
@@ -693,7 +693,7 @@ There are ways to query or modify cpusets:
693 - via the C library libcgroup. 693 - via the C library libcgroup.
694 (http://sourceforge.net/projects/libcg/) 694 (http://sourceforge.net/projects/libcg/)
695 - via the python application cset. 695 - via the python application cset.
696 (http://developer.novell.com/wiki/index.php/Cpuset) 696 (http://code.google.com/p/cpuset/)
697 697
698The sched_setaffinity calls can also be done at the shell prompt using 698The sched_setaffinity calls can also be done at the shell prompt using
699SGI's runon or Robert Love's taskset. The mbind and set_mempolicy 699SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
@@ -710,14 +710,14 @@ Creating, modifying, using the cpusets can be done through the cpuset
710virtual filesystem. 710virtual filesystem.
711 711
712To mount it, type: 712To mount it, type:
713# mount -t cgroup -o cpuset cpuset /dev/cpuset 713# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
714 714
715Then under /dev/cpuset you can find a tree that corresponds to the 715Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
716tree of the cpusets in the system. For instance, /dev/cpuset 716tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
717is the cpuset that holds the whole system. 717is the cpuset that holds the whole system.
718 718
719If you want to create a new cpuset under /dev/cpuset: 719If you want to create a new cpuset under /sys/fs/cgroup/cpuset:
720# cd /dev/cpuset 720# cd /sys/fs/cgroup/cpuset
721# mkdir my_cpuset 721# mkdir my_cpuset
722 722
723Now you want to do something with this cpuset. 723Now you want to do something with this cpuset.
@@ -725,13 +725,14 @@ Now you want to do something with this cpuset.
725 725
726In this directory you can find several files: 726In this directory you can find several files:
727# ls 727# ls
728cpuset.cpu_exclusive cpuset.memory_spread_slab 728cgroup.clone_children cpuset.memory_pressure
729cpuset.cpus cpuset.mems 729cgroup.event_control cpuset.memory_spread_page
730cpuset.mem_exclusive cpuset.sched_load_balance 730cgroup.procs cpuset.memory_spread_slab
731cpuset.mem_hardwall cpuset.sched_relax_domain_level 731cpuset.cpu_exclusive cpuset.mems
732cpuset.memory_migrate notify_on_release 732cpuset.cpus cpuset.sched_load_balance
733cpuset.memory_pressure tasks 733cpuset.mem_exclusive cpuset.sched_relax_domain_level
734cpuset.memory_spread_page 734cpuset.mem_hardwall notify_on_release
735cpuset.memory_migrate tasks
735 736
736Reading them will give you information about the state of this cpuset: 737Reading them will give you information about the state of this cpuset:
737the CPUs and Memory Nodes it can use, the processes that are using 738the CPUs and Memory Nodes it can use, the processes that are using
@@ -764,12 +765,12 @@ wrapper around the cgroup filesystem.
764 765
765The command 766The command
766 767
767mount -t cpuset X /dev/cpuset 768mount -t cpuset X /sys/fs/cgroup/cpuset
768 769
769is equivalent to 770is equivalent to
770 771
771mount -t cgroup -ocpuset,noprefix X /dev/cpuset 772mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
772echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent 773echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
773 774
7742.2 Adding/removing cpus 7752.2 Adding/removing cpus
775------------------------ 776------------------------
diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt
index 57ca4c89fe5c..16624a7f8222 100644
--- a/Documentation/cgroups/devices.txt
+++ b/Documentation/cgroups/devices.txt
@@ -22,16 +22,16 @@ removed from the child(ren).
22An entry is added using devices.allow, and removed using 22An entry is added using devices.allow, and removed using
23devices.deny. For instance 23devices.deny. For instance
24 24
25 echo 'c 1:3 mr' > /cgroups/1/devices.allow 25 echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow
26 26
27allows cgroup 1 to read and mknod the device usually known as 27allows cgroup 1 to read and mknod the device usually known as
28/dev/null. Doing 28/dev/null. Doing
29 29
30 echo a > /cgroups/1/devices.deny 30 echo a > /sys/fs/cgroup/1/devices.deny
31 31
32will remove the default 'a *:* rwm' entry. Doing 32will remove the default 'a *:* rwm' entry. Doing
33 33
34 echo a > /cgroups/1/devices.allow 34 echo a > /sys/fs/cgroup/1/devices.allow
35 35
36will add the 'a *:* rwm' entry to the whitelist. 36will add the 'a *:* rwm' entry to the whitelist.
37 37
diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt
index 41f37fea1276..c21d77742a07 100644
--- a/Documentation/cgroups/freezer-subsystem.txt
+++ b/Documentation/cgroups/freezer-subsystem.txt
@@ -59,28 +59,28 @@ is non-freezable.
59 59
60* Examples of usage : 60* Examples of usage :
61 61
62 # mkdir /containers 62 # mkdir /sys/fs/cgroup/freezer
63 # mount -t cgroup -ofreezer freezer /containers 63 # mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer
64 # mkdir /containers/0 64 # mkdir /sys/fs/cgroup/freezer/0
65 # echo $some_pid > /containers/0/tasks 65 # echo $some_pid > /sys/fs/cgroup/freezer/0/tasks
66 66
67to get status of the freezer subsystem : 67to get status of the freezer subsystem :
68 68
69 # cat /containers/0/freezer.state 69 # cat /sys/fs/cgroup/freezer/0/freezer.state
70 THAWED 70 THAWED
71 71
72to freeze all tasks in the container : 72to freeze all tasks in the container :
73 73
74 # echo FROZEN > /containers/0/freezer.state 74 # echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state
75 # cat /containers/0/freezer.state 75 # cat /sys/fs/cgroup/freezer/0/freezer.state
76 FREEZING 76 FREEZING
77 # cat /containers/0/freezer.state 77 # cat /sys/fs/cgroup/freezer/0/freezer.state
78 FROZEN 78 FROZEN
79 79
80to unfreeze all tasks in the container : 80to unfreeze all tasks in the container :
81 81
82 # echo THAWED > /containers/0/freezer.state 82 # echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state
83 # cat /containers/0/freezer.state 83 # cat /sys/fs/cgroup/freezer/0/freezer.state
84 THAWED 84 THAWED
85 85
86This is the basic mechanism which should do the right thing for user space task 86This is the basic mechanism which should do the right thing for user space task
diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index b7eececfb195..fc8fa97a09ac 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -398,7 +398,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
398 written to move_charge_at_immigrate. 398 written to move_charge_at_immigrate.
399 399
400 9.10 Memory thresholds 400 9.10 Memory thresholds
401 Memory controler implements memory thresholds using cgroups notification 401 Memory controller implements memory thresholds using cgroups notification
402 API. You can use Documentation/cgroups/cgroup_event_listener.c to test 402 API. You can use Documentation/cgroups/cgroup_event_listener.c to test
403 it. 403 it.
404 404
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 7781857dc940..06eb6d957c83 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -1,8 +1,8 @@
1Memory Resource Controller 1Memory Resource Controller
2 2
3NOTE: The Memory Resource Controller has been generically been referred 3NOTE: The Memory Resource Controller has generically been referred to as the
4 to as the memory controller in this document. Do not confuse memory 4 memory controller in this document. Do not confuse memory controller
5 controller used here with the memory controller that is used in hardware. 5 used here with the memory controller that is used in hardware.
6 6
7(For editors) 7(For editors)
8In this document: 8In this document:
@@ -52,8 +52,10 @@ Brief summary of control files.
52 tasks # attach a task(thread) and show list of threads 52 tasks # attach a task(thread) and show list of threads
53 cgroup.procs # show list of processes 53 cgroup.procs # show list of processes
54 cgroup.event_control # an interface for event_fd() 54 cgroup.event_control # an interface for event_fd()
55 memory.usage_in_bytes # show current memory(RSS+Cache) usage. 55 memory.usage_in_bytes # show current res_counter usage for memory
56 memory.memsw.usage_in_bytes # show current memory+Swap usage 56 (See 5.5 for details)
57 memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap
58 (See 5.5 for details)
57 memory.limit_in_bytes # set/show limit of memory usage 59 memory.limit_in_bytes # set/show limit of memory usage
58 memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage 60 memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
59 memory.failcnt # show the number of memory usage hits limits 61 memory.failcnt # show the number of memory usage hits limits
@@ -68,6 +70,7 @@ Brief summary of control files.
68 (See sysctl's vm.swappiness) 70 (See sysctl's vm.swappiness)
69 memory.move_charge_at_immigrate # set/show controls of moving charges 71 memory.move_charge_at_immigrate # set/show controls of moving charges
70 memory.oom_control # set/show oom controls. 72 memory.oom_control # set/show oom controls.
73 memory.numa_stat # show the number of memory usage per numa node
71 74
721. History 751. History
73 76
@@ -179,7 +182,7 @@ behind this approach is that a cgroup that aggressively uses a shared
179page will eventually get charged for it (once it is uncharged from 182page will eventually get charged for it (once it is uncharged from
180the cgroup that brought it in -- this will happen on memory pressure). 183the cgroup that brought it in -- this will happen on memory pressure).
181 184
182Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used.. 185Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used.
183When you do swapoff and make swapped-out pages of shmem(tmpfs) to 186When you do swapoff and make swapped-out pages of shmem(tmpfs) to
184be backed into memory in force, charges for pages are accounted against the 187be backed into memory in force, charges for pages are accounted against the
185caller of swapoff rather than the users of shmem. 188caller of swapoff rather than the users of shmem.
@@ -211,7 +214,7 @@ affecting global LRU, memory+swap limit is better than just limiting swap from
211OS point of view. 214OS point of view.
212 215
213* What happens when a cgroup hits memory.memsw.limit_in_bytes 216* What happens when a cgroup hits memory.memsw.limit_in_bytes
214When a cgroup his memory.memsw.limit_in_bytes, it's useless to do swap-out 217When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
215in this cgroup. Then, swap-out will not be done by cgroup routine and file 218in this cgroup. Then, swap-out will not be done by cgroup routine and file
216caches are dropped. But as mentioned above, global LRU can do swapout memory 219caches are dropped. But as mentioned above, global LRU can do swapout memory
217from it for sanity of the system's memory management state. You can't forbid 220from it for sanity of the system's memory management state. You can't forbid
@@ -261,16 +264,17 @@ b. Enable CONFIG_RESOURCE_COUNTERS
261c. Enable CONFIG_CGROUP_MEM_RES_CTLR 264c. Enable CONFIG_CGROUP_MEM_RES_CTLR
262d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension) 265d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension)
263 266
2641. Prepare the cgroups 2671. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
265# mkdir -p /cgroups 268# mount -t tmpfs none /sys/fs/cgroup
266# mount -t cgroup none /cgroups -o memory 269# mkdir /sys/fs/cgroup/memory
270# mount -t cgroup none /sys/fs/cgroup/memory -o memory
267 271
2682. Make the new group and move bash into it 2722. Make the new group and move bash into it
269# mkdir /cgroups/0 273# mkdir /sys/fs/cgroup/memory/0
270# echo $$ > /cgroups/0/tasks 274# echo $$ > /sys/fs/cgroup/memory/0/tasks
271 275
272Since now we're in the 0 cgroup, we can alter the memory limit: 276Since now we're in the 0 cgroup, we can alter the memory limit:
273# echo 4M > /cgroups/0/memory.limit_in_bytes 277# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
274 278
275NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, 279NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
276mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.) 280mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
@@ -278,11 +282,11 @@ mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
278NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). 282NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).
279NOTE: We cannot set limits on the root cgroup any more. 283NOTE: We cannot set limits on the root cgroup any more.
280 284
281# cat /cgroups/0/memory.limit_in_bytes 285# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
2824194304 2864194304
283 287
284We can check the usage: 288We can check the usage:
285# cat /cgroups/0/memory.usage_in_bytes 289# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
2861216512 2901216512
287 291
288A successful write to this file does not guarantee a successful set of 292A successful write to this file does not guarantee a successful set of
@@ -453,6 +457,33 @@ memory under it will be reclaimed.
453You can reset failcnt by writing 0 to failcnt file. 457You can reset failcnt by writing 0 to failcnt file.
454# echo 0 > .../memory.failcnt 458# echo 0 > .../memory.failcnt
455 459
4605.5 usage_in_bytes
461
462For efficiency, as other kernel components, memory cgroup uses some optimization
463to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
464method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz
465value for efficient access. (Of course, when necessary, it's synchronized.)
466If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
467value in memory.stat(see 5.2).
468
4695.6 numa_stat
470
471This is similar to numa_maps but operates on a per-memcg basis. This is
472useful for providing visibility into the numa locality information within
473an memcg since the pages are allowed to be allocated from any physical
474node. One of the usecases is evaluating application performance by
475combining this information with the application's cpu allocation.
476
477We export "total", "file", "anon" and "unevictable" pages per-node for
478each memcg. The ouput format of memory.numa_stat is:
479
480total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
481file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
482anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
483unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
484
485And we have total = file + anon + unevictable.
486
4566. Hierarchy support 4876. Hierarchy support
457 488
458The memory controller supports a deep hierarchy and hierarchical accounting. 489The memory controller supports a deep hierarchy and hierarchical accounting.
@@ -460,13 +491,13 @@ The hierarchy is created by creating the appropriate cgroups in the
460cgroup filesystem. Consider for example, the following cgroup filesystem 491cgroup filesystem. Consider for example, the following cgroup filesystem
461hierarchy 492hierarchy
462 493
463 root 494 root
464 / | \ 495 / | \
465 / | \ 496 / | \
466 a b c 497 a b c
467 | \ 498 | \
468 | \ 499 | \
469 d e 500 d e
470 501
471In the diagram above, with hierarchical accounting enabled, all memory 502In the diagram above, with hierarchical accounting enabled, all memory
472usage of e, is accounted to its ancestors up until the root (i.e, c and root), 503usage of e, is accounted to its ancestors up until the root (i.e, c and root),
@@ -485,8 +516,9 @@ The feature can be disabled by
485 516
486# echo 0 > memory.use_hierarchy 517# echo 0 > memory.use_hierarchy
487 518
488NOTE1: Enabling/disabling will fail if the cgroup already has other 519NOTE1: Enabling/disabling will fail if either the cgroup already has other
489 cgroups created below it. 520 cgroups created below it, or if the parent cgroup has use_hierarchy
521 enabled.
490 522
491NOTE2: When panic_on_oom is set to "2", the whole system will panic in 523NOTE2: When panic_on_oom is set to "2", the whole system will panic in
492 case of an OOM event in any cgroup. 524 case of an OOM event in any cgroup.