diff options
author | Glenn Elliott <gelliott@cs.unc.edu> | 2012-03-04 19:47:13 -0500 |
---|---|---|
committer | Glenn Elliott <gelliott@cs.unc.edu> | 2012-03-04 19:47:13 -0500 |
commit | c71c03bda1e86c9d5198c5d83f712e695c4f2a1e (patch) | |
tree | ecb166cb3e2b7e2adb3b5e292245fefd23381ac8 /Documentation/cgroups | |
parent | ea53c912f8a86a8567697115b6a0d8152beee5c8 (diff) | |
parent | 6a00f206debf8a5c8899055726ad127dbeeed098 (diff) |
Merge branch 'mpi-master' into wip-k-fmlpwip-k-fmlp
Conflicts:
litmus/sched_cedf.c
Diffstat (limited to 'Documentation/cgroups')
-rw-r--r-- | Documentation/cgroups/blkio-controller.txt | 186 | ||||
-rw-r--r-- | Documentation/cgroups/cgroup_event_listener.c | 2 | ||||
-rw-r--r-- | Documentation/cgroups/cgroups.txt | 143 | ||||
-rw-r--r-- | Documentation/cgroups/cpuacct.txt | 21 | ||||
-rw-r--r-- | Documentation/cgroups/cpusets.txt | 45 | ||||
-rw-r--r-- | Documentation/cgroups/devices.txt | 6 | ||||
-rw-r--r-- | Documentation/cgroups/freezer-subsystem.txt | 20 | ||||
-rw-r--r-- | Documentation/cgroups/memcg_test.txt | 2 | ||||
-rw-r--r-- | Documentation/cgroups/memory.txt | 78 |
9 files changed, 346 insertions, 157 deletions
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt index 6919d62591d9..84f0a15fc210 100644 --- a/Documentation/cgroups/blkio-controller.txt +++ b/Documentation/cgroups/blkio-controller.txt | |||
@@ -8,12 +8,17 @@ both at leaf nodes as well as at intermediate nodes in a storage hierarchy. | |||
8 | Plan is to use the same cgroup based management interface for blkio controller | 8 | Plan is to use the same cgroup based management interface for blkio controller |
9 | and based on user options switch IO policies in the background. | 9 | and based on user options switch IO policies in the background. |
10 | 10 | ||
11 | In the first phase, this patchset implements proportional weight time based | 11 | Currently two IO control policies are implemented. First one is proportional |
12 | division of disk policy. It is implemented in CFQ. Hence this policy takes | 12 | weight time based division of disk policy. It is implemented in CFQ. Hence |
13 | effect only on leaf nodes when CFQ is being used. | 13 | this policy takes effect only on leaf nodes when CFQ is being used. The second |
14 | one is throttling policy which can be used to specify upper IO rate limits | ||
15 | on devices. This policy is implemented in generic block layer and can be | ||
16 | used on leaf nodes as well as higher level logical devices like device mapper. | ||
14 | 17 | ||
15 | HOWTO | 18 | HOWTO |
16 | ===== | 19 | ===== |
20 | Proportional Weight division of bandwidth | ||
21 | ----------------------------------------- | ||
17 | You can do a very simple testing of running two dd threads in two different | 22 | You can do a very simple testing of running two dd threads in two different |
18 | cgroups. Here is what you can do. | 23 | cgroups. Here is what you can do. |
19 | 24 | ||
@@ -23,16 +28,19 @@ cgroups. Here is what you can do. | |||
23 | - Enable group scheduling in CFQ | 28 | - Enable group scheduling in CFQ |
24 | CONFIG_CFQ_GROUP_IOSCHED=y | 29 | CONFIG_CFQ_GROUP_IOSCHED=y |
25 | 30 | ||
26 | - Compile and boot into kernel and mount IO controller (blkio). | 31 | - Compile and boot into kernel and mount IO controller (blkio); see |
32 | cgroups.txt, Why are cgroups needed?. | ||
27 | 33 | ||
28 | mount -t cgroup -o blkio none /cgroup | 34 | mount -t tmpfs cgroup_root /sys/fs/cgroup |
35 | mkdir /sys/fs/cgroup/blkio | ||
36 | mount -t cgroup -o blkio none /sys/fs/cgroup/blkio | ||
29 | 37 | ||
30 | - Create two cgroups | 38 | - Create two cgroups |
31 | mkdir -p /cgroup/test1/ /cgroup/test2 | 39 | mkdir -p /sys/fs/cgroup/blkio/test1/ /sys/fs/cgroup/blkio/test2 |
32 | 40 | ||
33 | - Set weights of group test1 and test2 | 41 | - Set weights of group test1 and test2 |
34 | echo 1000 > /cgroup/test1/blkio.weight | 42 | echo 1000 > /sys/fs/cgroup/blkio/test1/blkio.weight |
35 | echo 500 > /cgroup/test2/blkio.weight | 43 | echo 500 > /sys/fs/cgroup/blkio/test2/blkio.weight |
36 | 44 | ||
37 | - Create two same size files (say 512MB each) on same disk (file1, file2) and | 45 | - Create two same size files (say 512MB each) on same disk (file1, file2) and |
38 | launch two dd threads in different cgroup to read those files. | 46 | launch two dd threads in different cgroup to read those files. |
@@ -41,12 +49,12 @@ cgroups. Here is what you can do. | |||
41 | echo 3 > /proc/sys/vm/drop_caches | 49 | echo 3 > /proc/sys/vm/drop_caches |
42 | 50 | ||
43 | dd if=/mnt/sdb/zerofile1 of=/dev/null & | 51 | dd if=/mnt/sdb/zerofile1 of=/dev/null & |
44 | echo $! > /cgroup/test1/tasks | 52 | echo $! > /sys/fs/cgroup/blkio/test1/tasks |
45 | cat /cgroup/test1/tasks | 53 | cat /sys/fs/cgroup/blkio/test1/tasks |
46 | 54 | ||
47 | dd if=/mnt/sdb/zerofile2 of=/dev/null & | 55 | dd if=/mnt/sdb/zerofile2 of=/dev/null & |
48 | echo $! > /cgroup/test2/tasks | 56 | echo $! > /sys/fs/cgroup/blkio/test2/tasks |
49 | cat /cgroup/test2/tasks | 57 | cat /sys/fs/cgroup/blkio/test2/tasks |
50 | 58 | ||
51 | - At macro level, first dd should finish first. To get more precise data, keep | 59 | - At macro level, first dd should finish first. To get more precise data, keep |
52 | on looking at (with the help of script), at blkio.disk_time and | 60 | on looking at (with the help of script), at blkio.disk_time and |
@@ -55,6 +63,62 @@ cgroups. Here is what you can do. | |||
55 | group dispatched to the disk. We provide fairness in terms of disk time, so | 63 | group dispatched to the disk. We provide fairness in terms of disk time, so |
56 | ideally io.disk_time of cgroups should be in proportion to the weight. | 64 | ideally io.disk_time of cgroups should be in proportion to the weight. |
57 | 65 | ||
66 | Throttling/Upper Limit policy | ||
67 | ----------------------------- | ||
68 | - Enable Block IO controller | ||
69 | CONFIG_BLK_CGROUP=y | ||
70 | |||
71 | - Enable throttling in block layer | ||
72 | CONFIG_BLK_DEV_THROTTLING=y | ||
73 | |||
74 | - Mount blkio controller (see cgroups.txt, Why are cgroups needed?) | ||
75 | mount -t cgroup -o blkio none /sys/fs/cgroup/blkio | ||
76 | |||
77 | - Specify a bandwidth rate on particular device for root group. The format | ||
78 | for policy is "<major>:<minor> <byes_per_second>". | ||
79 | |||
80 | echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device | ||
81 | |||
82 | Above will put a limit of 1MB/second on reads happening for root group | ||
83 | on device having major/minor number 8:16. | ||
84 | |||
85 | - Run dd to read a file and see if rate is throttled to 1MB/s or not. | ||
86 | |||
87 | # dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 | ||
88 | # iflag=direct | ||
89 | 1024+0 records in | ||
90 | 1024+0 records out | ||
91 | 4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s | ||
92 | |||
93 | Limits for writes can be put using blkio.throttle.write_bps_device file. | ||
94 | |||
95 | Hierarchical Cgroups | ||
96 | ==================== | ||
97 | - Currently none of the IO control policy supports hierarhical groups. But | ||
98 | cgroup interface does allow creation of hierarhical cgroups and internally | ||
99 | IO policies treat them as flat hierarchy. | ||
100 | |||
101 | So this patch will allow creation of cgroup hierarhcy but at the backend | ||
102 | everything will be treated as flat. So if somebody created a hierarchy like | ||
103 | as follows. | ||
104 | |||
105 | root | ||
106 | / \ | ||
107 | test1 test2 | ||
108 | | | ||
109 | test3 | ||
110 | |||
111 | CFQ and throttling will practically treat all groups at same level. | ||
112 | |||
113 | pivot | ||
114 | / / \ \ | ||
115 | root test1 test2 test3 | ||
116 | |||
117 | Down the line we can implement hierarchical accounting/control support | ||
118 | and also introduce a new cgroup file "use_hierarchy" which will control | ||
119 | whether cgroup hierarchy is viewed as flat or hierarchical by the policy.. | ||
120 | This is how memory controller also has implemented the things. | ||
121 | |||
58 | Various user visible config options | 122 | Various user visible config options |
59 | =================================== | 123 | =================================== |
60 | CONFIG_BLK_CGROUP | 124 | CONFIG_BLK_CGROUP |
@@ -68,13 +132,18 @@ CONFIG_CFQ_GROUP_IOSCHED | |||
68 | - Enables group scheduling in CFQ. Currently only 1 level of group | 132 | - Enables group scheduling in CFQ. Currently only 1 level of group |
69 | creation is allowed. | 133 | creation is allowed. |
70 | 134 | ||
135 | CONFIG_BLK_DEV_THROTTLING | ||
136 | - Enable block device throttling support in block layer. | ||
137 | |||
71 | Details of cgroup files | 138 | Details of cgroup files |
72 | ======================= | 139 | ======================= |
140 | Proportional weight policy files | ||
141 | -------------------------------- | ||
73 | - blkio.weight | 142 | - blkio.weight |
74 | - Specifies per cgroup weight. This is default weight of the group | 143 | - Specifies per cgroup weight. This is default weight of the group |
75 | on all the devices until and unless overridden by per device rule. | 144 | on all the devices until and unless overridden by per device rule. |
76 | (See blkio.weight_device). | 145 | (See blkio.weight_device). |
77 | Currently allowed range of weights is from 100 to 1000. | 146 | Currently allowed range of weights is from 10 to 1000. |
78 | 147 | ||
79 | - blkio.weight_device | 148 | - blkio.weight_device |
80 | - One can specify per cgroup per device rules using this interface. | 149 | - One can specify per cgroup per device rules using this interface. |
@@ -83,7 +152,7 @@ Details of cgroup files | |||
83 | 152 | ||
84 | Following is the format. | 153 | Following is the format. |
85 | 154 | ||
86 | #echo dev_maj:dev_minor weight > /path/to/cgroup/blkio.weight_device | 155 | # echo dev_maj:dev_minor weight > blkio.weight_device |
87 | Configure weight=300 on /dev/sdb (8:16) in this cgroup | 156 | Configure weight=300 on /dev/sdb (8:16) in this cgroup |
88 | # echo 8:16 300 > blkio.weight_device | 157 | # echo 8:16 300 > blkio.weight_device |
89 | # cat blkio.weight_device | 158 | # cat blkio.weight_device |
@@ -210,40 +279,73 @@ Details of cgroup files | |||
210 | and minor number of the device and third field specifies the number | 279 | and minor number of the device and third field specifies the number |
211 | of times a group was dequeued from a particular device. | 280 | of times a group was dequeued from a particular device. |
212 | 281 | ||
282 | Throttling/Upper limit policy files | ||
283 | ----------------------------------- | ||
284 | - blkio.throttle.read_bps_device | ||
285 | - Specifies upper limit on READ rate from the device. IO rate is | ||
286 | specified in bytes per second. Rules are per deivce. Following is | ||
287 | the format. | ||
288 | |||
289 | echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device | ||
290 | |||
291 | - blkio.throttle.write_bps_device | ||
292 | - Specifies upper limit on WRITE rate to the device. IO rate is | ||
293 | specified in bytes per second. Rules are per deivce. Following is | ||
294 | the format. | ||
295 | |||
296 | echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device | ||
297 | |||
298 | - blkio.throttle.read_iops_device | ||
299 | - Specifies upper limit on READ rate from the device. IO rate is | ||
300 | specified in IO per second. Rules are per deivce. Following is | ||
301 | the format. | ||
302 | |||
303 | echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device | ||
304 | |||
305 | - blkio.throttle.write_iops_device | ||
306 | - Specifies upper limit on WRITE rate to the device. IO rate is | ||
307 | specified in io per second. Rules are per deivce. Following is | ||
308 | the format. | ||
309 | |||
310 | echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device | ||
311 | |||
312 | Note: If both BW and IOPS rules are specified for a device, then IO is | ||
313 | subjectd to both the constraints. | ||
314 | |||
315 | - blkio.throttle.io_serviced | ||
316 | - Number of IOs (bio) completed to/from the disk by the group (as | ||
317 | seen by throttling policy). These are further divided by the type | ||
318 | of operation - read or write, sync or async. First two fields specify | ||
319 | the major and minor number of the device, third field specifies the | ||
320 | operation type and the fourth field specifies the number of IOs. | ||
321 | |||
322 | blkio.io_serviced does accounting as seen by CFQ and counts are in | ||
323 | number of requests (struct request). On the other hand, | ||
324 | blkio.throttle.io_serviced counts number of IO in terms of number | ||
325 | of bios as seen by throttling policy. These bios can later be | ||
326 | merged by elevator and total number of requests completed can be | ||
327 | lesser. | ||
328 | |||
329 | - blkio.throttle.io_service_bytes | ||
330 | - Number of bytes transferred to/from the disk by the group. These | ||
331 | are further divided by the type of operation - read or write, sync | ||
332 | or async. First two fields specify the major and minor number of the | ||
333 | device, third field specifies the operation type and the fourth field | ||
334 | specifies the number of bytes. | ||
335 | |||
336 | These numbers should roughly be same as blkio.io_service_bytes as | ||
337 | updated by CFQ. The difference between two is that | ||
338 | blkio.io_service_bytes will not be updated if CFQ is not operating | ||
339 | on request queue. | ||
340 | |||
341 | Common files among various policies | ||
342 | ----------------------------------- | ||
213 | - blkio.reset_stats | 343 | - blkio.reset_stats |
214 | - Writing an int to this file will result in resetting all the stats | 344 | - Writing an int to this file will result in resetting all the stats |
215 | for that cgroup. | 345 | for that cgroup. |
216 | 346 | ||
217 | CFQ sysfs tunable | 347 | CFQ sysfs tunable |
218 | ================= | 348 | ================= |
219 | /sys/block/<disk>/queue/iosched/group_isolation | ||
220 | ----------------------------------------------- | ||
221 | |||
222 | If group_isolation=1, it provides stronger isolation between groups at the | ||
223 | expense of throughput. By default group_isolation is 0. In general that | ||
224 | means that if group_isolation=0, expect fairness for sequential workload | ||
225 | only. Set group_isolation=1 to see fairness for random IO workload also. | ||
226 | |||
227 | Generally CFQ will put random seeky workload in sync-noidle category. CFQ | ||
228 | will disable idling on these queues and it does a collective idling on group | ||
229 | of such queues. Generally these are slow moving queues and if there is a | ||
230 | sync-noidle service tree in each group, that group gets exclusive access to | ||
231 | disk for certain period. That means it will bring the throughput down if | ||
232 | group does not have enough IO to drive deeper queue depths and utilize disk | ||
233 | capacity to the fullest in the slice allocated to it. But the flip side is | ||
234 | that even a random reader should get better latencies and overall throughput | ||
235 | if there are lots of sequential readers/sync-idle workload running in the | ||
236 | system. | ||
237 | |||
238 | If group_isolation=0, then CFQ automatically moves all the random seeky queues | ||
239 | in the root group. That means there will be no service differentiation for | ||
240 | that kind of workload. This leads to better throughput as we do collective | ||
241 | idling on root sync-noidle tree. | ||
242 | |||
243 | By default one should run with group_isolation=0. If that is not sufficient | ||
244 | and one wants stronger isolation between groups, then set group_isolation=1 | ||
245 | but this will come at cost of reduced throughput. | ||
246 | |||
247 | /sys/block/<disk>/queue/iosched/slice_idle | 349 | /sys/block/<disk>/queue/iosched/slice_idle |
248 | ------------------------------------------ | 350 | ------------------------------------------ |
249 | On a faster hardware CFQ can be slow, especially with sequential workload. | 351 | On a faster hardware CFQ can be slow, especially with sequential workload. |
diff --git a/Documentation/cgroups/cgroup_event_listener.c b/Documentation/cgroups/cgroup_event_listener.c index 8c2bfc4a6358..3e082f96dc12 100644 --- a/Documentation/cgroups/cgroup_event_listener.c +++ b/Documentation/cgroups/cgroup_event_listener.c | |||
@@ -91,7 +91,7 @@ int main(int argc, char **argv) | |||
91 | 91 | ||
92 | if (ret == -1) { | 92 | if (ret == -1) { |
93 | perror("cgroup.event_control " | 93 | perror("cgroup.event_control " |
94 | "is not accessable any more"); | 94 | "is not accessible any more"); |
95 | break; | 95 | break; |
96 | } | 96 | } |
97 | 97 | ||
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index b34823ff1646..cd67e90003c0 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt | |||
@@ -18,7 +18,8 @@ CONTENTS: | |||
18 | 1.2 Why are cgroups needed ? | 18 | 1.2 Why are cgroups needed ? |
19 | 1.3 How are cgroups implemented ? | 19 | 1.3 How are cgroups implemented ? |
20 | 1.4 What does notify_on_release do ? | 20 | 1.4 What does notify_on_release do ? |
21 | 1.5 How do I use cgroups ? | 21 | 1.5 What does clone_children do ? |
22 | 1.6 How do I use cgroups ? | ||
22 | 2. Usage Examples and Syntax | 23 | 2. Usage Examples and Syntax |
23 | 2.1 Basic Usage | 24 | 2.1 Basic Usage |
24 | 2.2 Attaching processes | 25 | 2.2 Attaching processes |
@@ -109,22 +110,22 @@ university server with various users - students, professors, system | |||
109 | tasks etc. The resource planning for this server could be along the | 110 | tasks etc. The resource planning for this server could be along the |
110 | following lines: | 111 | following lines: |
111 | 112 | ||
112 | CPU : Top cpuset | 113 | CPU : "Top cpuset" |
113 | / \ | 114 | / \ |
114 | CPUSet1 CPUSet2 | 115 | CPUSet1 CPUSet2 |
115 | | | | 116 | | | |
116 | (Profs) (Students) | 117 | (Professors) (Students) |
117 | 118 | ||
118 | In addition (system tasks) are attached to topcpuset (so | 119 | In addition (system tasks) are attached to topcpuset (so |
119 | that they can run anywhere) with a limit of 20% | 120 | that they can run anywhere) with a limit of 20% |
120 | 121 | ||
121 | Memory : Professors (50%), students (30%), system (20%) | 122 | Memory : Professors (50%), Students (30%), system (20%) |
122 | 123 | ||
123 | Disk : Prof (50%), students (30%), system (20%) | 124 | Disk : Professors (50%), Students (30%), system (20%) |
124 | 125 | ||
125 | Network : WWW browsing (20%), Network File System (60%), others (20%) | 126 | Network : WWW browsing (20%), Network File System (60%), others (20%) |
126 | / \ | 127 | / \ |
127 | Prof (15%) students (5%) | 128 | Professors (15%) students (5%) |
128 | 129 | ||
129 | Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go | 130 | Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go |
130 | into NFS network class. | 131 | into NFS network class. |
@@ -137,11 +138,11 @@ With the ability to classify tasks differently for different resources | |||
137 | the admin can easily set up a script which receives exec notifications | 138 | the admin can easily set up a script which receives exec notifications |
138 | and depending on who is launching the browser he can | 139 | and depending on who is launching the browser he can |
139 | 140 | ||
140 | # echo browser_pid > /mnt/<restype>/<userclass>/tasks | 141 | # echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks |
141 | 142 | ||
142 | With only a single hierarchy, he now would potentially have to create | 143 | With only a single hierarchy, he now would potentially have to create |
143 | a separate cgroup for every browser launched and associate it with | 144 | a separate cgroup for every browser launched and associate it with |
144 | approp network and other resource class. This may lead to | 145 | appropriate network and other resource class. This may lead to |
145 | proliferation of such cgroups. | 146 | proliferation of such cgroups. |
146 | 147 | ||
147 | Also lets say that the administrator would like to give enhanced network | 148 | Also lets say that the administrator would like to give enhanced network |
@@ -152,9 +153,9 @@ apps enhanced CPU power, | |||
152 | With ability to write pids directly to resource classes, it's just a | 153 | With ability to write pids directly to resource classes, it's just a |
153 | matter of : | 154 | matter of : |
154 | 155 | ||
155 | # echo pid > /mnt/network/<new_class>/tasks | 156 | # echo pid > /sys/fs/cgroup/network/<new_class>/tasks |
156 | (after some time) | 157 | (after some time) |
157 | # echo pid > /mnt/network/<orig_class>/tasks | 158 | # echo pid > /sys/fs/cgroup/network/<orig_class>/tasks |
158 | 159 | ||
159 | Without this ability, he would have to split the cgroup into | 160 | Without this ability, he would have to split the cgroup into |
160 | multiple separate ones and then associate the new cgroups with the | 161 | multiple separate ones and then associate the new cgroups with the |
@@ -235,7 +236,8 @@ containing the following files describing that cgroup: | |||
235 | - cgroup.procs: list of tgids in the cgroup. This list is not | 236 | - cgroup.procs: list of tgids in the cgroup. This list is not |
236 | guaranteed to be sorted or free of duplicate tgids, and userspace | 237 | guaranteed to be sorted or free of duplicate tgids, and userspace |
237 | should sort/uniquify the list if this property is required. | 238 | should sort/uniquify the list if this property is required. |
238 | This is a read-only file, for now. | 239 | Writing a thread group id into this file moves all threads in that |
240 | group into this cgroup. | ||
239 | - notify_on_release flag: run the release agent on exit? | 241 | - notify_on_release flag: run the release agent on exit? |
240 | - release_agent: the path to use for release notifications (this file | 242 | - release_agent: the path to use for release notifications (this file |
241 | exists in the top cgroup only) | 243 | exists in the top cgroup only) |
@@ -293,27 +295,39 @@ notify_on_release in the root cgroup at system boot is disabled | |||
293 | value of their parents notify_on_release setting. The default value of | 295 | value of their parents notify_on_release setting. The default value of |
294 | a cgroup hierarchy's release_agent path is empty. | 296 | a cgroup hierarchy's release_agent path is empty. |
295 | 297 | ||
296 | 1.5 How do I use cgroups ? | 298 | 1.5 What does clone_children do ? |
299 | --------------------------------- | ||
300 | |||
301 | If the clone_children flag is enabled (1) in a cgroup, then all | ||
302 | cgroups created beneath will call the post_clone callbacks for each | ||
303 | subsystem of the newly created cgroup. Usually when this callback is | ||
304 | implemented for a subsystem, it copies the values of the parent | ||
305 | subsystem, this is the case for the cpuset. | ||
306 | |||
307 | 1.6 How do I use cgroups ? | ||
297 | -------------------------- | 308 | -------------------------- |
298 | 309 | ||
299 | To start a new job that is to be contained within a cgroup, using | 310 | To start a new job that is to be contained within a cgroup, using |
300 | the "cpuset" cgroup subsystem, the steps are something like: | 311 | the "cpuset" cgroup subsystem, the steps are something like: |
301 | 312 | ||
302 | 1) mkdir /dev/cgroup | 313 | 1) mount -t tmpfs cgroup_root /sys/fs/cgroup |
303 | 2) mount -t cgroup -ocpuset cpuset /dev/cgroup | 314 | 2) mkdir /sys/fs/cgroup/cpuset |
304 | 3) Create the new cgroup by doing mkdir's and write's (or echo's) in | 315 | 3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset |
305 | the /dev/cgroup virtual file system. | 316 | 4) Create the new cgroup by doing mkdir's and write's (or echo's) in |
306 | 4) Start a task that will be the "founding father" of the new job. | 317 | the /sys/fs/cgroup virtual file system. |
307 | 5) Attach that task to the new cgroup by writing its pid to the | 318 | 5) Start a task that will be the "founding father" of the new job. |
308 | /dev/cgroup tasks file for that cgroup. | 319 | 6) Attach that task to the new cgroup by writing its pid to the |
309 | 6) fork, exec or clone the job tasks from this founding father task. | 320 | /sys/fs/cgroup/cpuset/tasks file for that cgroup. |
321 | 7) fork, exec or clone the job tasks from this founding father task. | ||
310 | 322 | ||
311 | For example, the following sequence of commands will setup a cgroup | 323 | For example, the following sequence of commands will setup a cgroup |
312 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | 324 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, |
313 | and then start a subshell 'sh' in that cgroup: | 325 | and then start a subshell 'sh' in that cgroup: |
314 | 326 | ||
315 | mount -t cgroup cpuset -ocpuset /dev/cgroup | 327 | mount -t tmpfs cgroup_root /sys/fs/cgroup |
316 | cd /dev/cgroup | 328 | mkdir /sys/fs/cgroup/cpuset |
329 | mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset | ||
330 | cd /sys/fs/cgroup/cpuset | ||
317 | mkdir Charlie | 331 | mkdir Charlie |
318 | cd Charlie | 332 | cd Charlie |
319 | /bin/echo 2-3 > cpuset.cpus | 333 | /bin/echo 2-3 > cpuset.cpus |
@@ -334,28 +348,41 @@ Creating, modifying, using the cgroups can be done through the cgroup | |||
334 | virtual filesystem. | 348 | virtual filesystem. |
335 | 349 | ||
336 | To mount a cgroup hierarchy with all available subsystems, type: | 350 | To mount a cgroup hierarchy with all available subsystems, type: |
337 | # mount -t cgroup xxx /dev/cgroup | 351 | # mount -t cgroup xxx /sys/fs/cgroup |
338 | 352 | ||
339 | The "xxx" is not interpreted by the cgroup code, but will appear in | 353 | The "xxx" is not interpreted by the cgroup code, but will appear in |
340 | /proc/mounts so may be any useful identifying string that you like. | 354 | /proc/mounts so may be any useful identifying string that you like. |
341 | 355 | ||
356 | Note: Some subsystems do not work without some user input first. For instance, | ||
357 | if cpusets are enabled the user will have to populate the cpus and mems files | ||
358 | for each new cgroup created before that group can be used. | ||
359 | |||
360 | As explained in section `1.2 Why are cgroups needed?' you should create | ||
361 | different hierarchies of cgroups for each single resource or group of | ||
362 | resources you want to control. Therefore, you should mount a tmpfs on | ||
363 | /sys/fs/cgroup and create directories for each cgroup resource or resource | ||
364 | group. | ||
365 | |||
366 | # mount -t tmpfs cgroup_root /sys/fs/cgroup | ||
367 | # mkdir /sys/fs/cgroup/rg1 | ||
368 | |||
342 | To mount a cgroup hierarchy with just the cpuset and memory | 369 | To mount a cgroup hierarchy with just the cpuset and memory |
343 | subsystems, type: | 370 | subsystems, type: |
344 | # mount -t cgroup -o cpuset,memory hier1 /dev/cgroup | 371 | # mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1 |
345 | 372 | ||
346 | To change the set of subsystems bound to a mounted hierarchy, just | 373 | To change the set of subsystems bound to a mounted hierarchy, just |
347 | remount with different options: | 374 | remount with different options: |
348 | # mount -o remount,cpuset,ns hier1 /dev/cgroup | 375 | # mount -o remount,cpuset,blkio hier1 /sys/fs/cgroup/rg1 |
349 | 376 | ||
350 | Now memory is removed from the hierarchy and ns is added. | 377 | Now memory is removed from the hierarchy and blkio is added. |
351 | 378 | ||
352 | Note this will add ns to the hierarchy but won't remove memory or | 379 | Note this will add blkio to the hierarchy but won't remove memory or |
353 | cpuset, because the new options are appended to the old ones: | 380 | cpuset, because the new options are appended to the old ones: |
354 | # mount -o remount,ns /dev/cgroup | 381 | # mount -o remount,blkio /sys/fs/cgroup/rg1 |
355 | 382 | ||
356 | To Specify a hierarchy's release_agent: | 383 | To Specify a hierarchy's release_agent: |
357 | # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ | 384 | # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ |
358 | xxx /dev/cgroup | 385 | xxx /sys/fs/cgroup/rg1 |
359 | 386 | ||
360 | Note that specifying 'release_agent' more than once will return failure. | 387 | Note that specifying 'release_agent' more than once will return failure. |
361 | 388 | ||
@@ -364,17 +391,17 @@ when the hierarchy consists of a single (root) cgroup. Supporting | |||
364 | the ability to arbitrarily bind/unbind subsystems from an existing | 391 | the ability to arbitrarily bind/unbind subsystems from an existing |
365 | cgroup hierarchy is intended to be implemented in the future. | 392 | cgroup hierarchy is intended to be implemented in the future. |
366 | 393 | ||
367 | Then under /dev/cgroup you can find a tree that corresponds to the | 394 | Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the |
368 | tree of the cgroups in the system. For instance, /dev/cgroup | 395 | tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1 |
369 | is the cgroup that holds the whole system. | 396 | is the cgroup that holds the whole system. |
370 | 397 | ||
371 | If you want to change the value of release_agent: | 398 | If you want to change the value of release_agent: |
372 | # echo "/sbin/new_release_agent" > /dev/cgroup/release_agent | 399 | # echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent |
373 | 400 | ||
374 | It can also be changed via remount. | 401 | It can also be changed via remount. |
375 | 402 | ||
376 | If you want to create a new cgroup under /dev/cgroup: | 403 | If you want to create a new cgroup under /sys/fs/cgroup/rg1: |
377 | # cd /dev/cgroup | 404 | # cd /sys/fs/cgroup/rg1 |
378 | # mkdir my_cgroup | 405 | # mkdir my_cgroup |
379 | 406 | ||
380 | Now you want to do something with this cgroup. | 407 | Now you want to do something with this cgroup. |
@@ -416,6 +443,20 @@ You can attach the current shell task by echoing 0: | |||
416 | 443 | ||
417 | # echo 0 > tasks | 444 | # echo 0 > tasks |
418 | 445 | ||
446 | You can use the cgroup.procs file instead of the tasks file to move all | ||
447 | threads in a threadgroup at once. Echoing the pid of any task in a | ||
448 | threadgroup to cgroup.procs causes all tasks in that threadgroup to be | ||
449 | be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks | ||
450 | in the writing task's threadgroup. | ||
451 | |||
452 | Note: Since every task is always a member of exactly one cgroup in each | ||
453 | mounted hierarchy, to remove a task from its current cgroup you must | ||
454 | move it into a new cgroup (possibly the root cgroup) by writing to the | ||
455 | new cgroup's tasks file. | ||
456 | |||
457 | Note: If the ns cgroup is active, moving a process to another cgroup can | ||
458 | fail. | ||
459 | |||
419 | 2.3 Mounting hierarchies by name | 460 | 2.3 Mounting hierarchies by name |
420 | -------------------------------- | 461 | -------------------------------- |
421 | 462 | ||
@@ -553,7 +594,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be | |||
553 | called multiple times against a cgroup. | 594 | called multiple times against a cgroup. |
554 | 595 | ||
555 | int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | 596 | int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, |
556 | struct task_struct *task, bool threadgroup) | 597 | struct task_struct *task) |
557 | (cgroup_mutex held by caller) | 598 | (cgroup_mutex held by caller) |
558 | 599 | ||
559 | Called prior to moving a task into a cgroup; if the subsystem | 600 | Called prior to moving a task into a cgroup; if the subsystem |
@@ -562,9 +603,14 @@ task is passed, then a successful result indicates that *any* | |||
562 | unspecified task can be moved into the cgroup. Note that this isn't | 603 | unspecified task can be moved into the cgroup. Note that this isn't |
563 | called on a fork. If this method returns 0 (success) then this should | 604 | called on a fork. If this method returns 0 (success) then this should |
564 | remain valid while the caller holds cgroup_mutex and it is ensured that either | 605 | remain valid while the caller holds cgroup_mutex and it is ensured that either |
565 | attach() or cancel_attach() will be called in future. If threadgroup is | 606 | attach() or cancel_attach() will be called in future. |
566 | true, then a successful result indicates that all threads in the given | 607 | |
567 | thread's threadgroup can be moved together. | 608 | int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk); |
609 | (cgroup_mutex held by caller) | ||
610 | |||
611 | As can_attach, but for operations that must be run once per task to be | ||
612 | attached (possibly many when using cgroup_attach_proc). Called after | ||
613 | can_attach. | ||
568 | 614 | ||
569 | void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | 615 | void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, |
570 | struct task_struct *task, bool threadgroup) | 616 | struct task_struct *task, bool threadgroup) |
@@ -576,15 +622,24 @@ function, so that the subsystem can implement a rollback. If not, not necessary. | |||
576 | This will be called only about subsystems whose can_attach() operation have | 622 | This will be called only about subsystems whose can_attach() operation have |
577 | succeeded. | 623 | succeeded. |
578 | 624 | ||
625 | void pre_attach(struct cgroup *cgrp); | ||
626 | (cgroup_mutex held by caller) | ||
627 | |||
628 | For any non-per-thread attachment work that needs to happen before | ||
629 | attach_task. Needed by cpuset. | ||
630 | |||
579 | void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | 631 | void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, |
580 | struct cgroup *old_cgrp, struct task_struct *task, | 632 | struct cgroup *old_cgrp, struct task_struct *task) |
581 | bool threadgroup) | ||
582 | (cgroup_mutex held by caller) | 633 | (cgroup_mutex held by caller) |
583 | 634 | ||
584 | Called after the task has been attached to the cgroup, to allow any | 635 | Called after the task has been attached to the cgroup, to allow any |
585 | post-attachment activity that requires memory allocations or blocking. | 636 | post-attachment activity that requires memory allocations or blocking. |
586 | If threadgroup is true, the subsystem should take care of all threads | 637 | |
587 | in the specified thread's threadgroup. Currently does not support any | 638 | void attach_task(struct cgroup *cgrp, struct task_struct *tsk); |
639 | (cgroup_mutex held by caller) | ||
640 | |||
641 | As attach, but for operations that must be run once per task to be attached, | ||
642 | like can_attach_task. Called before attach. Currently does not support any | ||
588 | subsystem that might need the old_cgrp for every thread in the group. | 643 | subsystem that might need the old_cgrp for every thread in the group. |
589 | 644 | ||
590 | void fork(struct cgroup_subsy *ss, struct task_struct *task) | 645 | void fork(struct cgroup_subsy *ss, struct task_struct *task) |
@@ -608,7 +663,7 @@ always handled well. | |||
608 | void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp) | 663 | void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp) |
609 | (cgroup_mutex held by caller) | 664 | (cgroup_mutex held by caller) |
610 | 665 | ||
611 | Called at the end of cgroup_clone() to do any parameter | 666 | Called during cgroup_create() to do any parameter |
612 | initialization which might be required before a task could attach. For | 667 | initialization which might be required before a task could attach. For |
613 | example in cpusets, no task may attach before 'cpus' and 'mems' are set | 668 | example in cpusets, no task may attach before 'cpus' and 'mems' are set |
614 | up. | 669 | up. |
diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt index 8b930946c52a..9ad85df4b983 100644 --- a/Documentation/cgroups/cpuacct.txt +++ b/Documentation/cgroups/cpuacct.txt | |||
@@ -10,26 +10,25 @@ directly present in its group. | |||
10 | 10 | ||
11 | Accounting groups can be created by first mounting the cgroup filesystem. | 11 | Accounting groups can be created by first mounting the cgroup filesystem. |
12 | 12 | ||
13 | # mkdir /cgroups | 13 | # mount -t cgroup -ocpuacct none /sys/fs/cgroup |
14 | # mount -t cgroup -ocpuacct none /cgroups | 14 | |
15 | 15 | With the above step, the initial or the parent accounting group becomes | |
16 | With the above step, the initial or the parent accounting group | 16 | visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in |
17 | becomes visible at /cgroups. At bootup, this group includes all the | 17 | the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. |
18 | tasks in the system. /cgroups/tasks lists the tasks in this cgroup. | 18 | /sys/fs/cgroup/cpuacct.usage gives the CPU time (in nanoseconds) obtained |
19 | /cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by | 19 | by this group which is essentially the CPU time obtained by all the tasks |
20 | this group which is essentially the CPU time obtained by all the tasks | ||
21 | in the system. | 20 | in the system. |
22 | 21 | ||
23 | New accounting groups can be created under the parent group /cgroups. | 22 | New accounting groups can be created under the parent group /sys/fs/cgroup. |
24 | 23 | ||
25 | # cd /cgroups | 24 | # cd /sys/fs/cgroup |
26 | # mkdir g1 | 25 | # mkdir g1 |
27 | # echo $$ > g1 | 26 | # echo $$ > g1 |
28 | 27 | ||
29 | The above steps create a new group g1 and move the current shell | 28 | The above steps create a new group g1 and move the current shell |
30 | process (bash) into it. CPU time consumed by this bash and its children | 29 | process (bash) into it. CPU time consumed by this bash and its children |
31 | can be obtained from g1/cpuacct.usage and the same is accumulated in | 30 | can be obtained from g1/cpuacct.usage and the same is accumulated in |
32 | /cgroups/cpuacct.usage also. | 31 | /sys/fs/cgroup/cpuacct.usage also. |
33 | 32 | ||
34 | cpuacct.stat file lists a few statistics which further divide the | 33 | cpuacct.stat file lists a few statistics which further divide the |
35 | CPU time obtained by the cgroup into user and system times. Currently | 34 | CPU time obtained by the cgroup into user and system times. Currently |
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 5d0d5692a365..5b0d78e55ccc 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt | |||
@@ -661,21 +661,21 @@ than stress the kernel. | |||
661 | 661 | ||
662 | To start a new job that is to be contained within a cpuset, the steps are: | 662 | To start a new job that is to be contained within a cpuset, the steps are: |
663 | 663 | ||
664 | 1) mkdir /dev/cpuset | 664 | 1) mkdir /sys/fs/cgroup/cpuset |
665 | 2) mount -t cgroup -ocpuset cpuset /dev/cpuset | 665 | 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset |
666 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in | 666 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in |
667 | the /dev/cpuset virtual file system. | 667 | the /sys/fs/cgroup/cpuset virtual file system. |
668 | 4) Start a task that will be the "founding father" of the new job. | 668 | 4) Start a task that will be the "founding father" of the new job. |
669 | 5) Attach that task to the new cpuset by writing its pid to the | 669 | 5) Attach that task to the new cpuset by writing its pid to the |
670 | /dev/cpuset tasks file for that cpuset. | 670 | /sys/fs/cgroup/cpuset tasks file for that cpuset. |
671 | 6) fork, exec or clone the job tasks from this founding father task. | 671 | 6) fork, exec or clone the job tasks from this founding father task. |
672 | 672 | ||
673 | For example, the following sequence of commands will setup a cpuset | 673 | For example, the following sequence of commands will setup a cpuset |
674 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | 674 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, |
675 | and then start a subshell 'sh' in that cpuset: | 675 | and then start a subshell 'sh' in that cpuset: |
676 | 676 | ||
677 | mount -t cgroup -ocpuset cpuset /dev/cpuset | 677 | mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset |
678 | cd /dev/cpuset | 678 | cd /sys/fs/cgroup/cpuset |
679 | mkdir Charlie | 679 | mkdir Charlie |
680 | cd Charlie | 680 | cd Charlie |
681 | /bin/echo 2-3 > cpuset.cpus | 681 | /bin/echo 2-3 > cpuset.cpus |
@@ -693,7 +693,7 @@ There are ways to query or modify cpusets: | |||
693 | - via the C library libcgroup. | 693 | - via the C library libcgroup. |
694 | (http://sourceforge.net/projects/libcg/) | 694 | (http://sourceforge.net/projects/libcg/) |
695 | - via the python application cset. | 695 | - via the python application cset. |
696 | (http://developer.novell.com/wiki/index.php/Cpuset) | 696 | (http://code.google.com/p/cpuset/) |
697 | 697 | ||
698 | The sched_setaffinity calls can also be done at the shell prompt using | 698 | The sched_setaffinity calls can also be done at the shell prompt using |
699 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy | 699 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy |
@@ -710,14 +710,14 @@ Creating, modifying, using the cpusets can be done through the cpuset | |||
710 | virtual filesystem. | 710 | virtual filesystem. |
711 | 711 | ||
712 | To mount it, type: | 712 | To mount it, type: |
713 | # mount -t cgroup -o cpuset cpuset /dev/cpuset | 713 | # mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset |
714 | 714 | ||
715 | Then under /dev/cpuset you can find a tree that corresponds to the | 715 | Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the |
716 | tree of the cpusets in the system. For instance, /dev/cpuset | 716 | tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset |
717 | is the cpuset that holds the whole system. | 717 | is the cpuset that holds the whole system. |
718 | 718 | ||
719 | If you want to create a new cpuset under /dev/cpuset: | 719 | If you want to create a new cpuset under /sys/fs/cgroup/cpuset: |
720 | # cd /dev/cpuset | 720 | # cd /sys/fs/cgroup/cpuset |
721 | # mkdir my_cpuset | 721 | # mkdir my_cpuset |
722 | 722 | ||
723 | Now you want to do something with this cpuset. | 723 | Now you want to do something with this cpuset. |
@@ -725,13 +725,14 @@ Now you want to do something with this cpuset. | |||
725 | 725 | ||
726 | In this directory you can find several files: | 726 | In this directory you can find several files: |
727 | # ls | 727 | # ls |
728 | cpuset.cpu_exclusive cpuset.memory_spread_slab | 728 | cgroup.clone_children cpuset.memory_pressure |
729 | cpuset.cpus cpuset.mems | 729 | cgroup.event_control cpuset.memory_spread_page |
730 | cpuset.mem_exclusive cpuset.sched_load_balance | 730 | cgroup.procs cpuset.memory_spread_slab |
731 | cpuset.mem_hardwall cpuset.sched_relax_domain_level | 731 | cpuset.cpu_exclusive cpuset.mems |
732 | cpuset.memory_migrate notify_on_release | 732 | cpuset.cpus cpuset.sched_load_balance |
733 | cpuset.memory_pressure tasks | 733 | cpuset.mem_exclusive cpuset.sched_relax_domain_level |
734 | cpuset.memory_spread_page | 734 | cpuset.mem_hardwall notify_on_release |
735 | cpuset.memory_migrate tasks | ||
735 | 736 | ||
736 | Reading them will give you information about the state of this cpuset: | 737 | Reading them will give you information about the state of this cpuset: |
737 | the CPUs and Memory Nodes it can use, the processes that are using | 738 | the CPUs and Memory Nodes it can use, the processes that are using |
@@ -764,12 +765,12 @@ wrapper around the cgroup filesystem. | |||
764 | 765 | ||
765 | The command | 766 | The command |
766 | 767 | ||
767 | mount -t cpuset X /dev/cpuset | 768 | mount -t cpuset X /sys/fs/cgroup/cpuset |
768 | 769 | ||
769 | is equivalent to | 770 | is equivalent to |
770 | 771 | ||
771 | mount -t cgroup -ocpuset,noprefix X /dev/cpuset | 772 | mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset |
772 | echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent | 773 | echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent |
773 | 774 | ||
774 | 2.2 Adding/removing cpus | 775 | 2.2 Adding/removing cpus |
775 | ------------------------ | 776 | ------------------------ |
diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt index 57ca4c89fe5c..16624a7f8222 100644 --- a/Documentation/cgroups/devices.txt +++ b/Documentation/cgroups/devices.txt | |||
@@ -22,16 +22,16 @@ removed from the child(ren). | |||
22 | An entry is added using devices.allow, and removed using | 22 | An entry is added using devices.allow, and removed using |
23 | devices.deny. For instance | 23 | devices.deny. For instance |
24 | 24 | ||
25 | echo 'c 1:3 mr' > /cgroups/1/devices.allow | 25 | echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow |
26 | 26 | ||
27 | allows cgroup 1 to read and mknod the device usually known as | 27 | allows cgroup 1 to read and mknod the device usually known as |
28 | /dev/null. Doing | 28 | /dev/null. Doing |
29 | 29 | ||
30 | echo a > /cgroups/1/devices.deny | 30 | echo a > /sys/fs/cgroup/1/devices.deny |
31 | 31 | ||
32 | will remove the default 'a *:* rwm' entry. Doing | 32 | will remove the default 'a *:* rwm' entry. Doing |
33 | 33 | ||
34 | echo a > /cgroups/1/devices.allow | 34 | echo a > /sys/fs/cgroup/1/devices.allow |
35 | 35 | ||
36 | will add the 'a *:* rwm' entry to the whitelist. | 36 | will add the 'a *:* rwm' entry to the whitelist. |
37 | 37 | ||
diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt index 41f37fea1276..c21d77742a07 100644 --- a/Documentation/cgroups/freezer-subsystem.txt +++ b/Documentation/cgroups/freezer-subsystem.txt | |||
@@ -59,28 +59,28 @@ is non-freezable. | |||
59 | 59 | ||
60 | * Examples of usage : | 60 | * Examples of usage : |
61 | 61 | ||
62 | # mkdir /containers | 62 | # mkdir /sys/fs/cgroup/freezer |
63 | # mount -t cgroup -ofreezer freezer /containers | 63 | # mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer |
64 | # mkdir /containers/0 | 64 | # mkdir /sys/fs/cgroup/freezer/0 |
65 | # echo $some_pid > /containers/0/tasks | 65 | # echo $some_pid > /sys/fs/cgroup/freezer/0/tasks |
66 | 66 | ||
67 | to get status of the freezer subsystem : | 67 | to get status of the freezer subsystem : |
68 | 68 | ||
69 | # cat /containers/0/freezer.state | 69 | # cat /sys/fs/cgroup/freezer/0/freezer.state |
70 | THAWED | 70 | THAWED |
71 | 71 | ||
72 | to freeze all tasks in the container : | 72 | to freeze all tasks in the container : |
73 | 73 | ||
74 | # echo FROZEN > /containers/0/freezer.state | 74 | # echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state |
75 | # cat /containers/0/freezer.state | 75 | # cat /sys/fs/cgroup/freezer/0/freezer.state |
76 | FREEZING | 76 | FREEZING |
77 | # cat /containers/0/freezer.state | 77 | # cat /sys/fs/cgroup/freezer/0/freezer.state |
78 | FROZEN | 78 | FROZEN |
79 | 79 | ||
80 | to unfreeze all tasks in the container : | 80 | to unfreeze all tasks in the container : |
81 | 81 | ||
82 | # echo THAWED > /containers/0/freezer.state | 82 | # echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state |
83 | # cat /containers/0/freezer.state | 83 | # cat /sys/fs/cgroup/freezer/0/freezer.state |
84 | THAWED | 84 | THAWED |
85 | 85 | ||
86 | This is the basic mechanism which should do the right thing for user space task | 86 | This is the basic mechanism which should do the right thing for user space task |
diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index b7eececfb195..fc8fa97a09ac 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt | |||
@@ -398,7 +398,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | |||
398 | written to move_charge_at_immigrate. | 398 | written to move_charge_at_immigrate. |
399 | 399 | ||
400 | 9.10 Memory thresholds | 400 | 9.10 Memory thresholds |
401 | Memory controler implements memory thresholds using cgroups notification | 401 | Memory controller implements memory thresholds using cgroups notification |
402 | API. You can use Documentation/cgroups/cgroup_event_listener.c to test | 402 | API. You can use Documentation/cgroups/cgroup_event_listener.c to test |
403 | it. | 403 | it. |
404 | 404 | ||
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 7781857dc940..06eb6d957c83 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt | |||
@@ -1,8 +1,8 @@ | |||
1 | Memory Resource Controller | 1 | Memory Resource Controller |
2 | 2 | ||
3 | NOTE: The Memory Resource Controller has been generically been referred | 3 | NOTE: The Memory Resource Controller has generically been referred to as the |
4 | to as the memory controller in this document. Do not confuse memory | 4 | memory controller in this document. Do not confuse memory controller |
5 | controller used here with the memory controller that is used in hardware. | 5 | used here with the memory controller that is used in hardware. |
6 | 6 | ||
7 | (For editors) | 7 | (For editors) |
8 | In this document: | 8 | In this document: |
@@ -52,8 +52,10 @@ Brief summary of control files. | |||
52 | tasks # attach a task(thread) and show list of threads | 52 | tasks # attach a task(thread) and show list of threads |
53 | cgroup.procs # show list of processes | 53 | cgroup.procs # show list of processes |
54 | cgroup.event_control # an interface for event_fd() | 54 | cgroup.event_control # an interface for event_fd() |
55 | memory.usage_in_bytes # show current memory(RSS+Cache) usage. | 55 | memory.usage_in_bytes # show current res_counter usage for memory |
56 | memory.memsw.usage_in_bytes # show current memory+Swap usage | 56 | (See 5.5 for details) |
57 | memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap | ||
58 | (See 5.5 for details) | ||
57 | memory.limit_in_bytes # set/show limit of memory usage | 59 | memory.limit_in_bytes # set/show limit of memory usage |
58 | memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage | 60 | memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage |
59 | memory.failcnt # show the number of memory usage hits limits | 61 | memory.failcnt # show the number of memory usage hits limits |
@@ -68,6 +70,7 @@ Brief summary of control files. | |||
68 | (See sysctl's vm.swappiness) | 70 | (See sysctl's vm.swappiness) |
69 | memory.move_charge_at_immigrate # set/show controls of moving charges | 71 | memory.move_charge_at_immigrate # set/show controls of moving charges |
70 | memory.oom_control # set/show oom controls. | 72 | memory.oom_control # set/show oom controls. |
73 | memory.numa_stat # show the number of memory usage per numa node | ||
71 | 74 | ||
72 | 1. History | 75 | 1. History |
73 | 76 | ||
@@ -179,7 +182,7 @@ behind this approach is that a cgroup that aggressively uses a shared | |||
179 | page will eventually get charged for it (once it is uncharged from | 182 | page will eventually get charged for it (once it is uncharged from |
180 | the cgroup that brought it in -- this will happen on memory pressure). | 183 | the cgroup that brought it in -- this will happen on memory pressure). |
181 | 184 | ||
182 | Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used.. | 185 | Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used. |
183 | When you do swapoff and make swapped-out pages of shmem(tmpfs) to | 186 | When you do swapoff and make swapped-out pages of shmem(tmpfs) to |
184 | be backed into memory in force, charges for pages are accounted against the | 187 | be backed into memory in force, charges for pages are accounted against the |
185 | caller of swapoff rather than the users of shmem. | 188 | caller of swapoff rather than the users of shmem. |
@@ -211,7 +214,7 @@ affecting global LRU, memory+swap limit is better than just limiting swap from | |||
211 | OS point of view. | 214 | OS point of view. |
212 | 215 | ||
213 | * What happens when a cgroup hits memory.memsw.limit_in_bytes | 216 | * What happens when a cgroup hits memory.memsw.limit_in_bytes |
214 | When a cgroup his memory.memsw.limit_in_bytes, it's useless to do swap-out | 217 | When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out |
215 | in this cgroup. Then, swap-out will not be done by cgroup routine and file | 218 | in this cgroup. Then, swap-out will not be done by cgroup routine and file |
216 | caches are dropped. But as mentioned above, global LRU can do swapout memory | 219 | caches are dropped. But as mentioned above, global LRU can do swapout memory |
217 | from it for sanity of the system's memory management state. You can't forbid | 220 | from it for sanity of the system's memory management state. You can't forbid |
@@ -261,16 +264,17 @@ b. Enable CONFIG_RESOURCE_COUNTERS | |||
261 | c. Enable CONFIG_CGROUP_MEM_RES_CTLR | 264 | c. Enable CONFIG_CGROUP_MEM_RES_CTLR |
262 | d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension) | 265 | d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension) |
263 | 266 | ||
264 | 1. Prepare the cgroups | 267 | 1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) |
265 | # mkdir -p /cgroups | 268 | # mount -t tmpfs none /sys/fs/cgroup |
266 | # mount -t cgroup none /cgroups -o memory | 269 | # mkdir /sys/fs/cgroup/memory |
270 | # mount -t cgroup none /sys/fs/cgroup/memory -o memory | ||
267 | 271 | ||
268 | 2. Make the new group and move bash into it | 272 | 2. Make the new group and move bash into it |
269 | # mkdir /cgroups/0 | 273 | # mkdir /sys/fs/cgroup/memory/0 |
270 | # echo $$ > /cgroups/0/tasks | 274 | # echo $$ > /sys/fs/cgroup/memory/0/tasks |
271 | 275 | ||
272 | Since now we're in the 0 cgroup, we can alter the memory limit: | 276 | Since now we're in the 0 cgroup, we can alter the memory limit: |
273 | # echo 4M > /cgroups/0/memory.limit_in_bytes | 277 | # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes |
274 | 278 | ||
275 | NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, | 279 | NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, |
276 | mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.) | 280 | mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.) |
@@ -278,11 +282,11 @@ mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.) | |||
278 | NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). | 282 | NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). |
279 | NOTE: We cannot set limits on the root cgroup any more. | 283 | NOTE: We cannot set limits on the root cgroup any more. |
280 | 284 | ||
281 | # cat /cgroups/0/memory.limit_in_bytes | 285 | # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes |
282 | 4194304 | 286 | 4194304 |
283 | 287 | ||
284 | We can check the usage: | 288 | We can check the usage: |
285 | # cat /cgroups/0/memory.usage_in_bytes | 289 | # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes |
286 | 1216512 | 290 | 1216512 |
287 | 291 | ||
288 | A successful write to this file does not guarantee a successful set of | 292 | A successful write to this file does not guarantee a successful set of |
@@ -453,6 +457,33 @@ memory under it will be reclaimed. | |||
453 | You can reset failcnt by writing 0 to failcnt file. | 457 | You can reset failcnt by writing 0 to failcnt file. |
454 | # echo 0 > .../memory.failcnt | 458 | # echo 0 > .../memory.failcnt |
455 | 459 | ||
460 | 5.5 usage_in_bytes | ||
461 | |||
462 | For efficiency, as other kernel components, memory cgroup uses some optimization | ||
463 | to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the | ||
464 | method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz | ||
465 | value for efficient access. (Of course, when necessary, it's synchronized.) | ||
466 | If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) | ||
467 | value in memory.stat(see 5.2). | ||
468 | |||
469 | 5.6 numa_stat | ||
470 | |||
471 | This is similar to numa_maps but operates on a per-memcg basis. This is | ||
472 | useful for providing visibility into the numa locality information within | ||
473 | an memcg since the pages are allowed to be allocated from any physical | ||
474 | node. One of the usecases is evaluating application performance by | ||
475 | combining this information with the application's cpu allocation. | ||
476 | |||
477 | We export "total", "file", "anon" and "unevictable" pages per-node for | ||
478 | each memcg. The ouput format of memory.numa_stat is: | ||
479 | |||
480 | total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ... | ||
481 | file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ... | ||
482 | anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... | ||
483 | unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... | ||
484 | |||
485 | And we have total = file + anon + unevictable. | ||
486 | |||
456 | 6. Hierarchy support | 487 | 6. Hierarchy support |
457 | 488 | ||
458 | The memory controller supports a deep hierarchy and hierarchical accounting. | 489 | The memory controller supports a deep hierarchy and hierarchical accounting. |
@@ -460,13 +491,13 @@ The hierarchy is created by creating the appropriate cgroups in the | |||
460 | cgroup filesystem. Consider for example, the following cgroup filesystem | 491 | cgroup filesystem. Consider for example, the following cgroup filesystem |
461 | hierarchy | 492 | hierarchy |
462 | 493 | ||
463 | root | 494 | root |
464 | / | \ | 495 | / | \ |
465 | / | \ | 496 | / | \ |
466 | a b c | 497 | a b c |
467 | | \ | 498 | | \ |
468 | | \ | 499 | | \ |
469 | d e | 500 | d e |
470 | 501 | ||
471 | In the diagram above, with hierarchical accounting enabled, all memory | 502 | In the diagram above, with hierarchical accounting enabled, all memory |
472 | usage of e, is accounted to its ancestors up until the root (i.e, c and root), | 503 | usage of e, is accounted to its ancestors up until the root (i.e, c and root), |
@@ -485,8 +516,9 @@ The feature can be disabled by | |||
485 | 516 | ||
486 | # echo 0 > memory.use_hierarchy | 517 | # echo 0 > memory.use_hierarchy |
487 | 518 | ||
488 | NOTE1: Enabling/disabling will fail if the cgroup already has other | 519 | NOTE1: Enabling/disabling will fail if either the cgroup already has other |
489 | cgroups created below it. | 520 | cgroups created below it, or if the parent cgroup has use_hierarchy |
521 | enabled. | ||
490 | 522 | ||
491 | NOTE2: When panic_on_oom is set to "2", the whole system will panic in | 523 | NOTE2: When panic_on_oom is set to "2", the whole system will panic in |
492 | case of an OOM event in any cgroup. | 524 | case of an OOM event in any cgroup. |