diff options
Diffstat (limited to 'Documentation/cgroups/cgroups.txt')
-rw-r--r-- | Documentation/cgroups/cgroups.txt | 92 |
1 files changed, 56 insertions, 36 deletions
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 4a0b64c605fc..9e04196c4d78 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt | |||
@@ -29,7 +29,8 @@ CONTENTS: | |||
29 | 3.1 Overview | 29 | 3.1 Overview |
30 | 3.2 Synchronization | 30 | 3.2 Synchronization |
31 | 3.3 Subsystem API | 31 | 3.3 Subsystem API |
32 | 4. Questions | 32 | 4. Extended attributes usage |
33 | 5. Questions | ||
33 | 34 | ||
34 | 1. Control Groups | 35 | 1. Control Groups |
35 | ================= | 36 | ================= |
@@ -62,9 +63,9 @@ an instance of the cgroup virtual filesystem associated with it. | |||
62 | At any one time there may be multiple active hierarchies of task | 63 | At any one time there may be multiple active hierarchies of task |
63 | cgroups. Each hierarchy is a partition of all tasks in the system. | 64 | cgroups. Each hierarchy is a partition of all tasks in the system. |
64 | 65 | ||
65 | User level code may create and destroy cgroups by name in an | 66 | User-level code may create and destroy cgroups by name in an |
66 | instance of the cgroup virtual file system, specify and query to | 67 | instance of the cgroup virtual file system, specify and query to |
67 | which cgroup a task is assigned, and list the task pids assigned to | 68 | which cgroup a task is assigned, and list the task PIDs assigned to |
68 | a cgroup. Those creations and assignments only affect the hierarchy | 69 | a cgroup. Those creations and assignments only affect the hierarchy |
69 | associated with that instance of the cgroup file system. | 70 | associated with that instance of the cgroup file system. |
70 | 71 | ||
@@ -72,7 +73,7 @@ On their own, the only use for cgroups is for simple job | |||
72 | tracking. The intention is that other subsystems hook into the generic | 73 | tracking. The intention is that other subsystems hook into the generic |
73 | cgroup support to provide new attributes for cgroups, such as | 74 | cgroup support to provide new attributes for cgroups, such as |
74 | accounting/limiting the resources which processes in a cgroup can | 75 | accounting/limiting the resources which processes in a cgroup can |
75 | access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allows | 76 | access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allow |
76 | you to associate a set of CPUs and a set of memory nodes with the | 77 | you to associate a set of CPUs and a set of memory nodes with the |
77 | tasks in each cgroup. | 78 | tasks in each cgroup. |
78 | 79 | ||
@@ -80,11 +81,11 @@ tasks in each cgroup. | |||
80 | ---------------------------- | 81 | ---------------------------- |
81 | 82 | ||
82 | There are multiple efforts to provide process aggregations in the | 83 | There are multiple efforts to provide process aggregations in the |
83 | Linux kernel, mainly for resource tracking purposes. Such efforts | 84 | Linux kernel, mainly for resource-tracking purposes. Such efforts |
84 | include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server | 85 | include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server |
85 | namespaces. These all require the basic notion of a | 86 | namespaces. These all require the basic notion of a |
86 | grouping/partitioning of processes, with newly forked processes ending | 87 | grouping/partitioning of processes, with newly forked processes ending |
87 | in the same group (cgroup) as their parent process. | 88 | up in the same group (cgroup) as their parent process. |
88 | 89 | ||
89 | The kernel cgroup patch provides the minimum essential kernel | 90 | The kernel cgroup patch provides the minimum essential kernel |
90 | mechanisms required to efficiently implement such groups. It has | 91 | mechanisms required to efficiently implement such groups. It has |
@@ -127,14 +128,14 @@ following lines: | |||
127 | / \ | 128 | / \ |
128 | Professors (15%) students (5%) | 129 | Professors (15%) students (5%) |
129 | 130 | ||
130 | Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go | 131 | Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd goes |
131 | into NFS network class. | 132 | into the NFS network class. |
132 | 133 | ||
133 | At the same time Firefox/Lynx will share an appropriate CPU/Memory class | 134 | At the same time Firefox/Lynx will share an appropriate CPU/Memory class |
134 | depending on who launched it (prof/student). | 135 | depending on who launched it (prof/student). |
135 | 136 | ||
136 | With the ability to classify tasks differently for different resources | 137 | With the ability to classify tasks differently for different resources |
137 | (by putting those resource subsystems in different hierarchies) then | 138 | (by putting those resource subsystems in different hierarchies), |
138 | the admin can easily set up a script which receives exec notifications | 139 | the admin can easily set up a script which receives exec notifications |
139 | and depending on who is launching the browser he can | 140 | and depending on who is launching the browser he can |
140 | 141 | ||
@@ -145,19 +146,19 @@ a separate cgroup for every browser launched and associate it with | |||
145 | appropriate network and other resource class. This may lead to | 146 | appropriate network and other resource class. This may lead to |
146 | proliferation of such cgroups. | 147 | proliferation of such cgroups. |
147 | 148 | ||
148 | Also lets say that the administrator would like to give enhanced network | 149 | Also let's say that the administrator would like to give enhanced network |
149 | access temporarily to a student's browser (since it is night and the user | 150 | access temporarily to a student's browser (since it is night and the user |
150 | wants to do online gaming :)) OR give one of the students simulation | 151 | wants to do online gaming :)) OR give one of the student's simulation |
151 | apps enhanced CPU power, | 152 | apps enhanced CPU power. |
152 | 153 | ||
153 | With ability to write pids directly to resource classes, it's just a | 154 | With ability to write PIDs directly to resource classes, it's just a |
154 | matter of : | 155 | matter of: |
155 | 156 | ||
156 | # echo pid > /sys/fs/cgroup/network/<new_class>/tasks | 157 | # echo pid > /sys/fs/cgroup/network/<new_class>/tasks |
157 | (after some time) | 158 | (after some time) |
158 | # echo pid > /sys/fs/cgroup/network/<orig_class>/tasks | 159 | # echo pid > /sys/fs/cgroup/network/<orig_class>/tasks |
159 | 160 | ||
160 | Without this ability, he would have to split the cgroup into | 161 | Without this ability, the administrator would have to split the cgroup into |
161 | multiple separate ones and then associate the new cgroups with the | 162 | multiple separate ones and then associate the new cgroups with the |
162 | new resource classes. | 163 | new resource classes. |
163 | 164 | ||
@@ -184,20 +185,20 @@ Control Groups extends the kernel as follows: | |||
184 | field of each task_struct using the css_set, anchored at | 185 | field of each task_struct using the css_set, anchored at |
185 | css_set->tasks. | 186 | css_set->tasks. |
186 | 187 | ||
187 | - A cgroup hierarchy filesystem can be mounted for browsing and | 188 | - A cgroup hierarchy filesystem can be mounted for browsing and |
188 | manipulation from user space. | 189 | manipulation from user space. |
189 | 190 | ||
190 | - You can list all the tasks (by pid) attached to any cgroup. | 191 | - You can list all the tasks (by PID) attached to any cgroup. |
191 | 192 | ||
192 | The implementation of cgroups requires a few, simple hooks | 193 | The implementation of cgroups requires a few, simple hooks |
193 | into the rest of the kernel, none in performance critical paths: | 194 | into the rest of the kernel, none in performance-critical paths: |
194 | 195 | ||
195 | - in init/main.c, to initialize the root cgroups and initial | 196 | - in init/main.c, to initialize the root cgroups and initial |
196 | css_set at system boot. | 197 | css_set at system boot. |
197 | 198 | ||
198 | - in fork and exit, to attach and detach a task from its css_set. | 199 | - in fork and exit, to attach and detach a task from its css_set. |
199 | 200 | ||
200 | In addition a new file system, of type "cgroup" may be mounted, to | 201 | In addition, a new file system of type "cgroup" may be mounted, to |
201 | enable browsing and modifying the cgroups presently known to the | 202 | enable browsing and modifying the cgroups presently known to the |
202 | kernel. When mounting a cgroup hierarchy, you may specify a | 203 | kernel. When mounting a cgroup hierarchy, you may specify a |
203 | comma-separated list of subsystems to mount as the filesystem mount | 204 | comma-separated list of subsystems to mount as the filesystem mount |
@@ -230,13 +231,13 @@ as the path relative to the root of the cgroup file system. | |||
230 | Each cgroup is represented by a directory in the cgroup file system | 231 | Each cgroup is represented by a directory in the cgroup file system |
231 | containing the following files describing that cgroup: | 232 | containing the following files describing that cgroup: |
232 | 233 | ||
233 | - tasks: list of tasks (by pid) attached to that cgroup. This list | 234 | - tasks: list of tasks (by PID) attached to that cgroup. This list |
234 | is not guaranteed to be sorted. Writing a thread id into this file | 235 | is not guaranteed to be sorted. Writing a thread ID into this file |
235 | moves the thread into this cgroup. | 236 | moves the thread into this cgroup. |
236 | - cgroup.procs: list of tgids in the cgroup. This list is not | 237 | - cgroup.procs: list of thread group IDs in the cgroup. This list is |
237 | guaranteed to be sorted or free of duplicate tgids, and userspace | 238 | not guaranteed to be sorted or free of duplicate TGIDs, and userspace |
238 | should sort/uniquify the list if this property is required. | 239 | should sort/uniquify the list if this property is required. |
239 | Writing a thread group id into this file moves all threads in that | 240 | Writing a thread group ID into this file moves all threads in that |
240 | group into this cgroup. | 241 | group into this cgroup. |
241 | - notify_on_release flag: run the release agent on exit? | 242 | - notify_on_release flag: run the release agent on exit? |
242 | - release_agent: the path to use for release notifications (this file | 243 | - release_agent: the path to use for release notifications (this file |
@@ -261,7 +262,7 @@ cgroup file system directories. | |||
261 | 262 | ||
262 | When a task is moved from one cgroup to another, it gets a new | 263 | When a task is moved from one cgroup to another, it gets a new |
263 | css_set pointer - if there's an already existing css_set with the | 264 | css_set pointer - if there's an already existing css_set with the |
264 | desired collection of cgroups then that group is reused, else a new | 265 | desired collection of cgroups then that group is reused, otherwise a new |
265 | css_set is allocated. The appropriate existing css_set is located by | 266 | css_set is allocated. The appropriate existing css_set is located by |
266 | looking into a hash table. | 267 | looking into a hash table. |
267 | 268 | ||
@@ -292,7 +293,7 @@ file system) of the abandoned cgroup. This enables automatic | |||
292 | removal of abandoned cgroups. The default value of | 293 | removal of abandoned cgroups. The default value of |
293 | notify_on_release in the root cgroup at system boot is disabled | 294 | notify_on_release in the root cgroup at system boot is disabled |
294 | (0). The default value of other cgroups at creation is the current | 295 | (0). The default value of other cgroups at creation is the current |
295 | value of their parents notify_on_release setting. The default value of | 296 | value of their parents' notify_on_release settings. The default value of |
296 | a cgroup hierarchy's release_agent path is empty. | 297 | a cgroup hierarchy's release_agent path is empty. |
297 | 298 | ||
298 | 1.5 What does clone_children do ? | 299 | 1.5 What does clone_children do ? |
@@ -316,7 +317,7 @@ the "cpuset" cgroup subsystem, the steps are something like: | |||
316 | 4) Create the new cgroup by doing mkdir's and write's (or echo's) in | 317 | 4) Create the new cgroup by doing mkdir's and write's (or echo's) in |
317 | the /sys/fs/cgroup virtual file system. | 318 | the /sys/fs/cgroup virtual file system. |
318 | 5) Start a task that will be the "founding father" of the new job. | 319 | 5) Start a task that will be the "founding father" of the new job. |
319 | 6) Attach that task to the new cgroup by writing its pid to the | 320 | 6) Attach that task to the new cgroup by writing its PID to the |
320 | /sys/fs/cgroup/cpuset/tasks file for that cgroup. | 321 | /sys/fs/cgroup/cpuset/tasks file for that cgroup. |
321 | 7) fork, exec or clone the job tasks from this founding father task. | 322 | 7) fork, exec or clone the job tasks from this founding father task. |
322 | 323 | ||
@@ -344,7 +345,7 @@ and then start a subshell 'sh' in that cgroup: | |||
344 | 2.1 Basic Usage | 345 | 2.1 Basic Usage |
345 | --------------- | 346 | --------------- |
346 | 347 | ||
347 | Creating, modifying, using the cgroups can be done through the cgroup | 348 | Creating, modifying, using cgroups can be done through the cgroup |
348 | virtual filesystem. | 349 | virtual filesystem. |
349 | 350 | ||
350 | To mount a cgroup hierarchy with all available subsystems, type: | 351 | To mount a cgroup hierarchy with all available subsystems, type: |
@@ -441,7 +442,7 @@ You can attach the current shell task by echoing 0: | |||
441 | # echo 0 > tasks | 442 | # echo 0 > tasks |
442 | 443 | ||
443 | You can use the cgroup.procs file instead of the tasks file to move all | 444 | You can use the cgroup.procs file instead of the tasks file to move all |
444 | threads in a threadgroup at once. Echoing the pid of any task in a | 445 | threads in a threadgroup at once. Echoing the PID of any task in a |
445 | threadgroup to cgroup.procs causes all tasks in that threadgroup to be | 446 | threadgroup to cgroup.procs causes all tasks in that threadgroup to be |
446 | be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks | 447 | be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks |
447 | in the writing task's threadgroup. | 448 | in the writing task's threadgroup. |
@@ -479,7 +480,7 @@ in /proc/mounts and /proc/<pid>/cgroups. | |||
479 | There is mechanism which allows to get notifications about changing | 480 | There is mechanism which allows to get notifications about changing |
480 | status of a cgroup. | 481 | status of a cgroup. |
481 | 482 | ||
482 | To register new notification handler you need: | 483 | To register a new notification handler you need to: |
483 | - create a file descriptor for event notification using eventfd(2); | 484 | - create a file descriptor for event notification using eventfd(2); |
484 | - open a control file to be monitored (e.g. memory.usage_in_bytes); | 485 | - open a control file to be monitored (e.g. memory.usage_in_bytes); |
485 | - write "<event_fd> <control_fd> <args>" to cgroup.event_control. | 486 | - write "<event_fd> <control_fd> <args>" to cgroup.event_control. |
@@ -488,7 +489,7 @@ To register new notification handler you need: | |||
488 | eventfd will be woken up by control file implementation or when the | 489 | eventfd will be woken up by control file implementation or when the |
489 | cgroup is removed. | 490 | cgroup is removed. |
490 | 491 | ||
491 | To unregister notification handler just close eventfd. | 492 | To unregister a notification handler just close eventfd. |
492 | 493 | ||
493 | NOTE: Support of notifications should be implemented for the control | 494 | NOTE: Support of notifications should be implemented for the control |
494 | file. See documentation for the subsystem. | 495 | file. See documentation for the subsystem. |
@@ -502,7 +503,7 @@ file. See documentation for the subsystem. | |||
502 | Each kernel subsystem that wants to hook into the generic cgroup | 503 | Each kernel subsystem that wants to hook into the generic cgroup |
503 | system needs to create a cgroup_subsys object. This contains | 504 | system needs to create a cgroup_subsys object. This contains |
504 | various methods, which are callbacks from the cgroup system, along | 505 | various methods, which are callbacks from the cgroup system, along |
505 | with a subsystem id which will be assigned by the cgroup system. | 506 | with a subsystem ID which will be assigned by the cgroup system. |
506 | 507 | ||
507 | Other fields in the cgroup_subsys object include: | 508 | Other fields in the cgroup_subsys object include: |
508 | 509 | ||
@@ -516,7 +517,7 @@ Other fields in the cgroup_subsys object include: | |||
516 | at system boot. | 517 | at system boot. |
517 | 518 | ||
518 | Each cgroup object created by the system has an array of pointers, | 519 | Each cgroup object created by the system has an array of pointers, |
519 | indexed by subsystem id; this pointer is entirely managed by the | 520 | indexed by subsystem ID; this pointer is entirely managed by the |
520 | subsystem; the generic cgroup code will never touch this pointer. | 521 | subsystem; the generic cgroup code will never touch this pointer. |
521 | 522 | ||
522 | 3.2 Synchronization | 523 | 3.2 Synchronization |
@@ -639,7 +640,7 @@ void post_clone(struct cgroup *cgrp) | |||
639 | 640 | ||
640 | Called during cgroup_create() to do any parameter | 641 | Called during cgroup_create() to do any parameter |
641 | initialization which might be required before a task could attach. For | 642 | initialization which might be required before a task could attach. For |
642 | example in cpusets, no task may attach before 'cpus' and 'mems' are set | 643 | example, in cpusets, no task may attach before 'cpus' and 'mems' are set |
643 | up. | 644 | up. |
644 | 645 | ||
645 | void bind(struct cgroup *root) | 646 | void bind(struct cgroup *root) |
@@ -650,7 +651,26 @@ and root cgroup. Currently this will only involve movement between | |||
650 | the default hierarchy (which never has sub-cgroups) and a hierarchy | 651 | the default hierarchy (which never has sub-cgroups) and a hierarchy |
651 | that is being created/destroyed (and hence has no sub-cgroups). | 652 | that is being created/destroyed (and hence has no sub-cgroups). |
652 | 653 | ||
653 | 4. Questions | 654 | 4. Extended attribute usage |
655 | =========================== | ||
656 | |||
657 | cgroup filesystem supports certain types of extended attributes in its | ||
658 | directories and files. The current supported types are: | ||
659 | - Trusted (XATTR_TRUSTED) | ||
660 | - Security (XATTR_SECURITY) | ||
661 | |||
662 | Both require CAP_SYS_ADMIN capability to set. | ||
663 | |||
664 | Like in tmpfs, the extended attributes in cgroup filesystem are stored | ||
665 | using kernel memory and it's advised to keep the usage at minimum. This | ||
666 | is the reason why user defined extended attributes are not supported, since | ||
667 | any user can do it and there's no limit in the value size. | ||
668 | |||
669 | The current known users for this feature are SELinux to limit cgroup usage | ||
670 | in containers and systemd for assorted meta data like main PID in a cgroup | ||
671 | (systemd creates a cgroup per service). | ||
672 | |||
673 | 5. Questions | ||
654 | ============ | 674 | ============ |
655 | 675 | ||
656 | Q: what's up with this '/bin/echo' ? | 676 | Q: what's up with this '/bin/echo' ? |
@@ -660,5 +680,5 @@ A: bash's builtin 'echo' command does not check calls to write() against | |||
660 | 680 | ||
661 | Q: When I attach processes, only the first of the line gets really attached ! | 681 | Q: When I attach processes, only the first of the line gets really attached ! |
662 | A: We can only return one error code per call to write(). So you should also | 682 | A: We can only return one error code per call to write(). So you should also |
663 | put only ONE pid. | 683 | put only ONE PID. |
664 | 684 | ||