aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/cgroups/cgroups.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/cgroups/cgroups.txt')
-rw-r--r--Documentation/cgroups/cgroups.txt92
1 files changed, 56 insertions, 36 deletions
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 4a0b64c605fc..9e04196c4d78 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -29,7 +29,8 @@ CONTENTS:
29 3.1 Overview 29 3.1 Overview
30 3.2 Synchronization 30 3.2 Synchronization
31 3.3 Subsystem API 31 3.3 Subsystem API
324. Questions 324. Extended attributes usage
335. Questions
33 34
341. Control Groups 351. Control Groups
35================= 36=================
@@ -62,9 +63,9 @@ an instance of the cgroup virtual filesystem associated with it.
62At any one time there may be multiple active hierarchies of task 63At any one time there may be multiple active hierarchies of task
63cgroups. Each hierarchy is a partition of all tasks in the system. 64cgroups. Each hierarchy is a partition of all tasks in the system.
64 65
65User level code may create and destroy cgroups by name in an 66User-level code may create and destroy cgroups by name in an
66instance of the cgroup virtual file system, specify and query to 67instance of the cgroup virtual file system, specify and query to
67which cgroup a task is assigned, and list the task pids assigned to 68which cgroup a task is assigned, and list the task PIDs assigned to
68a cgroup. Those creations and assignments only affect the hierarchy 69a cgroup. Those creations and assignments only affect the hierarchy
69associated with that instance of the cgroup file system. 70associated with that instance of the cgroup file system.
70 71
@@ -72,7 +73,7 @@ On their own, the only use for cgroups is for simple job
72tracking. The intention is that other subsystems hook into the generic 73tracking. The intention is that other subsystems hook into the generic
73cgroup support to provide new attributes for cgroups, such as 74cgroup support to provide new attributes for cgroups, such as
74accounting/limiting the resources which processes in a cgroup can 75accounting/limiting the resources which processes in a cgroup can
75access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allows 76access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allow
76you to associate a set of CPUs and a set of memory nodes with the 77you to associate a set of CPUs and a set of memory nodes with the
77tasks in each cgroup. 78tasks in each cgroup.
78 79
@@ -80,11 +81,11 @@ tasks in each cgroup.
80---------------------------- 81----------------------------
81 82
82There are multiple efforts to provide process aggregations in the 83There are multiple efforts to provide process aggregations in the
83Linux kernel, mainly for resource tracking purposes. Such efforts 84Linux kernel, mainly for resource-tracking purposes. Such efforts
84include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server 85include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
85namespaces. These all require the basic notion of a 86namespaces. These all require the basic notion of a
86grouping/partitioning of processes, with newly forked processes ending 87grouping/partitioning of processes, with newly forked processes ending
87in the same group (cgroup) as their parent process. 88up in the same group (cgroup) as their parent process.
88 89
89The kernel cgroup patch provides the minimum essential kernel 90The kernel cgroup patch provides the minimum essential kernel
90mechanisms required to efficiently implement such groups. It has 91mechanisms required to efficiently implement such groups. It has
@@ -127,14 +128,14 @@ following lines:
127 / \ 128 / \
128 Professors (15%) students (5%) 129 Professors (15%) students (5%)
129 130
130Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go 131Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd goes
131into NFS network class. 132into the NFS network class.
132 133
133At the same time Firefox/Lynx will share an appropriate CPU/Memory class 134At the same time Firefox/Lynx will share an appropriate CPU/Memory class
134depending on who launched it (prof/student). 135depending on who launched it (prof/student).
135 136
136With the ability to classify tasks differently for different resources 137With the ability to classify tasks differently for different resources
137(by putting those resource subsystems in different hierarchies) then 138(by putting those resource subsystems in different hierarchies),
138the admin can easily set up a script which receives exec notifications 139the admin can easily set up a script which receives exec notifications
139and depending on who is launching the browser he can 140and depending on who is launching the browser he can
140 141
@@ -145,19 +146,19 @@ a separate cgroup for every browser launched and associate it with
145appropriate network and other resource class. This may lead to 146appropriate network and other resource class. This may lead to
146proliferation of such cgroups. 147proliferation of such cgroups.
147 148
148Also lets say that the administrator would like to give enhanced network 149Also let's say that the administrator would like to give enhanced network
149access temporarily to a student's browser (since it is night and the user 150access temporarily to a student's browser (since it is night and the user
150wants to do online gaming :)) OR give one of the students simulation 151wants to do online gaming :)) OR give one of the student's simulation
151apps enhanced CPU power, 152apps enhanced CPU power.
152 153
153With ability to write pids directly to resource classes, it's just a 154With ability to write PIDs directly to resource classes, it's just a
154matter of : 155matter of:
155 156
156 # echo pid > /sys/fs/cgroup/network/<new_class>/tasks 157 # echo pid > /sys/fs/cgroup/network/<new_class>/tasks
157 (after some time) 158 (after some time)
158 # echo pid > /sys/fs/cgroup/network/<orig_class>/tasks 159 # echo pid > /sys/fs/cgroup/network/<orig_class>/tasks
159 160
160Without this ability, he would have to split the cgroup into 161Without this ability, the administrator would have to split the cgroup into
161multiple separate ones and then associate the new cgroups with the 162multiple separate ones and then associate the new cgroups with the
162new resource classes. 163new resource classes.
163 164
@@ -184,20 +185,20 @@ Control Groups extends the kernel as follows:
184 field of each task_struct using the css_set, anchored at 185 field of each task_struct using the css_set, anchored at
185 css_set->tasks. 186 css_set->tasks.
186 187
187 - A cgroup hierarchy filesystem can be mounted for browsing and 188 - A cgroup hierarchy filesystem can be mounted for browsing and
188 manipulation from user space. 189 manipulation from user space.
189 190
190 - You can list all the tasks (by pid) attached to any cgroup. 191 - You can list all the tasks (by PID) attached to any cgroup.
191 192
192The implementation of cgroups requires a few, simple hooks 193The implementation of cgroups requires a few, simple hooks
193into the rest of the kernel, none in performance critical paths: 194into the rest of the kernel, none in performance-critical paths:
194 195
195 - in init/main.c, to initialize the root cgroups and initial 196 - in init/main.c, to initialize the root cgroups and initial
196 css_set at system boot. 197 css_set at system boot.
197 198
198 - in fork and exit, to attach and detach a task from its css_set. 199 - in fork and exit, to attach and detach a task from its css_set.
199 200
200In addition a new file system, of type "cgroup" may be mounted, to 201In addition, a new file system of type "cgroup" may be mounted, to
201enable browsing and modifying the cgroups presently known to the 202enable browsing and modifying the cgroups presently known to the
202kernel. When mounting a cgroup hierarchy, you may specify a 203kernel. When mounting a cgroup hierarchy, you may specify a
203comma-separated list of subsystems to mount as the filesystem mount 204comma-separated list of subsystems to mount as the filesystem mount
@@ -230,13 +231,13 @@ as the path relative to the root of the cgroup file system.
230Each cgroup is represented by a directory in the cgroup file system 231Each cgroup is represented by a directory in the cgroup file system
231containing the following files describing that cgroup: 232containing the following files describing that cgroup:
232 233
233 - tasks: list of tasks (by pid) attached to that cgroup. This list 234 - tasks: list of tasks (by PID) attached to that cgroup. This list
234 is not guaranteed to be sorted. Writing a thread id into this file 235 is not guaranteed to be sorted. Writing a thread ID into this file
235 moves the thread into this cgroup. 236 moves the thread into this cgroup.
236 - cgroup.procs: list of tgids in the cgroup. This list is not 237 - cgroup.procs: list of thread group IDs in the cgroup. This list is
237 guaranteed to be sorted or free of duplicate tgids, and userspace 238 not guaranteed to be sorted or free of duplicate TGIDs, and userspace
238 should sort/uniquify the list if this property is required. 239 should sort/uniquify the list if this property is required.
239 Writing a thread group id into this file moves all threads in that 240 Writing a thread group ID into this file moves all threads in that
240 group into this cgroup. 241 group into this cgroup.
241 - notify_on_release flag: run the release agent on exit? 242 - notify_on_release flag: run the release agent on exit?
242 - release_agent: the path to use for release notifications (this file 243 - release_agent: the path to use for release notifications (this file
@@ -261,7 +262,7 @@ cgroup file system directories.
261 262
262When a task is moved from one cgroup to another, it gets a new 263When a task is moved from one cgroup to another, it gets a new
263css_set pointer - if there's an already existing css_set with the 264css_set pointer - if there's an already existing css_set with the
264desired collection of cgroups then that group is reused, else a new 265desired collection of cgroups then that group is reused, otherwise a new
265css_set is allocated. The appropriate existing css_set is located by 266css_set is allocated. The appropriate existing css_set is located by
266looking into a hash table. 267looking into a hash table.
267 268
@@ -292,7 +293,7 @@ file system) of the abandoned cgroup. This enables automatic
292removal of abandoned cgroups. The default value of 293removal of abandoned cgroups. The default value of
293notify_on_release in the root cgroup at system boot is disabled 294notify_on_release in the root cgroup at system boot is disabled
294(0). The default value of other cgroups at creation is the current 295(0). The default value of other cgroups at creation is the current
295value of their parents notify_on_release setting. The default value of 296value of their parents' notify_on_release settings. The default value of
296a cgroup hierarchy's release_agent path is empty. 297a cgroup hierarchy's release_agent path is empty.
297 298
2981.5 What does clone_children do ? 2991.5 What does clone_children do ?
@@ -316,7 +317,7 @@ the "cpuset" cgroup subsystem, the steps are something like:
316 4) Create the new cgroup by doing mkdir's and write's (or echo's) in 317 4) Create the new cgroup by doing mkdir's and write's (or echo's) in
317 the /sys/fs/cgroup virtual file system. 318 the /sys/fs/cgroup virtual file system.
318 5) Start a task that will be the "founding father" of the new job. 319 5) Start a task that will be the "founding father" of the new job.
319 6) Attach that task to the new cgroup by writing its pid to the 320 6) Attach that task to the new cgroup by writing its PID to the
320 /sys/fs/cgroup/cpuset/tasks file for that cgroup. 321 /sys/fs/cgroup/cpuset/tasks file for that cgroup.
321 7) fork, exec or clone the job tasks from this founding father task. 322 7) fork, exec or clone the job tasks from this founding father task.
322 323
@@ -344,7 +345,7 @@ and then start a subshell 'sh' in that cgroup:
3442.1 Basic Usage 3452.1 Basic Usage
345--------------- 346---------------
346 347
347Creating, modifying, using the cgroups can be done through the cgroup 348Creating, modifying, using cgroups can be done through the cgroup
348virtual filesystem. 349virtual filesystem.
349 350
350To mount a cgroup hierarchy with all available subsystems, type: 351To mount a cgroup hierarchy with all available subsystems, type:
@@ -441,7 +442,7 @@ You can attach the current shell task by echoing 0:
441# echo 0 > tasks 442# echo 0 > tasks
442 443
443You can use the cgroup.procs file instead of the tasks file to move all 444You can use the cgroup.procs file instead of the tasks file to move all
444threads in a threadgroup at once. Echoing the pid of any task in a 445threads in a threadgroup at once. Echoing the PID of any task in a
445threadgroup to cgroup.procs causes all tasks in that threadgroup to be 446threadgroup to cgroup.procs causes all tasks in that threadgroup to be
446be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks 447be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
447in the writing task's threadgroup. 448in the writing task's threadgroup.
@@ -479,7 +480,7 @@ in /proc/mounts and /proc/<pid>/cgroups.
479There is mechanism which allows to get notifications about changing 480There is mechanism which allows to get notifications about changing
480status of a cgroup. 481status of a cgroup.
481 482
482To register new notification handler you need: 483To register a new notification handler you need to:
483 - create a file descriptor for event notification using eventfd(2); 484 - create a file descriptor for event notification using eventfd(2);
484 - open a control file to be monitored (e.g. memory.usage_in_bytes); 485 - open a control file to be monitored (e.g. memory.usage_in_bytes);
485 - write "<event_fd> <control_fd> <args>" to cgroup.event_control. 486 - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
@@ -488,7 +489,7 @@ To register new notification handler you need:
488eventfd will be woken up by control file implementation or when the 489eventfd will be woken up by control file implementation or when the
489cgroup is removed. 490cgroup is removed.
490 491
491To unregister notification handler just close eventfd. 492To unregister a notification handler just close eventfd.
492 493
493NOTE: Support of notifications should be implemented for the control 494NOTE: Support of notifications should be implemented for the control
494file. See documentation for the subsystem. 495file. See documentation for the subsystem.
@@ -502,7 +503,7 @@ file. See documentation for the subsystem.
502Each kernel subsystem that wants to hook into the generic cgroup 503Each kernel subsystem that wants to hook into the generic cgroup
503system needs to create a cgroup_subsys object. This contains 504system needs to create a cgroup_subsys object. This contains
504various methods, which are callbacks from the cgroup system, along 505various methods, which are callbacks from the cgroup system, along
505with a subsystem id which will be assigned by the cgroup system. 506with a subsystem ID which will be assigned by the cgroup system.
506 507
507Other fields in the cgroup_subsys object include: 508Other fields in the cgroup_subsys object include:
508 509
@@ -516,7 +517,7 @@ Other fields in the cgroup_subsys object include:
516 at system boot. 517 at system boot.
517 518
518Each cgroup object created by the system has an array of pointers, 519Each cgroup object created by the system has an array of pointers,
519indexed by subsystem id; this pointer is entirely managed by the 520indexed by subsystem ID; this pointer is entirely managed by the
520subsystem; the generic cgroup code will never touch this pointer. 521subsystem; the generic cgroup code will never touch this pointer.
521 522
5223.2 Synchronization 5233.2 Synchronization
@@ -639,7 +640,7 @@ void post_clone(struct cgroup *cgrp)
639 640
640Called during cgroup_create() to do any parameter 641Called during cgroup_create() to do any parameter
641initialization which might be required before a task could attach. For 642initialization which might be required before a task could attach. For
642example in cpusets, no task may attach before 'cpus' and 'mems' are set 643example, in cpusets, no task may attach before 'cpus' and 'mems' are set
643up. 644up.
644 645
645void bind(struct cgroup *root) 646void bind(struct cgroup *root)
@@ -650,7 +651,26 @@ and root cgroup. Currently this will only involve movement between
650the default hierarchy (which never has sub-cgroups) and a hierarchy 651the default hierarchy (which never has sub-cgroups) and a hierarchy
651that is being created/destroyed (and hence has no sub-cgroups). 652that is being created/destroyed (and hence has no sub-cgroups).
652 653
6534. Questions 6544. Extended attribute usage
655===========================
656
657cgroup filesystem supports certain types of extended attributes in its
658directories and files. The current supported types are:
659 - Trusted (XATTR_TRUSTED)
660 - Security (XATTR_SECURITY)
661
662Both require CAP_SYS_ADMIN capability to set.
663
664Like in tmpfs, the extended attributes in cgroup filesystem are stored
665using kernel memory and it's advised to keep the usage at minimum. This
666is the reason why user defined extended attributes are not supported, since
667any user can do it and there's no limit in the value size.
668
669The current known users for this feature are SELinux to limit cgroup usage
670in containers and systemd for assorted meta data like main PID in a cgroup
671(systemd creates a cgroup per service).
672
6735. Questions
654============ 674============
655 675
656Q: what's up with this '/bin/echo' ? 676Q: what's up with this '/bin/echo' ?
@@ -660,5 +680,5 @@ A: bash's builtin 'echo' command does not check calls to write() against
660 680
661Q: When I attach processes, only the first of the line gets really attached ! 681Q: When I attach processes, only the first of the line gets really attached !
662A: We can only return one error code per call to write(). So you should also 682A: We can only return one error code per call to write(). So you should also
663 put only ONE pid. 683 put only ONE PID.
664 684