diff options
Diffstat (limited to 'Documentation/cgroups')
-rw-r--r-- | Documentation/cgroups/cgroups.txt | 548 | ||||
-rw-r--r-- | Documentation/cgroups/freezer-subsystem.txt | 99 |
2 files changed, 647 insertions, 0 deletions
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt new file mode 100644 index 000000000000..d9014aa0eb68 --- /dev/null +++ b/Documentation/cgroups/cgroups.txt | |||
@@ -0,0 +1,548 @@ | |||
1 | CGROUPS | ||
2 | ------- | ||
3 | |||
4 | Written by Paul Menage <menage@google.com> based on Documentation/cpusets.txt | ||
5 | |||
6 | Original copyright statements from cpusets.txt: | ||
7 | Portions Copyright (C) 2004 BULL SA. | ||
8 | Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. | ||
9 | Modified by Paul Jackson <pj@sgi.com> | ||
10 | Modified by Christoph Lameter <clameter@sgi.com> | ||
11 | |||
12 | CONTENTS: | ||
13 | ========= | ||
14 | |||
15 | 1. Control Groups | ||
16 | 1.1 What are cgroups ? | ||
17 | 1.2 Why are cgroups needed ? | ||
18 | 1.3 How are cgroups implemented ? | ||
19 | 1.4 What does notify_on_release do ? | ||
20 | 1.5 How do I use cgroups ? | ||
21 | 2. Usage Examples and Syntax | ||
22 | 2.1 Basic Usage | ||
23 | 2.2 Attaching processes | ||
24 | 3. Kernel API | ||
25 | 3.1 Overview | ||
26 | 3.2 Synchronization | ||
27 | 3.3 Subsystem API | ||
28 | 4. Questions | ||
29 | |||
30 | 1. Control Groups | ||
31 | ================= | ||
32 | |||
33 | 1.1 What are cgroups ? | ||
34 | ---------------------- | ||
35 | |||
36 | Control Groups provide a mechanism for aggregating/partitioning sets of | ||
37 | tasks, and all their future children, into hierarchical groups with | ||
38 | specialized behaviour. | ||
39 | |||
40 | Definitions: | ||
41 | |||
42 | A *cgroup* associates a set of tasks with a set of parameters for one | ||
43 | or more subsystems. | ||
44 | |||
45 | A *subsystem* is a module that makes use of the task grouping | ||
46 | facilities provided by cgroups to treat groups of tasks in | ||
47 | particular ways. A subsystem is typically a "resource controller" that | ||
48 | schedules a resource or applies per-cgroup limits, but it may be | ||
49 | anything that wants to act on a group of processes, e.g. a | ||
50 | virtualization subsystem. | ||
51 | |||
52 | A *hierarchy* is a set of cgroups arranged in a tree, such that | ||
53 | every task in the system is in exactly one of the cgroups in the | ||
54 | hierarchy, and a set of subsystems; each subsystem has system-specific | ||
55 | state attached to each cgroup in the hierarchy. Each hierarchy has | ||
56 | an instance of the cgroup virtual filesystem associated with it. | ||
57 | |||
58 | At any one time there may be multiple active hierachies of task | ||
59 | cgroups. Each hierarchy is a partition of all tasks in the system. | ||
60 | |||
61 | User level code may create and destroy cgroups by name in an | ||
62 | instance of the cgroup virtual file system, specify and query to | ||
63 | which cgroup a task is assigned, and list the task pids assigned to | ||
64 | a cgroup. Those creations and assignments only affect the hierarchy | ||
65 | associated with that instance of the cgroup file system. | ||
66 | |||
67 | On their own, the only use for cgroups is for simple job | ||
68 | tracking. The intention is that other subsystems hook into the generic | ||
69 | cgroup support to provide new attributes for cgroups, such as | ||
70 | accounting/limiting the resources which processes in a cgroup can | ||
71 | access. For example, cpusets (see Documentation/cpusets.txt) allows | ||
72 | you to associate a set of CPUs and a set of memory nodes with the | ||
73 | tasks in each cgroup. | ||
74 | |||
75 | 1.2 Why are cgroups needed ? | ||
76 | ---------------------------- | ||
77 | |||
78 | There are multiple efforts to provide process aggregations in the | ||
79 | Linux kernel, mainly for resource tracking purposes. Such efforts | ||
80 | include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server | ||
81 | namespaces. These all require the basic notion of a | ||
82 | grouping/partitioning of processes, with newly forked processes ending | ||
83 | in the same group (cgroup) as their parent process. | ||
84 | |||
85 | The kernel cgroup patch provides the minimum essential kernel | ||
86 | mechanisms required to efficiently implement such groups. It has | ||
87 | minimal impact on the system fast paths, and provides hooks for | ||
88 | specific subsystems such as cpusets to provide additional behaviour as | ||
89 | desired. | ||
90 | |||
91 | Multiple hierarchy support is provided to allow for situations where | ||
92 | the division of tasks into cgroups is distinctly different for | ||
93 | different subsystems - having parallel hierarchies allows each | ||
94 | hierarchy to be a natural division of tasks, without having to handle | ||
95 | complex combinations of tasks that would be present if several | ||
96 | unrelated subsystems needed to be forced into the same tree of | ||
97 | cgroups. | ||
98 | |||
99 | At one extreme, each resource controller or subsystem could be in a | ||
100 | separate hierarchy; at the other extreme, all subsystems | ||
101 | would be attached to the same hierarchy. | ||
102 | |||
103 | As an example of a scenario (originally proposed by vatsa@in.ibm.com) | ||
104 | that can benefit from multiple hierarchies, consider a large | ||
105 | university server with various users - students, professors, system | ||
106 | tasks etc. The resource planning for this server could be along the | ||
107 | following lines: | ||
108 | |||
109 | CPU : Top cpuset | ||
110 | / \ | ||
111 | CPUSet1 CPUSet2 | ||
112 | | | | ||
113 | (Profs) (Students) | ||
114 | |||
115 | In addition (system tasks) are attached to topcpuset (so | ||
116 | that they can run anywhere) with a limit of 20% | ||
117 | |||
118 | Memory : Professors (50%), students (30%), system (20%) | ||
119 | |||
120 | Disk : Prof (50%), students (30%), system (20%) | ||
121 | |||
122 | Network : WWW browsing (20%), Network File System (60%), others (20%) | ||
123 | / \ | ||
124 | Prof (15%) students (5%) | ||
125 | |||
126 | Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go | ||
127 | into NFS network class. | ||
128 | |||
129 | At the same time firefox/lynx will share an appropriate CPU/Memory class | ||
130 | depending on who launched it (prof/student). | ||
131 | |||
132 | With the ability to classify tasks differently for different resources | ||
133 | (by putting those resource subsystems in different hierarchies) then | ||
134 | the admin can easily set up a script which receives exec notifications | ||
135 | and depending on who is launching the browser he can | ||
136 | |||
137 | # echo browser_pid > /mnt/<restype>/<userclass>/tasks | ||
138 | |||
139 | With only a single hierarchy, he now would potentially have to create | ||
140 | a separate cgroup for every browser launched and associate it with | ||
141 | approp network and other resource class. This may lead to | ||
142 | proliferation of such cgroups. | ||
143 | |||
144 | Also lets say that the administrator would like to give enhanced network | ||
145 | access temporarily to a student's browser (since it is night and the user | ||
146 | wants to do online gaming :)) OR give one of the students simulation | ||
147 | apps enhanced CPU power, | ||
148 | |||
149 | With ability to write pids directly to resource classes, it's just a | ||
150 | matter of : | ||
151 | |||
152 | # echo pid > /mnt/network/<new_class>/tasks | ||
153 | (after some time) | ||
154 | # echo pid > /mnt/network/<orig_class>/tasks | ||
155 | |||
156 | Without this ability, he would have to split the cgroup into | ||
157 | multiple separate ones and then associate the new cgroups with the | ||
158 | new resource classes. | ||
159 | |||
160 | |||
161 | |||
162 | 1.3 How are cgroups implemented ? | ||
163 | --------------------------------- | ||
164 | |||
165 | Control Groups extends the kernel as follows: | ||
166 | |||
167 | - Each task in the system has a reference-counted pointer to a | ||
168 | css_set. | ||
169 | |||
170 | - A css_set contains a set of reference-counted pointers to | ||
171 | cgroup_subsys_state objects, one for each cgroup subsystem | ||
172 | registered in the system. There is no direct link from a task to | ||
173 | the cgroup of which it's a member in each hierarchy, but this | ||
174 | can be determined by following pointers through the | ||
175 | cgroup_subsys_state objects. This is because accessing the | ||
176 | subsystem state is something that's expected to happen frequently | ||
177 | and in performance-critical code, whereas operations that require a | ||
178 | task's actual cgroup assignments (in particular, moving between | ||
179 | cgroups) are less common. A linked list runs through the cg_list | ||
180 | field of each task_struct using the css_set, anchored at | ||
181 | css_set->tasks. | ||
182 | |||
183 | - A cgroup hierarchy filesystem can be mounted for browsing and | ||
184 | manipulation from user space. | ||
185 | |||
186 | - You can list all the tasks (by pid) attached to any cgroup. | ||
187 | |||
188 | The implementation of cgroups requires a few, simple hooks | ||
189 | into the rest of the kernel, none in performance critical paths: | ||
190 | |||
191 | - in init/main.c, to initialize the root cgroups and initial | ||
192 | css_set at system boot. | ||
193 | |||
194 | - in fork and exit, to attach and detach a task from its css_set. | ||
195 | |||
196 | In addition a new file system, of type "cgroup" may be mounted, to | ||
197 | enable browsing and modifying the cgroups presently known to the | ||
198 | kernel. When mounting a cgroup hierarchy, you may specify a | ||
199 | comma-separated list of subsystems to mount as the filesystem mount | ||
200 | options. By default, mounting the cgroup filesystem attempts to | ||
201 | mount a hierarchy containing all registered subsystems. | ||
202 | |||
203 | If an active hierarchy with exactly the same set of subsystems already | ||
204 | exists, it will be reused for the new mount. If no existing hierarchy | ||
205 | matches, and any of the requested subsystems are in use in an existing | ||
206 | hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy | ||
207 | is activated, associated with the requested subsystems. | ||
208 | |||
209 | It's not currently possible to bind a new subsystem to an active | ||
210 | cgroup hierarchy, or to unbind a subsystem from an active cgroup | ||
211 | hierarchy. This may be possible in future, but is fraught with nasty | ||
212 | error-recovery issues. | ||
213 | |||
214 | When a cgroup filesystem is unmounted, if there are any | ||
215 | child cgroups created below the top-level cgroup, that hierarchy | ||
216 | will remain active even though unmounted; if there are no | ||
217 | child cgroups then the hierarchy will be deactivated. | ||
218 | |||
219 | No new system calls are added for cgroups - all support for | ||
220 | querying and modifying cgroups is via this cgroup file system. | ||
221 | |||
222 | Each task under /proc has an added file named 'cgroup' displaying, | ||
223 | for each active hierarchy, the subsystem names and the cgroup name | ||
224 | as the path relative to the root of the cgroup file system. | ||
225 | |||
226 | Each cgroup is represented by a directory in the cgroup file system | ||
227 | containing the following files describing that cgroup: | ||
228 | |||
229 | - tasks: list of tasks (by pid) attached to that cgroup | ||
230 | - releasable flag: cgroup currently removeable? | ||
231 | - notify_on_release flag: run the release agent on exit? | ||
232 | - release_agent: the path to use for release notifications (this file | ||
233 | exists in the top cgroup only) | ||
234 | |||
235 | Other subsystems such as cpusets may add additional files in each | ||
236 | cgroup dir. | ||
237 | |||
238 | New cgroups are created using the mkdir system call or shell | ||
239 | command. The properties of a cgroup, such as its flags, are | ||
240 | modified by writing to the appropriate file in that cgroups | ||
241 | directory, as listed above. | ||
242 | |||
243 | The named hierarchical structure of nested cgroups allows partitioning | ||
244 | a large system into nested, dynamically changeable, "soft-partitions". | ||
245 | |||
246 | The attachment of each task, automatically inherited at fork by any | ||
247 | children of that task, to a cgroup allows organizing the work load | ||
248 | on a system into related sets of tasks. A task may be re-attached to | ||
249 | any other cgroup, if allowed by the permissions on the necessary | ||
250 | cgroup file system directories. | ||
251 | |||
252 | When a task is moved from one cgroup to another, it gets a new | ||
253 | css_set pointer - if there's an already existing css_set with the | ||
254 | desired collection of cgroups then that group is reused, else a new | ||
255 | css_set is allocated. Note that the current implementation uses a | ||
256 | linear search to locate an appropriate existing css_set, so isn't | ||
257 | very efficient. A future version will use a hash table for better | ||
258 | performance. | ||
259 | |||
260 | To allow access from a cgroup to the css_sets (and hence tasks) | ||
261 | that comprise it, a set of cg_cgroup_link objects form a lattice; | ||
262 | each cg_cgroup_link is linked into a list of cg_cgroup_links for | ||
263 | a single cgroup on its cgrp_link_list field, and a list of | ||
264 | cg_cgroup_links for a single css_set on its cg_link_list. | ||
265 | |||
266 | Thus the set of tasks in a cgroup can be listed by iterating over | ||
267 | each css_set that references the cgroup, and sub-iterating over | ||
268 | each css_set's task set. | ||
269 | |||
270 | The use of a Linux virtual file system (vfs) to represent the | ||
271 | cgroup hierarchy provides for a familiar permission and name space | ||
272 | for cgroups, with a minimum of additional kernel code. | ||
273 | |||
274 | 1.4 What does notify_on_release do ? | ||
275 | ------------------------------------ | ||
276 | |||
277 | If the notify_on_release flag is enabled (1) in a cgroup, then | ||
278 | whenever the last task in the cgroup leaves (exits or attaches to | ||
279 | some other cgroup) and the last child cgroup of that cgroup | ||
280 | is removed, then the kernel runs the command specified by the contents | ||
281 | of the "release_agent" file in that hierarchy's root directory, | ||
282 | supplying the pathname (relative to the mount point of the cgroup | ||
283 | file system) of the abandoned cgroup. This enables automatic | ||
284 | removal of abandoned cgroups. The default value of | ||
285 | notify_on_release in the root cgroup at system boot is disabled | ||
286 | (0). The default value of other cgroups at creation is the current | ||
287 | value of their parents notify_on_release setting. The default value of | ||
288 | a cgroup hierarchy's release_agent path is empty. | ||
289 | |||
290 | 1.5 How do I use cgroups ? | ||
291 | -------------------------- | ||
292 | |||
293 | To start a new job that is to be contained within a cgroup, using | ||
294 | the "cpuset" cgroup subsystem, the steps are something like: | ||
295 | |||
296 | 1) mkdir /dev/cgroup | ||
297 | 2) mount -t cgroup -ocpuset cpuset /dev/cgroup | ||
298 | 3) Create the new cgroup by doing mkdir's and write's (or echo's) in | ||
299 | the /dev/cgroup virtual file system. | ||
300 | 4) Start a task that will be the "founding father" of the new job. | ||
301 | 5) Attach that task to the new cgroup by writing its pid to the | ||
302 | /dev/cgroup tasks file for that cgroup. | ||
303 | 6) fork, exec or clone the job tasks from this founding father task. | ||
304 | |||
305 | For example, the following sequence of commands will setup a cgroup | ||
306 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | ||
307 | and then start a subshell 'sh' in that cgroup: | ||
308 | |||
309 | mount -t cgroup cpuset -ocpuset /dev/cgroup | ||
310 | cd /dev/cgroup | ||
311 | mkdir Charlie | ||
312 | cd Charlie | ||
313 | /bin/echo 2-3 > cpuset.cpus | ||
314 | /bin/echo 1 > cpuset.mems | ||
315 | /bin/echo $$ > tasks | ||
316 | sh | ||
317 | # The subshell 'sh' is now running in cgroup Charlie | ||
318 | # The next line should display '/Charlie' | ||
319 | cat /proc/self/cgroup | ||
320 | |||
321 | 2. Usage Examples and Syntax | ||
322 | ============================ | ||
323 | |||
324 | 2.1 Basic Usage | ||
325 | --------------- | ||
326 | |||
327 | Creating, modifying, using the cgroups can be done through the cgroup | ||
328 | virtual filesystem. | ||
329 | |||
330 | To mount a cgroup hierarchy will all available subsystems, type: | ||
331 | # mount -t cgroup xxx /dev/cgroup | ||
332 | |||
333 | The "xxx" is not interpreted by the cgroup code, but will appear in | ||
334 | /proc/mounts so may be any useful identifying string that you like. | ||
335 | |||
336 | To mount a cgroup hierarchy with just the cpuset and numtasks | ||
337 | subsystems, type: | ||
338 | # mount -t cgroup -o cpuset,numtasks hier1 /dev/cgroup | ||
339 | |||
340 | To change the set of subsystems bound to a mounted hierarchy, just | ||
341 | remount with different options: | ||
342 | |||
343 | # mount -o remount,cpuset,ns /dev/cgroup | ||
344 | |||
345 | Note that changing the set of subsystems is currently only supported | ||
346 | when the hierarchy consists of a single (root) cgroup. Supporting | ||
347 | the ability to arbitrarily bind/unbind subsystems from an existing | ||
348 | cgroup hierarchy is intended to be implemented in the future. | ||
349 | |||
350 | Then under /dev/cgroup you can find a tree that corresponds to the | ||
351 | tree of the cgroups in the system. For instance, /dev/cgroup | ||
352 | is the cgroup that holds the whole system. | ||
353 | |||
354 | If you want to create a new cgroup under /dev/cgroup: | ||
355 | # cd /dev/cgroup | ||
356 | # mkdir my_cgroup | ||
357 | |||
358 | Now you want to do something with this cgroup. | ||
359 | # cd my_cgroup | ||
360 | |||
361 | In this directory you can find several files: | ||
362 | # ls | ||
363 | notify_on_release releasable tasks | ||
364 | (plus whatever files added by the attached subsystems) | ||
365 | |||
366 | Now attach your shell to this cgroup: | ||
367 | # /bin/echo $$ > tasks | ||
368 | |||
369 | You can also create cgroups inside your cgroup by using mkdir in this | ||
370 | directory. | ||
371 | # mkdir my_sub_cs | ||
372 | |||
373 | To remove a cgroup, just use rmdir: | ||
374 | # rmdir my_sub_cs | ||
375 | |||
376 | This will fail if the cgroup is in use (has cgroups inside, or | ||
377 | has processes attached, or is held alive by other subsystem-specific | ||
378 | reference). | ||
379 | |||
380 | 2.2 Attaching processes | ||
381 | ----------------------- | ||
382 | |||
383 | # /bin/echo PID > tasks | ||
384 | |||
385 | Note that it is PID, not PIDs. You can only attach ONE task at a time. | ||
386 | If you have several tasks to attach, you have to do it one after another: | ||
387 | |||
388 | # /bin/echo PID1 > tasks | ||
389 | # /bin/echo PID2 > tasks | ||
390 | ... | ||
391 | # /bin/echo PIDn > tasks | ||
392 | |||
393 | You can attach the current shell task by echoing 0: | ||
394 | |||
395 | # echo 0 > tasks | ||
396 | |||
397 | 3. Kernel API | ||
398 | ============= | ||
399 | |||
400 | 3.1 Overview | ||
401 | ------------ | ||
402 | |||
403 | Each kernel subsystem that wants to hook into the generic cgroup | ||
404 | system needs to create a cgroup_subsys object. This contains | ||
405 | various methods, which are callbacks from the cgroup system, along | ||
406 | with a subsystem id which will be assigned by the cgroup system. | ||
407 | |||
408 | Other fields in the cgroup_subsys object include: | ||
409 | |||
410 | - subsys_id: a unique array index for the subsystem, indicating which | ||
411 | entry in cgroup->subsys[] this subsystem should be managing. | ||
412 | |||
413 | - name: should be initialized to a unique subsystem name. Should be | ||
414 | no longer than MAX_CGROUP_TYPE_NAMELEN. | ||
415 | |||
416 | - early_init: indicate if the subsystem needs early initialization | ||
417 | at system boot. | ||
418 | |||
419 | Each cgroup object created by the system has an array of pointers, | ||
420 | indexed by subsystem id; this pointer is entirely managed by the | ||
421 | subsystem; the generic cgroup code will never touch this pointer. | ||
422 | |||
423 | 3.2 Synchronization | ||
424 | ------------------- | ||
425 | |||
426 | There is a global mutex, cgroup_mutex, used by the cgroup | ||
427 | system. This should be taken by anything that wants to modify a | ||
428 | cgroup. It may also be taken to prevent cgroups from being | ||
429 | modified, but more specific locks may be more appropriate in that | ||
430 | situation. | ||
431 | |||
432 | See kernel/cgroup.c for more details. | ||
433 | |||
434 | Subsystems can take/release the cgroup_mutex via the functions | ||
435 | cgroup_lock()/cgroup_unlock(). | ||
436 | |||
437 | Accessing a task's cgroup pointer may be done in the following ways: | ||
438 | - while holding cgroup_mutex | ||
439 | - while holding the task's alloc_lock (via task_lock()) | ||
440 | - inside an rcu_read_lock() section via rcu_dereference() | ||
441 | |||
442 | 3.3 Subsystem API | ||
443 | ----------------- | ||
444 | |||
445 | Each subsystem should: | ||
446 | |||
447 | - add an entry in linux/cgroup_subsys.h | ||
448 | - define a cgroup_subsys object called <name>_subsys | ||
449 | |||
450 | Each subsystem may export the following methods. The only mandatory | ||
451 | methods are create/destroy. Any others that are null are presumed to | ||
452 | be successful no-ops. | ||
453 | |||
454 | struct cgroup_subsys_state *create(struct cgroup_subsys *ss, | ||
455 | struct cgroup *cgrp) | ||
456 | (cgroup_mutex held by caller) | ||
457 | |||
458 | Called to create a subsystem state object for a cgroup. The | ||
459 | subsystem should allocate its subsystem state object for the passed | ||
460 | cgroup, returning a pointer to the new object on success or a | ||
461 | negative error code. On success, the subsystem pointer should point to | ||
462 | a structure of type cgroup_subsys_state (typically embedded in a | ||
463 | larger subsystem-specific object), which will be initialized by the | ||
464 | cgroup system. Note that this will be called at initialization to | ||
465 | create the root subsystem state for this subsystem; this case can be | ||
466 | identified by the passed cgroup object having a NULL parent (since | ||
467 | it's the root of the hierarchy) and may be an appropriate place for | ||
468 | initialization code. | ||
469 | |||
470 | void destroy(struct cgroup_subsys *ss, struct cgroup *cgrp) | ||
471 | (cgroup_mutex held by caller) | ||
472 | |||
473 | The cgroup system is about to destroy the passed cgroup; the subsystem | ||
474 | should do any necessary cleanup and free its subsystem state | ||
475 | object. By the time this method is called, the cgroup has already been | ||
476 | unlinked from the file system and from the child list of its parent; | ||
477 | cgroup->parent is still valid. (Note - can also be called for a | ||
478 | newly-created cgroup if an error occurs after this subsystem's | ||
479 | create() method has been called for the new cgroup). | ||
480 | |||
481 | void pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp); | ||
482 | (cgroup_mutex held by caller) | ||
483 | |||
484 | Called before checking the reference count on each subsystem. This may | ||
485 | be useful for subsystems which have some extra references even if | ||
486 | there are not tasks in the cgroup. | ||
487 | |||
488 | int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | ||
489 | struct task_struct *task) | ||
490 | (cgroup_mutex held by caller) | ||
491 | |||
492 | Called prior to moving a task into a cgroup; if the subsystem | ||
493 | returns an error, this will abort the attach operation. If a NULL | ||
494 | task is passed, then a successful result indicates that *any* | ||
495 | unspecified task can be moved into the cgroup. Note that this isn't | ||
496 | called on a fork. If this method returns 0 (success) then this should | ||
497 | remain valid while the caller holds cgroup_mutex. | ||
498 | |||
499 | void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | ||
500 | struct cgroup *old_cgrp, struct task_struct *task) | ||
501 | |||
502 | Called after the task has been attached to the cgroup, to allow any | ||
503 | post-attachment activity that requires memory allocations or blocking. | ||
504 | |||
505 | void fork(struct cgroup_subsy *ss, struct task_struct *task) | ||
506 | |||
507 | Called when a task is forked into a cgroup. | ||
508 | |||
509 | void exit(struct cgroup_subsys *ss, struct task_struct *task) | ||
510 | |||
511 | Called during task exit. | ||
512 | |||
513 | int populate(struct cgroup_subsys *ss, struct cgroup *cgrp) | ||
514 | |||
515 | Called after creation of a cgroup to allow a subsystem to populate | ||
516 | the cgroup directory with file entries. The subsystem should make | ||
517 | calls to cgroup_add_file() with objects of type cftype (see | ||
518 | include/linux/cgroup.h for details). Note that although this | ||
519 | method can return an error code, the error code is currently not | ||
520 | always handled well. | ||
521 | |||
522 | void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp) | ||
523 | |||
524 | Called at the end of cgroup_clone() to do any paramater | ||
525 | initialization which might be required before a task could attach. For | ||
526 | example in cpusets, no task may attach before 'cpus' and 'mems' are set | ||
527 | up. | ||
528 | |||
529 | void bind(struct cgroup_subsys *ss, struct cgroup *root) | ||
530 | (cgroup_mutex held by caller) | ||
531 | |||
532 | Called when a cgroup subsystem is rebound to a different hierarchy | ||
533 | and root cgroup. Currently this will only involve movement between | ||
534 | the default hierarchy (which never has sub-cgroups) and a hierarchy | ||
535 | that is being created/destroyed (and hence has no sub-cgroups). | ||
536 | |||
537 | 4. Questions | ||
538 | ============ | ||
539 | |||
540 | Q: what's up with this '/bin/echo' ? | ||
541 | A: bash's builtin 'echo' command does not check calls to write() against | ||
542 | errors. If you use it in the cgroup file system, you won't be | ||
543 | able to tell whether a command succeeded or failed. | ||
544 | |||
545 | Q: When I attach processes, only the first of the line gets really attached ! | ||
546 | A: We can only return one error code per call to write(). So you should also | ||
547 | put only ONE pid. | ||
548 | |||
diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt new file mode 100644 index 000000000000..c50ab58b72eb --- /dev/null +++ b/Documentation/cgroups/freezer-subsystem.txt | |||
@@ -0,0 +1,99 @@ | |||
1 | The cgroup freezer is useful to batch job management system which start | ||
2 | and stop sets of tasks in order to schedule the resources of a machine | ||
3 | according to the desires of a system administrator. This sort of program | ||
4 | is often used on HPC clusters to schedule access to the cluster as a | ||
5 | whole. The cgroup freezer uses cgroups to describe the set of tasks to | ||
6 | be started/stopped by the batch job management system. It also provides | ||
7 | a means to start and stop the tasks composing the job. | ||
8 | |||
9 | The cgroup freezer will also be useful for checkpointing running groups | ||
10 | of tasks. The freezer allows the checkpoint code to obtain a consistent | ||
11 | image of the tasks by attempting to force the tasks in a cgroup into a | ||
12 | quiescent state. Once the tasks are quiescent another task can | ||
13 | walk /proc or invoke a kernel interface to gather information about the | ||
14 | quiesced tasks. Checkpointed tasks can be restarted later should a | ||
15 | recoverable error occur. This also allows the checkpointed tasks to be | ||
16 | migrated between nodes in a cluster by copying the gathered information | ||
17 | to another node and restarting the tasks there. | ||
18 | |||
19 | Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping | ||
20 | and resuming tasks in userspace. Both of these signals are observable | ||
21 | from within the tasks we wish to freeze. While SIGSTOP cannot be caught, | ||
22 | blocked, or ignored it can be seen by waiting or ptracing parent tasks. | ||
23 | SIGCONT is especially unsuitable since it can be caught by the task. Any | ||
24 | programs designed to watch for SIGSTOP and SIGCONT could be broken by | ||
25 | attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can | ||
26 | demonstrate this problem using nested bash shells: | ||
27 | |||
28 | $ echo $$ | ||
29 | 16644 | ||
30 | $ bash | ||
31 | $ echo $$ | ||
32 | 16690 | ||
33 | |||
34 | From a second, unrelated bash shell: | ||
35 | $ kill -SIGSTOP 16690 | ||
36 | $ kill -SIGCONT 16990 | ||
37 | |||
38 | <at this point 16990 exits and causes 16644 to exit too> | ||
39 | |||
40 | This happens because bash can observe both signals and choose how it | ||
41 | responds to them. | ||
42 | |||
43 | Another example of a program which catches and responds to these | ||
44 | signals is gdb. In fact any program designed to use ptrace is likely to | ||
45 | have a problem with this method of stopping and resuming tasks. | ||
46 | |||
47 | In contrast, the cgroup freezer uses the kernel freezer code to | ||
48 | prevent the freeze/unfreeze cycle from becoming visible to the tasks | ||
49 | being frozen. This allows the bash example above and gdb to run as | ||
50 | expected. | ||
51 | |||
52 | The freezer subsystem in the container filesystem defines a file named | ||
53 | freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the | ||
54 | cgroup. Subsequently writing "THAWED" will unfreeze the tasks in the cgroup. | ||
55 | Reading will return the current state. | ||
56 | |||
57 | * Examples of usage : | ||
58 | |||
59 | # mkdir /containers/freezer | ||
60 | # mount -t cgroup -ofreezer freezer /containers | ||
61 | # mkdir /containers/0 | ||
62 | # echo $some_pid > /containers/0/tasks | ||
63 | |||
64 | to get status of the freezer subsystem : | ||
65 | |||
66 | # cat /containers/0/freezer.state | ||
67 | THAWED | ||
68 | |||
69 | to freeze all tasks in the container : | ||
70 | |||
71 | # echo FROZEN > /containers/0/freezer.state | ||
72 | # cat /containers/0/freezer.state | ||
73 | FREEZING | ||
74 | # cat /containers/0/freezer.state | ||
75 | FROZEN | ||
76 | |||
77 | to unfreeze all tasks in the container : | ||
78 | |||
79 | # echo THAWED > /containers/0/freezer.state | ||
80 | # cat /containers/0/freezer.state | ||
81 | THAWED | ||
82 | |||
83 | This is the basic mechanism which should do the right thing for user space task | ||
84 | in a simple scenario. | ||
85 | |||
86 | It's important to note that freezing can be incomplete. In that case we return | ||
87 | EBUSY. This means that some tasks in the cgroup are busy doing something that | ||
88 | prevents us from completely freezing the cgroup at this time. After EBUSY, | ||
89 | the cgroup will remain partially frozen -- reflected by freezer.state reporting | ||
90 | "FREEZING" when read. The state will remain "FREEZING" until one of these | ||
91 | things happens: | ||
92 | |||
93 | 1) Userspace cancels the freezing operation by writing "THAWED" to | ||
94 | the freezer.state file | ||
95 | 2) Userspace retries the freezing operation by writing "FROZEN" to | ||
96 | the freezer.state file (writing "FREEZING" is not legal | ||
97 | and returns EIO) | ||
98 | 3) The tasks that blocked the cgroup from entering the "FROZEN" | ||
99 | state disappear from the cgroup's set of tasks. | ||