diff options
author | Tejun Heo <tj@kernel.org> | 2014-04-25 18:28:02 -0400 |
---|---|---|
committer | Tejun Heo <tj@kernel.org> | 2014-04-25 18:28:02 -0400 |
commit | 657315780005a676d294c7edf7548650c7e57f76 (patch) | |
tree | c207cea07222964f78bd800620f23007581ef58d | |
parent | 842b597ee0a7e1aa5a3148164ffdba00ec17f614 (diff) |
cgroup: add documentation about unified hierarchy
Unified hierarchy will be the new version of cgroup interface. This
patch adds Documentation/cgroups/unified-hierarchy.txt which describes
the design and rationales of unified hierarchy.
v2: Grammatical updates as per Randy Dunlap's review.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
-rw-r--r-- | Documentation/cgroups/unified-hierarchy.txt | 359 |
1 files changed, 359 insertions, 0 deletions
diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt new file mode 100644 index 000000000000..324b182e6000 --- /dev/null +++ b/Documentation/cgroups/unified-hierarchy.txt | |||
@@ -0,0 +1,359 @@ | |||
1 | |||
2 | Cgroup unified hierarchy | ||
3 | |||
4 | April, 2014 Tejun Heo <tj@kernel.org> | ||
5 | |||
6 | This document describes the changes made by unified hierarchy and | ||
7 | their rationales. It will eventually be merged into the main cgroup | ||
8 | documentation. | ||
9 | |||
10 | CONTENTS | ||
11 | |||
12 | 1. Background | ||
13 | 2. Basic Operation | ||
14 | 2-1. Mounting | ||
15 | 2-2. cgroup.subtree_control | ||
16 | 2-3. cgroup.controllers | ||
17 | 3. Structural Constraints | ||
18 | 3-1. Top-down | ||
19 | 3-2. No internal tasks | ||
20 | 4. Other Changes | ||
21 | 4-1. [Un]populated Notification | ||
22 | 4-2. Other Core Changes | ||
23 | 4-3. Per-Controller Changes | ||
24 | 4-3-1. blkio | ||
25 | 4-3-2. cpuset | ||
26 | 4-3-3. memory | ||
27 | 5. Planned Changes | ||
28 | 5-1. CAP for resource control | ||
29 | |||
30 | |||
31 | 1. Background | ||
32 | |||
33 | cgroup allows an arbitrary number of hierarchies and each hierarchy | ||
34 | can host any number of controllers. While this seems to provide a | ||
35 | high level of flexibility, it isn't quite useful in practice. | ||
36 | |||
37 | For example, as there is only one instance of each controller, utility | ||
38 | type controllers such as freezer which can be useful in all | ||
39 | hierarchies can only be used in one. The issue is exacerbated by the | ||
40 | fact that controllers can't be moved around once hierarchies are | ||
41 | populated. Another issue is that all controllers bound to a hierarchy | ||
42 | are forced to have exactly the same view of the hierarchy. It isn't | ||
43 | possible to vary the granularity depending on the specific controller. | ||
44 | |||
45 | In practice, these issues heavily limit which controllers can be put | ||
46 | on the same hierarchy and most configurations resort to putting each | ||
47 | controller on its own hierarchy. Only closely related ones, such as | ||
48 | the cpu and cpuacct controllers, make sense to put on the same | ||
49 | hierarchy. This often means that userland ends up managing multiple | ||
50 | similar hierarchies repeating the same steps on each hierarchy | ||
51 | whenever a hierarchy management operation is necessary. | ||
52 | |||
53 | Unfortunately, support for multiple hierarchies comes at a steep cost. | ||
54 | Internal implementation in cgroup core proper is dazzlingly | ||
55 | complicated but more importantly the support for multiple hierarchies | ||
56 | restricts how cgroup is used in general and what controllers can do. | ||
57 | |||
58 | There's no limit on how many hierarchies there may be, which means | ||
59 | that a task's cgroup membership can't be described in finite length. | ||
60 | The key may contain any varying number of entries and is unlimited in | ||
61 | length, which makes it highly awkward to handle and leads to addition | ||
62 | of controllers which exist only to identify membership, which in turn | ||
63 | exacerbates the original problem. | ||
64 | |||
65 | Also, as a controller can't have any expectation regarding what shape | ||
66 | of hierarchies other controllers would be on, each controller has to | ||
67 | assume that all other controllers are operating on completely | ||
68 | orthogonal hierarchies. This makes it impossible, or at least very | ||
69 | cumbersome, for controllers to cooperate with each other. | ||
70 | |||
71 | In most use cases, putting controllers on hierarchies which are | ||
72 | completely orthogonal to each other isn't necessary. What usually is | ||
73 | called for is the ability to have differing levels of granularity | ||
74 | depending on the specific controller. In other words, hierarchy may | ||
75 | be collapsed from leaf towards root when viewed from specific | ||
76 | controllers. For example, a given configuration might not care about | ||
77 | how memory is distributed beyond a certain level while still wanting | ||
78 | to control how CPU cycles are distributed. | ||
79 | |||
80 | Unified hierarchy is the next version of cgroup interface. It aims to | ||
81 | address the aforementioned issues by having more structure while | ||
82 | retaining enough flexibility for most use cases. Various other | ||
83 | general and controller-specific interface issues are also addressed in | ||
84 | the process. | ||
85 | |||
86 | |||
87 | 2. Basic Operation | ||
88 | |||
89 | 2-1. Mounting | ||
90 | |||
91 | Currently, unified hierarchy can be mounted with the following mount | ||
92 | command. Note that this is still under development and scheduled to | ||
93 | change soon. | ||
94 | |||
95 | mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT | ||
96 | |||
97 | All controllers which are not bound to other hierarchies are | ||
98 | automatically bound to unified hierarchy and show up at the root of | ||
99 | it. Controllers which are enabled only in the root of unified | ||
100 | hierarchy can be bound to other hierarchies at any time. This allows | ||
101 | mixing unified hierarchy with the traditional multiple hierarchies in | ||
102 | a fully backward compatible way. | ||
103 | |||
104 | |||
105 | 2-2. cgroup.subtree_control | ||
106 | |||
107 | All cgroups on unified hierarchy have a "cgroup.subtree_control" file | ||
108 | which governs which controllers are enabled on the children of the | ||
109 | cgroup. Let's assume a hierarchy like the following. | ||
110 | |||
111 | root - A - B - C | ||
112 | \ D | ||
113 | |||
114 | root's "cgroup.subtree_control" file determines which controllers are | ||
115 | enabled on A. A's on B. B's on C and D. This coincides with the | ||
116 | fact that controllers on the immediate sub-level are used to | ||
117 | distribute the resources of the parent. In fact, it's natural to | ||
118 | assume that resource control knobs of a child belong to its parent. | ||
119 | Enabling a controller in a "cgroup.subtree_control" file declares that | ||
120 | distribution of the respective resources of the cgroup will be | ||
121 | controlled. Note that this means that controller enable states are | ||
122 | shared among siblings. | ||
123 | |||
124 | When read, the file contains a space-separated list of currently | ||
125 | enabled controllers. A write to the file should contain a | ||
126 | space-separated list of controllers with '+' or '-' prefixed (without | ||
127 | the quotes). Controllers prefixed with '+' are enabled and '-' | ||
128 | disabled. If a controller is listed multiple times, the last entry | ||
129 | wins. The specific operations are executed atomically - either all | ||
130 | succeed or fail. | ||
131 | |||
132 | |||
133 | 2-3. cgroup.controllers | ||
134 | |||
135 | Read-only "cgroup.controllers" file contains a space-separated list of | ||
136 | controllers which can be enabled in the cgroup's | ||
137 | "cgroup.subtree_control" file. | ||
138 | |||
139 | In the root cgroup, this lists controllers which are not bound to | ||
140 | other hierarchies and the content changes as controllers are bound to | ||
141 | and unbound from other hierarchies. | ||
142 | |||
143 | In non-root cgroups, the content of this file equals that of the | ||
144 | parent's "cgroup.subtree_control" file as only controllers enabled | ||
145 | from the parent can be used in its children. | ||
146 | |||
147 | |||
148 | 3. Structural Constraints | ||
149 | |||
150 | 3-1. Top-down | ||
151 | |||
152 | As it doesn't make sense to nest control of an uncontrolled resource, | ||
153 | all non-root "cgroup.subtree_control" files can only contain | ||
154 | controllers which are enabled in the parent's "cgroup.subtree_control" | ||
155 | file. A controller can be enabled only if the parent has the | ||
156 | controller enabled and a controller can't be disabled if one or more | ||
157 | children have it enabled. | ||
158 | |||
159 | |||
160 | 3-2. No internal tasks | ||
161 | |||
162 | One long-standing issue that cgroup faces is the competition between | ||
163 | tasks belonging to the parent cgroup and its children cgroups. This | ||
164 | is inherently nasty as two different types of entities compete and | ||
165 | there is no agreed-upon obvious way to handle it. Different | ||
166 | controllers are doing different things. | ||
167 | |||
168 | The cpu controller considers tasks and cgroups as equivalents and maps | ||
169 | nice levels to cgroup weights. This works for some cases but falls | ||
170 | flat when children should be allocated specific ratios of CPU cycles | ||
171 | and the number of internal tasks fluctuates - the ratios constantly | ||
172 | change as the number of competing entities fluctuates. There also are | ||
173 | other issues. The mapping from nice level to weight isn't obvious or | ||
174 | universal, and there are various other knobs which simply aren't | ||
175 | available for tasks. | ||
176 | |||
177 | The blkio controller implicitly creates a hidden leaf node for each | ||
178 | cgroup to host the tasks. The hidden leaf has its own copies of all | ||
179 | the knobs with "leaf_" prefixed. While this allows equivalent control | ||
180 | over internal tasks, it's with serious drawbacks. It always adds an | ||
181 | extra layer of nesting which may not be necessary, makes the interface | ||
182 | messy and significantly complicates the implementation. | ||
183 | |||
184 | The memory controller currently doesn't have a way to control what | ||
185 | happens between internal tasks and child cgroups and the behavior is | ||
186 | not clearly defined. There have been attempts to add ad-hoc behaviors | ||
187 | and knobs to tailor the behavior to specific workloads. Continuing | ||
188 | this direction will lead to problems which will be extremely difficult | ||
189 | to resolve in the long term. | ||
190 | |||
191 | Multiple controllers struggle with internal tasks and came up with | ||
192 | different ways to deal with it; unfortunately, all the approaches in | ||
193 | use now are severely flawed and, furthermore, the widely different | ||
194 | behaviors make cgroup as whole highly inconsistent. | ||
195 | |||
196 | It is clear that this is something which needs to be addressed from | ||
197 | cgroup core proper in a uniform way so that controllers don't need to | ||
198 | worry about it and cgroup as a whole shows a consistent and logical | ||
199 | behavior. To achieve that, unified hierarchy enforces the following | ||
200 | structural constraint: | ||
201 | |||
202 | Except for the root, only cgroups which don't contain any task may | ||
203 | have controllers enabled in their "cgroup.subtree_control" files. | ||
204 | |||
205 | Combined with other properties, this guarantees that, when a | ||
206 | controller is looking at the part of the hierarchy which has it | ||
207 | enabled, tasks are always only on the leaves. This rules out | ||
208 | situations where child cgroups compete against internal tasks of the | ||
209 | parent. | ||
210 | |||
211 | There are two things to note. Firstly, the root cgroup is exempt from | ||
212 | the restriction. Root contains tasks and anonymous resource | ||
213 | consumption which can't be associated with any other cgroup and | ||
214 | requires special treatment from most controllers. How resource | ||
215 | consumption in the root cgroup is governed is up to each controller. | ||
216 | |||
217 | Secondly, the restriction doesn't take effect if there is no enabled | ||
218 | controller in the cgroup's "cgroup.subtree_control" file. This is | ||
219 | important as otherwise it wouldn't be possible to create children of a | ||
220 | populated cgroup. To control resource distribution of a cgroup, the | ||
221 | cgroup must create children and transfer all its tasks to the children | ||
222 | before enabling controllers in its "cgroup.subtree_control" file. | ||
223 | |||
224 | |||
225 | 4. Other Changes | ||
226 | |||
227 | 4-1. [Un]populated Notification | ||
228 | |||
229 | cgroup users often need a way to determine when a cgroup's | ||
230 | subhierarchy becomes empty so that it can be cleaned up. cgroup | ||
231 | currently provides release_agent for it; unfortunately, this mechanism | ||
232 | is riddled with issues. | ||
233 | |||
234 | - It delivers events by forking and execing a userland binary | ||
235 | specified as the release_agent. This is a long deprecated method of | ||
236 | notification delivery. It's extremely heavy, slow and cumbersome to | ||
237 | integrate with larger infrastructure. | ||
238 | |||
239 | - There is single monitoring point at the root. There's no way to | ||
240 | delegate management of a subtree. | ||
241 | |||
242 | - The event isn't recursive. It triggers when a cgroup doesn't have | ||
243 | any tasks or child cgroups. Events for internal nodes trigger only | ||
244 | after all children are removed. This again makes it impossible to | ||
245 | delegate management of a subtree. | ||
246 | |||
247 | - Events are filtered from the kernel side. A "notify_on_release" | ||
248 | file is used to subscribe to or suppress release events. This is | ||
249 | unnecessarily complicated and probably done this way because event | ||
250 | delivery itself was expensive. | ||
251 | |||
252 | Unified hierarchy implements an interface file "cgroup.populated" | ||
253 | which can be used to monitor whether the cgroup's subhierarchy has | ||
254 | tasks in it or not. Its value is 0 if there is no task in the cgroup | ||
255 | and its descendants; otherwise, 1. poll and [id]notify events are | ||
256 | triggered when the value changes. | ||
257 | |||
258 | This is significantly lighter and simpler and trivially allows | ||
259 | delegating management of subhierarchy - subhierarchy monitoring can | ||
260 | block further propagation simply by putting itself or another process | ||
261 | in the subhierarchy and monitor events that it's interested in from | ||
262 | there without interfering with monitoring higher in the tree. | ||
263 | |||
264 | In unified hierarchy, the release_agent mechanism is no longer | ||
265 | supported and the interface files "release_agent" and | ||
266 | "notify_on_release" do not exist. | ||
267 | |||
268 | |||
269 | 4-2. Other Core Changes | ||
270 | |||
271 | - None of the mount options is allowed. | ||
272 | |||
273 | - remount is disallowed. | ||
274 | |||
275 | - rename(2) is disallowed. | ||
276 | |||
277 | - The "tasks" file is removed. Everything should at process | ||
278 | granularity. Use the "cgroup.procs" file instead. | ||
279 | |||
280 | - The "cgroup.procs" file is not sorted. pids will be unique unless | ||
281 | they got recycled in-between reads. | ||
282 | |||
283 | - The "cgroup.clone_children" file is removed. | ||
284 | |||
285 | |||
286 | 4-3. Per-Controller Changes | ||
287 | |||
288 | 4-3-1. blkio | ||
289 | |||
290 | - blk-throttle becomes properly hierarchical. | ||
291 | |||
292 | |||
293 | 4-3-2. cpuset | ||
294 | |||
295 | - Tasks are kept in empty cpusets after hotplug and take on the masks | ||
296 | of the nearest non-empty ancestor, instead of being moved to it. | ||
297 | |||
298 | - A task can be moved into an empty cpuset, and again it takes on the | ||
299 | masks of the nearest non-empty ancestor. | ||
300 | |||
301 | |||
302 | 4-3-3. memory | ||
303 | |||
304 | - use_hierarchy is on by default and the cgroup file for the flag is | ||
305 | not created. | ||
306 | |||
307 | |||
308 | 5. Planned Changes | ||
309 | |||
310 | 5-1. CAP for resource control | ||
311 | |||
312 | Unified hierarchy will require one of the capabilities(7), which is | ||
313 | yet to be decided, for all resource control related knobs. Process | ||
314 | organization operations - creation of sub-cgroups and migration of | ||
315 | processes in sub-hierarchies may be delegated by changing the | ||
316 | ownership and/or permissions on the cgroup directory and | ||
317 | "cgroup.procs" interface file; however, all operations which affect | ||
318 | resource control - writes to a "cgroup.subtree_control" file or any | ||
319 | controller-specific knobs - will require an explicit CAP privilege. | ||
320 | |||
321 | This, in part, is to prevent the cgroup interface from being | ||
322 | inadvertently promoted to programmable API used by non-privileged | ||
323 | binaries. cgroup exposes various aspects of the system in ways which | ||
324 | aren't properly abstracted for direct consumption by regular programs. | ||
325 | This is an administration interface much closer to sysctl knobs than | ||
326 | system calls. Even the basic access model, being filesystem path | ||
327 | based, isn't suitable for direct consumption. There's no way to | ||
328 | access "my cgroup" in a race-free way or make multiple operations | ||
329 | atomic against migration to another cgroup. | ||
330 | |||
331 | Another aspect is that, for better or for worse, the cgroup interface | ||
332 | goes through far less scrutiny than regular interfaces for | ||
333 | unprivileged userland. The upside is that cgroup is able to expose | ||
334 | useful features which may not be suitable for general consumption in a | ||
335 | reasonable time frame. It provides a relatively short path between | ||
336 | internal details and userland-visible interface. Of course, this | ||
337 | shortcut comes with high risk. We go through what we go through for | ||
338 | general kernel APIs for good reasons. It may end up leaking internal | ||
339 | details in a way which can exert significant pain by locking the | ||
340 | kernel into a contract that can't be maintained in a reasonable | ||
341 | manner. | ||
342 | |||
343 | Also, due to the specific nature, cgroup and its controllers don't | ||
344 | tend to attract attention from a wide scope of developers. cgroup's | ||
345 | short history is already fraught with severely mis-designed | ||
346 | interfaces, unnecessary commitments to and exposing of internal | ||
347 | details, broken and dangerous implementations of various features. | ||
348 | |||
349 | Keeping cgroup as an administration interface is both advantageous for | ||
350 | its role and imperative given its nature. Some of the cgroup features | ||
351 | may make sense for unprivileged access. If deemed justified, those | ||
352 | must be further abstracted and implemented as a different interface, | ||
353 | be it a system call or process-private filesystem, and survive through | ||
354 | the scrutiny that any interface for general consumption is required to | ||
355 | go through. | ||
356 | |||
357 | Requiring CAP is not a complete solution but should serve as a | ||
358 | significant deterrent against spraying cgroup usages in non-privileged | ||
359 | programs. | ||