diff options
Diffstat (limited to 'Documentation/cgroups/unified-hierarchy.txt')
-rw-r--r-- | Documentation/cgroups/unified-hierarchy.txt | 79 |
1 files changed, 79 insertions, 0 deletions
diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt index 4f4563277864..71daa35ec2d9 100644 --- a/Documentation/cgroups/unified-hierarchy.txt +++ b/Documentation/cgroups/unified-hierarchy.txt | |||
@@ -327,6 +327,85 @@ supported and the interface files "release_agent" and | |||
327 | - use_hierarchy is on by default and the cgroup file for the flag is | 327 | - use_hierarchy is on by default and the cgroup file for the flag is |
328 | not created. | 328 | not created. |
329 | 329 | ||
330 | - The original lower boundary, the soft limit, is defined as a limit | ||
331 | that is per default unset. As a result, the set of cgroups that | ||
332 | global reclaim prefers is opt-in, rather than opt-out. The costs | ||
333 | for optimizing these mostly negative lookups are so high that the | ||
334 | implementation, despite its enormous size, does not even provide the | ||
335 | basic desirable behavior. First off, the soft limit has no | ||
336 | hierarchical meaning. All configured groups are organized in a | ||
337 | global rbtree and treated like equal peers, regardless where they | ||
338 | are located in the hierarchy. This makes subtree delegation | ||
339 | impossible. Second, the soft limit reclaim pass is so aggressive | ||
340 | that it not just introduces high allocation latencies into the | ||
341 | system, but also impacts system performance due to overreclaim, to | ||
342 | the point where the feature becomes self-defeating. | ||
343 | |||
344 | The memory.low boundary on the other hand is a top-down allocated | ||
345 | reserve. A cgroup enjoys reclaim protection when it and all its | ||
346 | ancestors are below their low boundaries, which makes delegation of | ||
347 | subtrees possible. Secondly, new cgroups have no reserve per | ||
348 | default and in the common case most cgroups are eligible for the | ||
349 | preferred reclaim pass. This allows the new low boundary to be | ||
350 | efficiently implemented with just a minor addition to the generic | ||
351 | reclaim code, without the need for out-of-band data structures and | ||
352 | reclaim passes. Because the generic reclaim code considers all | ||
353 | cgroups except for the ones running low in the preferred first | ||
354 | reclaim pass, overreclaim of individual groups is eliminated as | ||
355 | well, resulting in much better overall workload performance. | ||
356 | |||
357 | - The original high boundary, the hard limit, is defined as a strict | ||
358 | limit that can not budge, even if the OOM killer has to be called. | ||
359 | But this generally goes against the goal of making the most out of | ||
360 | the available memory. The memory consumption of workloads varies | ||
361 | during runtime, and that requires users to overcommit. But doing | ||
362 | that with a strict upper limit requires either a fairly accurate | ||
363 | prediction of the working set size or adding slack to the limit. | ||
364 | Since working set size estimation is hard and error prone, and | ||
365 | getting it wrong results in OOM kills, most users tend to err on the | ||
366 | side of a looser limit and end up wasting precious resources. | ||
367 | |||
368 | The memory.high boundary on the other hand can be set much more | ||
369 | conservatively. When hit, it throttles allocations by forcing them | ||
370 | into direct reclaim to work off the excess, but it never invokes the | ||
371 | OOM killer. As a result, a high boundary that is chosen too | ||
372 | aggressively will not terminate the processes, but instead it will | ||
373 | lead to gradual performance degradation. The user can monitor this | ||
374 | and make corrections until the minimal memory footprint that still | ||
375 | gives acceptable performance is found. | ||
376 | |||
377 | In extreme cases, with many concurrent allocations and a complete | ||
378 | breakdown of reclaim progress within the group, the high boundary | ||
379 | can be exceeded. But even then it's mostly better to satisfy the | ||
380 | allocation from the slack available in other groups or the rest of | ||
381 | the system than killing the group. Otherwise, memory.max is there | ||
382 | to limit this type of spillover and ultimately contain buggy or even | ||
383 | malicious applications. | ||
384 | |||
385 | - The original control file names are unwieldy and inconsistent in | ||
386 | many different ways. For example, the upper boundary hit count is | ||
387 | exported in the memory.failcnt file, but an OOM event count has to | ||
388 | be manually counted by listening to memory.oom_control events, and | ||
389 | lower boundary / soft limit events have to be counted by first | ||
390 | setting a threshold for that value and then counting those events. | ||
391 | Also, usage and limit files encode their units in the filename. | ||
392 | That makes the filenames very long, even though this is not | ||
393 | information that a user needs to be reminded of every time they type | ||
394 | out those names. | ||
395 | |||
396 | To address these naming issues, as well as to signal clearly that | ||
397 | the new interface carries a new configuration model, the naming | ||
398 | conventions in it necessarily differ from the old interface. | ||
399 | |||
400 | - The original limit files indicate the state of an unset limit with a | ||
401 | Very High Number, and a configured limit can be unset by echoing -1 | ||
402 | into those files. But that very high number is implementation and | ||
403 | architecture dependent and not very descriptive. And while -1 can | ||
404 | be understood as an underflow into the highest possible value, -2 or | ||
405 | -10M etc. do not work, so it's not consistent. | ||
406 | |||
407 | memory.low, memory.high, and memory.max will use the string | ||
408 | "infinity" to indicate and set the highest possible value. | ||
330 | 409 | ||
331 | 5. Planned Changes | 410 | 5. Planned Changes |
332 | 411 | ||