aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/cgroups/unified-hierarchy.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/cgroups/unified-hierarchy.txt')
-rw-r--r--Documentation/cgroups/unified-hierarchy.txt79
1 files changed, 79 insertions, 0 deletions
diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
index 4f4563277864..71daa35ec2d9 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -327,6 +327,85 @@ supported and the interface files "release_agent" and
327- use_hierarchy is on by default and the cgroup file for the flag is 327- use_hierarchy is on by default and the cgroup file for the flag is
328 not created. 328 not created.
329 329
330- The original lower boundary, the soft limit, is defined as a limit
331 that is per default unset. As a result, the set of cgroups that
332 global reclaim prefers is opt-in, rather than opt-out. The costs
333 for optimizing these mostly negative lookups are so high that the
334 implementation, despite its enormous size, does not even provide the
335 basic desirable behavior. First off, the soft limit has no
336 hierarchical meaning. All configured groups are organized in a
337 global rbtree and treated like equal peers, regardless where they
338 are located in the hierarchy. This makes subtree delegation
339 impossible. Second, the soft limit reclaim pass is so aggressive
340 that it not just introduces high allocation latencies into the
341 system, but also impacts system performance due to overreclaim, to
342 the point where the feature becomes self-defeating.
343
344 The memory.low boundary on the other hand is a top-down allocated
345 reserve. A cgroup enjoys reclaim protection when it and all its
346 ancestors are below their low boundaries, which makes delegation of
347 subtrees possible. Secondly, new cgroups have no reserve per
348 default and in the common case most cgroups are eligible for the
349 preferred reclaim pass. This allows the new low boundary to be
350 efficiently implemented with just a minor addition to the generic
351 reclaim code, without the need for out-of-band data structures and
352 reclaim passes. Because the generic reclaim code considers all
353 cgroups except for the ones running low in the preferred first
354 reclaim pass, overreclaim of individual groups is eliminated as
355 well, resulting in much better overall workload performance.
356
357- The original high boundary, the hard limit, is defined as a strict
358 limit that can not budge, even if the OOM killer has to be called.
359 But this generally goes against the goal of making the most out of
360 the available memory. The memory consumption of workloads varies
361 during runtime, and that requires users to overcommit. But doing
362 that with a strict upper limit requires either a fairly accurate
363 prediction of the working set size or adding slack to the limit.
364 Since working set size estimation is hard and error prone, and
365 getting it wrong results in OOM kills, most users tend to err on the
366 side of a looser limit and end up wasting precious resources.
367
368 The memory.high boundary on the other hand can be set much more
369 conservatively. When hit, it throttles allocations by forcing them
370 into direct reclaim to work off the excess, but it never invokes the
371 OOM killer. As a result, a high boundary that is chosen too
372 aggressively will not terminate the processes, but instead it will
373 lead to gradual performance degradation. The user can monitor this
374 and make corrections until the minimal memory footprint that still
375 gives acceptable performance is found.
376
377 In extreme cases, with many concurrent allocations and a complete
378 breakdown of reclaim progress within the group, the high boundary
379 can be exceeded. But even then it's mostly better to satisfy the
380 allocation from the slack available in other groups or the rest of
381 the system than killing the group. Otherwise, memory.max is there
382 to limit this type of spillover and ultimately contain buggy or even
383 malicious applications.
384
385- The original control file names are unwieldy and inconsistent in
386 many different ways. For example, the upper boundary hit count is
387 exported in the memory.failcnt file, but an OOM event count has to
388 be manually counted by listening to memory.oom_control events, and
389 lower boundary / soft limit events have to be counted by first
390 setting a threshold for that value and then counting those events.
391 Also, usage and limit files encode their units in the filename.
392 That makes the filenames very long, even though this is not
393 information that a user needs to be reminded of every time they type
394 out those names.
395
396 To address these naming issues, as well as to signal clearly that
397 the new interface carries a new configuration model, the naming
398 conventions in it necessarily differ from the old interface.
399
400- The original limit files indicate the state of an unset limit with a
401 Very High Number, and a configured limit can be unset by echoing -1
402 into those files. But that very high number is implementation and
403 architecture dependent and not very descriptive. And while -1 can
404 be understood as an underflow into the highest possible value, -2 or
405 -10M etc. do not work, so it's not consistent.
406
407 memory.low, memory.high, and memory.max will use the string
408 "infinity" to indicate and set the highest possible value.
330 409
3315. Planned Changes 4105. Planned Changes
332 411