aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--Documentation/cgroup-v2.txt460
1 files changed, 239 insertions, 221 deletions
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index e6101976e0f1..bde177103567 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1,7 +1,9 @@
1 1================
2Control Group v2 2Control Group v2
3================
3 4
4October, 2015 Tejun Heo <tj@kernel.org> 5:Date: October, 2015
6:Author: Tejun Heo <tj@kernel.org>
5 7
6This is the authoritative documentation on the design, interface and 8This is the authoritative documentation on the design, interface and
7conventions of cgroup v2. It describes all userland-visible aspects 9conventions of cgroup v2. It describes all userland-visible aspects
@@ -9,70 +11,72 @@ of cgroup including core and specific controller behaviors. All
9future changes must be reflected in this document. Documentation for 11future changes must be reflected in this document. Documentation for
10v1 is available under Documentation/cgroup-v1/. 12v1 is available under Documentation/cgroup-v1/.
11 13
12CONTENTS 14.. CONTENTS
13 15
141. Introduction 16 1. Introduction
15 1-1. Terminology 17 1-1. Terminology
16 1-2. What is cgroup? 18 1-2. What is cgroup?
172. Basic Operations 19 2. Basic Operations
18 2-1. Mounting 20 2-1. Mounting
19 2-2. Organizing Processes 21 2-2. Organizing Processes
20 2-3. [Un]populated Notification 22 2-3. [Un]populated Notification
21 2-4. Controlling Controllers 23 2-4. Controlling Controllers
22 2-4-1. Enabling and Disabling 24 2-4-1. Enabling and Disabling
23 2-4-2. Top-down Constraint 25 2-4-2. Top-down Constraint
24 2-4-3. No Internal Process Constraint 26 2-4-3. No Internal Process Constraint
25 2-5. Delegation 27 2-5. Delegation
26 2-5-1. Model of Delegation 28 2-5-1. Model of Delegation
27 2-5-2. Delegation Containment 29 2-5-2. Delegation Containment
28 2-6. Guidelines 30 2-6. Guidelines
29 2-6-1. Organize Once and Control 31 2-6-1. Organize Once and Control
30 2-6-2. Avoid Name Collisions 32 2-6-2. Avoid Name Collisions
313. Resource Distribution Models 33 3. Resource Distribution Models
32 3-1. Weights 34 3-1. Weights
33 3-2. Limits 35 3-2. Limits
34 3-3. Protections 36 3-3. Protections
35 3-4. Allocations 37 3-4. Allocations
364. Interface Files 38 4. Interface Files
37 4-1. Format 39 4-1. Format
38 4-2. Conventions 40 4-2. Conventions
39 4-3. Core Interface Files 41 4-3. Core Interface Files
405. Controllers 42 5. Controllers
41 5-1. CPU 43 5-1. CPU
42 5-1-1. CPU Interface Files 44 5-1-1. CPU Interface Files
43 5-2. Memory 45 5-2. Memory
44 5-2-1. Memory Interface Files 46 5-2-1. Memory Interface Files
45 5-2-2. Usage Guidelines 47 5-2-2. Usage Guidelines
46 5-2-3. Memory Ownership 48 5-2-3. Memory Ownership
47 5-3. IO 49 5-3. IO
48 5-3-1. IO Interface Files 50 5-3-1. IO Interface Files
49 5-3-2. Writeback 51 5-3-2. Writeback
50 5-4. PID 52 5-4. PID
51 5-4-1. PID Interface Files 53 5-4-1. PID Interface Files
52 5-5. RDMA 54 5-5. RDMA
53 5-5-1. RDMA Interface Files 55 5-5-1. RDMA Interface Files
54 5-6. Misc 56 5-6. Misc
55 5-6-1. perf_event 57 5-6-1. perf_event
566. Namespace 58 6. Namespace
57 6-1. Basics 59 6-1. Basics
58 6-2. The Root and Views 60 6-2. The Root and Views
59 6-3. Migration and setns(2) 61 6-3. Migration and setns(2)
60 6-4. Interaction with Other Namespaces 62 6-4. Interaction with Other Namespaces
61P. Information on Kernel Programming 63 P. Information on Kernel Programming
62 P-1. Filesystem Support for Writeback 64 P-1. Filesystem Support for Writeback
63D. Deprecated v1 Core Features 65 D. Deprecated v1 Core Features
64R. Issues with v1 and Rationales for v2 66 R. Issues with v1 and Rationales for v2
65 R-1. Multiple Hierarchies 67 R-1. Multiple Hierarchies
66 R-2. Thread Granularity 68 R-2. Thread Granularity
67 R-3. Competition Between Inner Nodes and Threads 69 R-3. Competition Between Inner Nodes and Threads
68 R-4. Other Interface Issues 70 R-4. Other Interface Issues
69 R-5. Controller Issues and Remedies 71 R-5. Controller Issues and Remedies
70 R-5-1. Memory 72 R-5-1. Memory
71 73
72 74
731. Introduction 75Introduction
74 76============
751-1. Terminology 77
78Terminology
79-----------
76 80
77"cgroup" stands for "control group" and is never capitalized. The 81"cgroup" stands for "control group" and is never capitalized. The
78singular form is used to designate the whole feature and also as a 82singular form is used to designate the whole feature and also as a
@@ -80,7 +84,8 @@ qualifier as in "cgroup controllers". When explicitly referring to
80multiple individual control groups, the plural form "cgroups" is used. 84multiple individual control groups, the plural form "cgroups" is used.
81 85
82 86
831-2. What is cgroup? 87What is cgroup?
88---------------
84 89
85cgroup is a mechanism to organize processes hierarchically and 90cgroup is a mechanism to organize processes hierarchically and
86distribute system resources along the hierarchy in a controlled and 91distribute system resources along the hierarchy in a controlled and
@@ -110,12 +115,14 @@ restrictions set closer to the root in the hierarchy can not be
110overridden from further away. 115overridden from further away.
111 116
112 117
1132. Basic Operations 118Basic Operations
119================
114 120
1152-1. Mounting 121Mounting
122--------
116 123
117Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2 124Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2
118hierarchy can be mounted with the following mount command. 125hierarchy can be mounted with the following mount command::
119 126
120 # mount -t cgroup2 none $MOUNT_POINT 127 # mount -t cgroup2 none $MOUNT_POINT
121 128
@@ -160,10 +167,11 @@ cgroup v2 currently supports the following mount options.
160 Delegation section for details. 167 Delegation section for details.
161 168
162 169
1632-2. Organizing Processes 170Organizing Processes
171--------------------
164 172
165Initially, only the root cgroup exists to which all processes belong. 173Initially, only the root cgroup exists to which all processes belong.
166A child cgroup can be created by creating a sub-directory. 174A child cgroup can be created by creating a sub-directory::
167 175
168 # mkdir $CGROUP_NAME 176 # mkdir $CGROUP_NAME
169 177
@@ -190,28 +198,29 @@ moved to another cgroup.
190A cgroup which doesn't have any children or live processes can be 198A cgroup which doesn't have any children or live processes can be
191destroyed by removing the directory. Note that a cgroup which doesn't 199destroyed by removing the directory. Note that a cgroup which doesn't
192have any children and is associated only with zombie processes is 200have any children and is associated only with zombie processes is
193considered empty and can be removed. 201considered empty and can be removed::
194 202
195 # rmdir $CGROUP_NAME 203 # rmdir $CGROUP_NAME
196 204
197"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy 205"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy
198cgroup is in use in the system, this file may contain multiple lines, 206cgroup is in use in the system, this file may contain multiple lines,
199one for each hierarchy. The entry for cgroup v2 is always in the 207one for each hierarchy. The entry for cgroup v2 is always in the
200format "0::$PATH". 208format "0::$PATH"::
201 209
202 # cat /proc/842/cgroup 210 # cat /proc/842/cgroup
203 ... 211 ...
204 0::/test-cgroup/test-cgroup-nested 212 0::/test-cgroup/test-cgroup-nested
205 213
206If the process becomes a zombie and the cgroup it was associated with 214If the process becomes a zombie and the cgroup it was associated with
207is removed subsequently, " (deleted)" is appended to the path. 215is removed subsequently, " (deleted)" is appended to the path::
208 216
209 # cat /proc/842/cgroup 217 # cat /proc/842/cgroup
210 ... 218 ...
211 0::/test-cgroup/test-cgroup-nested (deleted) 219 0::/test-cgroup/test-cgroup-nested (deleted)
212 220
213 221
2142-3. [Un]populated Notification 222[Un]populated Notification
223--------------------------
215 224
216Each non-root cgroup has a "cgroup.events" file which contains 225Each non-root cgroup has a "cgroup.events" file which contains
217"populated" field indicating whether the cgroup's sub-hierarchy has 226"populated" field indicating whether the cgroup's sub-hierarchy has
@@ -222,7 +231,7 @@ example, to start a clean-up operation after all processes of a given
222sub-hierarchy have exited. The populated state updates and 231sub-hierarchy have exited. The populated state updates and
223notifications are recursive. Consider the following sub-hierarchy 232notifications are recursive. Consider the following sub-hierarchy
224where the numbers in the parentheses represent the numbers of processes 233where the numbers in the parentheses represent the numbers of processes
225in each cgroup. 234in each cgroup::
226 235
227 A(4) - B(0) - C(1) 236 A(4) - B(0) - C(1)
228 \ D(0) 237 \ D(0)
@@ -233,18 +242,20 @@ file modified events will be generated on the "cgroup.events" files of
233both cgroups. 242both cgroups.
234 243
235 244
2362-4. Controlling Controllers 245Controlling Controllers
246-----------------------
237 247
2382-4-1. Enabling and Disabling 248Enabling and Disabling
249~~~~~~~~~~~~~~~~~~~~~~
239 250
240Each cgroup has a "cgroup.controllers" file which lists all 251Each cgroup has a "cgroup.controllers" file which lists all
241controllers available for the cgroup to enable. 252controllers available for the cgroup to enable::
242 253
243 # cat cgroup.controllers 254 # cat cgroup.controllers
244 cpu io memory 255 cpu io memory
245 256
246No controller is enabled by default. Controllers can be enabled and 257No controller is enabled by default. Controllers can be enabled and
247disabled by writing to the "cgroup.subtree_control" file. 258disabled by writing to the "cgroup.subtree_control" file::
248 259
249 # echo "+cpu +memory -io" > cgroup.subtree_control 260 # echo "+cpu +memory -io" > cgroup.subtree_control
250 261
@@ -256,7 +267,7 @@ are specified, the last one is effective.
256Enabling a controller in a cgroup indicates that the distribution of 267Enabling a controller in a cgroup indicates that the distribution of
257the target resource across its immediate children will be controlled. 268the target resource across its immediate children will be controlled.
258Consider the following sub-hierarchy. The enabled controllers are 269Consider the following sub-hierarchy. The enabled controllers are
259listed in parentheses. 270listed in parentheses::
260 271
261 A(cpu,memory) - B(memory) - C() 272 A(cpu,memory) - B(memory) - C()
262 \ D() 273 \ D()
@@ -276,7 +287,8 @@ controller interface files - anything which doesn't start with
276"cgroup." are owned by the parent rather than the cgroup itself. 287"cgroup." are owned by the parent rather than the cgroup itself.
277 288
278 289
2792-4-2. Top-down Constraint 290Top-down Constraint
291~~~~~~~~~~~~~~~~~~~
280 292
281Resources are distributed top-down and a cgroup can further distribute 293Resources are distributed top-down and a cgroup can further distribute
282a resource only if the resource has been distributed to it from the 294a resource only if the resource has been distributed to it from the
@@ -287,7 +299,8 @@ the parent has the controller enabled and a controller can't be
287disabled if one or more children have it enabled. 299disabled if one or more children have it enabled.
288 300
289 301
2902-4-3. No Internal Process Constraint 302No Internal Process Constraint
303~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
291 304
292Non-root cgroups can only distribute resources to their children when 305Non-root cgroups can only distribute resources to their children when
293they don't have any processes of their own. In other words, only 306they don't have any processes of their own. In other words, only
@@ -314,9 +327,11 @@ children before enabling controllers in its "cgroup.subtree_control"
314file. 327file.
315 328
316 329
3172-5. Delegation 330Delegation
331----------
318 332
3192-5-1. Model of Delegation 333Model of Delegation
334~~~~~~~~~~~~~~~~~~~
320 335
321A cgroup can be delegated in two ways. First, to a less privileged 336A cgroup can be delegated in two ways. First, to a less privileged
322user by granting write access of the directory and its "cgroup.procs" 337user by granting write access of the directory and its "cgroup.procs"
@@ -345,7 +360,8 @@ cgroups in or nesting depth of a delegated sub-hierarchy; however,
345this may be limited explicitly in the future. 360this may be limited explicitly in the future.
346 361
347 362
3482-5-2. Delegation Containment 363Delegation Containment
364~~~~~~~~~~~~~~~~~~~~~~
349 365
350A delegated sub-hierarchy is contained in the sense that processes 366A delegated sub-hierarchy is contained in the sense that processes
351can't be moved into or out of the sub-hierarchy by the delegatee. 367can't be moved into or out of the sub-hierarchy by the delegatee.
@@ -366,7 +382,7 @@ in from or push out to outside the sub-hierarchy.
366 382
367For an example, let's assume cgroups C0 and C1 have been delegated to 383For an example, let's assume cgroups C0 and C1 have been delegated to
368user U0 who created C00, C01 under C0 and C10 under C1 as follows and 384user U0 who created C00, C01 under C0 and C10 under C1 as follows and
369all processes under C0 and C1 belong to U0. 385all processes under C0 and C1 belong to U0::
370 386
371 ~~~~~~~~~~~~~ - C0 - C00 387 ~~~~~~~~~~~~~ - C0 - C00
372 ~ cgroup ~ \ C01 388 ~ cgroup ~ \ C01
@@ -386,9 +402,11 @@ namespace of the process which is attempting the migration. If either
386is not reachable, the migration is rejected with -ENOENT. 402is not reachable, the migration is rejected with -ENOENT.
387 403
388 404
3892-6. Guidelines 405Guidelines
406----------
390 407
3912-6-1. Organize Once and Control 408Organize Once and Control
409~~~~~~~~~~~~~~~~~~~~~~~~~
392 410
393Migrating a process across cgroups is a relatively expensive operation 411Migrating a process across cgroups is a relatively expensive operation
394and stateful resources such as memory are not moved together with the 412and stateful resources such as memory are not moved together with the
@@ -404,7 +422,8 @@ distribution can be made by changing controller configuration through
404the interface files. 422the interface files.
405 423
406 424
4072-6-2. Avoid Name Collisions 425Avoid Name Collisions
426~~~~~~~~~~~~~~~~~~~~~
408 427
409Interface files for a cgroup and its children cgroups occupy the same 428Interface files for a cgroup and its children cgroups occupy the same
410directory and it is possible to create children cgroups which collide 429directory and it is possible to create children cgroups which collide
@@ -422,14 +441,16 @@ cgroup doesn't do anything to prevent name collisions and it's the
422user's responsibility to avoid them. 441user's responsibility to avoid them.
423 442
424 443
4253. Resource Distribution Models 444Resource Distribution Models
445============================
426 446
427cgroup controllers implement several resource distribution schemes 447cgroup controllers implement several resource distribution schemes
428depending on the resource type and expected use cases. This section 448depending on the resource type and expected use cases. This section
429describes major schemes in use along with their expected behaviors. 449describes major schemes in use along with their expected behaviors.
430 450
431 451
4323-1. Weights 452Weights
453-------
433 454
434A parent's resource is distributed by adding up the weights of all 455A parent's resource is distributed by adding up the weights of all
435active children and giving each the fraction matching the ratio of its 456active children and giving each the fraction matching the ratio of its
@@ -450,7 +471,8 @@ process migrations.
450and is an example of this type. 471and is an example of this type.
451 472
452 473
4533-2. Limits 474Limits
475------
454 476
455A child can only consume upto the configured amount of the resource. 477A child can only consume upto the configured amount of the resource.
456Limits can be over-committed - the sum of the limits of children can 478Limits can be over-committed - the sum of the limits of children can
@@ -466,7 +488,8 @@ process migrations.
466on an IO device and is an example of this type. 488on an IO device and is an example of this type.
467 489
468 490
4693-3. Protections 491Protections
492-----------
470 493
471A cgroup is protected to be allocated upto the configured amount of 494A cgroup is protected to be allocated upto the configured amount of
472the resource if the usages of all its ancestors are under their 495the resource if the usages of all its ancestors are under their
@@ -486,7 +509,8 @@ process migrations.
486example of this type. 509example of this type.
487 510
488 511
4893-4. Allocations 512Allocations
513-----------
490 514
491A cgroup is exclusively allocated a certain amount of a finite 515A cgroup is exclusively allocated a certain amount of a finite
492resource. Allocations can't be over-committed - the sum of the 516resource. Allocations can't be over-committed - the sum of the
@@ -505,12 +529,14 @@ may be rejected.
505type. 529type.
506 530
507 531
5084. Interface Files 532Interface Files
533===============
509 534
5104-1. Format 535Format
536------
511 537
512All interface files should be in one of the following formats whenever 538All interface files should be in one of the following formats whenever
513possible. 539possible::
514 540
515 New-line separated values 541 New-line separated values
516 (when only one value can be written at once) 542 (when only one value can be written at once)
@@ -545,7 +571,8 @@ can be written at a time. For nested keyed files, the sub key pairs
545may be specified in any order and not all pairs have to be specified. 571may be specified in any order and not all pairs have to be specified.
546 572
547 573
5484-2. Conventions 574Conventions
575-----------
549 576
550- Settings for a single feature should be contained in a single file. 577- Settings for a single feature should be contained in a single file.
551 578
@@ -581,25 +608,25 @@ may be specified in any order and not all pairs have to be specified.
581 with "default" as the value must not appear when read. 608 with "default" as the value must not appear when read.
582 609
583 For example, a setting which is keyed by major:minor device numbers 610 For example, a setting which is keyed by major:minor device numbers
584 with integer values may look like the following. 611 with integer values may look like the following::
585 612
586 # cat cgroup-example-interface-file 613 # cat cgroup-example-interface-file
587 default 150 614 default 150
588 8:0 300 615 8:0 300
589 616
590 The default value can be updated by 617 The default value can be updated by::
591 618
592 # echo 125 > cgroup-example-interface-file 619 # echo 125 > cgroup-example-interface-file
593 620
594 or 621 or::
595 622
596 # echo "default 125" > cgroup-example-interface-file 623 # echo "default 125" > cgroup-example-interface-file
597 624
598 An override can be set by 625 An override can be set by::
599 626
600 # echo "8:16 170" > cgroup-example-interface-file 627 # echo "8:16 170" > cgroup-example-interface-file
601 628
602 and cleared by 629 and cleared by::
603 630
604 # echo "8:0 default" > cgroup-example-interface-file 631 # echo "8:0 default" > cgroup-example-interface-file
605 # cat cgroup-example-interface-file 632 # cat cgroup-example-interface-file
@@ -612,12 +639,12 @@ may be specified in any order and not all pairs have to be specified.
612 generated on the file. 639 generated on the file.
613 640
614 641
6154-3. Core Interface Files 642Core Interface Files
643--------------------
616 644
617All cgroup core files are prefixed with "cgroup." 645All cgroup core files are prefixed with "cgroup."
618 646
619 cgroup.procs 647 cgroup.procs
620
621 A read-write new-line separated values file which exists on 648 A read-write new-line separated values file which exists on
622 all cgroups. 649 all cgroups.
623 650
@@ -643,7 +670,6 @@ All cgroup core files are prefixed with "cgroup."
643 should be granted along with the containing directory. 670 should be granted along with the containing directory.
644 671
645 cgroup.controllers 672 cgroup.controllers
646
647 A read-only space separated values file which exists on all 673 A read-only space separated values file which exists on all
648 cgroups. 674 cgroups.
649 675
@@ -651,7 +677,6 @@ All cgroup core files are prefixed with "cgroup."
651 the cgroup. The controllers are not ordered. 677 the cgroup. The controllers are not ordered.
652 678
653 cgroup.subtree_control 679 cgroup.subtree_control
654
655 A read-write space separated values file which exists on all 680 A read-write space separated values file which exists on all
656 cgroups. Starts out empty. 681 cgroups. Starts out empty.
657 682
@@ -667,23 +692,25 @@ All cgroup core files are prefixed with "cgroup."
667 operations are specified, either all succeed or all fail. 692 operations are specified, either all succeed or all fail.
668 693
669 cgroup.events 694 cgroup.events
670
671 A read-only flat-keyed file which exists on non-root cgroups. 695 A read-only flat-keyed file which exists on non-root cgroups.
672 The following entries are defined. Unless specified 696 The following entries are defined. Unless specified
673 otherwise, a value change in this file generates a file 697 otherwise, a value change in this file generates a file
674 modified event. 698 modified event.
675 699
676 populated 700 populated
677
678 1 if the cgroup or its descendants contains any live 701 1 if the cgroup or its descendants contains any live
679 processes; otherwise, 0. 702 processes; otherwise, 0.
680 703
681 704
6825. Controllers 705Controllers
706===========
683 707
6845-1. CPU 708CPU
709---
685 710
686[NOTE: The interface for the cpu controller hasn't been merged yet] 711.. note::
712
713 The interface for the cpu controller hasn't been merged yet
687 714
688The "cpu" controllers regulates distribution of CPU cycles. This 715The "cpu" controllers regulates distribution of CPU cycles. This
689controller implements weight and absolute bandwidth limit models for 716controller implements weight and absolute bandwidth limit models for
@@ -691,36 +718,34 @@ normal scheduling policy and absolute bandwidth allocation model for
691realtime scheduling policy. 718realtime scheduling policy.
692 719
693 720
6945-1-1. CPU Interface Files 721CPU Interface Files
722~~~~~~~~~~~~~~~~~~~
695 723
696All time durations are in microseconds. 724All time durations are in microseconds.
697 725
698 cpu.stat 726 cpu.stat
699
700 A read-only flat-keyed file which exists on non-root cgroups. 727 A read-only flat-keyed file which exists on non-root cgroups.
701 728
702 It reports the following six stats. 729 It reports the following six stats:
703 730
704 usage_usec 731 - usage_usec
705 user_usec 732 - user_usec
706 system_usec 733 - system_usec
707 nr_periods 734 - nr_periods
708 nr_throttled 735 - nr_throttled
709 throttled_usec 736 - throttled_usec
710 737
711 cpu.weight 738 cpu.weight
712
713 A read-write single value file which exists on non-root 739 A read-write single value file which exists on non-root
714 cgroups. The default is "100". 740 cgroups. The default is "100".
715 741
716 The weight in the range [1, 10000]. 742 The weight in the range [1, 10000].
717 743
718 cpu.max 744 cpu.max
719
720 A read-write two value file which exists on non-root cgroups. 745 A read-write two value file which exists on non-root cgroups.
721 The default is "max 100000". 746 The default is "max 100000".
722 747
723 The maximum bandwidth limit. It's in the following format. 748 The maximum bandwidth limit. It's in the following format::
724 749
725 $MAX $PERIOD 750 $MAX $PERIOD
726 751
@@ -729,9 +754,10 @@ All time durations are in microseconds.
729 one number is written, $MAX is updated. 754 one number is written, $MAX is updated.
730 755
731 cpu.rt.max 756 cpu.rt.max
757 .. note::
732 758
733 [NOTE: The semantics of this file is still under discussion and the 759 The semantics of this file is still under discussion and the
734 interface hasn't been merged yet] 760 interface hasn't been merged yet
735 761
736 A read-write two value file which exists on all cgroups. 762 A read-write two value file which exists on all cgroups.
737 The default is "0 100000". 763 The default is "0 100000".
@@ -739,7 +765,7 @@ All time durations are in microseconds.
739 The maximum realtime runtime allocation. Over-committing 765 The maximum realtime runtime allocation. Over-committing
740 configurations are disallowed and process migrations are 766 configurations are disallowed and process migrations are
741 rejected if not enough bandwidth is available. It's in the 767 rejected if not enough bandwidth is available. It's in the
742 following format. 768 following format::
743 769
744 $MAX $PERIOD 770 $MAX $PERIOD
745 771
@@ -748,7 +774,8 @@ All time durations are in microseconds.
748 updated. 774 updated.
749 775
750 776
7515-2. Memory 777Memory
778------
752 779
753The "memory" controller regulates distribution of memory. Memory is 780The "memory" controller regulates distribution of memory. Memory is
754stateful and implements both limit and protection models. Due to the 781stateful and implements both limit and protection models. Due to the
@@ -770,14 +797,14 @@ following types of memory usages are tracked.
770The above list may expand in the future for better coverage. 797The above list may expand in the future for better coverage.
771 798
772 799
7735-2-1. Memory Interface Files 800Memory Interface Files
801~~~~~~~~~~~~~~~~~~~~~~
774 802
775All memory amounts are in bytes. If a value which is not aligned to 803All memory amounts are in bytes. If a value which is not aligned to
776PAGE_SIZE is written, the value may be rounded up to the closest 804PAGE_SIZE is written, the value may be rounded up to the closest
777PAGE_SIZE multiple when read back. 805PAGE_SIZE multiple when read back.
778 806
779 memory.current 807 memory.current
780
781 A read-only single value file which exists on non-root 808 A read-only single value file which exists on non-root
782 cgroups. 809 cgroups.
783 810
@@ -785,7 +812,6 @@ PAGE_SIZE multiple when read back.
785 and its descendants. 812 and its descendants.
786 813
787 memory.low 814 memory.low
788
789 A read-write single value file which exists on non-root 815 A read-write single value file which exists on non-root
790 cgroups. The default is "0". 816 cgroups. The default is "0".
791 817
@@ -798,7 +824,6 @@ PAGE_SIZE multiple when read back.
798 protection is discouraged. 824 protection is discouraged.
799 825
800 memory.high 826 memory.high
801
802 A read-write single value file which exists on non-root 827 A read-write single value file which exists on non-root
803 cgroups. The default is "max". 828 cgroups. The default is "max".
804 829
@@ -811,7 +836,6 @@ PAGE_SIZE multiple when read back.
811 under extreme conditions the limit may be breached. 836 under extreme conditions the limit may be breached.
812 837
813 memory.max 838 memory.max
814
815 A read-write single value file which exists on non-root 839 A read-write single value file which exists on non-root
816 cgroups. The default is "max". 840 cgroups. The default is "max".
817 841
@@ -826,21 +850,18 @@ PAGE_SIZE multiple when read back.
826 utility is limited to providing the final safety net. 850 utility is limited to providing the final safety net.
827 851
828 memory.events 852 memory.events
829
830 A read-only flat-keyed file which exists on non-root cgroups. 853 A read-only flat-keyed file which exists on non-root cgroups.
831 The following entries are defined. Unless specified 854 The following entries are defined. Unless specified
832 otherwise, a value change in this file generates a file 855 otherwise, a value change in this file generates a file
833 modified event. 856 modified event.
834 857
835 low 858 low
836
837 The number of times the cgroup is reclaimed due to 859 The number of times the cgroup is reclaimed due to
838 high memory pressure even though its usage is under 860 high memory pressure even though its usage is under
839 the low boundary. This usually indicates that the low 861 the low boundary. This usually indicates that the low
840 boundary is over-committed. 862 boundary is over-committed.
841 863
842 high 864 high
843
844 The number of times processes of the cgroup are 865 The number of times processes of the cgroup are
845 throttled and routed to perform direct memory reclaim 866 throttled and routed to perform direct memory reclaim
846 because the high memory boundary was exceeded. For a 867 because the high memory boundary was exceeded. For a
@@ -849,13 +870,11 @@ PAGE_SIZE multiple when read back.
849 occurrences are expected. 870 occurrences are expected.
850 871
851 max 872 max
852
853 The number of times the cgroup's memory usage was 873 The number of times the cgroup's memory usage was
854 about to go over the max boundary. If direct reclaim 874 about to go over the max boundary. If direct reclaim
855 fails to bring it down, the cgroup goes to OOM state. 875 fails to bring it down, the cgroup goes to OOM state.
856 876
857 oom 877 oom
858
859 The number of time the cgroup's memory usage was 878 The number of time the cgroup's memory usage was
860 reached the limit and allocation was about to fail. 879 reached the limit and allocation was about to fail.
861 880
@@ -864,16 +883,14 @@ PAGE_SIZE multiple when read back.
864 883
865 Failed allocation in its turn could be returned into 884 Failed allocation in its turn could be returned into
866 userspace as -ENOMEM or siletly ignored in cases like 885 userspace as -ENOMEM or siletly ignored in cases like
867 disk readahead. For now OOM in memory cgroup kills 886 disk readahead. For now OOM in memory cgroup kills
868 tasks iff shortage has happened inside page fault. 887 tasks iff shortage has happened inside page fault.
869 888
870 oom_kill 889 oom_kill
871
872 The number of processes belonging to this cgroup 890 The number of processes belonging to this cgroup
873 killed by any kind of OOM killer. 891 killed by any kind of OOM killer.
874 892
875 memory.stat 893 memory.stat
876
877 A read-only flat-keyed file which exists on non-root cgroups. 894 A read-only flat-keyed file which exists on non-root cgroups.
878 895
879 This breaks down the cgroup's memory footprint into different 896 This breaks down the cgroup's memory footprint into different
@@ -887,73 +904,55 @@ PAGE_SIZE multiple when read back.
887 fixed position; use the keys to look up specific values! 904 fixed position; use the keys to look up specific values!
888 905
889 anon 906 anon
890
891 Amount of memory used in anonymous mappings such as 907 Amount of memory used in anonymous mappings such as
892 brk(), sbrk(), and mmap(MAP_ANONYMOUS) 908 brk(), sbrk(), and mmap(MAP_ANONYMOUS)
893 909
894 file 910 file
895
896 Amount of memory used to cache filesystem data, 911 Amount of memory used to cache filesystem data,
897 including tmpfs and shared memory. 912 including tmpfs and shared memory.
898 913
899 kernel_stack 914 kernel_stack
900
901 Amount of memory allocated to kernel stacks. 915 Amount of memory allocated to kernel stacks.
902 916
903 slab 917 slab
904
905 Amount of memory used for storing in-kernel data 918 Amount of memory used for storing in-kernel data
906 structures. 919 structures.
907 920
908 sock 921 sock
909
910 Amount of memory used in network transmission buffers 922 Amount of memory used in network transmission buffers
911 923
912 shmem 924 shmem
913
914 Amount of cached filesystem data that is swap-backed, 925 Amount of cached filesystem data that is swap-backed,
915 such as tmpfs, shm segments, shared anonymous mmap()s 926 such as tmpfs, shm segments, shared anonymous mmap()s
916 927
917 file_mapped 928 file_mapped
918
919 Amount of cached filesystem data mapped with mmap() 929 Amount of cached filesystem data mapped with mmap()
920 930
921 file_dirty 931 file_dirty
922
923 Amount of cached filesystem data that was modified but 932 Amount of cached filesystem data that was modified but
924 not yet written back to disk 933 not yet written back to disk
925 934
926 file_writeback 935 file_writeback
927
928 Amount of cached filesystem data that was modified and 936 Amount of cached filesystem data that was modified and
929 is currently being written back to disk 937 is currently being written back to disk
930 938
931 inactive_anon 939 inactive_anon, active_anon, inactive_file, active_file, unevictable
932 active_anon
933 inactive_file
934 active_file
935 unevictable
936
937 Amount of memory, swap-backed and filesystem-backed, 940 Amount of memory, swap-backed and filesystem-backed,
938 on the internal memory management lists used by the 941 on the internal memory management lists used by the
939 page reclaim algorithm 942 page reclaim algorithm
940 943
941 slab_reclaimable 944 slab_reclaimable
942
943 Part of "slab" that might be reclaimed, such as 945 Part of "slab" that might be reclaimed, such as
944 dentries and inodes. 946 dentries and inodes.
945 947
946 slab_unreclaimable 948 slab_unreclaimable
947
948 Part of "slab" that cannot be reclaimed on memory 949 Part of "slab" that cannot be reclaimed on memory
949 pressure. 950 pressure.
950 951
951 pgfault 952 pgfault
952
953 Total number of page faults incurred 953 Total number of page faults incurred
954 954
955 pgmajfault 955 pgmajfault
956
957 Number of major page faults incurred 956 Number of major page faults incurred
958 957
959 workingset_refault 958 workingset_refault
@@ -997,7 +996,6 @@ PAGE_SIZE multiple when read back.
997 Amount of reclaimed lazyfree pages 996 Amount of reclaimed lazyfree pages
998 997
999 memory.swap.current 998 memory.swap.current
1000
1001 A read-only single value file which exists on non-root 999 A read-only single value file which exists on non-root
1002 cgroups. 1000 cgroups.
1003 1001
@@ -1005,7 +1003,6 @@ PAGE_SIZE multiple when read back.
1005 and its descendants. 1003 and its descendants.
1006 1004
1007 memory.swap.max 1005 memory.swap.max
1008
1009 A read-write single value file which exists on non-root 1006 A read-write single value file which exists on non-root
1010 cgroups. The default is "max". 1007 cgroups. The default is "max".
1011 1008
@@ -1013,7 +1010,8 @@ PAGE_SIZE multiple when read back.
1013 limit, anonymous meomry of the cgroup will not be swapped out. 1010 limit, anonymous meomry of the cgroup will not be swapped out.
1014 1011
1015 1012
10165-2-2. Usage Guidelines 1013Usage Guidelines
1014~~~~~~~~~~~~~~~~
1017 1015
1018"memory.high" is the main mechanism to control memory usage. 1016"memory.high" is the main mechanism to control memory usage.
1019Over-committing on high limit (sum of high limits > available memory) 1017Over-committing on high limit (sum of high limits > available memory)
@@ -1036,7 +1034,8 @@ memory; unfortunately, memory pressure monitoring mechanism isn't
1036implemented yet. 1034implemented yet.
1037 1035
1038 1036
10395-2-3. Memory Ownership 1037Memory Ownership
1038~~~~~~~~~~~~~~~~
1040 1039
1041A memory area is charged to the cgroup which instantiated it and stays 1040A memory area is charged to the cgroup which instantiated it and stays
1042charged to the cgroup until the area is released. Migrating a process 1041charged to the cgroup until the area is released. Migrating a process
@@ -1054,7 +1053,8 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
1054belonging to the affected files to ensure correct memory ownership. 1053belonging to the affected files to ensure correct memory ownership.
1055 1054
1056 1055
10575-3. IO 1056IO
1057--
1058 1058
1059The "io" controller regulates the distribution of IO resources. This 1059The "io" controller regulates the distribution of IO resources. This
1060controller implements both weight based and absolute bandwidth or IOPS 1060controller implements both weight based and absolute bandwidth or IOPS
@@ -1063,28 +1063,29 @@ only if cfq-iosched is in use and neither scheme is available for
1063blk-mq devices. 1063blk-mq devices.
1064 1064
1065 1065
10665-3-1. IO Interface Files 1066IO Interface Files
1067~~~~~~~~~~~~~~~~~~
1067 1068
1068 io.stat 1069 io.stat
1069
1070 A read-only nested-keyed file which exists on non-root 1070 A read-only nested-keyed file which exists on non-root
1071 cgroups. 1071 cgroups.
1072 1072
1073 Lines are keyed by $MAJ:$MIN device numbers and not ordered. 1073 Lines are keyed by $MAJ:$MIN device numbers and not ordered.
1074 The following nested keys are defined. 1074 The following nested keys are defined.
1075 1075
1076 ====== ===================
1076 rbytes Bytes read 1077 rbytes Bytes read
1077 wbytes Bytes written 1078 wbytes Bytes written
1078 rios Number of read IOs 1079 rios Number of read IOs
1079 wios Number of write IOs 1080 wios Number of write IOs
1081 ====== ===================
1080 1082
1081 An example read output follows. 1083 An example read output follows:
1082 1084
1083 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 1085 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353
1084 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 1086 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252
1085 1087
1086 io.weight 1088 io.weight
1087
1088 A read-write flat-keyed file which exists on non-root cgroups. 1089 A read-write flat-keyed file which exists on non-root cgroups.
1089 The default is "default 100". 1090 The default is "default 100".
1090 1091
@@ -1098,14 +1099,13 @@ blk-mq devices.
1098 $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing 1099 $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing
1099 "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default". 1100 "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
1100 1101
1101 An example read output follows. 1102 An example read output follows::
1102 1103
1103 default 100 1104 default 100
1104 8:16 200 1105 8:16 200
1105 8:0 50 1106 8:0 50
1106 1107
1107 io.max 1108 io.max
1108
1109 A read-write nested-keyed file which exists on non-root 1109 A read-write nested-keyed file which exists on non-root
1110 cgroups. 1110 cgroups.
1111 1111
@@ -1113,10 +1113,12 @@ blk-mq devices.
1113 device numbers and not ordered. The following nested keys are 1113 device numbers and not ordered. The following nested keys are
1114 defined. 1114 defined.
1115 1115
1116 ===== ==================================
1116 rbps Max read bytes per second 1117 rbps Max read bytes per second
1117 wbps Max write bytes per second 1118 wbps Max write bytes per second
1118 riops Max read IO operations per second 1119 riops Max read IO operations per second
1119 wiops Max write IO operations per second 1120 wiops Max write IO operations per second
1121 ===== ==================================
1120 1122
1121 When writing, any number of nested key-value pairs can be 1123 When writing, any number of nested key-value pairs can be
1122 specified in any order. "max" can be specified as the value 1124 specified in any order. "max" can be specified as the value
@@ -1126,24 +1128,25 @@ blk-mq devices.
1126 BPS and IOPS are measured in each IO direction and IOs are 1128 BPS and IOPS are measured in each IO direction and IOs are
1127 delayed if limit is reached. Temporary bursts are allowed. 1129 delayed if limit is reached. Temporary bursts are allowed.
1128 1130
1129 Setting read limit at 2M BPS and write at 120 IOPS for 8:16. 1131 Setting read limit at 2M BPS and write at 120 IOPS for 8:16::
1130 1132
1131 echo "8:16 rbps=2097152 wiops=120" > io.max 1133 echo "8:16 rbps=2097152 wiops=120" > io.max
1132 1134
1133 Reading returns the following. 1135 Reading returns the following::
1134 1136
1135 8:16 rbps=2097152 wbps=max riops=max wiops=120 1137 8:16 rbps=2097152 wbps=max riops=max wiops=120
1136 1138
1137 Write IOPS limit can be removed by writing the following. 1139 Write IOPS limit can be removed by writing the following::
1138 1140
1139 echo "8:16 wiops=max" > io.max 1141 echo "8:16 wiops=max" > io.max
1140 1142
1141 Reading now returns the following. 1143 Reading now returns the following::
1142 1144
1143 8:16 rbps=2097152 wbps=max riops=max wiops=max 1145 8:16 rbps=2097152 wbps=max riops=max wiops=max
1144 1146
1145 1147
11465-3-2. Writeback 1148Writeback
1149~~~~~~~~~
1147 1150
1148Page cache is dirtied through buffered writes and shared mmaps and 1151Page cache is dirtied through buffered writes and shared mmaps and
1149written asynchronously to the backing filesystem by the writeback 1152written asynchronously to the backing filesystem by the writeback
@@ -1191,22 +1194,19 @@ patterns.
1191The sysctl knobs which affect writeback behavior are applied to cgroup 1194The sysctl knobs which affect writeback behavior are applied to cgroup
1192writeback as follows. 1195writeback as follows.
1193 1196
1194 vm.dirty_background_ratio 1197 vm.dirty_background_ratio, vm.dirty_ratio
1195 vm.dirty_ratio
1196
1197 These ratios apply the same to cgroup writeback with the 1198 These ratios apply the same to cgroup writeback with the
1198 amount of available memory capped by limits imposed by the 1199 amount of available memory capped by limits imposed by the
1199 memory controller and system-wide clean memory. 1200 memory controller and system-wide clean memory.
1200 1201
1201 vm.dirty_background_bytes 1202 vm.dirty_background_bytes, vm.dirty_bytes
1202 vm.dirty_bytes
1203
1204 For cgroup writeback, this is calculated into ratio against 1203 For cgroup writeback, this is calculated into ratio against
1205 total available memory and applied the same way as 1204 total available memory and applied the same way as
1206 vm.dirty[_background]_ratio. 1205 vm.dirty[_background]_ratio.
1207 1206
1208 1207
12095-4. PID 1208PID
1209---
1210 1210
1211The process number controller is used to allow a cgroup to stop any 1211The process number controller is used to allow a cgroup to stop any
1212new tasks from being fork()'d or clone()'d after a specified limit is 1212new tasks from being fork()'d or clone()'d after a specified limit is
@@ -1221,17 +1221,16 @@ Note that PIDs used in this controller refer to TIDs, process IDs as
1221used by the kernel. 1221used by the kernel.
1222 1222
1223 1223
12245-4-1. PID Interface Files 1224PID Interface Files
1225~~~~~~~~~~~~~~~~~~~
1225 1226
1226 pids.max 1227 pids.max
1227
1228 A read-write single value file which exists on non-root 1228 A read-write single value file which exists on non-root
1229 cgroups. The default is "max". 1229 cgroups. The default is "max".
1230 1230
1231 Hard limit of number of processes. 1231 Hard limit of number of processes.
1232 1232
1233 pids.current 1233 pids.current
1234
1235 A read-only single value file which exists on all cgroups. 1234 A read-only single value file which exists on all cgroups.
1236 1235
1237 The number of processes currently in the cgroup and its 1236 The number of processes currently in the cgroup and its
@@ -1246,12 +1245,14 @@ through fork() or clone(). These will return -EAGAIN if the creation
1246of a new process would cause a cgroup policy to be violated. 1245of a new process would cause a cgroup policy to be violated.
1247 1246
1248 1247
12495-5. RDMA 1248RDMA
1249----
1250 1250
1251The "rdma" controller regulates the distribution and accounting of 1251The "rdma" controller regulates the distribution and accounting of
1252of RDMA resources. 1252of RDMA resources.
1253 1253
12545-5-1. RDMA Interface Files 1254RDMA Interface Files
1255~~~~~~~~~~~~~~~~~~~~
1255 1256
1256 rdma.max 1257 rdma.max
1257 A readwrite nested-keyed file that exists for all the cgroups 1258 A readwrite nested-keyed file that exists for all the cgroups
@@ -1264,10 +1265,12 @@ of RDMA resources.
1264 1265
1265 The following nested keys are defined. 1266 The following nested keys are defined.
1266 1267
1268 ========== =============================
1267 hca_handle Maximum number of HCA Handles 1269 hca_handle Maximum number of HCA Handles
1268 hca_object Maximum number of HCA Objects 1270 hca_object Maximum number of HCA Objects
1271 ========== =============================
1269 1272
1270 An example for mlx4 and ocrdma device follows. 1273 An example for mlx4 and ocrdma device follows::
1271 1274
1272 mlx4_0 hca_handle=2 hca_object=2000 1275 mlx4_0 hca_handle=2 hca_object=2000
1273 ocrdma1 hca_handle=3 hca_object=max 1276 ocrdma1 hca_handle=3 hca_object=max
@@ -1276,15 +1279,17 @@ of RDMA resources.
1276 A read-only file that describes current resource usage. 1279 A read-only file that describes current resource usage.
1277 It exists for all the cgroup except root. 1280 It exists for all the cgroup except root.
1278 1281
1279 An example for mlx4 and ocrdma device follows. 1282 An example for mlx4 and ocrdma device follows::
1280 1283
1281 mlx4_0 hca_handle=1 hca_object=20 1284 mlx4_0 hca_handle=1 hca_object=20
1282 ocrdma1 hca_handle=1 hca_object=23 1285 ocrdma1 hca_handle=1 hca_object=23
1283 1286
1284 1287
12855-6. Misc 1288Misc
1289----
1286 1290
12875-6-1. perf_event 1291perf_event
1292~~~~~~~~~~
1288 1293
1289perf_event controller, if not mounted on a legacy hierarchy, is 1294perf_event controller, if not mounted on a legacy hierarchy, is
1290automatically enabled on the v2 hierarchy so that perf events can 1295automatically enabled on the v2 hierarchy so that perf events can
@@ -1292,9 +1297,11 @@ always be filtered by cgroup v2 path. The controller can still be
1292moved to a legacy hierarchy after v2 hierarchy is populated. 1297moved to a legacy hierarchy after v2 hierarchy is populated.
1293 1298
1294 1299
12956. Namespace 1300Namespace
1301=========
1296 1302
12976-1. Basics 1303Basics
1304------
1298 1305
1299cgroup namespace provides a mechanism to virtualize the view of the 1306cgroup namespace provides a mechanism to virtualize the view of the
1300"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone 1307"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone
@@ -1308,7 +1315,7 @@ Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
1308complete path of the cgroup of a process. In a container setup where 1315complete path of the cgroup of a process. In a container setup where
1309a set of cgroups and namespaces are intended to isolate processes the 1316a set of cgroups and namespaces are intended to isolate processes the
1310"/proc/$PID/cgroup" file may leak potential system level information 1317"/proc/$PID/cgroup" file may leak potential system level information
1311to the isolated processes. For Example: 1318to the isolated processes. For Example::
1312 1319
1313 # cat /proc/self/cgroup 1320 # cat /proc/self/cgroup
1314 0::/batchjobs/container_id1 1321 0::/batchjobs/container_id1
@@ -1316,14 +1323,14 @@ to the isolated processes. For Example:
1316The path '/batchjobs/container_id1' can be considered as system-data 1323The path '/batchjobs/container_id1' can be considered as system-data
1317and undesirable to expose to the isolated processes. cgroup namespace 1324and undesirable to expose to the isolated processes. cgroup namespace
1318can be used to restrict visibility of this path. For example, before 1325can be used to restrict visibility of this path. For example, before
1319creating a cgroup namespace, one would see: 1326creating a cgroup namespace, one would see::
1320 1327
1321 # ls -l /proc/self/ns/cgroup 1328 # ls -l /proc/self/ns/cgroup
1322 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835] 1329 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
1323 # cat /proc/self/cgroup 1330 # cat /proc/self/cgroup
1324 0::/batchjobs/container_id1 1331 0::/batchjobs/container_id1
1325 1332
1326After unsharing a new namespace, the view changes. 1333After unsharing a new namespace, the view changes::
1327 1334
1328 # ls -l /proc/self/ns/cgroup 1335 # ls -l /proc/self/ns/cgroup
1329 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183] 1336 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
@@ -1341,7 +1348,8 @@ namespace is destroyed. The cgroupns root and the actual cgroups
1341remain. 1348remain.
1342 1349
1343 1350
13446-2. The Root and Views 1351The Root and Views
1352------------------
1345 1353
1346The 'cgroupns root' for a cgroup namespace is the cgroup in which the 1354The 'cgroupns root' for a cgroup namespace is the cgroup in which the
1347process calling unshare(2) is running. For example, if a process in 1355process calling unshare(2) is running. For example, if a process in
@@ -1350,7 +1358,7 @@ process calling unshare(2) is running. For example, if a process in
1350init_cgroup_ns, this is the real root ('/') cgroup. 1358init_cgroup_ns, this is the real root ('/') cgroup.
1351 1359
1352The cgroupns root cgroup does not change even if the namespace creator 1360The cgroupns root cgroup does not change even if the namespace creator
1353process later moves to a different cgroup. 1361process later moves to a different cgroup::
1354 1362
1355 # ~/unshare -c # unshare cgroupns in some cgroup 1363 # ~/unshare -c # unshare cgroupns in some cgroup
1356 # cat /proc/self/cgroup 1364 # cat /proc/self/cgroup
@@ -1364,7 +1372,7 @@ Each process gets its namespace-specific view of "/proc/$PID/cgroup"
1364 1372
1365Processes running inside the cgroup namespace will be able to see 1373Processes running inside the cgroup namespace will be able to see
1366cgroup paths (in /proc/self/cgroup) only inside their root cgroup. 1374cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
1367From within an unshared cgroupns: 1375From within an unshared cgroupns::
1368 1376
1369 # sleep 100000 & 1377 # sleep 100000 &
1370 [1] 7353 1378 [1] 7353
@@ -1373,7 +1381,7 @@ From within an unshared cgroupns:
1373 0::/sub_cgrp_1 1381 0::/sub_cgrp_1
1374 1382
1375From the initial cgroup namespace, the real cgroup path will be 1383From the initial cgroup namespace, the real cgroup path will be
1376visible: 1384visible::
1377 1385
1378 $ cat /proc/7353/cgroup 1386 $ cat /proc/7353/cgroup
1379 0::/batchjobs/container_id1/sub_cgrp_1 1387 0::/batchjobs/container_id1/sub_cgrp_1
@@ -1381,7 +1389,7 @@ visible:
1381From a sibling cgroup namespace (that is, a namespace rooted at a 1389From a sibling cgroup namespace (that is, a namespace rooted at a
1382different cgroup), the cgroup path relative to its own cgroup 1390different cgroup), the cgroup path relative to its own cgroup
1383namespace root will be shown. For instance, if PID 7353's cgroup 1391namespace root will be shown. For instance, if PID 7353's cgroup
1384namespace root is at '/batchjobs/container_id2', then it will see 1392namespace root is at '/batchjobs/container_id2', then it will see::
1385 1393
1386 # cat /proc/7353/cgroup 1394 # cat /proc/7353/cgroup
1387 0::/../container_id2/sub_cgrp_1 1395 0::/../container_id2/sub_cgrp_1
@@ -1390,13 +1398,14 @@ Note that the relative path always starts with '/' to indicate that
1390its relative to the cgroup namespace root of the caller. 1398its relative to the cgroup namespace root of the caller.
1391 1399
1392 1400
13936-3. Migration and setns(2) 1401Migration and setns(2)
1402----------------------
1394 1403
1395Processes inside a cgroup namespace can move into and out of the 1404Processes inside a cgroup namespace can move into and out of the
1396namespace root if they have proper access to external cgroups. For 1405namespace root if they have proper access to external cgroups. For
1397example, from inside a namespace with cgroupns root at 1406example, from inside a namespace with cgroupns root at
1398/batchjobs/container_id1, and assuming that the global hierarchy is 1407/batchjobs/container_id1, and assuming that the global hierarchy is
1399still accessible inside cgroupns: 1408still accessible inside cgroupns::
1400 1409
1401 # cat /proc/7353/cgroup 1410 # cat /proc/7353/cgroup
1402 0::/sub_cgrp_1 1411 0::/sub_cgrp_1
@@ -1418,10 +1427,11 @@ namespace. It is expected that the someone moves the attaching
1418process under the target cgroup namespace root. 1427process under the target cgroup namespace root.
1419 1428
1420 1429
14216-4. Interaction with Other Namespaces 1430Interaction with Other Namespaces
1431---------------------------------
1422 1432
1423Namespace specific cgroup hierarchy can be mounted by a process 1433Namespace specific cgroup hierarchy can be mounted by a process
1424running inside a non-init cgroup namespace. 1434running inside a non-init cgroup namespace::
1425 1435
1426 # mount -t cgroup2 none $MOUNT_POINT 1436 # mount -t cgroup2 none $MOUNT_POINT
1427 1437
@@ -1434,27 +1444,27 @@ the view of cgroup hierarchy by namespace-private cgroupfs mount
1434provides a properly isolated cgroup view inside the container. 1444provides a properly isolated cgroup view inside the container.
1435 1445
1436 1446
1437P. Information on Kernel Programming 1447Information on Kernel Programming
1448=================================
1438 1449
1439This section contains kernel programming information in the areas 1450This section contains kernel programming information in the areas
1440where interacting with cgroup is necessary. cgroup core and 1451where interacting with cgroup is necessary. cgroup core and
1441controllers are not covered. 1452controllers are not covered.
1442 1453
1443 1454
1444P-1. Filesystem Support for Writeback 1455Filesystem Support for Writeback
1456--------------------------------
1445 1457
1446A filesystem can support cgroup writeback by updating 1458A filesystem can support cgroup writeback by updating
1447address_space_operations->writepage[s]() to annotate bio's using the 1459address_space_operations->writepage[s]() to annotate bio's using the
1448following two functions. 1460following two functions.
1449 1461
1450 wbc_init_bio(@wbc, @bio) 1462 wbc_init_bio(@wbc, @bio)
1451
1452 Should be called for each bio carrying writeback data and 1463 Should be called for each bio carrying writeback data and
1453 associates the bio with the inode's owner cgroup. Can be 1464 associates the bio with the inode's owner cgroup. Can be
1454 called anytime between bio allocation and submission. 1465 called anytime between bio allocation and submission.
1455 1466
1456 wbc_account_io(@wbc, @page, @bytes) 1467 wbc_account_io(@wbc, @page, @bytes)
1457
1458 Should be called for each data segment being written out. 1468 Should be called for each data segment being written out.
1459 While this function doesn't care exactly when it's called 1469 While this function doesn't care exactly when it's called
1460 during the writeback session, it's the easiest and most 1470 during the writeback session, it's the easiest and most
@@ -1475,7 +1485,8 @@ cases by skipping wbc_init_bio() or using bio_associate_blkcg()
1475directly. 1485directly.
1476 1486
1477 1487
1478D. Deprecated v1 Core Features 1488Deprecated v1 Core Features
1489===========================
1479 1490
1480- Multiple hierarchies including named ones are not supported. 1491- Multiple hierarchies including named ones are not supported.
1481 1492
@@ -1489,9 +1500,11 @@ D. Deprecated v1 Core Features
1489 at the root instead. 1500 at the root instead.
1490 1501
1491 1502
1492R. Issues with v1 and Rationales for v2 1503Issues with v1 and Rationales for v2
1504====================================
1493 1505
1494R-1. Multiple Hierarchies 1506Multiple Hierarchies
1507--------------------
1495 1508
1496cgroup v1 allowed an arbitrary number of hierarchies and each 1509cgroup v1 allowed an arbitrary number of hierarchies and each
1497hierarchy could host any number of controllers. While this seemed to 1510hierarchy could host any number of controllers. While this seemed to
@@ -1543,7 +1556,8 @@ how memory is distributed beyond a certain level while still wanting
1543to control how CPU cycles are distributed. 1556to control how CPU cycles are distributed.
1544 1557
1545 1558
1546R-2. Thread Granularity 1559Thread Granularity
1560------------------
1547 1561
1548cgroup v1 allowed threads of a process to belong to different cgroups. 1562cgroup v1 allowed threads of a process to belong to different cgroups.
1549This didn't make sense for some controllers and those controllers 1563This didn't make sense for some controllers and those controllers
@@ -1586,7 +1600,8 @@ misbehaving and poorly abstracted interfaces and kernel exposing and
1586locked into constructs inadvertently. 1600locked into constructs inadvertently.
1587 1601
1588 1602
1589R-3. Competition Between Inner Nodes and Threads 1603Competition Between Inner Nodes and Threads
1604-------------------------------------------
1590 1605
1591cgroup v1 allowed threads to be in any cgroups which created an 1606cgroup v1 allowed threads to be in any cgroups which created an
1592interesting problem where threads belonging to a parent cgroup and its 1607interesting problem where threads belonging to a parent cgroup and its
@@ -1605,7 +1620,7 @@ simply weren't available for threads.
1605 1620
1606The io controller implicitly created a hidden leaf node for each 1621The io controller implicitly created a hidden leaf node for each
1607cgroup to host the threads. The hidden leaf had its own copies of all 1622cgroup to host the threads. The hidden leaf had its own copies of all
1608the knobs with "leaf_" prefixed. While this allowed equivalent 1623the knobs with ``leaf_`` prefixed. While this allowed equivalent
1609control over internal threads, it was with serious drawbacks. It 1624control over internal threads, it was with serious drawbacks. It
1610always added an extra layer of nesting which wouldn't be necessary 1625always added an extra layer of nesting which wouldn't be necessary
1611otherwise, made the interface messy and significantly complicated the 1626otherwise, made the interface messy and significantly complicated the
@@ -1626,7 +1641,8 @@ This clearly is a problem which needs to be addressed from cgroup core
1626in a uniform way. 1641in a uniform way.
1627 1642
1628 1643
1629R-4. Other Interface Issues 1644Other Interface Issues
1645----------------------
1630 1646
1631cgroup v1 grew without oversight and developed a large number of 1647cgroup v1 grew without oversight and developed a large number of
1632idiosyncrasies and inconsistencies. One issue on the cgroup core side 1648idiosyncrasies and inconsistencies. One issue on the cgroup core side
@@ -1654,9 +1670,11 @@ cgroup v2 establishes common conventions where appropriate and updates
1654controllers so that they expose minimal and consistent interfaces. 1670controllers so that they expose minimal and consistent interfaces.
1655 1671
1656 1672
1657R-5. Controller Issues and Remedies 1673Controller Issues and Remedies
1674------------------------------
1658 1675
1659R-5-1. Memory 1676Memory
1677~~~~~~
1660 1678
1661The original lower boundary, the soft limit, is defined as a limit 1679The original lower boundary, the soft limit, is defined as a limit
1662that is per default unset. As a result, the set of cgroups that 1680that is per default unset. As a result, the set of cgroups that