aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/cgroups
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/cgroups')
-rw-r--r--Documentation/cgroups/cgroups.txt3
-rw-r--r--Documentation/cgroups/devices.txt70
-rw-r--r--Documentation/cgroups/memory.txt72
3 files changed, 139 insertions, 6 deletions
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index bcf1a00b06a1..638bf17ff869 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -442,7 +442,7 @@ You can attach the current shell task by echoing 0:
442You can use the cgroup.procs file instead of the tasks file to move all 442You can use the cgroup.procs file instead of the tasks file to move all
443threads in a threadgroup at once. Echoing the PID of any task in a 443threads in a threadgroup at once. Echoing the PID of any task in a
444threadgroup to cgroup.procs causes all tasks in that threadgroup to be 444threadgroup to cgroup.procs causes all tasks in that threadgroup to be
445be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks 445attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
446in the writing task's threadgroup. 446in the writing task's threadgroup.
447 447
448Note: Since every task is always a member of exactly one cgroup in each 448Note: Since every task is always a member of exactly one cgroup in each
@@ -580,6 +580,7 @@ propagation along the hierarchy. See the comment on
580cgroup_for_each_descendant_pre() for details. 580cgroup_for_each_descendant_pre() for details.
581 581
582void css_offline(struct cgroup *cgrp); 582void css_offline(struct cgroup *cgrp);
583(cgroup_mutex held by caller)
583 584
584This is the counterpart of css_online() and called iff css_online() 585This is the counterpart of css_online() and called iff css_online()
585has succeeded on @cgrp. This signifies the beginning of the end of 586has succeeded on @cgrp. This signifies the beginning of the end of
diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt
index 16624a7f8222..3c1095ca02ea 100644
--- a/Documentation/cgroups/devices.txt
+++ b/Documentation/cgroups/devices.txt
@@ -13,9 +13,7 @@ either an integer or * for all. Access is a composition of r
13The root device cgroup starts with rwm to 'all'. A child device 13The root device cgroup starts with rwm to 'all'. A child device
14cgroup gets a copy of the parent. Administrators can then remove 14cgroup gets a copy of the parent. Administrators can then remove
15devices from the whitelist or add new entries. A child cgroup can 15devices from the whitelist or add new entries. A child cgroup can
16never receive a device access which is denied by its parent. However 16never receive a device access which is denied by its parent.
17when a device access is removed from a parent it will not also be
18removed from the child(ren).
19 17
202. User Interface 182. User Interface
21 19
@@ -50,3 +48,69 @@ task to a new cgroup. (Again we'll probably want to change that).
50 48
51A cgroup may not be granted more permissions than the cgroup's 49A cgroup may not be granted more permissions than the cgroup's
52parent has. 50parent has.
51
524. Hierarchy
53
54device cgroups maintain hierarchy by making sure a cgroup never has more
55access permissions than its parent. Every time an entry is written to
56a cgroup's devices.deny file, all its children will have that entry removed
57from their whitelist and all the locally set whitelist entries will be
58re-evaluated. In case one of the locally set whitelist entries would provide
59more access than the cgroup's parent, it'll be removed from the whitelist.
60
61Example:
62 A
63 / \
64 B
65
66 group behavior exceptions
67 A allow "b 8:* rwm", "c 116:1 rw"
68 B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
69
70If a device is denied in group A:
71 # echo "c 116:* r" > A/devices.deny
72it'll propagate down and after revalidating B's entries, the whitelist entry
73"c 116:2 rwm" will be removed:
74
75 group whitelist entries denied devices
76 A all "b 8:* rwm", "c 116:* rw"
77 B "c 1:3 rwm", "b 3:* rwm" all the rest
78
79In case parent's exceptions change and local exceptions are not allowed
80anymore, they'll be deleted.
81
82Notice that new whitelist entries will not be propagated:
83 A
84 / \
85 B
86
87 group whitelist entries denied devices
88 A "c 1:3 rwm", "c 1:5 r" all the rest
89 B "c 1:3 rwm", "c 1:5 r" all the rest
90
91when adding "c *:3 rwm":
92 # echo "c *:3 rwm" >A/devices.allow
93
94the result:
95 group whitelist entries denied devices
96 A "c *:3 rwm", "c 1:5 r" all the rest
97 B "c 1:3 rwm", "c 1:5 r" all the rest
98
99but now it'll be possible to add new entries to B:
100 # echo "c 2:3 rwm" >B/devices.allow
101 # echo "c 50:3 r" >B/devices.allow
102or even
103 # echo "c *:3 rwm" >B/devices.allow
104
105Allowing or denying all by writing 'a' to devices.allow or devices.deny will
106not be possible once the device cgroups has children.
107
1084.1 Hierarchy (internal implementation)
109
110device cgroups is implemented internally using a behavior (ALLOW, DENY) and a
111list of exceptions. The internal state is controlled using the same user
112interface to preserve compatibility with the previous whitelist-only
113implementation. Removal or addition of exceptions that will reduce the access
114to devices will be propagated down the hierarchy.
115For every propagated exception, the effective rules will be re-evaluated based
116on current parent's access rules.
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 8b8c28b9864c..09027a9fece5 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -40,6 +40,7 @@ Features:
40 - soft limit 40 - soft limit
41 - moving (recharging) account at moving a task is selectable. 41 - moving (recharging) account at moving a task is selectable.
42 - usage threshold notifier 42 - usage threshold notifier
43 - memory pressure notifier
43 - oom-killer disable knob and oom-notifier 44 - oom-killer disable knob and oom-notifier
44 - Root cgroup has no limit controls. 45 - Root cgroup has no limit controls.
45 46
@@ -65,6 +66,7 @@ Brief summary of control files.
65 memory.stat # show various statistics 66 memory.stat # show various statistics
66 memory.use_hierarchy # set/show hierarchical account enabled 67 memory.use_hierarchy # set/show hierarchical account enabled
67 memory.force_empty # trigger forced move charge to parent 68 memory.force_empty # trigger forced move charge to parent
69 memory.pressure_level # set memory pressure notifications
68 memory.swappiness # set/show swappiness parameter of vmscan 70 memory.swappiness # set/show swappiness parameter of vmscan
69 (See sysctl's vm.swappiness) 71 (See sysctl's vm.swappiness)
70 memory.move_charge_at_immigrate # set/show controls of moving charges 72 memory.move_charge_at_immigrate # set/show controls of moving charges
@@ -194,7 +196,7 @@ the cgroup that brought it in -- this will happen on memory pressure).
194But see section 8.2: when moving a task to another cgroup, its pages may 196But see section 8.2: when moving a task to another cgroup, its pages may
195be recharged to the new cgroup, if move_charge_at_immigrate has been chosen. 197be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
196 198
197Exception: If CONFIG_CGROUP_CGROUP_MEMCG_SWAP is not used. 199Exception: If CONFIG_MEMCG_SWAP is not used.
198When you do swapoff and make swapped-out pages of shmem(tmpfs) to 200When you do swapoff and make swapped-out pages of shmem(tmpfs) to
199be backed into memory in force, charges for pages are accounted against the 201be backed into memory in force, charges for pages are accounted against the
200caller of swapoff rather than the users of shmem. 202caller of swapoff rather than the users of shmem.
@@ -762,7 +764,73 @@ At reading, current status of OOM is shown.
762 under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may 764 under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
763 be stopped.) 765 be stopped.)
764 766
76511. TODO 76711. Memory Pressure
768
769The pressure level notifications can be used to monitor the memory
770allocation cost; based on the pressure, applications can implement
771different strategies of managing their memory resources. The pressure
772levels are defined as following:
773
774The "low" level means that the system is reclaiming memory for new
775allocations. Monitoring this reclaiming activity might be useful for
776maintaining cache level. Upon notification, the program (typically
777"Activity Manager") might analyze vmstat and act in advance (i.e.
778prematurely shutdown unimportant services).
779
780The "medium" level means that the system is experiencing medium memory
781pressure, the system might be making swap, paging out active file caches,
782etc. Upon this event applications may decide to further analyze
783vmstat/zoneinfo/memcg or internal memory usage statistics and free any
784resources that can be easily reconstructed or re-read from a disk.
785
786The "critical" level means that the system is actively thrashing, it is
787about to out of memory (OOM) or even the in-kernel OOM killer is on its
788way to trigger. Applications should do whatever they can to help the
789system. It might be too late to consult with vmstat or any other
790statistics, so it's advisable to take an immediate action.
791
792The events are propagated upward until the event is handled, i.e. the
793events are not pass-through. Here is what this means: for example you have
794three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
795and C, and suppose group C experiences some pressure. In this situation,
796only group C will receive the notification, i.e. groups A and B will not
797receive it. This is done to avoid excessive "broadcasting" of messages,
798which disturbs the system and which is especially bad if we are low on
799memory or thrashing. So, organize the cgroups wisely, or propagate the
800events manually (or, ask us to implement the pass-through events,
801explaining why would you need them.)
802
803The file memory.pressure_level is only used to setup an eventfd. To
804register a notification, an application must:
805
806- create an eventfd using eventfd(2);
807- open memory.pressure_level;
808- write string like "<event_fd> <fd of memory.pressure_level> <level>"
809 to cgroup.event_control.
810
811Application will be notified through eventfd when memory pressure is at
812the specific level (or higher). Read/write operations to
813memory.pressure_level are no implemented.
814
815Test:
816
817 Here is a small script example that makes a new cgroup, sets up a
818 memory limit, sets up a notification in the cgroup and then makes child
819 cgroup experience a critical pressure:
820
821 # cd /sys/fs/cgroup/memory/
822 # mkdir foo
823 # cd foo
824 # cgroup_event_listener memory.pressure_level low &
825 # echo 8000000 > memory.limit_in_bytes
826 # echo 8000000 > memory.memsw.limit_in_bytes
827 # echo $$ > tasks
828 # dd if=/dev/zero | read x
829
830 (Expect a bunch of notifications, and eventually, the oom-killer will
831 trigger.)
832
83312. TODO
766 834
7671. Add support for accounting huge pages (as a separate controller) 8351. Add support for accounting huge pages (as a separate controller)
7682. Make per-cgroup scanner reclaim not-shared pages first 8362. Make per-cgroup scanner reclaim not-shared pages first