3 files changed, 139 insertions, 6 deletions
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index bcf1a00b06a1..638bf17ff869 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -442,7 +442,7 @@ You can attach the current shell task by echoing 0:
 You can use the cgroup.procs file instead of the tasks file to move all
 threads in a threadgroup at once. Echoing the PID of any task in a
 threadgroup to cgroup.procs causes all tasks in that threadgroup to be
-be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
 in the writing task's threadgroup.
 Note: Since every task is always a member of exactly one cgroup in each
@@ -580,6 +580,7 @@ propagation along the hierarchy. See the comment on
 cgroup_for_each_descendant_pre() for details.
 void css_offline(struct cgroup *cgrp);
+(cgroup_mutex held by caller)
 This is the counterpart of css_online() and called iff css_online()
 has succeeded on @cgrp. This signifies the beginning of the end of
diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt
index 16624a7f8222..3c1095ca02ea 100644
--- a/Documentation/cgroups/devices.txt
+++ b/Documentation/cgroups/devices.txt
@@ -13,9 +13,7 @@ either an integer or * for all.  Access is a composition of r
 The root device cgroup starts with rwm to 'all'.  A child device
 cgroup gets a copy of the parent.  Administrators can then remove
 devices from the whitelist or add new entries.  A child cgroup can
-never receive a device access which is denied by its parent.  However
+never receive a device access which is denied by its parent.
-when a device access is removed from a parent it will not also be
-removed from the child(ren).
 2. User Interface
@@ -50,3 +48,69 @@ task to a new cgroup.  (Again we'll probably want to change that).
 A cgroup may not be granted more permissions than the cgroup's
 parent has.
+4. Hierarchy
+device cgroups maintain hierarchy by making sure a cgroup never has more
+access permissions than its parent.  Every time an entry is written to
+a cgroup's devices.deny file, all its children will have that entry removed
+from their whitelist and all the locally set whitelist entries will be
+re-evaluated.  In case one of the locally set whitelist entries would provide
+more access than the cgroup's parent, it'll be removed from the whitelist.
+Example:
+      A
+     / \
+        B
+    group        behavior       exceptions
+    A            allow          "b 8:* rwm", "c 116:1 rw"
+    B            deny           "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
+If a device is denied in group A:
+        # echo "c 116:* r" > A/devices.deny
+it'll propagate down and after revalidating B's entries, the whitelist entry
+"c 116:2 rwm" will be removed:
+    group        whitelist entries                        denied devices
+    A            all                                      "b 8:* rwm", "c 116:* rw"
+    B            "c 1:3 rwm", "b 3:* rwm"                 all the rest
+In case parent's exceptions change and local exceptions are not allowed
+anymore, they'll be deleted.
+Notice that new whitelist entries will not be propagated:
+      A
+     / \
+        B
+    group        whitelist entries                        denied devices
+    A            "c 1:3 rwm", "c 1:5 r"                   all the rest
+    B            "c 1:3 rwm", "c 1:5 r"                   all the rest
+when adding "c *:3 rwm":
+        # echo "c *:3 rwm" >A/devices.allow
+the result:
+    group        whitelist entries                        denied devices
+    A            "c *:3 rwm", "c 1:5 r"                   all the rest
+    B            "c 1:3 rwm", "c 1:5 r"                   all the rest
+but now it'll be possible to add new entries to B:
+        # echo "c 2:3 rwm" >B/devices.allow
+        # echo "c 50:3 r" >B/devices.allow
+or even
+        # echo "c *:3 rwm" >B/devices.allow
+Allowing or denying all by writing 'a' to devices.allow or devices.deny will
+not be possible once the device cgroups has children.
+4.1 Hierarchy (internal implementation)
+device cgroups is implemented internally using a behavior (ALLOW, DENY) and a
+list of exceptions.  The internal state is controlled using the same user
+interface to preserve compatibility with the previous whitelist-only
+implementation.  Removal or addition of exceptions that will reduce the access
+to devices will be propagated down the hierarchy.
+For every propagated exception, the effective rules will be re-evaluated based
+on current parent's access rules.
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 8b8c28b9864c..09027a9fece5 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -40,6 +40,7 @@ Features:
 - soft limit
 - moving (recharging) account at moving a task is selectable.
 - usage threshold notifier
+ - memory pressure notifier
 - oom-killer disable knob and oom-notifier
 - Root cgroup has no limit controls.
@@ -65,6 +66,7 @@ Brief summary of control files.
 memory.stat                     # show various statistics
 memory.use_hierarchy            # set/show hierarchical account enabled
 memory.force_empty              # trigger forced move charge to parent
+ memory.pressure_level           # set memory pressure notifications
 memory.swappiness               # set/show swappiness parameter of vmscan
                                 (See sysctl's vm.swappiness)
 memory.move_charge_at_immigrate # set/show controls of moving charges
@@ -194,7 +196,7 @@ the cgroup that brought it in -- this will happen on memory pressure).
 But see section 8.2: when moving a task to another cgroup, its pages may
 be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
-Exception: If CONFIG_CGROUP_CGROUP_MEMCG_SWAP is not used.
+Exception: If CONFIG_MEMCG_SWAP is not used.
 When you do swapoff and make swapped-out pages of shmem(tmpfs) to
 be backed into memory in force, charges for pages are accounted against the
 caller of swapoff rather than the users of shmem.
@@ -762,7 +764,73 @@ At reading, current status of OOM is shown.
        under_oom        0 or 1 (if 1, the memory cgroup is under OOM, tasks may
                                 be stopped.)
-11. TODO
+11. Memory Pressure
+The pressure level notifications can be used to monitor the memory
+allocation cost; based on the pressure, applications can implement
+different strategies of managing their memory resources. The pressure
+levels are defined as following:
+The "low" level means that the system is reclaiming memory for new
+allocations. Monitoring this reclaiming activity might be useful for
+maintaining cache level. Upon notification, the program (typically
+"Activity Manager") might analyze vmstat and act in advance (i.e.
+prematurely shutdown unimportant services).
+The "medium" level means that the system is experiencing medium memory
+pressure, the system might be making swap, paging out active file caches,
+etc. Upon this event applications may decide to further analyze
+vmstat/zoneinfo/memcg or internal memory usage statistics and free any
+resources that can be easily reconstructed or re-read from a disk.
+The "critical" level means that the system is actively thrashing, it is
+about to out of memory (OOM) or even the in-kernel OOM killer is on its
+way to trigger. Applications should do whatever they can to help the
+system. It might be too late to consult with vmstat or any other
+statistics, so it's advisable to take an immediate action.
+The events are propagated upward until the event is handled, i.e. the
+events are not pass-through. Here is what this means: for example you have
+three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
+and C, and suppose group C experiences some pressure. In this situation,
+only group C will receive the notification, i.e. groups A and B will not
+receive it. This is done to avoid excessive "broadcasting" of messages,
+which disturbs the system and which is especially bad if we are low on
+memory or thrashing. So, organize the cgroups wisely, or propagate the
+events manually (or, ask us to implement the pass-through events,
+explaining why would you need them.)
+The file memory.pressure_level is only used to setup an eventfd. To
+register a notification, an application must:
+- create an eventfd using eventfd(2);
+- open memory.pressure_level;
+- write string like "<event_fd> <fd of memory.pressure_level> <level>"
+  to cgroup.event_control.
+Application will be notified through eventfd when memory pressure is at
+the specific level (or higher). Read/write operations to
+memory.pressure_level are no implemented.
+Test:
+   Here is a small script example that makes a new cgroup, sets up a
+   memory limit, sets up a notification in the cgroup and then makes child
+   cgroup experience a critical pressure:
+   # cd /sys/fs/cgroup/memory/
+   # mkdir foo
+   # cd foo
+   # cgroup_event_listener memory.pressure_level low &
+   # echo 8000000 > memory.limit_in_bytes
+   # echo 8000000 > memory.memsw.limit_in_bytes
+   # echo $$ > tasks
+   # dd if=/dev/zero | read x
+   (Expect a bunch of notifications, and eventually, the oom-killer will
+   trigger.)
+12. TODO
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index bcf1a00b06a1..638bf17ff869 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt
@@ -442,7 +442,7 @@ You can attach the current shell task by echoing 0:
442	You can use the cgroup.procs file instead of the tasks file to move all	442	You can use the cgroup.procs file instead of the tasks file to move all
443	threads in a threadgroup at once. Echoing the PID of any task in a	443	threads in a threadgroup at once. Echoing the PID of any task in a
444	threadgroup to cgroup.procs causes all tasks in that threadgroup to be	444	threadgroup to cgroup.procs causes all tasks in that threadgroup to be
445	be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks	445	attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
446	in the writing task's threadgroup.	446	in the writing task's threadgroup.
447		447
448	Note: Since every task is always a member of exactly one cgroup in each	448	Note: Since every task is always a member of exactly one cgroup in each
@@ -580,6 +580,7 @@ propagation along the hierarchy. See the comment on
580	cgroup_for_each_descendant_pre() for details.	580	cgroup_for_each_descendant_pre() for details.
581		581
582	void css_offline(struct cgroup *cgrp);	582	void css_offline(struct cgroup *cgrp);
		583	(cgroup_mutex held by caller)
583		584
584	This is the counterpart of css_online() and called iff css_online()	585	This is the counterpart of css_online() and called iff css_online()
585	has succeeded on @cgrp. This signifies the beginning of the end of	586	has succeeded on @cgrp. This signifies the beginning of the end of


diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt index 16624a7f8222..3c1095ca02ea 100644 --- a/Documentation/cgroups/devices.txt +++ b/Documentation/cgroups/devices.txt
@@ -13,9 +13,7 @@ either an integer or * for all. Access is a composition of r
13	The root device cgroup starts with rwm to 'all'. A child device	13	The root device cgroup starts with rwm to 'all'. A child device
14	cgroup gets a copy of the parent. Administrators can then remove	14	cgroup gets a copy of the parent. Administrators can then remove
15	devices from the whitelist or add new entries. A child cgroup can	15	devices from the whitelist or add new entries. A child cgroup can
16	never receive a device access which is denied by its parent. However	16	never receive a device access which is denied by its parent.
17	when a device access is removed from a parent it will not also be
18	removed from the child(ren).
19		17
20	2. User Interface	18	2. User Interface
21		19
@@ -50,3 +48,69 @@ task to a new cgroup. (Again we'll probably want to change that).
50		48
51	A cgroup may not be granted more permissions than the cgroup's	49	A cgroup may not be granted more permissions than the cgroup's
52	parent has.	50	parent has.
		51
		52	4. Hierarchy
		53
		54	device cgroups maintain hierarchy by making sure a cgroup never has more
		55	access permissions than its parent. Every time an entry is written to
		56	a cgroup's devices.deny file, all its children will have that entry removed
		57	from their whitelist and all the locally set whitelist entries will be
		58	re-evaluated. In case one of the locally set whitelist entries would provide
		59	more access than the cgroup's parent, it'll be removed from the whitelist.
		60
		61	Example:
		62	A
		63	/ \
		64	B
		65
		66	group behavior exceptions
		67	A allow "b 8:* rwm", "c 116:1 rw"
		68	B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
		69
		70	If a device is denied in group A:
		71	# echo "c 116:* r" > A/devices.deny
		72	it'll propagate down and after revalidating B's entries, the whitelist entry
		73	"c 116:2 rwm" will be removed:
		74
		75	group whitelist entries denied devices
		76	A all "b 8:* rwm", "c 116:* rw"
		77	B "c 1:3 rwm", "b 3:* rwm" all the rest
		78
		79	In case parent's exceptions change and local exceptions are not allowed
		80	anymore, they'll be deleted.
		81
		82	Notice that new whitelist entries will not be propagated:
		83	A
		84	/ \
		85	B
		86
		87	group whitelist entries denied devices
		88	A "c 1:3 rwm", "c 1:5 r" all the rest
		89	B "c 1:3 rwm", "c 1:5 r" all the rest
		90
		91	when adding "c *:3 rwm":
		92	# echo "c *:3 rwm" >A/devices.allow
		93
		94	the result:
		95	group whitelist entries denied devices
		96	A "c *:3 rwm", "c 1:5 r" all the rest
		97	B "c 1:3 rwm", "c 1:5 r" all the rest
		98
		99	but now it'll be possible to add new entries to B:
		100	# echo "c 2:3 rwm" >B/devices.allow
		101	# echo "c 50:3 r" >B/devices.allow
		102	or even
		103	# echo "c *:3 rwm" >B/devices.allow
		104
		105	Allowing or denying all by writing 'a' to devices.allow or devices.deny will
		106	not be possible once the device cgroups has children.
		107
		108	4.1 Hierarchy (internal implementation)
		109
		110	device cgroups is implemented internally using a behavior (ALLOW, DENY) and a
		111	list of exceptions. The internal state is controlled using the same user
		112	interface to preserve compatibility with the previous whitelist-only
		113	implementation. Removal or addition of exceptions that will reduce the access
		114	to devices will be propagated down the hierarchy.
		115	For every propagated exception, the effective rules will be re-evaluated based
		116	on current parent's access rules.


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 8b8c28b9864c..09027a9fece5 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt
@@ -40,6 +40,7 @@ Features:
40	- soft limit	40	- soft limit
41	- moving (recharging) account at moving a task is selectable.	41	- moving (recharging) account at moving a task is selectable.
42	- usage threshold notifier	42	- usage threshold notifier
		43	- memory pressure notifier
43	- oom-killer disable knob and oom-notifier	44	- oom-killer disable knob and oom-notifier
44	- Root cgroup has no limit controls.	45	- Root cgroup has no limit controls.
45		46
@@ -65,6 +66,7 @@ Brief summary of control files.
65	memory.stat # show various statistics	66	memory.stat # show various statistics
66	memory.use_hierarchy # set/show hierarchical account enabled	67	memory.use_hierarchy # set/show hierarchical account enabled
67	memory.force_empty # trigger forced move charge to parent	68	memory.force_empty # trigger forced move charge to parent
		69	memory.pressure_level # set memory pressure notifications
68	memory.swappiness # set/show swappiness parameter of vmscan	70	memory.swappiness # set/show swappiness parameter of vmscan
69	(See sysctl's vm.swappiness)	71	(See sysctl's vm.swappiness)
70	memory.move_charge_at_immigrate # set/show controls of moving charges	72	memory.move_charge_at_immigrate # set/show controls of moving charges
@@ -194,7 +196,7 @@ the cgroup that brought it in -- this will happen on memory pressure).
194	But see section 8.2: when moving a task to another cgroup, its pages may	196	But see section 8.2: when moving a task to another cgroup, its pages may
195	be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.	197	be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
196		198
197	Exception: If CONFIG_CGROUP_CGROUP_MEMCG_SWAP is not used.	199	Exception: If CONFIG_MEMCG_SWAP is not used.
198	When you do swapoff and make swapped-out pages of shmem(tmpfs) to	200	When you do swapoff and make swapped-out pages of shmem(tmpfs) to
199	be backed into memory in force, charges for pages are accounted against the	201	be backed into memory in force, charges for pages are accounted against the
200	caller of swapoff rather than the users of shmem.	202	caller of swapoff rather than the users of shmem.
@@ -762,7 +764,73 @@ At reading, current status of OOM is shown.
762	under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may	764	under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
763	be stopped.)	765	be stopped.)
764		766
765	11. TODO	767	11. Memory Pressure
		768
		769	The pressure level notifications can be used to monitor the memory
		770	allocation cost; based on the pressure, applications can implement
		771	different strategies of managing their memory resources. The pressure
		772	levels are defined as following:
		773
		774	The "low" level means that the system is reclaiming memory for new
		775	allocations. Monitoring this reclaiming activity might be useful for
		776	maintaining cache level. Upon notification, the program (typically
		777	"Activity Manager") might analyze vmstat and act in advance (i.e.
		778	prematurely shutdown unimportant services).
		779
		780	The "medium" level means that the system is experiencing medium memory
		781	pressure, the system might be making swap, paging out active file caches,
		782	etc. Upon this event applications may decide to further analyze
		783	vmstat/zoneinfo/memcg or internal memory usage statistics and free any
		784	resources that can be easily reconstructed or re-read from a disk.
		785
		786	The "critical" level means that the system is actively thrashing, it is
		787	about to out of memory (OOM) or even the in-kernel OOM killer is on its
		788	way to trigger. Applications should do whatever they can to help the
		789	system. It might be too late to consult with vmstat or any other
		790	statistics, so it's advisable to take an immediate action.
		791
		792	The events are propagated upward until the event is handled, i.e. the
		793	events are not pass-through. Here is what this means: for example you have
		794	three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
		795	and C, and suppose group C experiences some pressure. In this situation,
		796	only group C will receive the notification, i.e. groups A and B will not
		797	receive it. This is done to avoid excessive "broadcasting" of messages,
		798	which disturbs the system and which is especially bad if we are low on
		799	memory or thrashing. So, organize the cgroups wisely, or propagate the
		800	events manually (or, ask us to implement the pass-through events,
		801	explaining why would you need them.)
		802
		803	The file memory.pressure_level is only used to setup an eventfd. To
		804	register a notification, an application must:
		805
		806	- create an eventfd using eventfd(2);
		807	- open memory.pressure_level;
		808	- write string like "<event_fd> <fd of memory.pressure_level> <level>"
		809	to cgroup.event_control.
		810
		811	Application will be notified through eventfd when memory pressure is at
		812	the specific level (or higher). Read/write operations to
		813	memory.pressure_level are no implemented.
		814
		815	Test:
		816
		817	Here is a small script example that makes a new cgroup, sets up a
		818	memory limit, sets up a notification in the cgroup and then makes child
		819	cgroup experience a critical pressure:
		820
		821	# cd /sys/fs/cgroup/memory/
		822	# mkdir foo
		823	# cd foo
		824	# cgroup_event_listener memory.pressure_level low &
		825	# echo 8000000 > memory.limit_in_bytes
		826	# echo 8000000 > memory.memsw.limit_in_bytes
		827	# echo $$ > tasks
		828	# dd if=/dev/zero \| read x
		829
		830	(Expect a bunch of notifications, and eventually, the oom-killer will
		831	trigger.)
		832
		833	12. TODO
766		834
767	1. Add support for accounting huge pages (as a separate controller)	835	1. Add support for accounting huge pages (as a separate controller)
768	2. Make per-cgroup scanner reclaim not-shared pages first	836	2. Make per-cgroup scanner reclaim not-shared pages first