diff options
Diffstat (limited to 'Documentation/cgroups')
-rw-r--r-- | Documentation/cgroups/cgroups.txt | 3 | ||||
-rw-r--r-- | Documentation/cgroups/devices.txt | 70 | ||||
-rw-r--r-- | Documentation/cgroups/memory.txt | 72 |
3 files changed, 139 insertions, 6 deletions
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index bcf1a00b06a1..638bf17ff869 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt | |||
@@ -442,7 +442,7 @@ You can attach the current shell task by echoing 0: | |||
442 | You can use the cgroup.procs file instead of the tasks file to move all | 442 | You can use the cgroup.procs file instead of the tasks file to move all |
443 | threads in a threadgroup at once. Echoing the PID of any task in a | 443 | threads in a threadgroup at once. Echoing the PID of any task in a |
444 | threadgroup to cgroup.procs causes all tasks in that threadgroup to be | 444 | threadgroup to cgroup.procs causes all tasks in that threadgroup to be |
445 | be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks | 445 | attached to the cgroup. Writing 0 to cgroup.procs moves all tasks |
446 | in the writing task's threadgroup. | 446 | in the writing task's threadgroup. |
447 | 447 | ||
448 | Note: Since every task is always a member of exactly one cgroup in each | 448 | Note: Since every task is always a member of exactly one cgroup in each |
@@ -580,6 +580,7 @@ propagation along the hierarchy. See the comment on | |||
580 | cgroup_for_each_descendant_pre() for details. | 580 | cgroup_for_each_descendant_pre() for details. |
581 | 581 | ||
582 | void css_offline(struct cgroup *cgrp); | 582 | void css_offline(struct cgroup *cgrp); |
583 | (cgroup_mutex held by caller) | ||
583 | 584 | ||
584 | This is the counterpart of css_online() and called iff css_online() | 585 | This is the counterpart of css_online() and called iff css_online() |
585 | has succeeded on @cgrp. This signifies the beginning of the end of | 586 | has succeeded on @cgrp. This signifies the beginning of the end of |
diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt index 16624a7f8222..3c1095ca02ea 100644 --- a/Documentation/cgroups/devices.txt +++ b/Documentation/cgroups/devices.txt | |||
@@ -13,9 +13,7 @@ either an integer or * for all. Access is a composition of r | |||
13 | The root device cgroup starts with rwm to 'all'. A child device | 13 | The root device cgroup starts with rwm to 'all'. A child device |
14 | cgroup gets a copy of the parent. Administrators can then remove | 14 | cgroup gets a copy of the parent. Administrators can then remove |
15 | devices from the whitelist or add new entries. A child cgroup can | 15 | devices from the whitelist or add new entries. A child cgroup can |
16 | never receive a device access which is denied by its parent. However | 16 | never receive a device access which is denied by its parent. |
17 | when a device access is removed from a parent it will not also be | ||
18 | removed from the child(ren). | ||
19 | 17 | ||
20 | 2. User Interface | 18 | 2. User Interface |
21 | 19 | ||
@@ -50,3 +48,69 @@ task to a new cgroup. (Again we'll probably want to change that). | |||
50 | 48 | ||
51 | A cgroup may not be granted more permissions than the cgroup's | 49 | A cgroup may not be granted more permissions than the cgroup's |
52 | parent has. | 50 | parent has. |
51 | |||
52 | 4. Hierarchy | ||
53 | |||
54 | device cgroups maintain hierarchy by making sure a cgroup never has more | ||
55 | access permissions than its parent. Every time an entry is written to | ||
56 | a cgroup's devices.deny file, all its children will have that entry removed | ||
57 | from their whitelist and all the locally set whitelist entries will be | ||
58 | re-evaluated. In case one of the locally set whitelist entries would provide | ||
59 | more access than the cgroup's parent, it'll be removed from the whitelist. | ||
60 | |||
61 | Example: | ||
62 | A | ||
63 | / \ | ||
64 | B | ||
65 | |||
66 | group behavior exceptions | ||
67 | A allow "b 8:* rwm", "c 116:1 rw" | ||
68 | B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm" | ||
69 | |||
70 | If a device is denied in group A: | ||
71 | # echo "c 116:* r" > A/devices.deny | ||
72 | it'll propagate down and after revalidating B's entries, the whitelist entry | ||
73 | "c 116:2 rwm" will be removed: | ||
74 | |||
75 | group whitelist entries denied devices | ||
76 | A all "b 8:* rwm", "c 116:* rw" | ||
77 | B "c 1:3 rwm", "b 3:* rwm" all the rest | ||
78 | |||
79 | In case parent's exceptions change and local exceptions are not allowed | ||
80 | anymore, they'll be deleted. | ||
81 | |||
82 | Notice that new whitelist entries will not be propagated: | ||
83 | A | ||
84 | / \ | ||
85 | B | ||
86 | |||
87 | group whitelist entries denied devices | ||
88 | A "c 1:3 rwm", "c 1:5 r" all the rest | ||
89 | B "c 1:3 rwm", "c 1:5 r" all the rest | ||
90 | |||
91 | when adding "c *:3 rwm": | ||
92 | # echo "c *:3 rwm" >A/devices.allow | ||
93 | |||
94 | the result: | ||
95 | group whitelist entries denied devices | ||
96 | A "c *:3 rwm", "c 1:5 r" all the rest | ||
97 | B "c 1:3 rwm", "c 1:5 r" all the rest | ||
98 | |||
99 | but now it'll be possible to add new entries to B: | ||
100 | # echo "c 2:3 rwm" >B/devices.allow | ||
101 | # echo "c 50:3 r" >B/devices.allow | ||
102 | or even | ||
103 | # echo "c *:3 rwm" >B/devices.allow | ||
104 | |||
105 | Allowing or denying all by writing 'a' to devices.allow or devices.deny will | ||
106 | not be possible once the device cgroups has children. | ||
107 | |||
108 | 4.1 Hierarchy (internal implementation) | ||
109 | |||
110 | device cgroups is implemented internally using a behavior (ALLOW, DENY) and a | ||
111 | list of exceptions. The internal state is controlled using the same user | ||
112 | interface to preserve compatibility with the previous whitelist-only | ||
113 | implementation. Removal or addition of exceptions that will reduce the access | ||
114 | to devices will be propagated down the hierarchy. | ||
115 | For every propagated exception, the effective rules will be re-evaluated based | ||
116 | on current parent's access rules. | ||
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 8b8c28b9864c..09027a9fece5 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt | |||
@@ -40,6 +40,7 @@ Features: | |||
40 | - soft limit | 40 | - soft limit |
41 | - moving (recharging) account at moving a task is selectable. | 41 | - moving (recharging) account at moving a task is selectable. |
42 | - usage threshold notifier | 42 | - usage threshold notifier |
43 | - memory pressure notifier | ||
43 | - oom-killer disable knob and oom-notifier | 44 | - oom-killer disable knob and oom-notifier |
44 | - Root cgroup has no limit controls. | 45 | - Root cgroup has no limit controls. |
45 | 46 | ||
@@ -65,6 +66,7 @@ Brief summary of control files. | |||
65 | memory.stat # show various statistics | 66 | memory.stat # show various statistics |
66 | memory.use_hierarchy # set/show hierarchical account enabled | 67 | memory.use_hierarchy # set/show hierarchical account enabled |
67 | memory.force_empty # trigger forced move charge to parent | 68 | memory.force_empty # trigger forced move charge to parent |
69 | memory.pressure_level # set memory pressure notifications | ||
68 | memory.swappiness # set/show swappiness parameter of vmscan | 70 | memory.swappiness # set/show swappiness parameter of vmscan |
69 | (See sysctl's vm.swappiness) | 71 | (See sysctl's vm.swappiness) |
70 | memory.move_charge_at_immigrate # set/show controls of moving charges | 72 | memory.move_charge_at_immigrate # set/show controls of moving charges |
@@ -194,7 +196,7 @@ the cgroup that brought it in -- this will happen on memory pressure). | |||
194 | But see section 8.2: when moving a task to another cgroup, its pages may | 196 | But see section 8.2: when moving a task to another cgroup, its pages may |
195 | be recharged to the new cgroup, if move_charge_at_immigrate has been chosen. | 197 | be recharged to the new cgroup, if move_charge_at_immigrate has been chosen. |
196 | 198 | ||
197 | Exception: If CONFIG_CGROUP_CGROUP_MEMCG_SWAP is not used. | 199 | Exception: If CONFIG_MEMCG_SWAP is not used. |
198 | When you do swapoff and make swapped-out pages of shmem(tmpfs) to | 200 | When you do swapoff and make swapped-out pages of shmem(tmpfs) to |
199 | be backed into memory in force, charges for pages are accounted against the | 201 | be backed into memory in force, charges for pages are accounted against the |
200 | caller of swapoff rather than the users of shmem. | 202 | caller of swapoff rather than the users of shmem. |
@@ -762,7 +764,73 @@ At reading, current status of OOM is shown. | |||
762 | under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may | 764 | under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may |
763 | be stopped.) | 765 | be stopped.) |
764 | 766 | ||
765 | 11. TODO | 767 | 11. Memory Pressure |
768 | |||
769 | The pressure level notifications can be used to monitor the memory | ||
770 | allocation cost; based on the pressure, applications can implement | ||
771 | different strategies of managing their memory resources. The pressure | ||
772 | levels are defined as following: | ||
773 | |||
774 | The "low" level means that the system is reclaiming memory for new | ||
775 | allocations. Monitoring this reclaiming activity might be useful for | ||
776 | maintaining cache level. Upon notification, the program (typically | ||
777 | "Activity Manager") might analyze vmstat and act in advance (i.e. | ||
778 | prematurely shutdown unimportant services). | ||
779 | |||
780 | The "medium" level means that the system is experiencing medium memory | ||
781 | pressure, the system might be making swap, paging out active file caches, | ||
782 | etc. Upon this event applications may decide to further analyze | ||
783 | vmstat/zoneinfo/memcg or internal memory usage statistics and free any | ||
784 | resources that can be easily reconstructed or re-read from a disk. | ||
785 | |||
786 | The "critical" level means that the system is actively thrashing, it is | ||
787 | about to out of memory (OOM) or even the in-kernel OOM killer is on its | ||
788 | way to trigger. Applications should do whatever they can to help the | ||
789 | system. It might be too late to consult with vmstat or any other | ||
790 | statistics, so it's advisable to take an immediate action. | ||
791 | |||
792 | The events are propagated upward until the event is handled, i.e. the | ||
793 | events are not pass-through. Here is what this means: for example you have | ||
794 | three cgroups: A->B->C. Now you set up an event listener on cgroups A, B | ||
795 | and C, and suppose group C experiences some pressure. In this situation, | ||
796 | only group C will receive the notification, i.e. groups A and B will not | ||
797 | receive it. This is done to avoid excessive "broadcasting" of messages, | ||
798 | which disturbs the system and which is especially bad if we are low on | ||
799 | memory or thrashing. So, organize the cgroups wisely, or propagate the | ||
800 | events manually (or, ask us to implement the pass-through events, | ||
801 | explaining why would you need them.) | ||
802 | |||
803 | The file memory.pressure_level is only used to setup an eventfd. To | ||
804 | register a notification, an application must: | ||
805 | |||
806 | - create an eventfd using eventfd(2); | ||
807 | - open memory.pressure_level; | ||
808 | - write string like "<event_fd> <fd of memory.pressure_level> <level>" | ||
809 | to cgroup.event_control. | ||
810 | |||
811 | Application will be notified through eventfd when memory pressure is at | ||
812 | the specific level (or higher). Read/write operations to | ||
813 | memory.pressure_level are no implemented. | ||
814 | |||
815 | Test: | ||
816 | |||
817 | Here is a small script example that makes a new cgroup, sets up a | ||
818 | memory limit, sets up a notification in the cgroup and then makes child | ||
819 | cgroup experience a critical pressure: | ||
820 | |||
821 | # cd /sys/fs/cgroup/memory/ | ||
822 | # mkdir foo | ||
823 | # cd foo | ||
824 | # cgroup_event_listener memory.pressure_level low & | ||
825 | # echo 8000000 > memory.limit_in_bytes | ||
826 | # echo 8000000 > memory.memsw.limit_in_bytes | ||
827 | # echo $$ > tasks | ||
828 | # dd if=/dev/zero | read x | ||
829 | |||
830 | (Expect a bunch of notifications, and eventually, the oom-killer will | ||
831 | trigger.) | ||
832 | |||
833 | 12. TODO | ||
766 | 834 | ||
767 | 1. Add support for accounting huge pages (as a separate controller) | 835 | 1. Add support for accounting huge pages (as a separate controller) |
768 | 2. Make per-cgroup scanner reclaim not-shared pages first | 836 | 2. Make per-cgroup scanner reclaim not-shared pages first |