diff options
Diffstat (limited to 'Documentation/cgroups')
-rw-r--r-- | Documentation/cgroups/cgroups.txt | 11 | ||||
-rw-r--r-- | Documentation/cgroups/cpuacct.txt | 32 | ||||
-rw-r--r-- | Documentation/cgroups/cpusets.txt | 817 | ||||
-rw-r--r-- | Documentation/cgroups/devices.txt | 52 | ||||
-rw-r--r-- | Documentation/cgroups/memcg_test.txt | 362 | ||||
-rw-r--r-- | Documentation/cgroups/memory.txt | 399 | ||||
-rw-r--r-- | Documentation/cgroups/resource_counter.txt | 181 |
7 files changed, 1848 insertions, 6 deletions
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index e33ee74eee77..93feb8444489 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt | |||
@@ -1,7 +1,8 @@ | |||
1 | CGROUPS | 1 | CGROUPS |
2 | ------- | 2 | ------- |
3 | 3 | ||
4 | Written by Paul Menage <menage@google.com> based on Documentation/cpusets.txt | 4 | Written by Paul Menage <menage@google.com> based on |
5 | Documentation/cgroups/cpusets.txt | ||
5 | 6 | ||
6 | Original copyright statements from cpusets.txt: | 7 | Original copyright statements from cpusets.txt: |
7 | Portions Copyright (C) 2004 BULL SA. | 8 | Portions Copyright (C) 2004 BULL SA. |
@@ -68,7 +69,7 @@ On their own, the only use for cgroups is for simple job | |||
68 | tracking. The intention is that other subsystems hook into the generic | 69 | tracking. The intention is that other subsystems hook into the generic |
69 | cgroup support to provide new attributes for cgroups, such as | 70 | cgroup support to provide new attributes for cgroups, such as |
70 | accounting/limiting the resources which processes in a cgroup can | 71 | accounting/limiting the resources which processes in a cgroup can |
71 | access. For example, cpusets (see Documentation/cpusets.txt) allows | 72 | access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allows |
72 | you to associate a set of CPUs and a set of memory nodes with the | 73 | you to associate a set of CPUs and a set of memory nodes with the |
73 | tasks in each cgroup. | 74 | tasks in each cgroup. |
74 | 75 | ||
@@ -251,10 +252,8 @@ cgroup file system directories. | |||
251 | When a task is moved from one cgroup to another, it gets a new | 252 | When a task is moved from one cgroup to another, it gets a new |
252 | css_set pointer - if there's an already existing css_set with the | 253 | css_set pointer - if there's an already existing css_set with the |
253 | desired collection of cgroups then that group is reused, else a new | 254 | desired collection of cgroups then that group is reused, else a new |
254 | css_set is allocated. Note that the current implementation uses a | 255 | css_set is allocated. The appropriate existing css_set is located by |
255 | linear search to locate an appropriate existing css_set, so isn't | 256 | looking into a hash table. |
256 | very efficient. A future version will use a hash table for better | ||
257 | performance. | ||
258 | 257 | ||
259 | To allow access from a cgroup to the css_sets (and hence tasks) | 258 | To allow access from a cgroup to the css_sets (and hence tasks) |
260 | that comprise it, a set of cg_cgroup_link objects form a lattice; | 259 | that comprise it, a set of cg_cgroup_link objects form a lattice; |
diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt new file mode 100644 index 000000000000..bb775fbe43d7 --- /dev/null +++ b/Documentation/cgroups/cpuacct.txt | |||
@@ -0,0 +1,32 @@ | |||
1 | CPU Accounting Controller | ||
2 | ------------------------- | ||
3 | |||
4 | The CPU accounting controller is used to group tasks using cgroups and | ||
5 | account the CPU usage of these groups of tasks. | ||
6 | |||
7 | The CPU accounting controller supports multi-hierarchy groups. An accounting | ||
8 | group accumulates the CPU usage of all of its child groups and the tasks | ||
9 | directly present in its group. | ||
10 | |||
11 | Accounting groups can be created by first mounting the cgroup filesystem. | ||
12 | |||
13 | # mkdir /cgroups | ||
14 | # mount -t cgroup -ocpuacct none /cgroups | ||
15 | |||
16 | With the above step, the initial or the parent accounting group | ||
17 | becomes visible at /cgroups. At bootup, this group includes all the | ||
18 | tasks in the system. /cgroups/tasks lists the tasks in this cgroup. | ||
19 | /cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by | ||
20 | this group which is essentially the CPU time obtained by all the tasks | ||
21 | in the system. | ||
22 | |||
23 | New accounting groups can be created under the parent group /cgroups. | ||
24 | |||
25 | # cd /cgroups | ||
26 | # mkdir g1 | ||
27 | # echo $$ > g1 | ||
28 | |||
29 | The above steps create a new group g1 and move the current shell | ||
30 | process (bash) into it. CPU time consumed by this bash and its children | ||
31 | can be obtained from g1/cpuacct.usage and the same is accumulated in | ||
32 | /cgroups/cpuacct.usage also. | ||
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt new file mode 100644 index 000000000000..0611e9528c7c --- /dev/null +++ b/Documentation/cgroups/cpusets.txt | |||
@@ -0,0 +1,817 @@ | |||
1 | CPUSETS | ||
2 | ------- | ||
3 | |||
4 | Copyright (C) 2004 BULL SA. | ||
5 | Written by Simon.Derr@bull.net | ||
6 | |||
7 | Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. | ||
8 | Modified by Paul Jackson <pj@sgi.com> | ||
9 | Modified by Christoph Lameter <clameter@sgi.com> | ||
10 | Modified by Paul Menage <menage@google.com> | ||
11 | Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> | ||
12 | |||
13 | CONTENTS: | ||
14 | ========= | ||
15 | |||
16 | 1. Cpusets | ||
17 | 1.1 What are cpusets ? | ||
18 | 1.2 Why are cpusets needed ? | ||
19 | 1.3 How are cpusets implemented ? | ||
20 | 1.4 What are exclusive cpusets ? | ||
21 | 1.5 What is memory_pressure ? | ||
22 | 1.6 What is memory spread ? | ||
23 | 1.7 What is sched_load_balance ? | ||
24 | 1.8 What is sched_relax_domain_level ? | ||
25 | 1.9 How do I use cpusets ? | ||
26 | 2. Usage Examples and Syntax | ||
27 | 2.1 Basic Usage | ||
28 | 2.2 Adding/removing cpus | ||
29 | 2.3 Setting flags | ||
30 | 2.4 Attaching processes | ||
31 | 3. Questions | ||
32 | 4. Contact | ||
33 | |||
34 | 1. Cpusets | ||
35 | ========== | ||
36 | |||
37 | 1.1 What are cpusets ? | ||
38 | ---------------------- | ||
39 | |||
40 | Cpusets provide a mechanism for assigning a set of CPUs and Memory | ||
41 | Nodes to a set of tasks. In this document "Memory Node" refers to | ||
42 | an on-line node that contains memory. | ||
43 | |||
44 | Cpusets constrain the CPU and Memory placement of tasks to only | ||
45 | the resources within a tasks current cpuset. They form a nested | ||
46 | hierarchy visible in a virtual file system. These are the essential | ||
47 | hooks, beyond what is already present, required to manage dynamic | ||
48 | job placement on large systems. | ||
49 | |||
50 | Cpusets use the generic cgroup subsystem described in | ||
51 | Documentation/cgroups/cgroups.txt. | ||
52 | |||
53 | Requests by a task, using the sched_setaffinity(2) system call to | ||
54 | include CPUs in its CPU affinity mask, and using the mbind(2) and | ||
55 | set_mempolicy(2) system calls to include Memory Nodes in its memory | ||
56 | policy, are both filtered through that tasks cpuset, filtering out any | ||
57 | CPUs or Memory Nodes not in that cpuset. The scheduler will not | ||
58 | schedule a task on a CPU that is not allowed in its cpus_allowed | ||
59 | vector, and the kernel page allocator will not allocate a page on a | ||
60 | node that is not allowed in the requesting tasks mems_allowed vector. | ||
61 | |||
62 | User level code may create and destroy cpusets by name in the cgroup | ||
63 | virtual file system, manage the attributes and permissions of these | ||
64 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | ||
65 | specify and query to which cpuset a task is assigned, and list the | ||
66 | task pids assigned to a cpuset. | ||
67 | |||
68 | |||
69 | 1.2 Why are cpusets needed ? | ||
70 | ---------------------------- | ||
71 | |||
72 | The management of large computer systems, with many processors (CPUs), | ||
73 | complex memory cache hierarchies and multiple Memory Nodes having | ||
74 | non-uniform access times (NUMA) presents additional challenges for | ||
75 | the efficient scheduling and memory placement of processes. | ||
76 | |||
77 | Frequently more modest sized systems can be operated with adequate | ||
78 | efficiency just by letting the operating system automatically share | ||
79 | the available CPU and Memory resources amongst the requesting tasks. | ||
80 | |||
81 | But larger systems, which benefit more from careful processor and | ||
82 | memory placement to reduce memory access times and contention, | ||
83 | and which typically represent a larger investment for the customer, | ||
84 | can benefit from explicitly placing jobs on properly sized subsets of | ||
85 | the system. | ||
86 | |||
87 | This can be especially valuable on: | ||
88 | |||
89 | * Web Servers running multiple instances of the same web application, | ||
90 | * Servers running different applications (for instance, a web server | ||
91 | and a database), or | ||
92 | * NUMA systems running large HPC applications with demanding | ||
93 | performance characteristics. | ||
94 | |||
95 | These subsets, or "soft partitions" must be able to be dynamically | ||
96 | adjusted, as the job mix changes, without impacting other concurrently | ||
97 | executing jobs. The location of the running jobs pages may also be moved | ||
98 | when the memory locations are changed. | ||
99 | |||
100 | The kernel cpuset patch provides the minimum essential kernel | ||
101 | mechanisms required to efficiently implement such subsets. It | ||
102 | leverages existing CPU and Memory Placement facilities in the Linux | ||
103 | kernel to avoid any additional impact on the critical scheduler or | ||
104 | memory allocator code. | ||
105 | |||
106 | |||
107 | 1.3 How are cpusets implemented ? | ||
108 | --------------------------------- | ||
109 | |||
110 | Cpusets provide a Linux kernel mechanism to constrain which CPUs and | ||
111 | Memory Nodes are used by a process or set of processes. | ||
112 | |||
113 | The Linux kernel already has a pair of mechanisms to specify on which | ||
114 | CPUs a task may be scheduled (sched_setaffinity) and on which Memory | ||
115 | Nodes it may obtain memory (mbind, set_mempolicy). | ||
116 | |||
117 | Cpusets extends these two mechanisms as follows: | ||
118 | |||
119 | - Cpusets are sets of allowed CPUs and Memory Nodes, known to the | ||
120 | kernel. | ||
121 | - Each task in the system is attached to a cpuset, via a pointer | ||
122 | in the task structure to a reference counted cgroup structure. | ||
123 | - Calls to sched_setaffinity are filtered to just those CPUs | ||
124 | allowed in that tasks cpuset. | ||
125 | - Calls to mbind and set_mempolicy are filtered to just | ||
126 | those Memory Nodes allowed in that tasks cpuset. | ||
127 | - The root cpuset contains all the systems CPUs and Memory | ||
128 | Nodes. | ||
129 | - For any cpuset, one can define child cpusets containing a subset | ||
130 | of the parents CPU and Memory Node resources. | ||
131 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for | ||
132 | browsing and manipulation from user space. | ||
133 | - A cpuset may be marked exclusive, which ensures that no other | ||
134 | cpuset (except direct ancestors and descendents) may contain | ||
135 | any overlapping CPUs or Memory Nodes. | ||
136 | - You can list all the tasks (by pid) attached to any cpuset. | ||
137 | |||
138 | The implementation of cpusets requires a few, simple hooks | ||
139 | into the rest of the kernel, none in performance critical paths: | ||
140 | |||
141 | - in init/main.c, to initialize the root cpuset at system boot. | ||
142 | - in fork and exit, to attach and detach a task from its cpuset. | ||
143 | - in sched_setaffinity, to mask the requested CPUs by what's | ||
144 | allowed in that tasks cpuset. | ||
145 | - in sched.c migrate_live_tasks(), to keep migrating tasks within | ||
146 | the CPUs allowed by their cpuset, if possible. | ||
147 | - in the mbind and set_mempolicy system calls, to mask the requested | ||
148 | Memory Nodes by what's allowed in that tasks cpuset. | ||
149 | - in page_alloc.c, to restrict memory to allowed nodes. | ||
150 | - in vmscan.c, to restrict page recovery to the current cpuset. | ||
151 | |||
152 | You should mount the "cgroup" filesystem type in order to enable | ||
153 | browsing and modifying the cpusets presently known to the kernel. No | ||
154 | new system calls are added for cpusets - all support for querying and | ||
155 | modifying cpusets is via this cpuset file system. | ||
156 | |||
157 | The /proc/<pid>/status file for each task has four added lines, | ||
158 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) | ||
159 | and mems_allowed (on which Memory Nodes it may obtain memory), | ||
160 | in the two formats seen in the following example: | ||
161 | |||
162 | Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff | ||
163 | Cpus_allowed_list: 0-127 | ||
164 | Mems_allowed: ffffffff,ffffffff | ||
165 | Mems_allowed_list: 0-63 | ||
166 | |||
167 | Each cpuset is represented by a directory in the cgroup file system | ||
168 | containing (on top of the standard cgroup files) the following | ||
169 | files describing that cpuset: | ||
170 | |||
171 | - cpus: list of CPUs in that cpuset | ||
172 | - mems: list of Memory Nodes in that cpuset | ||
173 | - memory_migrate flag: if set, move pages to cpusets nodes | ||
174 | - cpu_exclusive flag: is cpu placement exclusive? | ||
175 | - mem_exclusive flag: is memory placement exclusive? | ||
176 | - mem_hardwall flag: is memory allocation hardwalled | ||
177 | - memory_pressure: measure of how much paging pressure in cpuset | ||
178 | - memory_spread_page flag: if set, spread page cache evenly on allowed nodes | ||
179 | - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes | ||
180 | - sched_load_balance flag: if set, load balance within CPUs on that cpuset | ||
181 | - sched_relax_domain_level: the searching range when migrating tasks | ||
182 | |||
183 | In addition, the root cpuset only has the following file: | ||
184 | - memory_pressure_enabled flag: compute memory_pressure? | ||
185 | |||
186 | New cpusets are created using the mkdir system call or shell | ||
187 | command. The properties of a cpuset, such as its flags, allowed | ||
188 | CPUs and Memory Nodes, and attached tasks, are modified by writing | ||
189 | to the appropriate file in that cpusets directory, as listed above. | ||
190 | |||
191 | The named hierarchical structure of nested cpusets allows partitioning | ||
192 | a large system into nested, dynamically changeable, "soft-partitions". | ||
193 | |||
194 | The attachment of each task, automatically inherited at fork by any | ||
195 | children of that task, to a cpuset allows organizing the work load | ||
196 | on a system into related sets of tasks such that each set is constrained | ||
197 | to using the CPUs and Memory Nodes of a particular cpuset. A task | ||
198 | may be re-attached to any other cpuset, if allowed by the permissions | ||
199 | on the necessary cpuset file system directories. | ||
200 | |||
201 | Such management of a system "in the large" integrates smoothly with | ||
202 | the detailed placement done on individual tasks and memory regions | ||
203 | using the sched_setaffinity, mbind and set_mempolicy system calls. | ||
204 | |||
205 | The following rules apply to each cpuset: | ||
206 | |||
207 | - Its CPUs and Memory Nodes must be a subset of its parents. | ||
208 | - It can't be marked exclusive unless its parent is. | ||
209 | - If its cpu or memory is exclusive, they may not overlap any sibling. | ||
210 | |||
211 | These rules, and the natural hierarchy of cpusets, enable efficient | ||
212 | enforcement of the exclusive guarantee, without having to scan all | ||
213 | cpusets every time any of them change to ensure nothing overlaps a | ||
214 | exclusive cpuset. Also, the use of a Linux virtual file system (vfs) | ||
215 | to represent the cpuset hierarchy provides for a familiar permission | ||
216 | and name space for cpusets, with a minimum of additional kernel code. | ||
217 | |||
218 | The cpus and mems files in the root (top_cpuset) cpuset are | ||
219 | read-only. The cpus file automatically tracks the value of | ||
220 | cpu_online_map using a CPU hotplug notifier, and the mems file | ||
221 | automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e., | ||
222 | nodes with memory--using the cpuset_track_online_nodes() hook. | ||
223 | |||
224 | |||
225 | 1.4 What are exclusive cpusets ? | ||
226 | -------------------------------- | ||
227 | |||
228 | If a cpuset is cpu or mem exclusive, no other cpuset, other than | ||
229 | a direct ancestor or descendent, may share any of the same CPUs or | ||
230 | Memory Nodes. | ||
231 | |||
232 | A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", | ||
233 | i.e. it restricts kernel allocations for page, buffer and other data | ||
234 | commonly shared by the kernel across multiple users. All cpusets, | ||
235 | whether hardwalled or not, restrict allocations of memory for user | ||
236 | space. This enables configuring a system so that several independent | ||
237 | jobs can share common kernel data, such as file system pages, while | ||
238 | isolating each job's user allocation in its own cpuset. To do this, | ||
239 | construct a large mem_exclusive cpuset to hold all the jobs, and | ||
240 | construct child, non-mem_exclusive cpusets for each individual job. | ||
241 | Only a small amount of typical kernel memory, such as requests from | ||
242 | interrupt handlers, is allowed to be taken outside even a | ||
243 | mem_exclusive cpuset. | ||
244 | |||
245 | |||
246 | 1.5 What is memory_pressure ? | ||
247 | ----------------------------- | ||
248 | The memory_pressure of a cpuset provides a simple per-cpuset metric | ||
249 | of the rate that the tasks in a cpuset are attempting to free up in | ||
250 | use memory on the nodes of the cpuset to satisfy additional memory | ||
251 | requests. | ||
252 | |||
253 | This enables batch managers monitoring jobs running in dedicated | ||
254 | cpusets to efficiently detect what level of memory pressure that job | ||
255 | is causing. | ||
256 | |||
257 | This is useful both on tightly managed systems running a wide mix of | ||
258 | submitted jobs, which may choose to terminate or re-prioritize jobs that | ||
259 | are trying to use more memory than allowed on the nodes assigned to them, | ||
260 | and with tightly coupled, long running, massively parallel scientific | ||
261 | computing jobs that will dramatically fail to meet required performance | ||
262 | goals if they start to use more memory than allowed to them. | ||
263 | |||
264 | This mechanism provides a very economical way for the batch manager | ||
265 | to monitor a cpuset for signs of memory pressure. It's up to the | ||
266 | batch manager or other user code to decide what to do about it and | ||
267 | take action. | ||
268 | |||
269 | ==> Unless this feature is enabled by writing "1" to the special file | ||
270 | /dev/cpuset/memory_pressure_enabled, the hook in the rebalance | ||
271 | code of __alloc_pages() for this metric reduces to simply noticing | ||
272 | that the cpuset_memory_pressure_enabled flag is zero. So only | ||
273 | systems that enable this feature will compute the metric. | ||
274 | |||
275 | Why a per-cpuset, running average: | ||
276 | |||
277 | Because this meter is per-cpuset, rather than per-task or mm, | ||
278 | the system load imposed by a batch scheduler monitoring this | ||
279 | metric is sharply reduced on large systems, because a scan of | ||
280 | the tasklist can be avoided on each set of queries. | ||
281 | |||
282 | Because this meter is a running average, instead of an accumulating | ||
283 | counter, a batch scheduler can detect memory pressure with a | ||
284 | single read, instead of having to read and accumulate results | ||
285 | for a period of time. | ||
286 | |||
287 | Because this meter is per-cpuset rather than per-task or mm, | ||
288 | the batch scheduler can obtain the key information, memory | ||
289 | pressure in a cpuset, with a single read, rather than having to | ||
290 | query and accumulate results over all the (dynamically changing) | ||
291 | set of tasks in the cpuset. | ||
292 | |||
293 | A per-cpuset simple digital filter (requires a spinlock and 3 words | ||
294 | of data per-cpuset) is kept, and updated by any task attached to that | ||
295 | cpuset, if it enters the synchronous (direct) page reclaim code. | ||
296 | |||
297 | A per-cpuset file provides an integer number representing the recent | ||
298 | (half-life of 10 seconds) rate of direct page reclaims caused by | ||
299 | the tasks in the cpuset, in units of reclaims attempted per second, | ||
300 | times 1000. | ||
301 | |||
302 | |||
303 | 1.6 What is memory spread ? | ||
304 | --------------------------- | ||
305 | There are two boolean flag files per cpuset that control where the | ||
306 | kernel allocates pages for the file system buffers and related in | ||
307 | kernel data structures. They are called 'memory_spread_page' and | ||
308 | 'memory_spread_slab'. | ||
309 | |||
310 | If the per-cpuset boolean flag file 'memory_spread_page' is set, then | ||
311 | the kernel will spread the file system buffers (page cache) evenly | ||
312 | over all the nodes that the faulting task is allowed to use, instead | ||
313 | of preferring to put those pages on the node where the task is running. | ||
314 | |||
315 | If the per-cpuset boolean flag file 'memory_spread_slab' is set, | ||
316 | then the kernel will spread some file system related slab caches, | ||
317 | such as for inodes and dentries evenly over all the nodes that the | ||
318 | faulting task is allowed to use, instead of preferring to put those | ||
319 | pages on the node where the task is running. | ||
320 | |||
321 | The setting of these flags does not affect anonymous data segment or | ||
322 | stack segment pages of a task. | ||
323 | |||
324 | By default, both kinds of memory spreading are off, and memory | ||
325 | pages are allocated on the node local to where the task is running, | ||
326 | except perhaps as modified by the tasks NUMA mempolicy or cpuset | ||
327 | configuration, so long as sufficient free memory pages are available. | ||
328 | |||
329 | When new cpusets are created, they inherit the memory spread settings | ||
330 | of their parent. | ||
331 | |||
332 | Setting memory spreading causes allocations for the affected page | ||
333 | or slab caches to ignore the tasks NUMA mempolicy and be spread | ||
334 | instead. Tasks using mbind() or set_mempolicy() calls to set NUMA | ||
335 | mempolicies will not notice any change in these calls as a result of | ||
336 | their containing tasks memory spread settings. If memory spreading | ||
337 | is turned off, then the currently specified NUMA mempolicy once again | ||
338 | applies to memory page allocations. | ||
339 | |||
340 | Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag | ||
341 | files. By default they contain "0", meaning that the feature is off | ||
342 | for that cpuset. If a "1" is written to that file, then that turns | ||
343 | the named feature on. | ||
344 | |||
345 | The implementation is simple. | ||
346 | |||
347 | Setting the flag 'memory_spread_page' turns on a per-process flag | ||
348 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently | ||
349 | joins that cpuset. The page allocation calls for the page cache | ||
350 | is modified to perform an inline check for this PF_SPREAD_PAGE task | ||
351 | flag, and if set, a call to a new routine cpuset_mem_spread_node() | ||
352 | returns the node to prefer for the allocation. | ||
353 | |||
354 | Similarly, setting 'memory_spread_slab' turns on the flag | ||
355 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate | ||
356 | pages from the node returned by cpuset_mem_spread_node(). | ||
357 | |||
358 | The cpuset_mem_spread_node() routine is also simple. It uses the | ||
359 | value of a per-task rotor cpuset_mem_spread_rotor to select the next | ||
360 | node in the current tasks mems_allowed to prefer for the allocation. | ||
361 | |||
362 | This memory placement policy is also known (in other contexts) as | ||
363 | round-robin or interleave. | ||
364 | |||
365 | This policy can provide substantial improvements for jobs that need | ||
366 | to place thread local data on the corresponding node, but that need | ||
367 | to access large file system data sets that need to be spread across | ||
368 | the several nodes in the jobs cpuset in order to fit. Without this | ||
369 | policy, especially for jobs that might have one thread reading in the | ||
370 | data set, the memory allocation across the nodes in the jobs cpuset | ||
371 | can become very uneven. | ||
372 | |||
373 | 1.7 What is sched_load_balance ? | ||
374 | -------------------------------- | ||
375 | |||
376 | The kernel scheduler (kernel/sched.c) automatically load balances | ||
377 | tasks. If one CPU is underutilized, kernel code running on that | ||
378 | CPU will look for tasks on other more overloaded CPUs and move those | ||
379 | tasks to itself, within the constraints of such placement mechanisms | ||
380 | as cpusets and sched_setaffinity. | ||
381 | |||
382 | The algorithmic cost of load balancing and its impact on key shared | ||
383 | kernel data structures such as the task list increases more than | ||
384 | linearly with the number of CPUs being balanced. So the scheduler | ||
385 | has support to partition the systems CPUs into a number of sched | ||
386 | domains such that it only load balances within each sched domain. | ||
387 | Each sched domain covers some subset of the CPUs in the system; | ||
388 | no two sched domains overlap; some CPUs might not be in any sched | ||
389 | domain and hence won't be load balanced. | ||
390 | |||
391 | Put simply, it costs less to balance between two smaller sched domains | ||
392 | than one big one, but doing so means that overloads in one of the | ||
393 | two domains won't be load balanced to the other one. | ||
394 | |||
395 | By default, there is one sched domain covering all CPUs, except those | ||
396 | marked isolated using the kernel boot time "isolcpus=" argument. | ||
397 | |||
398 | This default load balancing across all CPUs is not well suited for | ||
399 | the following two situations: | ||
400 | 1) On large systems, load balancing across many CPUs is expensive. | ||
401 | If the system is managed using cpusets to place independent jobs | ||
402 | on separate sets of CPUs, full load balancing is unnecessary. | ||
403 | 2) Systems supporting realtime on some CPUs need to minimize | ||
404 | system overhead on those CPUs, including avoiding task load | ||
405 | balancing if that is not needed. | ||
406 | |||
407 | When the per-cpuset flag "sched_load_balance" is enabled (the default | ||
408 | setting), it requests that all the CPUs in that cpusets allowed 'cpus' | ||
409 | be contained in a single sched domain, ensuring that load balancing | ||
410 | can move a task (not otherwised pinned, as by sched_setaffinity) | ||
411 | from any CPU in that cpuset to any other. | ||
412 | |||
413 | When the per-cpuset flag "sched_load_balance" is disabled, then the | ||
414 | scheduler will avoid load balancing across the CPUs in that cpuset, | ||
415 | --except-- in so far as is necessary because some overlapping cpuset | ||
416 | has "sched_load_balance" enabled. | ||
417 | |||
418 | So, for example, if the top cpuset has the flag "sched_load_balance" | ||
419 | enabled, then the scheduler will have one sched domain covering all | ||
420 | CPUs, and the setting of the "sched_load_balance" flag in any other | ||
421 | cpusets won't matter, as we're already fully load balancing. | ||
422 | |||
423 | Therefore in the above two situations, the top cpuset flag | ||
424 | "sched_load_balance" should be disabled, and only some of the smaller, | ||
425 | child cpusets have this flag enabled. | ||
426 | |||
427 | When doing this, you don't usually want to leave any unpinned tasks in | ||
428 | the top cpuset that might use non-trivial amounts of CPU, as such tasks | ||
429 | may be artificially constrained to some subset of CPUs, depending on | ||
430 | the particulars of this flag setting in descendent cpusets. Even if | ||
431 | such a task could use spare CPU cycles in some other CPUs, the kernel | ||
432 | scheduler might not consider the possibility of load balancing that | ||
433 | task to that underused CPU. | ||
434 | |||
435 | Of course, tasks pinned to a particular CPU can be left in a cpuset | ||
436 | that disables "sched_load_balance" as those tasks aren't going anywhere | ||
437 | else anyway. | ||
438 | |||
439 | There is an impedance mismatch here, between cpusets and sched domains. | ||
440 | Cpusets are hierarchical and nest. Sched domains are flat; they don't | ||
441 | overlap and each CPU is in at most one sched domain. | ||
442 | |||
443 | It is necessary for sched domains to be flat because load balancing | ||
444 | across partially overlapping sets of CPUs would risk unstable dynamics | ||
445 | that would be beyond our understanding. So if each of two partially | ||
446 | overlapping cpusets enables the flag 'sched_load_balance', then we | ||
447 | form a single sched domain that is a superset of both. We won't move | ||
448 | a task to a CPU outside it cpuset, but the scheduler load balancing | ||
449 | code might waste some compute cycles considering that possibility. | ||
450 | |||
451 | This mismatch is why there is not a simple one-to-one relation | ||
452 | between which cpusets have the flag "sched_load_balance" enabled, | ||
453 | and the sched domain configuration. If a cpuset enables the flag, it | ||
454 | will get balancing across all its CPUs, but if it disables the flag, | ||
455 | it will only be assured of no load balancing if no other overlapping | ||
456 | cpuset enables the flag. | ||
457 | |||
458 | If two cpusets have partially overlapping 'cpus' allowed, and only | ||
459 | one of them has this flag enabled, then the other may find its | ||
460 | tasks only partially load balanced, just on the overlapping CPUs. | ||
461 | This is just the general case of the top_cpuset example given a few | ||
462 | paragraphs above. In the general case, as in the top cpuset case, | ||
463 | don't leave tasks that might use non-trivial amounts of CPU in | ||
464 | such partially load balanced cpusets, as they may be artificially | ||
465 | constrained to some subset of the CPUs allowed to them, for lack of | ||
466 | load balancing to the other CPUs. | ||
467 | |||
468 | 1.7.1 sched_load_balance implementation details. | ||
469 | ------------------------------------------------ | ||
470 | |||
471 | The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary | ||
472 | to most cpuset flags.) When enabled for a cpuset, the kernel will | ||
473 | ensure that it can load balance across all the CPUs in that cpuset | ||
474 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are | ||
475 | in the same sched domain.) | ||
476 | |||
477 | If two overlapping cpusets both have 'sched_load_balance' enabled, | ||
478 | then they will be (must be) both in the same sched domain. | ||
479 | |||
480 | If, as is the default, the top cpuset has 'sched_load_balance' enabled, | ||
481 | then by the above that means there is a single sched domain covering | ||
482 | the whole system, regardless of any other cpuset settings. | ||
483 | |||
484 | The kernel commits to user space that it will avoid load balancing | ||
485 | where it can. It will pick as fine a granularity partition of sched | ||
486 | domains as it can while still providing load balancing for any set | ||
487 | of CPUs allowed to a cpuset having 'sched_load_balance' enabled. | ||
488 | |||
489 | The internal kernel cpuset to scheduler interface passes from the | ||
490 | cpuset code to the scheduler code a partition of the load balanced | ||
491 | CPUs in the system. This partition is a set of subsets (represented | ||
492 | as an array of struct cpumask) of CPUs, pairwise disjoint, that cover | ||
493 | all the CPUs that must be load balanced. | ||
494 | |||
495 | The cpuset code builds a new such partition and passes it to the | ||
496 | scheduler sched domain setup code, to have the sched domains rebuilt | ||
497 | as necessary, whenever: | ||
498 | - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes, | ||
499 | - or CPUs come or go from a cpuset with this flag enabled, | ||
500 | - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs | ||
501 | and with this flag enabled changes, | ||
502 | - or a cpuset with non-empty CPUs and with this flag enabled is removed, | ||
503 | - or a cpu is offlined/onlined. | ||
504 | |||
505 | This partition exactly defines what sched domains the scheduler should | ||
506 | setup - one sched domain for each element (struct cpumask) in the | ||
507 | partition. | ||
508 | |||
509 | The scheduler remembers the currently active sched domain partitions. | ||
510 | When the scheduler routine partition_sched_domains() is invoked from | ||
511 | the cpuset code to update these sched domains, it compares the new | ||
512 | partition requested with the current, and updates its sched domains, | ||
513 | removing the old and adding the new, for each change. | ||
514 | |||
515 | |||
516 | 1.8 What is sched_relax_domain_level ? | ||
517 | -------------------------------------- | ||
518 | |||
519 | In sched domain, the scheduler migrates tasks in 2 ways; periodic load | ||
520 | balance on tick, and at time of some schedule events. | ||
521 | |||
522 | When a task is woken up, scheduler try to move the task on idle CPU. | ||
523 | For example, if a task A running on CPU X activates another task B | ||
524 | on the same CPU X, and if CPU Y is X's sibling and performing idle, | ||
525 | then scheduler migrate task B to CPU Y so that task B can start on | ||
526 | CPU Y without waiting task A on CPU X. | ||
527 | |||
528 | And if a CPU run out of tasks in its runqueue, the CPU try to pull | ||
529 | extra tasks from other busy CPUs to help them before it is going to | ||
530 | be idle. | ||
531 | |||
532 | Of course it takes some searching cost to find movable tasks and/or | ||
533 | idle CPUs, the scheduler might not search all CPUs in the domain | ||
534 | everytime. In fact, in some architectures, the searching ranges on | ||
535 | events are limited in the same socket or node where the CPU locates, | ||
536 | while the load balance on tick searchs all. | ||
537 | |||
538 | For example, assume CPU Z is relatively far from CPU X. Even if CPU Z | ||
539 | is idle while CPU X and the siblings are busy, scheduler can't migrate | ||
540 | woken task B from X to Z since it is out of its searching range. | ||
541 | As the result, task B on CPU X need to wait task A or wait load balance | ||
542 | on the next tick. For some applications in special situation, waiting | ||
543 | 1 tick may be too long. | ||
544 | |||
545 | The 'sched_relax_domain_level' file allows you to request changing | ||
546 | this searching range as you like. This file takes int value which | ||
547 | indicates size of searching range in levels ideally as follows, | ||
548 | otherwise initial value -1 that indicates the cpuset has no request. | ||
549 | |||
550 | -1 : no request. use system default or follow request of others. | ||
551 | 0 : no search. | ||
552 | 1 : search siblings (hyperthreads in a core). | ||
553 | 2 : search cores in a package. | ||
554 | 3 : search cpus in a node [= system wide on non-NUMA system] | ||
555 | ( 4 : search nodes in a chunk of node [on NUMA system] ) | ||
556 | ( 5 : search system wide [on NUMA system] ) | ||
557 | |||
558 | The system default is architecture dependent. The system default | ||
559 | can be changed using the relax_domain_level= boot parameter. | ||
560 | |||
561 | This file is per-cpuset and affect the sched domain where the cpuset | ||
562 | belongs to. Therefore if the flag 'sched_load_balance' of a cpuset | ||
563 | is disabled, then 'sched_relax_domain_level' have no effect since | ||
564 | there is no sched domain belonging the cpuset. | ||
565 | |||
566 | If multiple cpusets are overlapping and hence they form a single sched | ||
567 | domain, the largest value among those is used. Be careful, if one | ||
568 | requests 0 and others are -1 then 0 is used. | ||
569 | |||
570 | Note that modifying this file will have both good and bad effects, | ||
571 | and whether it is acceptable or not depends on your situation. | ||
572 | Don't modify this file if you are not sure. | ||
573 | |||
574 | If your situation is: | ||
575 | - The migration costs between each cpu can be assumed considerably | ||
576 | small(for you) due to your special application's behavior or | ||
577 | special hardware support for CPU cache etc. | ||
578 | - The searching cost doesn't have impact(for you) or you can make | ||
579 | the searching cost enough small by managing cpuset to compact etc. | ||
580 | - The latency is required even it sacrifices cache hit rate etc. | ||
581 | then increasing 'sched_relax_domain_level' would benefit you. | ||
582 | |||
583 | |||
584 | 1.9 How do I use cpusets ? | ||
585 | -------------------------- | ||
586 | |||
587 | In order to minimize the impact of cpusets on critical kernel | ||
588 | code, such as the scheduler, and due to the fact that the kernel | ||
589 | does not support one task updating the memory placement of another | ||
590 | task directly, the impact on a task of changing its cpuset CPU | ||
591 | or Memory Node placement, or of changing to which cpuset a task | ||
592 | is attached, is subtle. | ||
593 | |||
594 | If a cpuset has its Memory Nodes modified, then for each task attached | ||
595 | to that cpuset, the next time that the kernel attempts to allocate | ||
596 | a page of memory for that task, the kernel will notice the change | ||
597 | in the tasks cpuset, and update its per-task memory placement to | ||
598 | remain within the new cpusets memory placement. If the task was using | ||
599 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with | ||
600 | its new cpuset, then the task will continue to use whatever subset | ||
601 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task | ||
602 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | ||
603 | in the new cpuset, then the task will be essentially treated as if it | ||
604 | was MPOL_BIND bound to the new cpuset (even though its numa placement, | ||
605 | as queried by get_mempolicy(), doesn't change). If a task is moved | ||
606 | from one cpuset to another, then the kernel will adjust the tasks | ||
607 | memory placement, as above, the next time that the kernel attempts | ||
608 | to allocate a page of memory for that task. | ||
609 | |||
610 | If a cpuset has its 'cpus' modified, then each task in that cpuset | ||
611 | will have its allowed CPU placement changed immediately. Similarly, | ||
612 | if a tasks pid is written to another cpusets 'tasks' file, then its | ||
613 | allowed CPU placement is changed immediately. If such a task had been | ||
614 | bound to some subset of its cpuset using the sched_setaffinity() call, | ||
615 | the task will be allowed to run on any CPU allowed in its new cpuset, | ||
616 | negating the effect of the prior sched_setaffinity() call. | ||
617 | |||
618 | In summary, the memory placement of a task whose cpuset is changed is | ||
619 | updated by the kernel, on the next allocation of a page for that task, | ||
620 | and the processor placement is updated immediately. | ||
621 | |||
622 | Normally, once a page is allocated (given a physical page | ||
623 | of main memory) then that page stays on whatever node it | ||
624 | was allocated, so long as it remains allocated, even if the | ||
625 | cpusets memory placement policy 'mems' subsequently changes. | ||
626 | If the cpuset flag file 'memory_migrate' is set true, then when | ||
627 | tasks are attached to that cpuset, any pages that task had | ||
628 | allocated to it on nodes in its previous cpuset are migrated | ||
629 | to the tasks new cpuset. The relative placement of the page within | ||
630 | the cpuset is preserved during these migration operations if possible. | ||
631 | For example if the page was on the second valid node of the prior cpuset | ||
632 | then the page will be placed on the second valid node of the new cpuset. | ||
633 | |||
634 | Also if 'memory_migrate' is set true, then if that cpusets | ||
635 | 'mems' file is modified, pages allocated to tasks in that | ||
636 | cpuset, that were on nodes in the previous setting of 'mems', | ||
637 | will be moved to nodes in the new setting of 'mems.' | ||
638 | Pages that were not in the tasks prior cpuset, or in the cpusets | ||
639 | prior 'mems' setting, will not be moved. | ||
640 | |||
641 | There is an exception to the above. If hotplug functionality is used | ||
642 | to remove all the CPUs that are currently assigned to a cpuset, | ||
643 | then all the tasks in that cpuset will be moved to the nearest ancestor | ||
644 | with non-empty cpus. But the moving of some (or all) tasks might fail if | ||
645 | cpuset is bound with another cgroup subsystem which has some restrictions | ||
646 | on task attaching. In this failing case, those tasks will stay | ||
647 | in the original cpuset, and the kernel will automatically update | ||
648 | their cpus_allowed to allow all online CPUs. When memory hotplug | ||
649 | functionality for removing Memory Nodes is available, a similar exception | ||
650 | is expected to apply there as well. In general, the kernel prefers to | ||
651 | violate cpuset placement, over starving a task that has had all | ||
652 | its allowed CPUs or Memory Nodes taken offline. | ||
653 | |||
654 | There is a second exception to the above. GFP_ATOMIC requests are | ||
655 | kernel internal allocations that must be satisfied, immediately. | ||
656 | The kernel may drop some request, in rare cases even panic, if a | ||
657 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within | ||
658 | the current tasks cpuset, then we relax the cpuset, and look for | ||
659 | memory anywhere we can find it. It's better to violate the cpuset | ||
660 | than stress the kernel. | ||
661 | |||
662 | To start a new job that is to be contained within a cpuset, the steps are: | ||
663 | |||
664 | 1) mkdir /dev/cpuset | ||
665 | 2) mount -t cgroup -ocpuset cpuset /dev/cpuset | ||
666 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in | ||
667 | the /dev/cpuset virtual file system. | ||
668 | 4) Start a task that will be the "founding father" of the new job. | ||
669 | 5) Attach that task to the new cpuset by writing its pid to the | ||
670 | /dev/cpuset tasks file for that cpuset. | ||
671 | 6) fork, exec or clone the job tasks from this founding father task. | ||
672 | |||
673 | For example, the following sequence of commands will setup a cpuset | ||
674 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | ||
675 | and then start a subshell 'sh' in that cpuset: | ||
676 | |||
677 | mount -t cgroup -ocpuset cpuset /dev/cpuset | ||
678 | cd /dev/cpuset | ||
679 | mkdir Charlie | ||
680 | cd Charlie | ||
681 | /bin/echo 2-3 > cpus | ||
682 | /bin/echo 1 > mems | ||
683 | /bin/echo $$ > tasks | ||
684 | sh | ||
685 | # The subshell 'sh' is now running in cpuset Charlie | ||
686 | # The next line should display '/Charlie' | ||
687 | cat /proc/self/cpuset | ||
688 | |||
689 | There are ways to query or modify cpusets: | ||
690 | - via the cpuset file system directly, using the various cd, mkdir, echo, | ||
691 | cat, rmdir commands from the shell, or their equivalent from C. | ||
692 | - via the C library libcpuset. | ||
693 | - via the C library libcgroup. | ||
694 | (http://sourceforge.net/proects/libcg/) | ||
695 | - via the python application cset. | ||
696 | (http://developer.novell.com/wiki/index.php/Cpuset) | ||
697 | |||
698 | The sched_setaffinity calls can also be done at the shell prompt using | ||
699 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy | ||
700 | calls can be done at the shell prompt using the numactl command | ||
701 | (part of Andi Kleen's numa package). | ||
702 | |||
703 | 2. Usage Examples and Syntax | ||
704 | ============================ | ||
705 | |||
706 | 2.1 Basic Usage | ||
707 | --------------- | ||
708 | |||
709 | Creating, modifying, using the cpusets can be done through the cpuset | ||
710 | virtual filesystem. | ||
711 | |||
712 | To mount it, type: | ||
713 | # mount -t cgroup -o cpuset cpuset /dev/cpuset | ||
714 | |||
715 | Then under /dev/cpuset you can find a tree that corresponds to the | ||
716 | tree of the cpusets in the system. For instance, /dev/cpuset | ||
717 | is the cpuset that holds the whole system. | ||
718 | |||
719 | If you want to create a new cpuset under /dev/cpuset: | ||
720 | # cd /dev/cpuset | ||
721 | # mkdir my_cpuset | ||
722 | |||
723 | Now you want to do something with this cpuset. | ||
724 | # cd my_cpuset | ||
725 | |||
726 | In this directory you can find several files: | ||
727 | # ls | ||
728 | cpu_exclusive memory_migrate mems tasks | ||
729 | cpus memory_pressure notify_on_release | ||
730 | mem_exclusive memory_spread_page sched_load_balance | ||
731 | mem_hardwall memory_spread_slab sched_relax_domain_level | ||
732 | |||
733 | Reading them will give you information about the state of this cpuset: | ||
734 | the CPUs and Memory Nodes it can use, the processes that are using | ||
735 | it, its properties. By writing to these files you can manipulate | ||
736 | the cpuset. | ||
737 | |||
738 | Set some flags: | ||
739 | # /bin/echo 1 > cpu_exclusive | ||
740 | |||
741 | Add some cpus: | ||
742 | # /bin/echo 0-7 > cpus | ||
743 | |||
744 | Add some mems: | ||
745 | # /bin/echo 0-7 > mems | ||
746 | |||
747 | Now attach your shell to this cpuset: | ||
748 | # /bin/echo $$ > tasks | ||
749 | |||
750 | You can also create cpusets inside your cpuset by using mkdir in this | ||
751 | directory. | ||
752 | # mkdir my_sub_cs | ||
753 | |||
754 | To remove a cpuset, just use rmdir: | ||
755 | # rmdir my_sub_cs | ||
756 | This will fail if the cpuset is in use (has cpusets inside, or has | ||
757 | processes attached). | ||
758 | |||
759 | Note that for legacy reasons, the "cpuset" filesystem exists as a | ||
760 | wrapper around the cgroup filesystem. | ||
761 | |||
762 | The command | ||
763 | |||
764 | mount -t cpuset X /dev/cpuset | ||
765 | |||
766 | is equivalent to | ||
767 | |||
768 | mount -t cgroup -ocpuset,noprefix X /dev/cpuset | ||
769 | echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent | ||
770 | |||
771 | 2.2 Adding/removing cpus | ||
772 | ------------------------ | ||
773 | |||
774 | This is the syntax to use when writing in the cpus or mems files | ||
775 | in cpuset directories: | ||
776 | |||
777 | # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 | ||
778 | # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 | ||
779 | |||
780 | 2.3 Setting flags | ||
781 | ----------------- | ||
782 | |||
783 | The syntax is very simple: | ||
784 | |||
785 | # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' | ||
786 | # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' | ||
787 | |||
788 | 2.4 Attaching processes | ||
789 | ----------------------- | ||
790 | |||
791 | # /bin/echo PID > tasks | ||
792 | |||
793 | Note that it is PID, not PIDs. You can only attach ONE task at a time. | ||
794 | If you have several tasks to attach, you have to do it one after another: | ||
795 | |||
796 | # /bin/echo PID1 > tasks | ||
797 | # /bin/echo PID2 > tasks | ||
798 | ... | ||
799 | # /bin/echo PIDn > tasks | ||
800 | |||
801 | |||
802 | 3. Questions | ||
803 | ============ | ||
804 | |||
805 | Q: what's up with this '/bin/echo' ? | ||
806 | A: bash's builtin 'echo' command does not check calls to write() against | ||
807 | errors. If you use it in the cpuset file system, you won't be | ||
808 | able to tell whether a command succeeded or failed. | ||
809 | |||
810 | Q: When I attach processes, only the first of the line gets really attached ! | ||
811 | A: We can only return one error code per call to write(). So you should also | ||
812 | put only ONE pid. | ||
813 | |||
814 | 4. Contact | ||
815 | ========== | ||
816 | |||
817 | Web: http://www.bullopensource.org/cpuset | ||
diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt new file mode 100644 index 000000000000..7cc6e6a60672 --- /dev/null +++ b/Documentation/cgroups/devices.txt | |||
@@ -0,0 +1,52 @@ | |||
1 | Device Whitelist Controller | ||
2 | |||
3 | 1. Description: | ||
4 | |||
5 | Implement a cgroup to track and enforce open and mknod restrictions | ||
6 | on device files. A device cgroup associates a device access | ||
7 | whitelist with each cgroup. A whitelist entry has 4 fields. | ||
8 | 'type' is a (all), c (char), or b (block). 'all' means it applies | ||
9 | to all types and all major and minor numbers. Major and minor are | ||
10 | either an integer or * for all. Access is a composition of r | ||
11 | (read), w (write), and m (mknod). | ||
12 | |||
13 | The root device cgroup starts with rwm to 'all'. A child device | ||
14 | cgroup gets a copy of the parent. Administrators can then remove | ||
15 | devices from the whitelist or add new entries. A child cgroup can | ||
16 | never receive a device access which is denied by its parent. However | ||
17 | when a device access is removed from a parent it will not also be | ||
18 | removed from the child(ren). | ||
19 | |||
20 | 2. User Interface | ||
21 | |||
22 | An entry is added using devices.allow, and removed using | ||
23 | devices.deny. For instance | ||
24 | |||
25 | echo 'c 1:3 mr' > /cgroups/1/devices.allow | ||
26 | |||
27 | allows cgroup 1 to read and mknod the device usually known as | ||
28 | /dev/null. Doing | ||
29 | |||
30 | echo a > /cgroups/1/devices.deny | ||
31 | |||
32 | will remove the default 'a *:* rwm' entry. Doing | ||
33 | |||
34 | echo a > /cgroups/1/devices.allow | ||
35 | |||
36 | will add the 'a *:* rwm' entry to the whitelist. | ||
37 | |||
38 | 3. Security | ||
39 | |||
40 | Any task can move itself between cgroups. This clearly won't | ||
41 | suffice, but we can decide the best way to adequately restrict | ||
42 | movement as people get some experience with this. We may just want | ||
43 | to require CAP_SYS_ADMIN, which at least is a separate bit from | ||
44 | CAP_MKNOD. We may want to just refuse moving to a cgroup which | ||
45 | isn't a descendent of the current one. Or we may want to use | ||
46 | CAP_MAC_ADMIN, since we really are trying to lock down root. | ||
47 | |||
48 | CAP_SYS_ADMIN is needed to modify the whitelist or move another | ||
49 | task to a new cgroup. (Again we'll probably want to change that). | ||
50 | |||
51 | A cgroup may not be granted more permissions than the cgroup's | ||
52 | parent has. | ||
diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt new file mode 100644 index 000000000000..523a9c16c400 --- /dev/null +++ b/Documentation/cgroups/memcg_test.txt | |||
@@ -0,0 +1,362 @@ | |||
1 | Memory Resource Controller(Memcg) Implementation Memo. | ||
2 | Last Updated: 2009/1/19 | ||
3 | Base Kernel Version: based on 2.6.29-rc2. | ||
4 | |||
5 | Because VM is getting complex (one of reasons is memcg...), memcg's behavior | ||
6 | is complex. This is a document for memcg's internal behavior. | ||
7 | Please note that implementation details can be changed. | ||
8 | |||
9 | (*) Topics on API should be in Documentation/cgroups/memory.txt) | ||
10 | |||
11 | 0. How to record usage ? | ||
12 | 2 objects are used. | ||
13 | |||
14 | page_cgroup ....an object per page. | ||
15 | Allocated at boot or memory hotplug. Freed at memory hot removal. | ||
16 | |||
17 | swap_cgroup ... an entry per swp_entry. | ||
18 | Allocated at swapon(). Freed at swapoff(). | ||
19 | |||
20 | The page_cgroup has USED bit and double count against a page_cgroup never | ||
21 | occurs. swap_cgroup is used only when a charged page is swapped-out. | ||
22 | |||
23 | 1. Charge | ||
24 | |||
25 | a page/swp_entry may be charged (usage += PAGE_SIZE) at | ||
26 | |||
27 | mem_cgroup_newpage_charge() | ||
28 | Called at new page fault and Copy-On-Write. | ||
29 | |||
30 | mem_cgroup_try_charge_swapin() | ||
31 | Called at do_swap_page() (page fault on swap entry) and swapoff. | ||
32 | Followed by charge-commit-cancel protocol. (With swap accounting) | ||
33 | At commit, a charge recorded in swap_cgroup is removed. | ||
34 | |||
35 | mem_cgroup_cache_charge() | ||
36 | Called at add_to_page_cache() | ||
37 | |||
38 | mem_cgroup_cache_charge_swapin() | ||
39 | Called at shmem's swapin. | ||
40 | |||
41 | mem_cgroup_prepare_migration() | ||
42 | Called before migration. "extra" charge is done and followed by | ||
43 | charge-commit-cancel protocol. | ||
44 | At commit, charge against oldpage or newpage will be committed. | ||
45 | |||
46 | 2. Uncharge | ||
47 | a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by | ||
48 | |||
49 | mem_cgroup_uncharge_page() | ||
50 | Called when an anonymous page is fully unmapped. I.e., mapcount goes | ||
51 | to 0. If the page is SwapCache, uncharge is delayed until | ||
52 | mem_cgroup_uncharge_swapcache(). | ||
53 | |||
54 | mem_cgroup_uncharge_cache_page() | ||
55 | Called when a page-cache is deleted from radix-tree. If the page is | ||
56 | SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache(). | ||
57 | |||
58 | mem_cgroup_uncharge_swapcache() | ||
59 | Called when SwapCache is removed from radix-tree. The charge itself | ||
60 | is moved to swap_cgroup. (If mem+swap controller is disabled, no | ||
61 | charge to swap occurs.) | ||
62 | |||
63 | mem_cgroup_uncharge_swap() | ||
64 | Called when swp_entry's refcnt goes down to 0. A charge against swap | ||
65 | disappears. | ||
66 | |||
67 | mem_cgroup_end_migration(old, new) | ||
68 | At success of migration old is uncharged (if necessary), a charge | ||
69 | to new page is committed. At failure, charge to old page is committed. | ||
70 | |||
71 | 3. charge-commit-cancel | ||
72 | In some case, we can't know this "charge" is valid or not at charging | ||
73 | (because of races). | ||
74 | To handle such case, there are charge-commit-cancel functions. | ||
75 | mem_cgroup_try_charge_XXX | ||
76 | mem_cgroup_commit_charge_XXX | ||
77 | mem_cgroup_cancel_charge_XXX | ||
78 | these are used in swap-in and migration. | ||
79 | |||
80 | At try_charge(), there are no flags to say "this page is charged". | ||
81 | at this point, usage += PAGE_SIZE. | ||
82 | |||
83 | At commit(), the function checks the page should be charged or not | ||
84 | and set flags or avoid charging.(usage -= PAGE_SIZE) | ||
85 | |||
86 | At cancel(), simply usage -= PAGE_SIZE. | ||
87 | |||
88 | Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | ||
89 | |||
90 | 4. Anonymous | ||
91 | Anonymous page is newly allocated at | ||
92 | - page fault into MAP_ANONYMOUS mapping. | ||
93 | - Copy-On-Write. | ||
94 | It is charged right after it's allocated before doing any page table | ||
95 | related operations. Of course, it's uncharged when another page is used | ||
96 | for the fault address. | ||
97 | |||
98 | At freeing anonymous page (by exit() or munmap()), zap_pte() is called | ||
99 | and pages for ptes are freed one by one.(see mm/memory.c). Uncharges | ||
100 | are done at page_remove_rmap() when page_mapcount() goes down to 0. | ||
101 | |||
102 | Another page freeing is by page-reclaim (vmscan.c) and anonymous | ||
103 | pages are swapped out. In this case, the page is marked as | ||
104 | PageSwapCache(). uncharge() routine doesn't uncharge the page marked | ||
105 | as SwapCache(). It's delayed until __delete_from_swap_cache(). | ||
106 | |||
107 | 4.1 Swap-in. | ||
108 | At swap-in, the page is taken from swap-cache. There are 2 cases. | ||
109 | |||
110 | (a) If the SwapCache is newly allocated and read, it has no charges. | ||
111 | (b) If the SwapCache has been mapped by processes, it has been | ||
112 | charged already. | ||
113 | |||
114 | This swap-in is one of the most complicated work. In do_swap_page(), | ||
115 | following events occur when pte is unchanged. | ||
116 | |||
117 | (1) the page (SwapCache) is looked up. | ||
118 | (2) lock_page() | ||
119 | (3) try_charge_swapin() | ||
120 | (4) reuse_swap_page() (may call delete_swap_cache()) | ||
121 | (5) commit_charge_swapin() | ||
122 | (6) swap_free(). | ||
123 | |||
124 | Considering following situation for example. | ||
125 | |||
126 | (A) The page has not been charged before (2) and reuse_swap_page() | ||
127 | doesn't call delete_from_swap_cache(). | ||
128 | (B) The page has not been charged before (2) and reuse_swap_page() | ||
129 | calls delete_from_swap_cache(). | ||
130 | (C) The page has been charged before (2) and reuse_swap_page() doesn't | ||
131 | call delete_from_swap_cache(). | ||
132 | (D) The page has been charged before (2) and reuse_swap_page() calls | ||
133 | delete_from_swap_cache(). | ||
134 | |||
135 | memory.usage/memsw.usage changes to this page/swp_entry will be | ||
136 | Case (A) (B) (C) (D) | ||
137 | Event | ||
138 | Before (2) 0/ 1 0/ 1 1/ 1 1/ 1 | ||
139 | =========================================== | ||
140 | (3) +1/+1 +1/+1 +1/+1 +1/+1 | ||
141 | (4) - 0/ 0 - -1/ 0 | ||
142 | (5) 0/-1 0/ 0 -1/-1 0/ 0 | ||
143 | (6) - 0/-1 - 0/-1 | ||
144 | =========================================== | ||
145 | Result 1/ 1 1/ 1 1/ 1 1/ 1 | ||
146 | |||
147 | In any cases, charges to this page should be 1/ 1. | ||
148 | |||
149 | 4.2 Swap-out. | ||
150 | At swap-out, typical state transition is below. | ||
151 | |||
152 | (a) add to swap cache. (marked as SwapCache) | ||
153 | swp_entry's refcnt += 1. | ||
154 | (b) fully unmapped. | ||
155 | swp_entry's refcnt += # of ptes. | ||
156 | (c) write back to swap. | ||
157 | (d) delete from swap cache. (remove from SwapCache) | ||
158 | swp_entry's refcnt -= 1. | ||
159 | |||
160 | |||
161 | At (b), the page is marked as SwapCache and not uncharged. | ||
162 | At (d), the page is removed from SwapCache and a charge in page_cgroup | ||
163 | is moved to swap_cgroup. | ||
164 | |||
165 | Finally, at task exit, | ||
166 | (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. | ||
167 | Here, a charge in swap_cgroup disappears. | ||
168 | |||
169 | 5. Page Cache | ||
170 | Page Cache is charged at | ||
171 | - add_to_page_cache_locked(). | ||
172 | |||
173 | uncharged at | ||
174 | - __remove_from_page_cache(). | ||
175 | |||
176 | The logic is very clear. (About migration, see below) | ||
177 | Note: __remove_from_page_cache() is called by remove_from_page_cache() | ||
178 | and __remove_mapping(). | ||
179 | |||
180 | 6. Shmem(tmpfs) Page Cache | ||
181 | Memcg's charge/uncharge have special handlers of shmem. The best way | ||
182 | to understand shmem's page state transition is to read mm/shmem.c. | ||
183 | But brief explanation of the behavior of memcg around shmem will be | ||
184 | helpful to understand the logic. | ||
185 | |||
186 | Shmem's page (just leaf page, not direct/indirect block) can be on | ||
187 | - radix-tree of shmem's inode. | ||
188 | - SwapCache. | ||
189 | - Both on radix-tree and SwapCache. This happens at swap-in | ||
190 | and swap-out, | ||
191 | |||
192 | It's charged when... | ||
193 | - A new page is added to shmem's radix-tree. | ||
194 | - A swp page is read. (move a charge from swap_cgroup to page_cgroup) | ||
195 | It's uncharged when | ||
196 | - A page is removed from radix-tree and not SwapCache. | ||
197 | - When SwapCache is removed, a charge is moved to swap_cgroup. | ||
198 | - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup | ||
199 | disappears. | ||
200 | |||
201 | 7. Page Migration | ||
202 | One of the most complicated functions is page-migration-handler. | ||
203 | Memcg has 2 routines. Assume that we are migrating a page's contents | ||
204 | from OLDPAGE to NEWPAGE. | ||
205 | |||
206 | Usual migration logic is.. | ||
207 | (a) remove the page from LRU. | ||
208 | (b) allocate NEWPAGE (migration target) | ||
209 | (c) lock by lock_page(). | ||
210 | (d) unmap all mappings. | ||
211 | (e-1) If necessary, replace entry in radix-tree. | ||
212 | (e-2) move contents of a page. | ||
213 | (f) map all mappings again. | ||
214 | (g) pushback the page to LRU. | ||
215 | (-) OLDPAGE will be freed. | ||
216 | |||
217 | Before (g), memcg should complete all necessary charge/uncharge to | ||
218 | NEWPAGE/OLDPAGE. | ||
219 | |||
220 | The point is.... | ||
221 | - If OLDPAGE is anonymous, all charges will be dropped at (d) because | ||
222 | try_to_unmap() drops all mapcount and the page will not be | ||
223 | SwapCache. | ||
224 | |||
225 | - If OLDPAGE is SwapCache, charges will be kept at (g) because | ||
226 | __delete_from_swap_cache() isn't called at (e-1) | ||
227 | |||
228 | - If OLDPAGE is page-cache, charges will be kept at (g) because | ||
229 | remove_from_swap_cache() isn't called at (e-1) | ||
230 | |||
231 | memcg provides following hooks. | ||
232 | |||
233 | - mem_cgroup_prepare_migration(OLDPAGE) | ||
234 | Called after (b) to account a charge (usage += PAGE_SIZE) against | ||
235 | memcg which OLDPAGE belongs to. | ||
236 | |||
237 | - mem_cgroup_end_migration(OLDPAGE, NEWPAGE) | ||
238 | Called after (f) before (g). | ||
239 | If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already | ||
240 | charged, a charge by prepare_migration() is automatically canceled. | ||
241 | If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE. | ||
242 | |||
243 | But zap_pte() (by exit or munmap) can be called while migration, | ||
244 | we have to check if OLDPAGE/NEWPAGE is a valid page after commit(). | ||
245 | |||
246 | 8. LRU | ||
247 | Each memcg has its own private LRU. Now, it's handling is under global | ||
248 | VM's control (means that it's handled under global zone->lru_lock). | ||
249 | Almost all routines around memcg's LRU is called by global LRU's | ||
250 | list management functions under zone->lru_lock(). | ||
251 | |||
252 | A special function is mem_cgroup_isolate_pages(). This scans | ||
253 | memcg's private LRU and call __isolate_lru_page() to extract a page | ||
254 | from LRU. | ||
255 | (By __isolate_lru_page(), the page is removed from both of global and | ||
256 | private LRU.) | ||
257 | |||
258 | |||
259 | 9. Typical Tests. | ||
260 | |||
261 | Tests for racy cases. | ||
262 | |||
263 | 9.1 Small limit to memcg. | ||
264 | When you do test to do racy case, it's good test to set memcg's limit | ||
265 | to be very small rather than GB. Many races found in the test under | ||
266 | xKB or xxMB limits. | ||
267 | (Memory behavior under GB and Memory behavior under MB shows very | ||
268 | different situation.) | ||
269 | |||
270 | 9.2 Shmem | ||
271 | Historically, memcg's shmem handling was poor and we saw some amount | ||
272 | of troubles here. This is because shmem is page-cache but can be | ||
273 | SwapCache. Test with shmem/tmpfs is always good test. | ||
274 | |||
275 | 9.3 Migration | ||
276 | For NUMA, migration is an another special case. To do easy test, cpuset | ||
277 | is useful. Following is a sample script to do migration. | ||
278 | |||
279 | mount -t cgroup -o cpuset none /opt/cpuset | ||
280 | |||
281 | mkdir /opt/cpuset/01 | ||
282 | echo 1 > /opt/cpuset/01/cpuset.cpus | ||
283 | echo 0 > /opt/cpuset/01/cpuset.mems | ||
284 | echo 1 > /opt/cpuset/01/cpuset.memory_migrate | ||
285 | mkdir /opt/cpuset/02 | ||
286 | echo 1 > /opt/cpuset/02/cpuset.cpus | ||
287 | echo 1 > /opt/cpuset/02/cpuset.mems | ||
288 | echo 1 > /opt/cpuset/02/cpuset.memory_migrate | ||
289 | |||
290 | In above set, when you moves a task from 01 to 02, page migration to | ||
291 | node 0 to node 1 will occur. Following is a script to migrate all | ||
292 | under cpuset. | ||
293 | -- | ||
294 | move_task() | ||
295 | { | ||
296 | for pid in $1 | ||
297 | do | ||
298 | /bin/echo $pid >$2/tasks 2>/dev/null | ||
299 | echo -n $pid | ||
300 | echo -n " " | ||
301 | done | ||
302 | echo END | ||
303 | } | ||
304 | |||
305 | G1_TASK=`cat ${G1}/tasks` | ||
306 | G2_TASK=`cat ${G2}/tasks` | ||
307 | move_task "${G1_TASK}" ${G2} & | ||
308 | -- | ||
309 | 9.4 Memory hotplug. | ||
310 | memory hotplug test is one of good test. | ||
311 | to offline memory, do following. | ||
312 | # echo offline > /sys/devices/system/memory/memoryXXX/state | ||
313 | (XXX is the place of memory) | ||
314 | This is an easy way to test page migration, too. | ||
315 | |||
316 | 9.5 mkdir/rmdir | ||
317 | When using hierarchy, mkdir/rmdir test should be done. | ||
318 | Use tests like the following. | ||
319 | |||
320 | echo 1 >/opt/cgroup/01/memory/use_hierarchy | ||
321 | mkdir /opt/cgroup/01/child_a | ||
322 | mkdir /opt/cgroup/01/child_b | ||
323 | |||
324 | set limit to 01. | ||
325 | add limit to 01/child_b | ||
326 | run jobs under child_a and child_b | ||
327 | |||
328 | create/delete following groups at random while jobs are running. | ||
329 | /opt/cgroup/01/child_a/child_aa | ||
330 | /opt/cgroup/01/child_b/child_bb | ||
331 | /opt/cgroup/01/child_c | ||
332 | |||
333 | running new jobs in new group is also good. | ||
334 | |||
335 | 9.6 Mount with other subsystems. | ||
336 | Mounting with other subsystems is a good test because there is a | ||
337 | race and lock dependency with other cgroup subsystems. | ||
338 | |||
339 | example) | ||
340 | # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices | ||
341 | |||
342 | and do task move, mkdir, rmdir etc...under this. | ||
343 | |||
344 | 9.7 swapoff. | ||
345 | Besides management of swap is one of complicated parts of memcg, | ||
346 | call path of swap-in at swapoff is not same as usual swap-in path.. | ||
347 | It's worth to be tested explicitly. | ||
348 | |||
349 | For example, test like following is good. | ||
350 | (Shell-A) | ||
351 | # mount -t cgroup none /cgroup -t memory | ||
352 | # mkdir /cgroup/test | ||
353 | # echo 40M > /cgroup/test/memory.limit_in_bytes | ||
354 | # echo 0 > /cgroup/test/tasks | ||
355 | Run malloc(100M) program under this. You'll see 60M of swaps. | ||
356 | (Shell-B) | ||
357 | # move all tasks in /cgroup/test to /cgroup | ||
358 | # /sbin/swapoff -a | ||
359 | # rmdir /test/cgroup | ||
360 | # kill malloc task. | ||
361 | |||
362 | Of course, tmpfs v.s. swapoff test should be tested, too. | ||
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt new file mode 100644 index 000000000000..e1501964df1e --- /dev/null +++ b/Documentation/cgroups/memory.txt | |||
@@ -0,0 +1,399 @@ | |||
1 | Memory Resource Controller | ||
2 | |||
3 | NOTE: The Memory Resource Controller has been generically been referred | ||
4 | to as the memory controller in this document. Do not confuse memory controller | ||
5 | used here with the memory controller that is used in hardware. | ||
6 | |||
7 | Salient features | ||
8 | |||
9 | a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages | ||
10 | b. The infrastructure allows easy addition of other types of memory to control | ||
11 | c. Provides *zero overhead* for non memory controller users | ||
12 | d. Provides a double LRU: global memory pressure causes reclaim from the | ||
13 | global LRU; a cgroup on hitting a limit, reclaims from the per | ||
14 | cgroup LRU | ||
15 | |||
16 | NOTE: Swap Cache (unmapped) is not accounted now. | ||
17 | |||
18 | Benefits and Purpose of the memory controller | ||
19 | |||
20 | The memory controller isolates the memory behaviour of a group of tasks | ||
21 | from the rest of the system. The article on LWN [12] mentions some probable | ||
22 | uses of the memory controller. The memory controller can be used to | ||
23 | |||
24 | a. Isolate an application or a group of applications | ||
25 | Memory hungry applications can be isolated and limited to a smaller | ||
26 | amount of memory. | ||
27 | b. Create a cgroup with limited amount of memory, this can be used | ||
28 | as a good alternative to booting with mem=XXXX. | ||
29 | c. Virtualization solutions can control the amount of memory they want | ||
30 | to assign to a virtual machine instance. | ||
31 | d. A CD/DVD burner could control the amount of memory used by the | ||
32 | rest of the system to ensure that burning does not fail due to lack | ||
33 | of available memory. | ||
34 | e. There are several other use cases, find one or use the controller just | ||
35 | for fun (to learn and hack on the VM subsystem). | ||
36 | |||
37 | 1. History | ||
38 | |||
39 | The memory controller has a long history. A request for comments for the memory | ||
40 | controller was posted by Balbir Singh [1]. At the time the RFC was posted | ||
41 | there were several implementations for memory control. The goal of the | ||
42 | RFC was to build consensus and agreement for the minimal features required | ||
43 | for memory control. The first RSS controller was posted by Balbir Singh[2] | ||
44 | in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the | ||
45 | RSS controller. At OLS, at the resource management BoF, everyone suggested | ||
46 | that we handle both page cache and RSS together. Another request was raised | ||
47 | to allow user space handling of OOM. The current memory controller is | ||
48 | at version 6; it combines both mapped (RSS) and unmapped Page | ||
49 | Cache Control [11]. | ||
50 | |||
51 | 2. Memory Control | ||
52 | |||
53 | Memory is a unique resource in the sense that it is present in a limited | ||
54 | amount. If a task requires a lot of CPU processing, the task can spread | ||
55 | its processing over a period of hours, days, months or years, but with | ||
56 | memory, the same physical memory needs to be reused to accomplish the task. | ||
57 | |||
58 | The memory controller implementation has been divided into phases. These | ||
59 | are: | ||
60 | |||
61 | 1. Memory controller | ||
62 | 2. mlock(2) controller | ||
63 | 3. Kernel user memory accounting and slab control | ||
64 | 4. user mappings length controller | ||
65 | |||
66 | The memory controller is the first controller developed. | ||
67 | |||
68 | 2.1. Design | ||
69 | |||
70 | The core of the design is a counter called the res_counter. The res_counter | ||
71 | tracks the current memory usage and limit of the group of processes associated | ||
72 | with the controller. Each cgroup has a memory controller specific data | ||
73 | structure (mem_cgroup) associated with it. | ||
74 | |||
75 | 2.2. Accounting | ||
76 | |||
77 | +--------------------+ | ||
78 | | mem_cgroup | | ||
79 | | (res_counter) | | ||
80 | +--------------------+ | ||
81 | / ^ \ | ||
82 | / | \ | ||
83 | +---------------+ | +---------------+ | ||
84 | | mm_struct | |.... | mm_struct | | ||
85 | | | | | | | ||
86 | +---------------+ | +---------------+ | ||
87 | | | ||
88 | + --------------+ | ||
89 | | | ||
90 | +---------------+ +------+--------+ | ||
91 | | page +----------> page_cgroup| | ||
92 | | | | | | ||
93 | +---------------+ +---------------+ | ||
94 | |||
95 | (Figure 1: Hierarchy of Accounting) | ||
96 | |||
97 | |||
98 | Figure 1 shows the important aspects of the controller | ||
99 | |||
100 | 1. Accounting happens per cgroup | ||
101 | 2. Each mm_struct knows about which cgroup it belongs to | ||
102 | 3. Each page has a pointer to the page_cgroup, which in turn knows the | ||
103 | cgroup it belongs to | ||
104 | |||
105 | The accounting is done as follows: mem_cgroup_charge() is invoked to setup | ||
106 | the necessary data structures and check if the cgroup that is being charged | ||
107 | is over its limit. If it is then reclaim is invoked on the cgroup. | ||
108 | More details can be found in the reclaim section of this document. | ||
109 | If everything goes well, a page meta-data-structure called page_cgroup is | ||
110 | allocated and associated with the page. This routine also adds the page to | ||
111 | the per cgroup LRU. | ||
112 | |||
113 | 2.2.1 Accounting details | ||
114 | |||
115 | All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. | ||
116 | (some pages which never be reclaimable and will not be on global LRU | ||
117 | are not accounted. we just accounts pages under usual vm management.) | ||
118 | |||
119 | RSS pages are accounted at page_fault unless they've already been accounted | ||
120 | for earlier. A file page will be accounted for as Page Cache when it's | ||
121 | inserted into inode (radix-tree). While it's mapped into the page tables of | ||
122 | processes, duplicate accounting is carefully avoided. | ||
123 | |||
124 | A RSS page is unaccounted when it's fully unmapped. A PageCache page is | ||
125 | unaccounted when it's removed from radix-tree. | ||
126 | |||
127 | At page migration, accounting information is kept. | ||
128 | |||
129 | Note: we just account pages-on-lru because our purpose is to control amount | ||
130 | of used pages. not-on-lru pages are tend to be out-of-control from vm view. | ||
131 | |||
132 | 2.3 Shared Page Accounting | ||
133 | |||
134 | Shared pages are accounted on the basis of the first touch approach. The | ||
135 | cgroup that first touches a page is accounted for the page. The principle | ||
136 | behind this approach is that a cgroup that aggressively uses a shared | ||
137 | page will eventually get charged for it (once it is uncharged from | ||
138 | the cgroup that brought it in -- this will happen on memory pressure). | ||
139 | |||
140 | Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used.. | ||
141 | When you do swapoff and make swapped-out pages of shmem(tmpfs) to | ||
142 | be backed into memory in force, charges for pages are accounted against the | ||
143 | caller of swapoff rather than the users of shmem. | ||
144 | |||
145 | |||
146 | 2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) | ||
147 | Swap Extension allows you to record charge for swap. A swapped-in page is | ||
148 | charged back to original page allocator if possible. | ||
149 | |||
150 | When swap is accounted, following files are added. | ||
151 | - memory.memsw.usage_in_bytes. | ||
152 | - memory.memsw.limit_in_bytes. | ||
153 | |||
154 | usage of mem+swap is limited by memsw.limit_in_bytes. | ||
155 | |||
156 | Note: why 'mem+swap' rather than swap. | ||
157 | The global LRU(kswapd) can swap out arbitrary pages. Swap-out means | ||
158 | to move account from memory to swap...there is no change in usage of | ||
159 | mem+swap. | ||
160 | |||
161 | In other words, when we want to limit the usage of swap without affecting | ||
162 | global LRU, mem+swap limit is better than just limiting swap from OS point | ||
163 | of view. | ||
164 | |||
165 | 2.5 Reclaim | ||
166 | |||
167 | Each cgroup maintains a per cgroup LRU that consists of an active | ||
168 | and inactive list. When a cgroup goes over its limit, we first try | ||
169 | to reclaim memory from the cgroup so as to make space for the new | ||
170 | pages that the cgroup has touched. If the reclaim is unsuccessful, | ||
171 | an OOM routine is invoked to select and kill the bulkiest task in the | ||
172 | cgroup. | ||
173 | |||
174 | The reclaim algorithm has not been modified for cgroups, except that | ||
175 | pages that are selected for reclaiming come from the per cgroup LRU | ||
176 | list. | ||
177 | |||
178 | 2. Locking | ||
179 | |||
180 | The memory controller uses the following hierarchy | ||
181 | |||
182 | 1. zone->lru_lock is used for selecting pages to be isolated | ||
183 | 2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone) | ||
184 | 3. lock_page_cgroup() is used to protect page->page_cgroup | ||
185 | |||
186 | 3. User Interface | ||
187 | |||
188 | 0. Configuration | ||
189 | |||
190 | a. Enable CONFIG_CGROUPS | ||
191 | b. Enable CONFIG_RESOURCE_COUNTERS | ||
192 | c. Enable CONFIG_CGROUP_MEM_RES_CTLR | ||
193 | |||
194 | 1. Prepare the cgroups | ||
195 | # mkdir -p /cgroups | ||
196 | # mount -t cgroup none /cgroups -o memory | ||
197 | |||
198 | 2. Make the new group and move bash into it | ||
199 | # mkdir /cgroups/0 | ||
200 | # echo $$ > /cgroups/0/tasks | ||
201 | |||
202 | Since now we're in the 0 cgroup, | ||
203 | We can alter the memory limit: | ||
204 | # echo 4M > /cgroups/0/memory.limit_in_bytes | ||
205 | |||
206 | NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, | ||
207 | mega or gigabytes. | ||
208 | |||
209 | # cat /cgroups/0/memory.limit_in_bytes | ||
210 | 4194304 | ||
211 | |||
212 | NOTE: The interface has now changed to display the usage in bytes | ||
213 | instead of pages | ||
214 | |||
215 | We can check the usage: | ||
216 | # cat /cgroups/0/memory.usage_in_bytes | ||
217 | 1216512 | ||
218 | |||
219 | A successful write to this file does not guarantee a successful set of | ||
220 | this limit to the value written into the file. This can be due to a | ||
221 | number of factors, such as rounding up to page boundaries or the total | ||
222 | availability of memory on the system. The user is required to re-read | ||
223 | this file after a write to guarantee the value committed by the kernel. | ||
224 | |||
225 | # echo 1 > memory.limit_in_bytes | ||
226 | # cat memory.limit_in_bytes | ||
227 | 4096 | ||
228 | |||
229 | The memory.failcnt field gives the number of times that the cgroup limit was | ||
230 | exceeded. | ||
231 | |||
232 | The memory.stat file gives accounting information. Now, the number of | ||
233 | caches, RSS and Active pages/Inactive pages are shown. | ||
234 | |||
235 | 4. Testing | ||
236 | |||
237 | Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. | ||
238 | Apart from that v6 has been tested with several applications and regular | ||
239 | daily use. The controller has also been tested on the PPC64, x86_64 and | ||
240 | UML platforms. | ||
241 | |||
242 | 4.1 Troubleshooting | ||
243 | |||
244 | Sometimes a user might find that the application under a cgroup is | ||
245 | terminated. There are several causes for this: | ||
246 | |||
247 | 1. The cgroup limit is too low (just too low to do anything useful) | ||
248 | 2. The user is using anonymous memory and swap is turned off or too low | ||
249 | |||
250 | A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of | ||
251 | some of the pages cached in the cgroup (page cache pages). | ||
252 | |||
253 | 4.2 Task migration | ||
254 | |||
255 | When a task migrates from one cgroup to another, it's charge is not | ||
256 | carried forward. The pages allocated from the original cgroup still | ||
257 | remain charged to it, the charge is dropped when the page is freed or | ||
258 | reclaimed. | ||
259 | |||
260 | 4.3 Removing a cgroup | ||
261 | |||
262 | A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a | ||
263 | cgroup might have some charge associated with it, even though all | ||
264 | tasks have migrated away from it. | ||
265 | Such charges are freed(at default) or moved to its parent. When moved, | ||
266 | both of RSS and CACHES are moved to parent. | ||
267 | If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. | ||
268 | |||
269 | Charges recorded in swap information is not updated at removal of cgroup. | ||
270 | Recorded information is discarded and a cgroup which uses swap (swapcache) | ||
271 | will be charged as a new owner of it. | ||
272 | |||
273 | |||
274 | 5. Misc. interfaces. | ||
275 | |||
276 | 5.1 force_empty | ||
277 | memory.force_empty interface is provided to make cgroup's memory usage empty. | ||
278 | You can use this interface only when the cgroup has no tasks. | ||
279 | When writing anything to this | ||
280 | |||
281 | # echo 0 > memory.force_empty | ||
282 | |||
283 | Almost all pages tracked by this memcg will be unmapped and freed. Some of | ||
284 | pages cannot be freed because it's locked or in-use. Such pages are moved | ||
285 | to parent and this cgroup will be empty. But this may return -EBUSY in | ||
286 | some too busy case. | ||
287 | |||
288 | Typical use case of this interface is that calling this before rmdir(). | ||
289 | Because rmdir() moves all pages to parent, some out-of-use page caches can be | ||
290 | moved to the parent. If you want to avoid that, force_empty will be useful. | ||
291 | |||
292 | 5.2 stat file | ||
293 | memory.stat file includes following statistics (now) | ||
294 | cache - # of pages from page-cache and shmem. | ||
295 | rss - # of pages from anonymous memory. | ||
296 | pgpgin - # of event of charging | ||
297 | pgpgout - # of event of uncharging | ||
298 | active_anon - # of pages on active lru of anon, shmem. | ||
299 | inactive_anon - # of pages on active lru of anon, shmem | ||
300 | active_file - # of pages on active lru of file-cache | ||
301 | inactive_file - # of pages on inactive lru of file cache | ||
302 | unevictable - # of pages cannot be reclaimed.(mlocked etc) | ||
303 | |||
304 | Below is depend on CONFIG_DEBUG_VM. | ||
305 | inactive_ratio - VM inernal parameter. (see mm/page_alloc.c) | ||
306 | recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) | ||
307 | recent_rotated_file - VM internal parameter. (see mm/vmscan.c) | ||
308 | recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) | ||
309 | recent_scanned_file - VM internal parameter. (see mm/vmscan.c) | ||
310 | |||
311 | Memo: | ||
312 | recent_rotated means recent frequency of lru rotation. | ||
313 | recent_scanned means recent # of scans to lru. | ||
314 | showing for better debug please see the code for meanings. | ||
315 | |||
316 | |||
317 | 5.3 swappiness | ||
318 | Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. | ||
319 | |||
320 | Following cgroup's swapiness can't be changed. | ||
321 | - root cgroup (uses /proc/sys/vm/swappiness). | ||
322 | - a cgroup which uses hierarchy and it has child cgroup. | ||
323 | - a cgroup which uses hierarchy and not the root of hierarchy. | ||
324 | |||
325 | |||
326 | 6. Hierarchy support | ||
327 | |||
328 | The memory controller supports a deep hierarchy and hierarchical accounting. | ||
329 | The hierarchy is created by creating the appropriate cgroups in the | ||
330 | cgroup filesystem. Consider for example, the following cgroup filesystem | ||
331 | hierarchy | ||
332 | |||
333 | root | ||
334 | / | \ | ||
335 | / | \ | ||
336 | a b c | ||
337 | | \ | ||
338 | | \ | ||
339 | d e | ||
340 | |||
341 | In the diagram above, with hierarchical accounting enabled, all memory | ||
342 | usage of e, is accounted to its ancestors up until the root (i.e, c and root), | ||
343 | that has memory.use_hierarchy enabled. If one of the ancestors goes over its | ||
344 | limit, the reclaim algorithm reclaims from the tasks in the ancestor and the | ||
345 | children of the ancestor. | ||
346 | |||
347 | 6.1 Enabling hierarchical accounting and reclaim | ||
348 | |||
349 | The memory controller by default disables the hierarchy feature. Support | ||
350 | can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup | ||
351 | |||
352 | # echo 1 > memory.use_hierarchy | ||
353 | |||
354 | The feature can be disabled by | ||
355 | |||
356 | # echo 0 > memory.use_hierarchy | ||
357 | |||
358 | NOTE1: Enabling/disabling will fail if the cgroup already has other | ||
359 | cgroups created below it. | ||
360 | |||
361 | NOTE2: This feature can be enabled/disabled per subtree. | ||
362 | |||
363 | 7. TODO | ||
364 | |||
365 | 1. Add support for accounting huge pages (as a separate controller) | ||
366 | 2. Make per-cgroup scanner reclaim not-shared pages first | ||
367 | 3. Teach controller to account for shared-pages | ||
368 | 4. Start reclamation in the background when the limit is | ||
369 | not yet hit but the usage is getting closer | ||
370 | |||
371 | Summary | ||
372 | |||
373 | Overall, the memory controller has been a stable controller and has been | ||
374 | commented and discussed quite extensively in the community. | ||
375 | |||
376 | References | ||
377 | |||
378 | 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ | ||
379 | 2. Singh, Balbir. Memory Controller (RSS Control), | ||
380 | http://lwn.net/Articles/222762/ | ||
381 | 3. Emelianov, Pavel. Resource controllers based on process cgroups | ||
382 | http://lkml.org/lkml/2007/3/6/198 | ||
383 | 4. Emelianov, Pavel. RSS controller based on process cgroups (v2) | ||
384 | http://lkml.org/lkml/2007/4/9/78 | ||
385 | 5. Emelianov, Pavel. RSS controller based on process cgroups (v3) | ||
386 | http://lkml.org/lkml/2007/5/30/244 | ||
387 | 6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ | ||
388 | 7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control | ||
389 | subsystem (v3), http://lwn.net/Articles/235534/ | ||
390 | 8. Singh, Balbir. RSS controller v2 test results (lmbench), | ||
391 | http://lkml.org/lkml/2007/5/17/232 | ||
392 | 9. Singh, Balbir. RSS controller v2 AIM9 results | ||
393 | http://lkml.org/lkml/2007/5/18/1 | ||
394 | 10. Singh, Balbir. Memory controller v6 test results, | ||
395 | http://lkml.org/lkml/2007/8/19/36 | ||
396 | 11. Singh, Balbir. Memory controller introduction (v6), | ||
397 | http://lkml.org/lkml/2007/8/17/69 | ||
398 | 12. Corbet, Jonathan, Controlling memory use in cgroups, | ||
399 | http://lwn.net/Articles/243795/ | ||
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt new file mode 100644 index 000000000000..f196ac1d7d25 --- /dev/null +++ b/Documentation/cgroups/resource_counter.txt | |||
@@ -0,0 +1,181 @@ | |||
1 | |||
2 | The Resource Counter | ||
3 | |||
4 | The resource counter, declared at include/linux/res_counter.h, | ||
5 | is supposed to facilitate the resource management by controllers | ||
6 | by providing common stuff for accounting. | ||
7 | |||
8 | This "stuff" includes the res_counter structure and routines | ||
9 | to work with it. | ||
10 | |||
11 | |||
12 | |||
13 | 1. Crucial parts of the res_counter structure | ||
14 | |||
15 | a. unsigned long long usage | ||
16 | |||
17 | The usage value shows the amount of a resource that is consumed | ||
18 | by a group at a given time. The units of measurement should be | ||
19 | determined by the controller that uses this counter. E.g. it can | ||
20 | be bytes, items or any other unit the controller operates on. | ||
21 | |||
22 | b. unsigned long long max_usage | ||
23 | |||
24 | The maximal value of the usage over time. | ||
25 | |||
26 | This value is useful when gathering statistical information about | ||
27 | the particular group, as it shows the actual resource requirements | ||
28 | for a particular group, not just some usage snapshot. | ||
29 | |||
30 | c. unsigned long long limit | ||
31 | |||
32 | The maximal allowed amount of resource to consume by the group. In | ||
33 | case the group requests for more resources, so that the usage value | ||
34 | would exceed the limit, the resource allocation is rejected (see | ||
35 | the next section). | ||
36 | |||
37 | d. unsigned long long failcnt | ||
38 | |||
39 | The failcnt stands for "failures counter". This is the number of | ||
40 | resource allocation attempts that failed. | ||
41 | |||
42 | c. spinlock_t lock | ||
43 | |||
44 | Protects changes of the above values. | ||
45 | |||
46 | |||
47 | |||
48 | 2. Basic accounting routines | ||
49 | |||
50 | a. void res_counter_init(struct res_counter *rc) | ||
51 | |||
52 | Initializes the resource counter. As usual, should be the first | ||
53 | routine called for a new counter. | ||
54 | |||
55 | b. int res_counter_charge[_locked] | ||
56 | (struct res_counter *rc, unsigned long val) | ||
57 | |||
58 | When a resource is about to be allocated it has to be accounted | ||
59 | with the appropriate resource counter (controller should determine | ||
60 | which one to use on its own). This operation is called "charging". | ||
61 | |||
62 | This is not very important which operation - resource allocation | ||
63 | or charging - is performed first, but | ||
64 | * if the allocation is performed first, this may create a | ||
65 | temporary resource over-usage by the time resource counter is | ||
66 | charged; | ||
67 | * if the charging is performed first, then it should be uncharged | ||
68 | on error path (if the one is called). | ||
69 | |||
70 | c. void res_counter_uncharge[_locked] | ||
71 | (struct res_counter *rc, unsigned long val) | ||
72 | |||
73 | When a resource is released (freed) it should be de-accounted | ||
74 | from the resource counter it was accounted to. This is called | ||
75 | "uncharging". | ||
76 | |||
77 | The _locked routines imply that the res_counter->lock is taken. | ||
78 | |||
79 | |||
80 | 2.1 Other accounting routines | ||
81 | |||
82 | There are more routines that may help you with common needs, like | ||
83 | checking whether the limit is reached or resetting the max_usage | ||
84 | value. They are all declared in include/linux/res_counter.h. | ||
85 | |||
86 | |||
87 | |||
88 | 3. Analyzing the resource counter registrations | ||
89 | |||
90 | a. If the failcnt value constantly grows, this means that the counter's | ||
91 | limit is too tight. Either the group is misbehaving and consumes too | ||
92 | many resources, or the configuration is not suitable for the group | ||
93 | and the limit should be increased. | ||
94 | |||
95 | b. The max_usage value can be used to quickly tune the group. One may | ||
96 | set the limits to maximal values and either load the container with | ||
97 | a common pattern or leave one for a while. After this the max_usage | ||
98 | value shows the amount of memory the container would require during | ||
99 | its common activity. | ||
100 | |||
101 | Setting the limit a bit above this value gives a pretty good | ||
102 | configuration that works in most of the cases. | ||
103 | |||
104 | c. If the max_usage is much less than the limit, but the failcnt value | ||
105 | is growing, then the group tries to allocate a big chunk of resource | ||
106 | at once. | ||
107 | |||
108 | d. If the max_usage is much less than the limit, but the failcnt value | ||
109 | is 0, then this group is given too high limit, that it does not | ||
110 | require. It is better to lower the limit a bit leaving more resource | ||
111 | for other groups. | ||
112 | |||
113 | |||
114 | |||
115 | 4. Communication with the control groups subsystem (cgroups) | ||
116 | |||
117 | All the resource controllers that are using cgroups and resource counters | ||
118 | should provide files (in the cgroup filesystem) to work with the resource | ||
119 | counter fields. They are recommended to adhere to the following rules: | ||
120 | |||
121 | a. File names | ||
122 | |||
123 | Field name File name | ||
124 | --------------------------------------------------- | ||
125 | usage usage_in_<unit_of_measurement> | ||
126 | max_usage max_usage_in_<unit_of_measurement> | ||
127 | limit limit_in_<unit_of_measurement> | ||
128 | failcnt failcnt | ||
129 | lock no file :) | ||
130 | |||
131 | b. Reading from file should show the corresponding field value in the | ||
132 | appropriate format. | ||
133 | |||
134 | c. Writing to file | ||
135 | |||
136 | Field Expected behavior | ||
137 | ---------------------------------- | ||
138 | usage prohibited | ||
139 | max_usage reset to usage | ||
140 | limit set the limit | ||
141 | failcnt reset to zero | ||
142 | |||
143 | |||
144 | |||
145 | 5. Usage example | ||
146 | |||
147 | a. Declare a task group (take a look at cgroups subsystem for this) and | ||
148 | fold a res_counter into it | ||
149 | |||
150 | struct my_group { | ||
151 | struct res_counter res; | ||
152 | |||
153 | <other fields> | ||
154 | } | ||
155 | |||
156 | b. Put hooks in resource allocation/release paths | ||
157 | |||
158 | int alloc_something(...) | ||
159 | { | ||
160 | if (res_counter_charge(res_counter_ptr, amount) < 0) | ||
161 | return -ENOMEM; | ||
162 | |||
163 | <allocate the resource and return to the caller> | ||
164 | } | ||
165 | |||
166 | void release_something(...) | ||
167 | { | ||
168 | res_counter_uncharge(res_counter_ptr, amount); | ||
169 | |||
170 | <release the resource> | ||
171 | } | ||
172 | |||
173 | In order to keep the usage value self-consistent, both the | ||
174 | "res_counter_ptr" and the "amount" in release_something() should be | ||
175 | the same as they were in the alloc_something() when the releasing | ||
176 | resource was allocated. | ||
177 | |||
178 | c. Provide the way to read res_counter values and set them (the cgroups | ||
179 | still can help with it). | ||
180 | |||
181 | c. Compile and run :) | ||