diff options
Diffstat (limited to 'Documentation/cgroups/memory.txt')
-rw-r--r-- | Documentation/cgroups/memory.txt | 399 |
1 files changed, 399 insertions, 0 deletions
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt new file mode 100644 index 000000000000..e1501964df1e --- /dev/null +++ b/Documentation/cgroups/memory.txt | |||
@@ -0,0 +1,399 @@ | |||
1 | Memory Resource Controller | ||
2 | |||
3 | NOTE: The Memory Resource Controller has been generically been referred | ||
4 | to as the memory controller in this document. Do not confuse memory controller | ||
5 | used here with the memory controller that is used in hardware. | ||
6 | |||
7 | Salient features | ||
8 | |||
9 | a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages | ||
10 | b. The infrastructure allows easy addition of other types of memory to control | ||
11 | c. Provides *zero overhead* for non memory controller users | ||
12 | d. Provides a double LRU: global memory pressure causes reclaim from the | ||
13 | global LRU; a cgroup on hitting a limit, reclaims from the per | ||
14 | cgroup LRU | ||
15 | |||
16 | NOTE: Swap Cache (unmapped) is not accounted now. | ||
17 | |||
18 | Benefits and Purpose of the memory controller | ||
19 | |||
20 | The memory controller isolates the memory behaviour of a group of tasks | ||
21 | from the rest of the system. The article on LWN [12] mentions some probable | ||
22 | uses of the memory controller. The memory controller can be used to | ||
23 | |||
24 | a. Isolate an application or a group of applications | ||
25 | Memory hungry applications can be isolated and limited to a smaller | ||
26 | amount of memory. | ||
27 | b. Create a cgroup with limited amount of memory, this can be used | ||
28 | as a good alternative to booting with mem=XXXX. | ||
29 | c. Virtualization solutions can control the amount of memory they want | ||
30 | to assign to a virtual machine instance. | ||
31 | d. A CD/DVD burner could control the amount of memory used by the | ||
32 | rest of the system to ensure that burning does not fail due to lack | ||
33 | of available memory. | ||
34 | e. There are several other use cases, find one or use the controller just | ||
35 | for fun (to learn and hack on the VM subsystem). | ||
36 | |||
37 | 1. History | ||
38 | |||
39 | The memory controller has a long history. A request for comments for the memory | ||
40 | controller was posted by Balbir Singh [1]. At the time the RFC was posted | ||
41 | there were several implementations for memory control. The goal of the | ||
42 | RFC was to build consensus and agreement for the minimal features required | ||
43 | for memory control. The first RSS controller was posted by Balbir Singh[2] | ||
44 | in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the | ||
45 | RSS controller. At OLS, at the resource management BoF, everyone suggested | ||
46 | that we handle both page cache and RSS together. Another request was raised | ||
47 | to allow user space handling of OOM. The current memory controller is | ||
48 | at version 6; it combines both mapped (RSS) and unmapped Page | ||
49 | Cache Control [11]. | ||
50 | |||
51 | 2. Memory Control | ||
52 | |||
53 | Memory is a unique resource in the sense that it is present in a limited | ||
54 | amount. If a task requires a lot of CPU processing, the task can spread | ||
55 | its processing over a period of hours, days, months or years, but with | ||
56 | memory, the same physical memory needs to be reused to accomplish the task. | ||
57 | |||
58 | The memory controller implementation has been divided into phases. These | ||
59 | are: | ||
60 | |||
61 | 1. Memory controller | ||
62 | 2. mlock(2) controller | ||
63 | 3. Kernel user memory accounting and slab control | ||
64 | 4. user mappings length controller | ||
65 | |||
66 | The memory controller is the first controller developed. | ||
67 | |||
68 | 2.1. Design | ||
69 | |||
70 | The core of the design is a counter called the res_counter. The res_counter | ||
71 | tracks the current memory usage and limit of the group of processes associated | ||
72 | with the controller. Each cgroup has a memory controller specific data | ||
73 | structure (mem_cgroup) associated with it. | ||
74 | |||
75 | 2.2. Accounting | ||
76 | |||
77 | +--------------------+ | ||
78 | | mem_cgroup | | ||
79 | | (res_counter) | | ||
80 | +--------------------+ | ||
81 | / ^ \ | ||
82 | / | \ | ||
83 | +---------------+ | +---------------+ | ||
84 | | mm_struct | |.... | mm_struct | | ||
85 | | | | | | | ||
86 | +---------------+ | +---------------+ | ||
87 | | | ||
88 | + --------------+ | ||
89 | | | ||
90 | +---------------+ +------+--------+ | ||
91 | | page +----------> page_cgroup| | ||
92 | | | | | | ||
93 | +---------------+ +---------------+ | ||
94 | |||
95 | (Figure 1: Hierarchy of Accounting) | ||
96 | |||
97 | |||
98 | Figure 1 shows the important aspects of the controller | ||
99 | |||
100 | 1. Accounting happens per cgroup | ||
101 | 2. Each mm_struct knows about which cgroup it belongs to | ||
102 | 3. Each page has a pointer to the page_cgroup, which in turn knows the | ||
103 | cgroup it belongs to | ||
104 | |||
105 | The accounting is done as follows: mem_cgroup_charge() is invoked to setup | ||
106 | the necessary data structures and check if the cgroup that is being charged | ||
107 | is over its limit. If it is then reclaim is invoked on the cgroup. | ||
108 | More details can be found in the reclaim section of this document. | ||
109 | If everything goes well, a page meta-data-structure called page_cgroup is | ||
110 | allocated and associated with the page. This routine also adds the page to | ||
111 | the per cgroup LRU. | ||
112 | |||
113 | 2.2.1 Accounting details | ||
114 | |||
115 | All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. | ||
116 | (some pages which never be reclaimable and will not be on global LRU | ||
117 | are not accounted. we just accounts pages under usual vm management.) | ||
118 | |||
119 | RSS pages are accounted at page_fault unless they've already been accounted | ||
120 | for earlier. A file page will be accounted for as Page Cache when it's | ||
121 | inserted into inode (radix-tree). While it's mapped into the page tables of | ||
122 | processes, duplicate accounting is carefully avoided. | ||
123 | |||
124 | A RSS page is unaccounted when it's fully unmapped. A PageCache page is | ||
125 | unaccounted when it's removed from radix-tree. | ||
126 | |||
127 | At page migration, accounting information is kept. | ||
128 | |||
129 | Note: we just account pages-on-lru because our purpose is to control amount | ||
130 | of used pages. not-on-lru pages are tend to be out-of-control from vm view. | ||
131 | |||
132 | 2.3 Shared Page Accounting | ||
133 | |||
134 | Shared pages are accounted on the basis of the first touch approach. The | ||
135 | cgroup that first touches a page is accounted for the page. The principle | ||
136 | behind this approach is that a cgroup that aggressively uses a shared | ||
137 | page will eventually get charged for it (once it is uncharged from | ||
138 | the cgroup that brought it in -- this will happen on memory pressure). | ||
139 | |||
140 | Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used.. | ||
141 | When you do swapoff and make swapped-out pages of shmem(tmpfs) to | ||
142 | be backed into memory in force, charges for pages are accounted against the | ||
143 | caller of swapoff rather than the users of shmem. | ||
144 | |||
145 | |||
146 | 2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) | ||
147 | Swap Extension allows you to record charge for swap. A swapped-in page is | ||
148 | charged back to original page allocator if possible. | ||
149 | |||
150 | When swap is accounted, following files are added. | ||
151 | - memory.memsw.usage_in_bytes. | ||
152 | - memory.memsw.limit_in_bytes. | ||
153 | |||
154 | usage of mem+swap is limited by memsw.limit_in_bytes. | ||
155 | |||
156 | Note: why 'mem+swap' rather than swap. | ||
157 | The global LRU(kswapd) can swap out arbitrary pages. Swap-out means | ||
158 | to move account from memory to swap...there is no change in usage of | ||
159 | mem+swap. | ||
160 | |||
161 | In other words, when we want to limit the usage of swap without affecting | ||
162 | global LRU, mem+swap limit is better than just limiting swap from OS point | ||
163 | of view. | ||
164 | |||
165 | 2.5 Reclaim | ||
166 | |||
167 | Each cgroup maintains a per cgroup LRU that consists of an active | ||
168 | and inactive list. When a cgroup goes over its limit, we first try | ||
169 | to reclaim memory from the cgroup so as to make space for the new | ||
170 | pages that the cgroup has touched. If the reclaim is unsuccessful, | ||
171 | an OOM routine is invoked to select and kill the bulkiest task in the | ||
172 | cgroup. | ||
173 | |||
174 | The reclaim algorithm has not been modified for cgroups, except that | ||
175 | pages that are selected for reclaiming come from the per cgroup LRU | ||
176 | list. | ||
177 | |||
178 | 2. Locking | ||
179 | |||
180 | The memory controller uses the following hierarchy | ||
181 | |||
182 | 1. zone->lru_lock is used for selecting pages to be isolated | ||
183 | 2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone) | ||
184 | 3. lock_page_cgroup() is used to protect page->page_cgroup | ||
185 | |||
186 | 3. User Interface | ||
187 | |||
188 | 0. Configuration | ||
189 | |||
190 | a. Enable CONFIG_CGROUPS | ||
191 | b. Enable CONFIG_RESOURCE_COUNTERS | ||
192 | c. Enable CONFIG_CGROUP_MEM_RES_CTLR | ||
193 | |||
194 | 1. Prepare the cgroups | ||
195 | # mkdir -p /cgroups | ||
196 | # mount -t cgroup none /cgroups -o memory | ||
197 | |||
198 | 2. Make the new group and move bash into it | ||
199 | # mkdir /cgroups/0 | ||
200 | # echo $$ > /cgroups/0/tasks | ||
201 | |||
202 | Since now we're in the 0 cgroup, | ||
203 | We can alter the memory limit: | ||
204 | # echo 4M > /cgroups/0/memory.limit_in_bytes | ||
205 | |||
206 | NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, | ||
207 | mega or gigabytes. | ||
208 | |||
209 | # cat /cgroups/0/memory.limit_in_bytes | ||
210 | 4194304 | ||
211 | |||
212 | NOTE: The interface has now changed to display the usage in bytes | ||
213 | instead of pages | ||
214 | |||
215 | We can check the usage: | ||
216 | # cat /cgroups/0/memory.usage_in_bytes | ||
217 | 1216512 | ||
218 | |||
219 | A successful write to this file does not guarantee a successful set of | ||
220 | this limit to the value written into the file. This can be due to a | ||
221 | number of factors, such as rounding up to page boundaries or the total | ||
222 | availability of memory on the system. The user is required to re-read | ||
223 | this file after a write to guarantee the value committed by the kernel. | ||
224 | |||
225 | # echo 1 > memory.limit_in_bytes | ||
226 | # cat memory.limit_in_bytes | ||
227 | 4096 | ||
228 | |||
229 | The memory.failcnt field gives the number of times that the cgroup limit was | ||
230 | exceeded. | ||
231 | |||
232 | The memory.stat file gives accounting information. Now, the number of | ||
233 | caches, RSS and Active pages/Inactive pages are shown. | ||
234 | |||
235 | 4. Testing | ||
236 | |||
237 | Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. | ||
238 | Apart from that v6 has been tested with several applications and regular | ||
239 | daily use. The controller has also been tested on the PPC64, x86_64 and | ||
240 | UML platforms. | ||
241 | |||
242 | 4.1 Troubleshooting | ||
243 | |||
244 | Sometimes a user might find that the application under a cgroup is | ||
245 | terminated. There are several causes for this: | ||
246 | |||
247 | 1. The cgroup limit is too low (just too low to do anything useful) | ||
248 | 2. The user is using anonymous memory and swap is turned off or too low | ||
249 | |||
250 | A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of | ||
251 | some of the pages cached in the cgroup (page cache pages). | ||
252 | |||
253 | 4.2 Task migration | ||
254 | |||
255 | When a task migrates from one cgroup to another, it's charge is not | ||
256 | carried forward. The pages allocated from the original cgroup still | ||
257 | remain charged to it, the charge is dropped when the page is freed or | ||
258 | reclaimed. | ||
259 | |||
260 | 4.3 Removing a cgroup | ||
261 | |||
262 | A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a | ||
263 | cgroup might have some charge associated with it, even though all | ||
264 | tasks have migrated away from it. | ||
265 | Such charges are freed(at default) or moved to its parent. When moved, | ||
266 | both of RSS and CACHES are moved to parent. | ||
267 | If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. | ||
268 | |||
269 | Charges recorded in swap information is not updated at removal of cgroup. | ||
270 | Recorded information is discarded and a cgroup which uses swap (swapcache) | ||
271 | will be charged as a new owner of it. | ||
272 | |||
273 | |||
274 | 5. Misc. interfaces. | ||
275 | |||
276 | 5.1 force_empty | ||
277 | memory.force_empty interface is provided to make cgroup's memory usage empty. | ||
278 | You can use this interface only when the cgroup has no tasks. | ||
279 | When writing anything to this | ||
280 | |||
281 | # echo 0 > memory.force_empty | ||
282 | |||
283 | Almost all pages tracked by this memcg will be unmapped and freed. Some of | ||
284 | pages cannot be freed because it's locked or in-use. Such pages are moved | ||
285 | to parent and this cgroup will be empty. But this may return -EBUSY in | ||
286 | some too busy case. | ||
287 | |||
288 | Typical use case of this interface is that calling this before rmdir(). | ||
289 | Because rmdir() moves all pages to parent, some out-of-use page caches can be | ||
290 | moved to the parent. If you want to avoid that, force_empty will be useful. | ||
291 | |||
292 | 5.2 stat file | ||
293 | memory.stat file includes following statistics (now) | ||
294 | cache - # of pages from page-cache and shmem. | ||
295 | rss - # of pages from anonymous memory. | ||
296 | pgpgin - # of event of charging | ||
297 | pgpgout - # of event of uncharging | ||
298 | active_anon - # of pages on active lru of anon, shmem. | ||
299 | inactive_anon - # of pages on active lru of anon, shmem | ||
300 | active_file - # of pages on active lru of file-cache | ||
301 | inactive_file - # of pages on inactive lru of file cache | ||
302 | unevictable - # of pages cannot be reclaimed.(mlocked etc) | ||
303 | |||
304 | Below is depend on CONFIG_DEBUG_VM. | ||
305 | inactive_ratio - VM inernal parameter. (see mm/page_alloc.c) | ||
306 | recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) | ||
307 | recent_rotated_file - VM internal parameter. (see mm/vmscan.c) | ||
308 | recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) | ||
309 | recent_scanned_file - VM internal parameter. (see mm/vmscan.c) | ||
310 | |||
311 | Memo: | ||
312 | recent_rotated means recent frequency of lru rotation. | ||
313 | recent_scanned means recent # of scans to lru. | ||
314 | showing for better debug please see the code for meanings. | ||
315 | |||
316 | |||
317 | 5.3 swappiness | ||
318 | Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. | ||
319 | |||
320 | Following cgroup's swapiness can't be changed. | ||
321 | - root cgroup (uses /proc/sys/vm/swappiness). | ||
322 | - a cgroup which uses hierarchy and it has child cgroup. | ||
323 | - a cgroup which uses hierarchy and not the root of hierarchy. | ||
324 | |||
325 | |||
326 | 6. Hierarchy support | ||
327 | |||
328 | The memory controller supports a deep hierarchy and hierarchical accounting. | ||
329 | The hierarchy is created by creating the appropriate cgroups in the | ||
330 | cgroup filesystem. Consider for example, the following cgroup filesystem | ||
331 | hierarchy | ||
332 | |||
333 | root | ||
334 | / | \ | ||
335 | / | \ | ||
336 | a b c | ||
337 | | \ | ||
338 | | \ | ||
339 | d e | ||
340 | |||
341 | In the diagram above, with hierarchical accounting enabled, all memory | ||
342 | usage of e, is accounted to its ancestors up until the root (i.e, c and root), | ||
343 | that has memory.use_hierarchy enabled. If one of the ancestors goes over its | ||
344 | limit, the reclaim algorithm reclaims from the tasks in the ancestor and the | ||
345 | children of the ancestor. | ||
346 | |||
347 | 6.1 Enabling hierarchical accounting and reclaim | ||
348 | |||
349 | The memory controller by default disables the hierarchy feature. Support | ||
350 | can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup | ||
351 | |||
352 | # echo 1 > memory.use_hierarchy | ||
353 | |||
354 | The feature can be disabled by | ||
355 | |||
356 | # echo 0 > memory.use_hierarchy | ||
357 | |||
358 | NOTE1: Enabling/disabling will fail if the cgroup already has other | ||
359 | cgroups created below it. | ||
360 | |||
361 | NOTE2: This feature can be enabled/disabled per subtree. | ||
362 | |||
363 | 7. TODO | ||
364 | |||
365 | 1. Add support for accounting huge pages (as a separate controller) | ||
366 | 2. Make per-cgroup scanner reclaim not-shared pages first | ||
367 | 3. Teach controller to account for shared-pages | ||
368 | 4. Start reclamation in the background when the limit is | ||
369 | not yet hit but the usage is getting closer | ||
370 | |||
371 | Summary | ||
372 | |||
373 | Overall, the memory controller has been a stable controller and has been | ||
374 | commented and discussed quite extensively in the community. | ||
375 | |||
376 | References | ||
377 | |||
378 | 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ | ||
379 | 2. Singh, Balbir. Memory Controller (RSS Control), | ||
380 | http://lwn.net/Articles/222762/ | ||
381 | 3. Emelianov, Pavel. Resource controllers based on process cgroups | ||
382 | http://lkml.org/lkml/2007/3/6/198 | ||
383 | 4. Emelianov, Pavel. RSS controller based on process cgroups (v2) | ||
384 | http://lkml.org/lkml/2007/4/9/78 | ||
385 | 5. Emelianov, Pavel. RSS controller based on process cgroups (v3) | ||
386 | http://lkml.org/lkml/2007/5/30/244 | ||
387 | 6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ | ||
388 | 7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control | ||
389 | subsystem (v3), http://lwn.net/Articles/235534/ | ||
390 | 8. Singh, Balbir. RSS controller v2 test results (lmbench), | ||
391 | http://lkml.org/lkml/2007/5/17/232 | ||
392 | 9. Singh, Balbir. RSS controller v2 AIM9 results | ||
393 | http://lkml.org/lkml/2007/5/18/1 | ||
394 | 10. Singh, Balbir. Memory controller v6 test results, | ||
395 | http://lkml.org/lkml/2007/8/19/36 | ||
396 | 11. Singh, Balbir. Memory controller introduction (v6), | ||
397 | http://lkml.org/lkml/2007/8/17/69 | ||
398 | 12. Corbet, Jonathan, Controlling memory use in cgroups, | ||
399 | http://lwn.net/Articles/243795/ | ||