aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/cgroups/memory.txt90
1 files changed, 45 insertions, 45 deletions
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 4372e6b8a353..c07f7b4fb88d 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -18,16 +18,16 @@ from the rest of the system. The article on LWN [12] mentions some probable
18uses of the memory controller. The memory controller can be used to 18uses of the memory controller. The memory controller can be used to
19 19
20a. Isolate an application or a group of applications 20a. Isolate an application or a group of applications
21 Memory hungry applications can be isolated and limited to a smaller 21 Memory-hungry applications can be isolated and limited to a smaller
22 amount of memory. 22 amount of memory.
23b. Create a cgroup with limited amount of memory, this can be used 23b. Create a cgroup with a limited amount of memory; this can be used
24 as a good alternative to booting with mem=XXXX. 24 as a good alternative to booting with mem=XXXX.
25c. Virtualization solutions can control the amount of memory they want 25c. Virtualization solutions can control the amount of memory they want
26 to assign to a virtual machine instance. 26 to assign to a virtual machine instance.
27d. A CD/DVD burner could control the amount of memory used by the 27d. A CD/DVD burner could control the amount of memory used by the
28 rest of the system to ensure that burning does not fail due to lack 28 rest of the system to ensure that burning does not fail due to lack
29 of available memory. 29 of available memory.
30e. There are several other use cases, find one or use the controller just 30e. There are several other use cases; find one or use the controller just
31 for fun (to learn and hack on the VM subsystem). 31 for fun (to learn and hack on the VM subsystem).
32 32
33Current Status: linux-2.6.34-mmotm(development version of 2010/April) 33Current Status: linux-2.6.34-mmotm(development version of 2010/April)
@@ -38,12 +38,12 @@ Features:
38 - optionally, memory+swap usage can be accounted and limited. 38 - optionally, memory+swap usage can be accounted and limited.
39 - hierarchical accounting 39 - hierarchical accounting
40 - soft limit 40 - soft limit
41 - moving(recharging) account at moving a task is selectable. 41 - moving (recharging) account at moving a task is selectable.
42 - usage threshold notifier 42 - usage threshold notifier
43 - oom-killer disable knob and oom-notifier 43 - oom-killer disable knob and oom-notifier
44 - Root cgroup has no limit controls. 44 - Root cgroup has no limit controls.
45 45
46 Kernel memory support is work in progress, and the current version provides 46 Kernel memory support is a work in progress, and the current version provides
47 basically functionality. (See Section 2.7) 47 basically functionality. (See Section 2.7)
48 48
49Brief summary of control files. 49Brief summary of control files.
@@ -144,9 +144,9 @@ Figure 1 shows the important aspects of the controller
1443. Each page has a pointer to the page_cgroup, which in turn knows the 1443. Each page has a pointer to the page_cgroup, which in turn knows the
145 cgroup it belongs to 145 cgroup it belongs to
146 146
147The accounting is done as follows: mem_cgroup_charge() is invoked to setup 147The accounting is done as follows: mem_cgroup_charge() is invoked to set up
148the necessary data structures and check if the cgroup that is being charged 148the necessary data structures and check if the cgroup that is being charged
149is over its limit. If it is then reclaim is invoked on the cgroup. 149is over its limit. If it is, then reclaim is invoked on the cgroup.
150More details can be found in the reclaim section of this document. 150More details can be found in the reclaim section of this document.
151If everything goes well, a page meta-data-structure called page_cgroup is 151If everything goes well, a page meta-data-structure called page_cgroup is
152updated. page_cgroup has its own LRU on cgroup. 152updated. page_cgroup has its own LRU on cgroup.
@@ -163,13 +163,13 @@ for earlier. A file page will be accounted for as Page Cache when it's
163inserted into inode (radix-tree). While it's mapped into the page tables of 163inserted into inode (radix-tree). While it's mapped into the page tables of
164processes, duplicate accounting is carefully avoided. 164processes, duplicate accounting is carefully avoided.
165 165
166A RSS page is unaccounted when it's fully unmapped. A PageCache page is 166An RSS page is unaccounted when it's fully unmapped. A PageCache page is
167unaccounted when it's removed from radix-tree. Even if RSS pages are fully 167unaccounted when it's removed from radix-tree. Even if RSS pages are fully
168unmapped (by kswapd), they may exist as SwapCache in the system until they 168unmapped (by kswapd), they may exist as SwapCache in the system until they
169are really freed. Such SwapCaches also also accounted. 169are really freed. Such SwapCaches are also accounted.
170A swapped-in page is not accounted until it's mapped. 170A swapped-in page is not accounted until it's mapped.
171 171
172Note: The kernel does swapin-readahead and read multiple swaps at once. 172Note: The kernel does swapin-readahead and reads multiple swaps at once.
173This means swapped-in pages may contain pages for other tasks than a task 173This means swapped-in pages may contain pages for other tasks than a task
174causing page fault. So, we avoid accounting at swap-in I/O. 174causing page fault. So, we avoid accounting at swap-in I/O.
175 175
@@ -209,7 +209,7 @@ memsw.limit_in_bytes.
209Example: Assume a system with 4G of swap. A task which allocates 6G of memory 209Example: Assume a system with 4G of swap. A task which allocates 6G of memory
210(by mistake) under 2G memory limitation will use all swap. 210(by mistake) under 2G memory limitation will use all swap.
211In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. 211In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
212By using memsw limit, you can avoid system OOM which can be caused by swap 212By using the memsw limit, you can avoid system OOM which can be caused by swap
213shortage. 213shortage.
214 214
215* why 'memory+swap' rather than swap. 215* why 'memory+swap' rather than swap.
@@ -217,7 +217,7 @@ The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
217to move account from memory to swap...there is no change in usage of 217to move account from memory to swap...there is no change in usage of
218memory+swap. In other words, when we want to limit the usage of swap without 218memory+swap. In other words, when we want to limit the usage of swap without
219affecting global LRU, memory+swap limit is better than just limiting swap from 219affecting global LRU, memory+swap limit is better than just limiting swap from
220OS point of view. 220an OS point of view.
221 221
222* What happens when a cgroup hits memory.memsw.limit_in_bytes 222* What happens when a cgroup hits memory.memsw.limit_in_bytes
223When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out 223When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
@@ -236,7 +236,7 @@ an OOM routine is invoked to select and kill the bulkiest task in the
236cgroup. (See 10. OOM Control below.) 236cgroup. (See 10. OOM Control below.)
237 237
238The reclaim algorithm has not been modified for cgroups, except that 238The reclaim algorithm has not been modified for cgroups, except that
239pages that are selected for reclaiming come from the per cgroup LRU 239pages that are selected for reclaiming come from the per-cgroup LRU
240list. 240list.
241 241
242NOTE: Reclaim does not work for the root cgroup, since we cannot set any 242NOTE: Reclaim does not work for the root cgroup, since we cannot set any
@@ -316,7 +316,7 @@ We can check the usage:
316# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes 316# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
3171216512 3171216512
318 318
319A successful write to this file does not guarantee a successful set of 319A successful write to this file does not guarantee a successful setting of
320this limit to the value written into the file. This can be due to a 320this limit to the value written into the file. This can be due to a
321number of factors, such as rounding up to page boundaries or the total 321number of factors, such as rounding up to page boundaries or the total
322availability of memory on the system. The user is required to re-read 322availability of memory on the system. The user is required to re-read
@@ -350,7 +350,7 @@ Trying usual test under memory controller is always helpful.
3504.1 Troubleshooting 3504.1 Troubleshooting
351 351
352Sometimes a user might find that the application under a cgroup is 352Sometimes a user might find that the application under a cgroup is
353terminated by OOM killer. There are several causes for this: 353terminated by the OOM killer. There are several causes for this:
354 354
3551. The cgroup limit is too low (just too low to do anything useful) 3551. The cgroup limit is too low (just too low to do anything useful)
3562. The user is using anonymous memory and swap is turned off or too low 3562. The user is using anonymous memory and swap is turned off or too low
@@ -358,7 +358,7 @@ terminated by OOM killer. There are several causes for this:
358A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of 358A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
359some of the pages cached in the cgroup (page cache pages). 359some of the pages cached in the cgroup (page cache pages).
360 360
361To know what happens, disable OOM_Kill by 10. OOM Control(see below) and 361To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
362seeing what happens will be helpful. 362seeing what happens will be helpful.
363 363
3644.2 Task migration 3644.2 Task migration
@@ -399,10 +399,10 @@ About use_hierarchy, see Section 6.
399 399
400 Almost all pages tracked by this memory cgroup will be unmapped and freed. 400 Almost all pages tracked by this memory cgroup will be unmapped and freed.
401 Some pages cannot be freed because they are locked or in-use. Such pages are 401 Some pages cannot be freed because they are locked or in-use. Such pages are
402 moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this 402 moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this
403 cgroup will be empty. 403 cgroup will be empty.
404 404
405 Typical use case of this interface is that calling this before rmdir(). 405 The typical use case for this interface is before calling rmdir().
406 Because rmdir() moves all pages to parent, some out-of-use page caches can be 406 Because rmdir() moves all pages to parent, some out-of-use page caches can be
407 moved to the parent. If you want to avoid that, force_empty will be useful. 407 moved to the parent. If you want to avoid that, force_empty will be useful.
408 408
@@ -486,7 +486,7 @@ You can reset failcnt by writing 0 to failcnt file.
486 486
487For efficiency, as other kernel components, memory cgroup uses some optimization 487For efficiency, as other kernel components, memory cgroup uses some optimization
488to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the 488to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
489method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz 489method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz
490value for efficient access. (Of course, when necessary, it's synchronized.) 490value for efficient access. (Of course, when necessary, it's synchronized.)
491If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) 491If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
492value in memory.stat(see 5.2). 492value in memory.stat(see 5.2).
@@ -496,8 +496,8 @@ value in memory.stat(see 5.2).
496This is similar to numa_maps but operates on a per-memcg basis. This is 496This is similar to numa_maps but operates on a per-memcg basis. This is
497useful for providing visibility into the numa locality information within 497useful for providing visibility into the numa locality information within
498an memcg since the pages are allowed to be allocated from any physical 498an memcg since the pages are allowed to be allocated from any physical
499node. One of the usecases is evaluating application performance by 499node. One of the use cases is evaluating application performance by
500combining this information with the application's cpu allocation. 500combining this information with the application's CPU allocation.
501 501
502We export "total", "file", "anon" and "unevictable" pages per-node for 502We export "total", "file", "anon" and "unevictable" pages per-node for
503each memcg. The ouput format of memory.numa_stat is: 503each memcg. The ouput format of memory.numa_stat is:
@@ -561,10 +561,10 @@ are pushed back to their soft limits. If the soft limit of each control
561group is very high, they are pushed back as much as possible to make 561group is very high, they are pushed back as much as possible to make
562sure that one control group does not starve the others of memory. 562sure that one control group does not starve the others of memory.
563 563
564Please note that soft limits is a best effort feature, it comes with 564Please note that soft limits is a best-effort feature; it comes with
565no guarantees, but it does its best to make sure that when memory is 565no guarantees, but it does its best to make sure that when memory is
566heavily contended for, memory is allocated based on the soft limit 566heavily contended for, memory is allocated based on the soft limit
567hints/setup. Currently soft limit based reclaim is setup such that 567hints/setup. Currently soft limit based reclaim is set up such that
568it gets invoked from balance_pgdat (kswapd). 568it gets invoked from balance_pgdat (kswapd).
569 569
5707.1 Interface 5707.1 Interface
@@ -592,7 +592,7 @@ page tables.
592 592
5938.1 Interface 5938.1 Interface
594 594
595This feature is disabled by default. It can be enabled(and disabled again) by 595This feature is disabled by default. It can be enabledi (and disabled again) by
596writing to memory.move_charge_at_immigrate of the destination cgroup. 596writing to memory.move_charge_at_immigrate of the destination cgroup.
597 597
598If you want to enable it: 598If you want to enable it:
@@ -601,8 +601,8 @@ If you want to enable it:
601 601
602Note: Each bits of move_charge_at_immigrate has its own meaning about what type 602Note: Each bits of move_charge_at_immigrate has its own meaning about what type
603 of charges should be moved. See 8.2 for details. 603 of charges should be moved. See 8.2 for details.
604Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread 604Note: Charges are moved only when you move mm->owner, in other words,
605 group. 605 a leader of a thread group.
606Note: If we cannot find enough space for the task in the destination cgroup, we 606Note: If we cannot find enough space for the task in the destination cgroup, we
607 try to make space by reclaiming memory. Task migration may fail if we 607 try to make space by reclaiming memory. Task migration may fail if we
608 cannot make enough space. 608 cannot make enough space.
@@ -612,25 +612,25 @@ And if you want disable it again:
612 612
613# echo 0 > memory.move_charge_at_immigrate 613# echo 0 > memory.move_charge_at_immigrate
614 614
6158.2 Type of charges which can be move 6158.2 Type of charges which can be moved
616 616
617Each bits of move_charge_at_immigrate has its own meaning about what type of 617Each bit in move_charge_at_immigrate has its own meaning about what type of
618charges should be moved. But in any cases, it must be noted that an account of 618charges should be moved. But in any case, it must be noted that an account of
619a page or a swap can be moved only when it is charged to the task's current(old) 619a page or a swap can be moved only when it is charged to the task's current
620memory cgroup. 620(old) memory cgroup.
621 621
622 bit | what type of charges would be moved ? 622 bit | what type of charges would be moved ?
623 -----+------------------------------------------------------------------------ 623 -----+------------------------------------------------------------------------
624 0 | A charge of an anonymous page(or swap of it) used by the target task. 624 0 | A charge of an anonymous page (or swap of it) used by the target task.
625 | You must enable Swap Extension(see 2.4) to enable move of swap charges. 625 | You must enable Swap Extension (see 2.4) to enable move of swap charges.
626 -----+------------------------------------------------------------------------ 626 -----+------------------------------------------------------------------------
627 1 | A charge of file pages(normal file, tmpfs file(e.g. ipc shared memory) 627 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory)
628 | and swaps of tmpfs file) mmapped by the target task. Unlike the case of 628 | and swaps of tmpfs file) mmapped by the target task. Unlike the case of
629 | anonymous pages, file pages(and swaps) in the range mmapped by the task 629 | anonymous pages, file pages (and swaps) in the range mmapped by the task
630 | will be moved even if the task hasn't done page fault, i.e. they might 630 | will be moved even if the task hasn't done page fault, i.e. they might
631 | not be the task's "RSS", but other task's "RSS" that maps the same file. 631 | not be the task's "RSS", but other task's "RSS" that maps the same file.
632 | And mapcount of the page is ignored(the page can be moved even if 632 | And mapcount of the page is ignored (the page can be moved even if
633 | page_mapcount(page) > 1). You must enable Swap Extension(see 2.4) to 633 | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to
634 | enable move of swap charges. 634 | enable move of swap charges.
635 635
6368.3 TODO 6368.3 TODO
@@ -640,11 +640,11 @@ memory cgroup.
640 640
6419. Memory thresholds 6419. Memory thresholds
642 642
643Memory cgroup implements memory thresholds using cgroups notification 643Memory cgroup implements memory thresholds using the cgroups notification
644API (see cgroups.txt). It allows to register multiple memory and memsw 644API (see cgroups.txt). It allows to register multiple memory and memsw
645thresholds and gets notifications when it crosses. 645thresholds and gets notifications when it crosses.
646 646
647To register a threshold application need: 647To register a threshold, an application must:
648- create an eventfd using eventfd(2); 648- create an eventfd using eventfd(2);
649- open memory.usage_in_bytes or memory.memsw.usage_in_bytes; 649- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
650- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to 650- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
@@ -659,24 +659,24 @@ It's applicable for root and non-root cgroup.
659 659
660memory.oom_control file is for OOM notification and other controls. 660memory.oom_control file is for OOM notification and other controls.
661 661
662Memory cgroup implements OOM notifier using cgroup notification 662Memory cgroup implements OOM notifier using the cgroup notification
663API (See cgroups.txt). It allows to register multiple OOM notification 663API (See cgroups.txt). It allows to register multiple OOM notification
664delivery and gets notification when OOM happens. 664delivery and gets notification when OOM happens.
665 665
666To register a notifier, application need: 666To register a notifier, an application must:
667 - create an eventfd using eventfd(2) 667 - create an eventfd using eventfd(2)
668 - open memory.oom_control file 668 - open memory.oom_control file
669 - write string like "<event_fd> <fd of memory.oom_control>" to 669 - write string like "<event_fd> <fd of memory.oom_control>" to
670 cgroup.event_control 670 cgroup.event_control
671 671
672Application will be notified through eventfd when OOM happens. 672The application will be notified through eventfd when OOM happens.
673OOM notification doesn't work for root cgroup. 673OOM notification doesn't work for the root cgroup.
674 674
675You can disable OOM-killer by writing "1" to memory.oom_control file, as: 675You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
676 676
677 #echo 1 > memory.oom_control 677 #echo 1 > memory.oom_control
678 678
679This operation is only allowed to the top cgroup of sub-hierarchy. 679This operation is only allowed to the top cgroup of a sub-hierarchy.
680If OOM-killer is disabled, tasks under cgroup will hang/sleep 680If OOM-killer is disabled, tasks under cgroup will hang/sleep
681in memory cgroup's OOM-waitqueue when they request accountable memory. 681in memory cgroup's OOM-waitqueue when they request accountable memory.
682 682