diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/cgroups/memory.txt | 90 |
1 files changed, 45 insertions, 45 deletions
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 4372e6b8a353..c07f7b4fb88d 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt | |||
@@ -18,16 +18,16 @@ from the rest of the system. The article on LWN [12] mentions some probable | |||
18 | uses of the memory controller. The memory controller can be used to | 18 | uses of the memory controller. The memory controller can be used to |
19 | 19 | ||
20 | a. Isolate an application or a group of applications | 20 | a. Isolate an application or a group of applications |
21 | Memory hungry applications can be isolated and limited to a smaller | 21 | Memory-hungry applications can be isolated and limited to a smaller |
22 | amount of memory. | 22 | amount of memory. |
23 | b. Create a cgroup with limited amount of memory, this can be used | 23 | b. Create a cgroup with a limited amount of memory; this can be used |
24 | as a good alternative to booting with mem=XXXX. | 24 | as a good alternative to booting with mem=XXXX. |
25 | c. Virtualization solutions can control the amount of memory they want | 25 | c. Virtualization solutions can control the amount of memory they want |
26 | to assign to a virtual machine instance. | 26 | to assign to a virtual machine instance. |
27 | d. A CD/DVD burner could control the amount of memory used by the | 27 | d. A CD/DVD burner could control the amount of memory used by the |
28 | rest of the system to ensure that burning does not fail due to lack | 28 | rest of the system to ensure that burning does not fail due to lack |
29 | of available memory. | 29 | of available memory. |
30 | e. There are several other use cases, find one or use the controller just | 30 | e. There are several other use cases; find one or use the controller just |
31 | for fun (to learn and hack on the VM subsystem). | 31 | for fun (to learn and hack on the VM subsystem). |
32 | 32 | ||
33 | Current Status: linux-2.6.34-mmotm(development version of 2010/April) | 33 | Current Status: linux-2.6.34-mmotm(development version of 2010/April) |
@@ -38,12 +38,12 @@ Features: | |||
38 | - optionally, memory+swap usage can be accounted and limited. | 38 | - optionally, memory+swap usage can be accounted and limited. |
39 | - hierarchical accounting | 39 | - hierarchical accounting |
40 | - soft limit | 40 | - soft limit |
41 | - moving(recharging) account at moving a task is selectable. | 41 | - moving (recharging) account at moving a task is selectable. |
42 | - usage threshold notifier | 42 | - usage threshold notifier |
43 | - oom-killer disable knob and oom-notifier | 43 | - oom-killer disable knob and oom-notifier |
44 | - Root cgroup has no limit controls. | 44 | - Root cgroup has no limit controls. |
45 | 45 | ||
46 | Kernel memory support is work in progress, and the current version provides | 46 | Kernel memory support is a work in progress, and the current version provides |
47 | basically functionality. (See Section 2.7) | 47 | basically functionality. (See Section 2.7) |
48 | 48 | ||
49 | Brief summary of control files. | 49 | Brief summary of control files. |
@@ -144,9 +144,9 @@ Figure 1 shows the important aspects of the controller | |||
144 | 3. Each page has a pointer to the page_cgroup, which in turn knows the | 144 | 3. Each page has a pointer to the page_cgroup, which in turn knows the |
145 | cgroup it belongs to | 145 | cgroup it belongs to |
146 | 146 | ||
147 | The accounting is done as follows: mem_cgroup_charge() is invoked to setup | 147 | The accounting is done as follows: mem_cgroup_charge() is invoked to set up |
148 | the necessary data structures and check if the cgroup that is being charged | 148 | the necessary data structures and check if the cgroup that is being charged |
149 | is over its limit. If it is then reclaim is invoked on the cgroup. | 149 | is over its limit. If it is, then reclaim is invoked on the cgroup. |
150 | More details can be found in the reclaim section of this document. | 150 | More details can be found in the reclaim section of this document. |
151 | If everything goes well, a page meta-data-structure called page_cgroup is | 151 | If everything goes well, a page meta-data-structure called page_cgroup is |
152 | updated. page_cgroup has its own LRU on cgroup. | 152 | updated. page_cgroup has its own LRU on cgroup. |
@@ -163,13 +163,13 @@ for earlier. A file page will be accounted for as Page Cache when it's | |||
163 | inserted into inode (radix-tree). While it's mapped into the page tables of | 163 | inserted into inode (radix-tree). While it's mapped into the page tables of |
164 | processes, duplicate accounting is carefully avoided. | 164 | processes, duplicate accounting is carefully avoided. |
165 | 165 | ||
166 | A RSS page is unaccounted when it's fully unmapped. A PageCache page is | 166 | An RSS page is unaccounted when it's fully unmapped. A PageCache page is |
167 | unaccounted when it's removed from radix-tree. Even if RSS pages are fully | 167 | unaccounted when it's removed from radix-tree. Even if RSS pages are fully |
168 | unmapped (by kswapd), they may exist as SwapCache in the system until they | 168 | unmapped (by kswapd), they may exist as SwapCache in the system until they |
169 | are really freed. Such SwapCaches also also accounted. | 169 | are really freed. Such SwapCaches are also accounted. |
170 | A swapped-in page is not accounted until it's mapped. | 170 | A swapped-in page is not accounted until it's mapped. |
171 | 171 | ||
172 | Note: The kernel does swapin-readahead and read multiple swaps at once. | 172 | Note: The kernel does swapin-readahead and reads multiple swaps at once. |
173 | This means swapped-in pages may contain pages for other tasks than a task | 173 | This means swapped-in pages may contain pages for other tasks than a task |
174 | causing page fault. So, we avoid accounting at swap-in I/O. | 174 | causing page fault. So, we avoid accounting at swap-in I/O. |
175 | 175 | ||
@@ -209,7 +209,7 @@ memsw.limit_in_bytes. | |||
209 | Example: Assume a system with 4G of swap. A task which allocates 6G of memory | 209 | Example: Assume a system with 4G of swap. A task which allocates 6G of memory |
210 | (by mistake) under 2G memory limitation will use all swap. | 210 | (by mistake) under 2G memory limitation will use all swap. |
211 | In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. | 211 | In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. |
212 | By using memsw limit, you can avoid system OOM which can be caused by swap | 212 | By using the memsw limit, you can avoid system OOM which can be caused by swap |
213 | shortage. | 213 | shortage. |
214 | 214 | ||
215 | * why 'memory+swap' rather than swap. | 215 | * why 'memory+swap' rather than swap. |
@@ -217,7 +217,7 @@ The global LRU(kswapd) can swap out arbitrary pages. Swap-out means | |||
217 | to move account from memory to swap...there is no change in usage of | 217 | to move account from memory to swap...there is no change in usage of |
218 | memory+swap. In other words, when we want to limit the usage of swap without | 218 | memory+swap. In other words, when we want to limit the usage of swap without |
219 | affecting global LRU, memory+swap limit is better than just limiting swap from | 219 | affecting global LRU, memory+swap limit is better than just limiting swap from |
220 | OS point of view. | 220 | an OS point of view. |
221 | 221 | ||
222 | * What happens when a cgroup hits memory.memsw.limit_in_bytes | 222 | * What happens when a cgroup hits memory.memsw.limit_in_bytes |
223 | When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out | 223 | When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out |
@@ -236,7 +236,7 @@ an OOM routine is invoked to select and kill the bulkiest task in the | |||
236 | cgroup. (See 10. OOM Control below.) | 236 | cgroup. (See 10. OOM Control below.) |
237 | 237 | ||
238 | The reclaim algorithm has not been modified for cgroups, except that | 238 | The reclaim algorithm has not been modified for cgroups, except that |
239 | pages that are selected for reclaiming come from the per cgroup LRU | 239 | pages that are selected for reclaiming come from the per-cgroup LRU |
240 | list. | 240 | list. |
241 | 241 | ||
242 | NOTE: Reclaim does not work for the root cgroup, since we cannot set any | 242 | NOTE: Reclaim does not work for the root cgroup, since we cannot set any |
@@ -316,7 +316,7 @@ We can check the usage: | |||
316 | # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes | 316 | # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes |
317 | 1216512 | 317 | 1216512 |
318 | 318 | ||
319 | A successful write to this file does not guarantee a successful set of | 319 | A successful write to this file does not guarantee a successful setting of |
320 | this limit to the value written into the file. This can be due to a | 320 | this limit to the value written into the file. This can be due to a |
321 | number of factors, such as rounding up to page boundaries or the total | 321 | number of factors, such as rounding up to page boundaries or the total |
322 | availability of memory on the system. The user is required to re-read | 322 | availability of memory on the system. The user is required to re-read |
@@ -350,7 +350,7 @@ Trying usual test under memory controller is always helpful. | |||
350 | 4.1 Troubleshooting | 350 | 4.1 Troubleshooting |
351 | 351 | ||
352 | Sometimes a user might find that the application under a cgroup is | 352 | Sometimes a user might find that the application under a cgroup is |
353 | terminated by OOM killer. There are several causes for this: | 353 | terminated by the OOM killer. There are several causes for this: |
354 | 354 | ||
355 | 1. The cgroup limit is too low (just too low to do anything useful) | 355 | 1. The cgroup limit is too low (just too low to do anything useful) |
356 | 2. The user is using anonymous memory and swap is turned off or too low | 356 | 2. The user is using anonymous memory and swap is turned off or too low |
@@ -358,7 +358,7 @@ terminated by OOM killer. There are several causes for this: | |||
358 | A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of | 358 | A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of |
359 | some of the pages cached in the cgroup (page cache pages). | 359 | some of the pages cached in the cgroup (page cache pages). |
360 | 360 | ||
361 | To know what happens, disable OOM_Kill by 10. OOM Control(see below) and | 361 | To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and |
362 | seeing what happens will be helpful. | 362 | seeing what happens will be helpful. |
363 | 363 | ||
364 | 4.2 Task migration | 364 | 4.2 Task migration |
@@ -399,10 +399,10 @@ About use_hierarchy, see Section 6. | |||
399 | 399 | ||
400 | Almost all pages tracked by this memory cgroup will be unmapped and freed. | 400 | Almost all pages tracked by this memory cgroup will be unmapped and freed. |
401 | Some pages cannot be freed because they are locked or in-use. Such pages are | 401 | Some pages cannot be freed because they are locked or in-use. Such pages are |
402 | moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this | 402 | moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this |
403 | cgroup will be empty. | 403 | cgroup will be empty. |
404 | 404 | ||
405 | Typical use case of this interface is that calling this before rmdir(). | 405 | The typical use case for this interface is before calling rmdir(). |
406 | Because rmdir() moves all pages to parent, some out-of-use page caches can be | 406 | Because rmdir() moves all pages to parent, some out-of-use page caches can be |
407 | moved to the parent. If you want to avoid that, force_empty will be useful. | 407 | moved to the parent. If you want to avoid that, force_empty will be useful. |
408 | 408 | ||
@@ -486,7 +486,7 @@ You can reset failcnt by writing 0 to failcnt file. | |||
486 | 486 | ||
487 | For efficiency, as other kernel components, memory cgroup uses some optimization | 487 | For efficiency, as other kernel components, memory cgroup uses some optimization |
488 | to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the | 488 | to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the |
489 | method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz | 489 | method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz |
490 | value for efficient access. (Of course, when necessary, it's synchronized.) | 490 | value for efficient access. (Of course, when necessary, it's synchronized.) |
491 | If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) | 491 | If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) |
492 | value in memory.stat(see 5.2). | 492 | value in memory.stat(see 5.2). |
@@ -496,8 +496,8 @@ value in memory.stat(see 5.2). | |||
496 | This is similar to numa_maps but operates on a per-memcg basis. This is | 496 | This is similar to numa_maps but operates on a per-memcg basis. This is |
497 | useful for providing visibility into the numa locality information within | 497 | useful for providing visibility into the numa locality information within |
498 | an memcg since the pages are allowed to be allocated from any physical | 498 | an memcg since the pages are allowed to be allocated from any physical |
499 | node. One of the usecases is evaluating application performance by | 499 | node. One of the use cases is evaluating application performance by |
500 | combining this information with the application's cpu allocation. | 500 | combining this information with the application's CPU allocation. |
501 | 501 | ||
502 | We export "total", "file", "anon" and "unevictable" pages per-node for | 502 | We export "total", "file", "anon" and "unevictable" pages per-node for |
503 | each memcg. The ouput format of memory.numa_stat is: | 503 | each memcg. The ouput format of memory.numa_stat is: |
@@ -561,10 +561,10 @@ are pushed back to their soft limits. If the soft limit of each control | |||
561 | group is very high, they are pushed back as much as possible to make | 561 | group is very high, they are pushed back as much as possible to make |
562 | sure that one control group does not starve the others of memory. | 562 | sure that one control group does not starve the others of memory. |
563 | 563 | ||
564 | Please note that soft limits is a best effort feature, it comes with | 564 | Please note that soft limits is a best-effort feature; it comes with |
565 | no guarantees, but it does its best to make sure that when memory is | 565 | no guarantees, but it does its best to make sure that when memory is |
566 | heavily contended for, memory is allocated based on the soft limit | 566 | heavily contended for, memory is allocated based on the soft limit |
567 | hints/setup. Currently soft limit based reclaim is setup such that | 567 | hints/setup. Currently soft limit based reclaim is set up such that |
568 | it gets invoked from balance_pgdat (kswapd). | 568 | it gets invoked from balance_pgdat (kswapd). |
569 | 569 | ||
570 | 7.1 Interface | 570 | 7.1 Interface |
@@ -592,7 +592,7 @@ page tables. | |||
592 | 592 | ||
593 | 8.1 Interface | 593 | 8.1 Interface |
594 | 594 | ||
595 | This feature is disabled by default. It can be enabled(and disabled again) by | 595 | This feature is disabled by default. It can be enabledi (and disabled again) by |
596 | writing to memory.move_charge_at_immigrate of the destination cgroup. | 596 | writing to memory.move_charge_at_immigrate of the destination cgroup. |
597 | 597 | ||
598 | If you want to enable it: | 598 | If you want to enable it: |
@@ -601,8 +601,8 @@ If you want to enable it: | |||
601 | 601 | ||
602 | Note: Each bits of move_charge_at_immigrate has its own meaning about what type | 602 | Note: Each bits of move_charge_at_immigrate has its own meaning about what type |
603 | of charges should be moved. See 8.2 for details. | 603 | of charges should be moved. See 8.2 for details. |
604 | Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread | 604 | Note: Charges are moved only when you move mm->owner, in other words, |
605 | group. | 605 | a leader of a thread group. |
606 | Note: If we cannot find enough space for the task in the destination cgroup, we | 606 | Note: If we cannot find enough space for the task in the destination cgroup, we |
607 | try to make space by reclaiming memory. Task migration may fail if we | 607 | try to make space by reclaiming memory. Task migration may fail if we |
608 | cannot make enough space. | 608 | cannot make enough space. |
@@ -612,25 +612,25 @@ And if you want disable it again: | |||
612 | 612 | ||
613 | # echo 0 > memory.move_charge_at_immigrate | 613 | # echo 0 > memory.move_charge_at_immigrate |
614 | 614 | ||
615 | 8.2 Type of charges which can be move | 615 | 8.2 Type of charges which can be moved |
616 | 616 | ||
617 | Each bits of move_charge_at_immigrate has its own meaning about what type of | 617 | Each bit in move_charge_at_immigrate has its own meaning about what type of |
618 | charges should be moved. But in any cases, it must be noted that an account of | 618 | charges should be moved. But in any case, it must be noted that an account of |
619 | a page or a swap can be moved only when it is charged to the task's current(old) | 619 | a page or a swap can be moved only when it is charged to the task's current |
620 | memory cgroup. | 620 | (old) memory cgroup. |
621 | 621 | ||
622 | bit | what type of charges would be moved ? | 622 | bit | what type of charges would be moved ? |
623 | -----+------------------------------------------------------------------------ | 623 | -----+------------------------------------------------------------------------ |
624 | 0 | A charge of an anonymous page(or swap of it) used by the target task. | 624 | 0 | A charge of an anonymous page (or swap of it) used by the target task. |
625 | | You must enable Swap Extension(see 2.4) to enable move of swap charges. | 625 | | You must enable Swap Extension (see 2.4) to enable move of swap charges. |
626 | -----+------------------------------------------------------------------------ | 626 | -----+------------------------------------------------------------------------ |
627 | 1 | A charge of file pages(normal file, tmpfs file(e.g. ipc shared memory) | 627 | 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) |
628 | | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | 628 | | and swaps of tmpfs file) mmapped by the target task. Unlike the case of |
629 | | anonymous pages, file pages(and swaps) in the range mmapped by the task | 629 | | anonymous pages, file pages (and swaps) in the range mmapped by the task |
630 | | will be moved even if the task hasn't done page fault, i.e. they might | 630 | | will be moved even if the task hasn't done page fault, i.e. they might |
631 | | not be the task's "RSS", but other task's "RSS" that maps the same file. | 631 | | not be the task's "RSS", but other task's "RSS" that maps the same file. |
632 | | And mapcount of the page is ignored(the page can be moved even if | 632 | | And mapcount of the page is ignored (the page can be moved even if |
633 | | page_mapcount(page) > 1). You must enable Swap Extension(see 2.4) to | 633 | | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to |
634 | | enable move of swap charges. | 634 | | enable move of swap charges. |
635 | 635 | ||
636 | 8.3 TODO | 636 | 8.3 TODO |
@@ -640,11 +640,11 @@ memory cgroup. | |||
640 | 640 | ||
641 | 9. Memory thresholds | 641 | 9. Memory thresholds |
642 | 642 | ||
643 | Memory cgroup implements memory thresholds using cgroups notification | 643 | Memory cgroup implements memory thresholds using the cgroups notification |
644 | API (see cgroups.txt). It allows to register multiple memory and memsw | 644 | API (see cgroups.txt). It allows to register multiple memory and memsw |
645 | thresholds and gets notifications when it crosses. | 645 | thresholds and gets notifications when it crosses. |
646 | 646 | ||
647 | To register a threshold application need: | 647 | To register a threshold, an application must: |
648 | - create an eventfd using eventfd(2); | 648 | - create an eventfd using eventfd(2); |
649 | - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; | 649 | - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; |
650 | - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to | 650 | - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to |
@@ -659,24 +659,24 @@ It's applicable for root and non-root cgroup. | |||
659 | 659 | ||
660 | memory.oom_control file is for OOM notification and other controls. | 660 | memory.oom_control file is for OOM notification and other controls. |
661 | 661 | ||
662 | Memory cgroup implements OOM notifier using cgroup notification | 662 | Memory cgroup implements OOM notifier using the cgroup notification |
663 | API (See cgroups.txt). It allows to register multiple OOM notification | 663 | API (See cgroups.txt). It allows to register multiple OOM notification |
664 | delivery and gets notification when OOM happens. | 664 | delivery and gets notification when OOM happens. |
665 | 665 | ||
666 | To register a notifier, application need: | 666 | To register a notifier, an application must: |
667 | - create an eventfd using eventfd(2) | 667 | - create an eventfd using eventfd(2) |
668 | - open memory.oom_control file | 668 | - open memory.oom_control file |
669 | - write string like "<event_fd> <fd of memory.oom_control>" to | 669 | - write string like "<event_fd> <fd of memory.oom_control>" to |
670 | cgroup.event_control | 670 | cgroup.event_control |
671 | 671 | ||
672 | Application will be notified through eventfd when OOM happens. | 672 | The application will be notified through eventfd when OOM happens. |
673 | OOM notification doesn't work for root cgroup. | 673 | OOM notification doesn't work for the root cgroup. |
674 | 674 | ||
675 | You can disable OOM-killer by writing "1" to memory.oom_control file, as: | 675 | You can disable the OOM-killer by writing "1" to memory.oom_control file, as: |
676 | 676 | ||
677 | #echo 1 > memory.oom_control | 677 | #echo 1 > memory.oom_control |
678 | 678 | ||
679 | This operation is only allowed to the top cgroup of sub-hierarchy. | 679 | This operation is only allowed to the top cgroup of a sub-hierarchy. |
680 | If OOM-killer is disabled, tasks under cgroup will hang/sleep | 680 | If OOM-killer is disabled, tasks under cgroup will hang/sleep |
681 | in memory cgroup's OOM-waitqueue when they request accountable memory. | 681 | in memory cgroup's OOM-waitqueue when they request accountable memory. |
682 | 682 | ||