aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2012-10-09 03:23:15 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2012-10-09 03:23:15 -0400
commit9e2d8656f5e8aa214e66b462680cf86b210b74a8 (patch)
treef67d62e896cedf75599ea45f9ecf9999c6ad24cd /Documentation
parent1ea4f4f8405cc1ceec23f2d261bc3775785e6712 (diff)
parent9e695d2ecc8451cc2c1603d60b5c8e7f5581923a (diff)
Merge branch 'akpm' (Andrew's patch-bomb)
Merge patches from Andrew Morton: "A few misc things and very nearly all of the MM tree. A tremendous amount of stuff (again), including a significant rbtree library rework." * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (160 commits) sparc64: Support transparent huge pages. mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd(). mm: Add and use update_mmu_cache_pmd() in transparent huge page code. sparc64: Document PGD and PMD layout. sparc64: Eliminate PTE table memory wastage. sparc64: Halve the size of PTE tables sparc64: Only support 4MB huge pages and 8KB base pages. memory-hotplug: suppress "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning mm: memcg: clean up mm_match_cgroup() signature mm: document PageHuge somewhat mm: use %pK for /proc/vmallocinfo mm, thp: fix mlock statistics mm, thp: fix mapped pages avoiding unevictable list on mlock memory-hotplug: update memory block's state and notify userspace memory-hotplug: preparation to notify memory block's state at memory hot remove mm: avoid section mismatch warning for memblock_type_name make GFP_NOTRACK definition unconditional cma: decrease cc.nr_migratepages after reclaiming pagelist CMA: migrate mlocked pages kpageflags: fix wrong KPF_THP on non-huge compound pages ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/00-INDEX2
-rw-r--r--Documentation/ABI/obsolete/proc-pid-oom_adj22
-rw-r--r--Documentation/cgroups/memory.txt90
-rw-r--r--Documentation/filesystems/proc.txt22
-rw-r--r--Documentation/memory.txt33
-rw-r--r--Documentation/prio_tree.txt107
-rw-r--r--Documentation/rbtree.txt209
-rw-r--r--Documentation/vm/unevictable-lru.txt14
8 files changed, 227 insertions, 272 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index 49c051380daf..f54273e2ac97 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -270,8 +270,6 @@ preempt-locking.txt
270 - info on locking under a preemptive kernel. 270 - info on locking under a preemptive kernel.
271printk-formats.txt 271printk-formats.txt
272 - how to get printk format specifiers right 272 - how to get printk format specifiers right
273prio_tree.txt
274 - info on radix-priority-search-tree use for indexing vmas.
275ramoops.txt 273ramoops.txt
276 - documentation of the ramoops oops/panic logging module. 274 - documentation of the ramoops oops/panic logging module.
277rbtree.txt 275rbtree.txt
diff --git a/Documentation/ABI/obsolete/proc-pid-oom_adj b/Documentation/ABI/obsolete/proc-pid-oom_adj
deleted file mode 100644
index 9a3cb88ade47..000000000000
--- a/Documentation/ABI/obsolete/proc-pid-oom_adj
+++ /dev/null
@@ -1,22 +0,0 @@
1What: /proc/<pid>/oom_adj
2When: August 2012
3Why: /proc/<pid>/oom_adj allows userspace to influence the oom killer's
4 badness heuristic used to determine which task to kill when the kernel
5 is out of memory.
6
7 The badness heuristic has since been rewritten since the introduction of
8 this tunable such that its meaning is deprecated. The value was
9 implemented as a bitshift on a score generated by the badness()
10 function that did not have any precise units of measure. With the
11 rewrite, the score is given as a proportion of available memory to the
12 task allocating pages, so using a bitshift which grows the score
13 exponentially is, thus, impossible to tune with fine granularity.
14
15 A much more powerful interface, /proc/<pid>/oom_score_adj, was
16 introduced with the oom killer rewrite that allows users to increase or
17 decrease the badness score linearly. This interface will replace
18 /proc/<pid>/oom_adj.
19
20 A warning will be emitted to the kernel log if an application uses this
21 deprecated interface. After it is printed once, future warnings will be
22 suppressed until the kernel is rebooted.
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 4372e6b8a353..c07f7b4fb88d 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -18,16 +18,16 @@ from the rest of the system. The article on LWN [12] mentions some probable
18uses of the memory controller. The memory controller can be used to 18uses of the memory controller. The memory controller can be used to
19 19
20a. Isolate an application or a group of applications 20a. Isolate an application or a group of applications
21 Memory hungry applications can be isolated and limited to a smaller 21 Memory-hungry applications can be isolated and limited to a smaller
22 amount of memory. 22 amount of memory.
23b. Create a cgroup with limited amount of memory, this can be used 23b. Create a cgroup with a limited amount of memory; this can be used
24 as a good alternative to booting with mem=XXXX. 24 as a good alternative to booting with mem=XXXX.
25c. Virtualization solutions can control the amount of memory they want 25c. Virtualization solutions can control the amount of memory they want
26 to assign to a virtual machine instance. 26 to assign to a virtual machine instance.
27d. A CD/DVD burner could control the amount of memory used by the 27d. A CD/DVD burner could control the amount of memory used by the
28 rest of the system to ensure that burning does not fail due to lack 28 rest of the system to ensure that burning does not fail due to lack
29 of available memory. 29 of available memory.
30e. There are several other use cases, find one or use the controller just 30e. There are several other use cases; find one or use the controller just
31 for fun (to learn and hack on the VM subsystem). 31 for fun (to learn and hack on the VM subsystem).
32 32
33Current Status: linux-2.6.34-mmotm(development version of 2010/April) 33Current Status: linux-2.6.34-mmotm(development version of 2010/April)
@@ -38,12 +38,12 @@ Features:
38 - optionally, memory+swap usage can be accounted and limited. 38 - optionally, memory+swap usage can be accounted and limited.
39 - hierarchical accounting 39 - hierarchical accounting
40 - soft limit 40 - soft limit
41 - moving(recharging) account at moving a task is selectable. 41 - moving (recharging) account at moving a task is selectable.
42 - usage threshold notifier 42 - usage threshold notifier
43 - oom-killer disable knob and oom-notifier 43 - oom-killer disable knob and oom-notifier
44 - Root cgroup has no limit controls. 44 - Root cgroup has no limit controls.
45 45
46 Kernel memory support is work in progress, and the current version provides 46 Kernel memory support is a work in progress, and the current version provides
47 basically functionality. (See Section 2.7) 47 basically functionality. (See Section 2.7)
48 48
49Brief summary of control files. 49Brief summary of control files.
@@ -144,9 +144,9 @@ Figure 1 shows the important aspects of the controller
1443. Each page has a pointer to the page_cgroup, which in turn knows the 1443. Each page has a pointer to the page_cgroup, which in turn knows the
145 cgroup it belongs to 145 cgroup it belongs to
146 146
147The accounting is done as follows: mem_cgroup_charge() is invoked to setup 147The accounting is done as follows: mem_cgroup_charge() is invoked to set up
148the necessary data structures and check if the cgroup that is being charged 148the necessary data structures and check if the cgroup that is being charged
149is over its limit. If it is then reclaim is invoked on the cgroup. 149is over its limit. If it is, then reclaim is invoked on the cgroup.
150More details can be found in the reclaim section of this document. 150More details can be found in the reclaim section of this document.
151If everything goes well, a page meta-data-structure called page_cgroup is 151If everything goes well, a page meta-data-structure called page_cgroup is
152updated. page_cgroup has its own LRU on cgroup. 152updated. page_cgroup has its own LRU on cgroup.
@@ -163,13 +163,13 @@ for earlier. A file page will be accounted for as Page Cache when it's
163inserted into inode (radix-tree). While it's mapped into the page tables of 163inserted into inode (radix-tree). While it's mapped into the page tables of
164processes, duplicate accounting is carefully avoided. 164processes, duplicate accounting is carefully avoided.
165 165
166A RSS page is unaccounted when it's fully unmapped. A PageCache page is 166An RSS page is unaccounted when it's fully unmapped. A PageCache page is
167unaccounted when it's removed from radix-tree. Even if RSS pages are fully 167unaccounted when it's removed from radix-tree. Even if RSS pages are fully
168unmapped (by kswapd), they may exist as SwapCache in the system until they 168unmapped (by kswapd), they may exist as SwapCache in the system until they
169are really freed. Such SwapCaches also also accounted. 169are really freed. Such SwapCaches are also accounted.
170A swapped-in page is not accounted until it's mapped. 170A swapped-in page is not accounted until it's mapped.
171 171
172Note: The kernel does swapin-readahead and read multiple swaps at once. 172Note: The kernel does swapin-readahead and reads multiple swaps at once.
173This means swapped-in pages may contain pages for other tasks than a task 173This means swapped-in pages may contain pages for other tasks than a task
174causing page fault. So, we avoid accounting at swap-in I/O. 174causing page fault. So, we avoid accounting at swap-in I/O.
175 175
@@ -209,7 +209,7 @@ memsw.limit_in_bytes.
209Example: Assume a system with 4G of swap. A task which allocates 6G of memory 209Example: Assume a system with 4G of swap. A task which allocates 6G of memory
210(by mistake) under 2G memory limitation will use all swap. 210(by mistake) under 2G memory limitation will use all swap.
211In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. 211In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
212By using memsw limit, you can avoid system OOM which can be caused by swap 212By using the memsw limit, you can avoid system OOM which can be caused by swap
213shortage. 213shortage.
214 214
215* why 'memory+swap' rather than swap. 215* why 'memory+swap' rather than swap.
@@ -217,7 +217,7 @@ The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
217to move account from memory to swap...there is no change in usage of 217to move account from memory to swap...there is no change in usage of
218memory+swap. In other words, when we want to limit the usage of swap without 218memory+swap. In other words, when we want to limit the usage of swap without
219affecting global LRU, memory+swap limit is better than just limiting swap from 219affecting global LRU, memory+swap limit is better than just limiting swap from
220OS point of view. 220an OS point of view.
221 221
222* What happens when a cgroup hits memory.memsw.limit_in_bytes 222* What happens when a cgroup hits memory.memsw.limit_in_bytes
223When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out 223When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
@@ -236,7 +236,7 @@ an OOM routine is invoked to select and kill the bulkiest task in the
236cgroup. (See 10. OOM Control below.) 236cgroup. (See 10. OOM Control below.)
237 237
238The reclaim algorithm has not been modified for cgroups, except that 238The reclaim algorithm has not been modified for cgroups, except that
239pages that are selected for reclaiming come from the per cgroup LRU 239pages that are selected for reclaiming come from the per-cgroup LRU
240list. 240list.
241 241
242NOTE: Reclaim does not work for the root cgroup, since we cannot set any 242NOTE: Reclaim does not work for the root cgroup, since we cannot set any
@@ -316,7 +316,7 @@ We can check the usage:
316# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes 316# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
3171216512 3171216512
318 318
319A successful write to this file does not guarantee a successful set of 319A successful write to this file does not guarantee a successful setting of
320this limit to the value written into the file. This can be due to a 320this limit to the value written into the file. This can be due to a
321number of factors, such as rounding up to page boundaries or the total 321number of factors, such as rounding up to page boundaries or the total
322availability of memory on the system. The user is required to re-read 322availability of memory on the system. The user is required to re-read
@@ -350,7 +350,7 @@ Trying usual test under memory controller is always helpful.
3504.1 Troubleshooting 3504.1 Troubleshooting
351 351
352Sometimes a user might find that the application under a cgroup is 352Sometimes a user might find that the application under a cgroup is
353terminated by OOM killer. There are several causes for this: 353terminated by the OOM killer. There are several causes for this:
354 354
3551. The cgroup limit is too low (just too low to do anything useful) 3551. The cgroup limit is too low (just too low to do anything useful)
3562. The user is using anonymous memory and swap is turned off or too low 3562. The user is using anonymous memory and swap is turned off or too low
@@ -358,7 +358,7 @@ terminated by OOM killer. There are several causes for this:
358A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of 358A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
359some of the pages cached in the cgroup (page cache pages). 359some of the pages cached in the cgroup (page cache pages).
360 360
361To know what happens, disable OOM_Kill by 10. OOM Control(see below) and 361To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
362seeing what happens will be helpful. 362seeing what happens will be helpful.
363 363
3644.2 Task migration 3644.2 Task migration
@@ -399,10 +399,10 @@ About use_hierarchy, see Section 6.
399 399
400 Almost all pages tracked by this memory cgroup will be unmapped and freed. 400 Almost all pages tracked by this memory cgroup will be unmapped and freed.
401 Some pages cannot be freed because they are locked or in-use. Such pages are 401 Some pages cannot be freed because they are locked or in-use. Such pages are
402 moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this 402 moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this
403 cgroup will be empty. 403 cgroup will be empty.
404 404
405 Typical use case of this interface is that calling this before rmdir(). 405 The typical use case for this interface is before calling rmdir().
406 Because rmdir() moves all pages to parent, some out-of-use page caches can be 406 Because rmdir() moves all pages to parent, some out-of-use page caches can be
407 moved to the parent. If you want to avoid that, force_empty will be useful. 407 moved to the parent. If you want to avoid that, force_empty will be useful.
408 408
@@ -486,7 +486,7 @@ You can reset failcnt by writing 0 to failcnt file.
486 486
487For efficiency, as other kernel components, memory cgroup uses some optimization 487For efficiency, as other kernel components, memory cgroup uses some optimization
488to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the 488to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
489method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz 489method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz
490value for efficient access. (Of course, when necessary, it's synchronized.) 490value for efficient access. (Of course, when necessary, it's synchronized.)
491If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) 491If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
492value in memory.stat(see 5.2). 492value in memory.stat(see 5.2).
@@ -496,8 +496,8 @@ value in memory.stat(see 5.2).
496This is similar to numa_maps but operates on a per-memcg basis. This is 496This is similar to numa_maps but operates on a per-memcg basis. This is
497useful for providing visibility into the numa locality information within 497useful for providing visibility into the numa locality information within
498an memcg since the pages are allowed to be allocated from any physical 498an memcg since the pages are allowed to be allocated from any physical
499node. One of the usecases is evaluating application performance by 499node. One of the use cases is evaluating application performance by
500combining this information with the application's cpu allocation. 500combining this information with the application's CPU allocation.
501 501
502We export "total", "file", "anon" and "unevictable" pages per-node for 502We export "total", "file", "anon" and "unevictable" pages per-node for
503each memcg. The ouput format of memory.numa_stat is: 503each memcg. The ouput format of memory.numa_stat is:
@@ -561,10 +561,10 @@ are pushed back to their soft limits. If the soft limit of each control
561group is very high, they are pushed back as much as possible to make 561group is very high, they are pushed back as much as possible to make
562sure that one control group does not starve the others of memory. 562sure that one control group does not starve the others of memory.
563 563
564Please note that soft limits is a best effort feature, it comes with 564Please note that soft limits is a best-effort feature; it comes with
565no guarantees, but it does its best to make sure that when memory is 565no guarantees, but it does its best to make sure that when memory is
566heavily contended for, memory is allocated based on the soft limit 566heavily contended for, memory is allocated based on the soft limit
567hints/setup. Currently soft limit based reclaim is setup such that 567hints/setup. Currently soft limit based reclaim is set up such that
568it gets invoked from balance_pgdat (kswapd). 568it gets invoked from balance_pgdat (kswapd).
569 569
5707.1 Interface 5707.1 Interface
@@ -592,7 +592,7 @@ page tables.
592 592
5938.1 Interface 5938.1 Interface
594 594
595This feature is disabled by default. It can be enabled(and disabled again) by 595This feature is disabled by default. It can be enabledi (and disabled again) by
596writing to memory.move_charge_at_immigrate of the destination cgroup. 596writing to memory.move_charge_at_immigrate of the destination cgroup.
597 597
598If you want to enable it: 598If you want to enable it:
@@ -601,8 +601,8 @@ If you want to enable it:
601 601
602Note: Each bits of move_charge_at_immigrate has its own meaning about what type 602Note: Each bits of move_charge_at_immigrate has its own meaning about what type
603 of charges should be moved. See 8.2 for details. 603 of charges should be moved. See 8.2 for details.
604Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread 604Note: Charges are moved only when you move mm->owner, in other words,
605 group. 605 a leader of a thread group.
606Note: If we cannot find enough space for the task in the destination cgroup, we 606Note: If we cannot find enough space for the task in the destination cgroup, we
607 try to make space by reclaiming memory. Task migration may fail if we 607 try to make space by reclaiming memory. Task migration may fail if we
608 cannot make enough space. 608 cannot make enough space.
@@ -612,25 +612,25 @@ And if you want disable it again:
612 612
613# echo 0 > memory.move_charge_at_immigrate 613# echo 0 > memory.move_charge_at_immigrate
614 614
6158.2 Type of charges which can be move 6158.2 Type of charges which can be moved
616 616
617Each bits of move_charge_at_immigrate has its own meaning about what type of 617Each bit in move_charge_at_immigrate has its own meaning about what type of
618charges should be moved. But in any cases, it must be noted that an account of 618charges should be moved. But in any case, it must be noted that an account of
619a page or a swap can be moved only when it is charged to the task's current(old) 619a page or a swap can be moved only when it is charged to the task's current
620memory cgroup. 620(old) memory cgroup.
621 621
622 bit | what type of charges would be moved ? 622 bit | what type of charges would be moved ?
623 -----+------------------------------------------------------------------------ 623 -----+------------------------------------------------------------------------
624 0 | A charge of an anonymous page(or swap of it) used by the target task. 624 0 | A charge of an anonymous page (or swap of it) used by the target task.
625 | You must enable Swap Extension(see 2.4) to enable move of swap charges. 625 | You must enable Swap Extension (see 2.4) to enable move of swap charges.
626 -----+------------------------------------------------------------------------ 626 -----+------------------------------------------------------------------------
627 1 | A charge of file pages(normal file, tmpfs file(e.g. ipc shared memory) 627 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory)
628 | and swaps of tmpfs file) mmapped by the target task. Unlike the case of 628 | and swaps of tmpfs file) mmapped by the target task. Unlike the case of
629 | anonymous pages, file pages(and swaps) in the range mmapped by the task 629 | anonymous pages, file pages (and swaps) in the range mmapped by the task
630 | will be moved even if the task hasn't done page fault, i.e. they might 630 | will be moved even if the task hasn't done page fault, i.e. they might
631 | not be the task's "RSS", but other task's "RSS" that maps the same file. 631 | not be the task's "RSS", but other task's "RSS" that maps the same file.
632 | And mapcount of the page is ignored(the page can be moved even if 632 | And mapcount of the page is ignored (the page can be moved even if
633 | page_mapcount(page) > 1). You must enable Swap Extension(see 2.4) to 633 | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to
634 | enable move of swap charges. 634 | enable move of swap charges.
635 635
6368.3 TODO 6368.3 TODO
@@ -640,11 +640,11 @@ memory cgroup.
640 640
6419. Memory thresholds 6419. Memory thresholds
642 642
643Memory cgroup implements memory thresholds using cgroups notification 643Memory cgroup implements memory thresholds using the cgroups notification
644API (see cgroups.txt). It allows to register multiple memory and memsw 644API (see cgroups.txt). It allows to register multiple memory and memsw
645thresholds and gets notifications when it crosses. 645thresholds and gets notifications when it crosses.
646 646
647To register a threshold application need: 647To register a threshold, an application must:
648- create an eventfd using eventfd(2); 648- create an eventfd using eventfd(2);
649- open memory.usage_in_bytes or memory.memsw.usage_in_bytes; 649- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
650- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to 650- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
@@ -659,24 +659,24 @@ It's applicable for root and non-root cgroup.
659 659
660memory.oom_control file is for OOM notification and other controls. 660memory.oom_control file is for OOM notification and other controls.
661 661
662Memory cgroup implements OOM notifier using cgroup notification 662Memory cgroup implements OOM notifier using the cgroup notification
663API (See cgroups.txt). It allows to register multiple OOM notification 663API (See cgroups.txt). It allows to register multiple OOM notification
664delivery and gets notification when OOM happens. 664delivery and gets notification when OOM happens.
665 665
666To register a notifier, application need: 666To register a notifier, an application must:
667 - create an eventfd using eventfd(2) 667 - create an eventfd using eventfd(2)
668 - open memory.oom_control file 668 - open memory.oom_control file
669 - write string like "<event_fd> <fd of memory.oom_control>" to 669 - write string like "<event_fd> <fd of memory.oom_control>" to
670 cgroup.event_control 670 cgroup.event_control
671 671
672Application will be notified through eventfd when OOM happens. 672The application will be notified through eventfd when OOM happens.
673OOM notification doesn't work for root cgroup. 673OOM notification doesn't work for the root cgroup.
674 674
675You can disable OOM-killer by writing "1" to memory.oom_control file, as: 675You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
676 676
677 #echo 1 > memory.oom_control 677 #echo 1 > memory.oom_control
678 678
679This operation is only allowed to the top cgroup of sub-hierarchy. 679This operation is only allowed to the top cgroup of a sub-hierarchy.
680If OOM-killer is disabled, tasks under cgroup will hang/sleep 680If OOM-killer is disabled, tasks under cgroup will hang/sleep
681in memory cgroup's OOM-waitqueue when they request accountable memory. 681in memory cgroup's OOM-waitqueue when they request accountable memory.
682 682
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index fb0a6aeb936c..a1793d670cd0 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,7 +33,7 @@ Table of Contents
33 2 Modifying System Parameters 33 2 Modifying System Parameters
34 34
35 3 Per-Process Parameters 35 3 Per-Process Parameters
36 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer 36 3.1 /proc/<pid>/oom_score_adj - Adjust the oom-killer
37 score 37 score
38 3.2 /proc/<pid>/oom_score - Display current oom-killer score 38 3.2 /proc/<pid>/oom_score - Display current oom-killer score
39 3.3 /proc/<pid>/io - Display the IO accounting fields 39 3.3 /proc/<pid>/io - Display the IO accounting fields
@@ -1320,10 +1320,10 @@ of the kernel.
1320CHAPTER 3: PER-PROCESS PARAMETERS 1320CHAPTER 3: PER-PROCESS PARAMETERS
1321------------------------------------------------------------------------------ 1321------------------------------------------------------------------------------
1322 1322
13233.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score 13233.1 /proc/<pid>/oom_score_adj- Adjust the oom-killer score
1324-------------------------------------------------------------------------------- 1324--------------------------------------------------------------------------------
1325 1325
1326These file can be used to adjust the badness heuristic used to select which 1326This file can be used to adjust the badness heuristic used to select which
1327process gets killed in out of memory conditions. 1327process gets killed in out of memory conditions.
1328 1328
1329The badness heuristic assigns a value to each candidate task ranging from 0 1329The badness heuristic assigns a value to each candidate task ranging from 0
@@ -1361,22 +1361,10 @@ same system, cpuset, mempolicy, or memory controller resources to use at least
1361equivalent to discounting 50% of the task's allowed memory from being considered 1361equivalent to discounting 50% of the task's allowed memory from being considered
1362as scoring against the task. 1362as scoring against the task.
1363 1363
1364For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
1365be used to tune the badness score. Its acceptable values range from -16
1366(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
1367(OOM_DISABLE) to disable oom killing entirely for that task. Its value is
1368scaled linearly with /proc/<pid>/oom_score_adj.
1369
1370Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
1371other with its scaled value.
1372
1373The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last 1364The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
1374value set by a CAP_SYS_RESOURCE process. To reduce the value any lower 1365value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
1375requires CAP_SYS_RESOURCE. 1366requires CAP_SYS_RESOURCE.
1376 1367
1377NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
1378Documentation/feature-removal-schedule.txt.
1379
1380Caveat: when a parent task is selected, the oom killer will sacrifice any first 1368Caveat: when a parent task is selected, the oom killer will sacrifice any first
1381generation children with separate address spaces instead, if possible. This 1369generation children with separate address spaces instead, if possible. This
1382avoids servers and important system daemons from being killed and loses the 1370avoids servers and important system daemons from being killed and loses the
@@ -1387,9 +1375,7 @@ minimal amount of work.
1387------------------------------------------------------------- 1375-------------------------------------------------------------
1388 1376
1389This file can be used to check the current score used by the oom-killer is for 1377This file can be used to check the current score used by the oom-killer is for
1390any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which 1378any given <pid>.
1391process should be killed in an out-of-memory situation.
1392
1393 1379
13943.3 /proc/<pid>/io - Display the IO accounting fields 13803.3 /proc/<pid>/io - Display the IO accounting fields
1395------------------------------------------------------- 1381-------------------------------------------------------
diff --git a/Documentation/memory.txt b/Documentation/memory.txt
deleted file mode 100644
index 802efe58647c..000000000000
--- a/Documentation/memory.txt
+++ /dev/null
@@ -1,33 +0,0 @@
1There are several classic problems related to memory on Linux
2systems.
3
4 1) There are some motherboards that will not cache above
5 a certain quantity of memory. If you have one of these
6 motherboards, your system will be SLOWER, not faster
7 as you add more memory. Consider exchanging your
8 motherboard.
9
10All of these problems can be addressed with the "mem=XXXM" boot option
11(where XXX is the size of RAM to use in megabytes).
12It can also tell Linux to use less memory than is actually installed.
13If you use "mem=" on a machine with PCI, consider using "memmap=" to avoid
14physical address space collisions.
15
16See the documentation of your boot loader (LILO, grub, loadlin, etc.) about
17how to pass options to the kernel.
18
19There are other memory problems which Linux cannot deal with. Random
20corruption of memory is usually a sign of serious hardware trouble.
21Try:
22
23 * Reducing memory settings in the BIOS to the most conservative
24 timings.
25
26 * Adding a cooling fan.
27
28 * Not overclocking your CPU.
29
30 * Having the memory tested in a memory tester or exchanged
31 with the vendor. Consider testing it with memtest86 yourself.
32
33 * Exchanging your CPU, cache, or motherboard for one that works.
diff --git a/Documentation/prio_tree.txt b/Documentation/prio_tree.txt
deleted file mode 100644
index 3aa68f9a117b..000000000000
--- a/Documentation/prio_tree.txt
+++ /dev/null
@@ -1,107 +0,0 @@
1The prio_tree.c code indexes vmas using 3 different indexes:
2 * heap_index = vm_pgoff + vm_size_in_pages : end_vm_pgoff
3 * radix_index = vm_pgoff : start_vm_pgoff
4 * size_index = vm_size_in_pages
5
6A regular radix-priority-search-tree indexes vmas using only heap_index and
7radix_index. The conditions for indexing are:
8 * ->heap_index >= ->left->heap_index &&
9 ->heap_index >= ->right->heap_index
10 * if (->heap_index == ->left->heap_index)
11 then ->radix_index < ->left->radix_index;
12 * if (->heap_index == ->right->heap_index)
13 then ->radix_index < ->right->radix_index;
14 * nodes are hashed to left or right subtree using radix_index
15 similar to a pure binary radix tree.
16
17A regular radix-priority-search-tree helps to store and query
18intervals (vmas). However, a regular radix-priority-search-tree is only
19suitable for storing vmas with different radix indices (vm_pgoff).
20
21Therefore, the prio_tree.c extends the regular radix-priority-search-tree
22to handle many vmas with the same vm_pgoff. Such vmas are handled in
232 different ways: 1) All vmas with the same radix _and_ heap indices are
24linked using vm_set.list, 2) if there are many vmas with the same radix
25index, but different heap indices and if the regular radix-priority-search
26tree cannot index them all, we build an overflow-sub-tree that indexes such
27vmas using heap and size indices instead of heap and radix indices. For
28example, in the figure below some vmas with vm_pgoff = 0 (zero) are
29indexed by regular radix-priority-search-tree whereas others are pushed
30into an overflow-subtree. Note that all vmas in an overflow-sub-tree have
31the same vm_pgoff (radix_index) and if necessary we build different
32overflow-sub-trees to handle each possible radix_index. For example,
33in figure we have 3 overflow-sub-trees corresponding to radix indices
340, 2, and 4.
35
36In the final tree the first few (prio_tree_root->index_bits) levels
37are indexed using heap and radix indices whereas the overflow-sub-trees below
38those levels (i.e. levels prio_tree_root->index_bits + 1 and higher) are
39indexed using heap and size indices. In overflow-sub-trees the size_index
40is used for hashing the nodes to appropriate places.
41
42Now, an example prio_tree:
43
44 vmas are represented [radix_index, size_index, heap_index]
45 i.e., [start_vm_pgoff, vm_size_in_pages, end_vm_pgoff]
46
47level prio_tree_root->index_bits = 3
48-----
49 _
50 0 [0,7,7] |
51 / \ |
52 ------------------ ------------ | Regular
53 / \ | radix priority
54 1 [1,6,7] [4,3,7] | search tree
55 / \ / \ |
56 ------- ----- ------ ----- | heap-and-radix
57 / \ / \ | indexed
58 2 [0,6,6] [2,5,7] [5,2,7] [6,1,7] |
59 / \ / \ / \ / \ |
60 3 [0,5,5] [1,5,6] [2,4,6] [3,4,7] [4,2,6] [5,1,6] [6,0,6] [7,0,7] |
61 / / / _
62 / / / _
63 4 [0,4,4] [2,3,5] [4,1,5] |
64 / / / |
65 5 [0,3,3] [2,2,4] [4,0,4] | Overflow-sub-trees
66 / / |
67 6 [0,2,2] [2,1,3] | heap-and-size
68 / / | indexed
69 7 [0,1,1] [2,0,2] |
70 / |
71 8 [0,0,0] |
72 _
73
74Note that we use prio_tree_root->index_bits to optimize the height
75of the heap-and-radix indexed tree. Since prio_tree_root->index_bits is
76set according to the maximum end_vm_pgoff mapped, we are sure that all
77bits (in vm_pgoff) above prio_tree_root->index_bits are 0 (zero). Therefore,
78we only use the first prio_tree_root->index_bits as radix_index.
79Whenever index_bits is increased in prio_tree_expand, we shuffle the tree
80to make sure that the first prio_tree_root->index_bits levels of the tree
81is indexed properly using heap and radix indices.
82
83We do not optimize the height of overflow-sub-trees using index_bits.
84The reason is: there can be many such overflow-sub-trees and all of
85them have to be suffled whenever the index_bits increases. This may involve
86walking the whole prio_tree in prio_tree_insert->prio_tree_expand code
87path which is not desirable. Hence, we do not optimize the height of the
88heap-and-size indexed overflow-sub-trees using prio_tree->index_bits.
89Instead the overflow sub-trees are indexed using full BITS_PER_LONG bits
90of size_index. This may lead to skewed sub-trees because most of the
91higher significant bits of the size_index are likely to be 0 (zero). In
92the example above, all 3 overflow-sub-trees are skewed. This may marginally
93affect the performance. However, processes rarely map many vmas with the
94same start_vm_pgoff but different end_vm_pgoffs. Therefore, we normally
95do not require overflow-sub-trees to index all vmas.
96
97From the above discussion it is clear that the maximum height of
98a prio_tree can be prio_tree_root->index_bits + BITS_PER_LONG.
99However, in most of the common cases we do not need overflow-sub-trees,
100so the tree height in the common cases will be prio_tree_root->index_bits.
101
102It is fair to mention here that the prio_tree_root->index_bits
103is increased on demand, however, the index_bits is not decreased when
104vmas are removed from the prio_tree. That's tricky to do. Hence, it's
105left as a home work problem.
106
107
diff --git a/Documentation/rbtree.txt b/Documentation/rbtree.txt
index 8d32d85a5234..61b6c48871a0 100644
--- a/Documentation/rbtree.txt
+++ b/Documentation/rbtree.txt
@@ -193,24 +193,55 @@ Example:
193Support for Augmented rbtrees 193Support for Augmented rbtrees
194----------------------------- 194-----------------------------
195 195
196Augmented rbtree is an rbtree with "some" additional data stored in each node. 196Augmented rbtree is an rbtree with "some" additional data stored in
197This data can be used to augment some new functionality to rbtree. 197each node, where the additional data for node N must be a function of
198Augmented rbtree is an optional feature built on top of basic rbtree 198the contents of all nodes in the subtree rooted at N. This data can
199infrastructure. An rbtree user who wants this feature will have to call the 199be used to augment some new functionality to rbtree. Augmented rbtree
200augmentation functions with the user provided augmentation callback 200is an optional feature built on top of basic rbtree infrastructure.
201when inserting and erasing nodes. 201An rbtree user who wants this feature will have to call the augmentation
202 202functions with the user provided augmentation callback when inserting
203On insertion, the user must call rb_augment_insert() once the new node is in 203and erasing nodes.
204place. This will cause the augmentation function callback to be called for 204
205each node between the new node and the root which has been affected by the 205C files implementing augmented rbtree manipulation must include
206insertion. 206<linux/rbtree_augmented.h> instead of <linus/rbtree.h>. Note that
207 207linux/rbtree_augmented.h exposes some rbtree implementations details
208When erasing a node, the user must call rb_augment_erase_begin() first to 208you are not expected to rely on; please stick to the documented APIs
209retrieve the deepest node on the rebalance path. Then, after erasing the 209there and do not include <linux/rbtree_augmented.h> from header files
210original node, the user must call rb_augment_erase_end() with the deepest 210either so as to minimize chances of your users accidentally relying on
211node found earlier. This will cause the augmentation function to be called 211such implementation details.
212for each affected node between the deepest node and the root. 212
213 213On insertion, the user must update the augmented information on the path
214leading to the inserted node, then call rb_link_node() as usual and
215rb_augment_inserted() instead of the usual rb_insert_color() call.
216If rb_augment_inserted() rebalances the rbtree, it will callback into
217a user provided function to update the augmented information on the
218affected subtrees.
219
220When erasing a node, the user must call rb_erase_augmented() instead of
221rb_erase(). rb_erase_augmented() calls back into user provided functions
222to updated the augmented information on affected subtrees.
223
224In both cases, the callbacks are provided through struct rb_augment_callbacks.
2253 callbacks must be defined:
226
227- A propagation callback, which updates the augmented value for a given
228 node and its ancestors, up to a given stop point (or NULL to update
229 all the way to the root).
230
231- A copy callback, which copies the augmented value for a given subtree
232 to a newly assigned subtree root.
233
234- A tree rotation callback, which copies the augmented value for a given
235 subtree to a newly assigned subtree root AND recomputes the augmented
236 information for the former subtree root.
237
238The compiled code for rb_erase_augmented() may inline the propagation and
239copy callbacks, which results in a large function, so each augmented rbtree
240user should have a single rb_erase_augmented() call site in order to limit
241compiled code size.
242
243
244Sample usage:
214 245
215Interval tree is an example of augmented rb tree. Reference - 246Interval tree is an example of augmented rb tree. Reference -
216"Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein. 247"Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein.
@@ -230,26 +261,132 @@ and its immediate children. And this will be used in O(log n) lookup
230for lowest match (lowest start address among all possible matches) 261for lowest match (lowest start address among all possible matches)
231with something like: 262with something like:
232 263
233find_lowest_match(lo, hi, node) 264struct interval_tree_node *
265interval_tree_first_match(struct rb_root *root,
266 unsigned long start, unsigned long last)
234{ 267{
235 lowest_match = NULL; 268 struct interval_tree_node *node;
236 while (node) { 269
237 if (max_hi(node->left) > lo) { 270 if (!root->rb_node)
238 // Lowest overlap if any must be on left side 271 return NULL;
239 node = node->left; 272 node = rb_entry(root->rb_node, struct interval_tree_node, rb);
240 } else if (overlap(lo, hi, node)) { 273
241 lowest_match = node; 274 while (true) {
242 break; 275 if (node->rb.rb_left) {
243 } else if (lo > node->lo) { 276 struct interval_tree_node *left =
244 // Lowest overlap if any must be on right side 277 rb_entry(node->rb.rb_left,
245 node = node->right; 278 struct interval_tree_node, rb);
246 } else { 279 if (left->__subtree_last >= start) {
247 break; 280 /*
281 * Some nodes in left subtree satisfy Cond2.
282 * Iterate to find the leftmost such node N.
283 * If it also satisfies Cond1, that's the match
284 * we are looking for. Otherwise, there is no
285 * matching interval as nodes to the right of N
286 * can't satisfy Cond1 either.
287 */
288 node = left;
289 continue;
290 }
248 } 291 }
292 if (node->start <= last) { /* Cond1 */
293 if (node->last >= start) /* Cond2 */
294 return node; /* node is leftmost match */
295 if (node->rb.rb_right) {
296 node = rb_entry(node->rb.rb_right,
297 struct interval_tree_node, rb);
298 if (node->__subtree_last >= start)
299 continue;
300 }
301 }
302 return NULL; /* No match */
303 }
304}
305
306Insertion/removal are defined using the following augmented callbacks:
307
308static inline unsigned long
309compute_subtree_last(struct interval_tree_node *node)
310{
311 unsigned long max = node->last, subtree_last;
312 if (node->rb.rb_left) {
313 subtree_last = rb_entry(node->rb.rb_left,
314 struct interval_tree_node, rb)->__subtree_last;
315 if (max < subtree_last)
316 max = subtree_last;
317 }
318 if (node->rb.rb_right) {
319 subtree_last = rb_entry(node->rb.rb_right,
320 struct interval_tree_node, rb)->__subtree_last;
321 if (max < subtree_last)
322 max = subtree_last;
323 }
324 return max;
325}
326
327static void augment_propagate(struct rb_node *rb, struct rb_node *stop)
328{
329 while (rb != stop) {
330 struct interval_tree_node *node =
331 rb_entry(rb, struct interval_tree_node, rb);
332 unsigned long subtree_last = compute_subtree_last(node);
333 if (node->__subtree_last == subtree_last)
334 break;
335 node->__subtree_last = subtree_last;
336 rb = rb_parent(&node->rb);
249 } 337 }
250 return lowest_match;
251} 338}
252 339
253Finding exact match will be to first find lowest match and then to follow 340static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new)
254successor nodes looking for exact match, until the start of a node is beyond 341{
255the hi value we are looking for. 342 struct interval_tree_node *old =
343 rb_entry(rb_old, struct interval_tree_node, rb);
344 struct interval_tree_node *new =
345 rb_entry(rb_new, struct interval_tree_node, rb);
346
347 new->__subtree_last = old->__subtree_last;
348}
349
350static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new)
351{
352 struct interval_tree_node *old =
353 rb_entry(rb_old, struct interval_tree_node, rb);
354 struct interval_tree_node *new =
355 rb_entry(rb_new, struct interval_tree_node, rb);
356
357 new->__subtree_last = old->__subtree_last;
358 old->__subtree_last = compute_subtree_last(old);
359}
360
361static const struct rb_augment_callbacks augment_callbacks = {
362 augment_propagate, augment_copy, augment_rotate
363};
364
365void interval_tree_insert(struct interval_tree_node *node,
366 struct rb_root *root)
367{
368 struct rb_node **link = &root->rb_node, *rb_parent = NULL;
369 unsigned long start = node->start, last = node->last;
370 struct interval_tree_node *parent;
371
372 while (*link) {
373 rb_parent = *link;
374 parent = rb_entry(rb_parent, struct interval_tree_node, rb);
375 if (parent->__subtree_last < last)
376 parent->__subtree_last = last;
377 if (start < parent->start)
378 link = &parent->rb.rb_left;
379 else
380 link = &parent->rb.rb_right;
381 }
382
383 node->__subtree_last = last;
384 rb_link_node(&node->rb, rb_parent, link);
385 rb_insert_augmented(&node->rb, root, &augment_callbacks);
386}
387
388void interval_tree_remove(struct interval_tree_node *node,
389 struct rb_root *root)
390{
391 rb_erase_augmented(&node->rb, root, &augment_callbacks);
392}
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
index fa206cccf89f..a68db7692ee8 100644
--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -197,12 +197,8 @@ the pages are also "rescued" from the unevictable list in the process of
197freeing them. 197freeing them.
198 198
199page_evictable() also checks for mlocked pages by testing an additional page 199page_evictable() also checks for mlocked pages by testing an additional page
200flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked, 200flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
201and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is 201faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED.
202VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and
203update the appropriate statistics if the vma is VM_LOCKED. This method allows
204efficient "culling" of pages in the fault path that are being faulted in to
205VM_LOCKED VMAs.
206 202
207 203
208VMSCAN'S HANDLING OF UNEVICTABLE PAGES 204VMSCAN'S HANDLING OF UNEVICTABLE PAGES
@@ -371,8 +367,8 @@ mlock_fixup() filters several classes of "special" VMAs:
371 mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to 367 mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to
372 allocate the huge pages and populate the ptes. 368 allocate the huge pages and populate the ptes.
373 369
3743) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of 3703) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages,
375 kernel pages, such as the VDSO page, relay channel pages, etc. These pages 371 such as the VDSO page, relay channel pages, etc. These pages
376 are inherently unevictable and are not managed on the LRU lists. 372 are inherently unevictable and are not managed on the LRU lists.
377 mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls 373 mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls
378 make_pages_present() to populate the ptes. 374 make_pages_present() to populate the ptes.
@@ -651,7 +647,7 @@ PAGE RECLAIM IN shrink_*_list()
651------------------------------- 647-------------------------------
652 648
653shrink_active_list() culls any obviously unevictable pages - i.e. 649shrink_active_list() culls any obviously unevictable pages - i.e.
654!page_evictable(page, NULL) - diverting these to the unevictable list. 650!page_evictable(page) - diverting these to the unevictable list.
655However, shrink_active_list() only sees unevictable pages that made it onto the 651However, shrink_active_list() only sees unevictable pages that made it onto the
656active/inactive lru lists. Note that these pages do not have PageUnevictable 652active/inactive lru lists. Note that these pages do not have PageUnevictable
657set - otherwise they would be on the unevictable list and shrink_active_list 653set - otherwise they would be on the unevictable list and shrink_active_list