diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2012-10-09 03:23:15 -0400 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2012-10-09 03:23:15 -0400 |
| commit | 9e2d8656f5e8aa214e66b462680cf86b210b74a8 (patch) | |
| tree | f67d62e896cedf75599ea45f9ecf9999c6ad24cd /Documentation | |
| parent | 1ea4f4f8405cc1ceec23f2d261bc3775785e6712 (diff) | |
| parent | 9e695d2ecc8451cc2c1603d60b5c8e7f5581923a (diff) | |
Merge branch 'akpm' (Andrew's patch-bomb)
Merge patches from Andrew Morton:
"A few misc things and very nearly all of the MM tree. A tremendous
amount of stuff (again), including a significant rbtree library
rework."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (160 commits)
sparc64: Support transparent huge pages.
mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd().
mm: Add and use update_mmu_cache_pmd() in transparent huge page code.
sparc64: Document PGD and PMD layout.
sparc64: Eliminate PTE table memory wastage.
sparc64: Halve the size of PTE tables
sparc64: Only support 4MB huge pages and 8KB base pages.
memory-hotplug: suppress "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning
mm: memcg: clean up mm_match_cgroup() signature
mm: document PageHuge somewhat
mm: use %pK for /proc/vmallocinfo
mm, thp: fix mlock statistics
mm, thp: fix mapped pages avoiding unevictable list on mlock
memory-hotplug: update memory block's state and notify userspace
memory-hotplug: preparation to notify memory block's state at memory hot remove
mm: avoid section mismatch warning for memblock_type_name
make GFP_NOTRACK definition unconditional
cma: decrease cc.nr_migratepages after reclaiming pagelist
CMA: migrate mlocked pages
kpageflags: fix wrong KPF_THP on non-huge compound pages
...
Diffstat (limited to 'Documentation')
| -rw-r--r-- | Documentation/00-INDEX | 2 | ||||
| -rw-r--r-- | Documentation/ABI/obsolete/proc-pid-oom_adj | 22 | ||||
| -rw-r--r-- | Documentation/cgroups/memory.txt | 90 | ||||
| -rw-r--r-- | Documentation/filesystems/proc.txt | 22 | ||||
| -rw-r--r-- | Documentation/memory.txt | 33 | ||||
| -rw-r--r-- | Documentation/prio_tree.txt | 107 | ||||
| -rw-r--r-- | Documentation/rbtree.txt | 209 | ||||
| -rw-r--r-- | Documentation/vm/unevictable-lru.txt | 14 |
8 files changed, 227 insertions, 272 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 49c051380daf..f54273e2ac97 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX | |||
| @@ -270,8 +270,6 @@ preempt-locking.txt | |||
| 270 | - info on locking under a preemptive kernel. | 270 | - info on locking under a preemptive kernel. |
| 271 | printk-formats.txt | 271 | printk-formats.txt |
| 272 | - how to get printk format specifiers right | 272 | - how to get printk format specifiers right |
| 273 | prio_tree.txt | ||
| 274 | - info on radix-priority-search-tree use for indexing vmas. | ||
| 275 | ramoops.txt | 273 | ramoops.txt |
| 276 | - documentation of the ramoops oops/panic logging module. | 274 | - documentation of the ramoops oops/panic logging module. |
| 277 | rbtree.txt | 275 | rbtree.txt |
diff --git a/Documentation/ABI/obsolete/proc-pid-oom_adj b/Documentation/ABI/obsolete/proc-pid-oom_adj deleted file mode 100644 index 9a3cb88ade47..000000000000 --- a/Documentation/ABI/obsolete/proc-pid-oom_adj +++ /dev/null | |||
| @@ -1,22 +0,0 @@ | |||
| 1 | What: /proc/<pid>/oom_adj | ||
| 2 | When: August 2012 | ||
| 3 | Why: /proc/<pid>/oom_adj allows userspace to influence the oom killer's | ||
| 4 | badness heuristic used to determine which task to kill when the kernel | ||
| 5 | is out of memory. | ||
| 6 | |||
| 7 | The badness heuristic has since been rewritten since the introduction of | ||
| 8 | this tunable such that its meaning is deprecated. The value was | ||
| 9 | implemented as a bitshift on a score generated by the badness() | ||
| 10 | function that did not have any precise units of measure. With the | ||
| 11 | rewrite, the score is given as a proportion of available memory to the | ||
| 12 | task allocating pages, so using a bitshift which grows the score | ||
| 13 | exponentially is, thus, impossible to tune with fine granularity. | ||
| 14 | |||
| 15 | A much more powerful interface, /proc/<pid>/oom_score_adj, was | ||
| 16 | introduced with the oom killer rewrite that allows users to increase or | ||
| 17 | decrease the badness score linearly. This interface will replace | ||
| 18 | /proc/<pid>/oom_adj. | ||
| 19 | |||
| 20 | A warning will be emitted to the kernel log if an application uses this | ||
| 21 | deprecated interface. After it is printed once, future warnings will be | ||
| 22 | suppressed until the kernel is rebooted. | ||
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 4372e6b8a353..c07f7b4fb88d 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt | |||
| @@ -18,16 +18,16 @@ from the rest of the system. The article on LWN [12] mentions some probable | |||
| 18 | uses of the memory controller. The memory controller can be used to | 18 | uses of the memory controller. The memory controller can be used to |
| 19 | 19 | ||
| 20 | a. Isolate an application or a group of applications | 20 | a. Isolate an application or a group of applications |
| 21 | Memory hungry applications can be isolated and limited to a smaller | 21 | Memory-hungry applications can be isolated and limited to a smaller |
| 22 | amount of memory. | 22 | amount of memory. |
| 23 | b. Create a cgroup with limited amount of memory, this can be used | 23 | b. Create a cgroup with a limited amount of memory; this can be used |
| 24 | as a good alternative to booting with mem=XXXX. | 24 | as a good alternative to booting with mem=XXXX. |
| 25 | c. Virtualization solutions can control the amount of memory they want | 25 | c. Virtualization solutions can control the amount of memory they want |
| 26 | to assign to a virtual machine instance. | 26 | to assign to a virtual machine instance. |
| 27 | d. A CD/DVD burner could control the amount of memory used by the | 27 | d. A CD/DVD burner could control the amount of memory used by the |
| 28 | rest of the system to ensure that burning does not fail due to lack | 28 | rest of the system to ensure that burning does not fail due to lack |
| 29 | of available memory. | 29 | of available memory. |
| 30 | e. There are several other use cases, find one or use the controller just | 30 | e. There are several other use cases; find one or use the controller just |
| 31 | for fun (to learn and hack on the VM subsystem). | 31 | for fun (to learn and hack on the VM subsystem). |
| 32 | 32 | ||
| 33 | Current Status: linux-2.6.34-mmotm(development version of 2010/April) | 33 | Current Status: linux-2.6.34-mmotm(development version of 2010/April) |
| @@ -38,12 +38,12 @@ Features: | |||
| 38 | - optionally, memory+swap usage can be accounted and limited. | 38 | - optionally, memory+swap usage can be accounted and limited. |
| 39 | - hierarchical accounting | 39 | - hierarchical accounting |
| 40 | - soft limit | 40 | - soft limit |
| 41 | - moving(recharging) account at moving a task is selectable. | 41 | - moving (recharging) account at moving a task is selectable. |
| 42 | - usage threshold notifier | 42 | - usage threshold notifier |
| 43 | - oom-killer disable knob and oom-notifier | 43 | - oom-killer disable knob and oom-notifier |
| 44 | - Root cgroup has no limit controls. | 44 | - Root cgroup has no limit controls. |
| 45 | 45 | ||
| 46 | Kernel memory support is work in progress, and the current version provides | 46 | Kernel memory support is a work in progress, and the current version provides |
| 47 | basically functionality. (See Section 2.7) | 47 | basically functionality. (See Section 2.7) |
| 48 | 48 | ||
| 49 | Brief summary of control files. | 49 | Brief summary of control files. |
| @@ -144,9 +144,9 @@ Figure 1 shows the important aspects of the controller | |||
| 144 | 3. Each page has a pointer to the page_cgroup, which in turn knows the | 144 | 3. Each page has a pointer to the page_cgroup, which in turn knows the |
| 145 | cgroup it belongs to | 145 | cgroup it belongs to |
| 146 | 146 | ||
| 147 | The accounting is done as follows: mem_cgroup_charge() is invoked to setup | 147 | The accounting is done as follows: mem_cgroup_charge() is invoked to set up |
| 148 | the necessary data structures and check if the cgroup that is being charged | 148 | the necessary data structures and check if the cgroup that is being charged |
| 149 | is over its limit. If it is then reclaim is invoked on the cgroup. | 149 | is over its limit. If it is, then reclaim is invoked on the cgroup. |
| 150 | More details can be found in the reclaim section of this document. | 150 | More details can be found in the reclaim section of this document. |
| 151 | If everything goes well, a page meta-data-structure called page_cgroup is | 151 | If everything goes well, a page meta-data-structure called page_cgroup is |
| 152 | updated. page_cgroup has its own LRU on cgroup. | 152 | updated. page_cgroup has its own LRU on cgroup. |
| @@ -163,13 +163,13 @@ for earlier. A file page will be accounted for as Page Cache when it's | |||
| 163 | inserted into inode (radix-tree). While it's mapped into the page tables of | 163 | inserted into inode (radix-tree). While it's mapped into the page tables of |
| 164 | processes, duplicate accounting is carefully avoided. | 164 | processes, duplicate accounting is carefully avoided. |
| 165 | 165 | ||
| 166 | A RSS page is unaccounted when it's fully unmapped. A PageCache page is | 166 | An RSS page is unaccounted when it's fully unmapped. A PageCache page is |
| 167 | unaccounted when it's removed from radix-tree. Even if RSS pages are fully | 167 | unaccounted when it's removed from radix-tree. Even if RSS pages are fully |
| 168 | unmapped (by kswapd), they may exist as SwapCache in the system until they | 168 | unmapped (by kswapd), they may exist as SwapCache in the system until they |
| 169 | are really freed. Such SwapCaches also also accounted. | 169 | are really freed. Such SwapCaches are also accounted. |
| 170 | A swapped-in page is not accounted until it's mapped. | 170 | A swapped-in page is not accounted until it's mapped. |
| 171 | 171 | ||
| 172 | Note: The kernel does swapin-readahead and read multiple swaps at once. | 172 | Note: The kernel does swapin-readahead and reads multiple swaps at once. |
| 173 | This means swapped-in pages may contain pages for other tasks than a task | 173 | This means swapped-in pages may contain pages for other tasks than a task |
| 174 | causing page fault. So, we avoid accounting at swap-in I/O. | 174 | causing page fault. So, we avoid accounting at swap-in I/O. |
| 175 | 175 | ||
| @@ -209,7 +209,7 @@ memsw.limit_in_bytes. | |||
| 209 | Example: Assume a system with 4G of swap. A task which allocates 6G of memory | 209 | Example: Assume a system with 4G of swap. A task which allocates 6G of memory |
| 210 | (by mistake) under 2G memory limitation will use all swap. | 210 | (by mistake) under 2G memory limitation will use all swap. |
| 211 | In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. | 211 | In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. |
| 212 | By using memsw limit, you can avoid system OOM which can be caused by swap | 212 | By using the memsw limit, you can avoid system OOM which can be caused by swap |
| 213 | shortage. | 213 | shortage. |
| 214 | 214 | ||
| 215 | * why 'memory+swap' rather than swap. | 215 | * why 'memory+swap' rather than swap. |
| @@ -217,7 +217,7 @@ The global LRU(kswapd) can swap out arbitrary pages. Swap-out means | |||
| 217 | to move account from memory to swap...there is no change in usage of | 217 | to move account from memory to swap...there is no change in usage of |
| 218 | memory+swap. In other words, when we want to limit the usage of swap without | 218 | memory+swap. In other words, when we want to limit the usage of swap without |
| 219 | affecting global LRU, memory+swap limit is better than just limiting swap from | 219 | affecting global LRU, memory+swap limit is better than just limiting swap from |
| 220 | OS point of view. | 220 | an OS point of view. |
| 221 | 221 | ||
| 222 | * What happens when a cgroup hits memory.memsw.limit_in_bytes | 222 | * What happens when a cgroup hits memory.memsw.limit_in_bytes |
| 223 | When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out | 223 | When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out |
| @@ -236,7 +236,7 @@ an OOM routine is invoked to select and kill the bulkiest task in the | |||
| 236 | cgroup. (See 10. OOM Control below.) | 236 | cgroup. (See 10. OOM Control below.) |
| 237 | 237 | ||
| 238 | The reclaim algorithm has not been modified for cgroups, except that | 238 | The reclaim algorithm has not been modified for cgroups, except that |
| 239 | pages that are selected for reclaiming come from the per cgroup LRU | 239 | pages that are selected for reclaiming come from the per-cgroup LRU |
| 240 | list. | 240 | list. |
| 241 | 241 | ||
| 242 | NOTE: Reclaim does not work for the root cgroup, since we cannot set any | 242 | NOTE: Reclaim does not work for the root cgroup, since we cannot set any |
| @@ -316,7 +316,7 @@ We can check the usage: | |||
| 316 | # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes | 316 | # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes |
| 317 | 1216512 | 317 | 1216512 |
| 318 | 318 | ||
| 319 | A successful write to this file does not guarantee a successful set of | 319 | A successful write to this file does not guarantee a successful setting of |
| 320 | this limit to the value written into the file. This can be due to a | 320 | this limit to the value written into the file. This can be due to a |
| 321 | number of factors, such as rounding up to page boundaries or the total | 321 | number of factors, such as rounding up to page boundaries or the total |
| 322 | availability of memory on the system. The user is required to re-read | 322 | availability of memory on the system. The user is required to re-read |
| @@ -350,7 +350,7 @@ Trying usual test under memory controller is always helpful. | |||
| 350 | 4.1 Troubleshooting | 350 | 4.1 Troubleshooting |
| 351 | 351 | ||
| 352 | Sometimes a user might find that the application under a cgroup is | 352 | Sometimes a user might find that the application under a cgroup is |
| 353 | terminated by OOM killer. There are several causes for this: | 353 | terminated by the OOM killer. There are several causes for this: |
| 354 | 354 | ||
| 355 | 1. The cgroup limit is too low (just too low to do anything useful) | 355 | 1. The cgroup limit is too low (just too low to do anything useful) |
| 356 | 2. The user is using anonymous memory and swap is turned off or too low | 356 | 2. The user is using anonymous memory and swap is turned off or too low |
| @@ -358,7 +358,7 @@ terminated by OOM killer. There are several causes for this: | |||
| 358 | A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of | 358 | A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of |
| 359 | some of the pages cached in the cgroup (page cache pages). | 359 | some of the pages cached in the cgroup (page cache pages). |
| 360 | 360 | ||
| 361 | To know what happens, disable OOM_Kill by 10. OOM Control(see below) and | 361 | To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and |
| 362 | seeing what happens will be helpful. | 362 | seeing what happens will be helpful. |
| 363 | 363 | ||
| 364 | 4.2 Task migration | 364 | 4.2 Task migration |
| @@ -399,10 +399,10 @@ About use_hierarchy, see Section 6. | |||
| 399 | 399 | ||
| 400 | Almost all pages tracked by this memory cgroup will be unmapped and freed. | 400 | Almost all pages tracked by this memory cgroup will be unmapped and freed. |
| 401 | Some pages cannot be freed because they are locked or in-use. Such pages are | 401 | Some pages cannot be freed because they are locked or in-use. Such pages are |
| 402 | moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this | 402 | moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this |
| 403 | cgroup will be empty. | 403 | cgroup will be empty. |
| 404 | 404 | ||
| 405 | Typical use case of this interface is that calling this before rmdir(). | 405 | The typical use case for this interface is before calling rmdir(). |
| 406 | Because rmdir() moves all pages to parent, some out-of-use page caches can be | 406 | Because rmdir() moves all pages to parent, some out-of-use page caches can be |
| 407 | moved to the parent. If you want to avoid that, force_empty will be useful. | 407 | moved to the parent. If you want to avoid that, force_empty will be useful. |
| 408 | 408 | ||
| @@ -486,7 +486,7 @@ You can reset failcnt by writing 0 to failcnt file. | |||
| 486 | 486 | ||
| 487 | For efficiency, as other kernel components, memory cgroup uses some optimization | 487 | For efficiency, as other kernel components, memory cgroup uses some optimization |
| 488 | to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the | 488 | to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the |
| 489 | method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz | 489 | method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz |
| 490 | value for efficient access. (Of course, when necessary, it's synchronized.) | 490 | value for efficient access. (Of course, when necessary, it's synchronized.) |
| 491 | If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) | 491 | If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) |
| 492 | value in memory.stat(see 5.2). | 492 | value in memory.stat(see 5.2). |
| @@ -496,8 +496,8 @@ value in memory.stat(see 5.2). | |||
| 496 | This is similar to numa_maps but operates on a per-memcg basis. This is | 496 | This is similar to numa_maps but operates on a per-memcg basis. This is |
| 497 | useful for providing visibility into the numa locality information within | 497 | useful for providing visibility into the numa locality information within |
| 498 | an memcg since the pages are allowed to be allocated from any physical | 498 | an memcg since the pages are allowed to be allocated from any physical |
| 499 | node. One of the usecases is evaluating application performance by | 499 | node. One of the use cases is evaluating application performance by |
| 500 | combining this information with the application's cpu allocation. | 500 | combining this information with the application's CPU allocation. |
| 501 | 501 | ||
| 502 | We export "total", "file", "anon" and "unevictable" pages per-node for | 502 | We export "total", "file", "anon" and "unevictable" pages per-node for |
| 503 | each memcg. The ouput format of memory.numa_stat is: | 503 | each memcg. The ouput format of memory.numa_stat is: |
| @@ -561,10 +561,10 @@ are pushed back to their soft limits. If the soft limit of each control | |||
| 561 | group is very high, they are pushed back as much as possible to make | 561 | group is very high, they are pushed back as much as possible to make |
| 562 | sure that one control group does not starve the others of memory. | 562 | sure that one control group does not starve the others of memory. |
| 563 | 563 | ||
| 564 | Please note that soft limits is a best effort feature, it comes with | 564 | Please note that soft limits is a best-effort feature; it comes with |
| 565 | no guarantees, but it does its best to make sure that when memory is | 565 | no guarantees, but it does its best to make sure that when memory is |
| 566 | heavily contended for, memory is allocated based on the soft limit | 566 | heavily contended for, memory is allocated based on the soft limit |
| 567 | hints/setup. Currently soft limit based reclaim is setup such that | 567 | hints/setup. Currently soft limit based reclaim is set up such that |
| 568 | it gets invoked from balance_pgdat (kswapd). | 568 | it gets invoked from balance_pgdat (kswapd). |
| 569 | 569 | ||
| 570 | 7.1 Interface | 570 | 7.1 Interface |
| @@ -592,7 +592,7 @@ page tables. | |||
| 592 | 592 | ||
| 593 | 8.1 Interface | 593 | 8.1 Interface |
| 594 | 594 | ||
| 595 | This feature is disabled by default. It can be enabled(and disabled again) by | 595 | This feature is disabled by default. It can be enabledi (and disabled again) by |
| 596 | writing to memory.move_charge_at_immigrate of the destination cgroup. | 596 | writing to memory.move_charge_at_immigrate of the destination cgroup. |
| 597 | 597 | ||
| 598 | If you want to enable it: | 598 | If you want to enable it: |
| @@ -601,8 +601,8 @@ If you want to enable it: | |||
| 601 | 601 | ||
| 602 | Note: Each bits of move_charge_at_immigrate has its own meaning about what type | 602 | Note: Each bits of move_charge_at_immigrate has its own meaning about what type |
| 603 | of charges should be moved. See 8.2 for details. | 603 | of charges should be moved. See 8.2 for details. |
| 604 | Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread | 604 | Note: Charges are moved only when you move mm->owner, in other words, |
| 605 | group. | 605 | a leader of a thread group. |
| 606 | Note: If we cannot find enough space for the task in the destination cgroup, we | 606 | Note: If we cannot find enough space for the task in the destination cgroup, we |
| 607 | try to make space by reclaiming memory. Task migration may fail if we | 607 | try to make space by reclaiming memory. Task migration may fail if we |
| 608 | cannot make enough space. | 608 | cannot make enough space. |
| @@ -612,25 +612,25 @@ And if you want disable it again: | |||
| 612 | 612 | ||
| 613 | # echo 0 > memory.move_charge_at_immigrate | 613 | # echo 0 > memory.move_charge_at_immigrate |
| 614 | 614 | ||
| 615 | 8.2 Type of charges which can be move | 615 | 8.2 Type of charges which can be moved |
| 616 | 616 | ||
| 617 | Each bits of move_charge_at_immigrate has its own meaning about what type of | 617 | Each bit in move_charge_at_immigrate has its own meaning about what type of |
| 618 | charges should be moved. But in any cases, it must be noted that an account of | 618 | charges should be moved. But in any case, it must be noted that an account of |
| 619 | a page or a swap can be moved only when it is charged to the task's current(old) | 619 | a page or a swap can be moved only when it is charged to the task's current |
| 620 | memory cgroup. | 620 | (old) memory cgroup. |
| 621 | 621 | ||
| 622 | bit | what type of charges would be moved ? | 622 | bit | what type of charges would be moved ? |
| 623 | -----+------------------------------------------------------------------------ | 623 | -----+------------------------------------------------------------------------ |
| 624 | 0 | A charge of an anonymous page(or swap of it) used by the target task. | 624 | 0 | A charge of an anonymous page (or swap of it) used by the target task. |
| 625 | | You must enable Swap Extension(see 2.4) to enable move of swap charges. | 625 | | You must enable Swap Extension (see 2.4) to enable move of swap charges. |
| 626 | -----+------------------------------------------------------------------------ | 626 | -----+------------------------------------------------------------------------ |
| 627 | 1 | A charge of file pages(normal file, tmpfs file(e.g. ipc shared memory) | 627 | 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) |
| 628 | | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | 628 | | and swaps of tmpfs file) mmapped by the target task. Unlike the case of |
| 629 | | anonymous pages, file pages(and swaps) in the range mmapped by the task | 629 | | anonymous pages, file pages (and swaps) in the range mmapped by the task |
| 630 | | will be moved even if the task hasn't done page fault, i.e. they might | 630 | | will be moved even if the task hasn't done page fault, i.e. they might |
| 631 | | not be the task's "RSS", but other task's "RSS" that maps the same file. | 631 | | not be the task's "RSS", but other task's "RSS" that maps the same file. |
| 632 | | And mapcount of the page is ignored(the page can be moved even if | 632 | | And mapcount of the page is ignored (the page can be moved even if |
| 633 | | page_mapcount(page) > 1). You must enable Swap Extension(see 2.4) to | 633 | | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to |
| 634 | | enable move of swap charges. | 634 | | enable move of swap charges. |
| 635 | 635 | ||
| 636 | 8.3 TODO | 636 | 8.3 TODO |
| @@ -640,11 +640,11 @@ memory cgroup. | |||
| 640 | 640 | ||
| 641 | 9. Memory thresholds | 641 | 9. Memory thresholds |
| 642 | 642 | ||
| 643 | Memory cgroup implements memory thresholds using cgroups notification | 643 | Memory cgroup implements memory thresholds using the cgroups notification |
| 644 | API (see cgroups.txt). It allows to register multiple memory and memsw | 644 | API (see cgroups.txt). It allows to register multiple memory and memsw |
| 645 | thresholds and gets notifications when it crosses. | 645 | thresholds and gets notifications when it crosses. |
| 646 | 646 | ||
| 647 | To register a threshold application need: | 647 | To register a threshold, an application must: |
| 648 | - create an eventfd using eventfd(2); | 648 | - create an eventfd using eventfd(2); |
| 649 | - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; | 649 | - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; |
| 650 | - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to | 650 | - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to |
| @@ -659,24 +659,24 @@ It's applicable for root and non-root cgroup. | |||
| 659 | 659 | ||
| 660 | memory.oom_control file is for OOM notification and other controls. | 660 | memory.oom_control file is for OOM notification and other controls. |
| 661 | 661 | ||
| 662 | Memory cgroup implements OOM notifier using cgroup notification | 662 | Memory cgroup implements OOM notifier using the cgroup notification |
| 663 | API (See cgroups.txt). It allows to register multiple OOM notification | 663 | API (See cgroups.txt). It allows to register multiple OOM notification |
| 664 | delivery and gets notification when OOM happens. | 664 | delivery and gets notification when OOM happens. |
| 665 | 665 | ||
| 666 | To register a notifier, application need: | 666 | To register a notifier, an application must: |
| 667 | - create an eventfd using eventfd(2) | 667 | - create an eventfd using eventfd(2) |
| 668 | - open memory.oom_control file | 668 | - open memory.oom_control file |
| 669 | - write string like "<event_fd> <fd of memory.oom_control>" to | 669 | - write string like "<event_fd> <fd of memory.oom_control>" to |
| 670 | cgroup.event_control | 670 | cgroup.event_control |
| 671 | 671 | ||
| 672 | Application will be notified through eventfd when OOM happens. | 672 | The application will be notified through eventfd when OOM happens. |
| 673 | OOM notification doesn't work for root cgroup. | 673 | OOM notification doesn't work for the root cgroup. |
| 674 | 674 | ||
| 675 | You can disable OOM-killer by writing "1" to memory.oom_control file, as: | 675 | You can disable the OOM-killer by writing "1" to memory.oom_control file, as: |
| 676 | 676 | ||
| 677 | #echo 1 > memory.oom_control | 677 | #echo 1 > memory.oom_control |
| 678 | 678 | ||
| 679 | This operation is only allowed to the top cgroup of sub-hierarchy. | 679 | This operation is only allowed to the top cgroup of a sub-hierarchy. |
| 680 | If OOM-killer is disabled, tasks under cgroup will hang/sleep | 680 | If OOM-killer is disabled, tasks under cgroup will hang/sleep |
| 681 | in memory cgroup's OOM-waitqueue when they request accountable memory. | 681 | in memory cgroup's OOM-waitqueue when they request accountable memory. |
| 682 | 682 | ||
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index fb0a6aeb936c..a1793d670cd0 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
| @@ -33,7 +33,7 @@ Table of Contents | |||
| 33 | 2 Modifying System Parameters | 33 | 2 Modifying System Parameters |
| 34 | 34 | ||
| 35 | 3 Per-Process Parameters | 35 | 3 Per-Process Parameters |
| 36 | 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer | 36 | 3.1 /proc/<pid>/oom_score_adj - Adjust the oom-killer |
| 37 | score | 37 | score |
| 38 | 3.2 /proc/<pid>/oom_score - Display current oom-killer score | 38 | 3.2 /proc/<pid>/oom_score - Display current oom-killer score |
| 39 | 3.3 /proc/<pid>/io - Display the IO accounting fields | 39 | 3.3 /proc/<pid>/io - Display the IO accounting fields |
| @@ -1320,10 +1320,10 @@ of the kernel. | |||
| 1320 | CHAPTER 3: PER-PROCESS PARAMETERS | 1320 | CHAPTER 3: PER-PROCESS PARAMETERS |
| 1321 | ------------------------------------------------------------------------------ | 1321 | ------------------------------------------------------------------------------ |
| 1322 | 1322 | ||
| 1323 | 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score | 1323 | 3.1 /proc/<pid>/oom_score_adj- Adjust the oom-killer score |
| 1324 | -------------------------------------------------------------------------------- | 1324 | -------------------------------------------------------------------------------- |
| 1325 | 1325 | ||
| 1326 | These file can be used to adjust the badness heuristic used to select which | 1326 | This file can be used to adjust the badness heuristic used to select which |
| 1327 | process gets killed in out of memory conditions. | 1327 | process gets killed in out of memory conditions. |
| 1328 | 1328 | ||
| 1329 | The badness heuristic assigns a value to each candidate task ranging from 0 | 1329 | The badness heuristic assigns a value to each candidate task ranging from 0 |
| @@ -1361,22 +1361,10 @@ same system, cpuset, mempolicy, or memory controller resources to use at least | |||
| 1361 | equivalent to discounting 50% of the task's allowed memory from being considered | 1361 | equivalent to discounting 50% of the task's allowed memory from being considered |
| 1362 | as scoring against the task. | 1362 | as scoring against the task. |
| 1363 | 1363 | ||
| 1364 | For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also | ||
| 1365 | be used to tune the badness score. Its acceptable values range from -16 | ||
| 1366 | (OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 | ||
| 1367 | (OOM_DISABLE) to disable oom killing entirely for that task. Its value is | ||
| 1368 | scaled linearly with /proc/<pid>/oom_score_adj. | ||
| 1369 | |||
| 1370 | Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the | ||
| 1371 | other with its scaled value. | ||
| 1372 | |||
| 1373 | The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last | 1364 | The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last |
| 1374 | value set by a CAP_SYS_RESOURCE process. To reduce the value any lower | 1365 | value set by a CAP_SYS_RESOURCE process. To reduce the value any lower |
| 1375 | requires CAP_SYS_RESOURCE. | 1366 | requires CAP_SYS_RESOURCE. |
| 1376 | 1367 | ||
| 1377 | NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see | ||
| 1378 | Documentation/feature-removal-schedule.txt. | ||
| 1379 | |||
| 1380 | Caveat: when a parent task is selected, the oom killer will sacrifice any first | 1368 | Caveat: when a parent task is selected, the oom killer will sacrifice any first |
| 1381 | generation children with separate address spaces instead, if possible. This | 1369 | generation children with separate address spaces instead, if possible. This |
| 1382 | avoids servers and important system daemons from being killed and loses the | 1370 | avoids servers and important system daemons from being killed and loses the |
| @@ -1387,9 +1375,7 @@ minimal amount of work. | |||
| 1387 | ------------------------------------------------------------- | 1375 | ------------------------------------------------------------- |
| 1388 | 1376 | ||
| 1389 | This file can be used to check the current score used by the oom-killer is for | 1377 | This file can be used to check the current score used by the oom-killer is for |
| 1390 | any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which | 1378 | any given <pid>. |
| 1391 | process should be killed in an out-of-memory situation. | ||
| 1392 | |||
| 1393 | 1379 | ||
| 1394 | 3.3 /proc/<pid>/io - Display the IO accounting fields | 1380 | 3.3 /proc/<pid>/io - Display the IO accounting fields |
| 1395 | ------------------------------------------------------- | 1381 | ------------------------------------------------------- |
diff --git a/Documentation/memory.txt b/Documentation/memory.txt deleted file mode 100644 index 802efe58647c..000000000000 --- a/Documentation/memory.txt +++ /dev/null | |||
| @@ -1,33 +0,0 @@ | |||
| 1 | There are several classic problems related to memory on Linux | ||
| 2 | systems. | ||
| 3 | |||
| 4 | 1) There are some motherboards that will not cache above | ||
| 5 | a certain quantity of memory. If you have one of these | ||
| 6 | motherboards, your system will be SLOWER, not faster | ||
| 7 | as you add more memory. Consider exchanging your | ||
| 8 | motherboard. | ||
| 9 | |||
| 10 | All of these problems can be addressed with the "mem=XXXM" boot option | ||
| 11 | (where XXX is the size of RAM to use in megabytes). | ||
| 12 | It can also tell Linux to use less memory than is actually installed. | ||
| 13 | If you use "mem=" on a machine with PCI, consider using "memmap=" to avoid | ||
| 14 | physical address space collisions. | ||
| 15 | |||
| 16 | See the documentation of your boot loader (LILO, grub, loadlin, etc.) about | ||
| 17 | how to pass options to the kernel. | ||
| 18 | |||
| 19 | There are other memory problems which Linux cannot deal with. Random | ||
| 20 | corruption of memory is usually a sign of serious hardware trouble. | ||
| 21 | Try: | ||
| 22 | |||
| 23 | * Reducing memory settings in the BIOS to the most conservative | ||
| 24 | timings. | ||
| 25 | |||
| 26 | * Adding a cooling fan. | ||
| 27 | |||
| 28 | * Not overclocking your CPU. | ||
| 29 | |||
| 30 | * Having the memory tested in a memory tester or exchanged | ||
| 31 | with the vendor. Consider testing it with memtest86 yourself. | ||
| 32 | |||
| 33 | * Exchanging your CPU, cache, or motherboard for one that works. | ||
diff --git a/Documentation/prio_tree.txt b/Documentation/prio_tree.txt deleted file mode 100644 index 3aa68f9a117b..000000000000 --- a/Documentation/prio_tree.txt +++ /dev/null | |||
| @@ -1,107 +0,0 @@ | |||
| 1 | The prio_tree.c code indexes vmas using 3 different indexes: | ||
| 2 | * heap_index = vm_pgoff + vm_size_in_pages : end_vm_pgoff | ||
| 3 | * radix_index = vm_pgoff : start_vm_pgoff | ||
| 4 | * size_index = vm_size_in_pages | ||
| 5 | |||
| 6 | A regular radix-priority-search-tree indexes vmas using only heap_index and | ||
| 7 | radix_index. The conditions for indexing are: | ||
| 8 | * ->heap_index >= ->left->heap_index && | ||
| 9 | ->heap_index >= ->right->heap_index | ||
| 10 | * if (->heap_index == ->left->heap_index) | ||
| 11 | then ->radix_index < ->left->radix_index; | ||
| 12 | * if (->heap_index == ->right->heap_index) | ||
| 13 | then ->radix_index < ->right->radix_index; | ||
| 14 | * nodes are hashed to left or right subtree using radix_index | ||
| 15 | similar to a pure binary radix tree. | ||
| 16 | |||
| 17 | A regular radix-priority-search-tree helps to store and query | ||
| 18 | intervals (vmas). However, a regular radix-priority-search-tree is only | ||
| 19 | suitable for storing vmas with different radix indices (vm_pgoff). | ||
| 20 | |||
| 21 | Therefore, the prio_tree.c extends the regular radix-priority-search-tree | ||
| 22 | to handle many vmas with the same vm_pgoff. Such vmas are handled in | ||
| 23 | 2 different ways: 1) All vmas with the same radix _and_ heap indices are | ||
| 24 | linked using vm_set.list, 2) if there are many vmas with the same radix | ||
| 25 | index, but different heap indices and if the regular radix-priority-search | ||
| 26 | tree cannot index them all, we build an overflow-sub-tree that indexes such | ||
| 27 | vmas using heap and size indices instead of heap and radix indices. For | ||
| 28 | example, in the figure below some vmas with vm_pgoff = 0 (zero) are | ||
| 29 | indexed by regular radix-priority-search-tree whereas others are pushed | ||
| 30 | into an overflow-subtree. Note that all vmas in an overflow-sub-tree have | ||
| 31 | the same vm_pgoff (radix_index) and if necessary we build different | ||
| 32 | overflow-sub-trees to handle each possible radix_index. For example, | ||
| 33 | in figure we have 3 overflow-sub-trees corresponding to radix indices | ||
| 34 | 0, 2, and 4. | ||
| 35 | |||
| 36 | In the final tree the first few (prio_tree_root->index_bits) levels | ||
| 37 | are indexed using heap and radix indices whereas the overflow-sub-trees below | ||
| 38 | those levels (i.e. levels prio_tree_root->index_bits + 1 and higher) are | ||
| 39 | indexed using heap and size indices. In overflow-sub-trees the size_index | ||
| 40 | is used for hashing the nodes to appropriate places. | ||
| 41 | |||
| 42 | Now, an example prio_tree: | ||
| 43 | |||
| 44 | vmas are represented [radix_index, size_index, heap_index] | ||
| 45 | i.e., [start_vm_pgoff, vm_size_in_pages, end_vm_pgoff] | ||
| 46 | |||
| 47 | level prio_tree_root->index_bits = 3 | ||
| 48 | ----- | ||
| 49 | _ | ||
| 50 | 0 [0,7,7] | | ||
| 51 | / \ | | ||
| 52 | ------------------ ------------ | Regular | ||
| 53 | / \ | radix priority | ||
| 54 | 1 [1,6,7] [4,3,7] | search tree | ||
| 55 | / \ / \ | | ||
| 56 | ------- ----- ------ ----- | heap-and-radix | ||
| 57 | / \ / \ | indexed | ||
| 58 | 2 [0,6,6] [2,5,7] [5,2,7] [6,1,7] | | ||
| 59 | / \ / \ / \ / \ | | ||
| 60 | 3 [0,5,5] [1,5,6] [2,4,6] [3,4,7] [4,2,6] [5,1,6] [6,0,6] [7,0,7] | | ||
| 61 | / / / _ | ||
| 62 | / / / _ | ||
| 63 | 4 [0,4,4] [2,3,5] [4,1,5] | | ||
| 64 | / / / | | ||
| 65 | 5 [0,3,3] [2,2,4] [4,0,4] | Overflow-sub-trees | ||
| 66 | / / | | ||
| 67 | 6 [0,2,2] [2,1,3] | heap-and-size | ||
| 68 | / / | indexed | ||
| 69 | 7 [0,1,1] [2,0,2] | | ||
| 70 | / | | ||
| 71 | 8 [0,0,0] | | ||
| 72 | _ | ||
| 73 | |||
| 74 | Note that we use prio_tree_root->index_bits to optimize the height | ||
| 75 | of the heap-and-radix indexed tree. Since prio_tree_root->index_bits is | ||
| 76 | set according to the maximum end_vm_pgoff mapped, we are sure that all | ||
| 77 | bits (in vm_pgoff) above prio_tree_root->index_bits are 0 (zero). Therefore, | ||
| 78 | we only use the first prio_tree_root->index_bits as radix_index. | ||
| 79 | Whenever index_bits is increased in prio_tree_expand, we shuffle the tree | ||
| 80 | to make sure that the first prio_tree_root->index_bits levels of the tree | ||
| 81 | is indexed properly using heap and radix indices. | ||
| 82 | |||
| 83 | We do not optimize the height of overflow-sub-trees using index_bits. | ||
| 84 | The reason is: there can be many such overflow-sub-trees and all of | ||
| 85 | them have to be suffled whenever the index_bits increases. This may involve | ||
| 86 | walking the whole prio_tree in prio_tree_insert->prio_tree_expand code | ||
| 87 | path which is not desirable. Hence, we do not optimize the height of the | ||
| 88 | heap-and-size indexed overflow-sub-trees using prio_tree->index_bits. | ||
| 89 | Instead the overflow sub-trees are indexed using full BITS_PER_LONG bits | ||
| 90 | of size_index. This may lead to skewed sub-trees because most of the | ||
| 91 | higher significant bits of the size_index are likely to be 0 (zero). In | ||
| 92 | the example above, all 3 overflow-sub-trees are skewed. This may marginally | ||
| 93 | affect the performance. However, processes rarely map many vmas with the | ||
| 94 | same start_vm_pgoff but different end_vm_pgoffs. Therefore, we normally | ||
| 95 | do not require overflow-sub-trees to index all vmas. | ||
| 96 | |||
| 97 | From the above discussion it is clear that the maximum height of | ||
| 98 | a prio_tree can be prio_tree_root->index_bits + BITS_PER_LONG. | ||
| 99 | However, in most of the common cases we do not need overflow-sub-trees, | ||
| 100 | so the tree height in the common cases will be prio_tree_root->index_bits. | ||
| 101 | |||
| 102 | It is fair to mention here that the prio_tree_root->index_bits | ||
| 103 | is increased on demand, however, the index_bits is not decreased when | ||
| 104 | vmas are removed from the prio_tree. That's tricky to do. Hence, it's | ||
| 105 | left as a home work problem. | ||
| 106 | |||
| 107 | |||
diff --git a/Documentation/rbtree.txt b/Documentation/rbtree.txt index 8d32d85a5234..61b6c48871a0 100644 --- a/Documentation/rbtree.txt +++ b/Documentation/rbtree.txt | |||
| @@ -193,24 +193,55 @@ Example: | |||
| 193 | Support for Augmented rbtrees | 193 | Support for Augmented rbtrees |
| 194 | ----------------------------- | 194 | ----------------------------- |
| 195 | 195 | ||
| 196 | Augmented rbtree is an rbtree with "some" additional data stored in each node. | 196 | Augmented rbtree is an rbtree with "some" additional data stored in |
| 197 | This data can be used to augment some new functionality to rbtree. | 197 | each node, where the additional data for node N must be a function of |
| 198 | Augmented rbtree is an optional feature built on top of basic rbtree | 198 | the contents of all nodes in the subtree rooted at N. This data can |
| 199 | infrastructure. An rbtree user who wants this feature will have to call the | 199 | be used to augment some new functionality to rbtree. Augmented rbtree |
| 200 | augmentation functions with the user provided augmentation callback | 200 | is an optional feature built on top of basic rbtree infrastructure. |
| 201 | when inserting and erasing nodes. | 201 | An rbtree user who wants this feature will have to call the augmentation |
| 202 | 202 | functions with the user provided augmentation callback when inserting | |
| 203 | On insertion, the user must call rb_augment_insert() once the new node is in | 203 | and erasing nodes. |
| 204 | place. This will cause the augmentation function callback to be called for | 204 | |
| 205 | each node between the new node and the root which has been affected by the | 205 | C files implementing augmented rbtree manipulation must include |
| 206 | insertion. | 206 | <linux/rbtree_augmented.h> instead of <linus/rbtree.h>. Note that |
| 207 | 207 | linux/rbtree_augmented.h exposes some rbtree implementations details | |
| 208 | When erasing a node, the user must call rb_augment_erase_begin() first to | 208 | you are not expected to rely on; please stick to the documented APIs |
| 209 | retrieve the deepest node on the rebalance path. Then, after erasing the | 209 | there and do not include <linux/rbtree_augmented.h> from header files |
| 210 | original node, the user must call rb_augment_erase_end() with the deepest | 210 | either so as to minimize chances of your users accidentally relying on |
| 211 | node found earlier. This will cause the augmentation function to be called | 211 | such implementation details. |
| 212 | for each affected node between the deepest node and the root. | 212 | |
| 213 | 213 | On insertion, the user must update the augmented information on the path | |
| 214 | leading to the inserted node, then call rb_link_node() as usual and | ||
| 215 | rb_augment_inserted() instead of the usual rb_insert_color() call. | ||
| 216 | If rb_augment_inserted() rebalances the rbtree, it will callback into | ||
| 217 | a user provided function to update the augmented information on the | ||
| 218 | affected subtrees. | ||
| 219 | |||
| 220 | When erasing a node, the user must call rb_erase_augmented() instead of | ||
| 221 | rb_erase(). rb_erase_augmented() calls back into user provided functions | ||
| 222 | to updated the augmented information on affected subtrees. | ||
| 223 | |||
| 224 | In both cases, the callbacks are provided through struct rb_augment_callbacks. | ||
| 225 | 3 callbacks must be defined: | ||
| 226 | |||
| 227 | - A propagation callback, which updates the augmented value for a given | ||
| 228 | node and its ancestors, up to a given stop point (or NULL to update | ||
| 229 | all the way to the root). | ||
| 230 | |||
| 231 | - A copy callback, which copies the augmented value for a given subtree | ||
| 232 | to a newly assigned subtree root. | ||
| 233 | |||
| 234 | - A tree rotation callback, which copies the augmented value for a given | ||
| 235 | subtree to a newly assigned subtree root AND recomputes the augmented | ||
| 236 | information for the former subtree root. | ||
| 237 | |||
| 238 | The compiled code for rb_erase_augmented() may inline the propagation and | ||
| 239 | copy callbacks, which results in a large function, so each augmented rbtree | ||
| 240 | user should have a single rb_erase_augmented() call site in order to limit | ||
| 241 | compiled code size. | ||
| 242 | |||
| 243 | |||
| 244 | Sample usage: | ||
| 214 | 245 | ||
| 215 | Interval tree is an example of augmented rb tree. Reference - | 246 | Interval tree is an example of augmented rb tree. Reference - |
| 216 | "Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein. | 247 | "Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein. |
| @@ -230,26 +261,132 @@ and its immediate children. And this will be used in O(log n) lookup | |||
| 230 | for lowest match (lowest start address among all possible matches) | 261 | for lowest match (lowest start address among all possible matches) |
| 231 | with something like: | 262 | with something like: |
| 232 | 263 | ||
| 233 | find_lowest_match(lo, hi, node) | 264 | struct interval_tree_node * |
| 265 | interval_tree_first_match(struct rb_root *root, | ||
| 266 | unsigned long start, unsigned long last) | ||
| 234 | { | 267 | { |
| 235 | lowest_match = NULL; | 268 | struct interval_tree_node *node; |
| 236 | while (node) { | 269 | |
| 237 | if (max_hi(node->left) > lo) { | 270 | if (!root->rb_node) |
| 238 | // Lowest overlap if any must be on left side | 271 | return NULL; |
| 239 | node = node->left; | 272 | node = rb_entry(root->rb_node, struct interval_tree_node, rb); |
| 240 | } else if (overlap(lo, hi, node)) { | 273 | |
| 241 | lowest_match = node; | 274 | while (true) { |
| 242 | break; | 275 | if (node->rb.rb_left) { |
| 243 | } else if (lo > node->lo) { | 276 | struct interval_tree_node *left = |
| 244 | // Lowest overlap if any must be on right side | 277 | rb_entry(node->rb.rb_left, |
| 245 | node = node->right; | 278 | struct interval_tree_node, rb); |
| 246 | } else { | 279 | if (left->__subtree_last >= start) { |
| 247 | break; | 280 | /* |
| 281 | * Some nodes in left subtree satisfy Cond2. | ||
| 282 | * Iterate to find the leftmost such node N. | ||
| 283 | * If it also satisfies Cond1, that's the match | ||
| 284 | * we are looking for. Otherwise, there is no | ||
| 285 | * matching interval as nodes to the right of N | ||
| 286 | * can't satisfy Cond1 either. | ||
| 287 | */ | ||
| 288 | node = left; | ||
| 289 | continue; | ||
| 290 | } | ||
| 248 | } | 291 | } |
| 292 | if (node->start <= last) { /* Cond1 */ | ||
| 293 | if (node->last >= start) /* Cond2 */ | ||
| 294 | return node; /* node is leftmost match */ | ||
| 295 | if (node->rb.rb_right) { | ||
| 296 | node = rb_entry(node->rb.rb_right, | ||
| 297 | struct interval_tree_node, rb); | ||
| 298 | if (node->__subtree_last >= start) | ||
| 299 | continue; | ||
| 300 | } | ||
| 301 | } | ||
| 302 | return NULL; /* No match */ | ||
| 303 | } | ||
| 304 | } | ||
| 305 | |||
| 306 | Insertion/removal are defined using the following augmented callbacks: | ||
| 307 | |||
| 308 | static inline unsigned long | ||
| 309 | compute_subtree_last(struct interval_tree_node *node) | ||
| 310 | { | ||
| 311 | unsigned long max = node->last, subtree_last; | ||
| 312 | if (node->rb.rb_left) { | ||
| 313 | subtree_last = rb_entry(node->rb.rb_left, | ||
| 314 | struct interval_tree_node, rb)->__subtree_last; | ||
| 315 | if (max < subtree_last) | ||
| 316 | max = subtree_last; | ||
| 317 | } | ||
| 318 | if (node->rb.rb_right) { | ||
| 319 | subtree_last = rb_entry(node->rb.rb_right, | ||
| 320 | struct interval_tree_node, rb)->__subtree_last; | ||
| 321 | if (max < subtree_last) | ||
| 322 | max = subtree_last; | ||
| 323 | } | ||
| 324 | return max; | ||
| 325 | } | ||
| 326 | |||
| 327 | static void augment_propagate(struct rb_node *rb, struct rb_node *stop) | ||
| 328 | { | ||
| 329 | while (rb != stop) { | ||
| 330 | struct interval_tree_node *node = | ||
| 331 | rb_entry(rb, struct interval_tree_node, rb); | ||
| 332 | unsigned long subtree_last = compute_subtree_last(node); | ||
| 333 | if (node->__subtree_last == subtree_last) | ||
| 334 | break; | ||
| 335 | node->__subtree_last = subtree_last; | ||
| 336 | rb = rb_parent(&node->rb); | ||
| 249 | } | 337 | } |
| 250 | return lowest_match; | ||
| 251 | } | 338 | } |
| 252 | 339 | ||
| 253 | Finding exact match will be to first find lowest match and then to follow | 340 | static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new) |
| 254 | successor nodes looking for exact match, until the start of a node is beyond | 341 | { |
| 255 | the hi value we are looking for. | 342 | struct interval_tree_node *old = |
| 343 | rb_entry(rb_old, struct interval_tree_node, rb); | ||
| 344 | struct interval_tree_node *new = | ||
| 345 | rb_entry(rb_new, struct interval_tree_node, rb); | ||
| 346 | |||
| 347 | new->__subtree_last = old->__subtree_last; | ||
| 348 | } | ||
| 349 | |||
| 350 | static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new) | ||
| 351 | { | ||
| 352 | struct interval_tree_node *old = | ||
| 353 | rb_entry(rb_old, struct interval_tree_node, rb); | ||
| 354 | struct interval_tree_node *new = | ||
| 355 | rb_entry(rb_new, struct interval_tree_node, rb); | ||
| 356 | |||
| 357 | new->__subtree_last = old->__subtree_last; | ||
| 358 | old->__subtree_last = compute_subtree_last(old); | ||
| 359 | } | ||
| 360 | |||
| 361 | static const struct rb_augment_callbacks augment_callbacks = { | ||
| 362 | augment_propagate, augment_copy, augment_rotate | ||
| 363 | }; | ||
| 364 | |||
| 365 | void interval_tree_insert(struct interval_tree_node *node, | ||
| 366 | struct rb_root *root) | ||
| 367 | { | ||
| 368 | struct rb_node **link = &root->rb_node, *rb_parent = NULL; | ||
| 369 | unsigned long start = node->start, last = node->last; | ||
| 370 | struct interval_tree_node *parent; | ||
| 371 | |||
| 372 | while (*link) { | ||
| 373 | rb_parent = *link; | ||
| 374 | parent = rb_entry(rb_parent, struct interval_tree_node, rb); | ||
| 375 | if (parent->__subtree_last < last) | ||
| 376 | parent->__subtree_last = last; | ||
| 377 | if (start < parent->start) | ||
| 378 | link = &parent->rb.rb_left; | ||
| 379 | else | ||
| 380 | link = &parent->rb.rb_right; | ||
| 381 | } | ||
| 382 | |||
| 383 | node->__subtree_last = last; | ||
| 384 | rb_link_node(&node->rb, rb_parent, link); | ||
| 385 | rb_insert_augmented(&node->rb, root, &augment_callbacks); | ||
| 386 | } | ||
| 387 | |||
| 388 | void interval_tree_remove(struct interval_tree_node *node, | ||
| 389 | struct rb_root *root) | ||
| 390 | { | ||
| 391 | rb_erase_augmented(&node->rb, root, &augment_callbacks); | ||
| 392 | } | ||
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index fa206cccf89f..a68db7692ee8 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt | |||
| @@ -197,12 +197,8 @@ the pages are also "rescued" from the unevictable list in the process of | |||
| 197 | freeing them. | 197 | freeing them. |
| 198 | 198 | ||
| 199 | page_evictable() also checks for mlocked pages by testing an additional page | 199 | page_evictable() also checks for mlocked pages by testing an additional page |
| 200 | flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked, | 200 | flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is |
| 201 | and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is | 201 | faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED. |
| 202 | VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and | ||
| 203 | update the appropriate statistics if the vma is VM_LOCKED. This method allows | ||
| 204 | efficient "culling" of pages in the fault path that are being faulted in to | ||
| 205 | VM_LOCKED VMAs. | ||
| 206 | 202 | ||
| 207 | 203 | ||
| 208 | VMSCAN'S HANDLING OF UNEVICTABLE PAGES | 204 | VMSCAN'S HANDLING OF UNEVICTABLE PAGES |
| @@ -371,8 +367,8 @@ mlock_fixup() filters several classes of "special" VMAs: | |||
| 371 | mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to | 367 | mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to |
| 372 | allocate the huge pages and populate the ptes. | 368 | allocate the huge pages and populate the ptes. |
| 373 | 369 | ||
| 374 | 3) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of | 370 | 3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages, |
| 375 | kernel pages, such as the VDSO page, relay channel pages, etc. These pages | 371 | such as the VDSO page, relay channel pages, etc. These pages |
| 376 | are inherently unevictable and are not managed on the LRU lists. | 372 | are inherently unevictable and are not managed on the LRU lists. |
| 377 | mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls | 373 | mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls |
| 378 | make_pages_present() to populate the ptes. | 374 | make_pages_present() to populate the ptes. |
| @@ -651,7 +647,7 @@ PAGE RECLAIM IN shrink_*_list() | |||
| 651 | ------------------------------- | 647 | ------------------------------- |
| 652 | 648 | ||
| 653 | shrink_active_list() culls any obviously unevictable pages - i.e. | 649 | shrink_active_list() culls any obviously unevictable pages - i.e. |
| 654 | !page_evictable(page, NULL) - diverting these to the unevictable list. | 650 | !page_evictable(page) - diverting these to the unevictable list. |
| 655 | However, shrink_active_list() only sees unevictable pages that made it onto the | 651 | However, shrink_active_list() only sees unevictable pages that made it onto the |
| 656 | active/inactive lru lists. Note that these pages do not have PageUnevictable | 652 | active/inactive lru lists. Note that these pages do not have PageUnevictable |
| 657 | set - otherwise they would be on the unevictable list and shrink_active_list | 653 | set - otherwise they would be on the unevictable list and shrink_active_list |
