diff options
Diffstat (limited to 'Documentation/vm/unevictable-lru.txt')
-rw-r--r-- | Documentation/vm/unevictable-lru.txt | 1041 |
1 files changed, 572 insertions, 469 deletions
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index 0706a7282a8c..2d70d0d95108 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt | |||
@@ -1,588 +1,691 @@ | |||
1 | 1 | ============================== | |
2 | This document describes the Linux memory management "Unevictable LRU" | 2 | UNEVICTABLE LRU INFRASTRUCTURE |
3 | infrastructure and the use of this infrastructure to manage several types | 3 | ============================== |
4 | of "unevictable" pages. The document attempts to provide the overall | 4 | |
5 | rationale behind this mechanism and the rationale for some of the design | 5 | ======== |
6 | decisions that drove the implementation. The latter design rationale is | 6 | CONTENTS |
7 | discussed in the context of an implementation description. Admittedly, one | 7 | ======== |
8 | can obtain the implementation details--the "what does it do?"--by reading the | 8 | |
9 | code. One hopes that the descriptions below add value by provide the answer | 9 | (*) The Unevictable LRU |
10 | to "why does it do that?". | 10 | |
11 | 11 | - The unevictable page list. | |
12 | Unevictable LRU Infrastructure: | 12 | - Memory control group interaction. |
13 | 13 | - Marking address spaces unevictable. | |
14 | The Unevictable LRU adds an additional LRU list to track unevictable pages | 14 | - Detecting Unevictable Pages. |
15 | and to hide these pages from vmscan. This mechanism is based on a patch by | 15 | - vmscan's handling of unevictable pages. |
16 | Larry Woodman of Red Hat to address several scalability problems with page | 16 | |
17 | (*) mlock()'d pages. | ||
18 | |||
19 | - History. | ||
20 | - Basic management. | ||
21 | - mlock()/mlockall() system call handling. | ||
22 | - Filtering special vmas. | ||
23 | - munlock()/munlockall() system call handling. | ||
24 | - Migrating mlocked pages. | ||
25 | - mmap(MAP_LOCKED) system call handling. | ||
26 | - munmap()/exit()/exec() system call handling. | ||
27 | - try_to_unmap(). | ||
28 | - try_to_munlock() reverse map scan. | ||
29 | - Page reclaim in shrink_*_list(). | ||
30 | |||
31 | |||
32 | ============ | ||
33 | INTRODUCTION | ||
34 | ============ | ||
35 | |||
36 | This document describes the Linux memory manager's "Unevictable LRU" | ||
37 | infrastructure and the use of this to manage several types of "unevictable" | ||
38 | pages. | ||
39 | |||
40 | The document attempts to provide the overall rationale behind this mechanism | ||
41 | and the rationale for some of the design decisions that drove the | ||
42 | implementation. The latter design rationale is discussed in the context of an | ||
43 | implementation description. Admittedly, one can obtain the implementation | ||
44 | details - the "what does it do?" - by reading the code. One hopes that the | ||
45 | descriptions below add value by provide the answer to "why does it do that?". | ||
46 | |||
47 | |||
48 | =================== | ||
49 | THE UNEVICTABLE LRU | ||
50 | =================== | ||
51 | |||
52 | The Unevictable LRU facility adds an additional LRU list to track unevictable | ||
53 | pages and to hide these pages from vmscan. This mechanism is based on a patch | ||
54 | by Larry Woodman of Red Hat to address several scalability problems with page | ||
17 | reclaim in Linux. The problems have been observed at customer sites on large | 55 | reclaim in Linux. The problems have been observed at customer sites on large |
18 | memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB | 56 | memory x86_64 systems. |
19 | of main memory will have over 32 million 4k pages in a single zone. When a | 57 | |
20 | large fraction of these pages are not evictable for any reason [see below], | 58 | To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of |
21 | vmscan will spend a lot of time scanning the LRU lists looking for the small | 59 | main memory will have over 32 million 4k pages in a single zone. When a large |
22 | fraction of pages that are evictable. This can result in a situation where | 60 | fraction of these pages are not evictable for any reason [see below], vmscan |
23 | all cpus are spending 100% of their time in vmscan for hours or days on end, | 61 | will spend a lot of time scanning the LRU lists looking for the small fraction |
24 | with the system completely unresponsive. | 62 | of pages that are evictable. This can result in a situation where all CPUs are |
25 | 63 | spending 100% of their time in vmscan for hours or days on end, with the system | |
26 | The Unevictable LRU infrastructure addresses the following classes of | 64 | completely unresponsive. |
27 | unevictable pages: | 65 | |
28 | 66 | The unevictable list addresses the following classes of unevictable pages: | |
29 | + page owned by ramfs | 67 | |
30 | + page mapped into SHM_LOCKed shared memory regions | 68 | (*) Those owned by ramfs. |
31 | + page mapped into VM_LOCKED [mlock()ed] vmas | 69 | |
32 | 70 | (*) Those mapped into SHM_LOCK'd shared memory regions. | |
33 | The infrastructure might be able to handle other conditions that make pages | 71 | |
72 | (*) Those mapped into VM_LOCKED [mlock()ed] VMAs. | ||
73 | |||
74 | The infrastructure may also be able to handle other conditions that make pages | ||
34 | unevictable, either by definition or by circumstance, in the future. | 75 | unevictable, either by definition or by circumstance, in the future. |
35 | 76 | ||
36 | 77 | ||
37 | The Unevictable LRU List | 78 | THE UNEVICTABLE PAGE LIST |
79 | ------------------------- | ||
38 | 80 | ||
39 | The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list | 81 | The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list |
40 | called the "unevictable" list and an associated page flag, PG_unevictable, to | 82 | called the "unevictable" list and an associated page flag, PG_unevictable, to |
41 | indicate that the page is being managed on the unevictable list. The | 83 | indicate that the page is being managed on the unevictable list. |
42 | PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active | 84 | |
43 | flag in that it indicates on which LRU list a page resides when PG_lru is set. | 85 | The PG_unevictable flag is analogous to, and mutually exclusive with, the |
44 | The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU | 86 | PG_active flag in that it indicates on which LRU list a page resides when |
45 | Kconfig option. | 87 | PG_lru is set. The unevictable list is compile-time configurable based on the |
88 | UNEVICTABLE_LRU Kconfig option. | ||
46 | 89 | ||
47 | The Unevictable LRU infrastructure maintains unevictable pages on an additional | 90 | The Unevictable LRU infrastructure maintains unevictable pages on an additional |
48 | LRU list for a few reasons: | 91 | LRU list for a few reasons: |
49 | 92 | ||
50 | 1) We get to "treat unevictable pages just like we treat other pages in the | 93 | (1) We get to "treat unevictable pages just like we treat other pages in the |
51 | system, which means we get to use the same code to manipulate them, the | 94 | system - which means we get to use the same code to manipulate them, the |
52 | same code to isolate them (for migrate, etc.), the same code to keep track | 95 | same code to isolate them (for migrate, etc.), the same code to keep track |
53 | of the statistics, etc..." [Rik van Riel] | 96 | of the statistics, etc..." [Rik van Riel] |
97 | |||
98 | (2) We want to be able to migrate unevictable pages between nodes for memory | ||
99 | defragmentation, workload management and memory hotplug. The linux kernel | ||
100 | can only migrate pages that it can successfully isolate from the LRU | ||
101 | lists. If we were to maintain pages elsewhere than on an LRU-like list, | ||
102 | where they can be found by isolate_lru_page(), we would prevent their | ||
103 | migration, unless we reworked migration code to find the unevictable pages | ||
104 | itself. | ||
54 | 105 | ||
55 | 2) We want to be able to migrate unevictable pages between nodes--for memory | ||
56 | defragmentation, workload management and memory hotplug. The linux kernel | ||
57 | can only migrate pages that it can successfully isolate from the lru lists. | ||
58 | If we were to maintain pages elsewise than on an lru-like list, where they | ||
59 | can be found by isolate_lru_page(), we would prevent their migration, unless | ||
60 | we reworked migration code to find the unevictable pages. | ||
61 | 106 | ||
107 | The unevictable list does not differentiate between file-backed and anonymous, | ||
108 | swap-backed pages. This differentiation is only important while the pages are, | ||
109 | in fact, evictable. | ||
62 | 110 | ||
63 | The unevictable LRU list does not differentiate between file backed and swap | 111 | The unevictable list benefits from the "arrayification" of the per-zone LRU |
64 | backed [anon] pages. This differentiation is only important while the pages | 112 | lists and statistics originally proposed and posted by Christoph Lameter. |
65 | are, in fact, evictable. | ||
66 | 113 | ||
67 | The unevictable LRU list benefits from the "arrayification" of the per-zone | 114 | The unevictable list does not use the LRU pagevec mechanism. Rather, |
68 | LRU lists and statistics originally proposed and posted by Christoph Lameter. | 115 | unevictable pages are placed directly on the page's zone's unevictable list |
116 | under the zone lru_lock. This allows us to prevent the stranding of pages on | ||
117 | the unevictable list when one task has the page isolated from the LRU and other | ||
118 | tasks are changing the "evictability" state of the page. | ||
69 | 119 | ||
70 | The unevictable list does not use the lru pagevec mechanism. Rather, | ||
71 | unevictable pages are placed directly on the page's zone's unevictable | ||
72 | list under the zone lru_lock. The reason for this is to prevent stranding | ||
73 | of pages on the unevictable list when one task has the page isolated from the | ||
74 | lru and other tasks are changing the "evictability" state of the page. | ||
75 | 120 | ||
121 | MEMORY CONTROL GROUP INTERACTION | ||
122 | -------------------------------- | ||
76 | 123 | ||
77 | Unevictable LRU and Memory Controller Interaction | 124 | The unevictable LRU facility interacts with the memory control group [aka |
125 | memory controller; see Documentation/cgroups/memory.txt] by extending the | ||
126 | lru_list enum. | ||
127 | |||
128 | The memory controller data structure automatically gets a per-zone unevictable | ||
129 | list as a result of the "arrayification" of the per-zone LRU lists (one per | ||
130 | lru_list enum element). The memory controller tracks the movement of pages to | ||
131 | and from the unevictable list. | ||
78 | 132 | ||
79 | The memory controller data structure automatically gets a per zone unevictable | ||
80 | lru list as a result of the "arrayification" of the per-zone LRU lists. The | ||
81 | memory controller tracks the movement of pages to and from the unevictable list. | ||
82 | When a memory control group comes under memory pressure, the controller will | 133 | When a memory control group comes under memory pressure, the controller will |
83 | not attempt to reclaim pages on the unevictable list. This has a couple of | 134 | not attempt to reclaim pages on the unevictable list. This has a couple of |
84 | effects. Because the pages are "hidden" from reclaim on the unevictable list, | 135 | effects: |
85 | the reclaim process can be more efficient, dealing only with pages that have | 136 | |
86 | a chance of being reclaimed. On the other hand, if too many of the pages | 137 | (1) Because the pages are "hidden" from reclaim on the unevictable list, the |
87 | charged to the control group are unevictable, the evictable portion of the | 138 | reclaim process can be more efficient, dealing only with pages that have a |
88 | working set of the tasks in the control group may not fit into the available | 139 | chance of being reclaimed. |
89 | memory. This can cause the control group to thrash or to oom-kill tasks. | 140 | |
90 | 141 | (2) On the other hand, if too many of the pages charged to the control group | |
91 | 142 | are unevictable, the evictable portion of the working set of the tasks in | |
92 | Unevictable LRU: Detecting Unevictable Pages | 143 | the control group may not fit into the available memory. This can cause |
93 | 144 | the control group to thrash or to OOM-kill tasks. | |
94 | The function page_evictable(page, vma) in vmscan.c determines whether a | 145 | |
95 | page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions, | 146 | |
96 | page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's | 147 | MARKING ADDRESS SPACES UNEVICTABLE |
97 | address space using a wrapper function. Wrapper functions are used to set, | 148 | ---------------------------------- |
98 | clear and test the flag to reduce the requirement for #ifdef's throughout the | 149 | |
99 | source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created. | 150 | For facilities such as ramfs none of the pages attached to the address space |
100 | This flag remains for the life of the inode. | 151 | may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE |
101 | 152 | address space flag is provided, and this can be manipulated by a filesystem | |
102 | For shared memory regions, AS_UNEVICTABLE is set when an application | 153 | using a number of wrapper functions: |
103 | successfully SHM_LOCKs the region and is removed when the region is | 154 | |
104 | SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page | 155 | (*) void mapping_set_unevictable(struct address_space *mapping); |
105 | tables for the region as does, for example, mlock(). So, we make no special | 156 | |
106 | effort to push any pages in the SHM_LOCKed region to the unevictable list. | 157 | Mark the address space as being completely unevictable. |
107 | Vmscan will do this when/if it encounters the pages during reclaim. On | 158 | |
108 | SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the | 159 | (*) void mapping_clear_unevictable(struct address_space *mapping); |
109 | unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed | 160 | |
110 | region is destroyed, the pages are also "rescued" from the unevictable list in | 161 | Mark the address space as being evictable. |
111 | the process of freeing them. | 162 | |
112 | 163 | (*) int mapping_unevictable(struct address_space *mapping); | |
113 | page_evictable() detects mlock()ed pages by testing an additional page flag, | 164 | |
114 | PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a | 165 | Query the address space, and return true if it is completely |
115 | non-NULL vma is supplied, page_evictable() will check whether the vma is | 166 | unevictable. |
167 | |||
168 | These are currently used in two places in the kernel: | ||
169 | |||
170 | (1) By ramfs to mark the address spaces of its inodes when they are created, | ||
171 | and this mark remains for the life of the inode. | ||
172 | |||
173 | (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called. | ||
174 | |||
175 | Note that SHM_LOCK is not required to page in the locked pages if they're | ||
176 | swapped out; the application must touch the pages manually if it wants to | ||
177 | ensure they're in memory. | ||
178 | |||
179 | |||
180 | DETECTING UNEVICTABLE PAGES | ||
181 | --------------------------- | ||
182 | |||
183 | The function page_evictable() in vmscan.c determines whether a page is | ||
184 | evictable or not using the query function outlined above [see section "Marking | ||
185 | address spaces unevictable"] to check the AS_UNEVICTABLE flag. | ||
186 | |||
187 | For address spaces that are so marked after being populated (as SHM regions | ||
188 | might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate | ||
189 | the page tables for the region as does, for example, mlock(), nor need it make | ||
190 | any special effort to push any pages in the SHM_LOCK'd area to the unevictable | ||
191 | list. Instead, vmscan will do this if and when it encounters the pages during | ||
192 | a reclamation scan. | ||
193 | |||
194 | On an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan | ||
195 | the pages in the region and "rescue" them from the unevictable list if no other | ||
196 | condition is keeping them unevictable. If an unevictable region is destroyed, | ||
197 | the pages are also "rescued" from the unevictable list in the process of | ||
198 | freeing them. | ||
199 | |||
200 | page_evictable() also checks for mlocked pages by testing an additional page | ||
201 | flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked, | ||
202 | and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is | ||
116 | VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and | 203 | VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and |
117 | update the appropriate statistics if the vma is VM_LOCKED. This method allows | 204 | update the appropriate statistics if the vma is VM_LOCKED. This method allows |
118 | efficient "culling" of pages in the fault path that are being faulted in to | 205 | efficient "culling" of pages in the fault path that are being faulted in to |
119 | VM_LOCKED vmas. | 206 | VM_LOCKED VMAs. |
120 | 207 | ||
121 | 208 | ||
122 | Unevictable Pages and Vmscan [shrink_*_list()] | 209 | VMSCAN'S HANDLING OF UNEVICTABLE PAGES |
210 | -------------------------------------- | ||
123 | 211 | ||
124 | If unevictable pages are culled in the fault path, or moved to the unevictable | 212 | If unevictable pages are culled in the fault path, or moved to the unevictable |
125 | list at mlock() or mmap() time, vmscan will never encounter the pages until | 213 | list at mlock() or mmap() time, vmscan will not encounter the pages until they |
126 | they have become evictable again, for example, via munlock() and have been | 214 | have become evictable again (via munlock() for example) and have been "rescued" |
127 | "rescued" from the unevictable list. However, there may be situations where we | 215 | from the unevictable list. However, there may be situations where we decide, |
128 | decide, for the sake of expediency, to leave a unevictable page on one of the | 216 | for the sake of expediency, to leave a unevictable page on one of the regular |
129 | regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for | 217 | active/inactive LRU lists for vmscan to deal with. vmscan checks for such |
130 | such pages in all of the shrink_{active|inactive|page}_list() functions and | 218 | pages in all of the shrink_{active|inactive|page}_list() functions and will |
131 | will "cull" such pages that it encounters--that is, it diverts those pages to | 219 | "cull" such pages that it encounters: that is, it diverts those pages to the |
132 | the unevictable list for the zone being scanned. | 220 | unevictable list for the zone being scanned. |
133 | 221 | ||
134 | There may be situations where a page is mapped into a VM_LOCKED vma, but the | 222 | There may be situations where a page is mapped into a VM_LOCKED VMA, but the |
135 | page is not marked as PageMlocked. Such pages will make it all the way to | 223 | page is not marked as PG_mlocked. Such pages will make it all the way to |
136 | shrink_page_list() where they will be detected when vmscan walks the reverse | 224 | shrink_page_list() where they will be detected when vmscan walks the reverse |
137 | map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list() | 225 | map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, |
138 | will cull the page at that point. | 226 | shrink_page_list() will cull the page at that point. |
139 | 227 | ||
140 | To "cull" an unevictable page, vmscan simply puts the page back on the lru | 228 | To "cull" an unevictable page, vmscan simply puts the page back on the LRU list |
141 | list using putback_lru_page()--the inverse operation to isolate_lru_page()-- | 229 | using putback_lru_page() - the inverse operation to isolate_lru_page() - after |
142 | after dropping the page lock. Because the condition which makes the page | 230 | dropping the page lock. Because the condition which makes the page unevictable |
143 | unevictable may change once the page is unlocked, putback_lru_page() will | 231 | may change once the page is unlocked, putback_lru_page() will recheck the |
144 | recheck the unevictable state of a page that it places on the unevictable lru | 232 | unevictable state of a page that it places on the unevictable list. If the |
145 | list. If the page has become unevictable, putback_lru_page() removes it from | 233 | page has become unevictable, putback_lru_page() removes it from the list and |
146 | the list and retries, including the page_unevictable() test. Because such a | 234 | retries, including the page_unevictable() test. Because such a race is a rare |
147 | race is a rare event and movement of pages onto the unevictable list should be | 235 | event and movement of pages onto the unevictable list should be rare, these |
148 | rare, these extra evictabilty checks should not occur in the majority of calls | 236 | extra evictabilty checks should not occur in the majority of calls to |
149 | to putback_lru_page(). | 237 | putback_lru_page(). |
150 | 238 | ||
151 | 239 | ||
152 | Mlocked Page: Prior Work | 240 | ============= |
241 | MLOCKED PAGES | ||
242 | ============= | ||
153 | 243 | ||
154 | The "Unevictable Mlocked Pages" infrastructure is based on work originally | 244 | The unevictable page list is also useful for mlock(), in addition to ramfs and |
245 | SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in | ||
246 | NOMMU situations, all mappings are effectively mlocked. | ||
247 | |||
248 | |||
249 | HISTORY | ||
250 | ------- | ||
251 | |||
252 | The "Unevictable mlocked Pages" infrastructure is based on work originally | ||
155 | posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". | 253 | posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". |
156 | Nick posted his patch as an alternative to a patch posted by Christoph | 254 | Nick posted his patch as an alternative to a patch posted by Christoph Lameter |
157 | Lameter to achieve the same objective--hiding mlocked pages from vmscan. | 255 | to achieve the same objective: hiding mlocked pages from vmscan. |
158 | In Nick's patch, he used one of the struct page lru list link fields as a count | 256 | |
159 | of VM_LOCKED vmas that map the page. This use of the link field for a count | 257 | In Nick's patch, he used one of the struct page LRU list link fields as a count |
160 | prevented the management of the pages on an LRU list. Thus, mlocked pages were | 258 | of VM_LOCKED VMAs that map the page. This use of the link field for a count |
161 | not migratable as isolate_lru_page() could not find them and the lru list link | 259 | prevented the management of the pages on an LRU list, and thus mlocked pages |
162 | field was not available to the migration subsystem. Nick resolved this by | 260 | were not migratable as isolate_lru_page() could not find them, and the LRU list |
163 | putting mlocked pages back on the lru list before attempting to isolate them, | 261 | link field was not available to the migration subsystem. |
164 | thus abandoning the count of VM_LOCKED vmas. When Nick's patch was integrated | 262 | |
165 | with the Unevictable LRU work, the count was replaced by walking the reverse | 263 | Nick resolved this by putting mlocked pages back on the lru list before |
166 | map to determine whether any VM_LOCKED vmas mapped the page. More on this | 264 | attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When |
167 | below. | 265 | Nick's patch was integrated with the Unevictable LRU work, the count was |
168 | 266 | replaced by walking the reverse map to determine whether any VM_LOCKED VMAs | |
169 | 267 | mapped the page. More on this below. | |
170 | Mlocked Pages: Basic Management | 268 | |
171 | 269 | ||
172 | Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of | 270 | BASIC MANAGEMENT |
173 | unevictable pages. When such a page has been "noticed" by the memory | 271 | ---------------- |
174 | management subsystem, the page is marked with the PG_mlocked [PageMlocked()] | 272 | |
175 | flag. A PageMlocked() page will be placed on the unevictable LRU list when | 273 | mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable |
176 | it is added to the LRU. Pages can be "noticed" by memory management in | 274 | pages. When such a page has been "noticed" by the memory management subsystem, |
177 | several places: | 275 | the page is marked with the PG_mlocked flag. This can be manipulated using the |
178 | 276 | PageMlocked() functions. | |
179 | 1) in the mlock()/mlockall() system call handlers. | 277 | |
180 | 2) in the mmap() system call handler when mmap()ing a region with the | 278 | A PG_mlocked page will be placed on the unevictable list when it is added to |
181 | MAP_LOCKED flag, or mmap()ing a region in a task that has called | 279 | the LRU. Such pages can be "noticed" by memory management in several places: |
182 | mlockall() with the MCL_FUTURE flag. Both of these conditions result | 280 | |
183 | in the VM_LOCKED flag being set for the vma. | 281 | (1) in the mlock()/mlockall() system call handlers; |
184 | 3) in the fault path, if mlocked pages are "culled" in the fault path, | 282 | |
185 | and when a VM_LOCKED stack segment is expanded. | 283 | (2) in the mmap() system call handler when mmapping a region with the |
186 | 4) as mentioned above, in vmscan:shrink_page_list() when attempting to | 284 | MAP_LOCKED flag; |
187 | reclaim a page in a VM_LOCKED vma via try_to_unmap(). | 285 | |
188 | 286 | (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE | |
189 | Mlocked pages become unlocked and rescued from the unevictable list when: | 287 | flag |
190 | 288 | ||
191 | 1) mapped in a range unlocked via the munlock()/munlockall() system calls. | 289 | (4) in the fault path, if mlocked pages are "culled" in the fault path, |
192 | 2) munmapped() out of the last VM_LOCKED vma that maps the page, including | 290 | and when a VM_LOCKED stack segment is expanded; or |
193 | unmapping at task exit. | 291 | |
194 | 3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file. | 292 | (5) as mentioned above, in vmscan:shrink_page_list() when attempting to |
195 | 4) before a page is COWed in a VM_LOCKED vma. | 293 | reclaim a page in a VM_LOCKED VMA via try_to_unmap() |
196 | 294 | ||
197 | 295 | all of which result in the VM_LOCKED flag being set for the VMA if it doesn't | |
198 | Mlocked Pages: mlock()/mlockall() System Call Handling | 296 | already have it set. |
297 | |||
298 | mlocked pages become unlocked and rescued from the unevictable list when: | ||
299 | |||
300 | (1) mapped in a range unlocked via the munlock()/munlockall() system calls; | ||
301 | |||
302 | (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including | ||
303 | unmapping at task exit; | ||
304 | |||
305 | (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file; | ||
306 | or | ||
307 | |||
308 | (4) before a page is COW'd in a VM_LOCKED VMA. | ||
309 | |||
310 | |||
311 | mlock()/mlockall() SYSTEM CALL HANDLING | ||
312 | --------------------------------------- | ||
199 | 313 | ||
200 | Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() | 314 | Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() |
201 | for each vma in the range specified by the call. In the case of mlockall(), | 315 | for each VMA in the range specified by the call. In the case of mlockall(), |
202 | this is the entire active address space of the task. Note that mlock_fixup() | 316 | this is the entire active address space of the task. Note that mlock_fixup() |
203 | is used for both mlock()ing and munlock()ing a range of memory. A call to | 317 | is used for both mlocking and munlocking a range of memory. A call to mlock() |
204 | mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED | 318 | an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is |
205 | is treated as a no-op--mlock_fixup() simply returns. | 319 | treated as a no-op, and mlock_fixup() simply returns. |
206 | 320 | ||
207 | If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas" | 321 | If the VMA passes some filtering as described in "Filtering Special Vmas" |
208 | below, mlock_fixup() will attempt to merge the vma with its neighbors or split | 322 | below, mlock_fixup() will attempt to merge the VMA with its neighbors or split |
209 | off a subset of the vma if the range does not cover the entire vma. Once the | 323 | off a subset of the VMA if the range does not cover the entire VMA. Once the |
210 | vma has been merged or split or neither, mlock_fixup() will call | 324 | VMA has been merged or split or neither, mlock_fixup() will call |
211 | __mlock_vma_pages_range() to fault in the pages via get_user_pages() and | 325 | __mlock_vma_pages_range() to fault in the pages via get_user_pages() and to |
212 | to mark the pages as mlocked via mlock_vma_page(). | 326 | mark the pages as mlocked via mlock_vma_page(). |
213 | 327 | ||
214 | Note that the vma being mlocked might be mapped with PROT_NONE. In this case, | 328 | Note that the VMA being mlocked might be mapped with PROT_NONE. In this case, |
215 | get_user_pages() will be unable to fault in the pages. That's OK. If pages | 329 | get_user_pages() will be unable to fault in the pages. That's okay. If pages |
216 | do end up getting faulted into this VM_LOCKED vma, we'll handle them in the | 330 | do end up getting faulted into this VM_LOCKED VMA, we'll handle them in the |
217 | fault path or in vmscan. | 331 | fault path or in vmscan. |
218 | 332 | ||
219 | Also note that a page returned by get_user_pages() could be truncated or | 333 | Also note that a page returned by get_user_pages() could be truncated or |
220 | migrated out from under us, while we're trying to mlock it. To detect | 334 | migrated out from under us, while we're trying to mlock it. To detect this, |
221 | this, __mlock_vma_pages_range() tests the page_mapping after acquiring | 335 | __mlock_vma_pages_range() checks page_mapping() after acquiring the page lock. |
222 | the page lock. If the page is still associated with its mapping, we'll | 336 | If the page is still associated with its mapping, we'll go ahead and call |
223 | go ahead and call mlock_vma_page(). If the mapping is gone, we just | 337 | mlock_vma_page(). If the mapping is gone, we just unlock the page and move on. |
224 | unlock the page and move on. Worse case, this results in page mapped | 338 | In the worst case, this will result in a page mapped in a VM_LOCKED VMA |
225 | in a VM_LOCKED vma remaining on a normal LRU list without being | 339 | remaining on a normal LRU list without being PageMlocked(). Again, vmscan will |
226 | PageMlocked(). Again, vmscan will detect and cull such pages. | 340 | detect and cull such pages. |
227 | 341 | ||
228 | mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will | 342 | mlock_vma_page() will call TestSetPageMlocked() for each page returned by |
229 | TestSetPageMlocked() for each page returned by get_user_pages(). We use | 343 | get_user_pages(). We use TestSetPageMlocked() because the page might already |
230 | TestSetPageMlocked() because the page might already be mlocked by another | 344 | be mlocked by another task/VMA and we don't want to do extra work. We |
231 | task/vma and we don't want to do extra work. We especially do not want to | 345 | especially do not want to count an mlocked page more than once in the |
232 | count an mlocked page more than once in the statistics. If the page was | 346 | statistics. If the page was already mlocked, mlock_vma_page() need do nothing |
233 | already mlocked, mlock_vma_page() is done. | 347 | more. |
234 | 348 | ||
235 | If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the | 349 | If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the |
236 | page from the LRU, as it is likely on the appropriate active or inactive list | 350 | page from the LRU, as it is likely on the appropriate active or inactive list |
237 | at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will | 351 | at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put |
238 | putback the page--putback_lru_page()--which will notice that the page is now | 352 | back the page - by calling putback_lru_page() - which will notice that the page |
239 | mlocked and divert the page to the zone's unevictable LRU list. If | 353 | is now mlocked and divert the page to the zone's unevictable list. If |
240 | mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle | 354 | mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle |
241 | it later if/when it attempts to reclaim the page. | 355 | it later if and when it attempts to reclaim the page. |
242 | 356 | ||
243 | 357 | ||
244 | Mlocked Pages: Filtering Special Vmas | 358 | FILTERING SPECIAL VMAS |
359 | ---------------------- | ||
245 | 360 | ||
246 | mlock_fixup() filters several classes of "special" vmas: | 361 | mlock_fixup() filters several classes of "special" VMAs: |
247 | 362 | ||
248 | 1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind | 363 | 1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind |
249 | these mappings are inherently pinned, so we don't need to mark them as | 364 | these mappings are inherently pinned, so we don't need to mark them as |
250 | mlocked. In any case, most of the pages have no struct page in which to | 365 | mlocked. In any case, most of the pages have no struct page in which to so |
251 | so mark the page. Because of this, get_user_pages() will fail for these | 366 | mark the page. Because of this, get_user_pages() will fail for these VMAs, |
252 | vmas, so there is no sense in attempting to visit them. | 367 | so there is no sense in attempting to visit them. |
253 | 368 | ||
254 | 2) vmas mapping hugetlbfs page are already effectively pinned into memory. | 369 | 2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We |
255 | We don't need nor want to mlock() these pages. However, to preserve the | 370 | neither need nor want to mlock() these pages. However, to preserve the |
256 | prior behavior of mlock()--before the unevictable/mlock changes-- | 371 | prior behavior of mlock() - before the unevictable/mlock changes - |
257 | mlock_fixup() will call make_pages_present() in the hugetlbfs vma range | 372 | mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to |
258 | to allocate the huge pages and populate the ptes. | 373 | allocate the huge pages and populate the ptes. |
259 | 374 | ||
260 | 3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of | 375 | 3) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of |
261 | kernel pages, such as the vdso page, relay channel pages, etc. These pages | 376 | kernel pages, such as the VDSO page, relay channel pages, etc. These pages |
262 | are inherently unevictable and are not managed on the LRU lists. | 377 | are inherently unevictable and are not managed on the LRU lists. |
263 | mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls | 378 | mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls |
264 | make_pages_present() to populate the ptes. | 379 | make_pages_present() to populate the ptes. |
265 | 380 | ||
266 | Note that for all of these special vmas, mlock_fixup() does not set the | 381 | Note that for all of these special VMAs, mlock_fixup() does not set the |
267 | VM_LOCKED flag. Therefore, we won't have to deal with them later during | 382 | VM_LOCKED flag. Therefore, we won't have to deal with them later during |
268 | munlock() or munmap()--for example, at task exit. Neither does mlock_fixup() | 383 | munlock(), munmap() or task exit. Neither does mlock_fixup() account these |
269 | account these vmas against the task's "locked_vm". | 384 | VMAs against the task's "locked_vm". |
270 | 385 | ||
271 | Mlocked Pages: Downgrading the Mmap Semaphore. | 386 | |
272 | 387 | munlock()/munlockall() SYSTEM CALL HANDLING | |
273 | mlock_fixup() must be called with the mmap semaphore held for write, because | 388 | ------------------------------------------- |
274 | it may have to merge or split vmas. However, mlocking a large region of | 389 | |
275 | memory can take a long time--especially if vmscan must reclaim pages to | 390 | The munlock() and munlockall() system calls are handled by the same functions - |
276 | satisfy the regions requirements. Faulting in a large region with the mmap | 391 | do_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs |
277 | semaphore held for write can hold off other faults on the address space, in | 392 | lock operation indicated by an argument. So, these system calls are also |
278 | the case of a multi-threaded task. It can also hold off scans of the task's | 393 | handled by mlock_fixup(). Again, if called for an already munlocked VMA, |
279 | address space via /proc. While testing under heavy load, it was observed that | 394 | mlock_fixup() simply returns. Because of the VMA filtering discussed above, |
280 | the ps(1) command could be held off for many minutes while a large segment was | 395 | VM_LOCKED will not be set in any "special" VMAs. So, these VMAs will be |
281 | mlock()ed down. | ||
282 | |||
283 | To address this issue, and to make the system more responsive during mlock()ing | ||
284 | of large segments, mlock_fixup() downgrades the mmap semaphore to read mode | ||
285 | during the call to __mlock_vma_pages_range(). This works fine. However, the | ||
286 | callers of mlock_fixup() expect the semaphore to be returned in write mode. | ||
287 | So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not | ||
288 | support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore | ||
289 | and reacquire it in write mode. In a multi-threaded task, it is possible for | ||
290 | the task memory map to change while the semaphore is dropped. Therefore, | ||
291 | mlock_fixup() looks up the vma at the range start address after reacquiring | ||
292 | the semaphore in write mode and verifies that it still covers the original | ||
293 | range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of | ||
294 | mlock_fixup() have been changed to deal with this new error condition. | ||
295 | |||
296 | Note: when munlocking a region, all of the pages should already be resident-- | ||
297 | unless we have racing threads mlocking() and munlocking() regions. So, | ||
298 | unlocking should not have to wait for page allocations nor faults of any kind. | ||
299 | Therefore mlock_fixup() does not downgrade the semaphore for munlock(). | ||
300 | |||
301 | |||
302 | Mlocked Pages: munlock()/munlockall() System Call Handling | ||
303 | |||
304 | The munlock() and munlockall() system calls are handled by the same functions-- | ||
305 | do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock | ||
306 | vs lock operation indicated by an argument. So, these system calls are also | ||
307 | handled by mlock_fixup(). Again, if called for an already munlock()ed vma, | ||
308 | mlock_fixup() simply returns. Because of the vma filtering discussed above, | ||
309 | VM_LOCKED will not be set in any "special" vmas. So, these vmas will be | ||
310 | ignored for munlock. | 396 | ignored for munlock. |
311 | 397 | ||
312 | If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off | 398 | If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the |
313 | the specified range. The range is then munlocked via the function | 399 | specified range. The range is then munlocked via the function |
314 | __mlock_vma_pages_range()--the same function used to mlock a vma range-- | 400 | __mlock_vma_pages_range() - the same function used to mlock a VMA range - |
315 | passing a flag to indicate that munlock() is being performed. | 401 | passing a flag to indicate that munlock() is being performed. |
316 | 402 | ||
317 | Because the vma access protections could have been changed to PROT_NONE after | 403 | Because the VMA access protections could have been changed to PROT_NONE after |
318 | faulting in and mlocking pages, get_user_pages() was unreliable for visiting | 404 | faulting in and mlocking pages, get_user_pages() was unreliable for visiting |
319 | these pages for munlocking. Because we don't want to leave pages mlocked(), | 405 | these pages for munlocking. Because we don't want to leave pages mlocked, |
320 | get_user_pages() was enhanced to accept a flag to ignore the permissions when | 406 | get_user_pages() was enhanced to accept a flag to ignore the permissions when |
321 | fetching the pages--all of which should be resident as a result of previous | 407 | fetching the pages - all of which should be resident as a result of previous |
322 | mlock()ing. | 408 | mlocking. |
323 | 409 | ||
324 | For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling | 410 | For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling |
325 | munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked | 411 | munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked |
326 | flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page() | 412 | flag using TestClearPageMlocked(). As with mlock_vma_page(), |
327 | use the Test*PageMlocked() function to handle the case where the page might | 413 | munlock_vma_page() use the Test*PageMlocked() function to handle the case where |
328 | have already been unlocked by another task. If the page was mlocked, | 414 | the page might have already been unlocked by another task. If the page was |
329 | munlock_vma_page() updates that zone statistics for the number of mlocked | 415 | mlocked, munlock_vma_page() updates that zone statistics for the number of |
330 | pages. Note, however, that at this point we haven't checked whether the page | 416 | mlocked pages. Note, however, that at this point we haven't checked whether |
331 | is mapped by other VM_LOCKED vmas. | 417 | the page is mapped by other VM_LOCKED VMAs. |
332 | 418 | ||
333 | We can't call try_to_munlock(), the function that walks the reverse map to check | 419 | We can't call try_to_munlock(), the function that walks the reverse map to |
334 | for other VM_LOCKED vmas, without first isolating the page from the LRU. | 420 | check for other VM_LOCKED VMAs, without first isolating the page from the LRU. |
335 | try_to_munlock() is a variant of try_to_unmap() and thus requires that the page | 421 | try_to_munlock() is a variant of try_to_unmap() and thus requires that the page |
336 | not be on an lru list. [More on these below.] However, the call to | 422 | not be on an LRU list [more on these below]. However, the call to |
337 | isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). | 423 | isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So, |
338 | So, we go ahead and clear PG_mlocked up front, as this might be the only chance | 424 | we go ahead and clear PG_mlocked up front, as this might be the only chance we |
339 | we have. If we can successfully isolate the page, we go ahead and | 425 | have. If we can successfully isolate the page, we go ahead and |
340 | try_to_munlock(), which will restore the PG_mlocked flag and update the zone | 426 | try_to_munlock(), which will restore the PG_mlocked flag and update the zone |
341 | page statistics if it finds another vma holding the page mlocked. If we fail | 427 | page statistics if it finds another VMA holding the page mlocked. If we fail |
342 | to isolate the page, we'll have left a potentially mlocked page on the LRU. | 428 | to isolate the page, we'll have left a potentially mlocked page on the LRU. |
343 | This is fine, because we'll catch it later when/if vmscan tries to reclaim the | 429 | This is fine, because we'll catch it later if and if vmscan tries to reclaim |
344 | page. This should be relatively rare. | 430 | the page. This should be relatively rare. |
345 | 431 | ||
346 | Mlocked Pages: Migrating Them... | 432 | |
347 | 433 | MIGRATING MLOCKED PAGES | |
348 | A page that is being migrated has been isolated from the lru lists and is | 434 | ----------------------- |
349 | held locked across unmapping of the page, updating the page's mapping | 435 | |
350 | [address_space] entry and copying the contents and state, until the | 436 | A page that is being migrated has been isolated from the LRU lists and is held |
351 | page table entry has been replaced with an entry that refers to the new | 437 | locked across unmapping of the page, updating the page's address space entry |
352 | page. Linux supports migration of mlocked pages and other unevictable | 438 | and copying the contents and state, until the page table entry has been |
353 | pages. This involves simply moving the PageMlocked and PageUnevictable states | 439 | replaced with an entry that refers to the new page. Linux supports migration |
354 | from the old page to the new page. | 440 | of mlocked pages and other unevictable pages. This involves simply moving the |
355 | 441 | PG_mlocked and PG_unevictable states from the old page to the new page. | |
356 | Note that page migration can race with mlocking or munlocking of the same | 442 | |
357 | page. This has been discussed from the mlock/munlock perspective in the | 443 | Note that page migration can race with mlocking or munlocking of the same page. |
358 | respective sections above. Both processes [migration, m[un]locking], hold | 444 | This has been discussed from the mlock/munlock perspective in the respective |
359 | the page locked. This provides the first level of synchronization. Page | 445 | sections above. Both processes (migration and m[un]locking) hold the page |
360 | migration zeros out the page_mapping of the old page before unlocking it, | 446 | locked. This provides the first level of synchronization. Page migration |
361 | so m[un]lock can skip these pages by testing the page mapping under page | 447 | zeros out the page_mapping of the old page before unlocking it, so m[un]lock |
362 | lock. | 448 | can skip these pages by testing the page mapping under page lock. |
363 | 449 | ||
364 | When completing page migration, we place the new and old pages back onto the | 450 | To complete page migration, we place the new and old pages back onto the LRU |
365 | lru after dropping the page lock. The "unneeded" page--old page on success, | 451 | after dropping the page lock. The "unneeded" page - old page on success, new |
366 | new page on failure--will be freed when the reference count held by the | 452 | page on failure - will be freed when the reference count held by the migration |
367 | migration process is released. To ensure that we don't strand pages on the | 453 | process is released. To ensure that we don't strand pages on the unevictable |
368 | unevictable list because of a race between munlock and migration, page | 454 | list because of a race between munlock and migration, page migration uses the |
369 | migration uses the putback_lru_page() function to add migrated pages back to | 455 | putback_lru_page() function to add migrated pages back to the LRU. |
370 | the lru. | 456 | |
371 | 457 | ||
372 | 458 | mmap(MAP_LOCKED) SYSTEM CALL HANDLING | |
373 | Mlocked Pages: mmap(MAP_LOCKED) System Call Handling | 459 | ------------------------------------- |
374 | 460 | ||
375 | In addition the the mlock()/mlockall() system calls, an application can request | 461 | In addition the the mlock()/mlockall() system calls, an application can request |
376 | that a region of memory be mlocked using the MAP_LOCKED flag with the mmap() | 462 | that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap() |
377 | call. Furthermore, any mmap() call or brk() call that expands the heap by a | 463 | call. Furthermore, any mmap() call or brk() call that expands the heap by a |
378 | task that has previously called mlockall() with the MCL_FUTURE flag will result | 464 | task that has previously called mlockall() with the MCL_FUTURE flag will result |
379 | in the newly mapped memory being mlocked. Before the unevictable/mlock changes, | 465 | in the newly mapped memory being mlocked. Before the unevictable/mlock |
380 | the kernel simply called make_pages_present() to allocate pages and populate | 466 | changes, the kernel simply called make_pages_present() to allocate pages and |
381 | the page table. | 467 | populate the page table. |
382 | 468 | ||
383 | To mlock a range of memory under the unevictable/mlock infrastructure, the | 469 | To mlock a range of memory under the unevictable/mlock infrastructure, the |
384 | mmap() handler and task address space expansion functions call | 470 | mmap() handler and task address space expansion functions call |
385 | mlock_vma_pages_range() specifying the vma and the address range to mlock. | 471 | mlock_vma_pages_range() specifying the vma and the address range to mlock. |
386 | mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in | 472 | mlock_vma_pages_range() filters VMAs like mlock_fixup(), as described above in |
387 | "Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will | 473 | "Filtering Special VMAs". It will clear the VM_LOCKED flag, which will have |
388 | have already been set by the caller, in filtered vmas. Thus these vma's need | 474 | already been set by the caller, in filtered VMAs. Thus these VMA's need not be |
389 | not be visited for munlock when the region is unmapped. | 475 | visited for munlock when the region is unmapped. |
390 | 476 | ||
391 | For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to | 477 | For "normal" VMAs, mlock_vma_pages_range() calls __mlock_vma_pages_range() to |
392 | fault/allocate the pages and mlock them. Again, like mlock_fixup(), | 478 | fault/allocate the pages and mlock them. Again, like mlock_fixup(), |
393 | mlock_vma_pages_range() downgrades the mmap semaphore to read mode before | 479 | mlock_vma_pages_range() downgrades the mmap semaphore to read mode before |
394 | attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore | 480 | attempting to fault/allocate and mlock the pages and "upgrades" the semaphore |
395 | back to write mode before returning. | 481 | back to write mode before returning. |
396 | 482 | ||
397 | The callers of mlock_vma_pages_range() will have already added the memory | 483 | The callers of mlock_vma_pages_range() will have already added the memory range |
398 | range to be mlocked to the task's "locked_vm". To account for filtered vmas, | 484 | to be mlocked to the task's "locked_vm". To account for filtered VMAs, |
399 | mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the | 485 | mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the |
400 | callers then subtract a non-negative return value from the task's locked_vm. | 486 | callers then subtract a non-negative return value from the task's locked_vm. A |
401 | A negative return value represent an error--for example, from get_user_pages() | 487 | negative return value represent an error - for example, from get_user_pages() |
402 | attempting to fault in a vma with PROT_NONE access. In this case, we leave | 488 | attempting to fault in a VMA with PROT_NONE access. In this case, we leave the |
403 | the memory range accounted as locked_vm, as the protections could be changed | 489 | memory range accounted as locked_vm, as the protections could be changed later |
404 | later and pages allocated into that region. | 490 | and pages allocated into that region. |
405 | 491 | ||
406 | 492 | ||
407 | Mlocked Pages: munmap()/exit()/exec() System Call Handling | 493 | munmap()/exit()/exec() SYSTEM CALL HANDLING |
494 | ------------------------------------------- | ||
408 | 495 | ||
409 | When unmapping an mlocked region of memory, whether by an explicit call to | 496 | When unmapping an mlocked region of memory, whether by an explicit call to |
410 | munmap() or via an internal unmap from exit() or exec() processing, we must | 497 | munmap() or via an internal unmap from exit() or exec() processing, we must |
411 | munlock the pages if we're removing the last VM_LOCKED vma that maps the pages. | 498 | munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages. |
412 | Before the unevictable/mlock changes, mlocking did not mark the pages in any | 499 | Before the unevictable/mlock changes, mlocking did not mark the pages in any |
413 | way, so unmapping them required no processing. | 500 | way, so unmapping them required no processing. |
414 | 501 | ||
415 | To munlock a range of memory under the unevictable/mlock infrastructure, the | 502 | To munlock a range of memory under the unevictable/mlock infrastructure, the |
416 | munmap() hander and task address space tear down function call | 503 | munmap() handler and task address space call tear down function |
417 | munlock_vma_pages_all(). The name reflects the observation that one always | 504 | munlock_vma_pages_all(). The name reflects the observation that one always |
418 | specifies the entire vma range when munlock()ing during unmap of a region. | 505 | specifies the entire VMA range when munlock()ing during unmap of a region. |
419 | Because of the vma filtering when mlocking() regions, only "normal" vmas that | 506 | Because of the VMA filtering when mlocking() regions, only "normal" VMAs that |
420 | actually contain mlocked pages will be passed to munlock_vma_pages_all(). | 507 | actually contain mlocked pages will be passed to munlock_vma_pages_all(). |
421 | 508 | ||
422 | munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup() | 509 | munlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup() |
423 | for the munlock case, calls __munlock_vma_pages_range() to walk the page table | 510 | for the munlock case, calls __munlock_vma_pages_range() to walk the page table |
424 | for the vma's memory range and munlock_vma_page() each resident page mapped by | 511 | for the VMA's memory range and munlock_vma_page() each resident page mapped by |
425 | the vma. This effectively munlocks the page, only if this is the last | 512 | the VMA. This effectively munlocks the page, only if this is the last |
426 | VM_LOCKED vma that maps the page. | 513 | VM_LOCKED VMA that maps the page. |
427 | |||
428 | 514 | ||
429 | Mlocked Page: try_to_unmap() | ||
430 | 515 | ||
431 | [Note: the code changes represented by this section are really quite small | 516 | try_to_unmap() |
432 | compared to the text to describe what happening and why, and to discuss the | 517 | -------------- |
433 | implications.] | ||
434 | 518 | ||
435 | Pages can, of course, be mapped into multiple vmas. Some of these vmas may | 519 | Pages can, of course, be mapped into multiple VMAs. Some of these VMAs may |
436 | have VM_LOCKED flag set. It is possible for a page mapped into one or more | 520 | have VM_LOCKED flag set. It is possible for a page mapped into one or more |
437 | VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one | 521 | VM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one |
438 | of the active or inactive LRU lists. This could happen if, for example, a | 522 | of the active or inactive LRU lists. This could happen if, for example, a task |
439 | task in the process of munlock()ing the page could not isolate the page from | 523 | in the process of munlocking the page could not isolate the page from the LRU. |
440 | the LRU. As a result, vmscan/shrink_page_list() might encounter such a page | 524 | As a result, vmscan/shrink_page_list() might encounter such a page as described |
441 | as described in "Unevictable Pages and Vmscan [shrink_*_list()]". To | 525 | in section "vmscan's handling of unevictable pages". To handle this situation, |
442 | handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED | 526 | try_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse |
443 | vmas while it is walking a page's reverse map. | 527 | map. |
444 | 528 | ||
445 | try_to_unmap() is always called, by either vmscan for reclaim or for page | 529 | try_to_unmap() is always called, by either vmscan for reclaim or for page |
446 | migration, with the argument page locked and isolated from the LRU. BUG_ON() | 530 | migration, with the argument page locked and isolated from the LRU. Separate |
447 | assertions enforce this requirement. Separate functions handle anonymous and | 531 | functions handle anonymous and mapped file pages, as these types of pages have |
448 | mapped file pages, as these types of pages have different reverse map | 532 | different reverse map mechanisms. |
449 | mechanisms. | 533 | |
450 | 534 | (*) try_to_unmap_anon() | |
451 | try_to_unmap_anon() | 535 | |
452 | 536 | To unmap anonymous pages, each VMA in the list anchored in the anon_vma | |
453 | To unmap anonymous pages, each vma in the list anchored in the anon_vma must be | 537 | must be visited - at least until a VM_LOCKED VMA is encountered. If the |
454 | visited--at least until a VM_LOCKED vma is encountered. If the page is being | 538 | page is being unmapped for migration, VM_LOCKED VMAs do not stop the |
455 | unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked | 539 | process because mlocked pages are migratable. However, for reclaim, if |
456 | pages are migratable. However, for reclaim, if the page is mapped into a | 540 | the page is mapped into a VM_LOCKED VMA, the scan stops. |
457 | VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap | 541 | |
458 | semphore of the mm_struct to which the vma belongs in read mode. If this is | 542 | try_to_unmap_anon() attempts to acquire in read mode the mmap semphore of |
459 | successful, try_to_unmap() will mlock the page via mlock_vma_page()--we | 543 | the mm_struct to which the VMA belongs. If this is successful, it will |
460 | wouldn't have gotten to try_to_unmap() if the page were already mlocked--and | 544 | mlock the page via mlock_vma_page() - we wouldn't have gotten to |
461 | will return SWAP_MLOCK, indicating that the page is unevictable. If the | 545 | try_to_unmap_anon() if the page were already mlocked - and will return |
462 | mmap semaphore cannot be acquired, we are not sure whether the page is really | 546 | SWAP_MLOCK, indicating that the page is unevictable. |
463 | unevictable or not. In this case, try_to_unmap() will return SWAP_AGAIN. | 547 | |
464 | 548 | If the mmap semaphore cannot be acquired, we are not sure whether the page | |
465 | try_to_unmap_file() -- linear mappings | 549 | is really unevictable or not. In this case, try_to_unmap_anon() will |
466 | 550 | return SWAP_AGAIN. | |
467 | Unmapping of a mapped file page works the same, except that the scan visits | 551 | |
468 | all vmas that maps the page's index/page offset in the page's mapping's | 552 | (*) try_to_unmap_file() - linear mappings |
469 | reverse map priority search tree. It must also visit each vma in the page's | 553 | |
470 | mapping's non-linear list, if the list is non-empty. As for anonymous pages, | 554 | Unmapping of a mapped file page works the same as for anonymous mappings, |
471 | on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will | 555 | except that the scan visits all VMAs that map the page's index/page offset |
472 | attempt to acquire the associated mm_struct's mmap semaphore to mlock the page, | 556 | in the page's mapping's reverse map priority search tree. It also visits |
473 | returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not. | 557 | each VMA in the page's mapping's non-linear list, if the list is |
474 | 558 | non-empty. | |
475 | try_to_unmap_file() -- non-linear mappings | 559 | |
476 | 560 | As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file | |
477 | If a page's mapping contains a non-empty non-linear mapping vma list, then | 561 | page, try_to_unmap_file() will attempt to acquire the associated |
478 | try_to_un{map|lock}() must also visit each vma in that list to determine | 562 | mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this |
479 | whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit | 563 | is successful, and SWAP_AGAIN, if not. |
480 | all vmas in the non-linear list to ensure that the pages is not/should not be | 564 | |
481 | mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate. | 565 | (*) try_to_unmap_file() - non-linear mappings |
482 | However, there is no easy way to determine whether the page is actually mapped | 566 | |
483 | in a given vma--either for unmapping or testing whether the VM_LOCKED vma | 567 | If a page's mapping contains a non-empty non-linear mapping VMA list, then |
484 | actually pins the page. | 568 | try_to_un{map|lock}() must also visit each VMA in that list to determine |
485 | 569 | whether the page is mapped in a VM_LOCKED VMA. Again, the scan must visit | |
486 | So, try_to_unmap_file() handles non-linear mappings by scanning a certain | 570 | all VMAs in the non-linear list to ensure that the pages is not/should not |
487 | number of pages--a "cluster"--in each non-linear vma associated with the page's | 571 | be mlocked. |
488 | mapping, for each file mapped page that vmscan tries to unmap. If this happens | 572 | |
489 | to unmap the page we're trying to unmap, try_to_unmap() will notice this on | 573 | If a VM_LOCKED VMA is found in the list, the scan could terminate. |
490 | return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it | 574 | However, there is no easy way to determine whether the page is actually |
491 | will return SWAP_AGAIN, causing vmscan to recirculate this page. We take | 575 | mapped in a given VMA - either for unmapping or testing whether the |
492 | advantage of the cluster scan in try_to_unmap_cluster() as follows: | 576 | VM_LOCKED VMA actually pins the page. |
493 | 577 | ||
494 | For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap | 578 | try_to_unmap_file() handles non-linear mappings by scanning a certain |
495 | semaphore of the associated mm_struct for read without blocking. If this | 579 | number of pages - a "cluster" - in each non-linear VMA associated with the |
496 | attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will | 580 | page's mapping, for each file mapped page that vmscan tries to unmap. If |
497 | retain the mmap semaphore for the scan; otherwise it drops it here. Then, | 581 | this happens to unmap the page we're trying to unmap, try_to_unmap() will |
498 | for each page in the cluster, if we're holding the mmap semaphore for a locked | 582 | notice this on return (page_mapcount(page) will be 0) and return |
499 | vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This | 583 | SWAP_SUCCESS. Otherwise, it will return SWAP_AGAIN, causing vmscan to |
500 | call is a no-op if the page is already locked, but will mlock any pages in | 584 | recirculate this page. We take advantage of the cluster scan in |
501 | the non-linear mapping that happen to be unlocked. If one of the pages so | 585 | try_to_unmap_cluster() as follows: |
502 | mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will | 586 | |
503 | return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan | 587 | For each non-linear VMA, try_to_unmap_cluster() attempts to acquire the |
504 | to cull the page, rather than recirculating it on the inactive list. Again, | 588 | mmap semaphore of the associated mm_struct for read without blocking. |
505 | if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns | 589 | |
506 | SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but | 590 | If this attempt is successful and the VMA is VM_LOCKED, |
507 | couldn't be mlocked. | 591 | try_to_unmap_cluster() will retain the mmap semaphore for the scan; |
508 | 592 | otherwise it drops it here. | |
509 | 593 | ||
510 | Mlocked pages: try_to_munlock() Reverse Map Scan | 594 | Then, for each page in the cluster, if we're holding the mmap semaphore |
511 | 595 | for a locked VMA, try_to_unmap_cluster() calls mlock_vma_page() to | |
512 | TODO/FIXME: a better name might be page_mlocked()--analogous to the | 596 | mlock the page. This call is a no-op if the page is already locked, |
513 | page_referenced() reverse map walker. | 597 | but will mlock any pages in the non-linear mapping that happen to be |
514 | 598 | unlocked. | |
515 | When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() | 599 | |
516 | System Call Handling" above--tries to munlock a page, it needs to | 600 | If one of the pages so mlocked is the page passed in to try_to_unmap(), |
517 | determine whether or not the page is mapped by any VM_LOCKED vma, without | 601 | try_to_unmap_cluster() will return SWAP_MLOCK, rather than the default |
518 | actually attempting to unmap all ptes from the page. For this purpose, the | 602 | SWAP_AGAIN. This will allow vmscan to cull the page, rather than |
519 | unevictable/mlock infrastructure introduced a variant of try_to_unmap() called | 603 | recirculating it on the inactive list. |
520 | try_to_munlock(). | 604 | |
605 | Again, if try_to_unmap_cluster() cannot acquire the VMA's mmap sem, it | ||
606 | returns SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED | ||
607 | VMA, but couldn't be mlocked. | ||
608 | |||
609 | |||
610 | try_to_munlock() REVERSE MAP SCAN | ||
611 | --------------------------------- | ||
612 | |||
613 | [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the | ||
614 | page_referenced() reverse map walker. | ||
615 | |||
616 | When munlock_vma_page() [see section "munlock()/munlockall() System Call | ||
617 | Handling" above] tries to munlock a page, it needs to determine whether or not | ||
618 | the page is mapped by any VM_LOCKED VMA without actually attempting to unmap | ||
619 | all PTEs from the page. For this purpose, the unevictable/mlock infrastructure | ||
620 | introduced a variant of try_to_unmap() called try_to_munlock(). | ||
521 | 621 | ||
522 | try_to_munlock() calls the same functions as try_to_unmap() for anonymous and | 622 | try_to_munlock() calls the same functions as try_to_unmap() for anonymous and |
523 | mapped file pages with an additional argument specifing unlock versus unmap | 623 | mapped file pages with an additional argument specifing unlock versus unmap |
524 | processing. Again, these functions walk the respective reverse maps looking | 624 | processing. Again, these functions walk the respective reverse maps looking |
525 | for VM_LOCKED vmas. When such a vma is found for anonymous pages and file | 625 | for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file |
526 | pages mapped in linear VMAs, as in the try_to_unmap() case, the functions | 626 | pages mapped in linear VMAs, as in the try_to_unmap() case, the functions |
527 | attempt to acquire the associated mmap semphore, mlock the page via | 627 | attempt to acquire the associated mmap semphore, mlock the page via |
528 | mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the | 628 | mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the |
529 | pre-clearing of the page's PG_mlocked done by munlock_vma_page. | 629 | pre-clearing of the page's PG_mlocked done by munlock_vma_page. |
530 | 630 | ||
531 | If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap | 631 | If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap |
532 | semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() | 632 | semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to |
533 | to recycle the page on the inactive list and hope that it has better luck | 633 | recycle the page on the inactive list and hope that it has better luck with the |
534 | with the page next time. | 634 | page next time. |
535 | 635 | ||
536 | For file pages mapped into non-linear vmas, the try_to_munlock() logic works | 636 | For file pages mapped into non-linear VMAs, the try_to_munlock() logic works |
537 | slightly differently. On encountering a VM_LOCKED non-linear vma that might | 637 | slightly differently. On encountering a VM_LOCKED non-linear VMA that might |
538 | map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking | 638 | map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking the |
539 | the page. munlock_vma_page() will just leave the page unlocked and let | 639 | page. munlock_vma_page() will just leave the page unlocked and let vmscan deal |
540 | vmscan deal with it--the usual fallback position. | 640 | with it - the usual fallback position. |
541 | 641 | ||
542 | Note that try_to_munlock()'s reverse map walk must visit every vma in a pages' | 642 | Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's |
543 | reverse map to determine that a page is NOT mapped into any VM_LOCKED vma. | 643 | reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. |
544 | However, the scan can terminate when it encounters a VM_LOCKED vma and can | 644 | However, the scan can terminate when it encounters a VM_LOCKED VMA and can |
545 | successfully acquire the vma's mmap semphore for read and mlock the page. | 645 | successfully acquire the VMA's mmap semphore for read and mlock the page. |
546 | Although try_to_munlock() can be called many [very many!] times when | 646 | Although try_to_munlock() might be called a great many times when munlocking a |
547 | munlock()ing a large region or tearing down a large address space that has been | 647 | large region or tearing down a large address space that has been mlocked via |
548 | mlocked via mlockall(), overall this is a fairly rare event. | 648 | mlockall(), overall this is a fairly rare event. |
549 | 649 | ||
550 | Mlocked Page: Page Reclaim in shrink_*_list() | 650 | |
551 | 651 | PAGE RECLAIM IN shrink_*_list() | |
552 | shrink_active_list() culls any obviously unevictable pages--i.e., | 652 | ------------------------------- |
553 | !page_evictable(page, NULL)--diverting these to the unevictable lru | 653 | |
554 | list. However, shrink_active_list() only sees unevictable pages that | 654 | shrink_active_list() culls any obviously unevictable pages - i.e. |
555 | made it onto the active/inactive lru lists. Note that these pages do not | 655 | !page_evictable(page, NULL) - diverting these to the unevictable list. |
556 | have PageUnevictable set--otherwise, they would be on the unevictable list and | 656 | However, shrink_active_list() only sees unevictable pages that made it onto the |
557 | shrink_active_list would never see them. | 657 | active/inactive lru lists. Note that these pages do not have PageUnevictable |
658 | set - otherwise they would be on the unevictable list and shrink_active_list | ||
659 | would never see them. | ||
558 | 660 | ||
559 | Some examples of these unevictable pages on the LRU lists are: | 661 | Some examples of these unevictable pages on the LRU lists are: |
560 | 662 | ||
561 | 1) ramfs pages that have been placed on the lru lists when first allocated. | 663 | (1) ramfs pages that have been placed on the LRU lists when first allocated. |
664 | |||
665 | (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to | ||
666 | allocate or fault in the pages in the shared memory region. This happens | ||
667 | when an application accesses the page the first time after SHM_LOCK'ing | ||
668 | the segment. | ||
562 | 669 | ||
563 | 2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to | 670 | (3) mlocked pages that could not be isolated from the LRU and moved to the |
564 | allocate or fault in the pages in the shared memory region. This happens | 671 | unevictable list in mlock_vma_page(). |
565 | when an application accesses the page the first time after SHM_LOCKing | ||
566 | the segment. | ||
567 | 672 | ||
568 | 3) Mlocked pages that could not be isolated from the lru and moved to the | 673 | (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't |
569 | unevictable list in mlock_vma_page(). | 674 | acquire the VMA's mmap semaphore to test the flags and set PageMlocked. |
675 | munlock_vma_page() was forced to let the page back on to the normal LRU | ||
676 | list for vmscan to handle. | ||
570 | 677 | ||
571 | 3) Pages mapped into multiple VM_LOCKED vmas, but try_to_munlock() couldn't | 678 | shrink_inactive_list() also diverts any unevictable pages that it finds on the |
572 | acquire the vma's mmap semaphore to test the flags and set PageMlocked. | 679 | inactive lists to the appropriate zone's unevictable list. |
573 | munlock_vma_page() was forced to let the page back on to the normal | ||
574 | LRU list for vmscan to handle. | ||
575 | 680 | ||
576 | shrink_inactive_list() also culls any unevictable pages that it finds on | 681 | shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd |
577 | the inactive lists, again diverting them to the appropriate zone's unevictable | 682 | after shrink_active_list() had moved them to the inactive list, or pages mapped |
578 | lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became | 683 | into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to |
579 | SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or | 684 | recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter, |
580 | pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from | 685 | but will pass on to shrink_page_list(). |
581 | the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice | ||
582 | the latter, but will pass on to shrink_page_list(). | ||
583 | 686 | ||
584 | shrink_page_list() again culls obviously unevictable pages that it could | 687 | shrink_page_list() again culls obviously unevictable pages that it could |
585 | encounter for similar reason to shrink_inactive_list(). Pages mapped into | 688 | encounter for similar reason to shrink_inactive_list(). Pages mapped into |
586 | VM_LOCKED vmas but without PG_mlocked set will make it all the way to | 689 | VM_LOCKED VMAs but without PG_mlocked set will make it all the way to |
587 | try_to_unmap(). shrink_page_list() will divert them to the unevictable list | 690 | try_to_unmap(). shrink_page_list() will divert them to the unevictable list |
588 | when try_to_unmap() returns SWAP_MLOCK, as discussed above. | 691 | when try_to_unmap() returns SWAP_MLOCK, as discussed above. |