diff options
author | Jonathan Herman <hermanjl@cs.unc.edu> | 2013-01-17 16:15:55 -0500 |
---|---|---|
committer | Jonathan Herman <hermanjl@cs.unc.edu> | 2013-01-17 16:15:55 -0500 |
commit | 8dea78da5cee153b8af9c07a2745f6c55057fe12 (patch) | |
tree | a8f4d49d63b1ecc92f2fddceba0655b2472c5bd9 /Documentation/vm | |
parent | 406089d01562f1e2bf9f089fd7637009ebaad589 (diff) |
Patched in Tegra support.
Diffstat (limited to 'Documentation/vm')
-rw-r--r-- | Documentation/vm/00-INDEX | 2 | ||||
-rw-r--r-- | Documentation/vm/cleancache.txt | 43 | ||||
-rw-r--r-- | Documentation/vm/frontswap.txt | 278 | ||||
-rw-r--r-- | Documentation/vm/hugetlbpage.txt | 10 | ||||
-rw-r--r-- | Documentation/vm/numa | 4 | ||||
-rw-r--r-- | Documentation/vm/pagemap.txt | 6 | ||||
-rw-r--r-- | Documentation/vm/slub.txt | 9 | ||||
-rw-r--r-- | Documentation/vm/transhuge.txt | 81 | ||||
-rw-r--r-- | Documentation/vm/unevictable-lru.txt | 22 |
9 files changed, 46 insertions, 409 deletions
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index 5481c8ba341..dca82d7c83d 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX | |||
@@ -30,6 +30,8 @@ page_migration | |||
30 | - description of page migration in NUMA systems. | 30 | - description of page migration in NUMA systems. |
31 | pagemap.txt | 31 | pagemap.txt |
32 | - pagemap, from the userspace perspective | 32 | - pagemap, from the userspace perspective |
33 | slabinfo.c | ||
34 | - source code for a tool to get reports about slabs. | ||
33 | slub.txt | 35 | slub.txt |
34 | - a short users guide for SLUB. | 36 | - a short users guide for SLUB. |
35 | unevictable-lru.txt | 37 | unevictable-lru.txt |
diff --git a/Documentation/vm/cleancache.txt b/Documentation/vm/cleancache.txt index 142fbb0f325..36c367c7308 100644 --- a/Documentation/vm/cleancache.txt +++ b/Documentation/vm/cleancache.txt | |||
@@ -46,11 +46,10 @@ a negative return value indicates failure. A "put_page" will copy a | |||
46 | the pool id, a file key, and a page index into the file. (The combination | 46 | the pool id, a file key, and a page index into the file. (The combination |
47 | of a pool id, a file key, and an index is sometimes called a "handle".) | 47 | of a pool id, a file key, and an index is sometimes called a "handle".) |
48 | A "get_page" will copy the page, if found, from cleancache into kernel memory. | 48 | A "get_page" will copy the page, if found, from cleancache into kernel memory. |
49 | An "invalidate_page" will ensure the page no longer is present in cleancache; | 49 | A "flush_page" will ensure the page no longer is present in cleancache; |
50 | an "invalidate_inode" will invalidate all pages associated with the specified | 50 | a "flush_inode" will flush all pages associated with the specified file; |
51 | file; and, when a filesystem is unmounted, an "invalidate_fs" will invalidate | 51 | and, when a filesystem is unmounted, a "flush_fs" will flush all pages in |
52 | all pages in all files specified by the given pool id and also surrender | 52 | all files specified by the given pool id and also surrender the pool id. |
53 | the pool id. | ||
54 | 53 | ||
55 | An "init_shared_fs", like init_fs, obtains a pool id but tells cleancache | 54 | An "init_shared_fs", like init_fs, obtains a pool id but tells cleancache |
56 | to treat the pool as shared using a 128-bit UUID as a key. On systems | 55 | to treat the pool as shared using a 128-bit UUID as a key. On systems |
@@ -63,12 +62,12 @@ of the kernel (e.g. by "tools" that control cleancache). Or a | |||
63 | cleancache implementation can simply disable shared_init by always | 62 | cleancache implementation can simply disable shared_init by always |
64 | returning a negative value. | 63 | returning a negative value. |
65 | 64 | ||
66 | If a get_page is successful on a non-shared pool, the page is invalidated | 65 | If a get_page is successful on a non-shared pool, the page is flushed (thus |
67 | (thus making cleancache an "exclusive" cache). On a shared pool, the page | 66 | making cleancache an "exclusive" cache). On a shared pool, the page |
68 | is NOT invalidated on a successful get_page so that it remains accessible to | 67 | is NOT flushed on a successful get_page so that it remains accessible to |
69 | other sharers. The kernel is responsible for ensuring coherency between | 68 | other sharers. The kernel is responsible for ensuring coherency between |
70 | cleancache (shared or not), the page cache, and the filesystem, using | 69 | cleancache (shared or not), the page cache, and the filesystem, using |
71 | cleancache invalidate operations as required. | 70 | cleancache flush operations as required. |
72 | 71 | ||
73 | Note that cleancache must enforce put-put-get coherency and get-get | 72 | Note that cleancache must enforce put-put-get coherency and get-get |
74 | coherency. For the former, if two puts are made to the same handle but | 73 | coherency. For the former, if two puts are made to the same handle but |
@@ -78,22 +77,22 @@ if a get for a given handle fails, subsequent gets for that handle will | |||
78 | never succeed unless preceded by a successful put with that handle. | 77 | never succeed unless preceded by a successful put with that handle. |
79 | 78 | ||
80 | Last, cleancache provides no SMP serialization guarantees; if two | 79 | Last, cleancache provides no SMP serialization guarantees; if two |
81 | different Linux threads are simultaneously putting and invalidating a page | 80 | different Linux threads are simultaneously putting and flushing a page |
82 | with the same handle, the results are indeterminate. Callers must | 81 | with the same handle, the results are indeterminate. Callers must |
83 | lock the page to ensure serial behavior. | 82 | lock the page to ensure serial behavior. |
84 | 83 | ||
85 | CLEANCACHE PERFORMANCE METRICS | 84 | CLEANCACHE PERFORMANCE METRICS |
86 | 85 | ||
87 | If properly configured, monitoring of cleancache is done via debugfs in | 86 | Cleancache monitoring is done by sysfs files in the |
88 | the /sys/kernel/debug/mm/cleancache directory. The effectiveness of cleancache | 87 | /sys/kernel/mm/cleancache directory. The effectiveness of cleancache |
89 | can be measured (across all filesystems) with: | 88 | can be measured (across all filesystems) with: |
90 | 89 | ||
91 | succ_gets - number of gets that were successful | 90 | succ_gets - number of gets that were successful |
92 | failed_gets - number of gets that failed | 91 | failed_gets - number of gets that failed |
93 | puts - number of puts attempted (all "succeed") | 92 | puts - number of puts attempted (all "succeed") |
94 | invalidates - number of invalidates attempted | 93 | flushes - number of flushes attempted |
95 | 94 | ||
96 | A backend implementation may provide additional metrics. | 95 | A backend implementatation may provide additional metrics. |
97 | 96 | ||
98 | FAQ | 97 | FAQ |
99 | 98 | ||
@@ -144,7 +143,7 @@ systems. | |||
144 | 143 | ||
145 | The core hooks for cleancache in VFS are in most cases a single line | 144 | The core hooks for cleancache in VFS are in most cases a single line |
146 | and the minimum set are placed precisely where needed to maintain | 145 | and the minimum set are placed precisely where needed to maintain |
147 | coherency (via cleancache_invalidate operations) between cleancache, | 146 | coherency (via cleancache_flush operations) between cleancache, |
148 | the page cache, and disk. All hooks compile into nothingness if | 147 | the page cache, and disk. All hooks compile into nothingness if |
149 | cleancache is config'ed off and turn into a function-pointer- | 148 | cleancache is config'ed off and turn into a function-pointer- |
150 | compare-to-NULL if config'ed on but no backend claims the ops | 149 | compare-to-NULL if config'ed on but no backend claims the ops |
@@ -185,15 +184,15 @@ or for real kernel-addressable RAM, it makes perfect sense for | |||
185 | transcendent memory. | 184 | transcendent memory. |
186 | 185 | ||
187 | 4) Why is non-shared cleancache "exclusive"? And where is the | 186 | 4) Why is non-shared cleancache "exclusive"? And where is the |
188 | page "invalidated" after a "get"? (Minchan Kim) | 187 | page "flushed" after a "get"? (Minchan Kim) |
189 | 188 | ||
190 | The main reason is to free up space in transcendent memory and | 189 | The main reason is to free up space in transcendent memory and |
191 | to avoid unnecessary cleancache_invalidate calls. If you want inclusive, | 190 | to avoid unnecessary cleancache_flush calls. If you want inclusive, |
192 | the page can be "put" immediately following the "get". If | 191 | the page can be "put" immediately following the "get". If |
193 | put-after-get for inclusive becomes common, the interface could | 192 | put-after-get for inclusive becomes common, the interface could |
194 | be easily extended to add a "get_no_invalidate" call. | 193 | be easily extended to add a "get_no_flush" call. |
195 | 194 | ||
196 | The invalidate is done by the cleancache backend implementation. | 195 | The flush is done by the cleancache backend implementation. |
197 | 196 | ||
198 | 5) What's the performance impact? | 197 | 5) What's the performance impact? |
199 | 198 | ||
@@ -223,7 +222,7 @@ Some points for a filesystem to consider: | |||
223 | as tmpfs should not enable cleancache) | 222 | as tmpfs should not enable cleancache) |
224 | - To ensure coherency/correctness, the FS must ensure that all | 223 | - To ensure coherency/correctness, the FS must ensure that all |
225 | file removal or truncation operations either go through VFS or | 224 | file removal or truncation operations either go through VFS or |
226 | add hooks to do the equivalent cleancache "invalidate" operations | 225 | add hooks to do the equivalent cleancache "flush" operations |
227 | - To ensure coherency/correctness, either inode numbers must | 226 | - To ensure coherency/correctness, either inode numbers must |
228 | be unique across the lifetime of the on-disk file OR the | 227 | be unique across the lifetime of the on-disk file OR the |
229 | FS must provide an "encode_fh" function. | 228 | FS must provide an "encode_fh" function. |
@@ -244,11 +243,11 @@ If cleancache would use the inode virtual address instead of | |||
244 | inode/filehandle, the pool id could be eliminated. But, this | 243 | inode/filehandle, the pool id could be eliminated. But, this |
245 | won't work because cleancache retains pagecache data pages | 244 | won't work because cleancache retains pagecache data pages |
246 | persistently even when the inode has been pruned from the | 245 | persistently even when the inode has been pruned from the |
247 | inode unused list, and only invalidates the data page if the file | 246 | inode unused list, and only flushes the data page if the file |
248 | gets removed/truncated. So if cleancache used the inode kva, | 247 | gets removed/truncated. So if cleancache used the inode kva, |
249 | there would be potential coherency issues if/when the inode | 248 | there would be potential coherency issues if/when the inode |
250 | kva is reused for a different file. Alternately, if cleancache | 249 | kva is reused for a different file. Alternately, if cleancache |
251 | invalidated the pages when the inode kva was freed, much of the value | 250 | flushed the pages when the inode kva was freed, much of the value |
252 | of cleancache would be lost because the cache of pages in cleanache | 251 | of cleancache would be lost because the cache of pages in cleanache |
253 | is potentially much larger than the kernel pagecache and is most | 252 | is potentially much larger than the kernel pagecache and is most |
254 | useful if the pages survive inode cache removal. | 253 | useful if the pages survive inode cache removal. |
diff --git a/Documentation/vm/frontswap.txt b/Documentation/vm/frontswap.txt deleted file mode 100644 index c71a019be60..00000000000 --- a/Documentation/vm/frontswap.txt +++ /dev/null | |||
@@ -1,278 +0,0 @@ | |||
1 | Frontswap provides a "transcendent memory" interface for swap pages. | ||
2 | In some environments, dramatic performance savings may be obtained because | ||
3 | swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. | ||
4 | |||
5 | (Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" | ||
6 | and the only necessary changes to the core kernel for transcendent memory; | ||
7 | all other supporting code -- the "backends" -- is implemented as drivers. | ||
8 | See the LWN.net article "Transcendent memory in a nutshell" for a detailed | ||
9 | overview of frontswap and related kernel parts: | ||
10 | https://lwn.net/Articles/454795/ ) | ||
11 | |||
12 | Frontswap is so named because it can be thought of as the opposite of | ||
13 | a "backing" store for a swap device. The storage is assumed to be | ||
14 | a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming | ||
15 | to the requirements of transcendent memory (such as Xen's "tmem", or | ||
16 | in-kernel compressed memory, aka "zcache", or future RAM-like devices); | ||
17 | this pseudo-RAM device is not directly accessible or addressable by the | ||
18 | kernel and is of unknown and possibly time-varying size. The driver | ||
19 | links itself to frontswap by calling frontswap_register_ops to set the | ||
20 | frontswap_ops funcs appropriately and the functions it provides must | ||
21 | conform to certain policies as follows: | ||
22 | |||
23 | An "init" prepares the device to receive frontswap pages associated | ||
24 | with the specified swap device number (aka "type"). A "store" will | ||
25 | copy the page to transcendent memory and associate it with the type and | ||
26 | offset associated with the page. A "load" will copy the page, if found, | ||
27 | from transcendent memory into kernel memory, but will NOT remove the page | ||
28 | from transcendent memory. An "invalidate_page" will remove the page | ||
29 | from transcendent memory and an "invalidate_area" will remove ALL pages | ||
30 | associated with the swap type (e.g., like swapoff) and notify the "device" | ||
31 | to refuse further stores with that swap type. | ||
32 | |||
33 | Once a page is successfully stored, a matching load on the page will normally | ||
34 | succeed. So when the kernel finds itself in a situation where it needs | ||
35 | to swap out a page, it first attempts to use frontswap. If the store returns | ||
36 | success, the data has been successfully saved to transcendent memory and | ||
37 | a disk write and, if the data is later read back, a disk read are avoided. | ||
38 | If a store returns failure, transcendent memory has rejected the data, and the | ||
39 | page can be written to swap as usual. | ||
40 | |||
41 | If a backend chooses, frontswap can be configured as a "writethrough | ||
42 | cache" by calling frontswap_writethrough(). In this mode, the reduction | ||
43 | in swap device writes is lost (and also a non-trivial performance advantage) | ||
44 | in order to allow the backend to arbitrarily "reclaim" space used to | ||
45 | store frontswap pages to more completely manage its memory usage. | ||
46 | |||
47 | Note that if a page is stored and the page already exists in transcendent memory | ||
48 | (a "duplicate" store), either the store succeeds and the data is overwritten, | ||
49 | or the store fails AND the page is invalidated. This ensures stale data may | ||
50 | never be obtained from frontswap. | ||
51 | |||
52 | If properly configured, monitoring of frontswap is done via debugfs in | ||
53 | the /sys/kernel/debug/frontswap directory. The effectiveness of | ||
54 | frontswap can be measured (across all swap devices) with: | ||
55 | |||
56 | failed_stores - how many store attempts have failed | ||
57 | loads - how many loads were attempted (all should succeed) | ||
58 | succ_stores - how many store attempts have succeeded | ||
59 | invalidates - how many invalidates were attempted | ||
60 | |||
61 | A backend implementation may provide additional metrics. | ||
62 | |||
63 | FAQ | ||
64 | |||
65 | 1) Where's the value? | ||
66 | |||
67 | When a workload starts swapping, performance falls through the floor. | ||
68 | Frontswap significantly increases performance in many such workloads by | ||
69 | providing a clean, dynamic interface to read and write swap pages to | ||
70 | "transcendent memory" that is otherwise not directly addressable to the kernel. | ||
71 | This interface is ideal when data is transformed to a different form | ||
72 | and size (such as with compression) or secretly moved (as might be | ||
73 | useful for write-balancing for some RAM-like devices). Swap pages (and | ||
74 | evicted page-cache pages) are a great use for this kind of slower-than-RAM- | ||
75 | but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and | ||
76 | cleancache) interface to transcendent memory provides a nice way to read | ||
77 | and write -- and indirectly "name" -- the pages. | ||
78 | |||
79 | Frontswap -- and cleancache -- with a fairly small impact on the kernel, | ||
80 | provides a huge amount of flexibility for more dynamic, flexible RAM | ||
81 | utilization in various system configurations: | ||
82 | |||
83 | In the single kernel case, aka "zcache", pages are compressed and | ||
84 | stored in local memory, thus increasing the total anonymous pages | ||
85 | that can be safely kept in RAM. Zcache essentially trades off CPU | ||
86 | cycles used in compression/decompression for better memory utilization. | ||
87 | Benchmarks have shown little or no impact when memory pressure is | ||
88 | low while providing a significant performance improvement (25%+) | ||
89 | on some workloads under high memory pressure. | ||
90 | |||
91 | "RAMster" builds on zcache by adding "peer-to-peer" transcendent memory | ||
92 | support for clustered systems. Frontswap pages are locally compressed | ||
93 | as in zcache, but then "remotified" to another system's RAM. This | ||
94 | allows RAM to be dynamically load-balanced back-and-forth as needed, | ||
95 | i.e. when system A is overcommitted, it can swap to system B, and | ||
96 | vice versa. RAMster can also be configured as a memory server so | ||
97 | many servers in a cluster can swap, dynamically as needed, to a single | ||
98 | server configured with a large amount of RAM... without pre-configuring | ||
99 | how much of the RAM is available for each of the clients! | ||
100 | |||
101 | In the virtual case, the whole point of virtualization is to statistically | ||
102 | multiplex physical resources across the varying demands of multiple | ||
103 | virtual machines. This is really hard to do with RAM and efforts to do | ||
104 | it well with no kernel changes have essentially failed (except in some | ||
105 | well-publicized special-case workloads). | ||
106 | Specifically, the Xen Transcendent Memory backend allows otherwise | ||
107 | "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple | ||
108 | virtual machines, but the pages can be compressed and deduplicated to | ||
109 | optimize RAM utilization. And when guest OS's are induced to surrender | ||
110 | underutilized RAM (e.g. with "selfballooning"), sudden unexpected | ||
111 | memory pressure may result in swapping; frontswap allows those pages | ||
112 | to be swapped to and from hypervisor RAM (if overall host system memory | ||
113 | conditions allow), thus mitigating the potentially awful performance impact | ||
114 | of unplanned swapping. | ||
115 | |||
116 | A KVM implementation is underway and has been RFC'ed to lkml. And, | ||
117 | using frontswap, investigation is also underway on the use of NVM as | ||
118 | a memory extension technology. | ||
119 | |||
120 | 2) Sure there may be performance advantages in some situations, but | ||
121 | what's the space/time overhead of frontswap? | ||
122 | |||
123 | If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into | ||
124 | nothingness and the only overhead is a few extra bytes per swapon'ed | ||
125 | swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" | ||
126 | registers, there is one extra global variable compared to zero for | ||
127 | every swap page read or written. If CONFIG_FRONTSWAP is enabled | ||
128 | AND a frontswap backend registers AND the backend fails every "store" | ||
129 | request (i.e. provides no memory despite claiming it might), | ||
130 | CPU overhead is still negligible -- and since every frontswap fail | ||
131 | precedes a swap page write-to-disk, the system is highly likely | ||
132 | to be I/O bound and using a small fraction of a percent of a CPU | ||
133 | will be irrelevant anyway. | ||
134 | |||
135 | As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend | ||
136 | registers, one bit is allocated for every swap page for every swap | ||
137 | device that is swapon'd. This is added to the EIGHT bits (which | ||
138 | was sixteen until about 2.6.34) that the kernel already allocates | ||
139 | for every swap page for every swap device that is swapon'd. (Hugh | ||
140 | Dickins has observed that frontswap could probably steal one of | ||
141 | the existing eight bits, but let's worry about that minor optimization | ||
142 | later.) For very large swap disks (which are rare) on a standard | ||
143 | 4K pagesize, this is 1MB per 32GB swap. | ||
144 | |||
145 | When swap pages are stored in transcendent memory instead of written | ||
146 | out to disk, there is a side effect that this may create more memory | ||
147 | pressure that can potentially outweigh the other advantages. A | ||
148 | backend, such as zcache, must implement policies to carefully (but | ||
149 | dynamically) manage memory limits to ensure this doesn't happen. | ||
150 | |||
151 | 3) OK, how about a quick overview of what this frontswap patch does | ||
152 | in terms that a kernel hacker can grok? | ||
153 | |||
154 | Let's assume that a frontswap "backend" has registered during | ||
155 | kernel initialization; this registration indicates that this | ||
156 | frontswap backend has access to some "memory" that is not directly | ||
157 | accessible by the kernel. Exactly how much memory it provides is | ||
158 | entirely dynamic and random. | ||
159 | |||
160 | Whenever a swap-device is swapon'd frontswap_init() is called, | ||
161 | passing the swap device number (aka "type") as a parameter. | ||
162 | This notifies frontswap to expect attempts to "store" swap pages | ||
163 | associated with that number. | ||
164 | |||
165 | Whenever the swap subsystem is readying a page to write to a swap | ||
166 | device (c.f swap_writepage()), frontswap_store is called. Frontswap | ||
167 | consults with the frontswap backend and if the backend says it does NOT | ||
168 | have room, frontswap_store returns -1 and the kernel swaps the page | ||
169 | to the swap device as normal. Note that the response from the frontswap | ||
170 | backend is unpredictable to the kernel; it may choose to never accept a | ||
171 | page, it could accept every ninth page, or it might accept every | ||
172 | page. But if the backend does accept a page, the data from the page | ||
173 | has already been copied and associated with the type and offset, | ||
174 | and the backend guarantees the persistence of the data. In this case, | ||
175 | frontswap sets a bit in the "frontswap_map" for the swap device | ||
176 | corresponding to the page offset on the swap device to which it would | ||
177 | otherwise have written the data. | ||
178 | |||
179 | When the swap subsystem needs to swap-in a page (swap_readpage()), | ||
180 | it first calls frontswap_load() which checks the frontswap_map to | ||
181 | see if the page was earlier accepted by the frontswap backend. If | ||
182 | it was, the page of data is filled from the frontswap backend and | ||
183 | the swap-in is complete. If not, the normal swap-in code is | ||
184 | executed to obtain the page of data from the real swap device. | ||
185 | |||
186 | So every time the frontswap backend accepts a page, a swap device read | ||
187 | and (potentially) a swap device write are replaced by a "frontswap backend | ||
188 | store" and (possibly) a "frontswap backend loads", which are presumably much | ||
189 | faster. | ||
190 | |||
191 | 4) Can't frontswap be configured as a "special" swap device that is | ||
192 | just higher priority than any real swap device (e.g. like zswap, | ||
193 | or maybe swap-over-nbd/NFS)? | ||
194 | |||
195 | No. First, the existing swap subsystem doesn't allow for any kind of | ||
196 | swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, | ||
197 | but this would require fairly drastic changes. Even if it were | ||
198 | rewritten, the existing swap subsystem uses the block I/O layer which | ||
199 | assumes a swap device is fixed size and any page in it is linearly | ||
200 | addressable. Frontswap barely touches the existing swap subsystem, | ||
201 | and works around the constraints of the block I/O subsystem to provide | ||
202 | a great deal of flexibility and dynamicity. | ||
203 | |||
204 | For example, the acceptance of any swap page by the frontswap backend is | ||
205 | entirely unpredictable. This is critical to the definition of frontswap | ||
206 | backends because it grants completely dynamic discretion to the | ||
207 | backend. In zcache, one cannot know a priori how compressible a page is. | ||
208 | "Poorly" compressible pages can be rejected, and "poorly" can itself be | ||
209 | defined dynamically depending on current memory constraints. | ||
210 | |||
211 | Further, frontswap is entirely synchronous whereas a real swap | ||
212 | device is, by definition, asynchronous and uses block I/O. The | ||
213 | block I/O layer is not only unnecessary, but may perform "optimizations" | ||
214 | that are inappropriate for a RAM-oriented device including delaying | ||
215 | the write of some pages for a significant amount of time. Synchrony is | ||
216 | required to ensure the dynamicity of the backend and to avoid thorny race | ||
217 | conditions that would unnecessarily and greatly complicate frontswap | ||
218 | and/or the block I/O subsystem. That said, only the initial "store" | ||
219 | and "load" operations need be synchronous. A separate asynchronous thread | ||
220 | is free to manipulate the pages stored by frontswap. For example, | ||
221 | the "remotification" thread in RAMster uses standard asynchronous | ||
222 | kernel sockets to move compressed frontswap pages to a remote machine. | ||
223 | Similarly, a KVM guest-side implementation could do in-guest compression | ||
224 | and use "batched" hypercalls. | ||
225 | |||
226 | In a virtualized environment, the dynamicity allows the hypervisor | ||
227 | (or host OS) to do "intelligent overcommit". For example, it can | ||
228 | choose to accept pages only until host-swapping might be imminent, | ||
229 | then force guests to do their own swapping. | ||
230 | |||
231 | There is a downside to the transcendent memory specifications for | ||
232 | frontswap: Since any "store" might fail, there must always be a real | ||
233 | slot on a real swap device to swap the page. Thus frontswap must be | ||
234 | implemented as a "shadow" to every swapon'd device with the potential | ||
235 | capability of holding every page that the swap device might have held | ||
236 | and the possibility that it might hold no pages at all. This means | ||
237 | that frontswap cannot contain more pages than the total of swapon'd | ||
238 | swap devices. For example, if NO swap device is configured on some | ||
239 | installation, frontswap is useless. Swapless portable devices | ||
240 | can still use frontswap but a backend for such devices must configure | ||
241 | some kind of "ghost" swap device and ensure that it is never used. | ||
242 | |||
243 | 5) Why this weird definition about "duplicate stores"? If a page | ||
244 | has been previously successfully stored, can't it always be | ||
245 | successfully overwritten? | ||
246 | |||
247 | Nearly always it can, but no, sometimes it cannot. Consider an example | ||
248 | where data is compressed and the original 4K page has been compressed | ||
249 | to 1K. Now an attempt is made to overwrite the page with data that | ||
250 | is non-compressible and so would take the entire 4K. But the backend | ||
251 | has no more space. In this case, the store must be rejected. Whenever | ||
252 | frontswap rejects a store that would overwrite, it also must invalidate | ||
253 | the old data and ensure that it is no longer accessible. Since the | ||
254 | swap subsystem then writes the new data to the read swap device, | ||
255 | this is the correct course of action to ensure coherency. | ||
256 | |||
257 | 6) What is frontswap_shrink for? | ||
258 | |||
259 | When the (non-frontswap) swap subsystem swaps out a page to a real | ||
260 | swap device, that page is only taking up low-value pre-allocated disk | ||
261 | space. But if frontswap has placed a page in transcendent memory, that | ||
262 | page may be taking up valuable real estate. The frontswap_shrink | ||
263 | routine allows code outside of the swap subsystem to force pages out | ||
264 | of the memory managed by frontswap and back into kernel-addressable memory. | ||
265 | For example, in RAMster, a "suction driver" thread will attempt | ||
266 | to "repatriate" pages sent to a remote machine back to the local machine; | ||
267 | this is driven using the frontswap_shrink mechanism when memory pressure | ||
268 | subsides. | ||
269 | |||
270 | 7) Why does the frontswap patch create the new include file swapfile.h? | ||
271 | |||
272 | The frontswap code depends on some swap-subsystem-internal data | ||
273 | structures that have, over the years, moved back and forth between | ||
274 | static and global. This seemed a reasonable compromise: Define | ||
275 | them as global but declare them in a new include file that isn't | ||
276 | included by the large number of source files that include swap.h. | ||
277 | |||
278 | Dan Magenheimer, last updated April 9, 2012 | ||
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index 4ac359b7aa1..f8551b3879f 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt | |||
@@ -299,17 +299,11 @@ map_hugetlb.c. | |||
299 | ******************************************************************* | 299 | ******************************************************************* |
300 | 300 | ||
301 | /* | 301 | /* |
302 | * map_hugetlb: see tools/testing/selftests/vm/map_hugetlb.c | 302 | * hugepage-shm: see Documentation/vm/hugepage-shm.c |
303 | */ | 303 | */ |
304 | 304 | ||
305 | ******************************************************************* | 305 | ******************************************************************* |
306 | 306 | ||
307 | /* | 307 | /* |
308 | * hugepage-shm: see tools/testing/selftests/vm/hugepage-shm.c | 308 | * hugepage-mmap: see Documentation/vm/hugepage-mmap.c |
309 | */ | ||
310 | |||
311 | ******************************************************************* | ||
312 | |||
313 | /* | ||
314 | * hugepage-mmap: see tools/testing/selftests/vm/hugepage-mmap.c | ||
315 | */ | 309 | */ |
diff --git a/Documentation/vm/numa b/Documentation/vm/numa index ade01274212..a200a386429 100644 --- a/Documentation/vm/numa +++ b/Documentation/vm/numa | |||
@@ -109,11 +109,11 @@ to improve NUMA locality using various CPU affinity command line interfaces, | |||
109 | such as taskset(1) and numactl(1), and program interfaces such as | 109 | such as taskset(1) and numactl(1), and program interfaces such as |
110 | sched_setaffinity(2). Further, one can modify the kernel's default local | 110 | sched_setaffinity(2). Further, one can modify the kernel's default local |
111 | allocation behavior using Linux NUMA memory policy. | 111 | allocation behavior using Linux NUMA memory policy. |
112 | [see Documentation/vm/numa_memory_policy.txt.] | 112 | [see Documentation/vm/numa_memory_policy.] |
113 | 113 | ||
114 | System administrators can restrict the CPUs and nodes' memories that a non- | 114 | System administrators can restrict the CPUs and nodes' memories that a non- |
115 | privileged user can specify in the scheduling or NUMA commands and functions | 115 | privileged user can specify in the scheduling or NUMA commands and functions |
116 | using control groups and CPUsets. [see Documentation/cgroups/cpusets.txt] | 116 | using control groups and CPUsets. [see Documentation/cgroups/CPUsets.txt] |
117 | 117 | ||
118 | On architectures that do not hide memoryless nodes, Linux will include only | 118 | On architectures that do not hide memoryless nodes, Linux will include only |
119 | zones [nodes] with memory in the zonelists. This means that for a memoryless | 119 | zones [nodes] with memory in the zonelists. This means that for a memoryless |
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 7587493c67f..df09b9650a8 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt | |||
@@ -16,7 +16,7 @@ There are three components to pagemap: | |||
16 | * Bits 0-4 swap type if swapped | 16 | * Bits 0-4 swap type if swapped |
17 | * Bits 5-54 swap offset if swapped | 17 | * Bits 5-54 swap offset if swapped |
18 | * Bits 55-60 page shift (page size = 1<<page shift) | 18 | * Bits 55-60 page shift (page size = 1<<page shift) |
19 | * Bit 61 page is file-page or shared-anon | 19 | * Bit 61 reserved for future use |
20 | * Bit 62 page swapped | 20 | * Bit 62 page swapped |
21 | * Bit 63 page present | 21 | * Bit 63 page present |
22 | 22 | ||
@@ -60,7 +60,6 @@ There are three components to pagemap: | |||
60 | 19. HWPOISON | 60 | 19. HWPOISON |
61 | 20. NOPAGE | 61 | 20. NOPAGE |
62 | 21. KSM | 62 | 21. KSM |
63 | 22. THP | ||
64 | 63 | ||
65 | Short descriptions to the page flags: | 64 | Short descriptions to the page flags: |
66 | 65 | ||
@@ -98,9 +97,6 @@ Short descriptions to the page flags: | |||
98 | 21. KSM | 97 | 21. KSM |
99 | identical memory pages dynamically shared between one or more processes | 98 | identical memory pages dynamically shared between one or more processes |
100 | 99 | ||
101 | 22. THP | ||
102 | contiguous pages which construct transparent hugepages | ||
103 | |||
104 | [IO related page flags] | 100 | [IO related page flags] |
105 | 1. ERROR IO error occurred | 101 | 1. ERROR IO error occurred |
106 | 3. UPTODATE page has up-to-date data | 102 | 3. UPTODATE page has up-to-date data |
diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt index b0c6d1bbb43..07375e73981 100644 --- a/Documentation/vm/slub.txt +++ b/Documentation/vm/slub.txt | |||
@@ -17,7 +17,7 @@ data and perform operation on the slabs. By default slabinfo only lists | |||
17 | slabs that have data in them. See "slabinfo -h" for more options when | 17 | slabs that have data in them. See "slabinfo -h" for more options when |
18 | running the command. slabinfo can be compiled with | 18 | running the command. slabinfo can be compiled with |
19 | 19 | ||
20 | gcc -o slabinfo tools/vm/slabinfo.c | 20 | gcc -o slabinfo Documentation/vm/slabinfo.c |
21 | 21 | ||
22 | Some of the modes of operation of slabinfo require that slub debugging | 22 | Some of the modes of operation of slabinfo require that slub debugging |
23 | be enabled on the command line. F.e. no tracking information will be | 23 | be enabled on the command line. F.e. no tracking information will be |
@@ -117,7 +117,7 @@ can be influenced by kernel parameters: | |||
117 | 117 | ||
118 | slub_min_objects=x (default 4) | 118 | slub_min_objects=x (default 4) |
119 | slub_min_order=x (default 0) | 119 | slub_min_order=x (default 0) |
120 | slub_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) | 120 | slub_max_order=x (default 1) |
121 | 121 | ||
122 | slub_min_objects allows to specify how many objects must at least fit | 122 | slub_min_objects allows to specify how many objects must at least fit |
123 | into one slab in order for the allocation order to be acceptable. | 123 | into one slab in order for the allocation order to be acceptable. |
@@ -131,10 +131,7 @@ slub_min_objects. | |||
131 | slub_max_order specified the order at which slub_min_objects should no | 131 | slub_max_order specified the order at which slub_min_objects should no |
132 | longer be checked. This is useful to avoid SLUB trying to generate | 132 | longer be checked. This is useful to avoid SLUB trying to generate |
133 | super large order pages to fit slub_min_objects of a slab cache with | 133 | super large order pages to fit slub_min_objects of a slab cache with |
134 | large object sizes into one high order page. Setting command line | 134 | large object sizes into one high order page. |
135 | parameter debug_guardpage_minorder=N (N > 0), forces setting | ||
136 | slub_max_order to 0, what cause minimum possible order of slabs | ||
137 | allocation. | ||
138 | 135 | ||
139 | SLUB Debug output | 136 | SLUB Debug output |
140 | ----------------- | 137 | ----------------- |
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index 8785fb87d9c..29bdf62aac0 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt | |||
@@ -116,13 +116,6 @@ echo always >/sys/kernel/mm/transparent_hugepage/defrag | |||
116 | echo madvise >/sys/kernel/mm/transparent_hugepage/defrag | 116 | echo madvise >/sys/kernel/mm/transparent_hugepage/defrag |
117 | echo never >/sys/kernel/mm/transparent_hugepage/defrag | 117 | echo never >/sys/kernel/mm/transparent_hugepage/defrag |
118 | 118 | ||
119 | By default kernel tries to use huge zero page on read page fault. | ||
120 | It's possible to disable huge zero page by writing 0 or enable it | ||
121 | back by writing 1: | ||
122 | |||
123 | echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page | ||
124 | echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page | ||
125 | |||
126 | khugepaged will be automatically started when | 119 | khugepaged will be automatically started when |
127 | transparent_hugepage/enabled is set to "always" or "madvise, and it'll | 120 | transparent_hugepage/enabled is set to "always" or "madvise, and it'll |
128 | be automatically shutdown if it's set to "never". | 121 | be automatically shutdown if it's set to "never". |
@@ -173,76 +166,6 @@ behavior. So to make them effective you need to restart any | |||
173 | application that could have been using hugepages. This also applies to | 166 | application that could have been using hugepages. This also applies to |
174 | the regions registered in khugepaged. | 167 | the regions registered in khugepaged. |
175 | 168 | ||
176 | == Monitoring usage == | ||
177 | |||
178 | The number of transparent huge pages currently used by the system is | ||
179 | available by reading the AnonHugePages field in /proc/meminfo. To | ||
180 | identify what applications are using transparent huge pages, it is | ||
181 | necessary to read /proc/PID/smaps and count the AnonHugePages fields | ||
182 | for each mapping. Note that reading the smaps file is expensive and | ||
183 | reading it frequently will incur overhead. | ||
184 | |||
185 | There are a number of counters in /proc/vmstat that may be used to | ||
186 | monitor how successfully the system is providing huge pages for use. | ||
187 | |||
188 | thp_fault_alloc is incremented every time a huge page is successfully | ||
189 | allocated to handle a page fault. This applies to both the | ||
190 | first time a page is faulted and for COW faults. | ||
191 | |||
192 | thp_collapse_alloc is incremented by khugepaged when it has found | ||
193 | a range of pages to collapse into one huge page and has | ||
194 | successfully allocated a new huge page to store the data. | ||
195 | |||
196 | thp_fault_fallback is incremented if a page fault fails to allocate | ||
197 | a huge page and instead falls back to using small pages. | ||
198 | |||
199 | thp_collapse_alloc_failed is incremented if khugepaged found a range | ||
200 | of pages that should be collapsed into one huge page but failed | ||
201 | the allocation. | ||
202 | |||
203 | thp_split is incremented every time a huge page is split into base | ||
204 | pages. This can happen for a variety of reasons but a common | ||
205 | reason is that a huge page is old and is being reclaimed. | ||
206 | |||
207 | thp_zero_page_alloc is incremented every time a huge zero page is | ||
208 | successfully allocated. It includes allocations which where | ||
209 | dropped due race with other allocation. Note, it doesn't count | ||
210 | every map of the huge zero page, only its allocation. | ||
211 | |||
212 | thp_zero_page_alloc_failed is incremented if kernel fails to allocate | ||
213 | huge zero page and falls back to using small pages. | ||
214 | |||
215 | As the system ages, allocating huge pages may be expensive as the | ||
216 | system uses memory compaction to copy data around memory to free a | ||
217 | huge page for use. There are some counters in /proc/vmstat to help | ||
218 | monitor this overhead. | ||
219 | |||
220 | compact_stall is incremented every time a process stalls to run | ||
221 | memory compaction so that a huge page is free for use. | ||
222 | |||
223 | compact_success is incremented if the system compacted memory and | ||
224 | freed a huge page for use. | ||
225 | |||
226 | compact_fail is incremented if the system tries to compact memory | ||
227 | but failed. | ||
228 | |||
229 | compact_pages_moved is incremented each time a page is moved. If | ||
230 | this value is increasing rapidly, it implies that the system | ||
231 | is copying a lot of data to satisfy the huge page allocation. | ||
232 | It is possible that the cost of copying exceeds any savings | ||
233 | from reduced TLB misses. | ||
234 | |||
235 | compact_pagemigrate_failed is incremented when the underlying mechanism | ||
236 | for moving a page failed. | ||
237 | |||
238 | compact_blocks_moved is incremented each time memory compaction examines | ||
239 | a huge page aligned range of pages. | ||
240 | |||
241 | It is possible to establish how long the stalls were using the function | ||
242 | tracer to record how long was spent in __alloc_pages_nodemask and | ||
243 | using the mm_page_alloc tracepoint to identify which allocations were | ||
244 | for huge pages. | ||
245 | |||
246 | == get_user_pages and follow_page == | 169 | == get_user_pages and follow_page == |
247 | 170 | ||
248 | get_user_pages and follow_page if run on a hugepage, will return the | 171 | get_user_pages and follow_page if run on a hugepage, will return the |
@@ -291,7 +214,7 @@ unaffected. libhugetlbfs will also work fine as usual. | |||
291 | == Graceful fallback == | 214 | == Graceful fallback == |
292 | 215 | ||
293 | Code walking pagetables but unware about huge pmds can simply call | 216 | Code walking pagetables but unware about huge pmds can simply call |
294 | split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by | 217 | split_huge_page_pmd(mm, pmd) where the pmd is the one returned by |
295 | pmd_offset. It's trivial to make the code transparent hugepage aware | 218 | pmd_offset. It's trivial to make the code transparent hugepage aware |
296 | by just grepping for "pmd_offset" and adding split_huge_page_pmd where | 219 | by just grepping for "pmd_offset" and adding split_huge_page_pmd where |
297 | missing after pmd_offset returns the pmd. Thanks to the graceful | 220 | missing after pmd_offset returns the pmd. Thanks to the graceful |
@@ -314,7 +237,7 @@ diff --git a/mm/mremap.c b/mm/mremap.c | |||
314 | return NULL; | 237 | return NULL; |
315 | 238 | ||
316 | pmd = pmd_offset(pud, addr); | 239 | pmd = pmd_offset(pud, addr); |
317 | + split_huge_page_pmd(vma, addr, pmd); | 240 | + split_huge_page_pmd(mm, pmd); |
318 | if (pmd_none_or_clear_bad(pmd)) | 241 | if (pmd_none_or_clear_bad(pmd)) |
319 | return NULL; | 242 | return NULL; |
320 | 243 | ||
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index a68db7692ee..97bae3c576c 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt | |||
@@ -197,8 +197,12 @@ the pages are also "rescued" from the unevictable list in the process of | |||
197 | freeing them. | 197 | freeing them. |
198 | 198 | ||
199 | page_evictable() also checks for mlocked pages by testing an additional page | 199 | page_evictable() also checks for mlocked pages by testing an additional page |
200 | flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is | 200 | flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked, |
201 | faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED. | 201 | and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is |
202 | VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and | ||
203 | update the appropriate statistics if the vma is VM_LOCKED. This method allows | ||
204 | efficient "culling" of pages in the fault path that are being faulted in to | ||
205 | VM_LOCKED VMAs. | ||
202 | 206 | ||
203 | 207 | ||
204 | VMSCAN'S HANDLING OF UNEVICTABLE PAGES | 208 | VMSCAN'S HANDLING OF UNEVICTABLE PAGES |
@@ -367,8 +371,8 @@ mlock_fixup() filters several classes of "special" VMAs: | |||
367 | mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to | 371 | mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to |
368 | allocate the huge pages and populate the ptes. | 372 | allocate the huge pages and populate the ptes. |
369 | 373 | ||
370 | 3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages, | 374 | 3) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of |
371 | such as the VDSO page, relay channel pages, etc. These pages | 375 | kernel pages, such as the VDSO page, relay channel pages, etc. These pages |
372 | are inherently unevictable and are not managed on the LRU lists. | 376 | are inherently unevictable and are not managed on the LRU lists. |
373 | mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls | 377 | mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls |
374 | make_pages_present() to populate the ptes. | 378 | make_pages_present() to populate the ptes. |
@@ -534,7 +538,7 @@ different reverse map mechanisms. | |||
534 | process because mlocked pages are migratable. However, for reclaim, if | 538 | process because mlocked pages are migratable. However, for reclaim, if |
535 | the page is mapped into a VM_LOCKED VMA, the scan stops. | 539 | the page is mapped into a VM_LOCKED VMA, the scan stops. |
536 | 540 | ||
537 | try_to_unmap_anon() attempts to acquire in read mode the mmap semaphore of | 541 | try_to_unmap_anon() attempts to acquire in read mode the mmap semphore of |
538 | the mm_struct to which the VMA belongs. If this is successful, it will | 542 | the mm_struct to which the VMA belongs. If this is successful, it will |
539 | mlock the page via mlock_vma_page() - we wouldn't have gotten to | 543 | mlock the page via mlock_vma_page() - we wouldn't have gotten to |
540 | try_to_unmap_anon() if the page were already mlocked - and will return | 544 | try_to_unmap_anon() if the page were already mlocked - and will return |
@@ -615,11 +619,11 @@ all PTEs from the page. For this purpose, the unevictable/mlock infrastructure | |||
615 | introduced a variant of try_to_unmap() called try_to_munlock(). | 619 | introduced a variant of try_to_unmap() called try_to_munlock(). |
616 | 620 | ||
617 | try_to_munlock() calls the same functions as try_to_unmap() for anonymous and | 621 | try_to_munlock() calls the same functions as try_to_unmap() for anonymous and |
618 | mapped file pages with an additional argument specifying unlock versus unmap | 622 | mapped file pages with an additional argument specifing unlock versus unmap |
619 | processing. Again, these functions walk the respective reverse maps looking | 623 | processing. Again, these functions walk the respective reverse maps looking |
620 | for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file | 624 | for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file |
621 | pages mapped in linear VMAs, as in the try_to_unmap() case, the functions | 625 | pages mapped in linear VMAs, as in the try_to_unmap() case, the functions |
622 | attempt to acquire the associated mmap semaphore, mlock the page via | 626 | attempt to acquire the associated mmap semphore, mlock the page via |
623 | mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the | 627 | mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the |
624 | pre-clearing of the page's PG_mlocked done by munlock_vma_page. | 628 | pre-clearing of the page's PG_mlocked done by munlock_vma_page. |
625 | 629 | ||
@@ -637,7 +641,7 @@ with it - the usual fallback position. | |||
637 | Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's | 641 | Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's |
638 | reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. | 642 | reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. |
639 | However, the scan can terminate when it encounters a VM_LOCKED VMA and can | 643 | However, the scan can terminate when it encounters a VM_LOCKED VMA and can |
640 | successfully acquire the VMA's mmap semaphore for read and mlock the page. | 644 | successfully acquire the VMA's mmap semphore for read and mlock the page. |
641 | Although try_to_munlock() might be called a great many times when munlocking a | 645 | Although try_to_munlock() might be called a great many times when munlocking a |
642 | large region or tearing down a large address space that has been mlocked via | 646 | large region or tearing down a large address space that has been mlocked via |
643 | mlockall(), overall this is a fairly rare event. | 647 | mlockall(), overall this is a fairly rare event. |
@@ -647,7 +651,7 @@ PAGE RECLAIM IN shrink_*_list() | |||
647 | ------------------------------- | 651 | ------------------------------- |
648 | 652 | ||
649 | shrink_active_list() culls any obviously unevictable pages - i.e. | 653 | shrink_active_list() culls any obviously unevictable pages - i.e. |
650 | !page_evictable(page) - diverting these to the unevictable list. | 654 | !page_evictable(page, NULL) - diverting these to the unevictable list. |
651 | However, shrink_active_list() only sees unevictable pages that made it onto the | 655 | However, shrink_active_list() only sees unevictable pages that made it onto the |
652 | active/inactive lru lists. Note that these pages do not have PageUnevictable | 656 | active/inactive lru lists. Note that these pages do not have PageUnevictable |
653 | set - otherwise they would be on the unevictable list and shrink_active_list | 657 | set - otherwise they would be on the unevictable list and shrink_active_list |