diff options
author | Mike Rapoport <rppt@linux.vnet.ibm.com> | 2018-05-14 04:13:40 -0400 |
---|---|---|
committer | Jonathan Corbet <corbet@lwn.net> | 2018-05-21 11:30:58 -0400 |
commit | 45c9a74f648a76e1118cf8024d11cba54bd64e37 (patch) | |
tree | 60c97edb01661dcb96ae105509edb54ee367bb23 | |
parent | aa00eaa9afb0cc350590668ba6a9ecd99cfd3ad7 (diff) |
docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
Now that the administrative information for transparent huge pages is
nicely separated, move it to its own page under the admin guide.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 3 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/index.rst | 1 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/transhuge.rst | 418 | ||||
-rw-r--r-- | Documentation/vm/transhuge.rst | 414 |
4 files changed, 423 insertions, 413 deletions
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 42f3e2884e7c..8d24270644a1 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt | |||
@@ -4313,7 +4313,8 @@ | |||
4313 | Format: [always|madvise|never] | 4313 | Format: [always|madvise|never] |
4314 | Can be used to control the default behavior of the system | 4314 | Can be used to control the default behavior of the system |
4315 | with respect to transparent hugepages. | 4315 | with respect to transparent hugepages. |
4316 | See Documentation/vm/transhuge.rst for more details. | 4316 | See Documentation/admin-guide/mm/transhuge.rst |
4317 | for more details. | ||
4317 | 4318 | ||
4318 | tsc= Disable clocksource stability checks for TSC. | 4319 | tsc= Disable clocksource stability checks for TSC. |
4319 | Format: <string> | 4320 | Format: <string> |
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index a69aa69af255..8454be638108 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst | |||
@@ -27,4 +27,5 @@ the Linux memory management. | |||
27 | numa_memory_policy | 27 | numa_memory_policy |
28 | pagemap | 28 | pagemap |
29 | soft-dirty | 29 | soft-dirty |
30 | transhuge | ||
30 | userfaultfd | 31 | userfaultfd |
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst new file mode 100644 index 000000000000..7ab93a8404b9 --- /dev/null +++ b/Documentation/admin-guide/mm/transhuge.rst | |||
@@ -0,0 +1,418 @@ | |||
1 | .. _admin_guide_transhuge: | ||
2 | |||
3 | ============================ | ||
4 | Transparent Hugepage Support | ||
5 | ============================ | ||
6 | |||
7 | Objective | ||
8 | ========= | ||
9 | |||
10 | Performance critical computing applications dealing with large memory | ||
11 | working sets are already running on top of libhugetlbfs and in turn | ||
12 | hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of | ||
13 | using huge pages for the backing of virtual memory with huge pages | ||
14 | that supports the automatic promotion and demotion of page sizes and | ||
15 | without the shortcomings of hugetlbfs. | ||
16 | |||
17 | Currently THP only works for anonymous memory mappings and tmpfs/shmem. | ||
18 | But in the future it can expand to other filesystems. | ||
19 | |||
20 | .. note:: | ||
21 | in the examples below we presume that the basic page size is 4K and | ||
22 | the huge page size is 2M, although the actual numbers may vary | ||
23 | depending on the CPU architecture. | ||
24 | |||
25 | The reason applications are running faster is because of two | ||
26 | factors. The first factor is almost completely irrelevant and it's not | ||
27 | of significant interest because it'll also have the downside of | ||
28 | requiring larger clear-page copy-page in page faults which is a | ||
29 | potentially negative effect. The first factor consists in taking a | ||
30 | single page fault for each 2M virtual region touched by userland (so | ||
31 | reducing the enter/exit kernel frequency by a 512 times factor). This | ||
32 | only matters the first time the memory is accessed for the lifetime of | ||
33 | a memory mapping. The second long lasting and much more important | ||
34 | factor will affect all subsequent accesses to the memory for the whole | ||
35 | runtime of the application. The second factor consist of two | ||
36 | components: | ||
37 | |||
38 | 1) the TLB miss will run faster (especially with virtualization using | ||
39 | nested pagetables but almost always also on bare metal without | ||
40 | virtualization) | ||
41 | |||
42 | 2) a single TLB entry will be mapping a much larger amount of virtual | ||
43 | memory in turn reducing the number of TLB misses. With | ||
44 | virtualization and nested pagetables the TLB can be mapped of | ||
45 | larger size only if both KVM and the Linux guest are using | ||
46 | hugepages but a significant speedup already happens if only one of | ||
47 | the two is using hugepages just because of the fact the TLB miss is | ||
48 | going to run faster. | ||
49 | |||
50 | THP can be enabled system wide or restricted to certain tasks or even | ||
51 | memory ranges inside task's address space. Unless THP is completely | ||
52 | disabled, there is ``khugepaged`` daemon that scans memory and | ||
53 | collapses sequences of basic pages into huge pages. | ||
54 | |||
55 | The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` | ||
56 | interface and using madivse(2) and prctl(2) system calls. | ||
57 | |||
58 | Transparent Hugepage Support maximizes the usefulness of free memory | ||
59 | if compared to the reservation approach of hugetlbfs by allowing all | ||
60 | unused memory to be used as cache or other movable (or even unmovable | ||
61 | entities). It doesn't require reservation to prevent hugepage | ||
62 | allocation failures to be noticeable from userland. It allows paging | ||
63 | and all other advanced VM features to be available on the | ||
64 | hugepages. It requires no modifications for applications to take | ||
65 | advantage of it. | ||
66 | |||
67 | Applications however can be further optimized to take advantage of | ||
68 | this feature, like for example they've been optimized before to avoid | ||
69 | a flood of mmap system calls for every malloc(4k). Optimizing userland | ||
70 | is by far not mandatory and khugepaged already can take care of long | ||
71 | lived page allocations even for hugepage unaware applications that | ||
72 | deals with large amounts of memory. | ||
73 | |||
74 | In certain cases when hugepages are enabled system wide, application | ||
75 | may end up allocating more memory resources. An application may mmap a | ||
76 | large region but only touch 1 byte of it, in that case a 2M page might | ||
77 | be allocated instead of a 4k page for no good. This is why it's | ||
78 | possible to disable hugepages system-wide and to only have them inside | ||
79 | MADV_HUGEPAGE madvise regions. | ||
80 | |||
81 | Embedded systems should enable hugepages only inside madvise regions | ||
82 | to eliminate any risk of wasting any precious byte of memory and to | ||
83 | only run faster. | ||
84 | |||
85 | Applications that gets a lot of benefit from hugepages and that don't | ||
86 | risk to lose memory by using hugepages, should use | ||
87 | madvise(MADV_HUGEPAGE) on their critical mmapped regions. | ||
88 | |||
89 | .. _thp_sysfs: | ||
90 | |||
91 | sysfs | ||
92 | ===== | ||
93 | |||
94 | Global THP controls | ||
95 | ------------------- | ||
96 | |||
97 | Transparent Hugepage Support for anonymous memory can be entirely disabled | ||
98 | (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE | ||
99 | regions (to avoid the risk of consuming more memory resources) or enabled | ||
100 | system wide. This can be achieved with one of:: | ||
101 | |||
102 | echo always >/sys/kernel/mm/transparent_hugepage/enabled | ||
103 | echo madvise >/sys/kernel/mm/transparent_hugepage/enabled | ||
104 | echo never >/sys/kernel/mm/transparent_hugepage/enabled | ||
105 | |||
106 | It's also possible to limit defrag efforts in the VM to generate | ||
107 | anonymous hugepages in case they're not immediately free to madvise | ||
108 | regions or to never try to defrag memory and simply fallback to regular | ||
109 | pages unless hugepages are immediately available. Clearly if we spend CPU | ||
110 | time to defrag memory, we would expect to gain even more by the fact we | ||
111 | use hugepages later instead of regular pages. This isn't always | ||
112 | guaranteed, but it may be more likely in case the allocation is for a | ||
113 | MADV_HUGEPAGE region. | ||
114 | |||
115 | :: | ||
116 | |||
117 | echo always >/sys/kernel/mm/transparent_hugepage/defrag | ||
118 | echo defer >/sys/kernel/mm/transparent_hugepage/defrag | ||
119 | echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag | ||
120 | echo madvise >/sys/kernel/mm/transparent_hugepage/defrag | ||
121 | echo never >/sys/kernel/mm/transparent_hugepage/defrag | ||
122 | |||
123 | always | ||
124 | means that an application requesting THP will stall on | ||
125 | allocation failure and directly reclaim pages and compact | ||
126 | memory in an effort to allocate a THP immediately. This may be | ||
127 | desirable for virtual machines that benefit heavily from THP | ||
128 | use and are willing to delay the VM start to utilise them. | ||
129 | |||
130 | defer | ||
131 | means that an application will wake kswapd in the background | ||
132 | to reclaim pages and wake kcompactd to compact memory so that | ||
133 | THP is available in the near future. It's the responsibility | ||
134 | of khugepaged to then install the THP pages later. | ||
135 | |||
136 | defer+madvise | ||
137 | will enter direct reclaim and compaction like ``always``, but | ||
138 | only for regions that have used madvise(MADV_HUGEPAGE); all | ||
139 | other regions will wake kswapd in the background to reclaim | ||
140 | pages and wake kcompactd to compact memory so that THP is | ||
141 | available in the near future. | ||
142 | |||
143 | madvise | ||
144 | will enter direct reclaim like ``always`` but only for regions | ||
145 | that are have used madvise(MADV_HUGEPAGE). This is the default | ||
146 | behaviour. | ||
147 | |||
148 | never | ||
149 | should be self-explanatory. | ||
150 | |||
151 | By default kernel tries to use huge zero page on read page fault to | ||
152 | anonymous mapping. It's possible to disable huge zero page by writing 0 | ||
153 | or enable it back by writing 1:: | ||
154 | |||
155 | echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page | ||
156 | echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page | ||
157 | |||
158 | Some userspace (such as a test program, or an optimized memory allocation | ||
159 | library) may want to know the size (in bytes) of a transparent hugepage:: | ||
160 | |||
161 | cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size | ||
162 | |||
163 | khugepaged will be automatically started when | ||
164 | transparent_hugepage/enabled is set to "always" or "madvise, and it'll | ||
165 | be automatically shutdown if it's set to "never". | ||
166 | |||
167 | Khugepaged controls | ||
168 | ------------------- | ||
169 | |||
170 | khugepaged runs usually at low frequency so while one may not want to | ||
171 | invoke defrag algorithms synchronously during the page faults, it | ||
172 | should be worth invoking defrag at least in khugepaged. However it's | ||
173 | also possible to disable defrag in khugepaged by writing 0 or enable | ||
174 | defrag in khugepaged by writing 1:: | ||
175 | |||
176 | echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag | ||
177 | echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag | ||
178 | |||
179 | You can also control how many pages khugepaged should scan at each | ||
180 | pass:: | ||
181 | |||
182 | /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan | ||
183 | |||
184 | and how many milliseconds to wait in khugepaged between each pass (you | ||
185 | can set this to 0 to run khugepaged at 100% utilization of one core):: | ||
186 | |||
187 | /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs | ||
188 | |||
189 | and how many milliseconds to wait in khugepaged if there's an hugepage | ||
190 | allocation failure to throttle the next allocation attempt:: | ||
191 | |||
192 | /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs | ||
193 | |||
194 | The khugepaged progress can be seen in the number of pages collapsed:: | ||
195 | |||
196 | /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed | ||
197 | |||
198 | for each pass:: | ||
199 | |||
200 | /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans | ||
201 | |||
202 | ``max_ptes_none`` specifies how many extra small pages (that are | ||
203 | not already mapped) can be allocated when collapsing a group | ||
204 | of small pages into one large page:: | ||
205 | |||
206 | /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none | ||
207 | |||
208 | A higher value leads to use additional memory for programs. | ||
209 | A lower value leads to gain less thp performance. Value of | ||
210 | max_ptes_none can waste cpu time very little, you can | ||
211 | ignore it. | ||
212 | |||
213 | ``max_ptes_swap`` specifies how many pages can be brought in from | ||
214 | swap when collapsing a group of pages into a transparent huge page:: | ||
215 | |||
216 | /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap | ||
217 | |||
218 | A higher value can cause excessive swap IO and waste | ||
219 | memory. A lower value can prevent THPs from being | ||
220 | collapsed, resulting fewer pages being collapsed into | ||
221 | THPs, and lower memory access performance. | ||
222 | |||
223 | Boot parameter | ||
224 | ============== | ||
225 | |||
226 | You can change the sysfs boot time defaults of Transparent Hugepage | ||
227 | Support by passing the parameter ``transparent_hugepage=always`` or | ||
228 | ``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` | ||
229 | to the kernel command line. | ||
230 | |||
231 | Hugepages in tmpfs/shmem | ||
232 | ======================== | ||
233 | |||
234 | You can control hugepage allocation policy in tmpfs with mount option | ||
235 | ``huge=``. It can have following values: | ||
236 | |||
237 | always | ||
238 | Attempt to allocate huge pages every time we need a new page; | ||
239 | |||
240 | never | ||
241 | Do not allocate huge pages; | ||
242 | |||
243 | within_size | ||
244 | Only allocate huge page if it will be fully within i_size. | ||
245 | Also respect fadvise()/madvise() hints; | ||
246 | |||
247 | advise | ||
248 | Only allocate huge pages if requested with fadvise()/madvise(); | ||
249 | |||
250 | The default policy is ``never``. | ||
251 | |||
252 | ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting | ||
253 | ``huge=never`` will not attempt to break up huge pages at all, just stop more | ||
254 | from being allocated. | ||
255 | |||
256 | There's also sysfs knob to control hugepage allocation policy for internal | ||
257 | shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount | ||
258 | is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or | ||
259 | MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. | ||
260 | |||
261 | In addition to policies listed above, shmem_enabled allows two further | ||
262 | values: | ||
263 | |||
264 | deny | ||
265 | For use in emergencies, to force the huge option off from | ||
266 | all mounts; | ||
267 | force | ||
268 | Force the huge option on for all - very useful for testing; | ||
269 | |||
270 | Need of application restart | ||
271 | =========================== | ||
272 | |||
273 | The transparent_hugepage/enabled values and tmpfs mount option only affect | ||
274 | future behavior. So to make them effective you need to restart any | ||
275 | application that could have been using hugepages. This also applies to the | ||
276 | regions registered in khugepaged. | ||
277 | |||
278 | Monitoring usage | ||
279 | ================ | ||
280 | |||
281 | The number of anonymous transparent huge pages currently used by the | ||
282 | system is available by reading the AnonHugePages field in ``/proc/meminfo``. | ||
283 | To identify what applications are using anonymous transparent huge pages, | ||
284 | it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields | ||
285 | for each mapping. | ||
286 | |||
287 | The number of file transparent huge pages mapped to userspace is available | ||
288 | by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. | ||
289 | To identify what applications are mapping file transparent huge pages, it | ||
290 | is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields | ||
291 | for each mapping. | ||
292 | |||
293 | Note that reading the smaps file is expensive and reading it | ||
294 | frequently will incur overhead. | ||
295 | |||
296 | There are a number of counters in ``/proc/vmstat`` that may be used to | ||
297 | monitor how successfully the system is providing huge pages for use. | ||
298 | |||
299 | thp_fault_alloc | ||
300 | is incremented every time a huge page is successfully | ||
301 | allocated to handle a page fault. This applies to both the | ||
302 | first time a page is faulted and for COW faults. | ||
303 | |||
304 | thp_collapse_alloc | ||
305 | is incremented by khugepaged when it has found | ||
306 | a range of pages to collapse into one huge page and has | ||
307 | successfully allocated a new huge page to store the data. | ||
308 | |||
309 | thp_fault_fallback | ||
310 | is incremented if a page fault fails to allocate | ||
311 | a huge page and instead falls back to using small pages. | ||
312 | |||
313 | thp_collapse_alloc_failed | ||
314 | is incremented if khugepaged found a range | ||
315 | of pages that should be collapsed into one huge page but failed | ||
316 | the allocation. | ||
317 | |||
318 | thp_file_alloc | ||
319 | is incremented every time a file huge page is successfully | ||
320 | allocated. | ||
321 | |||
322 | thp_file_mapped | ||
323 | is incremented every time a file huge page is mapped into | ||
324 | user address space. | ||
325 | |||
326 | thp_split_page | ||
327 | is incremented every time a huge page is split into base | ||
328 | pages. This can happen for a variety of reasons but a common | ||
329 | reason is that a huge page is old and is being reclaimed. | ||
330 | This action implies splitting all PMD the page mapped with. | ||
331 | |||
332 | thp_split_page_failed | ||
333 | is incremented if kernel fails to split huge | ||
334 | page. This can happen if the page was pinned by somebody. | ||
335 | |||
336 | thp_deferred_split_page | ||
337 | is incremented when a huge page is put onto split | ||
338 | queue. This happens when a huge page is partially unmapped and | ||
339 | splitting it would free up some memory. Pages on split queue are | ||
340 | going to be split under memory pressure. | ||
341 | |||
342 | thp_split_pmd | ||
343 | is incremented every time a PMD split into table of PTEs. | ||
344 | This can happen, for instance, when application calls mprotect() or | ||
345 | munmap() on part of huge page. It doesn't split huge page, only | ||
346 | page table entry. | ||
347 | |||
348 | thp_zero_page_alloc | ||
349 | is incremented every time a huge zero page is | ||
350 | successfully allocated. It includes allocations which where | ||
351 | dropped due race with other allocation. Note, it doesn't count | ||
352 | every map of the huge zero page, only its allocation. | ||
353 | |||
354 | thp_zero_page_alloc_failed | ||
355 | is incremented if kernel fails to allocate | ||
356 | huge zero page and falls back to using small pages. | ||
357 | |||
358 | thp_swpout | ||
359 | is incremented every time a huge page is swapout in one | ||
360 | piece without splitting. | ||
361 | |||
362 | thp_swpout_fallback | ||
363 | is incremented if a huge page has to be split before swapout. | ||
364 | Usually because failed to allocate some continuous swap space | ||
365 | for the huge page. | ||
366 | |||
367 | As the system ages, allocating huge pages may be expensive as the | ||
368 | system uses memory compaction to copy data around memory to free a | ||
369 | huge page for use. There are some counters in ``/proc/vmstat`` to help | ||
370 | monitor this overhead. | ||
371 | |||
372 | compact_stall | ||
373 | is incremented every time a process stalls to run | ||
374 | memory compaction so that a huge page is free for use. | ||
375 | |||
376 | compact_success | ||
377 | is incremented if the system compacted memory and | ||
378 | freed a huge page for use. | ||
379 | |||
380 | compact_fail | ||
381 | is incremented if the system tries to compact memory | ||
382 | but failed. | ||
383 | |||
384 | compact_pages_moved | ||
385 | is incremented each time a page is moved. If | ||
386 | this value is increasing rapidly, it implies that the system | ||
387 | is copying a lot of data to satisfy the huge page allocation. | ||
388 | It is possible that the cost of copying exceeds any savings | ||
389 | from reduced TLB misses. | ||
390 | |||
391 | compact_pagemigrate_failed | ||
392 | is incremented when the underlying mechanism | ||
393 | for moving a page failed. | ||
394 | |||
395 | compact_blocks_moved | ||
396 | is incremented each time memory compaction examines | ||
397 | a huge page aligned range of pages. | ||
398 | |||
399 | It is possible to establish how long the stalls were using the function | ||
400 | tracer to record how long was spent in __alloc_pages_nodemask and | ||
401 | using the mm_page_alloc tracepoint to identify which allocations were | ||
402 | for huge pages. | ||
403 | |||
404 | Optimizing the applications | ||
405 | =========================== | ||
406 | |||
407 | To be guaranteed that the kernel will map a 2M page immediately in any | ||
408 | memory region, the mmap region has to be hugepage naturally | ||
409 | aligned. posix_memalign() can provide that guarantee. | ||
410 | |||
411 | Hugetlbfs | ||
412 | ========= | ||
413 | |||
414 | You can use hugetlbfs on a kernel that has transparent hugepage | ||
415 | support enabled just fine as always. No difference can be noted in | ||
416 | hugetlbfs other than there will be less overall fragmentation. All | ||
417 | usual features belonging to hugetlbfs are preserved and | ||
418 | unaffected. libhugetlbfs will also work fine as usual. | ||
diff --git a/Documentation/vm/transhuge.rst b/Documentation/vm/transhuge.rst index 47c7e4742bc2..a8cf6809e36e 100644 --- a/Documentation/vm/transhuge.rst +++ b/Documentation/vm/transhuge.rst | |||
@@ -4,418 +4,8 @@ | |||
4 | Transparent Hugepage Support | 4 | Transparent Hugepage Support |
5 | ============================ | 5 | ============================ |
6 | 6 | ||
7 | Objective | 7 | This document describes design principles Transparent Hugepage (THP) |
8 | ========= | 8 | Support and its interaction with other parts of the memory management. |
9 | |||
10 | Performance critical computing applications dealing with large memory | ||
11 | working sets are already running on top of libhugetlbfs and in turn | ||
12 | hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of | ||
13 | using huge pages for the backing of virtual memory with huge pages | ||
14 | that supports the automatic promotion and demotion of page sizes and | ||
15 | without the shortcomings of hugetlbfs. | ||
16 | |||
17 | Currently THP only works for anonymous memory mappings and tmpfs/shmem. | ||
18 | But in the future it can expand to other filesystems. | ||
19 | |||
20 | .. note:: | ||
21 | in the examples below we presume that the basic page size is 4K and | ||
22 | the huge page size is 2M, although the actual numbers may vary | ||
23 | depending on the CPU architecture. | ||
24 | |||
25 | The reason applications are running faster is because of two | ||
26 | factors. The first factor is almost completely irrelevant and it's not | ||
27 | of significant interest because it'll also have the downside of | ||
28 | requiring larger clear-page copy-page in page faults which is a | ||
29 | potentially negative effect. The first factor consists in taking a | ||
30 | single page fault for each 2M virtual region touched by userland (so | ||
31 | reducing the enter/exit kernel frequency by a 512 times factor). This | ||
32 | only matters the first time the memory is accessed for the lifetime of | ||
33 | a memory mapping. The second long lasting and much more important | ||
34 | factor will affect all subsequent accesses to the memory for the whole | ||
35 | runtime of the application. The second factor consist of two | ||
36 | components: | ||
37 | |||
38 | 1) the TLB miss will run faster (especially with virtualization using | ||
39 | nested pagetables but almost always also on bare metal without | ||
40 | virtualization) | ||
41 | |||
42 | 2) a single TLB entry will be mapping a much larger amount of virtual | ||
43 | memory in turn reducing the number of TLB misses. With | ||
44 | virtualization and nested pagetables the TLB can be mapped of | ||
45 | larger size only if both KVM and the Linux guest are using | ||
46 | hugepages but a significant speedup already happens if only one of | ||
47 | the two is using hugepages just because of the fact the TLB miss is | ||
48 | going to run faster. | ||
49 | |||
50 | THP can be enabled system wide or restricted to certain tasks or even | ||
51 | memory ranges inside task's address space. Unless THP is completely | ||
52 | disabled, there is ``khugepaged`` daemon that scans memory and | ||
53 | collapses sequences of basic pages into huge pages. | ||
54 | |||
55 | The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` | ||
56 | interface and using madivse(2) and prctl(2) system calls. | ||
57 | |||
58 | Transparent Hugepage Support maximizes the usefulness of free memory | ||
59 | if compared to the reservation approach of hugetlbfs by allowing all | ||
60 | unused memory to be used as cache or other movable (or even unmovable | ||
61 | entities). It doesn't require reservation to prevent hugepage | ||
62 | allocation failures to be noticeable from userland. It allows paging | ||
63 | and all other advanced VM features to be available on the | ||
64 | hugepages. It requires no modifications for applications to take | ||
65 | advantage of it. | ||
66 | |||
67 | Applications however can be further optimized to take advantage of | ||
68 | this feature, like for example they've been optimized before to avoid | ||
69 | a flood of mmap system calls for every malloc(4k). Optimizing userland | ||
70 | is by far not mandatory and khugepaged already can take care of long | ||
71 | lived page allocations even for hugepage unaware applications that | ||
72 | deals with large amounts of memory. | ||
73 | |||
74 | In certain cases when hugepages are enabled system wide, application | ||
75 | may end up allocating more memory resources. An application may mmap a | ||
76 | large region but only touch 1 byte of it, in that case a 2M page might | ||
77 | be allocated instead of a 4k page for no good. This is why it's | ||
78 | possible to disable hugepages system-wide and to only have them inside | ||
79 | MADV_HUGEPAGE madvise regions. | ||
80 | |||
81 | Embedded systems should enable hugepages only inside madvise regions | ||
82 | to eliminate any risk of wasting any precious byte of memory and to | ||
83 | only run faster. | ||
84 | |||
85 | Applications that gets a lot of benefit from hugepages and that don't | ||
86 | risk to lose memory by using hugepages, should use | ||
87 | madvise(MADV_HUGEPAGE) on their critical mmapped regions. | ||
88 | |||
89 | .. _thp_sysfs: | ||
90 | |||
91 | sysfs | ||
92 | ===== | ||
93 | |||
94 | Global THP controls | ||
95 | ------------------- | ||
96 | |||
97 | Transparent Hugepage Support for anonymous memory can be entirely disabled | ||
98 | (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE | ||
99 | regions (to avoid the risk of consuming more memory resources) or enabled | ||
100 | system wide. This can be achieved with one of:: | ||
101 | |||
102 | echo always >/sys/kernel/mm/transparent_hugepage/enabled | ||
103 | echo madvise >/sys/kernel/mm/transparent_hugepage/enabled | ||
104 | echo never >/sys/kernel/mm/transparent_hugepage/enabled | ||
105 | |||
106 | It's also possible to limit defrag efforts in the VM to generate | ||
107 | anonymous hugepages in case they're not immediately free to madvise | ||
108 | regions or to never try to defrag memory and simply fallback to regular | ||
109 | pages unless hugepages are immediately available. Clearly if we spend CPU | ||
110 | time to defrag memory, we would expect to gain even more by the fact we | ||
111 | use hugepages later instead of regular pages. This isn't always | ||
112 | guaranteed, but it may be more likely in case the allocation is for a | ||
113 | MADV_HUGEPAGE region. | ||
114 | |||
115 | :: | ||
116 | |||
117 | echo always >/sys/kernel/mm/transparent_hugepage/defrag | ||
118 | echo defer >/sys/kernel/mm/transparent_hugepage/defrag | ||
119 | echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag | ||
120 | echo madvise >/sys/kernel/mm/transparent_hugepage/defrag | ||
121 | echo never >/sys/kernel/mm/transparent_hugepage/defrag | ||
122 | |||
123 | always | ||
124 | means that an application requesting THP will stall on | ||
125 | allocation failure and directly reclaim pages and compact | ||
126 | memory in an effort to allocate a THP immediately. This may be | ||
127 | desirable for virtual machines that benefit heavily from THP | ||
128 | use and are willing to delay the VM start to utilise them. | ||
129 | |||
130 | defer | ||
131 | means that an application will wake kswapd in the background | ||
132 | to reclaim pages and wake kcompactd to compact memory so that | ||
133 | THP is available in the near future. It's the responsibility | ||
134 | of khugepaged to then install the THP pages later. | ||
135 | |||
136 | defer+madvise | ||
137 | will enter direct reclaim and compaction like ``always``, but | ||
138 | only for regions that have used madvise(MADV_HUGEPAGE); all | ||
139 | other regions will wake kswapd in the background to reclaim | ||
140 | pages and wake kcompactd to compact memory so that THP is | ||
141 | available in the near future. | ||
142 | |||
143 | madvise | ||
144 | will enter direct reclaim like ``always`` but only for regions | ||
145 | that are have used madvise(MADV_HUGEPAGE). This is the default | ||
146 | behaviour. | ||
147 | |||
148 | never | ||
149 | should be self-explanatory. | ||
150 | |||
151 | By default kernel tries to use huge zero page on read page fault to | ||
152 | anonymous mapping. It's possible to disable huge zero page by writing 0 | ||
153 | or enable it back by writing 1:: | ||
154 | |||
155 | echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page | ||
156 | echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page | ||
157 | |||
158 | Some userspace (such as a test program, or an optimized memory allocation | ||
159 | library) may want to know the size (in bytes) of a transparent hugepage:: | ||
160 | |||
161 | cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size | ||
162 | |||
163 | khugepaged will be automatically started when | ||
164 | transparent_hugepage/enabled is set to "always" or "madvise, and it'll | ||
165 | be automatically shutdown if it's set to "never". | ||
166 | |||
167 | Khugepaged controls | ||
168 | ------------------- | ||
169 | |||
170 | khugepaged runs usually at low frequency so while one may not want to | ||
171 | invoke defrag algorithms synchronously during the page faults, it | ||
172 | should be worth invoking defrag at least in khugepaged. However it's | ||
173 | also possible to disable defrag in khugepaged by writing 0 or enable | ||
174 | defrag in khugepaged by writing 1:: | ||
175 | |||
176 | echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag | ||
177 | echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag | ||
178 | |||
179 | You can also control how many pages khugepaged should scan at each | ||
180 | pass:: | ||
181 | |||
182 | /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan | ||
183 | |||
184 | and how many milliseconds to wait in khugepaged between each pass (you | ||
185 | can set this to 0 to run khugepaged at 100% utilization of one core):: | ||
186 | |||
187 | /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs | ||
188 | |||
189 | and how many milliseconds to wait in khugepaged if there's an hugepage | ||
190 | allocation failure to throttle the next allocation attempt:: | ||
191 | |||
192 | /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs | ||
193 | |||
194 | The khugepaged progress can be seen in the number of pages collapsed:: | ||
195 | |||
196 | /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed | ||
197 | |||
198 | for each pass:: | ||
199 | |||
200 | /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans | ||
201 | |||
202 | ``max_ptes_none`` specifies how many extra small pages (that are | ||
203 | not already mapped) can be allocated when collapsing a group | ||
204 | of small pages into one large page:: | ||
205 | |||
206 | /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none | ||
207 | |||
208 | A higher value leads to use additional memory for programs. | ||
209 | A lower value leads to gain less thp performance. Value of | ||
210 | max_ptes_none can waste cpu time very little, you can | ||
211 | ignore it. | ||
212 | |||
213 | ``max_ptes_swap`` specifies how many pages can be brought in from | ||
214 | swap when collapsing a group of pages into a transparent huge page:: | ||
215 | |||
216 | /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap | ||
217 | |||
218 | A higher value can cause excessive swap IO and waste | ||
219 | memory. A lower value can prevent THPs from being | ||
220 | collapsed, resulting fewer pages being collapsed into | ||
221 | THPs, and lower memory access performance. | ||
222 | |||
223 | Boot parameter | ||
224 | ============== | ||
225 | |||
226 | You can change the sysfs boot time defaults of Transparent Hugepage | ||
227 | Support by passing the parameter ``transparent_hugepage=always`` or | ||
228 | ``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` | ||
229 | to the kernel command line. | ||
230 | |||
231 | Hugepages in tmpfs/shmem | ||
232 | ======================== | ||
233 | |||
234 | You can control hugepage allocation policy in tmpfs with mount option | ||
235 | ``huge=``. It can have following values: | ||
236 | |||
237 | always | ||
238 | Attempt to allocate huge pages every time we need a new page; | ||
239 | |||
240 | never | ||
241 | Do not allocate huge pages; | ||
242 | |||
243 | within_size | ||
244 | Only allocate huge page if it will be fully within i_size. | ||
245 | Also respect fadvise()/madvise() hints; | ||
246 | |||
247 | advise | ||
248 | Only allocate huge pages if requested with fadvise()/madvise(); | ||
249 | |||
250 | The default policy is ``never``. | ||
251 | |||
252 | ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting | ||
253 | ``huge=never`` will not attempt to break up huge pages at all, just stop more | ||
254 | from being allocated. | ||
255 | |||
256 | There's also sysfs knob to control hugepage allocation policy for internal | ||
257 | shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount | ||
258 | is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or | ||
259 | MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. | ||
260 | |||
261 | In addition to policies listed above, shmem_enabled allows two further | ||
262 | values: | ||
263 | |||
264 | deny | ||
265 | For use in emergencies, to force the huge option off from | ||
266 | all mounts; | ||
267 | force | ||
268 | Force the huge option on for all - very useful for testing; | ||
269 | |||
270 | Need of application restart | ||
271 | =========================== | ||
272 | |||
273 | The transparent_hugepage/enabled values and tmpfs mount option only affect | ||
274 | future behavior. So to make them effective you need to restart any | ||
275 | application that could have been using hugepages. This also applies to the | ||
276 | regions registered in khugepaged. | ||
277 | |||
278 | Monitoring usage | ||
279 | ================ | ||
280 | |||
281 | The number of anonymous transparent huge pages currently used by the | ||
282 | system is available by reading the AnonHugePages field in ``/proc/meminfo``. | ||
283 | To identify what applications are using anonymous transparent huge pages, | ||
284 | it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields | ||
285 | for each mapping. | ||
286 | |||
287 | The number of file transparent huge pages mapped to userspace is available | ||
288 | by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. | ||
289 | To identify what applications are mapping file transparent huge pages, it | ||
290 | is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields | ||
291 | for each mapping. | ||
292 | |||
293 | Note that reading the smaps file is expensive and reading it | ||
294 | frequently will incur overhead. | ||
295 | |||
296 | There are a number of counters in ``/proc/vmstat`` that may be used to | ||
297 | monitor how successfully the system is providing huge pages for use. | ||
298 | |||
299 | thp_fault_alloc | ||
300 | is incremented every time a huge page is successfully | ||
301 | allocated to handle a page fault. This applies to both the | ||
302 | first time a page is faulted and for COW faults. | ||
303 | |||
304 | thp_collapse_alloc | ||
305 | is incremented by khugepaged when it has found | ||
306 | a range of pages to collapse into one huge page and has | ||
307 | successfully allocated a new huge page to store the data. | ||
308 | |||
309 | thp_fault_fallback | ||
310 | is incremented if a page fault fails to allocate | ||
311 | a huge page and instead falls back to using small pages. | ||
312 | |||
313 | thp_collapse_alloc_failed | ||
314 | is incremented if khugepaged found a range | ||
315 | of pages that should be collapsed into one huge page but failed | ||
316 | the allocation. | ||
317 | |||
318 | thp_file_alloc | ||
319 | is incremented every time a file huge page is successfully | ||
320 | allocated. | ||
321 | |||
322 | thp_file_mapped | ||
323 | is incremented every time a file huge page is mapped into | ||
324 | user address space. | ||
325 | |||
326 | thp_split_page | ||
327 | is incremented every time a huge page is split into base | ||
328 | pages. This can happen for a variety of reasons but a common | ||
329 | reason is that a huge page is old and is being reclaimed. | ||
330 | This action implies splitting all PMD the page mapped with. | ||
331 | |||
332 | thp_split_page_failed | ||
333 | is incremented if kernel fails to split huge | ||
334 | page. This can happen if the page was pinned by somebody. | ||
335 | |||
336 | thp_deferred_split_page | ||
337 | is incremented when a huge page is put onto split | ||
338 | queue. This happens when a huge page is partially unmapped and | ||
339 | splitting it would free up some memory. Pages on split queue are | ||
340 | going to be split under memory pressure. | ||
341 | |||
342 | thp_split_pmd | ||
343 | is incremented every time a PMD split into table of PTEs. | ||
344 | This can happen, for instance, when application calls mprotect() or | ||
345 | munmap() on part of huge page. It doesn't split huge page, only | ||
346 | page table entry. | ||
347 | |||
348 | thp_zero_page_alloc | ||
349 | is incremented every time a huge zero page is | ||
350 | successfully allocated. It includes allocations which where | ||
351 | dropped due race with other allocation. Note, it doesn't count | ||
352 | every map of the huge zero page, only its allocation. | ||
353 | |||
354 | thp_zero_page_alloc_failed | ||
355 | is incremented if kernel fails to allocate | ||
356 | huge zero page and falls back to using small pages. | ||
357 | |||
358 | thp_swpout | ||
359 | is incremented every time a huge page is swapout in one | ||
360 | piece without splitting. | ||
361 | |||
362 | thp_swpout_fallback | ||
363 | is incremented if a huge page has to be split before swapout. | ||
364 | Usually because failed to allocate some continuous swap space | ||
365 | for the huge page. | ||
366 | |||
367 | As the system ages, allocating huge pages may be expensive as the | ||
368 | system uses memory compaction to copy data around memory to free a | ||
369 | huge page for use. There are some counters in ``/proc/vmstat`` to help | ||
370 | monitor this overhead. | ||
371 | |||
372 | compact_stall | ||
373 | is incremented every time a process stalls to run | ||
374 | memory compaction so that a huge page is free for use. | ||
375 | |||
376 | compact_success | ||
377 | is incremented if the system compacted memory and | ||
378 | freed a huge page for use. | ||
379 | |||
380 | compact_fail | ||
381 | is incremented if the system tries to compact memory | ||
382 | but failed. | ||
383 | |||
384 | compact_pages_moved | ||
385 | is incremented each time a page is moved. If | ||
386 | this value is increasing rapidly, it implies that the system | ||
387 | is copying a lot of data to satisfy the huge page allocation. | ||
388 | It is possible that the cost of copying exceeds any savings | ||
389 | from reduced TLB misses. | ||
390 | |||
391 | compact_pagemigrate_failed | ||
392 | is incremented when the underlying mechanism | ||
393 | for moving a page failed. | ||
394 | |||
395 | compact_blocks_moved | ||
396 | is incremented each time memory compaction examines | ||
397 | a huge page aligned range of pages. | ||
398 | |||
399 | It is possible to establish how long the stalls were using the function | ||
400 | tracer to record how long was spent in __alloc_pages_nodemask and | ||
401 | using the mm_page_alloc tracepoint to identify which allocations were | ||
402 | for huge pages. | ||
403 | |||
404 | Optimizing the applications | ||
405 | =========================== | ||
406 | |||
407 | To be guaranteed that the kernel will map a 2M page immediately in any | ||
408 | memory region, the mmap region has to be hugepage naturally | ||
409 | aligned. posix_memalign() can provide that guarantee. | ||
410 | |||
411 | Hugetlbfs | ||
412 | ========= | ||
413 | |||
414 | You can use hugetlbfs on a kernel that has transparent hugepage | ||
415 | support enabled just fine as always. No difference can be noted in | ||
416 | hugetlbfs other than there will be less overall fragmentation. All | ||
417 | usual features belonging to hugetlbfs are preserved and | ||
418 | unaffected. libhugetlbfs will also work fine as usual. | ||
419 | 9 | ||
420 | Design principles | 10 | Design principles |
421 | ================= | 11 | ================= |