summaryrefslogtreecommitdiffstats
path: root/Documentation/vm
diff options
context:
space:
mode:
authorMike Rapoport <rppt@linux.vnet.ibm.com>2018-04-18 04:07:49 -0400
committerJonathan Corbet <corbet@lwn.net>2018-04-27 19:02:48 -0400
commit1ad1335dc58646764eda7bb054b350934a1b23ec (patch)
tree8c145819f0d380744d432512ea47d89c8b91a22c /Documentation/vm
parent3a3f7e26e5544032a687fb05b5221883b97a59ae (diff)
docs/admin-guide/mm: start moving here files from Documentation/vm
Several documents in Documentation/vm fit quite well into the "admin/user guide" category. The documents that don't overload the reader with lots of implementation details and provide coherent description of certain feature can be moved to Documentation/admin-guide/mm. Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Diffstat (limited to 'Documentation/vm')
-rw-r--r--Documentation/vm/00-INDEX10
-rw-r--r--Documentation/vm/hugetlbpage.rst381
-rw-r--r--Documentation/vm/hwpoison.rst2
-rw-r--r--Documentation/vm/idle_page_tracking.rst115
-rw-r--r--Documentation/vm/index.rst5
-rw-r--r--Documentation/vm/pagemap.rst197
-rw-r--r--Documentation/vm/soft-dirty.rst47
-rw-r--r--Documentation/vm/userfaultfd.rst241
8 files changed, 1 insertions, 997 deletions
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index cda564d55b3c..f8a96ca16b7a 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -12,14 +12,10 @@ highmem.rst
12 - Outline of highmem and common issues. 12 - Outline of highmem and common issues.
13hmm.rst 13hmm.rst
14 - Documentation of heterogeneous memory management 14 - Documentation of heterogeneous memory management
15hugetlbpage.rst
16 - a brief summary of hugetlbpage support in the Linux kernel.
17hugetlbfs_reserv.rst 15hugetlbfs_reserv.rst
18 - A brief overview of hugetlbfs reservation design/implementation. 16 - A brief overview of hugetlbfs reservation design/implementation.
19hwpoison.rst 17hwpoison.rst
20 - explains what hwpoison is 18 - explains what hwpoison is
21idle_page_tracking.rst
22 - description of the idle page tracking feature.
23ksm.rst 19ksm.rst
24 - how to use the Kernel Samepage Merging feature. 20 - how to use the Kernel Samepage Merging feature.
25mmu_notifier.rst 21mmu_notifier.rst
@@ -34,16 +30,12 @@ page_frags.rst
34 - description of page fragments allocator 30 - description of page fragments allocator
35page_migration.rst 31page_migration.rst
36 - description of page migration in NUMA systems. 32 - description of page migration in NUMA systems.
37pagemap.rst
38 - pagemap, from the userspace perspective
39page_owner.rst 33page_owner.rst
40 - tracking about who allocated each page 34 - tracking about who allocated each page
41remap_file_pages.rst 35remap_file_pages.rst
42 - a note about remap_file_pages() system call 36 - a note about remap_file_pages() system call
43slub.rst 37slub.rst
44 - a short users guide for SLUB. 38 - a short users guide for SLUB.
45soft-dirty.rst
46 - short explanation for soft-dirty PTEs
47split_page_table_lock.rst 39split_page_table_lock.rst
48 - Separate per-table lock to improve scalability of the old page_table_lock. 40 - Separate per-table lock to improve scalability of the old page_table_lock.
49swap_numa.rst 41swap_numa.rst
@@ -52,8 +44,6 @@ transhuge.rst
52 - Transparent Hugepage Support, alternative way of using hugepages. 44 - Transparent Hugepage Support, alternative way of using hugepages.
53unevictable-lru.rst 45unevictable-lru.rst
54 - Unevictable LRU infrastructure 46 - Unevictable LRU infrastructure
55userfaultfd.rst
56 - description of userfaultfd system call
57z3fold.txt 47z3fold.txt
58 - outline of z3fold allocator for storing compressed pages 48 - outline of z3fold allocator for storing compressed pages
59zsmalloc.rst 49zsmalloc.rst
diff --git a/Documentation/vm/hugetlbpage.rst b/Documentation/vm/hugetlbpage.rst
deleted file mode 100644
index 2b374d10284d..000000000000
--- a/Documentation/vm/hugetlbpage.rst
+++ /dev/null
@@ -1,381 +0,0 @@
1.. _hugetlbpage:
2
3=============
4HugeTLB Pages
5=============
6
7Overview
8========
9
10The intent of this file is to give a brief summary of hugetlbpage support in
11the Linux kernel. This support is built on top of multiple page size support
12that is provided by most modern architectures. For example, x86 CPUs normally
13support 4K and 2M (1G if architecturally supported) page sizes, ia64
14architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
15256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical
16translations. Typically this is a very scarce resource on processor.
17Operating systems try to make best use of limited number of TLB resources.
18This optimization is more critical now as bigger and bigger physical memories
19(several GBs) are more readily available.
20
21Users can use the huge page support in Linux kernel by either using the mmap
22system call or standard SYSV shared memory system calls (shmget, shmat).
23
24First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
25(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
26automatically when CONFIG_HUGETLBFS is selected) configuration
27options.
28
29The ``/proc/meminfo`` file provides information about the total number of
30persistent hugetlb pages in the kernel's huge page pool. It also displays
31default huge page size and information about the number of free, reserved
32and surplus huge pages in the pool of huge pages of default size.
33The huge page size is needed for generating the proper alignment and
34size of the arguments to system calls that map huge page regions.
35
36The output of ``cat /proc/meminfo`` will include lines like::
37
38 HugePages_Total: uuu
39 HugePages_Free: vvv
40 HugePages_Rsvd: www
41 HugePages_Surp: xxx
42 Hugepagesize: yyy kB
43 Hugetlb: zzz kB
44
45where:
46
47HugePages_Total
48 is the size of the pool of huge pages.
49HugePages_Free
50 is the number of huge pages in the pool that are not yet
51 allocated.
52HugePages_Rsvd
53 is short for "reserved," and is the number of huge pages for
54 which a commitment to allocate from the pool has been made,
55 but no allocation has yet been made. Reserved huge pages
56 guarantee that an application will be able to allocate a
57 huge page from the pool of huge pages at fault time.
58HugePages_Surp
59 is short for "surplus," and is the number of huge pages in
60 the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
61 maximum number of surplus huge pages is controlled by
62 ``/proc/sys/vm/nr_overcommit_hugepages``.
63Hugepagesize
64 is the default hugepage size (in Kb).
65Hugetlb
66 is the total amount of memory (in kB), consumed by huge
67 pages of all sizes.
68 If huge pages of different sizes are in use, this number
69 will exceed HugePages_Total \* Hugepagesize. To get more
70 detailed information, please, refer to
71 ``/sys/kernel/mm/hugepages`` (described below).
72
73
74``/proc/filesystems`` should also show a filesystem of type "hugetlbfs"
75configured in the kernel.
76
77``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge
78pages in the kernel's huge page pool. "Persistent" huge pages will be
79returned to the huge page pool when freed by a task. A user with root
80privileges can dynamically allocate more or free some persistent huge pages
81by increasing or decreasing the value of ``nr_hugepages``.
82
83Pages that are used as huge pages are reserved inside the kernel and cannot
84be used for other purposes. Huge pages cannot be swapped out under
85memory pressure.
86
87Once a number of huge pages have been pre-allocated to the kernel huge page
88pool, a user with appropriate privilege can use either the mmap system call
89or shared memory system calls to use the huge pages. See the discussion of
90:ref:`Using Huge Pages <using_huge_pages>`, below.
91
92The administrator can allocate persistent huge pages on the kernel boot
93command line by specifying the "hugepages=N" parameter, where 'N' = the
94number of huge pages requested. This is the most reliable method of
95allocating huge pages as memory has not yet become fragmented.
96
97Some platforms support multiple huge page sizes. To allocate huge pages
98of a specific size, one must precede the huge pages boot command parameters
99with a huge page size selection parameter "hugepagesz=<size>". <size> must
100be specified in bytes with optional scale suffix [kKmMgG]. The default huge
101page size may be selected with the "default_hugepagesz=<size>" boot parameter.
102
103When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
104indicates the current number of pre-allocated huge pages of the default size.
105Thus, one can use the following command to dynamically allocate/deallocate
106default sized persistent huge pages::
107
108 echo 20 > /proc/sys/vm/nr_hugepages
109
110This command will try to adjust the number of default sized huge pages in the
111huge page pool to 20, allocating or freeing huge pages, as required.
112
113On a NUMA platform, the kernel will attempt to distribute the huge page pool
114over all the set of allowed nodes specified by the NUMA memory policy of the
115task that modifies ``nr_hugepages``. The default for the allowed nodes--when the
116task has default memory policy--is all on-line nodes with memory. Allowed
117nodes with insufficient available, contiguous memory for a huge page will be
118silently skipped when allocating persistent huge pages. See the
119:ref:`discussion below <mem_policy_and_hp_alloc>`
120of the interaction of task memory policy, cpusets and per node attributes
121with the allocation and freeing of persistent huge pages.
122
123The success or failure of huge page allocation depends on the amount of
124physically contiguous memory that is present in system at the time of the
125allocation attempt. If the kernel is unable to allocate huge pages from
126some nodes in a NUMA system, it will attempt to make up the difference by
127allocating extra pages on other nodes with sufficient available contiguous
128memory, if any.
129
130System administrators may want to put this command in one of the local rc
131init files. This will enable the kernel to allocate huge pages early in
132the boot process when the possibility of getting physical contiguous pages
133is still very high. Administrators can verify the number of huge pages
134actually allocated by checking the sysctl or meminfo. To check the per node
135distribution of huge pages in a NUMA system, use::
136
137 cat /sys/devices/system/node/node*/meminfo | fgrep Huge
138
139``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of
140huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are
141requested by applications. Writing any non-zero value into this file
142indicates that the hugetlb subsystem is allowed to try to obtain that
143number of "surplus" huge pages from the kernel's normal page pool, when the
144persistent huge page pool is exhausted. As these surplus huge pages become
145unused, they are freed back to the kernel's normal page pool.
146
147When increasing the huge page pool size via ``nr_hugepages``, any existing
148surplus pages will first be promoted to persistent huge pages. Then, additional
149huge pages will be allocated, if necessary and if possible, to fulfill
150the new persistent huge page pool size.
151
152The administrator may shrink the pool of persistent huge pages for
153the default huge page size by setting the ``nr_hugepages`` sysctl to a
154smaller value. The kernel will attempt to balance the freeing of huge pages
155across all nodes in the memory policy of the task modifying ``nr_hugepages``.
156Any free huge pages on the selected nodes will be freed back to the kernel's
157normal page pool.
158
159Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that
160it becomes less than the number of huge pages in use will convert the balance
161of the in-use huge pages to surplus huge pages. This will occur even if
162the number of surplus pages would exceed the overcommit value. As long as
163this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is
164increased sufficiently, or the surplus huge pages go out of use and are freed--
165no more surplus huge pages will be allowed to be allocated.
166
167With support for multiple huge page pools at run-time available, much of
168the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in
169sysfs.
170The ``/proc`` interfaces discussed above have been retained for backwards
171compatibility. The root huge page control directory in sysfs is::
172
173 /sys/kernel/mm/hugepages
174
175For each huge page size supported by the running kernel, a subdirectory
176will exist, of the form::
177
178 hugepages-${size}kB
179
180Inside each of these directories, the same set of files will exist::
181
182 nr_hugepages
183 nr_hugepages_mempolicy
184 nr_overcommit_hugepages
185 free_hugepages
186 resv_hugepages
187 surplus_hugepages
188
189which function as described above for the default huge page-sized case.
190
191.. _mem_policy_and_hp_alloc:
192
193Interaction of Task Memory Policy with Huge Page Allocation/Freeing
194===================================================================
195
196Whether huge pages are allocated and freed via the ``/proc`` interface or
197the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the
198NUMA nodes from which huge pages are allocated or freed are controlled by the
199NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy``
200sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy
201is ignored.
202
203The recommended method to allocate or free huge pages to/from the kernel
204huge page pool, using the ``nr_hugepages`` example above, is::
205
206 numactl --interleave <node-list> echo 20 \
207 >/proc/sys/vm/nr_hugepages_mempolicy
208
209or, more succinctly::
210
211 numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
212
213This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes
214specified in <node-list>, depending on whether number of persistent huge pages
215is initially less than or greater than 20, respectively. No huge pages will be
216allocated nor freed on any node not included in the specified <node-list>.
217
218When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any
219memory policy mode--bind, preferred, local or interleave--may be used. The
220resulting effect on persistent huge page allocation is as follows:
221
222#. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.rst],
223 persistent huge pages will be distributed across the node or nodes
224 specified in the mempolicy as if "interleave" had been specified.
225 However, if a node in the policy does not contain sufficient contiguous
226 memory for a huge page, the allocation will not "fallback" to the nearest
227 neighbor node with sufficient contiguous memory. To do this would cause
228 undesirable imbalance in the distribution of the huge page pool, or
229 possibly, allocation of persistent huge pages on nodes not allowed by
230 the task's memory policy.
231
232#. One or more nodes may be specified with the bind or interleave policy.
233 If more than one node is specified with the preferred policy, only the
234 lowest numeric id will be used. Local policy will select the node where
235 the task is running at the time the nodes_allowed mask is constructed.
236 For local policy to be deterministic, the task must be bound to a cpu or
237 cpus in a single node. Otherwise, the task could be migrated to some
238 other node at any time after launch and the resulting node will be
239 indeterminate. Thus, local policy is not very useful for this purpose.
240 Any of the other mempolicy modes may be used to specify a single node.
241
242#. The nodes allowed mask will be derived from any non-default task mempolicy,
243 whether this policy was set explicitly by the task itself or one of its
244 ancestors, such as numactl. This means that if the task is invoked from a
245 shell with non-default policy, that policy will be used. One can specify a
246 node list of "all" with numactl --interleave or --membind [-m] to achieve
247 interleaving over all nodes in the system or cpuset.
248
249#. Any task mempolicy specified--e.g., using numactl--will be constrained by
250 the resource limits of any cpuset in which the task runs. Thus, there will
251 be no way for a task with non-default policy running in a cpuset with a
252 subset of the system nodes to allocate huge pages outside the cpuset
253 without first moving to a cpuset that contains all of the desired nodes.
254
255#. Boot-time huge page allocation attempts to distribute the requested number
256 of huge pages over all on-lines nodes with memory.
257
258Per Node Hugepages Attributes
259=============================
260
261A subset of the contents of the root huge page control directory in sysfs,
262described above, will be replicated under each the system device of each
263NUMA node with memory in::
264
265 /sys/devices/system/node/node[0-9]*/hugepages/
266
267Under this directory, the subdirectory for each supported huge page size
268contains the following attribute files::
269
270 nr_hugepages
271 free_hugepages
272 surplus_hugepages
273
274The free\_' and surplus\_' attribute files are read-only. They return the number
275of free and surplus [overcommitted] huge pages, respectively, on the parent
276node.
277
278The ``nr_hugepages`` attribute returns the total number of huge pages on the
279specified node. When this attribute is written, the number of persistent huge
280pages on the parent node will be adjusted to the specified value, if sufficient
281resources exist, regardless of the task's mempolicy or cpuset constraints.
282
283Note that the number of overcommit and reserve pages remain global quantities,
284as we don't know until fault time, when the faulting task's mempolicy is
285applied, from which node the huge page allocation will be attempted.
286
287.. _using_huge_pages:
288
289Using Huge Pages
290================
291
292If the user applications are going to request huge pages using mmap system
293call, then it is required that system administrator mount a file system of
294type hugetlbfs::
295
296 mount -t hugetlbfs \
297 -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
298 min_size=<value>,nr_inodes=<value> none /mnt/huge
299
300This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
301``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages.
302
303The ``uid`` and ``gid`` options sets the owner and group of the root of the
304file system. By default the ``uid`` and ``gid`` of the current process
305are taken.
306
307The ``mode`` option sets the mode of root of file system to value & 01777.
308This value is given in octal. By default the value 0755 is picked.
309
310If the platform supports multiple huge page sizes, the ``pagesize`` option can
311be used to specify the huge page size and associated pool. ``pagesize``
312is specified in bytes. If ``pagesize`` is not specified the platform's
313default huge page size and associated pool will be used.
314
315The ``size`` option sets the maximum value of memory (huge pages) allowed
316for that filesystem (``/mnt/huge``). The ``size`` option can be specified
317in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``).
318The size is rounded down to HPAGE_SIZE boundary.
319
320The ``min_size`` option sets the minimum value of memory (huge pages) allowed
321for the filesystem. ``min_size`` can be specified in the same way as ``size``,
322either bytes or a percentage of the huge page pool.
323At mount time, the number of huge pages specified by ``min_size`` are reserved
324for use by the filesystem.
325If there are not enough free huge pages available, the mount will fail.
326As huge pages are allocated to the filesystem and freed, the reserve count
327is adjusted so that the sum of allocated and reserved huge pages is always
328at least ``min_size``.
329
330The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge``
331can use.
332
333If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on
334command line then no limits are set.
335
336For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can
337use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo.
338For example, size=2K has the same meaning as size=2048.
339
340While read system calls are supported on files that reside on hugetlb
341file systems, write system calls are not.
342
343Regular chown, chgrp, and chmod commands (with right permissions) could be
344used to change the file attributes on hugetlbfs.
345
346Also, it is important to note that no such mount command is required if
347applications are going to use only shmat/shmget system calls or mmap with
348MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
349:ref:`map_hugetlb <map_hugetlb>` below.
350
351Users who wish to use hugetlb memory via shared memory segment should be
352members of a supplementary group and system admin needs to configure that gid
353into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different
354applications to use any combination of mmaps and shm* calls, though the mount of
355filesystem will be required for using mmap calls without MAP_HUGETLB.
356
357Syscalls that operate on memory backed by hugetlb pages only have their lengths
358aligned to the native page size of the processor; they will normally fail with
359errno set to EINVAL or exclude hugetlb pages that extend beyond the length if
360not hugepage aligned. For example, munmap(2) will fail if memory is backed by
361a hugetlb page and the length is smaller than the hugepage size.
362
363
364Examples
365========
366
367.. _map_hugetlb:
368
369``map_hugetlb``
370 see tools/testing/selftests/vm/map_hugetlb.c
371
372``hugepage-shm``
373 see tools/testing/selftests/vm/hugepage-shm.c
374
375``hugepage-mmap``
376 see tools/testing/selftests/vm/hugepage-mmap.c
377
378The `libhugetlbfs`_ library provides a wide range of userspace tools
379to help with huge page usability, environment setup, and control.
380
381.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs
diff --git a/Documentation/vm/hwpoison.rst b/Documentation/vm/hwpoison.rst
index 070aa1e716b7..09bd24a92784 100644
--- a/Documentation/vm/hwpoison.rst
+++ b/Documentation/vm/hwpoison.rst
@@ -155,7 +155,7 @@ Testing
155 value). This allows stress testing of many kinds of 155 value). This allows stress testing of many kinds of
156 pages. The page_flags are the same as in /proc/kpageflags. The 156 pages. The page_flags are the same as in /proc/kpageflags. The
157 flag bits are defined in include/linux/kernel-page-flags.h and 157 flag bits are defined in include/linux/kernel-page-flags.h and
158 documented in Documentation/vm/pagemap.rst 158 documented in Documentation/admin-guide/mm/pagemap.rst
159 159
160* Architecture specific MCE injector 160* Architecture specific MCE injector
161 161
diff --git a/Documentation/vm/idle_page_tracking.rst b/Documentation/vm/idle_page_tracking.rst
deleted file mode 100644
index d1c4609a5220..000000000000
--- a/Documentation/vm/idle_page_tracking.rst
+++ /dev/null
@@ -1,115 +0,0 @@
1.. _idle_page_tracking:
2
3==================
4Idle Page Tracking
5==================
6
7Motivation
8==========
9
10The idle page tracking feature allows to track which memory pages are being
11accessed by a workload and which are idle. This information can be useful for
12estimating the workload's working set size, which, in turn, can be taken into
13account when configuring the workload parameters, setting memory cgroup limits,
14or deciding where to place the workload within a compute cluster.
15
16It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
17
18.. _user_api:
19
20User API
21========
22
23The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
24Currently, it consists of the only read-write file,
25``/sys/kernel/mm/page_idle/bitmap``.
26
27The file implements a bitmap where each bit corresponds to a memory page. The
28bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
29mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
30set, the corresponding page is idle.
31
32A page is considered idle if it has not been accessed since it was marked idle
33(for more details on what "accessed" actually means see the :ref:`Implementation
34Details <impl_details>` section).
35To mark a page idle one has to set the bit corresponding to
36the page by writing to the file. A value written to the file is OR-ed with the
37current bitmap value.
38
39Only accesses to user memory pages are tracked. These are pages mapped to a
40process address space, page cache and buffer pages, swap cache pages. For other
41page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
42and hence such pages are never reported idle.
43
44For huge pages the idle flag is set only on the head page, so one has to read
45``/proc/kpageflags`` in order to correctly count idle huge pages.
46
47Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
48-EINVAL if you are not starting the read/write on an 8-byte boundary, or
49if the size of the read/write is not a multiple of 8 bytes. Writing to
50this file beyond max PFN will return -ENXIO.
51
52That said, in order to estimate the amount of pages that are not used by a
53workload one should:
54
55 1. Mark all the workload's pages as idle by setting corresponding bits in
56 ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
57 ``/proc/pid/pagemap`` if the workload is represented by a process, or by
58 filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
59 is placed in a memory cgroup.
60
61 2. Wait until the workload accesses its working set.
62
63 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
64 If one wants to ignore certain types of pages, e.g. mlocked pages since they
65 are not reclaimable, he or she can filter them out using
66 ``/proc/kpageflags``.
67
68See Documentation/vm/pagemap.rst for more information about
69``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``.
70
71.. _impl_details:
72
73Implementation Details
74======================
75
76The kernel internally keeps track of accesses to user memory pages in order to
77reclaim unreferenced pages first on memory shortage conditions. A page is
78considered referenced if it has been recently accessed via a process address
79space, in which case one or more PTEs it is mapped to will have the Accessed bit
80set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The
81latter happens when:
82
83 - a userspace process reads or writes a page using a system call (e.g. read(2)
84 or write(2))
85
86 - a page that is used for storing filesystem buffers is read or written,
87 because a process needs filesystem metadata stored in it (e.g. lists a
88 directory tree)
89
90 - a page is accessed by a device driver using get_user_pages()
91
92When a dirty page is written to swap or disk as a result of memory reclaim or
93exceeding the dirty memory limit, it is not marked referenced.
94
95The idle memory tracking feature adds a new page flag, the Idle flag. This flag
96is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
97:ref:`User API <user_api>`
98section), and cleared automatically whenever a page is referenced as defined
99above.
100
101When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
102mapped to, otherwise we will not be able to detect accesses to the page coming
103from a process address space. To avoid interference with the reclaimer, which,
104as noted above, uses the Accessed bit to promote actively referenced pages, one
105more page flag is introduced, the Young flag. When the PTE Accessed bit is
106cleared as a result of setting or updating a page's Idle flag, the Young flag
107is set on the page. The reclaimer treats the Young flag as an extra PTE
108Accessed bit and therefore will consider such a page as referenced.
109
110Since the idle memory tracking feature is based on the memory reclaimer logic,
111it only works with pages that are on an LRU list, other pages are silently
112ignored. That means it will ignore a user memory page if it is isolated, but
113since there are usually not many of them, it should not affect the overall
114result noticeably. In order not to stall scanning of the idle page bitmap,
115locked pages may be skipped too.
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index 6c451421a01e..ed58cb9f9675 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -13,15 +13,10 @@ various features of the Linux memory management
13.. toctree:: 13.. toctree::
14 :maxdepth: 1 14 :maxdepth: 1
15 15
16 hugetlbpage
17 idle_page_tracking
18 ksm 16 ksm
19 numa_memory_policy 17 numa_memory_policy
20 pagemap
21 transhuge 18 transhuge
22 soft-dirty
23 swap_numa 19 swap_numa
24 userfaultfd
25 zswap 20 zswap
26 21
27Kernel developers MM documentation 22Kernel developers MM documentation
diff --git a/Documentation/vm/pagemap.rst b/Documentation/vm/pagemap.rst
deleted file mode 100644
index 7ba8cbd57ad3..000000000000
--- a/Documentation/vm/pagemap.rst
+++ /dev/null
@@ -1,197 +0,0 @@
1.. _pagemap:
2
3=============================
4Examining Process Page Tables
5=============================
6
7pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
8userspace programs to examine the page tables and related information by
9reading files in ``/proc``.
10
11There are four components to pagemap:
12
13 * ``/proc/pid/pagemap``. This file lets a userspace process find out which
14 physical frame each virtual page is mapped to. It contains one 64-bit
15 value for each virtual page, containing the following data (from
16 ``fs/proc/task_mmu.c``, above pagemap_read):
17
18 * Bits 0-54 page frame number (PFN) if present
19 * Bits 0-4 swap type if swapped
20 * Bits 5-54 swap offset if swapped
21 * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.rst)
22 * Bit 56 page exclusively mapped (since 4.2)
23 * Bits 57-60 zero
24 * Bit 61 page is file-page or shared-anon (since 3.5)
25 * Bit 62 page swapped
26 * Bit 63 page present
27
28 Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
29 In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from
30 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
31 Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
32
33 If the page is not present but in swap, then the PFN contains an
34 encoding of the swap file number and the page's offset into the
35 swap. Unmapped pages return a null PFN. This allows determining
36 precisely which pages are mapped (or in swap) and comparing mapped
37 pages between processes.
38
39 Efficient users of this interface will use ``/proc/pid/maps`` to
40 determine which areas of memory are actually mapped and llseek to
41 skip over unmapped regions.
42
43 * ``/proc/kpagecount``. This file contains a 64-bit count of the number of
44 times each page is mapped, indexed by PFN.
45
46 * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
47 page, indexed by PFN.
48
49 The flags are (from ``fs/proc/page.c``, above kpageflags_read):
50
51 0. LOCKED
52 1. ERROR
53 2. REFERENCED
54 3. UPTODATE
55 4. DIRTY
56 5. LRU
57 6. ACTIVE
58 7. SLAB
59 8. WRITEBACK
60 9. RECLAIM
61 10. BUDDY
62 11. MMAP
63 12. ANON
64 13. SWAPCACHE
65 14. SWAPBACKED
66 15. COMPOUND_HEAD
67 16. COMPOUND_TAIL
68 17. HUGE
69 18. UNEVICTABLE
70 19. HWPOISON
71 20. NOPAGE
72 21. KSM
73 22. THP
74 23. BALLOON
75 24. ZERO_PAGE
76 25. IDLE
77
78 * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
79 memory cgroup each page is charged to, indexed by PFN. Only available when
80 CONFIG_MEMCG is set.
81
82Short descriptions to the page flags
83====================================
84
850 - LOCKED
86 page is being locked for exclusive access, e.g. by undergoing read/write IO
877 - SLAB
88 page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
89 When compound page is used, SLUB/SLQB will only set this flag on the head
90 page; SLOB will not flag it at all.
9110 - BUDDY
92 a free memory block managed by the buddy system allocator
93 The buddy system organizes free memory in blocks of various orders.
94 An order N block has 2^N physically contiguous pages, with the BUDDY flag
95 set for and _only_ for the first page.
9615 - COMPOUND_HEAD
97 A compound page with order N consists of 2^N physically contiguous pages.
98 A compound page with order 2 takes the form of "HTTT", where H donates its
99 head page and T donates its tail page(s). The major consumers of compound
100 pages are hugeTLB pages (Documentation/vm/hugetlbpage.rst), the SLUB etc.
101 memory allocators and various device drivers. However in this interface,
102 only huge/giga pages are made visible to end users.
10316 - COMPOUND_TAIL
104 A compound page tail (see description above).
10517 - HUGE
106 this is an integral part of a HugeTLB page
10719 - HWPOISON
108 hardware detected memory corruption on this page: don't touch the data!
10920 - NOPAGE
110 no page frame exists at the requested address
11121 - KSM
112 identical memory pages dynamically shared between one or more processes
11322 - THP
114 contiguous pages which construct transparent hugepages
11523 - BALLOON
116 balloon compaction page
11724 - ZERO_PAGE
118 zero page for pfn_zero or huge_zero page
11925 - IDLE
120 page has not been accessed since it was marked idle (see
121 Documentation/vm/idle_page_tracking.rst). Note that this flag may be
122 stale in case the page was accessed via a PTE. To make sure the flag
123 is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first.
124
125IO related page flags
126---------------------
127
1281 - ERROR
129 IO error occurred
1303 - UPTODATE
131 page has up-to-date data
132 ie. for file backed page: (in-memory data revision >= on-disk one)
1334 - DIRTY
134 page has been written to, hence contains new data
135 i.e. for file backed page: (in-memory data revision > on-disk one)
1368 - WRITEBACK
137 page is being synced to disk
138
139LRU related page flags
140----------------------
141
1425 - LRU
143 page is in one of the LRU lists
1446 - ACTIVE
145 page is in the active LRU list
14618 - UNEVICTABLE
147 page is in the unevictable (non-)LRU list It is somehow pinned and
148 not a candidate for LRU page reclaims, e.g. ramfs pages,
149 shmctl(SHM_LOCK) and mlock() memory segments
1502 - REFERENCED
151 page has been referenced since last LRU list enqueue/requeue
1529 - RECLAIM
153 page will be reclaimed soon after its pageout IO completed
15411 - MMAP
155 a memory mapped page
15612 - ANON
157 a memory mapped page that is not part of a file
15813 - SWAPCACHE
159 page is mapped to swap space, i.e. has an associated swap entry
16014 - SWAPBACKED
161 page is backed by swap/RAM
162
163The page-types tool in the tools/vm directory can be used to query the
164above flags.
165
166Using pagemap to do something useful
167====================================
168
169The general procedure for using pagemap to find out about a process' memory
170usage goes like this:
171
172 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
173 mapped to what.
174 2. Select the maps you are interested in -- all of them, or a particular
175 library, or the stack or the heap, etc.
176 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
177 4. Read a u64 for each page from pagemap.
178 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
179 just read, seek to that entry in the file, and read the data you want.
180
181For example, to find the "unique set size" (USS), which is the amount of
182memory that a process is using that is not shared with any other process,
183you can go through every map in the process, find the PFNs, look those up
184in kpagecount, and tally up the number of pages that are only referenced
185once.
186
187Other notes
188===========
189
190Reading from any of the files will return -EINVAL if you are not starting
191the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
192into the file), or if the size of the read is not a multiple of 8 bytes.
193
194Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
195always 12 at most architectures). Since Linux 3.11 their meaning changes
196after first clear of soft-dirty bits. Since Linux 4.2 they are used for
197flags unconditionally.
diff --git a/Documentation/vm/soft-dirty.rst b/Documentation/vm/soft-dirty.rst
deleted file mode 100644
index cb0cfd6672fa..000000000000
--- a/Documentation/vm/soft-dirty.rst
+++ /dev/null
@@ -1,47 +0,0 @@
1.. _soft_dirty:
2
3===============
4Soft-Dirty PTEs
5===============
6
7The soft-dirty is a bit on a PTE which helps to track which pages a task
8writes to. In order to do this tracking one should
9
10 1. Clear soft-dirty bits from the task's PTEs.
11
12 This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the
13 task in question.
14
15 2. Wait some time.
16
17 3. Read soft-dirty bits from the PTEs.
18
19 This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the
20 64-bit qword is the soft-dirty one. If set, the respective PTE was
21 written to since step 1.
22
23
24Internally, to do this tracking, the writable bit is cleared from PTEs
25when the soft-dirty bit is cleared. So, after this, when the task tries to
26modify a page at some virtual address the #PF occurs and the kernel sets
27the soft-dirty bit on the respective PTE.
28
29Note, that although all the task's address space is marked as r/o after the
30soft-dirty bits clear, the #PF-s that occur after that are processed fast.
31This is so, since the pages are still mapped to physical memory, and thus all
32the kernel does is finds this fact out and puts both writable and soft-dirty
33bits on the PTE.
34
35While in most cases tracking memory changes by #PF-s is more than enough
36there is still a scenario when we can lose soft dirty bits -- a task
37unmaps a previously mapped memory region and then maps a new one at exactly
38the same place. When unmap is called, the kernel internally clears PTE values
39including soft dirty bits. To notify user space application about such
40memory region renewal the kernel always marks new memory regions (and
41expanded regions) as soft dirty.
42
43This feature is actively used by the checkpoint-restore project. You
44can find more details about it on http://criu.org
45
46
47-- Pavel Emelyanov, Apr 9, 2013
diff --git a/Documentation/vm/userfaultfd.rst b/Documentation/vm/userfaultfd.rst
deleted file mode 100644
index 5048cf661a8a..000000000000
--- a/Documentation/vm/userfaultfd.rst
+++ /dev/null
@@ -1,241 +0,0 @@
1.. _userfaultfd:
2
3===========
4Userfaultfd
5===========
6
7Objective
8=========
9
10Userfaults allow the implementation of on-demand paging from userland
11and more generally they allow userland to take control of various
12memory page faults, something otherwise only the kernel code could do.
13
14For example userfaults allows a proper and more optimal implementation
15of the PROT_NONE+SIGSEGV trick.
16
17Design
18======
19
20Userfaults are delivered and resolved through the userfaultfd syscall.
21
22The userfaultfd (aside from registering and unregistering virtual
23memory ranges) provides two primary functionalities:
24
251) read/POLLIN protocol to notify a userland thread of the faults
26 happening
27
282) various UFFDIO_* ioctls that can manage the virtual memory regions
29 registered in the userfaultfd that allows userland to efficiently
30 resolve the userfaults it receives via 1) or to manage the virtual
31 memory in the background
32
33The real advantage of userfaults if compared to regular virtual memory
34management of mremap/mprotect is that the userfaults in all their
35operations never involve heavyweight structures like vmas (in fact the
36userfaultfd runtime load never takes the mmap_sem for writing).
37
38Vmas are not suitable for page- (or hugepage) granular fault tracking
39when dealing with virtual address spaces that could span
40Terabytes. Too many vmas would be needed for that.
41
42The userfaultfd once opened by invoking the syscall, can also be
43passed using unix domain sockets to a manager process, so the same
44manager process could handle the userfaults of a multitude of
45different processes without them being aware about what is going on
46(well of course unless they later try to use the userfaultfd
47themselves on the same region the manager is already tracking, which
48is a corner case that would currently return -EBUSY).
49
50API
51===
52
53When first opened the userfaultfd must be enabled invoking the
54UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
55a later API version) which will specify the read/POLLIN protocol
56userland intends to speak on the UFFD and the uffdio_api.features
57userland requires. The UFFDIO_API ioctl if successful (i.e. if the
58requested uffdio_api.api is spoken also by the running kernel and the
59requested features are going to be enabled) will return into
60uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
61respectively all the available features of the read(2) protocol and
62the generic ioctl available.
63
64The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
65defines what memory types are supported by the userfaultfd and what
66events, except page fault notifications, may be generated.
67
68If the kernel supports registering userfaultfd ranges on hugetlbfs
69virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
70uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
71set if the kernel supports registering userfaultfd ranges on shared
72memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
73MAP_SHARED, memfd_create, etc).
74
75The userland application that wants to use userfaultfd with hugetlbfs
76or shared memory need to set the corresponding flag in
77uffdio_api.features to enable those features.
78
79If the userland desires to receive notifications for events other than
80page faults, it has to verify that uffdio_api.features has appropriate
81UFFD_FEATURE_EVENT_* bits set. These events are described in more
82detail below in "Non-cooperative userfaultfd" section.
83
84Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
85be invoked (if present in the returned uffdio_api.ioctls bitmask) to
86register a memory range in the userfaultfd by setting the
87uffdio_register structure accordingly. The uffdio_register.mode
88bitmask will specify to the kernel which kind of faults to track for
89the range (UFFDIO_REGISTER_MODE_MISSING would track missing
90pages). The UFFDIO_REGISTER ioctl will return the
91uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
92userfaults on the range registered. Not all ioctls will necessarily be
93supported for all memory types depending on the underlying virtual
94memory backend (anonymous memory vs tmpfs vs real filebacked
95mappings).
96
97Userland can use the uffdio_register.ioctls to manage the virtual
98address space in the background (to add or potentially also remove
99memory from the userfaultfd registered range). This means a userfault
100could be triggering just before userland maps in the background the
101user-faulted page.
102
103The primary ioctl to resolve userfaults is UFFDIO_COPY. That
104atomically copies a page into the userfault registered range and wakes
105up the blocked userfaults (unless uffdio_copy.mode &
106UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
107UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
108half copied page since it'll keep userfaulting until the copy has
109finished.
110
111QEMU/KVM
112========
113
114QEMU/KVM is using the userfaultfd syscall to implement postcopy live
115migration. Postcopy live migration is one form of memory
116externalization consisting of a virtual machine running with part or
117all of its memory residing on a different node in the cloud. The
118userfaultfd abstraction is generic enough that not a single line of
119KVM kernel code had to be modified in order to add postcopy live
120migration to QEMU.
121
122Guest async page faults, FOLL_NOWAIT and all other GUP features work
123just fine in combination with userfaults. Userfaults trigger async
124page faults in the guest scheduler so those guest processes that
125aren't waiting for userfaults (i.e. network bound) can keep running in
126the guest vcpus.
127
128It is generally beneficial to run one pass of precopy live migration
129just before starting postcopy live migration, in order to avoid
130generating userfaults for readonly guest regions.
131
132The implementation of postcopy live migration currently uses one
133single bidirectional socket but in the future two different sockets
134will be used (to reduce the latency of the userfaults to the minimum
135possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
136
137The QEMU in the source node writes all pages that it knows are missing
138in the destination node, into the socket, and the migration thread of
139the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
140ioctls on the userfaultfd in order to map the received pages into the
141guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
142
143A different postcopy thread in the destination node listens with
144poll() to the userfaultfd in parallel. When a POLLIN event is
145generated after a userfault triggers, the postcopy thread read() from
146the userfaultfd and receives the fault address (or -EAGAIN in case the
147userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
148by the parallel QEMU migration thread).
149
150After the QEMU postcopy thread (running in the destination node) gets
151the userfault address it writes the information about the missing page
152into the socket. The QEMU source node receives the information and
153roughly "seeks" to that page address and continues sending all
154remaining missing pages from that new page offset. Soon after that
155(just the time to flush the tcp_wmem queue through the network) the
156migration thread in the QEMU running in the destination node will
157receive the page that triggered the userfault and it'll map it as
158usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
159was spontaneously sent by the source or if it was an urgent page
160requested through a userfault).
161
162By the time the userfaults start, the QEMU in the destination node
163doesn't need to keep any per-page state bitmap relative to the live
164migration around and a single per-page bitmap has to be maintained in
165the QEMU running in the source node to know which pages are still
166missing in the destination node. The bitmap in the source node is
167checked to find which missing pages to send in round robin and we seek
168over it when receiving incoming userfaults. After sending each page of
169course the bitmap is updated accordingly. It's also useful to avoid
170sending the same page twice (in case the userfault is read by the
171postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
172thread).
173
174Non-cooperative userfaultfd
175===========================
176
177When the userfaultfd is monitored by an external manager, the manager
178must be able to track changes in the process virtual memory
179layout. Userfaultfd can notify the manager about such changes using
180the same read(2) protocol as for the page fault notifications. The
181manager has to explicitly enable these events by setting appropriate
182bits in uffdio_api.features passed to UFFDIO_API ioctl:
183
184UFFD_FEATURE_EVENT_FORK
185 enable userfaultfd hooks for fork(). When this feature is
186 enabled, the userfaultfd context of the parent process is
187 duplicated into the newly created process. The manager
188 receives UFFD_EVENT_FORK with file descriptor of the new
189 userfaultfd context in the uffd_msg.fork.
190
191UFFD_FEATURE_EVENT_REMAP
192 enable notifications about mremap() calls. When the
193 non-cooperative process moves a virtual memory area to a
194 different location, the manager will receive
195 UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
196 new addresses of the area and its original length.
197
198UFFD_FEATURE_EVENT_REMOVE
199 enable notifications about madvise(MADV_REMOVE) and
200 madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
201 be generated upon these calls to madvise. The uffd_msg.remove
202 will contain start and end addresses of the removed area.
203
204UFFD_FEATURE_EVENT_UNMAP
205 enable notifications about memory unmapping. The manager will
206 get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
207 end addresses of the unmapped area.
208
209Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
210are pretty similar, they quite differ in the action expected from the
211userfaultfd manager. In the former case, the virtual memory is
212removed, but the area is not, the area remains monitored by the
213userfaultfd, and if a page fault occurs in that area it will be
214delivered to the manager. The proper resolution for such page fault is
215to zeromap the faulting address. However, in the latter case, when an
216area is unmapped, either explicitly (with munmap() system call), or
217implicitly (e.g. during mremap()), the area is removed and in turn the
218userfaultfd context for such area disappears too and the manager will
219not get further userland page faults from the removed area. Still, the
220notification is required in order to prevent manager from using
221UFFDIO_COPY on the unmapped area.
222
223Unlike userland page faults which have to be synchronous and require
224explicit or implicit wakeup, all the events are delivered
225asynchronously and the non-cooperative process resumes execution as
226soon as manager executes read(). The userfaultfd manager should
227carefully synchronize calls to UFFDIO_COPY with the events
228processing. To aid the synchronization, the UFFDIO_COPY ioctl will
229return -ENOSPC when the monitored process exits at the time of
230UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
231its virtual memory layout simultaneously with outstanding UFFDIO_COPY
232operation.
233
234The current asynchronous model of the event delivery is optimal for
235single threaded non-cooperative userfaultfd manager implementations. A
236synchronous event delivery model can be added later as a new
237userfaultfd feature to facilitate multithreading enhancements of the
238non cooperative manager, for example to allow UFFDIO_COPY ioctls to
239run in parallel to the event reception. Single threaded
240implementations should continue to use the current async event
241delivery model instead.