diff options
| author | Mike Rapoport <rppt@linux.vnet.ibm.com> | 2018-04-18 04:07:49 -0400 |
|---|---|---|
| committer | Jonathan Corbet <corbet@lwn.net> | 2018-04-27 19:02:48 -0400 |
| commit | 1ad1335dc58646764eda7bb054b350934a1b23ec (patch) | |
| tree | 8c145819f0d380744d432512ea47d89c8b91a22c /Documentation/vm | |
| parent | 3a3f7e26e5544032a687fb05b5221883b97a59ae (diff) | |
docs/admin-guide/mm: start moving here files from Documentation/vm
Several documents in Documentation/vm fit quite well into the "admin/user
guide" category. The documents that don't overload the reader with lots of
implementation details and provide coherent description of certain feature
can be moved to Documentation/admin-guide/mm.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Diffstat (limited to 'Documentation/vm')
| -rw-r--r-- | Documentation/vm/00-INDEX | 10 | ||||
| -rw-r--r-- | Documentation/vm/hugetlbpage.rst | 381 | ||||
| -rw-r--r-- | Documentation/vm/hwpoison.rst | 2 | ||||
| -rw-r--r-- | Documentation/vm/idle_page_tracking.rst | 115 | ||||
| -rw-r--r-- | Documentation/vm/index.rst | 5 | ||||
| -rw-r--r-- | Documentation/vm/pagemap.rst | 197 | ||||
| -rw-r--r-- | Documentation/vm/soft-dirty.rst | 47 | ||||
| -rw-r--r-- | Documentation/vm/userfaultfd.rst | 241 |
8 files changed, 1 insertions, 997 deletions
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index cda564d55b3c..f8a96ca16b7a 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX | |||
| @@ -12,14 +12,10 @@ highmem.rst | |||
| 12 | - Outline of highmem and common issues. | 12 | - Outline of highmem and common issues. |
| 13 | hmm.rst | 13 | hmm.rst |
| 14 | - Documentation of heterogeneous memory management | 14 | - Documentation of heterogeneous memory management |
| 15 | hugetlbpage.rst | ||
| 16 | - a brief summary of hugetlbpage support in the Linux kernel. | ||
| 17 | hugetlbfs_reserv.rst | 15 | hugetlbfs_reserv.rst |
| 18 | - A brief overview of hugetlbfs reservation design/implementation. | 16 | - A brief overview of hugetlbfs reservation design/implementation. |
| 19 | hwpoison.rst | 17 | hwpoison.rst |
| 20 | - explains what hwpoison is | 18 | - explains what hwpoison is |
| 21 | idle_page_tracking.rst | ||
| 22 | - description of the idle page tracking feature. | ||
| 23 | ksm.rst | 19 | ksm.rst |
| 24 | - how to use the Kernel Samepage Merging feature. | 20 | - how to use the Kernel Samepage Merging feature. |
| 25 | mmu_notifier.rst | 21 | mmu_notifier.rst |
| @@ -34,16 +30,12 @@ page_frags.rst | |||
| 34 | - description of page fragments allocator | 30 | - description of page fragments allocator |
| 35 | page_migration.rst | 31 | page_migration.rst |
| 36 | - description of page migration in NUMA systems. | 32 | - description of page migration in NUMA systems. |
| 37 | pagemap.rst | ||
| 38 | - pagemap, from the userspace perspective | ||
| 39 | page_owner.rst | 33 | page_owner.rst |
| 40 | - tracking about who allocated each page | 34 | - tracking about who allocated each page |
| 41 | remap_file_pages.rst | 35 | remap_file_pages.rst |
| 42 | - a note about remap_file_pages() system call | 36 | - a note about remap_file_pages() system call |
| 43 | slub.rst | 37 | slub.rst |
| 44 | - a short users guide for SLUB. | 38 | - a short users guide for SLUB. |
| 45 | soft-dirty.rst | ||
| 46 | - short explanation for soft-dirty PTEs | ||
| 47 | split_page_table_lock.rst | 39 | split_page_table_lock.rst |
| 48 | - Separate per-table lock to improve scalability of the old page_table_lock. | 40 | - Separate per-table lock to improve scalability of the old page_table_lock. |
| 49 | swap_numa.rst | 41 | swap_numa.rst |
| @@ -52,8 +44,6 @@ transhuge.rst | |||
| 52 | - Transparent Hugepage Support, alternative way of using hugepages. | 44 | - Transparent Hugepage Support, alternative way of using hugepages. |
| 53 | unevictable-lru.rst | 45 | unevictable-lru.rst |
| 54 | - Unevictable LRU infrastructure | 46 | - Unevictable LRU infrastructure |
| 55 | userfaultfd.rst | ||
| 56 | - description of userfaultfd system call | ||
| 57 | z3fold.txt | 47 | z3fold.txt |
| 58 | - outline of z3fold allocator for storing compressed pages | 48 | - outline of z3fold allocator for storing compressed pages |
| 59 | zsmalloc.rst | 49 | zsmalloc.rst |
diff --git a/Documentation/vm/hugetlbpage.rst b/Documentation/vm/hugetlbpage.rst deleted file mode 100644 index 2b374d10284d..000000000000 --- a/Documentation/vm/hugetlbpage.rst +++ /dev/null | |||
| @@ -1,381 +0,0 @@ | |||
| 1 | .. _hugetlbpage: | ||
| 2 | |||
| 3 | ============= | ||
| 4 | HugeTLB Pages | ||
| 5 | ============= | ||
| 6 | |||
| 7 | Overview | ||
| 8 | ======== | ||
| 9 | |||
| 10 | The intent of this file is to give a brief summary of hugetlbpage support in | ||
| 11 | the Linux kernel. This support is built on top of multiple page size support | ||
| 12 | that is provided by most modern architectures. For example, x86 CPUs normally | ||
| 13 | support 4K and 2M (1G if architecturally supported) page sizes, ia64 | ||
| 14 | architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M, | ||
| 15 | 256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical | ||
| 16 | translations. Typically this is a very scarce resource on processor. | ||
| 17 | Operating systems try to make best use of limited number of TLB resources. | ||
| 18 | This optimization is more critical now as bigger and bigger physical memories | ||
| 19 | (several GBs) are more readily available. | ||
| 20 | |||
| 21 | Users can use the huge page support in Linux kernel by either using the mmap | ||
| 22 | system call or standard SYSV shared memory system calls (shmget, shmat). | ||
| 23 | |||
| 24 | First the Linux kernel needs to be built with the CONFIG_HUGETLBFS | ||
| 25 | (present under "File systems") and CONFIG_HUGETLB_PAGE (selected | ||
| 26 | automatically when CONFIG_HUGETLBFS is selected) configuration | ||
| 27 | options. | ||
| 28 | |||
| 29 | The ``/proc/meminfo`` file provides information about the total number of | ||
| 30 | persistent hugetlb pages in the kernel's huge page pool. It also displays | ||
| 31 | default huge page size and information about the number of free, reserved | ||
| 32 | and surplus huge pages in the pool of huge pages of default size. | ||
| 33 | The huge page size is needed for generating the proper alignment and | ||
| 34 | size of the arguments to system calls that map huge page regions. | ||
| 35 | |||
| 36 | The output of ``cat /proc/meminfo`` will include lines like:: | ||
| 37 | |||
| 38 | HugePages_Total: uuu | ||
| 39 | HugePages_Free: vvv | ||
| 40 | HugePages_Rsvd: www | ||
| 41 | HugePages_Surp: xxx | ||
| 42 | Hugepagesize: yyy kB | ||
| 43 | Hugetlb: zzz kB | ||
| 44 | |||
| 45 | where: | ||
| 46 | |||
| 47 | HugePages_Total | ||
| 48 | is the size of the pool of huge pages. | ||
| 49 | HugePages_Free | ||
| 50 | is the number of huge pages in the pool that are not yet | ||
| 51 | allocated. | ||
| 52 | HugePages_Rsvd | ||
| 53 | is short for "reserved," and is the number of huge pages for | ||
| 54 | which a commitment to allocate from the pool has been made, | ||
| 55 | but no allocation has yet been made. Reserved huge pages | ||
| 56 | guarantee that an application will be able to allocate a | ||
| 57 | huge page from the pool of huge pages at fault time. | ||
| 58 | HugePages_Surp | ||
| 59 | is short for "surplus," and is the number of huge pages in | ||
| 60 | the pool above the value in ``/proc/sys/vm/nr_hugepages``. The | ||
| 61 | maximum number of surplus huge pages is controlled by | ||
| 62 | ``/proc/sys/vm/nr_overcommit_hugepages``. | ||
| 63 | Hugepagesize | ||
| 64 | is the default hugepage size (in Kb). | ||
| 65 | Hugetlb | ||
| 66 | is the total amount of memory (in kB), consumed by huge | ||
| 67 | pages of all sizes. | ||
| 68 | If huge pages of different sizes are in use, this number | ||
| 69 | will exceed HugePages_Total \* Hugepagesize. To get more | ||
| 70 | detailed information, please, refer to | ||
| 71 | ``/sys/kernel/mm/hugepages`` (described below). | ||
| 72 | |||
| 73 | |||
| 74 | ``/proc/filesystems`` should also show a filesystem of type "hugetlbfs" | ||
| 75 | configured in the kernel. | ||
| 76 | |||
| 77 | ``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge | ||
| 78 | pages in the kernel's huge page pool. "Persistent" huge pages will be | ||
| 79 | returned to the huge page pool when freed by a task. A user with root | ||
| 80 | privileges can dynamically allocate more or free some persistent huge pages | ||
| 81 | by increasing or decreasing the value of ``nr_hugepages``. | ||
| 82 | |||
| 83 | Pages that are used as huge pages are reserved inside the kernel and cannot | ||
| 84 | be used for other purposes. Huge pages cannot be swapped out under | ||
| 85 | memory pressure. | ||
| 86 | |||
| 87 | Once a number of huge pages have been pre-allocated to the kernel huge page | ||
| 88 | pool, a user with appropriate privilege can use either the mmap system call | ||
| 89 | or shared memory system calls to use the huge pages. See the discussion of | ||
| 90 | :ref:`Using Huge Pages <using_huge_pages>`, below. | ||
| 91 | |||
| 92 | The administrator can allocate persistent huge pages on the kernel boot | ||
| 93 | command line by specifying the "hugepages=N" parameter, where 'N' = the | ||
| 94 | number of huge pages requested. This is the most reliable method of | ||
| 95 | allocating huge pages as memory has not yet become fragmented. | ||
| 96 | |||
| 97 | Some platforms support multiple huge page sizes. To allocate huge pages | ||
| 98 | of a specific size, one must precede the huge pages boot command parameters | ||
| 99 | with a huge page size selection parameter "hugepagesz=<size>". <size> must | ||
| 100 | be specified in bytes with optional scale suffix [kKmMgG]. The default huge | ||
| 101 | page size may be selected with the "default_hugepagesz=<size>" boot parameter. | ||
| 102 | |||
| 103 | When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` | ||
| 104 | indicates the current number of pre-allocated huge pages of the default size. | ||
| 105 | Thus, one can use the following command to dynamically allocate/deallocate | ||
| 106 | default sized persistent huge pages:: | ||
| 107 | |||
| 108 | echo 20 > /proc/sys/vm/nr_hugepages | ||
| 109 | |||
| 110 | This command will try to adjust the number of default sized huge pages in the | ||
| 111 | huge page pool to 20, allocating or freeing huge pages, as required. | ||
| 112 | |||
| 113 | On a NUMA platform, the kernel will attempt to distribute the huge page pool | ||
| 114 | over all the set of allowed nodes specified by the NUMA memory policy of the | ||
| 115 | task that modifies ``nr_hugepages``. The default for the allowed nodes--when the | ||
| 116 | task has default memory policy--is all on-line nodes with memory. Allowed | ||
| 117 | nodes with insufficient available, contiguous memory for a huge page will be | ||
| 118 | silently skipped when allocating persistent huge pages. See the | ||
| 119 | :ref:`discussion below <mem_policy_and_hp_alloc>` | ||
| 120 | of the interaction of task memory policy, cpusets and per node attributes | ||
| 121 | with the allocation and freeing of persistent huge pages. | ||
| 122 | |||
| 123 | The success or failure of huge page allocation depends on the amount of | ||
| 124 | physically contiguous memory that is present in system at the time of the | ||
| 125 | allocation attempt. If the kernel is unable to allocate huge pages from | ||
| 126 | some nodes in a NUMA system, it will attempt to make up the difference by | ||
| 127 | allocating extra pages on other nodes with sufficient available contiguous | ||
| 128 | memory, if any. | ||
| 129 | |||
| 130 | System administrators may want to put this command in one of the local rc | ||
| 131 | init files. This will enable the kernel to allocate huge pages early in | ||
| 132 | the boot process when the possibility of getting physical contiguous pages | ||
| 133 | is still very high. Administrators can verify the number of huge pages | ||
| 134 | actually allocated by checking the sysctl or meminfo. To check the per node | ||
| 135 | distribution of huge pages in a NUMA system, use:: | ||
| 136 | |||
| 137 | cat /sys/devices/system/node/node*/meminfo | fgrep Huge | ||
| 138 | |||
| 139 | ``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of | ||
| 140 | huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are | ||
| 141 | requested by applications. Writing any non-zero value into this file | ||
| 142 | indicates that the hugetlb subsystem is allowed to try to obtain that | ||
| 143 | number of "surplus" huge pages from the kernel's normal page pool, when the | ||
| 144 | persistent huge page pool is exhausted. As these surplus huge pages become | ||
| 145 | unused, they are freed back to the kernel's normal page pool. | ||
| 146 | |||
| 147 | When increasing the huge page pool size via ``nr_hugepages``, any existing | ||
| 148 | surplus pages will first be promoted to persistent huge pages. Then, additional | ||
| 149 | huge pages will be allocated, if necessary and if possible, to fulfill | ||
| 150 | the new persistent huge page pool size. | ||
| 151 | |||
| 152 | The administrator may shrink the pool of persistent huge pages for | ||
| 153 | the default huge page size by setting the ``nr_hugepages`` sysctl to a | ||
| 154 | smaller value. The kernel will attempt to balance the freeing of huge pages | ||
| 155 | across all nodes in the memory policy of the task modifying ``nr_hugepages``. | ||
| 156 | Any free huge pages on the selected nodes will be freed back to the kernel's | ||
| 157 | normal page pool. | ||
| 158 | |||
| 159 | Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that | ||
| 160 | it becomes less than the number of huge pages in use will convert the balance | ||
| 161 | of the in-use huge pages to surplus huge pages. This will occur even if | ||
| 162 | the number of surplus pages would exceed the overcommit value. As long as | ||
| 163 | this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is | ||
| 164 | increased sufficiently, or the surplus huge pages go out of use and are freed-- | ||
| 165 | no more surplus huge pages will be allowed to be allocated. | ||
| 166 | |||
| 167 | With support for multiple huge page pools at run-time available, much of | ||
| 168 | the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in | ||
| 169 | sysfs. | ||
| 170 | The ``/proc`` interfaces discussed above have been retained for backwards | ||
| 171 | compatibility. The root huge page control directory in sysfs is:: | ||
| 172 | |||
| 173 | /sys/kernel/mm/hugepages | ||
| 174 | |||
| 175 | For each huge page size supported by the running kernel, a subdirectory | ||
| 176 | will exist, of the form:: | ||
| 177 | |||
| 178 | hugepages-${size}kB | ||
| 179 | |||
| 180 | Inside each of these directories, the same set of files will exist:: | ||
| 181 | |||
| 182 | nr_hugepages | ||
| 183 | nr_hugepages_mempolicy | ||
| 184 | nr_overcommit_hugepages | ||
| 185 | free_hugepages | ||
| 186 | resv_hugepages | ||
| 187 | surplus_hugepages | ||
| 188 | |||
| 189 | which function as described above for the default huge page-sized case. | ||
| 190 | |||
| 191 | .. _mem_policy_and_hp_alloc: | ||
| 192 | |||
| 193 | Interaction of Task Memory Policy with Huge Page Allocation/Freeing | ||
| 194 | =================================================================== | ||
| 195 | |||
| 196 | Whether huge pages are allocated and freed via the ``/proc`` interface or | ||
| 197 | the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the | ||
| 198 | NUMA nodes from which huge pages are allocated or freed are controlled by the | ||
| 199 | NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy`` | ||
| 200 | sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy | ||
| 201 | is ignored. | ||
| 202 | |||
| 203 | The recommended method to allocate or free huge pages to/from the kernel | ||
| 204 | huge page pool, using the ``nr_hugepages`` example above, is:: | ||
| 205 | |||
| 206 | numactl --interleave <node-list> echo 20 \ | ||
| 207 | >/proc/sys/vm/nr_hugepages_mempolicy | ||
| 208 | |||
| 209 | or, more succinctly:: | ||
| 210 | |||
| 211 | numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy | ||
| 212 | |||
| 213 | This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes | ||
| 214 | specified in <node-list>, depending on whether number of persistent huge pages | ||
| 215 | is initially less than or greater than 20, respectively. No huge pages will be | ||
| 216 | allocated nor freed on any node not included in the specified <node-list>. | ||
| 217 | |||
| 218 | When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any | ||
| 219 | memory policy mode--bind, preferred, local or interleave--may be used. The | ||
| 220 | resulting effect on persistent huge page allocation is as follows: | ||
| 221 | |||
| 222 | #. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.rst], | ||
| 223 | persistent huge pages will be distributed across the node or nodes | ||
| 224 | specified in the mempolicy as if "interleave" had been specified. | ||
| 225 | However, if a node in the policy does not contain sufficient contiguous | ||
| 226 | memory for a huge page, the allocation will not "fallback" to the nearest | ||
| 227 | neighbor node with sufficient contiguous memory. To do this would cause | ||
| 228 | undesirable imbalance in the distribution of the huge page pool, or | ||
| 229 | possibly, allocation of persistent huge pages on nodes not allowed by | ||
| 230 | the task's memory policy. | ||
| 231 | |||
| 232 | #. One or more nodes may be specified with the bind or interleave policy. | ||
| 233 | If more than one node is specified with the preferred policy, only the | ||
| 234 | lowest numeric id will be used. Local policy will select the node where | ||
| 235 | the task is running at the time the nodes_allowed mask is constructed. | ||
| 236 | For local policy to be deterministic, the task must be bound to a cpu or | ||
| 237 | cpus in a single node. Otherwise, the task could be migrated to some | ||
| 238 | other node at any time after launch and the resulting node will be | ||
| 239 | indeterminate. Thus, local policy is not very useful for this purpose. | ||
| 240 | Any of the other mempolicy modes may be used to specify a single node. | ||
| 241 | |||
| 242 | #. The nodes allowed mask will be derived from any non-default task mempolicy, | ||
| 243 | whether this policy was set explicitly by the task itself or one of its | ||
| 244 | ancestors, such as numactl. This means that if the task is invoked from a | ||
| 245 | shell with non-default policy, that policy will be used. One can specify a | ||
| 246 | node list of "all" with numactl --interleave or --membind [-m] to achieve | ||
| 247 | interleaving over all nodes in the system or cpuset. | ||
| 248 | |||
| 249 | #. Any task mempolicy specified--e.g., using numactl--will be constrained by | ||
| 250 | the resource limits of any cpuset in which the task runs. Thus, there will | ||
| 251 | be no way for a task with non-default policy running in a cpuset with a | ||
| 252 | subset of the system nodes to allocate huge pages outside the cpuset | ||
| 253 | without first moving to a cpuset that contains all of the desired nodes. | ||
| 254 | |||
| 255 | #. Boot-time huge page allocation attempts to distribute the requested number | ||
| 256 | of huge pages over all on-lines nodes with memory. | ||
| 257 | |||
| 258 | Per Node Hugepages Attributes | ||
| 259 | ============================= | ||
| 260 | |||
| 261 | A subset of the contents of the root huge page control directory in sysfs, | ||
| 262 | described above, will be replicated under each the system device of each | ||
| 263 | NUMA node with memory in:: | ||
| 264 | |||
| 265 | /sys/devices/system/node/node[0-9]*/hugepages/ | ||
| 266 | |||
| 267 | Under this directory, the subdirectory for each supported huge page size | ||
| 268 | contains the following attribute files:: | ||
| 269 | |||
| 270 | nr_hugepages | ||
| 271 | free_hugepages | ||
| 272 | surplus_hugepages | ||
| 273 | |||
| 274 | The free\_' and surplus\_' attribute files are read-only. They return the number | ||
| 275 | of free and surplus [overcommitted] huge pages, respectively, on the parent | ||
| 276 | node. | ||
| 277 | |||
| 278 | The ``nr_hugepages`` attribute returns the total number of huge pages on the | ||
| 279 | specified node. When this attribute is written, the number of persistent huge | ||
| 280 | pages on the parent node will be adjusted to the specified value, if sufficient | ||
| 281 | resources exist, regardless of the task's mempolicy or cpuset constraints. | ||
| 282 | |||
| 283 | Note that the number of overcommit and reserve pages remain global quantities, | ||
| 284 | as we don't know until fault time, when the faulting task's mempolicy is | ||
| 285 | applied, from which node the huge page allocation will be attempted. | ||
| 286 | |||
| 287 | .. _using_huge_pages: | ||
| 288 | |||
| 289 | Using Huge Pages | ||
| 290 | ================ | ||
| 291 | |||
| 292 | If the user applications are going to request huge pages using mmap system | ||
| 293 | call, then it is required that system administrator mount a file system of | ||
| 294 | type hugetlbfs:: | ||
| 295 | |||
| 296 | mount -t hugetlbfs \ | ||
| 297 | -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\ | ||
| 298 | min_size=<value>,nr_inodes=<value> none /mnt/huge | ||
| 299 | |||
| 300 | This command mounts a (pseudo) filesystem of type hugetlbfs on the directory | ||
| 301 | ``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages. | ||
| 302 | |||
| 303 | The ``uid`` and ``gid`` options sets the owner and group of the root of the | ||
| 304 | file system. By default the ``uid`` and ``gid`` of the current process | ||
| 305 | are taken. | ||
| 306 | |||
| 307 | The ``mode`` option sets the mode of root of file system to value & 01777. | ||
| 308 | This value is given in octal. By default the value 0755 is picked. | ||
| 309 | |||
| 310 | If the platform supports multiple huge page sizes, the ``pagesize`` option can | ||
| 311 | be used to specify the huge page size and associated pool. ``pagesize`` | ||
| 312 | is specified in bytes. If ``pagesize`` is not specified the platform's | ||
| 313 | default huge page size and associated pool will be used. | ||
| 314 | |||
| 315 | The ``size`` option sets the maximum value of memory (huge pages) allowed | ||
| 316 | for that filesystem (``/mnt/huge``). The ``size`` option can be specified | ||
| 317 | in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``). | ||
| 318 | The size is rounded down to HPAGE_SIZE boundary. | ||
| 319 | |||
| 320 | The ``min_size`` option sets the minimum value of memory (huge pages) allowed | ||
| 321 | for the filesystem. ``min_size`` can be specified in the same way as ``size``, | ||
| 322 | either bytes or a percentage of the huge page pool. | ||
| 323 | At mount time, the number of huge pages specified by ``min_size`` are reserved | ||
| 324 | for use by the filesystem. | ||
| 325 | If there are not enough free huge pages available, the mount will fail. | ||
| 326 | As huge pages are allocated to the filesystem and freed, the reserve count | ||
| 327 | is adjusted so that the sum of allocated and reserved huge pages is always | ||
| 328 | at least ``min_size``. | ||
| 329 | |||
| 330 | The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge`` | ||
| 331 | can use. | ||
| 332 | |||
| 333 | If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on | ||
| 334 | command line then no limits are set. | ||
| 335 | |||
| 336 | For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can | ||
| 337 | use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. | ||
| 338 | For example, size=2K has the same meaning as size=2048. | ||
| 339 | |||
| 340 | While read system calls are supported on files that reside on hugetlb | ||
| 341 | file systems, write system calls are not. | ||
| 342 | |||
| 343 | Regular chown, chgrp, and chmod commands (with right permissions) could be | ||
| 344 | used to change the file attributes on hugetlbfs. | ||
| 345 | |||
| 346 | Also, it is important to note that no such mount command is required if | ||
| 347 | applications are going to use only shmat/shmget system calls or mmap with | ||
| 348 | MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see | ||
| 349 | :ref:`map_hugetlb <map_hugetlb>` below. | ||
| 350 | |||
| 351 | Users who wish to use hugetlb memory via shared memory segment should be | ||
| 352 | members of a supplementary group and system admin needs to configure that gid | ||
| 353 | into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different | ||
| 354 | applications to use any combination of mmaps and shm* calls, though the mount of | ||
| 355 | filesystem will be required for using mmap calls without MAP_HUGETLB. | ||
| 356 | |||
| 357 | Syscalls that operate on memory backed by hugetlb pages only have their lengths | ||
| 358 | aligned to the native page size of the processor; they will normally fail with | ||
| 359 | errno set to EINVAL or exclude hugetlb pages that extend beyond the length if | ||
| 360 | not hugepage aligned. For example, munmap(2) will fail if memory is backed by | ||
| 361 | a hugetlb page and the length is smaller than the hugepage size. | ||
| 362 | |||
| 363 | |||
| 364 | Examples | ||
| 365 | ======== | ||
| 366 | |||
| 367 | .. _map_hugetlb: | ||
| 368 | |||
| 369 | ``map_hugetlb`` | ||
| 370 | see tools/testing/selftests/vm/map_hugetlb.c | ||
| 371 | |||
| 372 | ``hugepage-shm`` | ||
| 373 | see tools/testing/selftests/vm/hugepage-shm.c | ||
| 374 | |||
| 375 | ``hugepage-mmap`` | ||
| 376 | see tools/testing/selftests/vm/hugepage-mmap.c | ||
| 377 | |||
| 378 | The `libhugetlbfs`_ library provides a wide range of userspace tools | ||
| 379 | to help with huge page usability, environment setup, and control. | ||
| 380 | |||
| 381 | .. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs | ||
diff --git a/Documentation/vm/hwpoison.rst b/Documentation/vm/hwpoison.rst index 070aa1e716b7..09bd24a92784 100644 --- a/Documentation/vm/hwpoison.rst +++ b/Documentation/vm/hwpoison.rst | |||
| @@ -155,7 +155,7 @@ Testing | |||
| 155 | value). This allows stress testing of many kinds of | 155 | value). This allows stress testing of many kinds of |
| 156 | pages. The page_flags are the same as in /proc/kpageflags. The | 156 | pages. The page_flags are the same as in /proc/kpageflags. The |
| 157 | flag bits are defined in include/linux/kernel-page-flags.h and | 157 | flag bits are defined in include/linux/kernel-page-flags.h and |
| 158 | documented in Documentation/vm/pagemap.rst | 158 | documented in Documentation/admin-guide/mm/pagemap.rst |
| 159 | 159 | ||
| 160 | * Architecture specific MCE injector | 160 | * Architecture specific MCE injector |
| 161 | 161 | ||
diff --git a/Documentation/vm/idle_page_tracking.rst b/Documentation/vm/idle_page_tracking.rst deleted file mode 100644 index d1c4609a5220..000000000000 --- a/Documentation/vm/idle_page_tracking.rst +++ /dev/null | |||
| @@ -1,115 +0,0 @@ | |||
| 1 | .. _idle_page_tracking: | ||
| 2 | |||
| 3 | ================== | ||
| 4 | Idle Page Tracking | ||
| 5 | ================== | ||
| 6 | |||
| 7 | Motivation | ||
| 8 | ========== | ||
| 9 | |||
| 10 | The idle page tracking feature allows to track which memory pages are being | ||
| 11 | accessed by a workload and which are idle. This information can be useful for | ||
| 12 | estimating the workload's working set size, which, in turn, can be taken into | ||
| 13 | account when configuring the workload parameters, setting memory cgroup limits, | ||
| 14 | or deciding where to place the workload within a compute cluster. | ||
| 15 | |||
| 16 | It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. | ||
| 17 | |||
| 18 | .. _user_api: | ||
| 19 | |||
| 20 | User API | ||
| 21 | ======== | ||
| 22 | |||
| 23 | The idle page tracking API is located at ``/sys/kernel/mm/page_idle``. | ||
| 24 | Currently, it consists of the only read-write file, | ||
| 25 | ``/sys/kernel/mm/page_idle/bitmap``. | ||
| 26 | |||
| 27 | The file implements a bitmap where each bit corresponds to a memory page. The | ||
| 28 | bitmap is represented by an array of 8-byte integers, and the page at PFN #i is | ||
| 29 | mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is | ||
| 30 | set, the corresponding page is idle. | ||
| 31 | |||
| 32 | A page is considered idle if it has not been accessed since it was marked idle | ||
| 33 | (for more details on what "accessed" actually means see the :ref:`Implementation | ||
| 34 | Details <impl_details>` section). | ||
| 35 | To mark a page idle one has to set the bit corresponding to | ||
| 36 | the page by writing to the file. A value written to the file is OR-ed with the | ||
| 37 | current bitmap value. | ||
| 38 | |||
| 39 | Only accesses to user memory pages are tracked. These are pages mapped to a | ||
| 40 | process address space, page cache and buffer pages, swap cache pages. For other | ||
| 41 | page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, | ||
| 42 | and hence such pages are never reported idle. | ||
| 43 | |||
| 44 | For huge pages the idle flag is set only on the head page, so one has to read | ||
| 45 | ``/proc/kpageflags`` in order to correctly count idle huge pages. | ||
| 46 | |||
| 47 | Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return | ||
| 48 | -EINVAL if you are not starting the read/write on an 8-byte boundary, or | ||
| 49 | if the size of the read/write is not a multiple of 8 bytes. Writing to | ||
| 50 | this file beyond max PFN will return -ENXIO. | ||
| 51 | |||
| 52 | That said, in order to estimate the amount of pages that are not used by a | ||
| 53 | workload one should: | ||
| 54 | |||
| 55 | 1. Mark all the workload's pages as idle by setting corresponding bits in | ||
| 56 | ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading | ||
| 57 | ``/proc/pid/pagemap`` if the workload is represented by a process, or by | ||
| 58 | filtering out alien pages using ``/proc/kpagecgroup`` in case the workload | ||
| 59 | is placed in a memory cgroup. | ||
| 60 | |||
| 61 | 2. Wait until the workload accesses its working set. | ||
| 62 | |||
| 63 | 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set. | ||
| 64 | If one wants to ignore certain types of pages, e.g. mlocked pages since they | ||
| 65 | are not reclaimable, he or she can filter them out using | ||
| 66 | ``/proc/kpageflags``. | ||
| 67 | |||
| 68 | See Documentation/vm/pagemap.rst for more information about | ||
| 69 | ``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. | ||
| 70 | |||
| 71 | .. _impl_details: | ||
| 72 | |||
| 73 | Implementation Details | ||
| 74 | ====================== | ||
| 75 | |||
| 76 | The kernel internally keeps track of accesses to user memory pages in order to | ||
| 77 | reclaim unreferenced pages first on memory shortage conditions. A page is | ||
| 78 | considered referenced if it has been recently accessed via a process address | ||
| 79 | space, in which case one or more PTEs it is mapped to will have the Accessed bit | ||
| 80 | set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The | ||
| 81 | latter happens when: | ||
| 82 | |||
| 83 | - a userspace process reads or writes a page using a system call (e.g. read(2) | ||
| 84 | or write(2)) | ||
| 85 | |||
| 86 | - a page that is used for storing filesystem buffers is read or written, | ||
| 87 | because a process needs filesystem metadata stored in it (e.g. lists a | ||
| 88 | directory tree) | ||
| 89 | |||
| 90 | - a page is accessed by a device driver using get_user_pages() | ||
| 91 | |||
| 92 | When a dirty page is written to swap or disk as a result of memory reclaim or | ||
| 93 | exceeding the dirty memory limit, it is not marked referenced. | ||
| 94 | |||
| 95 | The idle memory tracking feature adds a new page flag, the Idle flag. This flag | ||
| 96 | is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the | ||
| 97 | :ref:`User API <user_api>` | ||
| 98 | section), and cleared automatically whenever a page is referenced as defined | ||
| 99 | above. | ||
| 100 | |||
| 101 | When a page is marked idle, the Accessed bit must be cleared in all PTEs it is | ||
| 102 | mapped to, otherwise we will not be able to detect accesses to the page coming | ||
| 103 | from a process address space. To avoid interference with the reclaimer, which, | ||
| 104 | as noted above, uses the Accessed bit to promote actively referenced pages, one | ||
| 105 | more page flag is introduced, the Young flag. When the PTE Accessed bit is | ||
| 106 | cleared as a result of setting or updating a page's Idle flag, the Young flag | ||
| 107 | is set on the page. The reclaimer treats the Young flag as an extra PTE | ||
| 108 | Accessed bit and therefore will consider such a page as referenced. | ||
| 109 | |||
| 110 | Since the idle memory tracking feature is based on the memory reclaimer logic, | ||
| 111 | it only works with pages that are on an LRU list, other pages are silently | ||
| 112 | ignored. That means it will ignore a user memory page if it is isolated, but | ||
| 113 | since there are usually not many of them, it should not affect the overall | ||
| 114 | result noticeably. In order not to stall scanning of the idle page bitmap, | ||
| 115 | locked pages may be skipped too. | ||
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 6c451421a01e..ed58cb9f9675 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst | |||
| @@ -13,15 +13,10 @@ various features of the Linux memory management | |||
| 13 | .. toctree:: | 13 | .. toctree:: |
| 14 | :maxdepth: 1 | 14 | :maxdepth: 1 |
| 15 | 15 | ||
| 16 | hugetlbpage | ||
| 17 | idle_page_tracking | ||
| 18 | ksm | 16 | ksm |
| 19 | numa_memory_policy | 17 | numa_memory_policy |
| 20 | pagemap | ||
| 21 | transhuge | 18 | transhuge |
| 22 | soft-dirty | ||
| 23 | swap_numa | 19 | swap_numa |
| 24 | userfaultfd | ||
| 25 | zswap | 20 | zswap |
| 26 | 21 | ||
| 27 | Kernel developers MM documentation | 22 | Kernel developers MM documentation |
diff --git a/Documentation/vm/pagemap.rst b/Documentation/vm/pagemap.rst deleted file mode 100644 index 7ba8cbd57ad3..000000000000 --- a/Documentation/vm/pagemap.rst +++ /dev/null | |||
| @@ -1,197 +0,0 @@ | |||
| 1 | .. _pagemap: | ||
| 2 | |||
| 3 | ============================= | ||
| 4 | Examining Process Page Tables | ||
| 5 | ============================= | ||
| 6 | |||
| 7 | pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow | ||
| 8 | userspace programs to examine the page tables and related information by | ||
| 9 | reading files in ``/proc``. | ||
| 10 | |||
| 11 | There are four components to pagemap: | ||
| 12 | |||
| 13 | * ``/proc/pid/pagemap``. This file lets a userspace process find out which | ||
| 14 | physical frame each virtual page is mapped to. It contains one 64-bit | ||
| 15 | value for each virtual page, containing the following data (from | ||
| 16 | ``fs/proc/task_mmu.c``, above pagemap_read): | ||
| 17 | |||
| 18 | * Bits 0-54 page frame number (PFN) if present | ||
| 19 | * Bits 0-4 swap type if swapped | ||
| 20 | * Bits 5-54 swap offset if swapped | ||
| 21 | * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.rst) | ||
| 22 | * Bit 56 page exclusively mapped (since 4.2) | ||
| 23 | * Bits 57-60 zero | ||
| 24 | * Bit 61 page is file-page or shared-anon (since 3.5) | ||
| 25 | * Bit 62 page swapped | ||
| 26 | * Bit 63 page present | ||
| 27 | |||
| 28 | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs. | ||
| 29 | In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from | ||
| 30 | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. | ||
| 31 | Reason: information about PFNs helps in exploiting Rowhammer vulnerability. | ||
| 32 | |||
| 33 | If the page is not present but in swap, then the PFN contains an | ||
| 34 | encoding of the swap file number and the page's offset into the | ||
| 35 | swap. Unmapped pages return a null PFN. This allows determining | ||
| 36 | precisely which pages are mapped (or in swap) and comparing mapped | ||
| 37 | pages between processes. | ||
| 38 | |||
| 39 | Efficient users of this interface will use ``/proc/pid/maps`` to | ||
| 40 | determine which areas of memory are actually mapped and llseek to | ||
| 41 | skip over unmapped regions. | ||
| 42 | |||
| 43 | * ``/proc/kpagecount``. This file contains a 64-bit count of the number of | ||
| 44 | times each page is mapped, indexed by PFN. | ||
| 45 | |||
| 46 | * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each | ||
| 47 | page, indexed by PFN. | ||
| 48 | |||
| 49 | The flags are (from ``fs/proc/page.c``, above kpageflags_read): | ||
| 50 | |||
| 51 | 0. LOCKED | ||
| 52 | 1. ERROR | ||
| 53 | 2. REFERENCED | ||
| 54 | 3. UPTODATE | ||
| 55 | 4. DIRTY | ||
| 56 | 5. LRU | ||
| 57 | 6. ACTIVE | ||
| 58 | 7. SLAB | ||
| 59 | 8. WRITEBACK | ||
| 60 | 9. RECLAIM | ||
| 61 | 10. BUDDY | ||
| 62 | 11. MMAP | ||
| 63 | 12. ANON | ||
| 64 | 13. SWAPCACHE | ||
| 65 | 14. SWAPBACKED | ||
| 66 | 15. COMPOUND_HEAD | ||
| 67 | 16. COMPOUND_TAIL | ||
| 68 | 17. HUGE | ||
| 69 | 18. UNEVICTABLE | ||
| 70 | 19. HWPOISON | ||
| 71 | 20. NOPAGE | ||
| 72 | 21. KSM | ||
| 73 | 22. THP | ||
| 74 | 23. BALLOON | ||
| 75 | 24. ZERO_PAGE | ||
| 76 | 25. IDLE | ||
| 77 | |||
| 78 | * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the | ||
| 79 | memory cgroup each page is charged to, indexed by PFN. Only available when | ||
| 80 | CONFIG_MEMCG is set. | ||
| 81 | |||
| 82 | Short descriptions to the page flags | ||
| 83 | ==================================== | ||
| 84 | |||
| 85 | 0 - LOCKED | ||
| 86 | page is being locked for exclusive access, e.g. by undergoing read/write IO | ||
| 87 | 7 - SLAB | ||
| 88 | page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator | ||
| 89 | When compound page is used, SLUB/SLQB will only set this flag on the head | ||
| 90 | page; SLOB will not flag it at all. | ||
| 91 | 10 - BUDDY | ||
| 92 | a free memory block managed by the buddy system allocator | ||
| 93 | The buddy system organizes free memory in blocks of various orders. | ||
| 94 | An order N block has 2^N physically contiguous pages, with the BUDDY flag | ||
| 95 | set for and _only_ for the first page. | ||
| 96 | 15 - COMPOUND_HEAD | ||
| 97 | A compound page with order N consists of 2^N physically contiguous pages. | ||
| 98 | A compound page with order 2 takes the form of "HTTT", where H donates its | ||
| 99 | head page and T donates its tail page(s). The major consumers of compound | ||
| 100 | pages are hugeTLB pages (Documentation/vm/hugetlbpage.rst), the SLUB etc. | ||
| 101 | memory allocators and various device drivers. However in this interface, | ||
| 102 | only huge/giga pages are made visible to end users. | ||
| 103 | 16 - COMPOUND_TAIL | ||
| 104 | A compound page tail (see description above). | ||
| 105 | 17 - HUGE | ||
| 106 | this is an integral part of a HugeTLB page | ||
| 107 | 19 - HWPOISON | ||
| 108 | hardware detected memory corruption on this page: don't touch the data! | ||
| 109 | 20 - NOPAGE | ||
| 110 | no page frame exists at the requested address | ||
| 111 | 21 - KSM | ||
| 112 | identical memory pages dynamically shared between one or more processes | ||
| 113 | 22 - THP | ||
| 114 | contiguous pages which construct transparent hugepages | ||
| 115 | 23 - BALLOON | ||
| 116 | balloon compaction page | ||
| 117 | 24 - ZERO_PAGE | ||
| 118 | zero page for pfn_zero or huge_zero page | ||
| 119 | 25 - IDLE | ||
| 120 | page has not been accessed since it was marked idle (see | ||
| 121 | Documentation/vm/idle_page_tracking.rst). Note that this flag may be | ||
| 122 | stale in case the page was accessed via a PTE. To make sure the flag | ||
| 123 | is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. | ||
| 124 | |||
| 125 | IO related page flags | ||
| 126 | --------------------- | ||
| 127 | |||
| 128 | 1 - ERROR | ||
| 129 | IO error occurred | ||
| 130 | 3 - UPTODATE | ||
| 131 | page has up-to-date data | ||
| 132 | ie. for file backed page: (in-memory data revision >= on-disk one) | ||
| 133 | 4 - DIRTY | ||
| 134 | page has been written to, hence contains new data | ||
| 135 | i.e. for file backed page: (in-memory data revision > on-disk one) | ||
| 136 | 8 - WRITEBACK | ||
| 137 | page is being synced to disk | ||
| 138 | |||
| 139 | LRU related page flags | ||
| 140 | ---------------------- | ||
| 141 | |||
| 142 | 5 - LRU | ||
| 143 | page is in one of the LRU lists | ||
| 144 | 6 - ACTIVE | ||
| 145 | page is in the active LRU list | ||
| 146 | 18 - UNEVICTABLE | ||
| 147 | page is in the unevictable (non-)LRU list It is somehow pinned and | ||
| 148 | not a candidate for LRU page reclaims, e.g. ramfs pages, | ||
| 149 | shmctl(SHM_LOCK) and mlock() memory segments | ||
| 150 | 2 - REFERENCED | ||
| 151 | page has been referenced since last LRU list enqueue/requeue | ||
| 152 | 9 - RECLAIM | ||
| 153 | page will be reclaimed soon after its pageout IO completed | ||
| 154 | 11 - MMAP | ||
| 155 | a memory mapped page | ||
| 156 | 12 - ANON | ||
| 157 | a memory mapped page that is not part of a file | ||
| 158 | 13 - SWAPCACHE | ||
| 159 | page is mapped to swap space, i.e. has an associated swap entry | ||
| 160 | 14 - SWAPBACKED | ||
| 161 | page is backed by swap/RAM | ||
| 162 | |||
| 163 | The page-types tool in the tools/vm directory can be used to query the | ||
| 164 | above flags. | ||
| 165 | |||
| 166 | Using pagemap to do something useful | ||
| 167 | ==================================== | ||
| 168 | |||
| 169 | The general procedure for using pagemap to find out about a process' memory | ||
| 170 | usage goes like this: | ||
| 171 | |||
| 172 | 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are | ||
| 173 | mapped to what. | ||
| 174 | 2. Select the maps you are interested in -- all of them, or a particular | ||
| 175 | library, or the stack or the heap, etc. | ||
| 176 | 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. | ||
| 177 | 4. Read a u64 for each page from pagemap. | ||
| 178 | 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you | ||
| 179 | just read, seek to that entry in the file, and read the data you want. | ||
| 180 | |||
| 181 | For example, to find the "unique set size" (USS), which is the amount of | ||
| 182 | memory that a process is using that is not shared with any other process, | ||
| 183 | you can go through every map in the process, find the PFNs, look those up | ||
| 184 | in kpagecount, and tally up the number of pages that are only referenced | ||
| 185 | once. | ||
| 186 | |||
| 187 | Other notes | ||
| 188 | =========== | ||
| 189 | |||
| 190 | Reading from any of the files will return -EINVAL if you are not starting | ||
| 191 | the read on an 8-byte boundary (e.g., if you sought an odd number of bytes | ||
| 192 | into the file), or if the size of the read is not a multiple of 8 bytes. | ||
| 193 | |||
| 194 | Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is | ||
| 195 | always 12 at most architectures). Since Linux 3.11 their meaning changes | ||
| 196 | after first clear of soft-dirty bits. Since Linux 4.2 they are used for | ||
| 197 | flags unconditionally. | ||
diff --git a/Documentation/vm/soft-dirty.rst b/Documentation/vm/soft-dirty.rst deleted file mode 100644 index cb0cfd6672fa..000000000000 --- a/Documentation/vm/soft-dirty.rst +++ /dev/null | |||
| @@ -1,47 +0,0 @@ | |||
| 1 | .. _soft_dirty: | ||
| 2 | |||
| 3 | =============== | ||
| 4 | Soft-Dirty PTEs | ||
| 5 | =============== | ||
| 6 | |||
| 7 | The soft-dirty is a bit on a PTE which helps to track which pages a task | ||
| 8 | writes to. In order to do this tracking one should | ||
| 9 | |||
| 10 | 1. Clear soft-dirty bits from the task's PTEs. | ||
| 11 | |||
| 12 | This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the | ||
| 13 | task in question. | ||
| 14 | |||
| 15 | 2. Wait some time. | ||
| 16 | |||
| 17 | 3. Read soft-dirty bits from the PTEs. | ||
| 18 | |||
| 19 | This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the | ||
| 20 | 64-bit qword is the soft-dirty one. If set, the respective PTE was | ||
| 21 | written to since step 1. | ||
| 22 | |||
| 23 | |||
| 24 | Internally, to do this tracking, the writable bit is cleared from PTEs | ||
| 25 | when the soft-dirty bit is cleared. So, after this, when the task tries to | ||
| 26 | modify a page at some virtual address the #PF occurs and the kernel sets | ||
| 27 | the soft-dirty bit on the respective PTE. | ||
| 28 | |||
| 29 | Note, that although all the task's address space is marked as r/o after the | ||
| 30 | soft-dirty bits clear, the #PF-s that occur after that are processed fast. | ||
| 31 | This is so, since the pages are still mapped to physical memory, and thus all | ||
| 32 | the kernel does is finds this fact out and puts both writable and soft-dirty | ||
| 33 | bits on the PTE. | ||
| 34 | |||
| 35 | While in most cases tracking memory changes by #PF-s is more than enough | ||
| 36 | there is still a scenario when we can lose soft dirty bits -- a task | ||
| 37 | unmaps a previously mapped memory region and then maps a new one at exactly | ||
| 38 | the same place. When unmap is called, the kernel internally clears PTE values | ||
| 39 | including soft dirty bits. To notify user space application about such | ||
| 40 | memory region renewal the kernel always marks new memory regions (and | ||
| 41 | expanded regions) as soft dirty. | ||
| 42 | |||
| 43 | This feature is actively used by the checkpoint-restore project. You | ||
| 44 | can find more details about it on http://criu.org | ||
| 45 | |||
| 46 | |||
| 47 | -- Pavel Emelyanov, Apr 9, 2013 | ||
diff --git a/Documentation/vm/userfaultfd.rst b/Documentation/vm/userfaultfd.rst deleted file mode 100644 index 5048cf661a8a..000000000000 --- a/Documentation/vm/userfaultfd.rst +++ /dev/null | |||
| @@ -1,241 +0,0 @@ | |||
| 1 | .. _userfaultfd: | ||
| 2 | |||
| 3 | =========== | ||
| 4 | Userfaultfd | ||
| 5 | =========== | ||
| 6 | |||
| 7 | Objective | ||
| 8 | ========= | ||
| 9 | |||
| 10 | Userfaults allow the implementation of on-demand paging from userland | ||
| 11 | and more generally they allow userland to take control of various | ||
| 12 | memory page faults, something otherwise only the kernel code could do. | ||
| 13 | |||
| 14 | For example userfaults allows a proper and more optimal implementation | ||
| 15 | of the PROT_NONE+SIGSEGV trick. | ||
| 16 | |||
| 17 | Design | ||
| 18 | ====== | ||
| 19 | |||
| 20 | Userfaults are delivered and resolved through the userfaultfd syscall. | ||
| 21 | |||
| 22 | The userfaultfd (aside from registering and unregistering virtual | ||
| 23 | memory ranges) provides two primary functionalities: | ||
| 24 | |||
| 25 | 1) read/POLLIN protocol to notify a userland thread of the faults | ||
| 26 | happening | ||
| 27 | |||
| 28 | 2) various UFFDIO_* ioctls that can manage the virtual memory regions | ||
| 29 | registered in the userfaultfd that allows userland to efficiently | ||
| 30 | resolve the userfaults it receives via 1) or to manage the virtual | ||
| 31 | memory in the background | ||
| 32 | |||
| 33 | The real advantage of userfaults if compared to regular virtual memory | ||
| 34 | management of mremap/mprotect is that the userfaults in all their | ||
| 35 | operations never involve heavyweight structures like vmas (in fact the | ||
| 36 | userfaultfd runtime load never takes the mmap_sem for writing). | ||
| 37 | |||
| 38 | Vmas are not suitable for page- (or hugepage) granular fault tracking | ||
| 39 | when dealing with virtual address spaces that could span | ||
| 40 | Terabytes. Too many vmas would be needed for that. | ||
| 41 | |||
| 42 | The userfaultfd once opened by invoking the syscall, can also be | ||
| 43 | passed using unix domain sockets to a manager process, so the same | ||
| 44 | manager process could handle the userfaults of a multitude of | ||
| 45 | different processes without them being aware about what is going on | ||
| 46 | (well of course unless they later try to use the userfaultfd | ||
| 47 | themselves on the same region the manager is already tracking, which | ||
| 48 | is a corner case that would currently return -EBUSY). | ||
| 49 | |||
| 50 | API | ||
| 51 | === | ||
| 52 | |||
| 53 | When first opened the userfaultfd must be enabled invoking the | ||
| 54 | UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or | ||
| 55 | a later API version) which will specify the read/POLLIN protocol | ||
| 56 | userland intends to speak on the UFFD and the uffdio_api.features | ||
| 57 | userland requires. The UFFDIO_API ioctl if successful (i.e. if the | ||
| 58 | requested uffdio_api.api is spoken also by the running kernel and the | ||
| 59 | requested features are going to be enabled) will return into | ||
| 60 | uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of | ||
| 61 | respectively all the available features of the read(2) protocol and | ||
| 62 | the generic ioctl available. | ||
| 63 | |||
| 64 | The uffdio_api.features bitmask returned by the UFFDIO_API ioctl | ||
| 65 | defines what memory types are supported by the userfaultfd and what | ||
| 66 | events, except page fault notifications, may be generated. | ||
| 67 | |||
| 68 | If the kernel supports registering userfaultfd ranges on hugetlbfs | ||
| 69 | virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in | ||
| 70 | uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be | ||
| 71 | set if the kernel supports registering userfaultfd ranges on shared | ||
| 72 | memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero | ||
| 73 | MAP_SHARED, memfd_create, etc). | ||
| 74 | |||
| 75 | The userland application that wants to use userfaultfd with hugetlbfs | ||
| 76 | or shared memory need to set the corresponding flag in | ||
| 77 | uffdio_api.features to enable those features. | ||
| 78 | |||
| 79 | If the userland desires to receive notifications for events other than | ||
| 80 | page faults, it has to verify that uffdio_api.features has appropriate | ||
| 81 | UFFD_FEATURE_EVENT_* bits set. These events are described in more | ||
| 82 | detail below in "Non-cooperative userfaultfd" section. | ||
| 83 | |||
| 84 | Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should | ||
| 85 | be invoked (if present in the returned uffdio_api.ioctls bitmask) to | ||
| 86 | register a memory range in the userfaultfd by setting the | ||
| 87 | uffdio_register structure accordingly. The uffdio_register.mode | ||
| 88 | bitmask will specify to the kernel which kind of faults to track for | ||
| 89 | the range (UFFDIO_REGISTER_MODE_MISSING would track missing | ||
| 90 | pages). The UFFDIO_REGISTER ioctl will return the | ||
| 91 | uffdio_register.ioctls bitmask of ioctls that are suitable to resolve | ||
| 92 | userfaults on the range registered. Not all ioctls will necessarily be | ||
| 93 | supported for all memory types depending on the underlying virtual | ||
| 94 | memory backend (anonymous memory vs tmpfs vs real filebacked | ||
| 95 | mappings). | ||
| 96 | |||
| 97 | Userland can use the uffdio_register.ioctls to manage the virtual | ||
| 98 | address space in the background (to add or potentially also remove | ||
| 99 | memory from the userfaultfd registered range). This means a userfault | ||
| 100 | could be triggering just before userland maps in the background the | ||
| 101 | user-faulted page. | ||
| 102 | |||
| 103 | The primary ioctl to resolve userfaults is UFFDIO_COPY. That | ||
| 104 | atomically copies a page into the userfault registered range and wakes | ||
| 105 | up the blocked userfaults (unless uffdio_copy.mode & | ||
| 106 | UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to | ||
| 107 | UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an | ||
| 108 | half copied page since it'll keep userfaulting until the copy has | ||
| 109 | finished. | ||
| 110 | |||
| 111 | QEMU/KVM | ||
| 112 | ======== | ||
| 113 | |||
| 114 | QEMU/KVM is using the userfaultfd syscall to implement postcopy live | ||
| 115 | migration. Postcopy live migration is one form of memory | ||
| 116 | externalization consisting of a virtual machine running with part or | ||
| 117 | all of its memory residing on a different node in the cloud. The | ||
| 118 | userfaultfd abstraction is generic enough that not a single line of | ||
| 119 | KVM kernel code had to be modified in order to add postcopy live | ||
| 120 | migration to QEMU. | ||
| 121 | |||
| 122 | Guest async page faults, FOLL_NOWAIT and all other GUP features work | ||
| 123 | just fine in combination with userfaults. Userfaults trigger async | ||
| 124 | page faults in the guest scheduler so those guest processes that | ||
| 125 | aren't waiting for userfaults (i.e. network bound) can keep running in | ||
| 126 | the guest vcpus. | ||
| 127 | |||
| 128 | It is generally beneficial to run one pass of precopy live migration | ||
| 129 | just before starting postcopy live migration, in order to avoid | ||
| 130 | generating userfaults for readonly guest regions. | ||
| 131 | |||
| 132 | The implementation of postcopy live migration currently uses one | ||
| 133 | single bidirectional socket but in the future two different sockets | ||
| 134 | will be used (to reduce the latency of the userfaults to the minimum | ||
| 135 | possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). | ||
| 136 | |||
| 137 | The QEMU in the source node writes all pages that it knows are missing | ||
| 138 | in the destination node, into the socket, and the migration thread of | ||
| 139 | the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE | ||
| 140 | ioctls on the userfaultfd in order to map the received pages into the | ||
| 141 | guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). | ||
| 142 | |||
| 143 | A different postcopy thread in the destination node listens with | ||
| 144 | poll() to the userfaultfd in parallel. When a POLLIN event is | ||
| 145 | generated after a userfault triggers, the postcopy thread read() from | ||
| 146 | the userfaultfd and receives the fault address (or -EAGAIN in case the | ||
| 147 | userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run | ||
| 148 | by the parallel QEMU migration thread). | ||
| 149 | |||
| 150 | After the QEMU postcopy thread (running in the destination node) gets | ||
| 151 | the userfault address it writes the information about the missing page | ||
| 152 | into the socket. The QEMU source node receives the information and | ||
| 153 | roughly "seeks" to that page address and continues sending all | ||
| 154 | remaining missing pages from that new page offset. Soon after that | ||
| 155 | (just the time to flush the tcp_wmem queue through the network) the | ||
| 156 | migration thread in the QEMU running in the destination node will | ||
| 157 | receive the page that triggered the userfault and it'll map it as | ||
| 158 | usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it | ||
| 159 | was spontaneously sent by the source or if it was an urgent page | ||
| 160 | requested through a userfault). | ||
| 161 | |||
| 162 | By the time the userfaults start, the QEMU in the destination node | ||
| 163 | doesn't need to keep any per-page state bitmap relative to the live | ||
| 164 | migration around and a single per-page bitmap has to be maintained in | ||
| 165 | the QEMU running in the source node to know which pages are still | ||
| 166 | missing in the destination node. The bitmap in the source node is | ||
| 167 | checked to find which missing pages to send in round robin and we seek | ||
| 168 | over it when receiving incoming userfaults. After sending each page of | ||
| 169 | course the bitmap is updated accordingly. It's also useful to avoid | ||
| 170 | sending the same page twice (in case the userfault is read by the | ||
| 171 | postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration | ||
| 172 | thread). | ||
| 173 | |||
| 174 | Non-cooperative userfaultfd | ||
| 175 | =========================== | ||
| 176 | |||
| 177 | When the userfaultfd is monitored by an external manager, the manager | ||
| 178 | must be able to track changes in the process virtual memory | ||
| 179 | layout. Userfaultfd can notify the manager about such changes using | ||
| 180 | the same read(2) protocol as for the page fault notifications. The | ||
| 181 | manager has to explicitly enable these events by setting appropriate | ||
| 182 | bits in uffdio_api.features passed to UFFDIO_API ioctl: | ||
| 183 | |||
| 184 | UFFD_FEATURE_EVENT_FORK | ||
| 185 | enable userfaultfd hooks for fork(). When this feature is | ||
| 186 | enabled, the userfaultfd context of the parent process is | ||
| 187 | duplicated into the newly created process. The manager | ||
| 188 | receives UFFD_EVENT_FORK with file descriptor of the new | ||
| 189 | userfaultfd context in the uffd_msg.fork. | ||
| 190 | |||
| 191 | UFFD_FEATURE_EVENT_REMAP | ||
| 192 | enable notifications about mremap() calls. When the | ||
| 193 | non-cooperative process moves a virtual memory area to a | ||
| 194 | different location, the manager will receive | ||
| 195 | UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and | ||
| 196 | new addresses of the area and its original length. | ||
| 197 | |||
| 198 | UFFD_FEATURE_EVENT_REMOVE | ||
| 199 | enable notifications about madvise(MADV_REMOVE) and | ||
| 200 | madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will | ||
| 201 | be generated upon these calls to madvise. The uffd_msg.remove | ||
| 202 | will contain start and end addresses of the removed area. | ||
| 203 | |||
| 204 | UFFD_FEATURE_EVENT_UNMAP | ||
| 205 | enable notifications about memory unmapping. The manager will | ||
| 206 | get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and | ||
| 207 | end addresses of the unmapped area. | ||
| 208 | |||
| 209 | Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP | ||
| 210 | are pretty similar, they quite differ in the action expected from the | ||
| 211 | userfaultfd manager. In the former case, the virtual memory is | ||
| 212 | removed, but the area is not, the area remains monitored by the | ||
| 213 | userfaultfd, and if a page fault occurs in that area it will be | ||
| 214 | delivered to the manager. The proper resolution for such page fault is | ||
| 215 | to zeromap the faulting address. However, in the latter case, when an | ||
| 216 | area is unmapped, either explicitly (with munmap() system call), or | ||
| 217 | implicitly (e.g. during mremap()), the area is removed and in turn the | ||
| 218 | userfaultfd context for such area disappears too and the manager will | ||
| 219 | not get further userland page faults from the removed area. Still, the | ||
| 220 | notification is required in order to prevent manager from using | ||
| 221 | UFFDIO_COPY on the unmapped area. | ||
| 222 | |||
| 223 | Unlike userland page faults which have to be synchronous and require | ||
| 224 | explicit or implicit wakeup, all the events are delivered | ||
| 225 | asynchronously and the non-cooperative process resumes execution as | ||
| 226 | soon as manager executes read(). The userfaultfd manager should | ||
| 227 | carefully synchronize calls to UFFDIO_COPY with the events | ||
| 228 | processing. To aid the synchronization, the UFFDIO_COPY ioctl will | ||
| 229 | return -ENOSPC when the monitored process exits at the time of | ||
| 230 | UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed | ||
| 231 | its virtual memory layout simultaneously with outstanding UFFDIO_COPY | ||
| 232 | operation. | ||
| 233 | |||
| 234 | The current asynchronous model of the event delivery is optimal for | ||
| 235 | single threaded non-cooperative userfaultfd manager implementations. A | ||
| 236 | synchronous event delivery model can be added later as a new | ||
| 237 | userfaultfd feature to facilitate multithreading enhancements of the | ||
| 238 | non cooperative manager, for example to allow UFFDIO_COPY ioctls to | ||
| 239 | run in parallel to the event reception. Single threaded | ||
| 240 | implementations should continue to use the current async event | ||
| 241 | delivery model instead. | ||
