diff options
| -rw-r--r-- | Documentation/vm/index.rst | 1 | ||||
| -rw-r--r-- | Documentation/vm/memory-model.rst | 183 |
2 files changed, 184 insertions, 0 deletions
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index b58cc3bfe777..e8d943b21cf9 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst | |||
| @@ -37,6 +37,7 @@ descriptions of data structures and algorithms. | |||
| 37 | hwpoison | 37 | hwpoison |
| 38 | hugetlbfs_reserv | 38 | hugetlbfs_reserv |
| 39 | ksm | 39 | ksm |
| 40 | memory-model | ||
| 40 | mmu_notifier | 41 | mmu_notifier |
| 41 | numa | 42 | numa |
| 42 | overcommit-accounting | 43 | overcommit-accounting |
diff --git a/Documentation/vm/memory-model.rst b/Documentation/vm/memory-model.rst new file mode 100644 index 000000000000..382f72ace1fc --- /dev/null +++ b/Documentation/vm/memory-model.rst | |||
| @@ -0,0 +1,183 @@ | |||
| 1 | .. SPDX-License-Identifier: GPL-2.0 | ||
| 2 | |||
| 3 | .. _physical_memory_model: | ||
| 4 | |||
| 5 | ===================== | ||
| 6 | Physical Memory Model | ||
| 7 | ===================== | ||
| 8 | |||
| 9 | Physical memory in a system may be addressed in different ways. The | ||
| 10 | simplest case is when the physical memory starts at address 0 and | ||
| 11 | spans a contiguous range up to the maximal address. It could be, | ||
| 12 | however, that this range contains small holes that are not accessible | ||
| 13 | for the CPU. Then there could be several contiguous ranges at | ||
| 14 | completely distinct addresses. And, don't forget about NUMA, where | ||
| 15 | different memory banks are attached to different CPUs. | ||
| 16 | |||
| 17 | Linux abstracts this diversity using one of the three memory models: | ||
| 18 | FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what | ||
| 19 | memory models it supports, what the default memory model is and | ||
| 20 | whether it is possible to manually override that default. | ||
| 21 | |||
| 22 | .. note:: | ||
| 23 | At time of this writing, DISCONTIGMEM is considered deprecated, | ||
| 24 | although it is still in use by several architectures. | ||
| 25 | |||
| 26 | All the memory models track the status of physical page frames using | ||
| 27 | :c:type:`struct page` arranged in one or more arrays. | ||
| 28 | |||
| 29 | Regardless of the selected memory model, there exists one-to-one | ||
| 30 | mapping between the physical page frame number (PFN) and the | ||
| 31 | corresponding `struct page`. | ||
| 32 | |||
| 33 | Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn` | ||
| 34 | helpers that allow the conversion from PFN to `struct page` and vice | ||
| 35 | versa. | ||
| 36 | |||
| 37 | FLATMEM | ||
| 38 | ======= | ||
| 39 | |||
| 40 | The simplest memory model is FLATMEM. This model is suitable for | ||
| 41 | non-NUMA systems with contiguous, or mostly contiguous, physical | ||
| 42 | memory. | ||
| 43 | |||
| 44 | In the FLATMEM memory model, there is a global `mem_map` array that | ||
| 45 | maps the entire physical memory. For most architectures, the holes | ||
| 46 | have entries in the `mem_map` array. The `struct page` objects | ||
| 47 | corresponding to the holes are never fully initialized. | ||
| 48 | |||
| 49 | To allocate the `mem_map` array, architecture specific setup code | ||
| 50 | should call :c:func:`free_area_init_node` function or its convenience | ||
| 51 | wrapper :c:func:`free_area_init`. Yet, the mappings array is not | ||
| 52 | usable until the call to :c:func:`memblock_free_all` that hands all | ||
| 53 | the memory to the page allocator. | ||
| 54 | |||
| 55 | If an architecture enables `CONFIG_ARCH_HAS_HOLES_MEMORYMODEL` option, | ||
| 56 | it may free parts of the `mem_map` array that do not cover the | ||
| 57 | actual physical pages. In such case, the architecture specific | ||
| 58 | :c:func:`pfn_valid` implementation should take the holes in the | ||
| 59 | `mem_map` into account. | ||
| 60 | |||
| 61 | With FLATMEM, the conversion between a PFN and the `struct page` is | ||
| 62 | straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the | ||
| 63 | `mem_map` array. | ||
| 64 | |||
| 65 | The `ARCH_PFN_OFFSET` defines the first page frame number for | ||
| 66 | systems with physical memory starting at address different from 0. | ||
| 67 | |||
| 68 | DISCONTIGMEM | ||
| 69 | ============ | ||
| 70 | |||
| 71 | The DISCONTIGMEM model treats the physical memory as a collection of | ||
| 72 | `nodes` similarly to how Linux NUMA support does. For each node Linux | ||
| 73 | constructs an independent memory management subsystem represented by | ||
| 74 | `struct pglist_data` (or `pg_data_t` for short). Among other | ||
| 75 | things, `pg_data_t` holds the `node_mem_map` array that maps | ||
| 76 | physical pages belonging to that node. The `node_start_pfn` field of | ||
| 77 | `pg_data_t` is the number of the first page frame belonging to that | ||
| 78 | node. | ||
| 79 | |||
| 80 | The architecture setup code should call :c:func:`free_area_init_node` for | ||
| 81 | each node in the system to initialize the `pg_data_t` object and its | ||
| 82 | `node_mem_map`. | ||
| 83 | |||
| 84 | Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` - | ||
| 85 | every physical page frame in a node has a `struct page` entry in the | ||
| 86 | `node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the | ||
| 87 | `flags` field of the `struct page` encodes the node number of the | ||
| 88 | node hosting that page. | ||
| 89 | |||
| 90 | The conversion between a PFN and the `struct page` in the | ||
| 91 | DISCONTIGMEM model became slightly more complex as it has to determine | ||
| 92 | which node hosts the physical page and which `pg_data_t` object | ||
| 93 | holds the `struct page`. | ||
| 94 | |||
| 95 | Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid` | ||
| 96 | to convert PFN to the node number. The opposite conversion helper | ||
| 97 | :c:func:`page_to_nid` is generic as it uses the node number encoded in | ||
| 98 | page->flags. | ||
| 99 | |||
| 100 | Once the node number is known, the PFN can be used to index | ||
| 101 | appropriate `node_mem_map` array to access the `struct page` and | ||
| 102 | the offset of the `struct page` from the `node_mem_map` plus | ||
| 103 | `node_start_pfn` is the PFN of that page. | ||
| 104 | |||
| 105 | SPARSEMEM | ||
| 106 | ========= | ||
| 107 | |||
| 108 | SPARSEMEM is the most versatile memory model available in Linux and it | ||
| 109 | is the only memory model that supports several advanced features such | ||
| 110 | as hot-plug and hot-remove of the physical memory, alternative memory | ||
| 111 | maps for non-volatile memory devices and deferred initialization of | ||
| 112 | the memory map for larger systems. | ||
| 113 | |||
| 114 | The SPARSEMEM model presents the physical memory as a collection of | ||
| 115 | sections. A section is represented with :c:type:`struct mem_section` | ||
| 116 | that contains `section_mem_map` that is, logically, a pointer to an | ||
| 117 | array of struct pages. However, it is stored with some other magic | ||
| 118 | that aids the sections management. The section size and maximal number | ||
| 119 | of section is specified using `SECTION_SIZE_BITS` and | ||
| 120 | `MAX_PHYSMEM_BITS` constants defined by each architecture that | ||
| 121 | supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a | ||
| 122 | physical address that an architecture supports, the | ||
| 123 | `SECTION_SIZE_BITS` is an arbitrary value. | ||
| 124 | |||
| 125 | The maximal number of sections is denoted `NR_MEM_SECTIONS` and | ||
| 126 | defined as | ||
| 127 | |||
| 128 | .. math:: | ||
| 129 | |||
| 130 | NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)} | ||
| 131 | |||
| 132 | The `mem_section` objects are arranged in a two-dimensional array | ||
| 133 | called `mem_sections`. The size and placement of this array depend | ||
| 134 | on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of | ||
| 135 | sections: | ||
| 136 | |||
| 137 | * When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections` | ||
| 138 | array is static and has `NR_MEM_SECTIONS` rows. Each row holds a | ||
| 139 | single `mem_section` object. | ||
| 140 | * When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections` | ||
| 141 | array is dynamically allocated. Each row contains PAGE_SIZE worth of | ||
| 142 | `mem_section` objects and the number of rows is calculated to fit | ||
| 143 | all the memory sections. | ||
| 144 | |||
| 145 | The architecture setup code should call :c:func:`memory_present` for | ||
| 146 | each active memory range or use :c:func:`memblocks_present` or | ||
| 147 | :c:func:`sparse_memory_present_with_active_regions` wrappers to | ||
| 148 | initialize the memory sections. Next, the actual memory maps should be | ||
| 149 | set up using :c:func:`sparse_init`. | ||
| 150 | |||
| 151 | With SPARSEMEM there are two possible ways to convert a PFN to the | ||
| 152 | corresponding `struct page` - a "classic sparse" and "sparse | ||
| 153 | vmemmap". The selection is made at build time and it is determined by | ||
| 154 | the value of `CONFIG_SPARSEMEM_VMEMMAP`. | ||
| 155 | |||
| 156 | The classic sparse encodes the section number of a page in page->flags | ||
| 157 | and uses high bits of a PFN to access the section that maps that page | ||
| 158 | frame. Inside a section, the PFN is the index to the array of pages. | ||
| 159 | |||
| 160 | The sparse vmemmap uses a virtually mapped memory map to optimize | ||
| 161 | pfn_to_page and page_to_pfn operations. There is a global `struct | ||
| 162 | page *vmemmap` pointer that points to a virtually contiguous array of | ||
| 163 | `struct page` objects. A PFN is an index to that array and the the | ||
| 164 | offset of the `struct page` from `vmemmap` is the PFN of that | ||
| 165 | page. | ||
| 166 | |||
| 167 | To use vmemmap, an architecture has to reserve a range of virtual | ||
| 168 | addresses that will map the physical pages containing the memory | ||
| 169 | map and make sure that `vmemmap` points to that range. In addition, | ||
| 170 | the architecture should implement :c:func:`vmemmap_populate` method | ||
| 171 | that will allocate the physical memory and create page tables for the | ||
| 172 | virtual memory map. If an architecture does not have any special | ||
| 173 | requirements for the vmemmap mappings, it can use default | ||
| 174 | :c:func:`vmemmap_populate_basepages` provided by the generic memory | ||
| 175 | management. | ||
| 176 | |||
| 177 | The virtually mapped memory map allows storing `struct page` objects | ||
| 178 | for persistent memory devices in pre-allocated storage on those | ||
| 179 | devices. This storage is represented with :c:type:`struct vmem_altmap` | ||
| 180 | that is eventually passed to vmemmap_populate() through a long chain | ||
| 181 | of function calls. The vmemmap_populate() implementation may use the | ||
| 182 | `vmem_altmap` along with :c:func:`altmap_alloc_block_buf` helper to | ||
| 183 | allocate memory map on the persistent memory device. | ||
