diff options
Diffstat (limited to 'Documentation/vm')
-rw-r--r-- | Documentation/vm/hugetlbpage.txt | 262 | ||||
-rw-r--r-- | Documentation/vm/hwpoison.txt | 182 | ||||
-rw-r--r-- | Documentation/vm/ksm.txt | 29 | ||||
-rw-r--r-- | Documentation/vm/page-types.c | 375 | ||||
-rw-r--r-- | Documentation/vm/pagemap.txt | 8 | ||||
-rw-r--r-- | Documentation/vm/slub.txt | 2 |
6 files changed, 639 insertions, 219 deletions
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index 82a7bd1800b2..bc31636973e3 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt | |||
@@ -11,23 +11,21 @@ This optimization is more critical now as bigger and bigger physical memories | |||
11 | (several GBs) are more readily available. | 11 | (several GBs) are more readily available. |
12 | 12 | ||
13 | Users can use the huge page support in Linux kernel by either using the mmap | 13 | Users can use the huge page support in Linux kernel by either using the mmap |
14 | system call or standard SYSv shared memory system calls (shmget, shmat). | 14 | system call or standard SYSV shared memory system calls (shmget, shmat). |
15 | 15 | ||
16 | First the Linux kernel needs to be built with the CONFIG_HUGETLBFS | 16 | First the Linux kernel needs to be built with the CONFIG_HUGETLBFS |
17 | (present under "File systems") and CONFIG_HUGETLB_PAGE (selected | 17 | (present under "File systems") and CONFIG_HUGETLB_PAGE (selected |
18 | automatically when CONFIG_HUGETLBFS is selected) configuration | 18 | automatically when CONFIG_HUGETLBFS is selected) configuration |
19 | options. | 19 | options. |
20 | 20 | ||
21 | The kernel built with huge page support should show the number of configured | 21 | The /proc/meminfo file provides information about the total number of |
22 | huge pages in the system by running the "cat /proc/meminfo" command. | 22 | persistent hugetlb pages in the kernel's huge page pool. It also displays |
23 | information about the number of free, reserved and surplus huge pages and the | ||
24 | default huge page size. The huge page size is needed for generating the | ||
25 | proper alignment and size of the arguments to system calls that map huge page | ||
26 | regions. | ||
23 | 27 | ||
24 | /proc/meminfo also provides information about the total number of hugetlb | 28 | The output of "cat /proc/meminfo" will include lines like: |
25 | pages configured in the kernel. It also displays information about the | ||
26 | number of free hugetlb pages at any time. It also displays information about | ||
27 | the configured huge page size - this is needed for generating the proper | ||
28 | alignment and size of the arguments to the above system calls. | ||
29 | |||
30 | The output of "cat /proc/meminfo" will have lines like: | ||
31 | 29 | ||
32 | ..... | 30 | ..... |
33 | HugePages_Total: vvv | 31 | HugePages_Total: vvv |
@@ -53,59 +51,63 @@ HugePages_Surp is short for "surplus," and is the number of huge pages in | |||
53 | /proc/filesystems should also show a filesystem of type "hugetlbfs" configured | 51 | /proc/filesystems should also show a filesystem of type "hugetlbfs" configured |
54 | in the kernel. | 52 | in the kernel. |
55 | 53 | ||
56 | /proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb | 54 | /proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge |
57 | pages in the kernel. Super user can dynamically request more (or free some | 55 | pages in the kernel's huge page pool. "Persistent" huge pages will be |
58 | pre-configured) huge pages. | 56 | returned to the huge page pool when freed by a task. A user with root |
59 | The allocation (or deallocation) of hugetlb pages is possible only if there are | 57 | privileges can dynamically allocate more or free some persistent huge pages |
60 | enough physically contiguous free pages in system (freeing of huge pages is | 58 | by increasing or decreasing the value of 'nr_hugepages'. |
61 | possible only if there are enough hugetlb pages free that can be transferred | ||
62 | back to regular memory pool). | ||
63 | 59 | ||
64 | Pages that are used as hugetlb pages are reserved inside the kernel and cannot | 60 | Pages that are used as huge pages are reserved inside the kernel and cannot |
65 | be used for other purposes. | 61 | be used for other purposes. Huge pages cannot be swapped out under |
62 | memory pressure. | ||
66 | 63 | ||
67 | Once the kernel with Hugetlb page support is built and running, a user can | 64 | Once a number of huge pages have been pre-allocated to the kernel huge page |
68 | use either the mmap system call or shared memory system calls to start using | 65 | pool, a user with appropriate privilege can use either the mmap system call |
69 | the huge pages. It is required that the system administrator preallocate | 66 | or shared memory system calls to use the huge pages. See the discussion of |
70 | enough memory for huge page purposes. | 67 | Using Huge Pages, below. |
71 | 68 | ||
72 | The administrator can preallocate huge pages on the kernel boot command line by | 69 | The administrator can allocate persistent huge pages on the kernel boot |
73 | specifying the "hugepages=N" parameter, where 'N' = the number of huge pages | 70 | command line by specifying the "hugepages=N" parameter, where 'N' = the |
74 | requested. This is the most reliable method for preallocating huge pages as | 71 | number of huge pages requested. This is the most reliable method of |
75 | memory has not yet become fragmented. | 72 | allocating huge pages as memory has not yet become fragmented. |
76 | 73 | ||
77 | Some platforms support multiple huge page sizes. To preallocate huge pages | 74 | Some platforms support multiple huge page sizes. To allocate huge pages |
78 | of a specific size, one must preceed the huge pages boot command parameters | 75 | of a specific size, one must preceed the huge pages boot command parameters |
79 | with a huge page size selection parameter "hugepagesz=<size>". <size> must | 76 | with a huge page size selection parameter "hugepagesz=<size>". <size> must |
80 | be specified in bytes with optional scale suffix [kKmMgG]. The default huge | 77 | be specified in bytes with optional scale suffix [kKmMgG]. The default huge |
81 | page size may be selected with the "default_hugepagesz=<size>" boot parameter. | 78 | page size may be selected with the "default_hugepagesz=<size>" boot parameter. |
82 | 79 | ||
83 | /proc/sys/vm/nr_hugepages indicates the current number of configured [default | 80 | When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages |
84 | size] hugetlb pages in the kernel. Super user can dynamically request more | 81 | indicates the current number of pre-allocated huge pages of the default size. |
85 | (or free some pre-configured) huge pages. | 82 | Thus, one can use the following command to dynamically allocate/deallocate |
86 | 83 | default sized persistent huge pages: | |
87 | Use the following command to dynamically allocate/deallocate default sized | ||
88 | huge pages: | ||
89 | 84 | ||
90 | echo 20 > /proc/sys/vm/nr_hugepages | 85 | echo 20 > /proc/sys/vm/nr_hugepages |
91 | 86 | ||
92 | This command will try to configure 20 default sized huge pages in the system. | 87 | This command will try to adjust the number of default sized huge pages in the |
88 | huge page pool to 20, allocating or freeing huge pages, as required. | ||
89 | |||
93 | On a NUMA platform, the kernel will attempt to distribute the huge page pool | 90 | On a NUMA platform, the kernel will attempt to distribute the huge page pool |
94 | over the all on-line nodes. These huge pages, allocated when nr_hugepages | 91 | over all the set of allowed nodes specified by the NUMA memory policy of the |
95 | is increased, are called "persistent huge pages". | 92 | task that modifies nr_hugepages. The default for the allowed nodes--when the |
93 | task has default memory policy--is all on-line nodes with memory. Allowed | ||
94 | nodes with insufficient available, contiguous memory for a huge page will be | ||
95 | silently skipped when allocating persistent huge pages. See the discussion | ||
96 | below of the interaction of task memory policy, cpusets and per node attributes | ||
97 | with the allocation and freeing of persistent huge pages. | ||
96 | 98 | ||
97 | The success or failure of huge page allocation depends on the amount of | 99 | The success or failure of huge page allocation depends on the amount of |
98 | physically contiguous memory that is preset in system at the time of the | 100 | physically contiguous memory that is present in system at the time of the |
99 | allocation attempt. If the kernel is unable to allocate huge pages from | 101 | allocation attempt. If the kernel is unable to allocate huge pages from |
100 | some nodes in a NUMA system, it will attempt to make up the difference by | 102 | some nodes in a NUMA system, it will attempt to make up the difference by |
101 | allocating extra pages on other nodes with sufficient available contiguous | 103 | allocating extra pages on other nodes with sufficient available contiguous |
102 | memory, if any. | 104 | memory, if any. |
103 | 105 | ||
104 | System administrators may want to put this command in one of the local rc init | 106 | System administrators may want to put this command in one of the local rc |
105 | files. This will enable the kernel to request huge pages early in the boot | 107 | init files. This will enable the kernel to allocate huge pages early in |
106 | process when the possibility of getting physical contiguous pages is still | 108 | the boot process when the possibility of getting physical contiguous pages |
107 | very high. Administrators can verify the number of huge pages actually | 109 | is still very high. Administrators can verify the number of huge pages |
108 | allocated by checking the sysctl or meminfo. To check the per node | 110 | actually allocated by checking the sysctl or meminfo. To check the per node |
109 | distribution of huge pages in a NUMA system, use: | 111 | distribution of huge pages in a NUMA system, use: |
110 | 112 | ||
111 | cat /sys/devices/system/node/node*/meminfo | fgrep Huge | 113 | cat /sys/devices/system/node/node*/meminfo | fgrep Huge |
@@ -113,45 +115,47 @@ distribution of huge pages in a NUMA system, use: | |||
113 | /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of | 115 | /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of |
114 | huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are | 116 | huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are |
115 | requested by applications. Writing any non-zero value into this file | 117 | requested by applications. Writing any non-zero value into this file |
116 | indicates that the hugetlb subsystem is allowed to try to obtain "surplus" | 118 | indicates that the hugetlb subsystem is allowed to try to obtain that |
117 | huge pages from the buddy allocator, when the normal pool is exhausted. As | 119 | number of "surplus" huge pages from the kernel's normal page pool, when the |
118 | these surplus huge pages go out of use, they are freed back to the buddy | 120 | persistent huge page pool is exhausted. As these surplus huge pages become |
119 | allocator. | 121 | unused, they are freed back to the kernel's normal page pool. |
120 | 122 | ||
121 | When increasing the huge page pool size via nr_hugepages, any surplus | 123 | When increasing the huge page pool size via nr_hugepages, any existing surplus |
122 | pages will first be promoted to persistent huge pages. Then, additional | 124 | pages will first be promoted to persistent huge pages. Then, additional |
123 | huge pages will be allocated, if necessary and if possible, to fulfill | 125 | huge pages will be allocated, if necessary and if possible, to fulfill |
124 | the new huge page pool size. | 126 | the new persistent huge page pool size. |
125 | 127 | ||
126 | The administrator may shrink the pool of preallocated huge pages for | 128 | The administrator may shrink the pool of persistent huge pages for |
127 | the default huge page size by setting the nr_hugepages sysctl to a | 129 | the default huge page size by setting the nr_hugepages sysctl to a |
128 | smaller value. The kernel will attempt to balance the freeing of huge pages | 130 | smaller value. The kernel will attempt to balance the freeing of huge pages |
129 | across all on-line nodes. Any free huge pages on the selected nodes will | 131 | across all nodes in the memory policy of the task modifying nr_hugepages. |
130 | be freed back to the buddy allocator. | 132 | Any free huge pages on the selected nodes will be freed back to the kernel's |
131 | 133 | normal page pool. | |
132 | Caveat: Shrinking the pool via nr_hugepages such that it becomes less | 134 | |
133 | than the number of huge pages in use will convert the balance to surplus | 135 | Caveat: Shrinking the persistent huge page pool via nr_hugepages such that |
134 | huge pages even if it would exceed the overcommit value. As long as | 136 | it becomes less than the number of huge pages in use will convert the balance |
135 | this condition holds, however, no more surplus huge pages will be | 137 | of the in-use huge pages to surplus huge pages. This will occur even if |
136 | allowed on the system until one of the two sysctls are increased | 138 | the number of surplus pages it would exceed the overcommit value. As long as |
137 | sufficiently, or the surplus huge pages go out of use and are freed. | 139 | this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is |
140 | increased sufficiently, or the surplus huge pages go out of use and are freed-- | ||
141 | no more surplus huge pages will be allowed to be allocated. | ||
138 | 142 | ||
139 | With support for multiple huge page pools at run-time available, much of | 143 | With support for multiple huge page pools at run-time available, much of |
140 | the huge page userspace interface has been duplicated in sysfs. The above | 144 | the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs. |
141 | information applies to the default huge page size which will be | 145 | The /proc interfaces discussed above have been retained for backwards |
142 | controlled by the /proc interfaces for backwards compatibility. The root | 146 | compatibility. The root huge page control directory in sysfs is: |
143 | huge page control directory in sysfs is: | ||
144 | 147 | ||
145 | /sys/kernel/mm/hugepages | 148 | /sys/kernel/mm/hugepages |
146 | 149 | ||
147 | For each huge page size supported by the running kernel, a subdirectory | 150 | For each huge page size supported by the running kernel, a subdirectory |
148 | will exist, of the form | 151 | will exist, of the form: |
149 | 152 | ||
150 | hugepages-${size}kB | 153 | hugepages-${size}kB |
151 | 154 | ||
152 | Inside each of these directories, the same set of files will exist: | 155 | Inside each of these directories, the same set of files will exist: |
153 | 156 | ||
154 | nr_hugepages | 157 | nr_hugepages |
158 | nr_hugepages_mempolicy | ||
155 | nr_overcommit_hugepages | 159 | nr_overcommit_hugepages |
156 | free_hugepages | 160 | free_hugepages |
157 | resv_hugepages | 161 | resv_hugepages |
@@ -159,6 +163,102 @@ Inside each of these directories, the same set of files will exist: | |||
159 | 163 | ||
160 | which function as described above for the default huge page-sized case. | 164 | which function as described above for the default huge page-sized case. |
161 | 165 | ||
166 | |||
167 | Interaction of Task Memory Policy with Huge Page Allocation/Freeing | ||
168 | |||
169 | Whether huge pages are allocated and freed via the /proc interface or | ||
170 | the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA | ||
171 | nodes from which huge pages are allocated or freed are controlled by the | ||
172 | NUMA memory policy of the task that modifies the nr_hugepages_mempolicy | ||
173 | sysctl or attribute. When the nr_hugepages attribute is used, mempolicy | ||
174 | is ignored. | ||
175 | |||
176 | The recommended method to allocate or free huge pages to/from the kernel | ||
177 | huge page pool, using the nr_hugepages example above, is: | ||
178 | |||
179 | numactl --interleave <node-list> echo 20 \ | ||
180 | >/proc/sys/vm/nr_hugepages_mempolicy | ||
181 | |||
182 | or, more succinctly: | ||
183 | |||
184 | numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy | ||
185 | |||
186 | This will allocate or free abs(20 - nr_hugepages) to or from the nodes | ||
187 | specified in <node-list>, depending on whether number of persistent huge pages | ||
188 | is initially less than or greater than 20, respectively. No huge pages will be | ||
189 | allocated nor freed on any node not included in the specified <node-list>. | ||
190 | |||
191 | When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any | ||
192 | memory policy mode--bind, preferred, local or interleave--may be used. The | ||
193 | resulting effect on persistent huge page allocation is as follows: | ||
194 | |||
195 | 1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt], | ||
196 | persistent huge pages will be distributed across the node or nodes | ||
197 | specified in the mempolicy as if "interleave" had been specified. | ||
198 | However, if a node in the policy does not contain sufficient contiguous | ||
199 | memory for a huge page, the allocation will not "fallback" to the nearest | ||
200 | neighbor node with sufficient contiguous memory. To do this would cause | ||
201 | undesirable imbalance in the distribution of the huge page pool, or | ||
202 | possibly, allocation of persistent huge pages on nodes not allowed by | ||
203 | the task's memory policy. | ||
204 | |||
205 | 2) One or more nodes may be specified with the bind or interleave policy. | ||
206 | If more than one node is specified with the preferred policy, only the | ||
207 | lowest numeric id will be used. Local policy will select the node where | ||
208 | the task is running at the time the nodes_allowed mask is constructed. | ||
209 | For local policy to be deterministic, the task must be bound to a cpu or | ||
210 | cpus in a single node. Otherwise, the task could be migrated to some | ||
211 | other node at any time after launch and the resulting node will be | ||
212 | indeterminate. Thus, local policy is not very useful for this purpose. | ||
213 | Any of the other mempolicy modes may be used to specify a single node. | ||
214 | |||
215 | 3) The nodes allowed mask will be derived from any non-default task mempolicy, | ||
216 | whether this policy was set explicitly by the task itself or one of its | ||
217 | ancestors, such as numactl. This means that if the task is invoked from a | ||
218 | shell with non-default policy, that policy will be used. One can specify a | ||
219 | node list of "all" with numactl --interleave or --membind [-m] to achieve | ||
220 | interleaving over all nodes in the system or cpuset. | ||
221 | |||
222 | 4) Any task mempolicy specifed--e.g., using numactl--will be constrained by | ||
223 | the resource limits of any cpuset in which the task runs. Thus, there will | ||
224 | be no way for a task with non-default policy running in a cpuset with a | ||
225 | subset of the system nodes to allocate huge pages outside the cpuset | ||
226 | without first moving to a cpuset that contains all of the desired nodes. | ||
227 | |||
228 | 5) Boot-time huge page allocation attempts to distribute the requested number | ||
229 | of huge pages over all on-lines nodes with memory. | ||
230 | |||
231 | Per Node Hugepages Attributes | ||
232 | |||
233 | A subset of the contents of the root huge page control directory in sysfs, | ||
234 | described above, will be replicated under each the system device of each | ||
235 | NUMA node with memory in: | ||
236 | |||
237 | /sys/devices/system/node/node[0-9]*/hugepages/ | ||
238 | |||
239 | Under this directory, the subdirectory for each supported huge page size | ||
240 | contains the following attribute files: | ||
241 | |||
242 | nr_hugepages | ||
243 | free_hugepages | ||
244 | surplus_hugepages | ||
245 | |||
246 | The free_' and surplus_' attribute files are read-only. They return the number | ||
247 | of free and surplus [overcommitted] huge pages, respectively, on the parent | ||
248 | node. | ||
249 | |||
250 | The nr_hugepages attribute returns the total number of huge pages on the | ||
251 | specified node. When this attribute is written, the number of persistent huge | ||
252 | pages on the parent node will be adjusted to the specified value, if sufficient | ||
253 | resources exist, regardless of the task's mempolicy or cpuset constraints. | ||
254 | |||
255 | Note that the number of overcommit and reserve pages remain global quantities, | ||
256 | as we don't know until fault time, when the faulting task's mempolicy is | ||
257 | applied, from which node the huge page allocation will be attempted. | ||
258 | |||
259 | |||
260 | Using Huge Pages | ||
261 | |||
162 | If the user applications are going to request huge pages using mmap system | 262 | If the user applications are going to request huge pages using mmap system |
163 | call, then it is required that system administrator mount a file system of | 263 | call, then it is required that system administrator mount a file system of |
164 | type hugetlbfs: | 264 | type hugetlbfs: |
@@ -206,9 +306,11 @@ map_hugetlb.c. | |||
206 | * requesting huge pages. | 306 | * requesting huge pages. |
207 | * | 307 | * |
208 | * For the ia64 architecture, the Linux kernel reserves Region number 4 for | 308 | * For the ia64 architecture, the Linux kernel reserves Region number 4 for |
209 | * huge pages. That means the addresses starting with 0x800000... will need | 309 | * huge pages. That means that if one requires a fixed address, a huge page |
210 | * to be specified. Specifying a fixed address is not required on ppc64, | 310 | * aligned address starting with 0x800000... will be required. If a fixed |
211 | * i386 or x86_64. | 311 | * address is not required, the kernel will select an address in the proper |
312 | * range. | ||
313 | * Other architectures, such as ppc64, i386 or x86_64 are not so constrained. | ||
212 | * | 314 | * |
213 | * Note: The default shared memory limit is quite low on many kernels, | 315 | * Note: The default shared memory limit is quite low on many kernels, |
214 | * you may need to increase it via: | 316 | * you may need to increase it via: |
@@ -237,14 +339,8 @@ map_hugetlb.c. | |||
237 | 339 | ||
238 | #define dprintf(x) printf(x) | 340 | #define dprintf(x) printf(x) |
239 | 341 | ||
240 | /* Only ia64 requires this */ | 342 | #define ADDR (void *)(0x0UL) /* let kernel choose address */ |
241 | #ifdef __ia64__ | ||
242 | #define ADDR (void *)(0x8000000000000000UL) | ||
243 | #define SHMAT_FLAGS (SHM_RND) | ||
244 | #else | ||
245 | #define ADDR (void *)(0x0UL) | ||
246 | #define SHMAT_FLAGS (0) | 343 | #define SHMAT_FLAGS (0) |
247 | #endif | ||
248 | 344 | ||
249 | int main(void) | 345 | int main(void) |
250 | { | 346 | { |
@@ -302,10 +398,12 @@ int main(void) | |||
302 | * example, the app is requesting memory of size 256MB that is backed by | 398 | * example, the app is requesting memory of size 256MB that is backed by |
303 | * huge pages. | 399 | * huge pages. |
304 | * | 400 | * |
305 | * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages. | 401 | * For the ia64 architecture, the Linux kernel reserves Region number 4 for |
306 | * That means the addresses starting with 0x800000... will need to be | 402 | * huge pages. That means that if one requires a fixed address, a huge page |
307 | * specified. Specifying a fixed address is not required on ppc64, i386 | 403 | * aligned address starting with 0x800000... will be required. If a fixed |
308 | * or x86_64. | 404 | * address is not required, the kernel will select an address in the proper |
405 | * range. | ||
406 | * Other architectures, such as ppc64, i386 or x86_64 are not so constrained. | ||
309 | */ | 407 | */ |
310 | #include <stdlib.h> | 408 | #include <stdlib.h> |
311 | #include <stdio.h> | 409 | #include <stdio.h> |
@@ -317,14 +415,8 @@ int main(void) | |||
317 | #define LENGTH (256UL*1024*1024) | 415 | #define LENGTH (256UL*1024*1024) |
318 | #define PROTECTION (PROT_READ | PROT_WRITE) | 416 | #define PROTECTION (PROT_READ | PROT_WRITE) |
319 | 417 | ||
320 | /* Only ia64 requires this */ | 418 | #define ADDR (void *)(0x0UL) /* let kernel choose address */ |
321 | #ifdef __ia64__ | ||
322 | #define ADDR (void *)(0x8000000000000000UL) | ||
323 | #define FLAGS (MAP_SHARED | MAP_FIXED) | ||
324 | #else | ||
325 | #define ADDR (void *)(0x0UL) | ||
326 | #define FLAGS (MAP_SHARED) | 419 | #define FLAGS (MAP_SHARED) |
327 | #endif | ||
328 | 420 | ||
329 | void check_bytes(char *addr) | 421 | void check_bytes(char *addr) |
330 | { | 422 | { |
diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt new file mode 100644 index 000000000000..12f9ba20ccb7 --- /dev/null +++ b/Documentation/vm/hwpoison.txt | |||
@@ -0,0 +1,182 @@ | |||
1 | What is hwpoison? | ||
2 | |||
3 | Upcoming Intel CPUs have support for recovering from some memory errors | ||
4 | (``MCA recovery''). This requires the OS to declare a page "poisoned", | ||
5 | kill the processes associated with it and avoid using it in the future. | ||
6 | |||
7 | This patchkit implements the necessary infrastructure in the VM. | ||
8 | |||
9 | To quote the overview comment: | ||
10 | |||
11 | * High level machine check handler. Handles pages reported by the | ||
12 | * hardware as being corrupted usually due to a 2bit ECC memory or cache | ||
13 | * failure. | ||
14 | * | ||
15 | * This focusses on pages detected as corrupted in the background. | ||
16 | * When the current CPU tries to consume corruption the currently | ||
17 | * running process can just be killed directly instead. This implies | ||
18 | * that if the error cannot be handled for some reason it's safe to | ||
19 | * just ignore it because no corruption has been consumed yet. Instead | ||
20 | * when that happens another machine check will happen. | ||
21 | * | ||
22 | * Handles page cache pages in various states. The tricky part | ||
23 | * here is that we can access any page asynchronous to other VM | ||
24 | * users, because memory failures could happen anytime and anywhere, | ||
25 | * possibly violating some of their assumptions. This is why this code | ||
26 | * has to be extremely careful. Generally it tries to use normal locking | ||
27 | * rules, as in get the standard locks, even if that means the | ||
28 | * error handling takes potentially a long time. | ||
29 | * | ||
30 | * Some of the operations here are somewhat inefficient and have non | ||
31 | * linear algorithmic complexity, because the data structures have not | ||
32 | * been optimized for this case. This is in particular the case | ||
33 | * for the mapping from a vma to a process. Since this case is expected | ||
34 | * to be rare we hope we can get away with this. | ||
35 | |||
36 | The code consists of a the high level handler in mm/memory-failure.c, | ||
37 | a new page poison bit and various checks in the VM to handle poisoned | ||
38 | pages. | ||
39 | |||
40 | The main target right now is KVM guests, but it works for all kinds | ||
41 | of applications. KVM support requires a recent qemu-kvm release. | ||
42 | |||
43 | For the KVM use there was need for a new signal type so that | ||
44 | KVM can inject the machine check into the guest with the proper | ||
45 | address. This in theory allows other applications to handle | ||
46 | memory failures too. The expection is that near all applications | ||
47 | won't do that, but some very specialized ones might. | ||
48 | |||
49 | --- | ||
50 | |||
51 | There are two (actually three) modi memory failure recovery can be in: | ||
52 | |||
53 | vm.memory_failure_recovery sysctl set to zero: | ||
54 | All memory failures cause a panic. Do not attempt recovery. | ||
55 | (on x86 this can be also affected by the tolerant level of the | ||
56 | MCE subsystem) | ||
57 | |||
58 | early kill | ||
59 | (can be controlled globally and per process) | ||
60 | Send SIGBUS to the application as soon as the error is detected | ||
61 | This allows applications who can process memory errors in a gentle | ||
62 | way (e.g. drop affected object) | ||
63 | This is the mode used by KVM qemu. | ||
64 | |||
65 | late kill | ||
66 | Send SIGBUS when the application runs into the corrupted page. | ||
67 | This is best for memory error unaware applications and default | ||
68 | Note some pages are always handled as late kill. | ||
69 | |||
70 | --- | ||
71 | |||
72 | User control: | ||
73 | |||
74 | vm.memory_failure_recovery | ||
75 | See sysctl.txt | ||
76 | |||
77 | vm.memory_failure_early_kill | ||
78 | Enable early kill mode globally | ||
79 | |||
80 | PR_MCE_KILL | ||
81 | Set early/late kill mode/revert to system default | ||
82 | arg1: PR_MCE_KILL_CLEAR: Revert to system default | ||
83 | arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode | ||
84 | PR_MCE_KILL_EARLY: Early kill | ||
85 | PR_MCE_KILL_LATE: Late kill | ||
86 | PR_MCE_KILL_DEFAULT: Use system global default | ||
87 | PR_MCE_KILL_GET | ||
88 | return current mode | ||
89 | |||
90 | |||
91 | --- | ||
92 | |||
93 | Testing: | ||
94 | |||
95 | madvise(MADV_HWPOISON, ....) | ||
96 | (as root) | ||
97 | Poison a page in the process for testing | ||
98 | |||
99 | |||
100 | hwpoison-inject module through debugfs | ||
101 | |||
102 | /sys/debug/hwpoison/ | ||
103 | |||
104 | corrupt-pfn | ||
105 | |||
106 | Inject hwpoison fault at PFN echoed into this file. This does | ||
107 | some early filtering to avoid corrupted unintended pages in test suites. | ||
108 | |||
109 | unpoison-pfn | ||
110 | |||
111 | Software-unpoison page at PFN echoed into this file. This | ||
112 | way a page can be reused again. | ||
113 | This only works for Linux injected failures, not for real | ||
114 | memory failures. | ||
115 | |||
116 | Note these injection interfaces are not stable and might change between | ||
117 | kernel versions | ||
118 | |||
119 | corrupt-filter-dev-major | ||
120 | corrupt-filter-dev-minor | ||
121 | |||
122 | Only handle memory failures to pages associated with the file system defined | ||
123 | by block device major/minor. -1U is the wildcard value. | ||
124 | This should be only used for testing with artificial injection. | ||
125 | |||
126 | corrupt-filter-memcg | ||
127 | |||
128 | Limit injection to pages owned by memgroup. Specified by inode number | ||
129 | of the memcg. | ||
130 | |||
131 | Example: | ||
132 | mkdir /cgroup/hwpoison | ||
133 | |||
134 | usemem -m 100 -s 1000 & | ||
135 | echo `jobs -p` > /cgroup/hwpoison/tasks | ||
136 | |||
137 | memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ') | ||
138 | echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg | ||
139 | |||
140 | page-types -p `pidof init` --hwpoison # shall do nothing | ||
141 | page-types -p `pidof usemem` --hwpoison # poison its pages | ||
142 | |||
143 | corrupt-filter-flags-mask | ||
144 | corrupt-filter-flags-value | ||
145 | |||
146 | When specified, only poison pages if ((page_flags & mask) == value). | ||
147 | This allows stress testing of many kinds of pages. The page_flags | ||
148 | are the same as in /proc/kpageflags. The flag bits are defined in | ||
149 | include/linux/kernel-page-flags.h and documented in | ||
150 | Documentation/vm/pagemap.txt | ||
151 | |||
152 | Architecture specific MCE injector | ||
153 | |||
154 | x86 has mce-inject, mce-test | ||
155 | |||
156 | Some portable hwpoison test programs in mce-test, see blow. | ||
157 | |||
158 | --- | ||
159 | |||
160 | References: | ||
161 | |||
162 | http://halobates.de/mce-lc09-2.pdf | ||
163 | Overview presentation from LinuxCon 09 | ||
164 | |||
165 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git | ||
166 | Test suite (hwpoison specific portable tests in tsrc) | ||
167 | |||
168 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git | ||
169 | x86 specific injector | ||
170 | |||
171 | |||
172 | --- | ||
173 | |||
174 | Limitations: | ||
175 | |||
176 | - Not all page types are supported and never will. Most kernel internal | ||
177 | objects cannot be recovered, only LRU pages for now. | ||
178 | - Right now hugepage support is missing. | ||
179 | |||
180 | --- | ||
181 | Andi Kleen, Oct 2009 | ||
182 | |||
diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt index 72a22f65960e..b392e496f816 100644 --- a/Documentation/vm/ksm.txt +++ b/Documentation/vm/ksm.txt | |||
@@ -16,9 +16,9 @@ by sharing the data common between them. But it can be useful to any | |||
16 | application which generates many instances of the same data. | 16 | application which generates many instances of the same data. |
17 | 17 | ||
18 | KSM only merges anonymous (private) pages, never pagecache (file) pages. | 18 | KSM only merges anonymous (private) pages, never pagecache (file) pages. |
19 | KSM's merged pages are at present locked into kernel memory for as long | 19 | KSM's merged pages were originally locked into kernel memory, but can now |
20 | as they are shared: so cannot be swapped out like the user pages they | 20 | be swapped out just like other user pages (but sharing is broken when they |
21 | replace (but swapping KSM pages should follow soon in a later release). | 21 | are swapped back in: ksmd must rediscover their identity and merge again). |
22 | 22 | ||
23 | KSM only operates on those areas of address space which an application | 23 | KSM only operates on those areas of address space which an application |
24 | has advised to be likely candidates for merging, by using the madvise(2) | 24 | has advised to be likely candidates for merging, by using the madvise(2) |
@@ -44,23 +44,15 @@ includes unmapped gaps (though working on the intervening mapped areas), | |||
44 | and might fail with EAGAIN if not enough memory for internal structures. | 44 | and might fail with EAGAIN if not enough memory for internal structures. |
45 | 45 | ||
46 | Applications should be considerate in their use of MADV_MERGEABLE, | 46 | Applications should be considerate in their use of MADV_MERGEABLE, |
47 | restricting its use to areas likely to benefit. KSM's scans may use | 47 | restricting its use to areas likely to benefit. KSM's scans may use a lot |
48 | a lot of processing power, and its kernel-resident pages are a limited | 48 | of processing power: some installations will disable KSM for that reason. |
49 | resource. Some installations will disable KSM for these reasons. | ||
50 | 49 | ||
51 | The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, | 50 | The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, |
52 | readable by all but writable only by root: | 51 | readable by all but writable only by root: |
53 | 52 | ||
54 | max_kernel_pages - set to maximum number of kernel pages that KSM may use | ||
55 | e.g. "echo 2000 > /sys/kernel/mm/ksm/max_kernel_pages" | ||
56 | Value 0 imposes no limit on the kernel pages KSM may use; | ||
57 | but note that any process using MADV_MERGEABLE can cause | ||
58 | KSM to allocate these pages, unswappable until it exits. | ||
59 | Default: 2000 (chosen for demonstration purposes) | ||
60 | |||
61 | pages_to_scan - how many present pages to scan before ksmd goes to sleep | 53 | pages_to_scan - how many present pages to scan before ksmd goes to sleep |
62 | e.g. "echo 200 > /sys/kernel/mm/ksm/pages_to_scan" | 54 | e.g. "echo 100 > /sys/kernel/mm/ksm/pages_to_scan" |
63 | Default: 200 (chosen for demonstration purposes) | 55 | Default: 100 (chosen for demonstration purposes) |
64 | 56 | ||
65 | sleep_millisecs - how many milliseconds ksmd should sleep before next scan | 57 | sleep_millisecs - how many milliseconds ksmd should sleep before next scan |
66 | e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" | 58 | e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" |
@@ -70,11 +62,12 @@ run - set 0 to stop ksmd from running but keep merged pages, | |||
70 | set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", | 62 | set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", |
71 | set 2 to stop ksmd and unmerge all pages currently merged, | 63 | set 2 to stop ksmd and unmerge all pages currently merged, |
72 | but leave mergeable areas registered for next run | 64 | but leave mergeable areas registered for next run |
73 | Default: 1 (for immediate use by apps which register) | 65 | Default: 0 (must be changed to 1 to activate KSM, |
66 | except if CONFIG_SYSFS is disabled) | ||
74 | 67 | ||
75 | The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/: | 68 | The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/: |
76 | 69 | ||
77 | pages_shared - how many shared unswappable kernel pages KSM is using | 70 | pages_shared - how many shared pages are being used |
78 | pages_sharing - how many more sites are sharing them i.e. how much saved | 71 | pages_sharing - how many more sites are sharing them i.e. how much saved |
79 | pages_unshared - how many pages unique but repeatedly checked for merging | 72 | pages_unshared - how many pages unique but repeatedly checked for merging |
80 | pages_volatile - how many pages changing too fast to be placed in a tree | 73 | pages_volatile - how many pages changing too fast to be placed in a tree |
@@ -86,4 +79,4 @@ pages_volatile embraces several different kinds of activity, but a high | |||
86 | proportion there would also indicate poor use of madvise MADV_MERGEABLE. | 79 | proportion there would also indicate poor use of madvise MADV_MERGEABLE. |
87 | 80 | ||
88 | Izik Eidus, | 81 | Izik Eidus, |
89 | Hugh Dickins, 30 July 2009 | 82 | Hugh Dickins, 17 Nov 2009 |
diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c index fa1a30d9e9d5..66e9358e2144 100644 --- a/Documentation/vm/page-types.c +++ b/Documentation/vm/page-types.c | |||
@@ -1,8 +1,22 @@ | |||
1 | /* | 1 | /* |
2 | * page-types: Tool for querying page flags | 2 | * page-types: Tool for querying page flags |
3 | * | 3 | * |
4 | * This program is free software; you can redistribute it and/or modify it | ||
5 | * under the terms of the GNU General Public License as published by the Free | ||
6 | * Software Foundation; version 2. | ||
7 | * | ||
8 | * This program is distributed in the hope that it will be useful, but WITHOUT | ||
9 | * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or | ||
10 | * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for | ||
11 | * more details. | ||
12 | * | ||
13 | * You should find a copy of v2 of the GNU General Public License somewhere on | ||
14 | * your Linux system; if not, write to the Free Software Foundation, Inc., 59 | ||
15 | * Temple Place, Suite 330, Boston, MA 02111-1307 USA. | ||
16 | * | ||
4 | * Copyright (C) 2009 Intel corporation | 17 | * Copyright (C) 2009 Intel corporation |
5 | * Copyright (C) 2009 Wu Fengguang <fengguang.wu@intel.com> | 18 | * |
19 | * Authors: Wu Fengguang <fengguang.wu@intel.com> | ||
6 | */ | 20 | */ |
7 | 21 | ||
8 | #define _LARGEFILE64_SOURCE | 22 | #define _LARGEFILE64_SOURCE |
@@ -69,7 +83,9 @@ | |||
69 | #define KPF_COMPOUND_TAIL 16 | 83 | #define KPF_COMPOUND_TAIL 16 |
70 | #define KPF_HUGE 17 | 84 | #define KPF_HUGE 17 |
71 | #define KPF_UNEVICTABLE 18 | 85 | #define KPF_UNEVICTABLE 18 |
86 | #define KPF_HWPOISON 19 | ||
72 | #define KPF_NOPAGE 20 | 87 | #define KPF_NOPAGE 20 |
88 | #define KPF_KSM 21 | ||
73 | 89 | ||
74 | /* [32-] kernel hacking assistances */ | 90 | /* [32-] kernel hacking assistances */ |
75 | #define KPF_RESERVED 32 | 91 | #define KPF_RESERVED 32 |
@@ -95,7 +111,7 @@ | |||
95 | #define BIT(name) (1ULL << KPF_##name) | 111 | #define BIT(name) (1ULL << KPF_##name) |
96 | #define BITS_COMPOUND (BIT(COMPOUND_HEAD) | BIT(COMPOUND_TAIL)) | 112 | #define BITS_COMPOUND (BIT(COMPOUND_HEAD) | BIT(COMPOUND_TAIL)) |
97 | 113 | ||
98 | static char *page_flag_names[] = { | 114 | static const char *page_flag_names[] = { |
99 | [KPF_LOCKED] = "L:locked", | 115 | [KPF_LOCKED] = "L:locked", |
100 | [KPF_ERROR] = "E:error", | 116 | [KPF_ERROR] = "E:error", |
101 | [KPF_REFERENCED] = "R:referenced", | 117 | [KPF_REFERENCED] = "R:referenced", |
@@ -116,7 +132,9 @@ static char *page_flag_names[] = { | |||
116 | [KPF_COMPOUND_TAIL] = "T:compound_tail", | 132 | [KPF_COMPOUND_TAIL] = "T:compound_tail", |
117 | [KPF_HUGE] = "G:huge", | 133 | [KPF_HUGE] = "G:huge", |
118 | [KPF_UNEVICTABLE] = "u:unevictable", | 134 | [KPF_UNEVICTABLE] = "u:unevictable", |
135 | [KPF_HWPOISON] = "X:hwpoison", | ||
119 | [KPF_NOPAGE] = "n:nopage", | 136 | [KPF_NOPAGE] = "n:nopage", |
137 | [KPF_KSM] = "x:ksm", | ||
120 | 138 | ||
121 | [KPF_RESERVED] = "r:reserved", | 139 | [KPF_RESERVED] = "r:reserved", |
122 | [KPF_MLOCKED] = "m:mlocked", | 140 | [KPF_MLOCKED] = "m:mlocked", |
@@ -152,9 +170,6 @@ static unsigned long opt_size[MAX_ADDR_RANGES]; | |||
152 | static int nr_vmas; | 170 | static int nr_vmas; |
153 | static unsigned long pg_start[MAX_VMAS]; | 171 | static unsigned long pg_start[MAX_VMAS]; |
154 | static unsigned long pg_end[MAX_VMAS]; | 172 | static unsigned long pg_end[MAX_VMAS]; |
155 | static unsigned long voffset; | ||
156 | |||
157 | static int pagemap_fd; | ||
158 | 173 | ||
159 | #define MAX_BIT_FILTERS 64 | 174 | #define MAX_BIT_FILTERS 64 |
160 | static int nr_bit_filters; | 175 | static int nr_bit_filters; |
@@ -163,9 +178,16 @@ static uint64_t opt_bits[MAX_BIT_FILTERS]; | |||
163 | 178 | ||
164 | static int page_size; | 179 | static int page_size; |
165 | 180 | ||
166 | #define PAGES_BATCH (64 << 10) /* 64k pages */ | 181 | static int pagemap_fd; |
167 | static int kpageflags_fd; | 182 | static int kpageflags_fd; |
168 | 183 | ||
184 | static int opt_hwpoison; | ||
185 | static int opt_unpoison; | ||
186 | |||
187 | static const char hwpoison_debug_fs[] = "/debug/hwpoison"; | ||
188 | static int hwpoison_inject_fd; | ||
189 | static int hwpoison_forget_fd; | ||
190 | |||
169 | #define HASH_SHIFT 13 | 191 | #define HASH_SHIFT 13 |
170 | #define HASH_SIZE (1 << HASH_SHIFT) | 192 | #define HASH_SIZE (1 << HASH_SHIFT) |
171 | #define HASH_MASK (HASH_SIZE - 1) | 193 | #define HASH_MASK (HASH_SIZE - 1) |
@@ -207,6 +229,74 @@ static void fatal(const char *x, ...) | |||
207 | exit(EXIT_FAILURE); | 229 | exit(EXIT_FAILURE); |
208 | } | 230 | } |
209 | 231 | ||
232 | static int checked_open(const char *pathname, int flags) | ||
233 | { | ||
234 | int fd = open(pathname, flags); | ||
235 | |||
236 | if (fd < 0) { | ||
237 | perror(pathname); | ||
238 | exit(EXIT_FAILURE); | ||
239 | } | ||
240 | |||
241 | return fd; | ||
242 | } | ||
243 | |||
244 | /* | ||
245 | * pagemap/kpageflags routines | ||
246 | */ | ||
247 | |||
248 | static unsigned long do_u64_read(int fd, char *name, | ||
249 | uint64_t *buf, | ||
250 | unsigned long index, | ||
251 | unsigned long count) | ||
252 | { | ||
253 | long bytes; | ||
254 | |||
255 | if (index > ULONG_MAX / 8) | ||
256 | fatal("index overflow: %lu\n", index); | ||
257 | |||
258 | if (lseek(fd, index * 8, SEEK_SET) < 0) { | ||
259 | perror(name); | ||
260 | exit(EXIT_FAILURE); | ||
261 | } | ||
262 | |||
263 | bytes = read(fd, buf, count * 8); | ||
264 | if (bytes < 0) { | ||
265 | perror(name); | ||
266 | exit(EXIT_FAILURE); | ||
267 | } | ||
268 | if (bytes % 8) | ||
269 | fatal("partial read: %lu bytes\n", bytes); | ||
270 | |||
271 | return bytes / 8; | ||
272 | } | ||
273 | |||
274 | static unsigned long kpageflags_read(uint64_t *buf, | ||
275 | unsigned long index, | ||
276 | unsigned long pages) | ||
277 | { | ||
278 | return do_u64_read(kpageflags_fd, PROC_KPAGEFLAGS, buf, index, pages); | ||
279 | } | ||
280 | |||
281 | static unsigned long pagemap_read(uint64_t *buf, | ||
282 | unsigned long index, | ||
283 | unsigned long pages) | ||
284 | { | ||
285 | return do_u64_read(pagemap_fd, "/proc/pid/pagemap", buf, index, pages); | ||
286 | } | ||
287 | |||
288 | static unsigned long pagemap_pfn(uint64_t val) | ||
289 | { | ||
290 | unsigned long pfn; | ||
291 | |||
292 | if (val & PM_PRESENT) | ||
293 | pfn = PM_PFRAME(val); | ||
294 | else | ||
295 | pfn = 0; | ||
296 | |||
297 | return pfn; | ||
298 | } | ||
299 | |||
210 | 300 | ||
211 | /* | 301 | /* |
212 | * page flag names | 302 | * page flag names |
@@ -222,7 +312,7 @@ static char *page_flag_name(uint64_t flags) | |||
222 | present = (flags >> i) & 1; | 312 | present = (flags >> i) & 1; |
223 | if (!page_flag_names[i]) { | 313 | if (!page_flag_names[i]) { |
224 | if (present) | 314 | if (present) |
225 | fatal("unkown flag bit %d\n", i); | 315 | fatal("unknown flag bit %d\n", i); |
226 | continue; | 316 | continue; |
227 | } | 317 | } |
228 | buf[j++] = present ? page_flag_names[i][0] : '_'; | 318 | buf[j++] = present ? page_flag_names[i][0] : '_'; |
@@ -255,7 +345,8 @@ static char *page_flag_longname(uint64_t flags) | |||
255 | * page list and summary | 345 | * page list and summary |
256 | */ | 346 | */ |
257 | 347 | ||
258 | static void show_page_range(unsigned long offset, uint64_t flags) | 348 | static void show_page_range(unsigned long voffset, |
349 | unsigned long offset, uint64_t flags) | ||
259 | { | 350 | { |
260 | static uint64_t flags0; | 351 | static uint64_t flags0; |
261 | static unsigned long voff; | 352 | static unsigned long voff; |
@@ -281,7 +372,8 @@ static void show_page_range(unsigned long offset, uint64_t flags) | |||
281 | count = 1; | 372 | count = 1; |
282 | } | 373 | } |
283 | 374 | ||
284 | static void show_page(unsigned long offset, uint64_t flags) | 375 | static void show_page(unsigned long voffset, |
376 | unsigned long offset, uint64_t flags) | ||
285 | { | 377 | { |
286 | if (opt_pid) | 378 | if (opt_pid) |
287 | printf("%lx\t", voffset); | 379 | printf("%lx\t", voffset); |
@@ -362,6 +454,62 @@ static uint64_t well_known_flags(uint64_t flags) | |||
362 | return flags; | 454 | return flags; |
363 | } | 455 | } |
364 | 456 | ||
457 | static uint64_t kpageflags_flags(uint64_t flags) | ||
458 | { | ||
459 | flags = expand_overloaded_flags(flags); | ||
460 | |||
461 | if (!opt_raw) | ||
462 | flags = well_known_flags(flags); | ||
463 | |||
464 | return flags; | ||
465 | } | ||
466 | |||
467 | /* | ||
468 | * page actions | ||
469 | */ | ||
470 | |||
471 | static void prepare_hwpoison_fd(void) | ||
472 | { | ||
473 | char buf[100]; | ||
474 | |||
475 | if (opt_hwpoison && !hwpoison_inject_fd) { | ||
476 | sprintf(buf, "%s/corrupt-pfn", hwpoison_debug_fs); | ||
477 | hwpoison_inject_fd = checked_open(buf, O_WRONLY); | ||
478 | } | ||
479 | |||
480 | if (opt_unpoison && !hwpoison_forget_fd) { | ||
481 | sprintf(buf, "%s/renew-pfn", hwpoison_debug_fs); | ||
482 | hwpoison_forget_fd = checked_open(buf, O_WRONLY); | ||
483 | } | ||
484 | } | ||
485 | |||
486 | static int hwpoison_page(unsigned long offset) | ||
487 | { | ||
488 | char buf[100]; | ||
489 | int len; | ||
490 | |||
491 | len = sprintf(buf, "0x%lx\n", offset); | ||
492 | len = write(hwpoison_inject_fd, buf, len); | ||
493 | if (len < 0) { | ||
494 | perror("hwpoison inject"); | ||
495 | return len; | ||
496 | } | ||
497 | return 0; | ||
498 | } | ||
499 | |||
500 | static int unpoison_page(unsigned long offset) | ||
501 | { | ||
502 | char buf[100]; | ||
503 | int len; | ||
504 | |||
505 | len = sprintf(buf, "0x%lx\n", offset); | ||
506 | len = write(hwpoison_forget_fd, buf, len); | ||
507 | if (len < 0) { | ||
508 | perror("hwpoison forget"); | ||
509 | return len; | ||
510 | } | ||
511 | return 0; | ||
512 | } | ||
365 | 513 | ||
366 | /* | 514 | /* |
367 | * page frame walker | 515 | * page frame walker |
@@ -394,104 +542,83 @@ static int hash_slot(uint64_t flags) | |||
394 | exit(EXIT_FAILURE); | 542 | exit(EXIT_FAILURE); |
395 | } | 543 | } |
396 | 544 | ||
397 | static void add_page(unsigned long offset, uint64_t flags) | 545 | static void add_page(unsigned long voffset, |
546 | unsigned long offset, uint64_t flags) | ||
398 | { | 547 | { |
399 | flags = expand_overloaded_flags(flags); | 548 | flags = kpageflags_flags(flags); |
400 | |||
401 | if (!opt_raw) | ||
402 | flags = well_known_flags(flags); | ||
403 | 549 | ||
404 | if (!bit_mask_ok(flags)) | 550 | if (!bit_mask_ok(flags)) |
405 | return; | 551 | return; |
406 | 552 | ||
553 | if (opt_hwpoison) | ||
554 | hwpoison_page(offset); | ||
555 | if (opt_unpoison) | ||
556 | unpoison_page(offset); | ||
557 | |||
407 | if (opt_list == 1) | 558 | if (opt_list == 1) |
408 | show_page_range(offset, flags); | 559 | show_page_range(voffset, offset, flags); |
409 | else if (opt_list == 2) | 560 | else if (opt_list == 2) |
410 | show_page(offset, flags); | 561 | show_page(voffset, offset, flags); |
411 | 562 | ||
412 | nr_pages[hash_slot(flags)]++; | 563 | nr_pages[hash_slot(flags)]++; |
413 | total_pages++; | 564 | total_pages++; |
414 | } | 565 | } |
415 | 566 | ||
416 | static void walk_pfn(unsigned long index, unsigned long count) | 567 | #define KPAGEFLAGS_BATCH (64 << 10) /* 64k pages */ |
568 | static void walk_pfn(unsigned long voffset, | ||
569 | unsigned long index, | ||
570 | unsigned long count) | ||
417 | { | 571 | { |
572 | uint64_t buf[KPAGEFLAGS_BATCH]; | ||
418 | unsigned long batch; | 573 | unsigned long batch; |
419 | unsigned long n; | 574 | long pages; |
420 | unsigned long i; | 575 | unsigned long i; |
421 | 576 | ||
422 | if (index > ULONG_MAX / KPF_BYTES) | ||
423 | fatal("index overflow: %lu\n", index); | ||
424 | |||
425 | lseek(kpageflags_fd, index * KPF_BYTES, SEEK_SET); | ||
426 | |||
427 | while (count) { | 577 | while (count) { |
428 | uint64_t kpageflags_buf[KPF_BYTES * PAGES_BATCH]; | 578 | batch = min_t(unsigned long, count, KPAGEFLAGS_BATCH); |
429 | 579 | pages = kpageflags_read(buf, index, batch); | |
430 | batch = min_t(unsigned long, count, PAGES_BATCH); | 580 | if (pages == 0) |
431 | n = read(kpageflags_fd, kpageflags_buf, batch * KPF_BYTES); | ||
432 | if (n == 0) | ||
433 | break; | 581 | break; |
434 | if (n < 0) { | ||
435 | perror(PROC_KPAGEFLAGS); | ||
436 | exit(EXIT_FAILURE); | ||
437 | } | ||
438 | 582 | ||
439 | if (n % KPF_BYTES != 0) | 583 | for (i = 0; i < pages; i++) |
440 | fatal("partial read: %lu bytes\n", n); | 584 | add_page(voffset + i, index + i, buf[i]); |
441 | n = n / KPF_BYTES; | ||
442 | 585 | ||
443 | for (i = 0; i < n; i++) | 586 | index += pages; |
444 | add_page(index + i, kpageflags_buf[i]); | 587 | count -= pages; |
445 | |||
446 | index += batch; | ||
447 | count -= batch; | ||
448 | } | 588 | } |
449 | } | 589 | } |
450 | 590 | ||
451 | 591 | #define PAGEMAP_BATCH (64 << 10) | |
452 | #define PAGEMAP_BATCH 4096 | 592 | static void walk_vma(unsigned long index, unsigned long count) |
453 | static unsigned long task_pfn(unsigned long pgoff) | ||
454 | { | 593 | { |
455 | static uint64_t buf[PAGEMAP_BATCH]; | 594 | uint64_t buf[PAGEMAP_BATCH]; |
456 | static unsigned long start; | 595 | unsigned long batch; |
457 | static long count; | 596 | unsigned long pages; |
458 | uint64_t pfn; | 597 | unsigned long pfn; |
598 | unsigned long i; | ||
459 | 599 | ||
460 | if (pgoff < start || pgoff >= start + count) { | 600 | while (count) { |
461 | if (lseek64(pagemap_fd, | 601 | batch = min_t(unsigned long, count, PAGEMAP_BATCH); |
462 | (uint64_t)pgoff * PM_ENTRY_BYTES, | 602 | pages = pagemap_read(buf, index, batch); |
463 | SEEK_SET) < 0) { | 603 | if (pages == 0) |
464 | perror("pagemap seek"); | 604 | break; |
465 | exit(EXIT_FAILURE); | ||
466 | } | ||
467 | count = read(pagemap_fd, buf, sizeof(buf)); | ||
468 | if (count == 0) | ||
469 | return 0; | ||
470 | if (count < 0) { | ||
471 | perror("pagemap read"); | ||
472 | exit(EXIT_FAILURE); | ||
473 | } | ||
474 | if (count % PM_ENTRY_BYTES) { | ||
475 | fatal("pagemap read not aligned.\n"); | ||
476 | exit(EXIT_FAILURE); | ||
477 | } | ||
478 | count /= PM_ENTRY_BYTES; | ||
479 | start = pgoff; | ||
480 | } | ||
481 | 605 | ||
482 | pfn = buf[pgoff - start]; | 606 | for (i = 0; i < pages; i++) { |
483 | if (pfn & PM_PRESENT) | 607 | pfn = pagemap_pfn(buf[i]); |
484 | pfn = PM_PFRAME(pfn); | 608 | if (pfn) |
485 | else | 609 | walk_pfn(index + i, pfn, 1); |
486 | pfn = 0; | 610 | } |
487 | 611 | ||
488 | return pfn; | 612 | index += pages; |
613 | count -= pages; | ||
614 | } | ||
489 | } | 615 | } |
490 | 616 | ||
491 | static void walk_task(unsigned long index, unsigned long count) | 617 | static void walk_task(unsigned long index, unsigned long count) |
492 | { | 618 | { |
493 | int i = 0; | ||
494 | const unsigned long end = index + count; | 619 | const unsigned long end = index + count; |
620 | unsigned long start; | ||
621 | int i = 0; | ||
495 | 622 | ||
496 | while (index < end) { | 623 | while (index < end) { |
497 | 624 | ||
@@ -501,15 +628,11 @@ static void walk_task(unsigned long index, unsigned long count) | |||
501 | if (pg_start[i] >= end) | 628 | if (pg_start[i] >= end) |
502 | return; | 629 | return; |
503 | 630 | ||
504 | voffset = max_t(unsigned long, pg_start[i], index); | 631 | start = max_t(unsigned long, pg_start[i], index); |
505 | index = min_t(unsigned long, pg_end[i], end); | 632 | index = min_t(unsigned long, pg_end[i], end); |
506 | 633 | ||
507 | assert(voffset < index); | 634 | assert(start < index); |
508 | for (; voffset < index; voffset++) { | 635 | walk_vma(start, index - start); |
509 | unsigned long pfn = task_pfn(voffset); | ||
510 | if (pfn) | ||
511 | walk_pfn(pfn, 1); | ||
512 | } | ||
513 | } | 636 | } |
514 | } | 637 | } |
515 | 638 | ||
@@ -527,18 +650,14 @@ static void walk_addr_ranges(void) | |||
527 | { | 650 | { |
528 | int i; | 651 | int i; |
529 | 652 | ||
530 | kpageflags_fd = open(PROC_KPAGEFLAGS, O_RDONLY); | 653 | kpageflags_fd = checked_open(PROC_KPAGEFLAGS, O_RDONLY); |
531 | if (kpageflags_fd < 0) { | ||
532 | perror(PROC_KPAGEFLAGS); | ||
533 | exit(EXIT_FAILURE); | ||
534 | } | ||
535 | 654 | ||
536 | if (!nr_addr_ranges) | 655 | if (!nr_addr_ranges) |
537 | add_addr_range(0, ULONG_MAX); | 656 | add_addr_range(0, ULONG_MAX); |
538 | 657 | ||
539 | for (i = 0; i < nr_addr_ranges; i++) | 658 | for (i = 0; i < nr_addr_ranges; i++) |
540 | if (!opt_pid) | 659 | if (!opt_pid) |
541 | walk_pfn(opt_offset[i], opt_size[i]); | 660 | walk_pfn(0, opt_offset[i], opt_size[i]); |
542 | else | 661 | else |
543 | walk_task(opt_offset[i], opt_size[i]); | 662 | walk_task(opt_offset[i], opt_size[i]); |
544 | 663 | ||
@@ -565,28 +684,35 @@ static void usage(void) | |||
565 | 684 | ||
566 | printf( | 685 | printf( |
567 | "page-types [options]\n" | 686 | "page-types [options]\n" |
568 | " -r|--raw Raw mode, for kernel developers\n" | 687 | " -r|--raw Raw mode, for kernel developers\n" |
569 | " -a|--addr addr-spec Walk a range of pages\n" | 688 | " -d|--describe flags Describe flags\n" |
570 | " -b|--bits bits-spec Walk pages with specified bits\n" | 689 | " -a|--addr addr-spec Walk a range of pages\n" |
571 | " -p|--pid pid Walk process address space\n" | 690 | " -b|--bits bits-spec Walk pages with specified bits\n" |
691 | " -p|--pid pid Walk process address space\n" | ||
572 | #if 0 /* planned features */ | 692 | #if 0 /* planned features */ |
573 | " -f|--file filename Walk file address space\n" | 693 | " -f|--file filename Walk file address space\n" |
574 | #endif | 694 | #endif |
575 | " -l|--list Show page details in ranges\n" | 695 | " -l|--list Show page details in ranges\n" |
576 | " -L|--list-each Show page details one by one\n" | 696 | " -L|--list-each Show page details one by one\n" |
577 | " -N|--no-summary Don't show summay info\n" | 697 | " -N|--no-summary Don't show summay info\n" |
578 | " -h|--help Show this usage message\n" | 698 | " -X|--hwpoison hwpoison pages\n" |
699 | " -x|--unpoison unpoison pages\n" | ||
700 | " -h|--help Show this usage message\n" | ||
701 | "flags:\n" | ||
702 | " 0x10 bitfield format, e.g.\n" | ||
703 | " anon bit-name, e.g.\n" | ||
704 | " 0x10,anon comma-separated list, e.g.\n" | ||
579 | "addr-spec:\n" | 705 | "addr-spec:\n" |
580 | " N one page at offset N (unit: pages)\n" | 706 | " N one page at offset N (unit: pages)\n" |
581 | " N+M pages range from N to N+M-1\n" | 707 | " N+M pages range from N to N+M-1\n" |
582 | " N,M pages range from N to M-1\n" | 708 | " N,M pages range from N to M-1\n" |
583 | " N, pages range from N to end\n" | 709 | " N, pages range from N to end\n" |
584 | " ,M pages range from 0 to M-1\n" | 710 | " ,M pages range from 0 to M-1\n" |
585 | "bits-spec:\n" | 711 | "bits-spec:\n" |
586 | " bit1,bit2 (flags & (bit1|bit2)) != 0\n" | 712 | " bit1,bit2 (flags & (bit1|bit2)) != 0\n" |
587 | " bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1\n" | 713 | " bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1\n" |
588 | " bit1,~bit2 (flags & (bit1|bit2)) == bit1\n" | 714 | " bit1,~bit2 (flags & (bit1|bit2)) == bit1\n" |
589 | " =bit1,bit2 flags == (bit1|bit2)\n" | 715 | " =bit1,bit2 flags == (bit1|bit2)\n" |
590 | "bit-names:\n" | 716 | "bit-names:\n" |
591 | ); | 717 | ); |
592 | 718 | ||
@@ -624,11 +750,7 @@ static void parse_pid(const char *str) | |||
624 | opt_pid = parse_number(str); | 750 | opt_pid = parse_number(str); |
625 | 751 | ||
626 | sprintf(buf, "/proc/%d/pagemap", opt_pid); | 752 | sprintf(buf, "/proc/%d/pagemap", opt_pid); |
627 | pagemap_fd = open(buf, O_RDONLY); | 753 | pagemap_fd = checked_open(buf, O_RDONLY); |
628 | if (pagemap_fd < 0) { | ||
629 | perror(buf); | ||
630 | exit(EXIT_FAILURE); | ||
631 | } | ||
632 | 754 | ||
633 | sprintf(buf, "/proc/%d/maps", opt_pid); | 755 | sprintf(buf, "/proc/%d/maps", opt_pid); |
634 | file = fopen(buf, "r"); | 756 | file = fopen(buf, "r"); |
@@ -778,16 +900,28 @@ static void parse_bits_mask(const char *optarg) | |||
778 | add_bits_filter(mask, bits); | 900 | add_bits_filter(mask, bits); |
779 | } | 901 | } |
780 | 902 | ||
903 | static void describe_flags(const char *optarg) | ||
904 | { | ||
905 | uint64_t flags = parse_flag_names(optarg, 0); | ||
906 | |||
907 | printf("0x%016llx\t%s\t%s\n", | ||
908 | (unsigned long long)flags, | ||
909 | page_flag_name(flags), | ||
910 | page_flag_longname(flags)); | ||
911 | } | ||
781 | 912 | ||
782 | static struct option opts[] = { | 913 | static const struct option opts[] = { |
783 | { "raw" , 0, NULL, 'r' }, | 914 | { "raw" , 0, NULL, 'r' }, |
784 | { "pid" , 1, NULL, 'p' }, | 915 | { "pid" , 1, NULL, 'p' }, |
785 | { "file" , 1, NULL, 'f' }, | 916 | { "file" , 1, NULL, 'f' }, |
786 | { "addr" , 1, NULL, 'a' }, | 917 | { "addr" , 1, NULL, 'a' }, |
787 | { "bits" , 1, NULL, 'b' }, | 918 | { "bits" , 1, NULL, 'b' }, |
919 | { "describe" , 1, NULL, 'd' }, | ||
788 | { "list" , 0, NULL, 'l' }, | 920 | { "list" , 0, NULL, 'l' }, |
789 | { "list-each" , 0, NULL, 'L' }, | 921 | { "list-each" , 0, NULL, 'L' }, |
790 | { "no-summary", 0, NULL, 'N' }, | 922 | { "no-summary", 0, NULL, 'N' }, |
923 | { "hwpoison" , 0, NULL, 'X' }, | ||
924 | { "unpoison" , 0, NULL, 'x' }, | ||
791 | { "help" , 0, NULL, 'h' }, | 925 | { "help" , 0, NULL, 'h' }, |
792 | { NULL , 0, NULL, 0 } | 926 | { NULL , 0, NULL, 0 } |
793 | }; | 927 | }; |
@@ -799,7 +933,7 @@ int main(int argc, char *argv[]) | |||
799 | page_size = getpagesize(); | 933 | page_size = getpagesize(); |
800 | 934 | ||
801 | while ((c = getopt_long(argc, argv, | 935 | while ((c = getopt_long(argc, argv, |
802 | "rp:f:a:b:lLNh", opts, NULL)) != -1) { | 936 | "rp:f:a:b:d:lLNXxh", opts, NULL)) != -1) { |
803 | switch (c) { | 937 | switch (c) { |
804 | case 'r': | 938 | case 'r': |
805 | opt_raw = 1; | 939 | opt_raw = 1; |
@@ -816,6 +950,9 @@ int main(int argc, char *argv[]) | |||
816 | case 'b': | 950 | case 'b': |
817 | parse_bits_mask(optarg); | 951 | parse_bits_mask(optarg); |
818 | break; | 952 | break; |
953 | case 'd': | ||
954 | describe_flags(optarg); | ||
955 | exit(0); | ||
819 | case 'l': | 956 | case 'l': |
820 | opt_list = 1; | 957 | opt_list = 1; |
821 | break; | 958 | break; |
@@ -825,6 +962,14 @@ int main(int argc, char *argv[]) | |||
825 | case 'N': | 962 | case 'N': |
826 | opt_no_summary = 1; | 963 | opt_no_summary = 1; |
827 | break; | 964 | break; |
965 | case 'X': | ||
966 | opt_hwpoison = 1; | ||
967 | prepare_hwpoison_fd(); | ||
968 | break; | ||
969 | case 'x': | ||
970 | opt_unpoison = 1; | ||
971 | prepare_hwpoison_fd(); | ||
972 | break; | ||
828 | case 'h': | 973 | case 'h': |
829 | usage(); | 974 | usage(); |
830 | exit(0); | 975 | exit(0); |
@@ -844,7 +989,7 @@ int main(int argc, char *argv[]) | |||
844 | walk_addr_ranges(); | 989 | walk_addr_ranges(); |
845 | 990 | ||
846 | if (opt_list == 1) | 991 | if (opt_list == 1) |
847 | show_page_range(0, 0); /* drain the buffer */ | 992 | show_page_range(0, 0, 0); /* drain the buffer */ |
848 | 993 | ||
849 | if (opt_no_summary) | 994 | if (opt_no_summary) |
850 | return 0; | 995 | return 0; |
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 600a304a828c..df09b9650a81 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt | |||
@@ -57,7 +57,9 @@ There are three components to pagemap: | |||
57 | 16. COMPOUND_TAIL | 57 | 16. COMPOUND_TAIL |
58 | 16. HUGE | 58 | 16. HUGE |
59 | 18. UNEVICTABLE | 59 | 18. UNEVICTABLE |
60 | 19. HWPOISON | ||
60 | 20. NOPAGE | 61 | 20. NOPAGE |
62 | 21. KSM | ||
61 | 63 | ||
62 | Short descriptions to the page flags: | 64 | Short descriptions to the page flags: |
63 | 65 | ||
@@ -86,9 +88,15 @@ Short descriptions to the page flags: | |||
86 | 17. HUGE | 88 | 17. HUGE |
87 | this is an integral part of a HugeTLB page | 89 | this is an integral part of a HugeTLB page |
88 | 90 | ||
91 | 19. HWPOISON | ||
92 | hardware detected memory corruption on this page: don't touch the data! | ||
93 | |||
89 | 20. NOPAGE | 94 | 20. NOPAGE |
90 | no page frame exists at the requested address | 95 | no page frame exists at the requested address |
91 | 96 | ||
97 | 21. KSM | ||
98 | identical memory pages dynamically shared between one or more processes | ||
99 | |||
92 | [IO related page flags] | 100 | [IO related page flags] |
93 | 1. ERROR IO error occurred | 101 | 1. ERROR IO error occurred |
94 | 3. UPTODATE page has up-to-date data | 102 | 3. UPTODATE page has up-to-date data |
diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt index 510917ff59ed..b37300edf27c 100644 --- a/Documentation/vm/slub.txt +++ b/Documentation/vm/slub.txt | |||
@@ -245,7 +245,7 @@ been overwritten. Here a string of 8 characters was written into a slab that | |||
245 | has the length of 8 characters. However, a 8 character string needs a | 245 | has the length of 8 characters. However, a 8 character string needs a |
246 | terminating 0. That zero has overwritten the first byte of the Redzone field. | 246 | terminating 0. That zero has overwritten the first byte of the Redzone field. |
247 | After reporting the details of the issue encountered the FIX SLUB message | 247 | After reporting the details of the issue encountered the FIX SLUB message |
248 | tell us that SLUB has restored the Redzone to its proper value and then | 248 | tells us that SLUB has restored the Redzone to its proper value and then |
249 | system operations continue. | 249 | system operations continue. |
250 | 250 | ||
251 | Emergency operations: | 251 | Emergency operations: |