diff options
Diffstat (limited to 'Documentation/vm')
-rw-r--r-- | Documentation/vm/numa_memory_policy.txt | 281 |
1 files changed, 201 insertions, 80 deletions
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index dd4986497996..bad16d3f6a47 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt | |||
@@ -135,77 +135,58 @@ most general to most specific: | |||
135 | 135 | ||
136 | Components of Memory Policies | 136 | Components of Memory Policies |
137 | 137 | ||
138 | A Linux memory policy is a tuple consisting of a "mode" and an optional set | 138 | A Linux memory policy consists of a "mode", optional mode flags, and an |
139 | of nodes. The mode determine the behavior of the policy, while the | 139 | optional set of nodes. The mode determines the behavior of the policy, |
140 | optional set of nodes can be viewed as the arguments to the behavior. | 140 | the optional mode flags determine the behavior of the mode, and the |
141 | optional set of nodes can be viewed as the arguments to the policy | ||
142 | behavior. | ||
141 | 143 | ||
142 | Internally, memory policies are implemented by a reference counted | 144 | Internally, memory policies are implemented by a reference counted |
143 | structure, struct mempolicy. Details of this structure will be discussed | 145 | structure, struct mempolicy. Details of this structure will be discussed |
144 | in context, below, as required to explain the behavior. | 146 | in context, below, as required to explain the behavior. |
145 | 147 | ||
146 | Note: in some functions AND in the struct mempolicy itself, the mode | ||
147 | is called "policy". However, to avoid confusion with the policy tuple, | ||
148 | this document will continue to use the term "mode". | ||
149 | |||
150 | Linux memory policy supports the following 4 behavioral modes: | 148 | Linux memory policy supports the following 4 behavioral modes: |
151 | 149 | ||
152 | Default Mode--MPOL_DEFAULT: The behavior specified by this mode is | 150 | Default Mode--MPOL_DEFAULT: This mode is only used in the memory |
153 | context or scope dependent. | 151 | policy APIs. Internally, MPOL_DEFAULT is converted to the NULL |
154 | 152 | memory policy in all policy scopes. Any existing non-default policy | |
155 | As mentioned in the Policy Scope section above, during normal | 153 | will simply be removed when MPOL_DEFAULT is specified. As a result, |
156 | system operation, the System Default Policy is hard coded to | 154 | MPOL_DEFAULT means "fall back to the next most specific policy scope." |
157 | contain the Default mode. | ||
158 | |||
159 | In this context, default mode means "local" allocation--that is | ||
160 | attempt to allocate the page from the node associated with the cpu | ||
161 | where the fault occurs. If the "local" node has no memory, or the | ||
162 | node's memory can be exhausted [no free pages available], local | ||
163 | allocation will "fallback to"--attempt to allocate pages from-- | ||
164 | "nearby" nodes, in order of increasing "distance". | ||
165 | 155 | ||
166 | Implementation detail -- subject to change: "Fallback" uses | 156 | For example, a NULL or default task policy will fall back to the |
167 | a per node list of sibling nodes--called zonelists--built at | 157 | system default policy. A NULL or default vma policy will fall |
168 | boot time, or when nodes or memory are added or removed from | 158 | back to the task policy. |
169 | the system [memory hotplug]. These per node zonelist are | ||
170 | constructed with nodes in order of increasing distance based | ||
171 | on information provided by the platform firmware. | ||
172 | 159 | ||
173 | When a task/process policy or a shared policy contains the Default | 160 | When specified in one of the memory policy APIs, the Default mode |
174 | mode, this also means "local allocation", as described above. | 161 | does not use the optional set of nodes. |
175 | 162 | ||
176 | In the context of a VMA, Default mode means "fall back to task | 163 | It is an error for the set of nodes specified for this policy to |
177 | policy"--which may or may not specify Default mode. Thus, Default | 164 | be non-empty. |
178 | mode can not be counted on to mean local allocation when used | ||
179 | on a non-shared region of the address space. However, see | ||
180 | MPOL_PREFERRED below. | ||
181 | |||
182 | The Default mode does not use the optional set of nodes. | ||
183 | 165 | ||
184 | MPOL_BIND: This mode specifies that memory must come from the | 166 | MPOL_BIND: This mode specifies that memory must come from the |
185 | set of nodes specified by the policy. | 167 | set of nodes specified by the policy. Memory will be allocated from |
186 | 168 | the node in the set with sufficient free memory that is closest to | |
187 | The memory policy APIs do not specify an order in which the nodes | 169 | the node where the allocation takes place. |
188 | will be searched. However, unlike "local allocation", the Bind | ||
189 | policy does not consider the distance between the nodes. Rather, | ||
190 | allocations will fallback to the nodes specified by the policy in | ||
191 | order of numeric node id. Like everything in Linux, this is subject | ||
192 | to change. | ||
193 | 170 | ||
194 | MPOL_PREFERRED: This mode specifies that the allocation should be | 171 | MPOL_PREFERRED: This mode specifies that the allocation should be |
195 | attempted from the single node specified in the policy. If that | 172 | attempted from the single node specified in the policy. If that |
196 | allocation fails, the kernel will search other nodes, exactly as | 173 | allocation fails, the kernel will search other nodes, in order of |
197 | it would for a local allocation that started at the preferred node | 174 | increasing distance from the preferred node based on information |
198 | in increasing distance from the preferred node. "Local" allocation | 175 | provided by the platform firmware. |
199 | policy can be viewed as a Preferred policy that starts at the node | ||
200 | containing the cpu where the allocation takes place. | 176 | containing the cpu where the allocation takes place. |
201 | 177 | ||
202 | Internally, the Preferred policy uses a single node--the | 178 | Internally, the Preferred policy uses a single node--the |
203 | preferred_node member of struct mempolicy. A "distinguished | 179 | preferred_node member of struct mempolicy. When the internal |
204 | value of this preferred_node, currently '-1', is interpreted | 180 | mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and |
205 | as "the node containing the cpu where the allocation takes | 181 | the policy is interpreted as local allocation. "Local" allocation |
206 | place"--local allocation. This is the way to specify | 182 | policy can be viewed as a Preferred policy that starts at the node |
207 | local allocation for a specific range of addresses--i.e. for | 183 | containing the cpu where the allocation takes place. |
208 | VMA policies. | 184 | |
185 | It is possible for the user to specify that local allocation is | ||
186 | always preferred by passing an empty nodemask with this mode. | ||
187 | If an empty nodemask is passed, the policy cannot use the | ||
188 | MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described | ||
189 | below. | ||
209 | 190 | ||
210 | MPOL_INTERLEAVED: This mode specifies that page allocations be | 191 | MPOL_INTERLEAVED: This mode specifies that page allocations be |
211 | interleaved, on a page granularity, across the nodes specified in | 192 | interleaved, on a page granularity, across the nodes specified in |
@@ -231,6 +212,154 @@ Components of Memory Policies | |||
231 | the temporary interleaved system default policy works in this | 212 | the temporary interleaved system default policy works in this |
232 | mode. | 213 | mode. |
233 | 214 | ||
215 | Linux memory policy supports the following optional mode flags: | ||
216 | |||
217 | MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by | ||
218 | the user should not be remapped if the task or VMA's set of allowed | ||
219 | nodes changes after the memory policy has been defined. | ||
220 | |||
221 | Without this flag, anytime a mempolicy is rebound because of a | ||
222 | change in the set of allowed nodes, the node (Preferred) or | ||
223 | nodemask (Bind, Interleave) is remapped to the new set of | ||
224 | allowed nodes. This may result in nodes being used that were | ||
225 | previously undesired. | ||
226 | |||
227 | With this flag, if the user-specified nodes overlap with the | ||
228 | nodes allowed by the task's cpuset, then the memory policy is | ||
229 | applied to their intersection. If the two sets of nodes do not | ||
230 | overlap, the Default policy is used. | ||
231 | |||
232 | For example, consider a task that is attached to a cpuset with | ||
233 | mems 1-3 that sets an Interleave policy over the same set. If | ||
234 | the cpuset's mems change to 3-5, the Interleave will now occur | ||
235 | over nodes 3, 4, and 5. With this flag, however, since only node | ||
236 | 3 is allowed from the user's nodemask, the "interleave" only | ||
237 | occurs over that node. If no nodes from the user's nodemask are | ||
238 | now allowed, the Default behavior is used. | ||
239 | |||
240 | MPOL_F_STATIC_NODES cannot be combined with the | ||
241 | MPOL_F_RELATIVE_NODES flag. It also cannot be used for | ||
242 | MPOL_PREFERRED policies that were created with an empty nodemask | ||
243 | (local allocation). | ||
244 | |||
245 | MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed | ||
246 | by the user will be mapped relative to the set of the task or VMA's | ||
247 | set of allowed nodes. The kernel stores the user-passed nodemask, | ||
248 | and if the allowed nodes changes, then that original nodemask will | ||
249 | be remapped relative to the new set of allowed nodes. | ||
250 | |||
251 | Without this flag (and without MPOL_F_STATIC_NODES), anytime a | ||
252 | mempolicy is rebound because of a change in the set of allowed | ||
253 | nodes, the node (Preferred) or nodemask (Bind, Interleave) is | ||
254 | remapped to the new set of allowed nodes. That remap may not | ||
255 | preserve the relative nature of the user's passed nodemask to its | ||
256 | set of allowed nodes upon successive rebinds: a nodemask of | ||
257 | 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of | ||
258 | allowed nodes is restored to its original state. | ||
259 | |||
260 | With this flag, the remap is done so that the node numbers from | ||
261 | the user's passed nodemask are relative to the set of allowed | ||
262 | nodes. In other words, if nodes 0, 2, and 4 are set in the user's | ||
263 | nodemask, the policy will be effected over the first (and in the | ||
264 | Bind or Interleave case, the third and fifth) nodes in the set of | ||
265 | allowed nodes. The nodemask passed by the user represents nodes | ||
266 | relative to task or VMA's set of allowed nodes. | ||
267 | |||
268 | If the user's nodemask includes nodes that are outside the range | ||
269 | of the new set of allowed nodes (for example, node 5 is set in | ||
270 | the user's nodemask when the set of allowed nodes is only 0-3), | ||
271 | then the remap wraps around to the beginning of the nodemask and, | ||
272 | if not already set, sets the node in the mempolicy nodemask. | ||
273 | |||
274 | For example, consider a task that is attached to a cpuset with | ||
275 | mems 2-5 that sets an Interleave policy over the same set with | ||
276 | MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the | ||
277 | interleave now occurs over nodes 3,5-6. If the cpuset's mems | ||
278 | then change to 0,2-3,5, then the interleave occurs over nodes | ||
279 | 0,3,5. | ||
280 | |||
281 | Thanks to the consistent remapping, applications preparing | ||
282 | nodemasks to specify memory policies using this flag should | ||
283 | disregard their current, actual cpuset imposed memory placement | ||
284 | and prepare the nodemask as if they were always located on | ||
285 | memory nodes 0 to N-1, where N is the number of memory nodes the | ||
286 | policy is intended to manage. Let the kernel then remap to the | ||
287 | set of memory nodes allowed by the task's cpuset, as that may | ||
288 | change over time. | ||
289 | |||
290 | MPOL_F_RELATIVE_NODES cannot be combined with the | ||
291 | MPOL_F_STATIC_NODES flag. It also cannot be used for | ||
292 | MPOL_PREFERRED policies that were created with an empty nodemask | ||
293 | (local allocation). | ||
294 | |||
295 | MEMORY POLICY REFERENCE COUNTING | ||
296 | |||
297 | To resolve use/free races, struct mempolicy contains an atomic reference | ||
298 | count field. Internal interfaces, mpol_get()/mpol_put() increment and | ||
299 | decrement this reference count, respectively. mpol_put() will only free | ||
300 | the structure back to the mempolicy kmem cache when the reference count | ||
301 | goes to zero. | ||
302 | |||
303 | When a new memory policy is allocated, it's reference count is initialized | ||
304 | to '1', representing the reference held by the task that is installing the | ||
305 | new policy. When a pointer to a memory policy structure is stored in another | ||
306 | structure, another reference is added, as the task's reference will be dropped | ||
307 | on completion of the policy installation. | ||
308 | |||
309 | During run-time "usage" of the policy, we attempt to minimize atomic operations | ||
310 | on the reference count, as this can lead to cache lines bouncing between cpus | ||
311 | and NUMA nodes. "Usage" here means one of the following: | ||
312 | |||
313 | 1) querying of the policy, either by the task itself [using the get_mempolicy() | ||
314 | API discussed below] or by another task using the /proc/<pid>/numa_maps | ||
315 | interface. | ||
316 | |||
317 | 2) examination of the policy to determine the policy mode and associated node | ||
318 | or node lists, if any, for page allocation. This is considered a "hot | ||
319 | path". Note that for MPOL_BIND, the "usage" extends across the entire | ||
320 | allocation process, which may sleep during page reclaimation, because the | ||
321 | BIND policy nodemask is used, by reference, to filter ineligible nodes. | ||
322 | |||
323 | We can avoid taking an extra reference during the usages listed above as | ||
324 | follows: | ||
325 | |||
326 | 1) we never need to get/free the system default policy as this is never | ||
327 | changed nor freed, once the system is up and running. | ||
328 | |||
329 | 2) for querying the policy, we do not need to take an extra reference on the | ||
330 | target task's task policy nor vma policies because we always acquire the | ||
331 | task's mm's mmap_sem for read during the query. The set_mempolicy() and | ||
332 | mbind() APIs [see below] always acquire the mmap_sem for write when | ||
333 | installing or replacing task or vma policies. Thus, there is no possibility | ||
334 | of a task or thread freeing a policy while another task or thread is | ||
335 | querying it. | ||
336 | |||
337 | 3) Page allocation usage of task or vma policy occurs in the fault path where | ||
338 | we hold them mmap_sem for read. Again, because replacing the task or vma | ||
339 | policy requires that the mmap_sem be held for write, the policy can't be | ||
340 | freed out from under us while we're using it for page allocation. | ||
341 | |||
342 | 4) Shared policies require special consideration. One task can replace a | ||
343 | shared memory policy while another task, with a distinct mmap_sem, is | ||
344 | querying or allocating a page based on the policy. To resolve this | ||
345 | potential race, the shared policy infrastructure adds an extra reference | ||
346 | to the shared policy during lookup while holding a spin lock on the shared | ||
347 | policy management structure. This requires that we drop this extra | ||
348 | reference when we're finished "using" the policy. We must drop the | ||
349 | extra reference on shared policies in the same query/allocation paths | ||
350 | used for non-shared policies. For this reason, shared policies are marked | ||
351 | as such, and the extra reference is dropped "conditionally"--i.e., only | ||
352 | for shared policies. | ||
353 | |||
354 | Because of this extra reference counting, and because we must lookup | ||
355 | shared policies in a tree structure under spinlock, shared policies are | ||
356 | more expensive to use in the page allocation path. This is expecially | ||
357 | true for shared policies on shared memory regions shared by tasks running | ||
358 | on different NUMA nodes. This extra overhead can be avoided by always | ||
359 | falling back to task or system default policy for shared memory regions, | ||
360 | or by prefaulting the entire shared memory region into memory and locking | ||
361 | it down. However, this might not be appropriate for all applications. | ||
362 | |||
234 | MEMORY POLICY APIs | 363 | MEMORY POLICY APIs |
235 | 364 | ||
236 | Linux supports 3 system calls for controlling memory policy. These APIS | 365 | Linux supports 3 system calls for controlling memory policy. These APIS |
@@ -251,7 +380,9 @@ Set [Task] Memory Policy: | |||
251 | Set's the calling task's "task/process memory policy" to mode | 380 | Set's the calling task's "task/process memory policy" to mode |
252 | specified by the 'mode' argument and the set of nodes defined | 381 | specified by the 'mode' argument and the set of nodes defined |
253 | by 'nmask'. 'nmask' points to a bit mask of node ids containing | 382 | by 'nmask'. 'nmask' points to a bit mask of node ids containing |
254 | at least 'maxnode' ids. | 383 | at least 'maxnode' ids. Optional mode flags may be passed by |
384 | combining the 'mode' argument with the flag (for example: | ||
385 | MPOL_INTERLEAVE | MPOL_F_STATIC_NODES). | ||
255 | 386 | ||
256 | See the set_mempolicy(2) man page for more details | 387 | See the set_mempolicy(2) man page for more details |
257 | 388 | ||
@@ -303,29 +434,19 @@ MEMORY POLICIES AND CPUSETS | |||
303 | Memory policies work within cpusets as described above. For memory policies | 434 | Memory policies work within cpusets as described above. For memory policies |
304 | that require a node or set of nodes, the nodes are restricted to the set of | 435 | that require a node or set of nodes, the nodes are restricted to the set of |
305 | nodes whose memories are allowed by the cpuset constraints. If the nodemask | 436 | nodes whose memories are allowed by the cpuset constraints. If the nodemask |
306 | specified for the policy contains nodes that are not allowed by the cpuset, or | 437 | specified for the policy contains nodes that are not allowed by the cpuset and |
307 | the intersection of the set of nodes specified for the policy and the set of | 438 | MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes |
308 | nodes with memory is the empty set, the policy is considered invalid | 439 | specified for the policy and the set of nodes with memory is used. If the |
309 | and cannot be installed. | 440 | result is the empty set, the policy is considered invalid and cannot be |
310 | 441 | installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped | |
311 | The interaction of memory policies and cpusets can be problematic for a | 442 | onto and folded into the task's set of allowed nodes as previously described. |
312 | couple of reasons: | 443 | |
313 | 444 | The interaction of memory policies and cpusets can be problematic when tasks | |
314 | 1) the memory policy APIs take physical node id's as arguments. As mentioned | 445 | in two cpusets share access to a memory region, such as shared memory segments |
315 | above, it is illegal to specify nodes that are not allowed in the cpuset. | 446 | created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and |
316 | The application must query the allowed nodes using the get_mempolicy() | 447 | any of the tasks install shared policy on the region, only nodes whose |
317 | API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and | 448 | memories are allowed in both cpusets may be used in the policies. Obtaining |
318 | restrict itself to those nodes. However, the resources available to a | 449 | this information requires "stepping outside" the memory policy APIs to use the |
319 | cpuset can be changed by the system administrator, or a workload manager | 450 | cpuset information and requires that one know in what cpusets other task might |
320 | application, at any time. So, a task may still get errors attempting to | 451 | be attaching to the shared region. Furthermore, if the cpusets' allowed |
321 | specify policy nodes, and must query the allowed memories again. | 452 | memory sets are disjoint, "local" allocation is the only valid policy. |
322 | |||
323 | 2) when tasks in two cpusets share access to a memory region, such as shared | ||
324 | memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and | ||
325 | MAP_SHARED flags, and any of the tasks install shared policy on the region, | ||
326 | only nodes whose memories are allowed in both cpusets may be used in the | ||
327 | policies. Obtaining this information requires "stepping outside" the | ||
328 | memory policy APIs to use the cpuset information and requires that one | ||
329 | know in what cpusets other task might be attaching to the shared region. | ||
330 | Furthermore, if the cpusets' allowed memory sets are disjoint, "local" | ||
331 | allocation is the only valid policy. | ||