diff options
Diffstat (limited to 'Documentation/vm/numa_memory_policy.txt')
-rw-r--r-- | Documentation/vm/numa_memory_policy.txt | 68 |
1 files changed, 68 insertions, 0 deletions
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 27b9507a3769..6719d642653f 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt | |||
@@ -311,6 +311,74 @@ Components of Memory Policies | |||
311 | MPOL_PREFERRED policies that were created with an empty nodemask | 311 | MPOL_PREFERRED policies that were created with an empty nodemask |
312 | (local allocation). | 312 | (local allocation). |
313 | 313 | ||
314 | MEMORY POLICY REFERENCE COUNTING | ||
315 | |||
316 | To resolve use/free races, struct mempolicy contains an atomic reference | ||
317 | count field. Internal interfaces, mpol_get()/mpol_put() increment and | ||
318 | decrement this reference count, respectively. mpol_put() will only free | ||
319 | the structure back to the mempolicy kmem cache when the reference count | ||
320 | goes to zero. | ||
321 | |||
322 | When a new memory policy is allocated, it's reference count is initialized | ||
323 | to '1', representing the reference held by the task that is installing the | ||
324 | new policy. When a pointer to a memory policy structure is stored in another | ||
325 | structure, another reference is added, as the task's reference will be dropped | ||
326 | on completion of the policy installation. | ||
327 | |||
328 | During run-time "usage" of the policy, we attempt to minimize atomic operations | ||
329 | on the reference count, as this can lead to cache lines bouncing between cpus | ||
330 | and NUMA nodes. "Usage" here means one of the following: | ||
331 | |||
332 | 1) querying of the policy, either by the task itself [using the get_mempolicy() | ||
333 | API discussed below] or by another task using the /proc/<pid>/numa_maps | ||
334 | interface. | ||
335 | |||
336 | 2) examination of the policy to determine the policy mode and associated node | ||
337 | or node lists, if any, for page allocation. This is considered a "hot | ||
338 | path". Note that for MPOL_BIND, the "usage" extends across the entire | ||
339 | allocation process, which may sleep during page reclaimation, because the | ||
340 | BIND policy nodemask is used, by reference, to filter ineligible nodes. | ||
341 | |||
342 | We can avoid taking an extra reference during the usages listed above as | ||
343 | follows: | ||
344 | |||
345 | 1) we never need to get/free the system default policy as this is never | ||
346 | changed nor freed, once the system is up and running. | ||
347 | |||
348 | 2) for querying the policy, we do not need to take an extra reference on the | ||
349 | target task's task policy nor vma policies because we always acquire the | ||
350 | task's mm's mmap_sem for read during the query. The set_mempolicy() and | ||
351 | mbind() APIs [see below] always acquire the mmap_sem for write when | ||
352 | installing or replacing task or vma policies. Thus, there is no possibility | ||
353 | of a task or thread freeing a policy while another task or thread is | ||
354 | querying it. | ||
355 | |||
356 | 3) Page allocation usage of task or vma policy occurs in the fault path where | ||
357 | we hold them mmap_sem for read. Again, because replacing the task or vma | ||
358 | policy requires that the mmap_sem be held for write, the policy can't be | ||
359 | freed out from under us while we're using it for page allocation. | ||
360 | |||
361 | 4) Shared policies require special consideration. One task can replace a | ||
362 | shared memory policy while another task, with a distinct mmap_sem, is | ||
363 | querying or allocating a page based on the policy. To resolve this | ||
364 | potential race, the shared policy infrastructure adds an extra reference | ||
365 | to the shared policy during lookup while holding a spin lock on the shared | ||
366 | policy management structure. This requires that we drop this extra | ||
367 | reference when we're finished "using" the policy. We must drop the | ||
368 | extra reference on shared policies in the same query/allocation paths | ||
369 | used for non-shared policies. For this reason, shared policies are marked | ||
370 | as such, and the extra reference is dropped "conditionally"--i.e., only | ||
371 | for shared policies. | ||
372 | |||
373 | Because of this extra reference counting, and because we must lookup | ||
374 | shared policies in a tree structure under spinlock, shared policies are | ||
375 | more expensive to use in the page allocation path. This is expecially | ||
376 | true for shared policies on shared memory regions shared by tasks running | ||
377 | on different NUMA nodes. This extra overhead can be avoided by always | ||
378 | falling back to task or system default policy for shared memory regions, | ||
379 | or by prefaulting the entire shared memory region into memory and locking | ||
380 | it down. However, this might not be appropriate for all applications. | ||
381 | |||
314 | MEMORY POLICY APIs | 382 | MEMORY POLICY APIs |
315 | 383 | ||
316 | Linux supports 3 system calls for controlling memory policy. These APIS | 384 | Linux supports 3 system calls for controlling memory policy. These APIS |