aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/vm/numa_memory_policy.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/vm/numa_memory_policy.txt')
-rw-r--r--Documentation/vm/numa_memory_policy.txt68
1 files changed, 68 insertions, 0 deletions
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index 27b9507a3769..6719d642653f 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -311,6 +311,74 @@ Components of Memory Policies
311 MPOL_PREFERRED policies that were created with an empty nodemask 311 MPOL_PREFERRED policies that were created with an empty nodemask
312 (local allocation). 312 (local allocation).
313 313
314MEMORY POLICY REFERENCE COUNTING
315
316To resolve use/free races, struct mempolicy contains an atomic reference
317count field. Internal interfaces, mpol_get()/mpol_put() increment and
318decrement this reference count, respectively. mpol_put() will only free
319the structure back to the mempolicy kmem cache when the reference count
320goes to zero.
321
322When a new memory policy is allocated, it's reference count is initialized
323to '1', representing the reference held by the task that is installing the
324new policy. When a pointer to a memory policy structure is stored in another
325structure, another reference is added, as the task's reference will be dropped
326on completion of the policy installation.
327
328During run-time "usage" of the policy, we attempt to minimize atomic operations
329on the reference count, as this can lead to cache lines bouncing between cpus
330and NUMA nodes. "Usage" here means one of the following:
331
3321) querying of the policy, either by the task itself [using the get_mempolicy()
333 API discussed below] or by another task using the /proc/<pid>/numa_maps
334 interface.
335
3362) examination of the policy to determine the policy mode and associated node
337 or node lists, if any, for page allocation. This is considered a "hot
338 path". Note that for MPOL_BIND, the "usage" extends across the entire
339 allocation process, which may sleep during page reclaimation, because the
340 BIND policy nodemask is used, by reference, to filter ineligible nodes.
341
342We can avoid taking an extra reference during the usages listed above as
343follows:
344
3451) we never need to get/free the system default policy as this is never
346 changed nor freed, once the system is up and running.
347
3482) for querying the policy, we do not need to take an extra reference on the
349 target task's task policy nor vma policies because we always acquire the
350 task's mm's mmap_sem for read during the query. The set_mempolicy() and
351 mbind() APIs [see below] always acquire the mmap_sem for write when
352 installing or replacing task or vma policies. Thus, there is no possibility
353 of a task or thread freeing a policy while another task or thread is
354 querying it.
355
3563) Page allocation usage of task or vma policy occurs in the fault path where
357 we hold them mmap_sem for read. Again, because replacing the task or vma
358 policy requires that the mmap_sem be held for write, the policy can't be
359 freed out from under us while we're using it for page allocation.
360
3614) Shared policies require special consideration. One task can replace a
362 shared memory policy while another task, with a distinct mmap_sem, is
363 querying or allocating a page based on the policy. To resolve this
364 potential race, the shared policy infrastructure adds an extra reference
365 to the shared policy during lookup while holding a spin lock on the shared
366 policy management structure. This requires that we drop this extra
367 reference when we're finished "using" the policy. We must drop the
368 extra reference on shared policies in the same query/allocation paths
369 used for non-shared policies. For this reason, shared policies are marked
370 as such, and the extra reference is dropped "conditionally"--i.e., only
371 for shared policies.
372
373 Because of this extra reference counting, and because we must lookup
374 shared policies in a tree structure under spinlock, shared policies are
375 more expensive to use in the page allocation path. This is expecially
376 true for shared policies on shared memory regions shared by tasks running
377 on different NUMA nodes. This extra overhead can be avoided by always
378 falling back to task or system default policy for shared memory regions,
379 or by prefaulting the entire shared memory region into memory and locking
380 it down. However, this might not be appropriate for all applications.
381
314MEMORY POLICY APIs 382MEMORY POLICY APIs
315 383
316Linux supports 3 system calls for controlling memory policy. These APIS 384Linux supports 3 system calls for controlling memory policy. These APIS