aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLee Schermerhorn <lee.schermerhorn@hp.com>2008-04-28 05:13:16 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2008-04-28 11:58:24 -0400
commit52cd3b074050dd664380b5e8cfc85d4a6ed8ad48 (patch)
treefcfcf55c0e81376ea34919fab26e29bedd7f3b88 /Documentation
parenta6020ed759404372e8be2b276e85e51735472cc9 (diff)
mempolicy: rework mempolicy Reference Counting [yet again]
After further discussion with Christoph Lameter, it has become clear that my earlier attempts to clean up the mempolicy reference counting were a bit of overkill in some areas, resulting in superflous ref/unref in what are usually fast paths. In other areas, further inspection reveals that I botched the unref for interleave policies. A separate patch, suitable for upstream/stable trees, fixes up the known errors in the previous attempt to fix reference counting. This patch reworks the memory policy referencing counting and, one hopes, simplifies the code. Maybe I'll get it right this time. See the update to the numa_memory_policy.txt document for a discussion of memory policy reference counting that motivates this patch. Summary: Lookup of mempolicy, based on (vma, address) need only add a reference for shared policy, and we need only unref the policy when finished for shared policies. So, this patch backs out all of the unneeded extra reference counting added by my previous attempt. It then unrefs only shared policies when we're finished with them, using the mpol_cond_put() [conditional put] helper function introduced by this patch. Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma containing just the policy. read_swap_cache_async() can call alloc_page_vma() multiple times, so we can't let alloc_page_vma() unref the shared policy in this case. To avoid this, we make a copy of any non-null shared policy and remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading a page [or multiple pages] from swap, so the overhead should not be an issue here. I introduced a new static inline function "mpol_cond_copy()" to copy the shared policy to an on-stack policy and remove the flags that would require a conditional free. The current implementation of mpol_cond_copy() assumes that the struct mempolicy contains no pointers to dynamically allocated structures that must be duplicated or reference counted during copy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/vm/numa_memory_policy.txt68
1 files changed, 68 insertions, 0 deletions
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index 27b9507a3769..6719d642653f 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -311,6 +311,74 @@ Components of Memory Policies
311 MPOL_PREFERRED policies that were created with an empty nodemask 311 MPOL_PREFERRED policies that were created with an empty nodemask
312 (local allocation). 312 (local allocation).
313 313
314MEMORY POLICY REFERENCE COUNTING
315
316To resolve use/free races, struct mempolicy contains an atomic reference
317count field. Internal interfaces, mpol_get()/mpol_put() increment and
318decrement this reference count, respectively. mpol_put() will only free
319the structure back to the mempolicy kmem cache when the reference count
320goes to zero.
321
322When a new memory policy is allocated, it's reference count is initialized
323to '1', representing the reference held by the task that is installing the
324new policy. When a pointer to a memory policy structure is stored in another
325structure, another reference is added, as the task's reference will be dropped
326on completion of the policy installation.
327
328During run-time "usage" of the policy, we attempt to minimize atomic operations
329on the reference count, as this can lead to cache lines bouncing between cpus
330and NUMA nodes. "Usage" here means one of the following:
331
3321) querying of the policy, either by the task itself [using the get_mempolicy()
333 API discussed below] or by another task using the /proc/<pid>/numa_maps
334 interface.
335
3362) examination of the policy to determine the policy mode and associated node
337 or node lists, if any, for page allocation. This is considered a "hot
338 path". Note that for MPOL_BIND, the "usage" extends across the entire
339 allocation process, which may sleep during page reclaimation, because the
340 BIND policy nodemask is used, by reference, to filter ineligible nodes.
341
342We can avoid taking an extra reference during the usages listed above as
343follows:
344
3451) we never need to get/free the system default policy as this is never
346 changed nor freed, once the system is up and running.
347
3482) for querying the policy, we do not need to take an extra reference on the
349 target task's task policy nor vma policies because we always acquire the
350 task's mm's mmap_sem for read during the query. The set_mempolicy() and
351 mbind() APIs [see below] always acquire the mmap_sem for write when
352 installing or replacing task or vma policies. Thus, there is no possibility
353 of a task or thread freeing a policy while another task or thread is
354 querying it.
355
3563) Page allocation usage of task or vma policy occurs in the fault path where
357 we hold them mmap_sem for read. Again, because replacing the task or vma
358 policy requires that the mmap_sem be held for write, the policy can't be
359 freed out from under us while we're using it for page allocation.
360
3614) Shared policies require special consideration. One task can replace a
362 shared memory policy while another task, with a distinct mmap_sem, is
363 querying or allocating a page based on the policy. To resolve this
364 potential race, the shared policy infrastructure adds an extra reference
365 to the shared policy during lookup while holding a spin lock on the shared
366 policy management structure. This requires that we drop this extra
367 reference when we're finished "using" the policy. We must drop the
368 extra reference on shared policies in the same query/allocation paths
369 used for non-shared policies. For this reason, shared policies are marked
370 as such, and the extra reference is dropped "conditionally"--i.e., only
371 for shared policies.
372
373 Because of this extra reference counting, and because we must lookup
374 shared policies in a tree structure under spinlock, shared policies are
375 more expensive to use in the page allocation path. This is expecially
376 true for shared policies on shared memory regions shared by tasks running
377 on different NUMA nodes. This extra overhead can be avoided by always
378 falling back to task or system default policy for shared memory regions,
379 or by prefaulting the entire shared memory region into memory and locking
380 it down. However, this might not be appropriate for all applications.
381
314MEMORY POLICY APIs 382MEMORY POLICY APIs
315 383
316Linux supports 3 system calls for controlling memory policy. These APIS 384Linux supports 3 system calls for controlling memory policy. These APIS