aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local PolicyLee Schermerhorn2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | Now that we're using "preferred local" policy for system default, we need to make this as fast as possible. Because of the variable size of the mempolicy structure [based on size of nodemasks], the preferred_node may be in a different cacheline from the mode. This can result in accessing an extra cacheline in the normal case of system default policy. Suspect this is the cause of an observed 2-3% slowdown in page fault testing relative to kernel without this patch series. To alleviate this, use an internal mode flag, MPOL_F_LOCAL in the mempolicy flags member which is guaranteed [?] to be in the same cacheline as the mode itself. Verified that reworked mempolicy now performs slightly better on 25-rc8-mm1 for both anon and shmem segments with system default and vma [preferred local] policy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: mPOL_PREFERRED cleanups for "local allocation"Lee Schermerhorn2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here are a couple of "cleanups" for MPOL_PREFERRED behavior when v.preferred_node < 0 -- i.e., "local allocation": 1) [do_]get_mempolicy() calls the now renamed get_policy_nodemask() to fetch the nodemask associated with a policy. Currently, get_policy_nodemask() returns the set of nodes with memory, when the policy 'mode' is 'PREFERRED, and the preferred_node is < 0. Change to return an empty nodemask, as this is what was specified to achieve "local allocation". 2) When a task is moved into a [new] cpuset, mpol_rebind_policy() is called to adjust any task and vma policy nodes to be valid in the new cpuset. However, when the policy is MPOL_PREFERRED, and the preferred_node is <0, no rebind is necessary. The "local allocation" indication is valid in any cpuset. Existing code will "do the right thing" because node_remap() will just return the argument node when it is outside of the valid range of node ids. However, I think it is clearer and cleaner to skip the remap explicitly in this case. 3) mpol_to_str() produces a printable, "human readable" string from a struct mempolicy. For MPOL_PREFERRED with preferred_node <0, show "local", as this indicates local allocation, as the task migrates among nodes. Note that this matches the usage of "local allocation" in libnuma() and numactl. Without this change, I believe that node_set() [via set_bit()] will set bit 31, resulting in a misleading display. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: use MPOL_PREFERRED for system-wide default policyLee Schermerhorn2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, when one specifies MPOL_DEFAULT via a NUMA memory policy API [set_mempolicy(), mbind() and internal versions], the kernel simply installs a NULL struct mempolicy pointer in the appropriate context: task policy, vma policy, or shared policy. This causes any use of that policy to "fall back" to the next most specific policy scope. The only use of MPOL_DEFAULT to mean "local allocation" is in the system default policy. This requires extra checks/cases for MPOL_DEFAULT in many mempolicy.c functions. There is another, "preferred" way to specify local allocation via the APIs. That is using the MPOL_PREFERRED policy mode with an empty nodemask. Internally, the empty nodemask gets converted to a preferred_node id of '-1'. All internal usage of MPOL_PREFERRED will convert the '-1' to the id of the node local to the cpu where the allocation occurs. System default policy, except during boot, is hard-coded to "local allocation". By using the MPOL_PREFERRED mode with a negative value of preferred node for system default policy, MPOL_DEFAULT will never occur in the 'policy' member of a struct mempolicy. Thus, we can remove all checks for MPOL_DEFAULT when converting policy to a node id/zonelist in the allocation paths. In slab_node() return local node id when policy pointer is NULL. No need to set a pol value to take the switch default. Replace switch default with BUG()--i.e., shouldn't happen. With this patch MPOL_DEFAULT is only used in the APIs, including internal calls to do_set_mempolicy() and in the display of policy in /proc/<pid>/numa_maps. It always means "fall back" to the the next most specific policy scope. This simplifies the description of memory policies quite a bit, with no visible change in behavior. get_mempolicy() continues to return MPOL_DEFAULT and an empty nodemask when the requested policy [task or vma/shared] is NULL. These are the values one would supply via set_mempolicy() or mbind() to achieve that condition--default behavior. This patch updates Documentation to reflect this change. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: rework mempolicy Reference Counting [yet again]Lee Schermerhorn2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After further discussion with Christoph Lameter, it has become clear that my earlier attempts to clean up the mempolicy reference counting were a bit of overkill in some areas, resulting in superflous ref/unref in what are usually fast paths. In other areas, further inspection reveals that I botched the unref for interleave policies. A separate patch, suitable for upstream/stable trees, fixes up the known errors in the previous attempt to fix reference counting. This patch reworks the memory policy referencing counting and, one hopes, simplifies the code. Maybe I'll get it right this time. See the update to the numa_memory_policy.txt document for a discussion of memory policy reference counting that motivates this patch. Summary: Lookup of mempolicy, based on (vma, address) need only add a reference for shared policy, and we need only unref the policy when finished for shared policies. So, this patch backs out all of the unneeded extra reference counting added by my previous attempt. It then unrefs only shared policies when we're finished with them, using the mpol_cond_put() [conditional put] helper function introduced by this patch. Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma containing just the policy. read_swap_cache_async() can call alloc_page_vma() multiple times, so we can't let alloc_page_vma() unref the shared policy in this case. To avoid this, we make a copy of any non-null shared policy and remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading a page [or multiple pages] from swap, so the overhead should not be an issue here. I introduced a new static inline function "mpol_cond_copy()" to copy the shared policy to an on-stack policy and remove the flags that would require a conditional free. The current implementation of mpol_cond_copy() assumes that the struct mempolicy contains no pointers to dynamically allocated structures that must be duplicated or reference counted during copy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: document {set|get}_policy() vm_ops APIsLee Schermerhorn2008-04-28
| | | | | | | | | | | | | | | Document mempolicy return value reference semantics assumed by the rest of the mempolicy code for the set_ and get_policy vm_ops in <linux/mm.h>--where the prototypes are defined--to inform any future mempolicy vm_op writers what the rest of the subsystem expects of them. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: mark shared policies for unrefLee Schermerhorn2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of yet another rework of mempolicy reference counting, we want to be able to identify shared policies efficiently, because they have an extra ref taken on lookup that needs to be removed when we're finished using the policy. Note: the extra ref is required because the policies are shared between tasks/processes and can be changed/freed by one task while another task is using them--e.g., for page allocation. Building on David Rientjes mempolicy "mode flags" enhancement, this patch indicates a "shared" policy by setting a new MPOL_F_SHARED flag in the flags member of the struct mempolicy added by David. MPOL_F_SHARED, and any future "internal mode flags" are reserved from bit zero up, as they will never be passed in the upper bits of the mode argument of a mempolicy API. I set the MPOL_F_SHARED flag when the policy is installed in the shared policy rb-tree. Don't need/want to clear the flag when removing from the tree as the mempolicy is freed [unref'd] internally to the sp_delete() function. However, a task could hold another reference on this mempolicy from a prior lookup. We need the MPOL_F_SHARED flag to stay put so that any tasks holding a ref will unref, eventually freeing, the mempolicy. A later patch in this series will introduce a function to conditionally unref [mpol_free] a policy. The MPOL_F_SHARED flag is one reason [currently the only reason] to unref/free a policy via the conditional free. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: rename struct mempolicy 'policy' member to 'mode'Lee Schermerhorn2008-04-28
| | | | | | | | | | | | | | | | | | | | | | The terms 'policy' and 'mode' are both used in various places to describe the semantics of the value stored in the 'policy' member of struct mempolicy. Furthermore, the term 'policy' is used to refer to that member, to the entire struct mempolicy and to the more abstract concept of the tuple consisting of a "mode" and an optional node or set of nodes. Recently, we have added "mode flags" that are passed in the upper bits of the 'mode' [or sometimes, 'policy'] member of the numa APIs. I'd like to resolve this confusion, which perhaps only exists in my mind, by renaming the 'policy' member to 'mode' throughout, and fixing up the Documentation. Man pages will be updated separately. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: fixup Fallback for Default Shmem PolicyLee Schermerhorn2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | get_vma_policy() is not handling fallback to task policy correctly when the get_policy() vm_op returns NULL. The NULL overwrites the 'pol' variable that was holding the fallback task mempolicy. So, it was falling back directly to system default policy. Fix get_vma_policy() to use only non-NULL policy returned from the vma get_policy op. shm_get_policy() was falling back to current task's mempolicy if the "backing file system" [tmpfs vs hugetlbfs] does not support the get_policy vm_op and the vma policy is null. This is incorrect for show_numa_maps() which is likely querying the numa_maps of some task other than current. Remove this fallback. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: write lock mmap_sem while changing task mempolicyLee Schermerhorn2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | A read of /proc/<pid>/numa_maps holds the target task's mmap_sem for read while examining each vma's mempolicy. A vma's mempolicy can fall back to the task's policy. However, the task could be changing it's task policy and free the one that the show_numa_maps() is examining. To prevent this, grab the mmap_sem for write when updating task mempolicy. Pointed out to me by Christoph Lameter and extracted and reworked from Christoph's alternative mempol reference counting patch. This is analogous to the way that do_mbind() and do_get_mempolicy() prevent races between task's sharing an mm_struct [a.k.a. threads] setting and querying a mempolicy for a particular address. Note: this is necessary, but not sufficient, to allow us to stop taking an extra reference on "other task's mempolicy" in get_vma_policy. Subsequent patches will complete this update, allowing us to simplify the tests for whether we need to unref a mempolicy at various points in the code. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: rename mpol_copy to mpol_dupLee Schermerhorn2008-04-28
| | | | | | | | | | | | | | | | | This patch renames mpol_copy() to mpol_dup() because, well, that's what it does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an existing mempolicy, allocates a new one and copies the contents. In a later patch, I want to use the name mpol_copy() to copy the contents from one mempolicy to another like, e.g., strcpy() does for strings. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: rename mpol_free to mpol_putLee Schermerhorn2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | This is a change that was requested some time ago by Mel Gorman. Makes sense to me, so here it is. Note: I retain the name "mpol_free_shared_policy()" because it actually does free the shared_policy, which is NOT a reference counted object. However, ... The mempolicy object[s] referenced by the shared_policy are reference counted, so mpol_put() is used to release the reference held by the shared_policy. The mempolicy might not be freed at this time, because some task attached to the shared object associated with the shared policy may be in the process of allocating a page based on the mempolicy. In that case, the task performing the allocation will hold a reference on the mempolicy, obtained via mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks holding such a reference have called mpol_put() for the mempolicy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Subject: [PATCH] hugetlb: vmstat events for huge page allocationsAdam Litke2008-04-28
| | | | | | | | | | | | | | | | | | | | | | Allocating huge pages directly from the buddy allocator is not guaranteed to succeed. Success depends on several factors (such as the amount of physical memory available and the level of fragmentation). With the addition of dynamic hugetlb pool resizing, allocations can occur much more frequently. For these reasons it is desirable to keep track of huge page allocation successes and failures. Add two new vmstat entries to track huge page allocations that succeed and fail. The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE being enabled. [akpm@linux-foundation.org: reduced ifdeffery] Signed-off-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Eric Munson <ebmunson@us.ibm.com> Tested-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andy Whitcroft <apw@shadowen.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmcoreinfo: add page flags valuesKen'ichi Ohmichi2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | Add some values of page flags to the vmcoreinfo data. The vmcoreinfo data has the minimum debugging information only for dump filtering. makedumpfile (dump filtering command) gets it to distinguish unnecessary pages, and makedumpfile creates a small dumpfile. An old makedumpfile (v1.2.4 or before) had assumed some values of page flags internally, and this implementation could not follow the change of these values. For example, Christoph Lameter is changing these values by the follwing patch: http://lkml.org/lkml/2008/2/29/463 So a new makedumpfile (v1.2.5) came to need these values and I created this patch to let the kernel output them. Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* s390: implement pte special bitNick Piggin2008-04-28
| | | | | | | | | | | | | | | | | | | Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for the user mappings. This requires the get_xip_page API to be changed to an address based one. Improve the API layering a little bit too, while we're here. This is required in order to support XIP filesystems on memory that isn't backed with struct page (but memory with struct page is still supported too). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Carsten Otte <cotte@de.ibm.com> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* xip: support non-struct page backed memoryNick Piggin2008-04-28
| | | | | | | | | | | | | | | | | | | Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for the user mappings. This requires the get_xip_page API to be changed to an address based one. Improve the API layering a little bit too, while we're here. This is required in order to support XIP filesystems on memory that isn't backed with struct page (but memory with struct page is still supported too). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Carsten Otte <cotte@de.ibm.com> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* return pfn from direct_access, for XIPJared Hulbert2008-04-28
| | | | | | | | | | | | | | | | | | | | | Alter the block device ->direct_access() API to work with the new get_xip_mem() API (that requires both kaddr and pfn are returned). Some architectures will not do the right thing in their virt_to_page() for use by XIP (to translate from the kernel virtual address returned by direct_access(), to a user mappable pfn in XIP's page fault handler. However, we can't switch it to just return the pfn and not the kaddr, because we have no good way to get a kva from a pfn, and XIP requires the kva for its read(2) and write(2) handlers. So we have to return both. Signed-off-by: Jared Hulbert <jaredeh@gmail.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-mm@kvack.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: add vm_insert_mixedNick Piggin2008-04-28
| | | | | | | | | | | | | | | | | vm_insert_mixed will insert either a raw pfn or a refcounted struct page into the page tables, depending on whether vm_normal_page() will return the page or not. With the introduction of the new pte bit, this is now a too tricky for drivers to be doing themselves. filemap_xip uses this in a subsequent patch. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: introduce pte_special pte bitNick Piggin2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | s390 for one, cannot implement VM_MIXEDMAP with pfn_valid, due to their memory model (which is more dynamic than most). Instead, they had proposed to implement it with an additional path through vm_normal_page(), using a bit in the pte to determine whether or not the page should be refcounted: vm_normal_page() { ... if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { #ifdef s390 if (!mixedmap_refcount_pte(pte)) return NULL; #else if (!pfn_valid(pfn)) return NULL; #endif goto out; } ... } This is fine, however if we are allowed to use a bit in the pte to determine refcountedness, we can use that to _completely_ replace all the vma based schemes. So instead of adding more cases to the already complex vma-based scheme, we can have a clearly seperate and simple pte-based scheme (and get slightly better code generation in the process): vm_normal_page() { #ifdef s390 if (!mixedmap_refcount_pte(pte)) return NULL; return pte_page(pte); #else ... #endif } And finally, we may rather make this concept usable by any architecture rather than making it s390 only, so implement a new type of pte state for this. Unfortunately the old vma based code must stay, because some architectures may not be able to spare pte bits. This makes vm_normal_page a little bit more ugly than we would like, but the 2 cases are clearly seperate. So introduce a pte_special pte state, and use it in mm/memory.c. It is currently a noop for all architectures, so this doesn't actually result in any compiled code changes to mm/memory.o. BTW: I haven't put vm_normal_page() into arch code as-per an earlier suggestion. The reason is that, regardless of where vm_normal_page is actually implemented, the *abstraction* is still exactly the same. Also, while it depends on whether the architecture has pte_special or not, that is the only two possible cases, and it really isn't an arch specific function -- the role of the arch code should be to provide primitive functions and accessors with which to build the core code; pte_special does that. We do not want architectures to know or care about vm_normal_page itself, and we definitely don't want them being able to invent something new there out of sight of mm/ code. If we made vm_normal_page an arch function, then we have to make vm_insert_mixed (next patch) an arch function too. So I don't think moving it to arch code fundamentally improves any abstractions, while it does practically make the code more difficult to follow, for both mm and arch developers, and easier to misuse. [akpm@linux-foundation.org: build fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Carsten Otte <cotte@de.ibm.com> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: introduce VM_MIXEDMAPJared Hulbert2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This series introduces some important infrastructure work. The overall result is that: 1. We now support XIP backed filesystems using memory that have no struct page allocated to them. And patches 6 and 7 actually implement this for s390. This is pretty important in a number of cases. As far as I understand, in the case of virtualisation (eg. s390), each guest may mount a readonly copy of the same filesystem (eg. the distro). Currently, guests need to allocate struct pages for this image. So if you have 100 guests, you already need to allocate more memory for the struct pages than the size of the image. I think. (Carsten?) For other (eg. embedded) systems, you may have a very large non- volatile filesystem. If you have to have struct pages for this, then your RAM consumption will go up proportionally to fs size. Even though it is just a small proportion, the RAM can be much more costly eg in terms of power, so every KB less that Linux uses makes it more attractive to a lot of these guys. 2. VM_MIXEDMAP allows us to support mappings where you actually do want to refcount _some_ pages in the mapping, but not others, and support COW on arbitrary (non-linear) mappings. Jared needs this for his NVRAM filesystem in progress. Future iterations of this filesystem will most likely want to migrate pages between pagecache and XIP backing, which is where the requirement for mixed (some refcounted, some not) comes from. 3. pte_special also has a peripheral usage that I need for my lockless get_user_pages patch. That was shown to speed up "oltp" on db2 by 10% on a 2 socket system, which is kind of significant because they scrounge for months to try to find 0.1% improvement on these workloads. I'm hoping we might finally be faster than AIX on pSeries with this :). My reference to lockless get_user_pages is not meant to justify this patchset (which doesn't include lockless gup), but just to show that pte_special is not some s390 specific thing that should be hidden in arch code or xip code: I definitely want to use it on at least x86 and powerpc as well. This patch: Introduce a new type of mapping, VM_MIXEDMAP. This is unlike VM_PFNMAP in that it can support COW mappings of arbitrary ranges including ranges without struct page *and* ranges with a struct page that we actually want to refcount (PFNMAP can only support COW in those cases where the un-COW-ed translations are mapped linearly in the virtual address, and can only support non refcounted ranges). VM_MIXEDMAP achieves this by refcounting all pfn_valid pages, and not refcounting !pfn_valid pages (which is not an option for VM_PFNMAP, because it needs to avoid refcounting pfn_valid pages eg. for /dev/mem mappings). Signed-off-by: Jared Hulbert <jaredeh@gmail.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Carsten Otte <cotte@de.ibm.com> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* smaps: account swap entriesPeter Zijlstra2008-04-28
| | | | | | | | | | | | Show the amount of swap for each vma. This can be used to see where all the swap goes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Matt Mackall <mpm@selenic.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* PAGEFLAGS_EXTENDED and separate page flags for Head and TailChristoph Lameter2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Having separate page flags for the head and the tail of a compound page allows the compiler to use bitops instead of operations on a word to check for a tail page. That is f.e. important for virt_to_head_page() which is used in various critical code paths (kfree for example): Code for PageTail(page) Before: mov (%rdi),%rdx page->flags mov %rdx,%rax 3 bytes and $0x12000,%eax 5 bytes cmp $0x12000,%rax 6 bytes je 897 <kfree+0xa7> After: mov (%rdi),%rax test $0x40,%ah (3 bytes) jne 887 <kfree+0x97> So we go from 14 bytes to 3 bytes and from 3 instructions to one. From the use of 2 registers we go to none. We can only use page flags for this if we have page flags available. This patch introduces CONFIG_PAGEFLAGS_EXTENDED that is set if pageflags are not scarce due to SPARSEMEM using page flags for its sectionid on 32 bit NUMA platforms. Additional page flag definitions can be added to the CONFIG_PAGEFLAGS_EXTENDED section in page-flags.h if the functionality depends on PAGEFLAGS_EXTENDED or if more page flag overlapping tricks are used for the !PAGEFLAGS_EXTENDED fallback (the upcoming virtual compound patch may hook in here and Rik's/Lee's additional page flags to solve the reclaim issues could also be added there [hint... hint... where are these patchsets?]). Avoiding the overlaying of Pg_reclaim also clears the way for possible use of compound pages for the pagecache or on the LRU. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: Get rid of __ZONE_COUNTChristoph Lameter2008-04-28
| | | | | | | | | | It was used to compensate because MAX_NR_ZONES was not available to the #ifdefs. Export MAX_NR_ZONES via the new mechanism and get rid of __ZONE_COUNT. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page flags: add PAGEFLAGS_FALSE for flags that are always falseChristoph Lameter2008-04-28
| | | | | | | | | Turns out that there are a number of times that a flag is simply always returning 0. Define a macro for that. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page flags: handle PG_uncached like all other flagsChristoph Lameter2008-04-28
| | | | | | | | | | | | Remove the special setup for PG_uncached and simply make it part of the enum. The page flag will only be allocated when the kernel build includes the uncached allocator. Acked-by: Dean Nelson <dcn@sgi.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pageflags: eliminate PG_xxx aliasesChristoph Lameter2008-04-28
| | | | | | | | | | | | | | | Remove aliases of PG_xxx. We can easily drop those now and alias by specifying the PG_xxx flag in the macro that generates the functions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pageflags: use proper page flag functions in XenChristoph Lameter2008-04-28
| | | | | | | | | | | | | | | Xen uses bitops to manipulate page flags. Make it use proper page flag functions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pageflags: convert to the use of new macrosChristoph Lameter2008-04-28
| | | | | | | | | | | | | | | | | Replace explicit definitions of page flags through the use of macros. Significantly reduces the size of the definitions and removes a lot of opportunity for errors. Additonal page flags can typically be generated with a single line. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pageflags: introduce macros to generate page flag functionsChristoph Lameter2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a set of macros that generate functions to handle page flags. A page flag function group typically starts with either SETPAGEFLAG(<part of function name>,<part of PG_ flagname>) to create a set of page flag operations that are atomic. Or __SETPAGEFLAG(<part of function name>,<part of PG_ flagname) to create a set of page flag operations that are not atomic. Then additional operations can be added using the following macros TESTSCFLAG Create additional atomic test-and-set and test-and-clear functions TESTSETFLAG Create additional test and set function TESTCLEARFLAG Create additional test and clear function SETPAGEFLAG Create additional atomic set function CLEARPAGEFLAG Create additional atomic clear function __TESTPAGEFLAG Create additional non atomic set function __SETPAGEFLAG Create additional non atomic clear function Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pageflags: get rid of FLAGS_RESERVEDChristoph Lameter2008-04-28
| | | | | | | | | | | | | | | | | | | | | | NR_PAGEFLAGS specifies the number of page flags we are using. From that we can calculate the number of bits leftover that can be used for zone, node (and maybe the sections id). There is no need anymore for FLAGS_RESERVED if we use NR_PAGEFLAGS. Use the new methods to make NR_PAGEFLAGS available via the preprocessor. NR_PAGEFLAGS is used to calculate field boundaries in the page flags fields. These field widths have to be available to the preprocessor. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: David Miller <davem@davemloft.net> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pageflags: use an enum for the flagsChristoph Lameter2008-04-28
| | | | | | | | | | | | | | | Use an enum to ease the maintenance of page flags. This is going to change the numbering from 0 to 18. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pageflags: standardize comment inclusion in asm-offsets.h and fix MIPSChristoph Lameter2008-04-28
| | | | | | | | | | | | | | | | | | | Add the ability to pass comments into asm-offsets.h by generating asm output like -># comment line Mips needs this feature to preserve the comments that are in asm-mips/asm-offsets.h right now. Then remove the special handling for mips from Kbuild and convert mips to use the new string to include the comments. Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page_mapping(): add ifdef around reference to swapper_spaceAndrew Morton2008-04-28
| | | | | | | | | | This fixes the superh build when the pageflags patches are applied. But it shouldn't unless it's a gcc bug. Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kbuild: create a way to create preprocessor constants from C expressionsChristoph Lameter2008-04-28
| | | | | | | | | | | | | | | | | | | | The use of enums create constants that are not available to the preprocessor when building the kernel (f.e. MAX_NR_ZONES). Arch code already has a way to export constants calculated to the preprocessor through the asm-offsets.c file. Generate something similar for the core kernel through kbuild. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sparsemem: vmemmap does not need section bitsChristoph Lameter2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | A set of patches that attempts to improve page flag handling. First of all a method is introduced to generate the page flag functions using macros. Then the number of page flags used by sparsemem is reduced. All page flag operations will no longer be macros. All flags will use inline function. Then we add a way to export enum constants to the preprocessor which allows us to get rid of __ZONE_COUNT and use the NR_PAGEFLAGS for the dynamic calculation of actually available page flags for fields. This patch: Sparsemem vmemmap does not need any section bits. This patch has the effect of reducing the number of bits used in page->flags by at least 6. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmallocinfo: add caller informationChristoph Lameter2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add caller information so that /proc/vmallocinfo shows where the allocation request for a slice of vmalloc memory originated. Results in output like this: 0xffffc20000000000-0xffffc20000801000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages 0xffffc20000801000-0xffffc20000806000 20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc 0xffffc20000806000-0xffffc20000c07000 4198400 alloc_large_system_hash+0x127/0x246 pages=1024 vmalloc vpages 0xffffc20000c07000-0xffffc20000c0a000 12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc 0xffffc20000c0a000-0xffffc20000c0c000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c0c000-0xffffc20000c0f000 12288 acpi_os_map_memory+0x13/0x1c phys=cff64000 ioremap 0xffffc20000c10000-0xffffc20000c15000 20480 acpi_os_map_memory+0x13/0x1c phys=cff65000 ioremap 0xffffc20000c16000-0xffffc20000c18000 8192 acpi_os_map_memory+0x13/0x1c phys=cff69000 ioremap 0xffffc20000c18000-0xffffc20000c1a000 8192 acpi_os_map_memory+0x13/0x1c phys=fed1f000 ioremap 0xffffc20000c1a000-0xffffc20000c1c000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c1c000-0xffffc20000c1e000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c1e000-0xffffc20000c20000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c20000-0xffffc20000c22000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c22000-0xffffc20000c24000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c24000-0xffffc20000c26000 8192 acpi_os_map_memory+0x13/0x1c phys=e0081000 ioremap 0xffffc20000c26000-0xffffc20000c28000 8192 acpi_os_map_memory+0x13/0x1c phys=e0080000 ioremap 0xffffc20000c28000-0xffffc20000c2d000 20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc 0xffffc20000c2d000-0xffffc20000c31000 16384 tcp_init+0xd5/0x31c pages=3 vmalloc 0xffffc20000c31000-0xffffc20000c34000 12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc 0xffffc20000c34000-0xffffc20000c36000 8192 init_vdso_vars+0xde/0x1f1 0xffffc20000c36000-0xffffc20000c38000 8192 pci_iomap+0x8a/0xb4 phys=d8e00000 ioremap 0xffffc20000c38000-0xffffc20000c3a000 8192 usb_hcd_pci_probe+0x139/0x295 [usbcore] phys=d8e00000 ioremap 0xffffc20000c3a000-0xffffc20000c3e000 16384 sys_swapon+0x509/0xa15 pages=3 vmalloc 0xffffc20000c40000-0xffffc20000c61000 135168 e1000_probe+0x1c4/0xa32 phys=d8a20000 ioremap 0xffffc20000c61000-0xffffc20000c6a000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20000c6a000-0xffffc20000c73000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20000c73000-0xffffc20000c7c000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20000c7c000-0xffffc20000c7f000 12288 e1000e_setup_tx_resources+0x29/0xbe pages=2 vmalloc 0xffffc20000c80000-0xffffc20001481000 8392704 pci_mmcfg_arch_init+0x90/0x118 phys=e0000000 ioremap 0xffffc20001481000-0xffffc20001682000 2101248 alloc_large_system_hash+0x127/0x246 pages=512 vmalloc 0xffffc20001682000-0xffffc20001e83000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages 0xffffc20001e83000-0xffffc20002204000 3674112 alloc_large_system_hash+0x127/0x246 pages=896 vmalloc vpages 0xffffc20002204000-0xffffc2000220d000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc2000220d000-0xffffc20002216000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20002216000-0xffffc2000221f000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc2000221f000-0xffffc20002228000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20002228000-0xffffc20002231000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20002231000-0xffffc20002234000 12288 e1000e_setup_rx_resources+0x35/0x122 pages=2 vmalloc 0xffffc20002240000-0xffffc20002261000 135168 e1000_probe+0x1c4/0xa32 phys=d8a60000 ioremap 0xffffc20002261000-0xffffc2000270c000 4894720 sys_swapon+0x509/0xa15 pages=1194 vmalloc vpages 0xffffffffa0000000-0xffffffffa0022000 139264 module_alloc+0x4f/0x55 pages=33 vmalloc 0xffffffffa0022000-0xffffffffa0029000 28672 module_alloc+0x4f/0x55 pages=6 vmalloc 0xffffffffa002b000-0xffffffffa0034000 36864 module_alloc+0x4f/0x55 pages=8 vmalloc 0xffffffffa0034000-0xffffffffa003d000 36864 module_alloc+0x4f/0x55 pages=8 vmalloc 0xffffffffa003d000-0xffffffffa0049000 49152 module_alloc+0x4f/0x55 pages=11 vmalloc 0xffffffffa0049000-0xffffffffa0050000 28672 module_alloc+0x4f/0x55 pages=6 vmalloc [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmalloc: show vmalloced areas via /proc/vmallocinfoChristoph Lameter2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement a new proc file that allows the display of the currently allocated vmalloc memory. It allows to see the users of vmalloc. That is important if vmalloc space is scarce (i386 for example). And it's going to be important for the compound page fallback to vmalloc. Many of the current users can be switched to use compound pages with fallback. This means that the number of users of vmalloc is reduced and page tables no longer necessary to access the memory. /proc/vmallocinfo allows to review how that reduction occurs. If memory becomes fragmented and larger order allocations are no longer possible then /proc/vmallocinfo allows to see which compound page allocations fell back to virtual compound pages. That is important for new users of virtual compound pages. Such as order 1 stack allocation etc that may fallback to virtual compound pages in the future. /proc/vmallocinfo permissions are made readable-only-by-root to avoid possible information leakage. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: CONFIG_MMU=n build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: make early_pfn_to_nid() a C functionAndrew Morton2008-04-28
| | | | | | | | | | | | | | | | | | | | | Fix this (sparc64) mm/sparse-vmemmap.c: In function `vmemmap_verify': mm/sparse-vmemmap.c:64: warning: unused variable `pfn' by switching to a C function which touches its arg. (reason 3,555 why macros are bad) Also, the `nid' arg was misnamed. Reviewed-by: Christoph Lameter <clameter@sgi.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: rotate_reclaimable_page() cleanupMiklos Szeredi2008-04-28
| | | | | | | | | | | | Clean up messy conditional calling of test_clear_page_writeback() from both rotate_reclaimable_page() and end_page_writeback(). The only user of rotate_reclaimable_page() is end_page_writeback() so this is OK. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_alloc.c: fix indentationS.Caglar Onur2008-04-28
| | | | | | | | | | | zlc_setup(): handle jiffies wraparound (10ed273f5016c582413dfbc468dd084957d847e1) changes tab with spaces Signed-off-by: S.Caglar Onur <caglar@pardus.org.tr> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: save some bytes in mm_struct by filling holes on 64bitAndi Kleen2008-04-28
| | | | | | | | | | | Save some bytes in mm_struct by filling holes Putting int values together for better packing on 64bit shrinks sizeof(struct mm_struct) from 776 bytes to 764 bytes. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dmapool: enable debugging for CONFIG_SLUB_DEBUG_ON tooAndi Kleen2008-04-28
| | | | | | | | | | | | Previously it was only enabled for CONFIG_DEBUG_SLAB. Not hooked into the slub runtime debug configuration, so you currently only get it with CONFIG_SLUB_DEBUG_ON, not plain CONFIG_SLUB_DEBUG Acked-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: fix parsing of tmpfs mpol mount optionLee Schermerhorn2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Parsing of new mode flags in the tmpfs mpol mount option is slightly broken: Setting a valid flag works OK: #mount -o remount,mpol=bind=static:1-2 /dev/shm #mount ... tmpfs on /dev/shm type tmpfs (rw,mpol=bind=static:1-2) ... However, we can't remove them or change them, once we've set a valid flag: #mount -o remount,mpol=bind:1-2 /dev/shm #mount ... tmpfs on /dev/shm type tmpfs (rw,mpol=bind:1-2) ... It SAYS it removed it, but that's just a copy of the input string. If we now try to set it to a different flag, we get: #mount -o remount,mpol=bind=relative:1-2 /dev/shm mount: /dev/shm not mounted already, or bad option And on the console, we see: tmpfs: Bad value 'bind' for mount option 'mpol' ^ lost remainder of string Furthermore, bogus flags are accepted with out error. Granted, they are a no-op: #mount -o remount,mpol=interleave=foo:0-3 /dev/shm #mount ... tmpfs on /dev/shm type tmpfs (rw,mpol=interleave=foo:0-3) Again, that's just a copy of the input string shown by the mount command. This patch fixes the behavior by pre-zeroing the flags so that only one of the mutually exclusive flags can be set at one time. It also reports an error when an unrecognized flag is specified. The check for both flags being set is removed because it can't happen with this implementation. If we ever want to support multiple non-exclusive flags, this area will need rework and we will need to check that any mutually exclusive flags aren't specified. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: Eric Whitney <eric.whitney@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: disallow static or relative flags for local preferred modeDavid Rientjes2008-04-28
| | | | | | | | | | | | | | | | | | | | | MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES don't mean anything for MPOL_PREFERRED policies that were created with an empty nodemask (for purely local allocations). They'll never be invalidated because the allowed mems of a task changes or need to be rebound relative to a cpuset's placement. Also fixes a bug identified by Lee Schermerhorn that disallowed empty nodemasks to be passed to MPOL_PREFERRED to specify local allocations. [A different, somewhat incomplete, patch already existed in 25-rc5-mm1.] Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: small header file cleanupDavid Rientjes2008-04-28
| | | | | | | | | | | | | | | | | | Removes forward definition of vm_area_struct in linux/mempolicy.h. We already get it from the linux/slab.h -> linux/gfp.h include. Removes the unused mpol_set_vma_default() macro from linux/mempolicy.h. Removes the extern definition of default_policy since it is only referenced, as it should be, in mm/mempolicy.c. Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: create mempolicy_operations structureDavid Rientjes2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | Create a mempolicy_operations structure that currently points to two functions[*] for the various modes: int (*create)(struct mempolicy *, const nodemask_t *); void (*rebind)(struct mempolicy *, const nodemask_t *); This splits the implementation for the various modes out of two large functions, mpol_new() and mpol_rebind_policy(). Eventually it may be beneficial to add additional functions to accomodate the existing switch() statements in mm/mempolicy.c. [*] The ->create() function for MPOL_DEFAULT is currently NULL since no struct mempolicy is dynamically allocated. [Lee.Schermerhorn@hp.com: fix regression in the package mempolicy regression tests] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Eric Whitney <eric.whitney@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: move rebind functionsDavid Rientjes2008-04-28
| | | | | | | | | | | | | Move the mpol_rebind_{policy,task,mm}() functions after mpol_new() to avoid having to declare function prototypes. Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: update NUMA memory policy documentationDavid Rientjes2008-04-28
| | | | | | | | | | | | | | Updates Documentation/vm/numa_memory_policy.txt and Documentation/filesystems/tmpfs.txt to describe optional mempolicy mode flags. Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: add MPOL_F_RELATIVE_NODES flagDavid Rientjes2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies nodemasks passed via set_mempolicy() or mbind() should be considered relative to the current task's mems_allowed. When the mempolicy is created, the passed nodemask is folded and mapped onto the current task's mems_allowed. For example, consider a task using set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a nodemask of 1-3. If current's mems_allowed is 4-7, the effected nodemask is 5-7 (the second, third, and fourth node of mems_allowed). If the same task is attached to a cpuset, the mempolicy nodemask is rebound each time the mems are changed. Some possible rebinds and results are: mems result 1-3 1-3 1-7 2-4 1,5-6 1,5-6 1,5-7 5-7 Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned to the resultant nodemask from the relative remap. In the MPOL_PREFERRED case, the preferred node is remapped from the currently effected nodemask to the relative nodemask. This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>. Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: add bitmap_onto() and bitmap_fold() operationsPaul Jackson2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following adds two more bitmap operators, bitmap_onto() and bitmap_fold(), with the usual cpumask and nodemask wrappers. The bitmap_onto() operator computes one bitmap relative to another. If the n-th bit in the origin mask is set, then the m-th bit of the destination mask will be set, where m is the position of the n-th set bit in the relative mask. The bitmap_fold() operator folds a bitmap into a second that has bit m set iff the input bitmap has some bit n set, where m == n mod sz, for the specified sz value. There are two substantive changes between this patch and its predecessor bitmap_relative: 1) Renamed bitmap_relative() to be bitmap_onto(). 2) Added bitmap_fold(). The essential motivation for bitmap_onto() is to provide a mechanism for converting a cpuset-relative CPU or Node mask to an absolute mask. Cpuset relative masks are written as if the current task were in a cpuset whose CPUs or Nodes were just the consecutive ones numbered 0..N-1, for some N. The bitmap_onto() operator is provided in anticipation of adding support for the first such cpuset relative mask, by the mbind() and set_mempolicy() system calls, using a planned flag of MPOL_F_RELATIVE_NODES. These bitmap operators (and their nodemask wrappers, in particular) will be used in code that converts the user specified cpuset relative memory policy to a specific system node numbered policy, given the current mems_allowed of the tasks cpuset. Such cpuset relative mempolicies will address two deficiencies of the existing interface between cpusets and mempolicies: 1) A task cannot at present reliably establish a cpuset relative mempolicy because there is an essential race condition, in that the tasks cpuset may be changed in between the time the task can query its cpuset placement, and the time the task can issue the applicable mbind or set_memplicy system call. 2) A task cannot at present establish what cpuset relative mempolicy it would like to have, if it is in a smaller cpuset than it might have mempolicy preferences for, because the existing interface only allows specifying mempolicies for nodes currently allowed by the cpuset. Cpuset relative mempolicies are useful for tasks that don't distinguish particularly between one CPU or Node and another, but only between how many of each are allowed, and the proper placement of threads and memory pages on the various CPUs and Nodes available. The motivation for the added bitmap_fold() can be seen in the following example. Let's say an application has specified some mempolicies that presume 16 memory nodes, including say a mempolicy that specified MPOL_F_RELATIVE_NODES (cpuset relative) nodes 12-15. Then lets say that application is crammed into a cpuset that only has 8 memory nodes, 0-7. If one just uses bitmap_onto(), this mempolicy, mapped to that cpuset, would ignore the requested relative nodes above 7, leaving it empty of nodes. That's not good; better to fold the higher nodes down, so that some nodes are included in the resulting mapped mempolicy. In this case, the mempolicy nodes 12-15 are taken modulo 8 (the weight of the mems_allowed of the confining cpuset), resulting in a mempolicy specifying nodes 4-7. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: <kosaki.motohiro@jp.fujitsu.com> Cc: <ray-lk@madrabbit.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: add MPOL_F_STATIC_NODES flagDavid Rientjes2008-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the node remap when the policy is rebound. Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of a union with cpuset_mems_allowed: struct mempolicy { ... union { nodemask_t cpuset_mems_allowed; nodemask_t user_nodemask; } w; } that stores the the nodemask that the user passed when he or she created the mempolicy via set_mempolicy() or mbind(). When using MPOL_F_STATIC_NODES, which is passed with any mempolicy mode, the user's passed nodemask intersected with the VMA or task's allowed nodes is always used when determining the preferred node, setting the MPOL_BIND zonelist, or creating the interleave nodemask. This happens whenever the policy is rebound, including when a task's cpuset assignment changes or the cpuset's mems are changed. This creates an interesting side-effect in that it allows the mempolicy "intent" to lie dormant and uneffected until it has access to the node(s) that it desires. For example, if you currently ask for an interleaved policy over a set of nodes that you do not have access to, the mempolicy is not created and the task continues to use the previous policy. With this change, however, it is possible to create the same mempolicy; it is only effected when access to nodes in the nodemask is acquired. It is also possible to mount tmpfs with the static nodemask behavior when specifying a node or nodemask. To do this, simply add "=static" immediately following the mempolicy mode at mount time: mount -o remount mpol=interleave=static:1-3 Also removes mpol_check_policy() and folds its logic into mpol_new() since it is now obsoleted. The unused vma_mpol_equal() is also removed. Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>