aboutsummaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAge
...
* | | | mm: stop ptlock enlarging struct pageHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CONFIG_DEBUG_SPINLOCK adds 12 or 16 bytes to a 32- or 64-bit spinlock_t, and CONFIG_DEBUG_LOCK_ALLOC adds another 12 or 24 bytes to it: lockdep enables both of those, and CONFIG_LOCK_STAT adds 8 or 16 bytes to that. When 2.6.15 placed the split page table lock inside struct page (usually sized 32 or 56 bytes), only CONFIG_DEBUG_SPINLOCK was a possibility, and we ignored the enlargement (but fitted in CONFIG_GENERIC_LOCKBREAK's 4 by letting the spinlock_t occupy both page->private and page->mapping). Should these debugging options be allowed to double the size of a struct page, when only one minority use of the page (as a page table) needs to fit a spinlock in there? Perhaps not. Take the easy way out: switch off SPLIT_PTLOCK_CPUS when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC is in force. I've sometimes tried to be cleverer, kmallocing a cacheline for the spinlock when it doesn't fit, but given up each time. Falling back to mm->page_table_lock (as we do when ptlock is not split) lets lockdep check out the strictest path anyway. And now that some arches allow 8192 cpus, use 999999 for infinity. (What has this got to do with KSM swapping? It doesn't care about the size of struct page, but may care about random junk in page->mapping - to be explained separately later.) Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: pass address down to rmap onesHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KSM swapping will know where page_referenced_one() and try_to_unmap_one() should look. It could hack page->index to get them to do what it wants, but it seems cleaner now to pass the address down to them. Make the same change to page_mkclean_one(), since it follows the same pattern; but there's no real need in its case. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: CONFIG_MMU for PG_mlockedHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove three degrees of obfuscation, left over from when we had CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only built when CONFIG_MMU, so don't need such conditions at all. Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from 169 defconfigs: leave those to evolve in due course. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: mlocking in try_to_unmap_oneHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's contorted mlock/munlock handling in try_to_unmap_anon() and try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping. Simplify it by moving it all down into try_to_unmap_one(). One thing is then lost, try_to_munlock()'s distinction between when no vma holds the page mlocked, and when a vma does mlock it, but we could not get mmap_sem to set the page flag. But its only caller takes no interest in that distinction (and is better testing SWAP_MLOCK anyway), so let's keep the code simple and return SWAP_AGAIN for both cases. try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly amusing: once unravelled, it turns out to have been choosing between two different ways of doing the same nothing. Ah, no, one way was actually returning SWAP_FAIL when it meant to return SWAP_SUCCESS. [kosaki.motohiro@jp.fujitsu.com: comment adding to mlocking in try_to_unmap_one] [akpm@linux-foundation.org: remove test of MLOCK_PAGES] Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: define PAGE_MAPPING_FLAGSHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set in page->mapping, with the higher bits a pointer to the anon_vma; and have defined PageKsm(page) as that with NULL anon_vma. But KSM swapping will need to store a pointer there: so in preparation for that, now define PAGE_MAPPING_FLAGS as the low two bits, including PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some other use for the bit emerges). Declare page_rmapping(page) to return the pointer part of page->mapping, and page_anon_vma(page) to return the anon_vma pointer when that's what it is. Use these in a few appropriate places: notably, unuse_vma() has been testing page->mapping, but is better to be testing page_anon_vma() (cases may be added in which flag bits are set without any pointer). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vmscan: stop kswapd waiting on congestion when the min watermark is not ↵KOSAKI Motohiro2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | being met If reclaim fails to make sufficient progress, the priority is raised. Once the priority is higher, kswapd starts waiting on congestion. However, if the zone is below the min watermark then kswapd needs to continue working without delay as there is a danger of an increased rate of GFP_ATOMIC allocation failure. This patch changes the conditions under which kswapd waits on congestion by only going to sleep if the min watermarks are being met. [mel@csn.ul.ie: add stats to track how relevant the logic is] [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vmscan: have kswapd sleep for a short interval and double check it should be ↵Mel Gorman2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | asleep After kswapd balances all zones in a pgdat, it goes to sleep. In the event of no IO congestion, kswapd can go to sleep very shortly after the high watermark was reached. If there are a constant stream of allocations from parallel processes, it can mean that kswapd went to sleep too quickly and the high watermark is not being maintained for sufficient length time. This patch makes kswapd go to sleep as a two-stage process. It first tries to sleep for HZ/10. If it is woken up by another process or the high watermark is no longer met, it's considered a premature sleep and kswapd continues work. Otherwise it goes fully to sleep. This adds more counters to distinguish between fast and slow breaches of watermarks. A "fast" premature sleep is one where the low watermark was hit in a very short time after kswapd going to sleep. A "slow" premature sleep indicates that the high watermark was breached after a very short interval. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Frans Pop <elendil@planet.nl> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | rmap: move label `out' to a better placeHuang Shijie2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the code jumps to the `out', `referenced' is still zero. So there is no need to check it. Signed-off-by: Huang Shijie <shijie8@gmail.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | rmap: simplify try_to_unmap_file()Huang Shijie2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just simplify the code when `mlocked' is true. Signed-off-by: Huang Shijie <shijie8@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | rmap: fix the comment for try_to_unmap_anonHuang Shijie2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the comment for try_to_unmap_anon() with the new arguments. Signed-off-by: Huang Shijie <shijie8@gmail.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm/vmscan: change comment generic_file_write to __generic_file_aio_writeVincent Li2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 543ade1fc9 ("Streamline generic_file_* interfaces and filemap cleanups") removed generic_file_write() in filemap. Change the comment in vmscan pageout() to __generic_file_aio_write(). Signed-off-by: Vincent Li <macli@brc.ubc.ca> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | swap: rework map_swap_page() againLee Schermerhorn2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Seems that page_io.c doesn't really need to know that page_private(page) is the swp_entry 'val'. Rework map_swap_page() to do what its name says and map a page to a page offset in the swap space. The only other caller of map_swap_page() is internal to mm/swapfile.c and it does want to map a swap entry to the 'sector'. So rename map_swap_page() to map_swap_entry(), make it 'static' and and implement map_swap_page() as a wrapper around that. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | swap_info: note SWAP_MAP_SHMEMHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While we're fiddling with the swap_map values, let's assign a particular value to shmem/tmpfs swap pages: their swap counts are never incremented, and it helps swapoff's try_to_unuse() a little if it can immediately distinguish those pages from process pages. Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED, we might as well use that 0xbf value for SWAP_MAP_SHMEM. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | swap_info: swap count continuationsHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Swap is duplicated (reference count incremented by one) whenever the same swap page is inserted into another mm (when forking finds a swap entry in place of a pte, or when reclaim unmaps a pte to insert the swap entry). swap_info_struct's vmalloc'ed swap_map is the array of these reference counts: but what happens when the unsigned short (or unsigned char since the preceding patch) is full? (and its high bit is kept for a cache flag) We then lose track of it, never freeing, leaving it in use until swapoff: at which point we _hope_ that a single pass will have found all instances, assume there are no more, and will lose user data if we're wrong. Swapping of KSM pages has not yet been enabled; but it is implemented, and makes it very easy for a user to overflow the maximum swap count: possible with ordinary process pages, but unlikely, even when pid_max has been raised from PID_MAX_DEFAULT. This patch implements swap count continuations: when the count overflows, a continuation page is allocated and linked to the original vmalloc'ed map page, and this used to hold the continuation counts for that entry and its neighbours. These continuation pages are seldom referenced: the common paths all work on the original swap_map, only referring to a continuation page when the low "digit" of a count is incremented or decremented through SWAP_MAP_MAX. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | swap_info: swap_map of chars not shortsHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned chars: it's still very unusual to reach a swap count of 126, and the next patch allows it to be extended indefinitely. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | swap_info: SWAP_HAS_CACHE cleanupsHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Though swap_count() is useful, I'm finding that swap_has_cache() and encode_swapmap() obscure what happens in the swap_map entry, just at those points where I need to understand it. Remove them, and pass more usable "usage" values to scan_swap_map(), swap_entry_free() and __swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | swap_info: miscellaneous minor cleanupsHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move CONFIG_HIBERNATION's swapdev_block() into the main CONFIG_HIBERNATION block, remove extraneous whitespace and return, fix typo in a comment. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | swap_info: include first_swap_extentHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make better use of the space by folding first swap_extent into its swap_info_struct, instead of just the list_head: swap partitions need only that one, and for others it's used as a circular list anyway. [jirislaby@gmail.com: fix crash on double swapon] Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | swap_info: change to array of pointersHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The swap_info_struct is only 76 or 104 bytes, but it does seem wrong to reserve an array of about 30 of them in bss, when most people will want only one. Change swap_info[] to an array of pointers. That does need a "type" field in the structure: pack it as a char with next type and short prio (aha, char is unsigned by default on PowerPC). Use the (admittedly peculiar) name "type" throughout for this index. /proc/swaps does not take swap_lock: I wouldn't want it to, but do take care with barriers when adding a new item to the array (never removed). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | swap_info: private to swapfile.cHugh Dickins2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The swap_info_struct is mostly private to mm/swapfile.c, with only one other in-tree user: get_swap_bio(). Adjust its interface to map_swap_page(), so that we can then remove get_swap_info_struct(). But there is a popular user out-of-tree, TuxOnIce: so leave the declaration of swap_info_struct in linux/swap.h. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Nigel Cunningham <ncunningham@crca.org.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vmalloc(): adjust gfp mask passed on nested vmalloc() invocationJan Beulich2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - avoid wasting more precious resources (DMA or DMA32 pools), when being called through vmalloc_32{,_user}() - explicitly allow using high memory here even if the outer allocation request doesn't allow it Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: add gfp flags for NODEMASK_ALLOC slab allocationsDavid Rientjes2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Objects passed to NODEMASK_ALLOC() are relatively small in size and are backed by slab caches that are not of large order, traditionally never greater than PAGE_ALLOC_COSTLY_ORDER. Thus, using GFP_KERNEL for these allocations on large machines when CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in the allocation attempt, each time invoking both direct reclaim or the oom killer. This is of particular interest when using NODEMASK_ALLOC() from a mempolicy context (either directly in mm/mempolicy.c or the mempolicy constrained hugetlb allocations) since the oom killer always kills current when allocations are constrained by mempolicies. So for all present use cases in the kernel, current would end up being oom killed when direct reclaim fails. That would allow the NODEMASK_ALLOC() to succeed but current would have sacrificed itself upon returning. This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations. All current use cases either directly from hugetlb code or indirectly via NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom killer when the slab allocator needs to allocate additional pages. The side-effect of this change is that all current use cases of either NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling when the allocation fails (never for CONFIG_NODES_SHIFT <= 8). All current use cases were audited and do have appropriate error handling at this time. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: clear node in N_HIGH_MEMORY and stop kswapd when all memory is offlinedDavid Rientjes2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if there are no present pages left. In such a situation, kswapd must also be stopped since it has nothing left to do. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | hugetlb: use only nodes with memory for huge pagesLee Schermerhorn2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Register per node hstate sysfs attributes only for nodes with memory. Global replacement of 'all online nodes" with "all nodes with memory" in mm/hugetlb.c. Suggested by David Rientjes. A subsequent patch will handle adding/removing of per node hstate sysfs attributes when nodes transition to/from memoryless state via memory hotplug. NOTE: this patch has not been tested with memoryless nodes. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | hugetlb: add per node hstate attributesLee Schermerhorn2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the per huge page size control/query attributes to the per node sysdevs: /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/ nr_hugepages - r/w free_huge_pages - r/o surplus_huge_pages - r/o The patch attempts to re-use/share as much of the existing global hstate attribute initialization and handling, and the "nodes_allowed" constraint processing as possible. Calling set_max_huge_pages() with no node indicates a change to global hstate parameters. In this case, any non-default task mempolicy will be used to generate the nodes_allowed mask. A valid node id indicates an update to that node's hstate parameters, and the count argument specifies the target count for the specified node. From this info, we compute the target global count for the hstate and construct a nodes_allowed node mask contain only the specified node. Setting the node specific nr_hugepages via the per node attribute effectively ignores any task mempolicy or cpuset constraints. With this patch: (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB ./ ../ free_hugepages nr_hugepages surplus_hugepages Starting from: Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 Node 2 HugePages_Total: 0 Node 2 HugePages_Free: 0 Node 2 HugePages_Surp: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 3 HugePages_Surp: 0 vm.nr_hugepages = 0 Allocate 16 persistent huge pages on node 2: (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages [Note that this is equivalent to: numactl -m 2 hugeadmin --pool-pages-min 2M:+16 ] Yields: Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 Node 2 HugePages_Total: 16 Node 2 HugePages_Free: 16 Node 2 HugePages_Surp: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 3 HugePages_Surp: 0 vm.nr_hugepages = 16 Global controls work as expected--reduce pool to 8 persistent huge pages: (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 Node 2 HugePages_Total: 8 Node 2 HugePages_Free: 8 Node 2 HugePages_Surp: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 3 HugePages_Surp: 0 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | hugetlb: derive huge pages nodes allowed from task mempolicyLee Schermerhorn2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch derives a "nodes_allowed" node mask from the numa mempolicy of the task modifying the number of persistent huge pages to control the allocation, freeing and adjusting of surplus huge pages when the pool page count is modified via the new sysctl or sysfs attribute "nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows: * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer is produced. This will cause the hugetlb subsystem to use node_online_map as the "nodes_allowed". This preserves the behavior before this patch. * For "preferred" mempolicy, including explicit local allocation, a nodemask with the single preferred node will be produced. "local" policy will NOT track any internode migrations of the task adjusting nr_hugepages. * For "bind" and "interleave" policy, the mempolicy's nodemask will be used. * Other than to inform the construction of the nodes_allowed node mask, the actual mempolicy mode is ignored. That is, all modes behave like interleave over the resulting nodes_allowed mask with no "fallback". See the updated documentation [next patch] for more information about the implications of this patch. Examples: Starting with: Node 0 HugePages_Total: 0 Node 1 HugePages_Total: 0 Node 2 HugePages_Total: 0 Node 3 HugePages_Total: 0 Default behavior [with or without this patch] balances persistent hugepage allocation across nodes [with sufficient contiguous memory]: sysctl vm.nr_hugepages[_mempolicy]=32 yields: Node 0 HugePages_Total: 8 Node 1 HugePages_Total: 8 Node 2 HugePages_Total: 8 Node 3 HugePages_Total: 8 Of course, we only have nr_hugepages_mempolicy with the patch, but with default mempolicy, nr_hugepages_mempolicy behaves the same as nr_hugepages. Applying mempolicy--e.g., with numactl [using '-m' a.k.a. '--membind' because it allows multiple nodes to be specified and it's easy to type]--we can allocate huge pages on individual nodes or sets of nodes. So, starting from the condition above, with 8 huge pages per node, add 8 more to node 2 using: numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40 This yields: Node 0 HugePages_Total: 8 Node 1 HugePages_Total: 8 Node 2 HugePages_Total: 16 Node 3 HugePages_Total: 8 The incremental 8 huge pages were restricted to node 2 by the specified mempolicy. Similarly, we can use mempolicy to free persistent huge pages from specified nodes: numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32 yields: Node 0 HugePages_Total: 4 Node 1 HugePages_Total: 4 Node 2 HugePages_Total: 16 Node 3 HugePages_Total: 8 The 8 huge pages freed were balanced over nodes 0 and 1. [rientjes@google.com: accomodate reworked NODEMASK_ALLOC] Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | hugetlb: add nodemask arg to huge page alloc, free and surplus adjust functionsLee Schermerhorn2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for constraining huge page allocation and freeing by the controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer to the allocate, free and surplus adjustment functions. For now, pass NULL to indicate default behavior--i.e., use node_online_map. A subsqeuent patch will derive a non-default mask from the controlling task's numa mempolicy. Note that this method of updating the global hstate nr_hugepages under the constraint of a nodemask simplifies keeping the global state consistent--especially the number of persistent and surplus pages relative to reservations and overcommit limits. There are undoubtedly other ways to do this, but this works for both interfaces: mempolicy and per node attributes. [rientjes@google.com: fix HIGHMEM compile error] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | hugetlb: rework hstate_next_node_* functionsLee Schermerhorn2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modify the hstate_next_node* functions to allow them to be called to obtain the "start_nid". Then, whereas prior to this patch we unconditionally called hstate_next_node_to_{alloc|free}(), whether or not we successfully allocated/freed a huge page on the node, now we only call these functions on failure to alloc/free to advance to next allowed node. Factor out the next_node_allowed() function to handle wrap at end of node_online_map. In this version, the allowed nodes include all of the online nodes. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: move inc_zone_page_state(NR_ISOLATED) to just isolated placeKOSAKI Motohiro2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed in right after isolate_page(). This patch does it. Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()KOSAKI Motohiro2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On ia64, the following test program exit abnormally, because glibc thread library called abort(). ======================================================== (gdb) bt #0 0xa000000000010620 in __kernel_syscall_via_break () #1 0x20000000003208e0 in raise () from /lib/libc.so.6.1 #2 0x2000000000324090 in abort () from /lib/libc.so.6.1 #3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0 #4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0 #5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1 ======================================================== The fact is, glibc call munmap() when thread exitng time for freeing stack, and it assume munlock() never fail. However, munmap() often make vma splitting and it with many mapcount make -ENOMEM. Oh well, that's crazy, because stack unmapping never increase mapcount. The maxcount exceeding is only temporary. internal temporary exceeding shouldn't make ENOMEM. This patch does it. test_max_mapcount.c ================================================================== #include<stdio.h> #include<stdlib.h> #include<string.h> #include<pthread.h> #include<errno.h> #include<unistd.h> #define THREAD_NUM 30000 #define MAL_SIZE (8*1024*1024) void *wait_thread(void *args) { void *addr; addr = malloc(MAL_SIZE); sleep(10); return NULL; } void *wait_thread2(void *args) { sleep(60); return NULL; } int main(int argc, char *argv[]) { int i; pthread_t thread[THREAD_NUM], th; int ret, count = 0; pthread_attr_t attr; ret = pthread_attr_init(&attr); if(ret) { perror("pthread_attr_init"); } ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); if(ret) { perror("pthread_attr_setdetachstate"); } for (i = 0; i < THREAD_NUM; i++) { ret = pthread_create(&th, &attr, wait_thread, NULL); if(ret) { fprintf(stderr, "[%d] ", count); perror("pthread_create"); } else { printf("[%d] create OK.\n", count); } count++; ret = pthread_create(&thread[i], &attr, wait_thread2, NULL); if(ret) { fprintf(stderr, "[%d] ", count); perror("pthread_create"); } else { printf("[%d] create OK.\n", count); } count++; } sleep(3600); return 0; } ================================================================== [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | oom: dump stack and VM state when oom killer panicsDavid Rientjes2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The oom killer header, including information such as the allocation order and gfp mask, current's cpuset and memory controller, call trace, and VM state information is currently only shown when the oom killer has selected a task to kill. This information is omitted, however, when the oom killer panics either because of panic_on_oom sysctl settings or when no killable task was found. It is still relevant to know crucial pieces of information such as the allocation order and VM state when diagnosing such issues, especially at boot. This patch displays the oom killer header whenever it panics so that bug reports can include pertinent information to debug the issue, if possible. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds2009-12-14
|\ \ \ \ | | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, mce: Clean up thermal init by introducing intel_thermal_supported() x86, mce: Thermal monitoring depends on APIC being enabled x86: Gart: fix breakage due to IOMMU initialization cleanup x86: Move swiotlb initialization before dma32_free_bootmem x86: Fix build warning in arch/x86/mm/mmio-mod.c x86: Remove usedac in feature-removal-schedule.txt x86: Fix duplicated UV BAU interrupt vector nvram: Fix write beyond end condition; prove to gcc copy is safe mm: Adjust do_pages_stat() so gcc can see copy_from_user() is safe x86: Limit the number of processor bootup messages x86: Remove enabling x2apic message for every CPU doc: Add documentation for bootloader_{type,version} x86, msr: Add support for non-contiguous cpumasks x86: Use find_e820() instead of hard coded trampoline address x86, AMD: Fix stale cpuid4_info shared_map data in shared_cpu_map cpumasks Trivial percpu-naming-introduced conflicts in arch/x86/kernel/cpu/intel_cacheinfo.c
| * | | mm: Adjust do_pages_stat() so gcc can see copy_from_user() is safeH. Peter Anvin2009-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Slightly adjust the logic for determining the size of the copy_form_user() in do_pages_stat(); with this change, gcc can see that the copying is safe. Without this, we get a build error for i386 allyesconfig: /home/hpa/kernel/linux-2.6-tip.urgent/arch/x86/include/asm/uaccess_32.h:213: error: call to ‘copy_from_user_overflow’ declared with attribute error: copy_from_user() buffer size is not provably correct Unlike an earlier patch from Arjan, this doesn't introduce new variables; merely reshuffles the compare so that gcc can see that an overflow cannot happen. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Brice Goglin <Brice.Goglin@inria.fr> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> LKML-Reference: <20090926205406.30d55b08@infradead.org>
* | | | Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds2009-12-14
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf sched: Fix build failure on sparc perf bench: Add "all" pseudo subsystem and "all" pseudo suite perf tools: Introduce perf_session class perf symbols: Ditch dso->find_symbol perf symbols: Allow lookups by symbol name too perf symbols: Add missing "Variables" entry to map_type__name perf symbols: Add support for 'variable' symtabs perf symbols: Introduce ELF counterparts to symbol_type__is_a perf symbols: Introduce symbol_type__is_a perf symbols: Rename kthreads to kmaps, using another abstraction for it perf tools: Allow building for ARM hw-breakpoints: Handle bad modify_user_hw_breakpoint off-case return value perf tools: Allow cross compiling tracing, slab: Fix no callsite ifndef CONFIG_KMEMTRACE tracing, slab: Define kmem_cache_alloc_notrace ifdef CONFIG_TRACING Trivial conflict due to different fixes to modify_user_hw_breakpoint() in include/linux/hw_breakpoint.h
| * | | | tracing, slab: Fix no callsite ifndef CONFIG_KMEMTRACELi Zefan2009-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For slab, if CONFIG_KMEMTRACE and CONFIG_DEBUG_SLAB are not set, __do_kmalloc() will not track callers: # ./perf record -f -a -R -e kmem:kmalloc ^C # ./perf trace ... perf-2204 [000] 147.376774: kmalloc: call_site=c0529d2d ... perf-2204 [000] 147.400997: kmalloc: call_site=c0529d2d ... Xorg-1461 [001] 147.405413: kmalloc: call_site=0 ... Xorg-1461 [001] 147.405609: kmalloc: call_site=0 ... konsole-1776 [001] 147.405786: kmalloc: call_site=0 ... Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: linux-mm@kvack.org <linux-mm@kvack.org> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> LKML-Reference: <4B21F8AE.6020804@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | tracing, slab: Define kmem_cache_alloc_notrace ifdef CONFIG_TRACINGLi Zefan2009-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Define kmem_trace_alloc_{,node}_notrace() if CONFIG_TRACING is enabled, otherwise perf-kmem will show wrong stats ifndef CONFIG_KMEM_TRACE, because a kmalloc() memory allocation may be traced by both trace_kmalloc() and trace_kmem_cache_alloc(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: linux-mm@kvack.org <linux-mm@kvack.org> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> LKML-Reference: <4B21F89A.7000801@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | tracing: Fix kmem event exportsIngo Molnar2009-11-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 53d0422 ("tracing: Convert some kmem events to DEFINE_EVENT") moved the kmem tracepoint creation from util.c to page_alloc.c, but forgot to move the exports. Move them back. Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> LKML-Reference: <4B0E286A.2000405@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | tracing: Convert some kmem events to DEFINE_EVENTLi Zefan2009-11-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use DECLARE_EVENT_CLASS to remove duplicate code: text data bss dec hex filename 333987 69800 27228 431015 693a7 mm/built-in.o.old 330030 69800 27228 427058 68432 mm/built-in.o 8 events are converted: kmem_alloc: kmalloc, kmem_cache_alloc kmem_alloc_node: kmalloc_node, kmem_cache_alloc_node kmem_free: kfree, kmem_cache_free mm_page: mm_page_alloc_zone_locked, mm_page_pcpu_drain No change in functionality. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> LKML-Reference: <4B0E286A.2000405@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | | Merge branch 'for-linus' of ↵Linus Torvalds2009-12-14
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits) m68k: rename global variable vmalloc_end to m68k_vmalloc_end percpu: add missing per_cpu_ptr_to_phys() definition for UP percpu: Fix kdump failure if booted with percpu_alloc=page percpu: make misc percpu symbols unique percpu: make percpu symbols in ia64 unique percpu: make percpu symbols in powerpc unique percpu: make percpu symbols in x86 unique percpu: make percpu symbols in xen unique percpu: make percpu symbols in cpufreq unique percpu: make percpu symbols in oprofile unique percpu: make percpu symbols in tracer unique percpu: make percpu symbols under kernel/ and mm/ unique percpu: remove some sparse warnings percpu: make alloc_percpu() handle array types vmalloc: fix use of non-existent percpu variable in put_cpu_var() this_cpu: Use this_cpu_xx in trace_functions_graph.c this_cpu: Use this_cpu_xx for ftrace this_cpu: Use this_cpu_xx in nmi handling this_cpu: Use this_cpu operations in RCU this_cpu: Use this_cpu ops for VM statistics ... Fix up trivial (famous last words) global per-cpu naming conflicts in arch/x86/kvm/svm.c mm/slab.c
| * \ \ \ \ Merge branch 'for-linus' into for-nextTejun Heo2009-12-07
| |\ \ \ \ \ | | |_|_|_|/ | |/| | | | | | | | | | | | | | | | Conflicts: mm/percpu.c
| | * | | | percpu: Fix kdump failure if booted with percpu_alloc=pageVivek Goyal2009-11-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | o kdump functionality reserves a per cpu area at boot time and exports the physical address of that area to user space through sys interface. This area stores some dump related information like cpu register states etc at the time of crash. o We were assuming that per cpu area always come from linearly mapped meory region and using __pa() to determine physical address. With percpu_alloc=page, per cpu area can come from vmalloc region also and __pa() breaks. o This patch implments a new function to convert per cpu address to physical address. Before the patch, crash_notes addresses looked as follows. cpu0 60fffff49800 cpu1 60fffff60800 cpu2 60fffff77800 These are bogus phsyical addresses. After the patch, address are following. cpu0 13eb44000 cpu1 13eb43000 cpu2 13eb42000 cpu3 13eb41000 These look fine. I got 4G of memory and /proc/iomem tell me following. 100000000-13fffffff : System RAM tj: * added missing asm/io.h include reported by Stephen Rothwell * repositioned per_cpu_ptr_phys() in percpu.c and added comment. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
| * | | | | percpu: make percpu symbols under kernel/ and mm/ uniqueTejun Heo2009-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch updates percpu related symbols under kernel/ and mm/ such that percpu symbols are unique and don't clash with local symbols. This serves two purposes of decreasing the possibility of global percpu symbol collision and allowing dropping per_cpu__ prefix from percpu symbols. * kernel/lockdep.c: s/lock_stats/cpu_lock_stats/ * kernel/sched.c: s/init_rq_rt/init_rt_rq_var/ (any better idea?) s/sched_group_cpus/sched_groups/ * kernel/softirq.c: s/ksoftirqd/run_ksoftirqd/a * kernel/softlockup.c: s/(*)_timestamp/softlockup_\1_ts/ s/watchdog_task/softlockup_watchdog/ s/timestamp/ts/ for local variables * kernel/time/timer_stats: s/lookup_lock/tstats_lookup_lock/ * mm/slab.c: s/reap_work/slab_reap_work/ s/reap_node/slab_reap_node/ * mm/vmstat.c: local variable changed to avoid collision with vmstat_work Partly based on Rusty Russell's "alloc_percpu: rename percpu vars which cause name clashes" patch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: (slab/vmstat) Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de>
| * | | | | percpu: remove some sparse warningsTejun Heo2009-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make the following changes to remove some sparse warnings. * Make DEFINE_PER_CPU_SECTION() declare __pcpu_unique_* before defining it. * Annotate pcpu_extend_area_map() that it is entered with pcpu_lock held, releases it and then reacquires it. * Make percpu related macros use unique nested variable names. * While at it, add pcpu prefix to __size_call[_return]() macros as to-be-implemented sparse annotations will add percpu specific stuff to these macros. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | vmalloc: fix use of non-existent percpu variable in put_cpu_var()Tejun Heo2009-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vmalloc used non-existent percpu variable vmap_cpu_blocks instead of the intended vmap_block_queue. This went unnoticed because put_cpu_var() didn't evaluate the parameter. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de>
| * | | | | Merge branch 'for-linus' into for-nextTejun Heo2009-10-12
| |\ \ \ \ \
| * | | | | | percpu: kill legacy percpu allocatorTejun Heo2009-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With ia64 converted, there's no arch left which still uses legacy percpu allocator. Kill it. Signed-off-by: Tejun Heo <tj@kernel.org> Delightedly-acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org>
| | | | | | |
| \ \ \ \ \ \
| \ \ \ \ \ \
| \ \ \ \ \ \
| \ \ \ \ \ \
| \ \ \ \ \ \
*-----. \ \ \ \ \ \ Merge branches 'slab/fixes', 'slab/kmemleak', 'slub/perf' and 'slub/stats' ↵Pekka Enberg2009-12-12
|\ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | into for-linus
| | | | * | | | | | | slub: allow stats to be clearedDavid Rientjes2009-10-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When collecting slub stats for particular workloads, it's necessary to collect each statistic for all caches before the job is even started because the counters are usually greater than zero just from boot and initialization. This allows a statistic to be cleared on each cpu by writing '0' to its sysfs file. This creates a baseline for statistics of interest before the workload is started. Setting a statistic to a particular value is not supported, so all values written to these files other than '0' returns -EINVAL. Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
| | | * | | | | | | | SLUB: Fix __GFP_ZERO unlikely() annotationPekka Enberg2009-11-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The unlikely() annotation in slab_alloc() covers too much of the expression. It's actually very likely that the object is not NULL so use unlikely() only for the __GFP_ZERO expression like SLAB does. The patch reduces kernel text by 29 bytes on x86-64: text data bss dec hex filename 24185 8560 176 32921 8099 mm/slub.o.orig 24156 8560 176 32892 807c mm/slub.o Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
| | * | | | | | | | | slab, kmemleak: pass the correct pointer to kmemleak_erase()J. R. Okajima2009-12-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In ____cache_alloc(), the variable 'ac' may be changed after cache_alloc_refill() and the following kmemleak_erase() may get an incorrect pointer. Update 'ac' after cache_alloc_refill() unconditionally. See the following URL for the discussion of this patch: http://marc.info/?l=linux-kernel&m=125873373124187&w=2 Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>