aboutsummaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAge
* [PATCH] remove invalidate_inode_pages()Andrew Morton2007-02-11
| | | | | | | | | | | Convert all calls to invalidate_inode_pages() into open-coded calls to invalidate_mapping_pages(). Leave the invalidate_inode_pages() wrapper in place for now, marked as deprecated. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Export invalidate_mapping_pages() to modulesAnton Altaparmakov2007-02-11
| | | | | | | | | | | | | It makes no sense to me to export invalidate_inode_pages() and not invalidate_mapping_pages() and I actually need invalidate_mapping_pages() because of its range specification ability... akpm: also remove the export of invalidate_inode_pages() by making it an inlined wrapper. Signed-off-by: Anton Altaparmakov <aia21@cantab.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] lockdep: also check for freed locks in kmem_cache_free()Ingo Molnar2007-02-11
| | | | | | | | kmem_cache_free() was missing the check for freeing held locks. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] do not disturb page referenced state when unmapping memory rangeKen Chen2007-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When kernel unmaps an address range, it needs to transfer PTE state into page struct. Currently, kernel transfer access bit via mark_page_accessed(). The call to mark_page_accessed in the unmap path doesn't look logically correct. At unmap time, calling mark_page_accessed will causes page LRU state to be bumped up one step closer to more recently used state. It is causing quite a bit headache in a scenario when a process creates a shmem segment, touch a whole bunch of pages, then unmaps it. The unmapping takes a long time because mark_page_accessed() will start moving pages from inactive to active list. I'm not too much concerned with moving the page from one list to another in LRU. Sooner or later it might be moved because of multiple mappings from various processes. But it just doesn't look logical that when user asks a range to be unmapped, it's his intention that the process is no longer interested in these pages. Moving those pages to active list (or bumping up a state towards more active) seems to be an over reaction. It also prolongs unmapping latency which is the core issue I'm trying to solve. As suggested by Peter, we should still preserve the info on pte young pages, but not more. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ken Chen <kenchen@google.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] simplify shmem_aops.set_page_dirty() methodKen Chen2007-02-11
| | | | | | | | | | | | | | | | | | | | | | | | shmem backed file does not have page writeback, nor it participates in backing device's dirty or writeback accounting. So using generic __set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit overkill. It unnecessarily prolongs shm unmap latency. For example, on a densely populated large shm segment (sevearl GBs), the unmapping operation becomes painfully long. Because at unmap, kernel transfers dirty bit in PTE into page struct and to the radix tree tag. The operation of tagging the radix tree is particularly expensive because it has to traverse the tree from the root to the leaf node on every dirty page. What's bothering is that radix tree tag is used for page write back. However, shmem is memory backed and there is no page write back for such file system. And in the end, we spend all that time tagging radix tree and none of that fancy tagging will be used. So let's simplify it by introduce a new aops __set_page_dirty_no_writeback and this will speed up shm unmap. Signed-off-by: Ken Chen <kenchen@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Set CONFIG_ZONE_DMA for arches with GENERIC_ISA_DMAChristoph Lameter2007-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | As Andi pointed out: CONFIG_GENERIC_ISA_DMA only disables the ISA DMA channel management. Other functionality may still expect GFP_DMA to provide memory below 16M. So we need to make sure that CONFIG_ZONE_DMA is set independent of CONFIG_GENERIC_ISA_DMA. Undo the modifications to mm/Kconfig where we made ZONE_DMA dependent on GENERIC_ISA_DMA and set theses explicitly in each arches Kconfig. Reviews must occur for each arch in order to determine if ZONE_DMA can be switched off. It can only be switched off if we know that all devices supported by a platform are capable of performing DMA transfers to all of memory (Some arches already support this: uml, avr32, sh sh64, parisc and IA64/Altix). In order to switch ZONE_DMA off conditionally, one would have to establish a scheme by which one can assure that no drivers are enabled that are only capable of doing I/O to a part of memory, or one needs to provide an alternate means of performing an allocation from a specific range of memory (like provided by alloc_pages_range()) and insure that all drivers use that call. In that case the arches alloc_dma_coherent() may need to be modified to call alloc_pages_range() instead of relying on GFP_DMA. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] optional ZONE_DMA: optional ZONE_DMA in the VMChristoph Lameter2007-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make ZONE_DMA optional in core code. - ifdef all code for ZONE_DMA and related definitions following the example for ZONE_DMA32 and ZONE_HIGHMEM. - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of 0. - Modify the VM statistics to work correctly without a DMA zone. - Modify slab to not create DMA slabs if there is no ZONE_DMA. [akpm@osdl.org: cleanup] [jdike@addtoit.com: build fix] [apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <willy@debian.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] optional ZONE_DMA: introduce CONFIG_ZONE_DMAChristoph Lameter2007-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch simply defines CONFIG_ZONE_DMA for all arches. We later do special things with CONFIG_ZONE_DMA after the VM and an arch are prepared to work without ZONE_DMA. CONFIG_ZONE_DMA can be defined in two ways depending on how an architecture handles ISA DMA. First if CONFIG_GENERIC_ISA_DMA is set by the arch then we know that the arch needs ZONE_DMA because ISA DMA devices are supported. We can catch this in mm/Kconfig and do not need to modify arch code. Second, arches may use ZONE_DMA in an unknown way. We set CONFIG_ZONE_DMA for all arches that do not set CONFIG_GENERIC_ISA_DMA in order to insure backwards compatibility. The arches may later undefine ZONE_DMA if their arch code has been verified to not depend on ZONE_DMA. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <willy@debian.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] optional ZONE_DMA: deal with cases of ZONE_DMA meaning the first zoneChristoph Lameter2007-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patchset follows up on the earlier work in Andrew's tree to reduce the number of zones. The patches allow to go to a minimum of 2 zones. This one allows also to make ZONE_DMA optional and therefore the number of zones can be reduced to one. ZONE_DMA is usually used for ISA DMA devices. There are a number of reasons why we would not want to have ZONE_DMA 1. Some arches do not need ZONE_DMA at all. 2. With the advent of IOMMUs DMA zones are no longer needed. The necessity of DMA zones may drastically be reduced in the future. This patchset allows a compilation of a kernel without that overhead. 3. Devices that require ISA DMA get rare these days. All my systems do not have any need for ISA DMA. 4. The presence of an additional zone unecessarily complicates VM operations because it must be scanned and balancing logic must operate on its. 5. With only ZONE_NORMAL one can reach the situation where we have only one zone. This will allow the unrolling of many loops in the VM and allows the optimization of varous code paths in the VM. 6. Having only a single zone in a NUMA system results in a 1-1 correspondence between nodes and zones. Various additional optimizations to critical VM paths become possible. Many systems today can operate just fine with a single zone. If you look at what is in ZONE_DMA then one usually sees that nothing uses it. The DMA slabs are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL will be empty instead). On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty. Why constantly look at an empty zone in /proc/zoneinfo and empty slab in /proc/slabinfo? Non i386 also frequently have no need for ZONE_DMA and zones stay empty. The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA). The RFC posted earlier (see http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots of #ifdefs in them. An effort has been made to minize the number of #ifdefs and make this as compact as possible. The job was made much easier by the ongoing efforts of others to extract common arch specific functionality. I have been running this for awhile now on my desktop and finally Linux is using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched: christoph@pentium940:~$ cat /proc/zoneinfo Node 0, zone Normal pages free 4435 min 1448 low 1810 high 2172 active 241786 inactive 210170 scanned 0 (a: 0 i: 0) spanned 524224 present 524224 nr_anon_pages 61680 nr_mapped 14271 nr_file_pages 390264 nr_slab_reclaimable 27564 nr_slab_unreclaimable 1793 nr_page_table_pages 449 nr_dirty 39 nr_writeback 0 nr_unstable 0 nr_bounce 0 cpu: 0 pcp: 0 count: 156 high: 186 batch: 31 cpu: 0 pcp: 1 count: 9 high: 62 batch: 15 vm stats threshold: 20 cpu: 1 pcp: 0 count: 177 high: 186 batch: 31 cpu: 1 pcp: 1 count: 12 high: 62 batch: 15 vm stats threshold: 20 all_unreclaimable: 0 prev_priority: 12 temp_priority: 12 start_pfn: 0 This patch: In two places in the VM we use ZONE_DMA to refer to the first zone. If ZONE_DMA is optional then other zones may be first. So simply replace ZONE_DMA with zone 0. This also fixes ZONETABLE_PGSHIFT. If we have only a single zone then ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone number related to a pgdat. However, we still need a zonetable to index all the zones for each node if this is a NUMA system. Therefore define ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags. [apw@shadowen.org: fix mismerge] Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <willy@debian.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Drop get_zone_counts()Christoph Lameter2007-02-11
| | | | | | | | Values are available via ZVC sums. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Drop __get_zone_counts()Christoph Lameter2007-02-11
| | | | | | | | Values are readily available via ZVC per node and global sums. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Drop nr_free_pages_pgdat()Christoph Lameter2007-02-11
| | | | | | | | | Function is unnecessary now. We can use the summing features of the ZVCs to get the values we need. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Drop free_pages()Christoph Lameter2007-02-11
| | | | | | | | | | | | | | | nr_free_pages is now a simple access to a global variable. Make it a macro instead of a function. The nr_free_pages now requires vmstat.h to be included. There is one occurrence in power management where we need to add the include. Directly refrer to global_page_state() there to clarify why the #include was added. [akpm@osdl.org: arm build fix] [akpm@osdl.org: sparc64 build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Reorder ZVCs according to cachelineChristoph Lameter2007-02-11
| | | | | | | | | | | | The global and per zone counter sums are in arrays of longs. Reorder the ZVCs so that the most frequently used ZVCs are put into the same cacheline. That way calculations of the global, node and per zone vm state touches only a single cacheline. This is mostly important for 64 bit systems were one 128 byte cacheline takes only 8 longs. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Use ZVC for free_pagesChristoph Lameter2007-02-11
| | | | | | | | | | | This is again simplifies some of the VM counter calculations through the use of the ZVC consolidated counters. [michal.k.k.piotrowski@gmail.com: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Use ZVC for inactive and active countsChristoph Lameter2007-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The determination of the dirty ratio to determine writeback behavior is currently based on the number of total pages on the system. However, not all pages in the system may be dirtied. Thus the ratio is always too low and can never reach 100%. The ratio may be particularly skewed if large hugepage allocations, slab allocations or device driver buffers make large sections of memory not available anymore. In that case we may get into a situation in which f.e. the background writeback ratio of 40% cannot be reached anymore which leads to undesired writeback behavior. This patchset fixes that issue by determining the ratio based on the actual pages that may potentially be dirty. These are the pages on the active and the inactive list plus free pages. The problem with those counts has so far been that it is expensive to calculate these because counts from multiple nodes and multiple zones will have to be summed up. This patchset makes these counters ZVC counters. This means that a current sum per zone, per node and for the whole system is always available via global variables and not expensive anymore to calculate. The patchset results in some other good side effects: - Removal of the various functions that sum up free, active and inactive page counts - Cleanup of the functions that display information via the proc filesystem. This patch: The use of a ZVC for nr_inactive and nr_active allows a simplification of some counter operations. More ZVC functionality is used for sums etc in the following patches. [akpm@osdl.org: UP build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] page_mkwrite caller race fixHugh Dickins2007-02-11
| | | | | | | | | | | | After do_wp_page has tested page_mkwrite, it must release old_page after acquiring page table lock, not before: at some stage that ordering got reversed, leaving a (very unlikely) window in which old_page might be truncated, freed, and reused in the same position. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] /proc/zoneinfo: fix vm stats displayAndrew Morton2007-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This early break prevents us from displaying info for the vm stats thresholds if the zone doesn't have any pages in its per-cpu pagesets. So my 800MB i386 box says: Node 0, zone DMA pages free 2365 min 16 low 20 high 24 active 0 inactive 0 scanned 0 (a: 0 i: 0) spanned 4096 present 4044 nr_anon_pages 0 nr_mapped 1 nr_file_pages 0 nr_slab_reclaimable 0 nr_slab_unreclaimable 0 nr_page_table_pages 0 nr_dirty 0 nr_writeback 0 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 protection: (0, 868, 868) pagesets all_unreclaimable: 0 prev_priority: 12 start_pfn: 0 Node 0, zone Normal pages free 199713 min 934 low 1167 high 1401 active 10215 inactive 4507 scanned 0 (a: 0 i: 0) spanned 225280 present 222420 nr_anon_pages 2685 nr_mapped 1110 nr_file_pages 12055 nr_slab_reclaimable 2216 nr_slab_unreclaimable 1527 nr_page_table_pages 213 nr_dirty 0 nr_writeback 0 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 protection: (0, 0, 0) pagesets cpu: 0 pcp: 0 count: 152 high: 186 batch: 31 cpu: 0 pcp: 1 count: 13 high: 62 batch: 15 vm stats threshold: 16 cpu: 1 pcp: 0 count: 34 high: 186 batch: 31 cpu: 1 pcp: 1 count: 10 high: 62 batch: 15 vm stats threshold: 16 all_unreclaimable: 0 prev_priority: 12 start_pfn: 4096 Just nuke all that search-for-the-first-non-empty-pageset code. Dunno why it was there in the first place.. Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Avoid excessive sorting of early_node_map[]Mel Gorman2007-02-11
| | | | | | | | | | | | | | find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort early_node_map[] on every call. This is an excessive amount of sorting and that can be avoided. This patch always searches the whole early_node_map[] in find_min_pfn_for_node() instead of returning the first value found. The map is then only sorted once when required. Successfully boot tested on a number of machines. [akpm@osdl.org: cleanup] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] slab: use parameter passed to cache_reap to determine pointer to ↵Christoph Lameter2007-02-11
| | | | | | | | | | | work structure Use the pointer passed to cache_reap to determine the work pointer and consolidate exit paths. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] slab: cache alloc cleanupsPekka Enberg2007-02-11
| | | | | | | | | | | | | | | | | | | Clean up __cache_alloc and __cache_alloc_node functions a bit. We no longer need to do NUMA_BUILD tricks and the UMA allocation path is much simpler. No functional changes in this patch. Note: saves few kernel text bytes on x86 NUMA build due to using gotos in __cache_alloc_node() and moving __GFP_THISNODE check in to fallback_alloc(). Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Manfred Spraul <manfred@colorfullife.com> Acked-by: Christoph Lameter <christoph@lameter.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] slab: remove broken PageSlab check from kfree_debugcheckPekka Enberg2007-02-11
| | | | | | | | | | | The PageSlab debug check in kfree_debugcheck() is broken for compound pages. It is also redundant as we already do BUG_ON for non-slab pages in page_get_cache() and page_get_slab() which are always called before we free any actual objects. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Add install_special_mappingRoland McGrath2007-02-09
| | | | | | | | | | | | | | | This patch adds a utility function install_special_mapping, for creating a special vma using a fixed set of preallocated pages as backing, such as for a vDSO. This consolidates some nearly identical code used for vDSO mapping reimplemented for different architectures. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] mm: show bounce pages in oom killer outputAndrew Morton2007-02-09
| | | | | | | | | | Also split that long line up - people like to send us wordwrapped oom-kill traces. Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] hugetlb: preserve hugetlb pte dirty stateKen Chen2007-02-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | __unmap_hugepage_range() is buggy that it does not preserve dirty state of huge_pte when unmapping hugepage range. It causes data corruption in the event of dop_caches being used by sys admin. For example, an application creates a hugetlb file, modify pages, then unmap it. While leaving the hugetlb file alive, comes along sys admin doing a "echo 3 > /proc/sys/vm/drop_caches". drop_pagecache_sb() will happily free all pages that aren't marked dirty if there are no active mapping. Later when application remaps the hugetlb file back and all data are gone, triggering catastrophic flip over on application. Not only that, the internal resv_huge_pages count will also get all messed up. Fix it up by marking page dirty appropriately. Signed-off-by: Ken Chen <kenchen@google.com> Cc: "Nish Aravamudan" <nish.aravamudan@gmail.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: <stable@kernel.org> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] mm: remove find_trylock_pageNick Piggin2007-02-09
| | | | | | | | Remove find_trylock_page as per the removal schedule. Signed-off-by: Nick Piggin <npiggin@suse.de> [ Let's see if anybody screams ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Revert "[PATCH] mm: micro optimise zone_watermark_ok"Linus Torvalds2007-01-31
| | | | | | | | | | | | | | | This reverts commit e80ee884ae0e3794ef2b65a18a767d502ad712ee. Pawel Sikora had a boot-time oops due to it - because the sign change invalidates the following comparisons, since 'free_pages' can be negative. The micro-optimization just isn't worth it. Bisected-by: Pawel Sikora <pluto@agmk.net> Acked-by: Andrew Morton <akpm@osdl.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Don't allow the stack to grow into hugetlb reserved regionsAdam Litke2007-01-30
| | | | | | | | | | | | | | | When expanding the stack, we don't currently check if the VMA will cross into an area of the address space that is reserved for hugetlb pages. Subsequent faults on the expanded portion of such a VMA will confuse the low-level MMU code, resulting in an OOPS. Check for this. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] mm: mremap correct rmap accountingHugh Dickins2007-01-30
| | | | | | | | | | | | | | | | | | | Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE. Instead of complicating that move_pte, just forget the minor optimization when mremapping, and change the one thing which needed it for correctness - filemap_xip use ZERO_PAGE(0) throughout instead of according to address. [ "There is no block device driver one could use for XIP on mips platforms" - Carsten Otte ] Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andrew Morton <akpm@osdl.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Fix balance_dirty_page() calculations with CONFIG_HIGHMEMLinus Torvalds2007-01-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This makes balance_dirty_page() always base its calculations on the amount of non-highmem memory in the machine, rather than try to base it on total memory and then falling back on non-highmem memory if the mapping it was writing wasn't highmem capable. This not only fixes a situation where two different writers can have wildly different notions about what is a "balanced" dirty state, but it also means that people with highmem machines don't run into an OOM situation when regular memory fills up with dirty pages. We used to try to handle the latter case by scaling down the dirty_ratio if the machine had a lot of highmem pages in page_writeback_init(), but it wasn't aggressive enough for some situations, and since basing the dirty ratio on highmem memory was broken in the first place, let's just stop doing so. (A variation of this theme fixed Justin Piszcz's OOM problem when copying an 18GB file on a RAID setup). Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Justin Piszcz <jpiszcz@lucidpixels.com> Cc: Andrew Morton <akpm@osdl.org> Cc: Neil Brown <neilb@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Christoph Lameter <clameter@sgi.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Adrian Bunk <bunk@stusta.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] MM: Remove [PATCH] invalidate_inode_pages2_range() debugTrond Myklebust2007-01-26
| | | | | | | | | | | | NFS can handle the case where invalidate_inode_pages2_range() fails, so the premise behind commit 8258d4a574d3a8c01f0ef68aa26b969398a0e140 is now gone. Remove the WARN_ON_ONCE() which is causing users grief as we can see from http://bugzilla.kernel.org/show_bug.cgi?id=7826 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] i386 vDSO: use VM_ALWAYSDUMPRoland McGrath2007-01-26
| | | | | | | | | | | | | | | | | | | | | | | This patch fixes core dumps to include the vDSO vma, which is left out now. It removes the special-case core writing macros, which were not doing the right thing for the vDSO vma anyway. Instead, it uses VM_ALWAYSDUMP in the vma; there is no need for the fixmap page to be installed. It handles the CONFIG_COMPAT_VDSO case by making elf_core_dump use the fake vma from get_gate_vma after real vmas in the same way the /proc/PID/maps code does. This changes core dumps so they no longer include the non-PT_LOAD phdrs from the vDSO. I made the change to add them in the first place, but in turned out that nothing ever wanted them there since the advent of NT_AUXV. It's cleaner to leave them out, and just let the phdrs inside the vDSO image speak for themselves. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Fix gate_vma.vm_flagsRoland McGrath2007-01-26
| | | | | | | | | | | | | | | This patch fixes the initialization of gate_vma.vm_flags and gate_vma.vm_page_prot to reflect reality. This makes the "[vdso]" line in /proc/PID/maps correctly show r-xp instead of ---p, when gate_vma is used (CONFIG_COMPAT_VDSO on i386). Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Resurrect 'try_to_free_buffers()' VM hackeryLinus Torvalds2007-01-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's not pretty, but it appears that ext3 with data=journal will clean pages without ever actually telling the VM that they are clean. This, in turn, will result in the VM (and balance_dirty_pages() in particular) to never realize that the pages got cleaned, and wait forever for an event that already happened. Technically, this seems to be a problem with ext3 itself, but it used to be hidden by 'try_to_free_buffers()' noticing this situation on its own, and just working around the filesystem problem. This commit re-instates that hack, in order to avoid a regression for the 2.6.20 release. This fixes bugzilla 7844: http://bugzilla.kernel.org/show_bug.cgi?id=7844 Peter Zijlstra points out that we should probably retain the debugging code that this removes from cancel_dirty_page(), and I agree, but for the imminent release we might as well just silence the warning too (since it's not a new bug: anything that triggers that warning has been around forever). Acked-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Jens Axboe <jens.axboe@oracle.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] mbind: restrict nodes to the currently allowed cpusetChristoph Lameter2007-01-23
| | | | | | | | | | | | | | | | | | | Currently one can specify an arbitrary node mask to mbind that includes nodes not allowed. If that is done with an interleave policy then we will go around all the nodes. Those outside of the currently allowed cpuset will be redirected to the border nodes. Interleave will then create imbalances at the borders of the cpuset. This patch restricts the nodes to the currently allowed cpuset. The RFC for this patch was discussed at http://marc.theaimsgroup.com/?t=116793842100004&r=1&w=2 Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] blktrace: only add a bounce trace when we really bounceJens Axboe2007-01-12
| | | | | | | | | | | Currently we issue a bounce trace when __blk_queue_bounce() is called, but that merely means that the device has a lower dma mask than the higher pages in the system. The bio itself may still be lower pages. So move the bounce trace into __blk_queue_bounce(), when we know there will actually be page bouncing. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] NFS: Fix race in nfs_release_page()Trond Myklebust2007-01-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | NFS: Fix race in nfs_release_page() invalidate_inode_pages2() may find the dirty bit has been set on a page owing to the fact that the page may still be mapped after it was locked. Only after the call to unmap_mapping_range() are we sure that the page can no longer be dirtied. In order to fix this, NFS has hooked the releasepage() method and tries to write the page out between the call to unmap_mapping_range() and the call to remove_mapping(). This, however leads to deadlocks in the page reclaim code, where the page may be locked without holding a reference to the inode or dentry. Fix is to add a new address_space_operation, launder_page(), which will attempt to write out a dirty page without releasing the page lock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Also, the bare SetPageDirty() can skew all sort of accounting leading to other nasties. [akpm@osdl.org: cleanup] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Fix sparsemem on CellDave Hansen2007-01-11
| | | | | | | | | | | | | | | | | | | | | Fix an oops experienced on the Cell architecture when init-time functions, early_*(), are called at runtime. It alters the call paths to make sure that the callers explicitly say whether the call is being made on behalf of a hotplug even, or happening at boot-time. It has been compile tested on ppc64, ia64, s390, i386 and x86_64. Acked-by: Arnd Bergmann <arndb@de.ibm.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Merge master.kernel.org:/home/rmk/linux-2.6-armLinus Torvalds2007-01-08
|\ | | | | | | | | | | | | | | | | | | | | | | * master.kernel.org:/home/rmk/linux-2.6-arm: [ARM] Provide basic printk_clock() implementation [ARM] Resolve fuse and direct-IO failures due to missing cache flushes [ARM] pass vma for flush_anon_page() [ARM] Fix potential MMCI bug [ARM] Fix kernel-mode undefined instruction aborts [ARM] 4082/1: iop3xx: fix iop33x gpio register offset [ARM] 4070/1: arch/arm/kernel: fix warnings from missing includes [ARM] 4079/1: iop: Update MAINTAINERS
| * [ARM] pass vma for flush_anon_page()Russell King2007-01-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since get_user_pages() may be used with processes other than the current process and calls flush_anon_page(), flush_anon_page() has to cope in some way with non-current processes. It may not be appropriate, or even desirable to flush a region of virtual memory cache in the current process when that is different to the process that we want the flush to occur for. Therefore, pass the vma into flush_anon_page() so that the architecture can work out whether the 'vmaddr' is for the current process or not. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* | [PATCH] shrink_all_memory(): fix lru_pages handlingAndrew Morton2007-01-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the end of shrink_all_memory() we forget to recalculate lru_pages: it can be zero. Fix that up, and add a helper function for this operation too. Also, recalculate lru_pages each time around the inner loop to get the balancing correct. Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] fix OOM killing of swapoffHugh Dickins2007-01-06
| | | | | | | | | | | | | | | | | | | | | | | | | | These days, if you swapoff when there isn't enough memory, OOM killer gives "BUG: scheduling while atomic" and the machine hangs: badness() needs to do its PF_SWAPOFF return after the task_unlock (tasklist_lock is also held here, so p isn't going to be freed: PF_SWAPOFF might get turned off at any moment, but that doesn't really matter). Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] Check for populated zone in __drain_pagesChristoph Lameter2007-01-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both process_zones() and drain_node_pages() check for populated zones before touching pagesets. However, __drain_pages does not do so, This may result in a NULL pointer dereference for pagesets in unpopulated zones if a NUMA setup is combined with cpu hotplug. Initially the unpopulated zone has the pcp pointers pointing to the boot pagesets. Since the zone is not populated the boot pageset pointers will not be changed during page allocator and slab bootstrap. If a cpu is later brought down (first call to __drain_pages()) then the pcp pointers for cpus in unpopulated zones are set to NULL since __drain_pages does not first check for an unpopulated zone. If the cpu is then brought up again then we call process_zones() which will ignore the unpopulated zone. So the pageset pointers will still be NULL. If the cpu is then again brought down then __drain_pages will attempt to drain pages by following the NULL pageset pointer for unpopulated zones. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] fix BUG_ON(!PageSlab) from fallback_allocHugh Dickins2007-01-06
| | | | | | | | | | | | | | | | | | | | | | | | | | pdflush hit the BUG_ON(!PageSlab(page)) in kmem_freepages called from fallback_alloc: cache_grow already freed those pages when alloc_slabmgmt failed. But it wouldn't have freed them if __GFP_NO_GROW, so make sure fallback_alloc doesn't waste its time on that case. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] Sanely size hash tables when using large base pagesPaul Mundt2007-01-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment the inode/dentry cache hash tables (common by way of alloc_large_system_hash()) are incorrectly sized by their respective detection logic when we attempt to use large base pages on systems with little memory. This results in odd behaviour when using a 64kB PAGE_SIZE, such as: Dentry cache hash table entries: 8192 (order: -1, 32768 bytes) Inode-cache hash table entries: 4096 (order: -2, 16384 bytes) The mount cache hash table is seemingly the only one that gets this right by directly taking PAGE_SIZE in to account. The following patch attempts to catch the bogus values and round it up to at least 0-order. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] swsusp: Do not fail if resume device is not setRafael J. Wysocki2007-01-06
|/ | | | | | | | | | | | | | In the kernels later than 2.6.19 there is a regression that makes swsusp fail if the resume device is not explicitly specified. It can be fixed by adding an additional parameter to mm/swapfile.c:swap_type_of() allowing us to pass the (struct block_device *) corresponding to the first available swap back to the caller. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Buglet in vmscan.cShantanu Goel2006-12-30
| | | | | | | | | | Fix a rather obvious buglet. Noticed while instrumenting the VM using /proc/vmstat. Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] page_mkclean_one(): fix call to set_pte_at()Al Viro2006-12-30
| | | | | | | | (akpm: macros are wonderful) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] MM: SLOB is broken by recent cleanup of slab.hDimitri Gorokhovik2006-12-30
| | | | | | | | | | | | | | | Recent cleanup of slab.h broke SLOB allocator: the routine kmem_cache_init has now the __init attribute for both slab.c and slob.c. This routine cannot be removed after init in the case of slob.c -- it serves as a timer callback. Provide a separate timer callback routine, call it once from kmem_cache_init, keep the __init attribute on the latter. Signed-off-by: Dimitri Gorokhovik <dimitri.gorokhovik@free.fr> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] fix oom killer kills current every time if there is memory-less-node ↵KAMEZAWA Hiroyuki2006-12-30
| | | | | | | | | | | | | | | | | | | | | take2 constrained_alloc(), which is called to detect where oom is from, checks passed zone_list(). If zone_list doesn't include all nodes, it thinks oom is from mempolicy. But there is memory-less-node. memory-less-node's zones are never included in zonelist[]. contstrained_alloc() should get memory_less_node into count. Otherwise, it always thinks 'oom is from mempolicy'. This means that current process dies at any time. This patch fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>