aboutsummaryrefslogtreecommitdiffstats
path: root/mm/hugetlb.c
Commit message (Collapse)AuthorAge
* [PATCH] mm: make hugepages obey cpusets.Christoph Lameter2006-01-08
| | | | | | | | | | | | See http://marc.theaimsgroup.com/?l=linux-kernel&m=113167000201265&w=2 http://marc.theaimsgroup.com/?l=linux-mm&m=113167267527312&w=2 Make hugepages obey cpusets. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] hugepages: fold find_or_alloc_pages into huge_no_page()Christoph Lameter2006-01-06
| | | | | | | | | | | | | | | | | | The number of parameters for find_or_alloc_page increases significantly after policy support is added to huge pages. Simplify the code by folding find_or_alloc_huge_page() into hugetlb_no_page(). Adam Litke objected to this piece in an earlier patch but I think this is a good simplification. Diffstat shows that we can get rid of almost half of the lines of find_or_alloc_page(). If we can find no consensus then lets simply drop this patch. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Add NUMA policy support for huge pages.Christoph Lameter2006-01-06
| | | | | | | | | | | | | | | | | The huge_zonelist() function in the memory policy layer provides an list of zones ordered by NUMA distance. The hugetlb layer will walk that list looking for a zone that has available huge pages but is also in the nodeset of the current cpuset. This patch does not contain the folding of find_or_alloc_huge_page() that was controversial in the earlier discussion. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: dequeue a huge page near to this nodeChristoph Lameter2006-01-06
| | | | | | | | | | | | | | | | | | | | | | This was discussed at http://marc.theaimsgroup.com/?l=linux-kernel&m=113166526217117&w=2 This patch changes the dequeueing to select a huge page near the node executing instead of always beginning to check for free nodes from node 0. This will result in a placement of the huge pages near the executing processor improving performance. The existing implementation can place the huge pages far away from the executing processor causing significant degradation of performance. The search starting from zero also means that the lower zones quickly run out of memory. Selecting a huge page near the process distributed the huge pages better. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Hugetlb: Copy on Write supportDavid Gibson2006-01-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs. 1. Add a pair of functions to set/clear write access on huge ptes. The writable check in make_huge_pte is moved out to the caller for use by COW later. 2. Hugetlb copy-on-write requires special case handling in the following situations: - copy_hugetlb_page_range() - Copied pages must be write protected so a COW fault will be triggered (if necessary) if those pages are written to. - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page cache. MAP_PRIVATE pages still need to be locked however. 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() which handles the COW fault by making the actual copy. 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping check. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Hugetlb: Reorganize hugetlb_fault to prepare for COWAdam Litke2006-01-06
| | | | | | | | | | | | | | | This patch splits the "no_page()" type activity into its own function, hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults and delegates to the appropriate handler depending on the type of fault. Right now we still have only hugetlb_no_page() but a later patch introduces a COW fault. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Hugetlb: Rename find_lock_page to find_or_alloc_huge_pageAdam Litke2006-01-06
| | | | | | | | | | | | | find_lock_huge_page() isn't a great name, since it does extra things not analagous to find_lock_page(). Rename it find_or_alloc_huge_page() which is closer to the mark. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Hugetlb: Remove duplicate i_size checkAdam Litke2006-01-06
| | | | | | | | | | | cleanup Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] hugetlb: fix race in set_max_huge_pages for multiple updaters of ↵Eric Paris2005-11-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nr_huge_pages If there are multiple updaters to /proc/sys/vm/nr_hugepages simultaneously it is possible for the nr_huge_pages variable to become incorrect. There is no locking in the set_max_huge_pages function around alloc_fresh_huge_page which is able to update nr_huge_pages. Two callers to alloc_fresh_huge_page could race against each other as could a call to alloc_fresh_huge_page and a call to update_and_free_page. This patch just expands the area covered by the hugetlb_lock to cover the call into alloc_fresh_huge_page. I'm not sure how we could say that a sysctl section is performance critical where more specific locking would be needed. My reproducer was to run a couple copies of the following script simultaneously while [ true ]; do echo 1000 > /proc/sys/vm/nr_hugepages echo 500 > /proc/sys/vm/nr_hugepages echo 750 > /proc/sys/vm/nr_hugepages echo 100 > /proc/sys/vm/nr_hugepages echo 0 > /proc/sys/vm/nr_hugepages done and then watch /proc/meminfo and eventually you will see things like HugePages_Total: 100 HugePages_Free: 109 After applying the patch all seemed well. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] unexport hugetlb_total_pagesAdrian Bunk2005-11-07
| | | | | | | | I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ppc64: support 64k pagesBenjamin Herrenschmidt2005-11-06
| | | | | | | | | | | | | | | Adds a new CONFIG_PPC_64K_PAGES which, when enabled, changes the kernel base page size to 64K. The resulting kernel still boots on any hardware. On current machines with 4K pages support only, the kernel will maintain 16 "subpages" for each 64K page transparently. Note that while real 64K capable HW has been tested, the current patch will not enable it yet as such hardware is not released yet, and I'm still verifying with the firmware architects the proper to get the information from the newer hypervisors. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] hugetlb: demand fault handlerAdam Litke2005-10-30
| | | | | | | | | | | | | | Below is a patch to implement demand faulting for huge pages. The main motivation for changing from prefaulting to demand faulting is so that huge page memory areas can be allocated according to NUMA policy. Thanks to consolidated hugetlb code, switching the behavior requires changing only one fault handler. The bulk of the patch just moves the logic from hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page(). Signed-off-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: unmap_vmas with inner ptlockHugh Dickins2005-10-30
| | | | | | | | | | | | | | | | | | | | Remove the page_table_lock from around the calls to unmap_vmas, and replace the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are now safe to descend without page_table_lock. Don't attempt fancy locking for hugepages, just take page_table_lock in unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor does unmap_vmas have much use for its mm arg now. The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without page_table_lock: if they're implemented at all, they typically come down to flush_cache_range (usually done outside page_table_lock) and flush_tlb_range (which we already audited for the mprotect case). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: ptd_alloc take ptlockHugh Dickins2005-10-30
| | | | | | | | | | | | | | | | | | | | | | Second step in pushing down the page_table_lock. Remove the temporary bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not to hold page_table_lock, whether it's on init_mm or a user mm; take page_table_lock internally to check if a racing task already allocated. Convert their callers from common code. But avoid coming back to change them again later: instead of moving the spin_lock(&mm->page_table_lock) down, switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which encapsulate the mapping+locking and unlocking+unmapping together, and in the end may use alternatives to the mm page_table_lock itself. These callers all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them; and pte_alloc uses the "atomic" pmd_present to test whether it needs to allocate. It appears that on all arches we can safely descend without page_table_lock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: update_hiwaters just in timeHugh Dickins2005-10-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | update_mem_hiwater has attracted various criticisms, in particular from those concerned with mm scalability. Originally it was called whenever rss or total_vm got raised. Then many of those callsites were replaced by a timer tick call from account_system_time. Now Frank van Maarseveen reports that to be found inadequate. How about this? Works for Frank. Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros update_hiwater_rss and update_hiwater_vm. Don't attempt to keep mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually by 1): those are hot paths. Do the opposite, update only when about to lower rss (usually by many), or just before final accounting in do_exit. Handle mm->hiwater_vm in the same way, though it's much less of an issue. Demand that whoever collects these hiwater statistics do the work of taking the maximum with rss or total_vm. And there has been no collector of these hiwater statistics in the tree. The new convention needs an example, so match Frank's usage by adding a VmPeak line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS (High-Water-Mark or High-Water-Memory). There was a particular anomaly during mremap move, that hiwater_vm might be captured too high. A fleeting such anomaly remains, but it's quickly corrected now, whereas before it would stick. What locking? None: if the app is racy then these statistics will be racy, it's not worth any overhead to make them exact. But whenever it suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under page_table_lock (for now) or with preemption disabled (later on): without going to any trouble, minimize the time between reading current values and updating, to minimize those occasions when a racing thread bumps a count up and back down in between. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: rss = file_rss + anon_rssHugh Dickins2005-10-30
| | | | | | | | | | | | | | | | I was lazy when we added anon_rss, and chose to change as few places as possible. So currently each anonymous page has to be counted twice, in rss and in anon_rss. Which won't be so good if those are atomic counts in some configurations. Change that around: keep file_rss and anon_rss separately, and add them together (with get_mm_rss macro) when the total is needed - reading two atomics is much cheaper than updating two atomics. And update anon_rss upfront, typically in memory.c, not tucked away in page_add_anon_rmap. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Fix handling spurious page fault for hugetlb regionHugh Dickins2005-10-20
| | | | | | | | | | This reverts commit 3359b54c8c07338f3a863d1109b42eebccdcf379 and replaces it with a cleaner version that is purely based on page table operations, so that the synchronization between inode size and hugetlb mappings becomes moot. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: hugetlb truncation fixesHugh Dickins2005-10-20
| | | | | | | | | | | | | | | | hugetlbfs allows truncation of its files (should it?), but hugetlb.c often forgets that: crashes and misaccounting ensue. copy_hugetlb_page_range better grab the src page_table_lock since we don't want to guess what happens if concurrently truncated. unmap_hugepage_range rss accounting must not assume the full range was mapped. follow_hugetlb_page must guard with page_table_lock and be prepared to exit early. Restyle copy_hugetlb_page_range with a for loop like the others there. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] hugetlb: move stale pte check into huge_pte_alloc()Adam Litke2005-09-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | Initial Post (Wed, 17 Aug 2005) This patch moves the if (! pte_none(*pte)) hugetlb_clean_stale_pgtable(pte); logic into huge_pte_alloc() so all of its callers can be immune to the bug described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246 > It turns out there is a bug in hugetlb_prefault(): with 3 level page table, > huge_pte_alloc() might return a pmd that points to a PTE page. It happens > if the virtual address for hugetlb mmap is recycled from previously used > normal page mmap. free_pgtables() might not scrub the pmd entry on > munmap and hugetlb_prefault skips on any pmd presence regardless what type > it is. Unless I am missing something, it seems more correct to place the check inside huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated. It also allows checking for this condition when lazily faulting huge pages later in the series. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: <linux-mm@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Fix hugepage crash on failing mmap()David Gibson2005-08-05
| | | | | | | | | | | | | | | | This patch fixes a crash in the hugepage code. unmap_hugepage_area() was assuming that (due to prefault) PTEs must exist for all the area in question. However, this may not be the case, if mmap() encounters an error before the prefault and calls unmap_region() to clean up any partial mapping. Depending on the hugepage configuration, this crash can be triggered by an unpriveleged user. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Hugepage consolidationDavid Gibson2005-06-21
| | | | | | | | | | | | | | | | | | | | A lot of the code in arch/*/mm/hugetlbpage.c is quite similar. This patch attempts to consolidate a lot of the code across the arch's, putting the combined version in mm/hugetlb.c. There are a couple of uglyish hacks in order to covert all the hugepage archs, but the result is a very large reduction in the total amount of code. It also means things like hugepage lazy allocation could be implemented in one place, instead of six. Tested, at least a little, on ppc64, i386 and x86_64. Notes: - this patch changes the meaning of set_huge_pte() to be more analagous to set_pte() - does SH4 need s special huge_ptep_get_and_clear()?? Acked-by: William Lee Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Linux-2.6.12-rc2v2.6.12-rc2Linus Torvalds2005-04-16
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!