aboutsummaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAge
* [PATCH] uninline zone helpersKAMEZAWA Hiroyuki2006-03-27
| | | | | | | | | | | | | | | | | Helper functions for for_each_online_pgdat/for_each_zone look too big to be inlined. Speed of these helper macro itself is not very important. (inner loops are tend to do more work than this) This patch make helper function to be out-of-lined. inline out-of-line .text 005c0680 005bf6a0 005c0680 - 005bf6a0 = FE0 = 4Kbytes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] for_each_online_pgdat: remove pgdat_listKAMEZAWA Hiroyuki2006-03-27
| | | | | | | | | By using for_each_online_pgdat(), pgdat_list is not necessary now. This patch removes it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] for_each_online_pgdat: renaming for_each_pgdatKAMEZAWA Hiroyuki2006-03-27
| | | | | | | | Replace for_each_pgdat() with for_each_online_pgdat(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] for_each_online_pgdat: for_each_bootmemKAMEZAWA Hiroyuki2006-03-27
| | | | | | | | | | | | | | | | Add a list_head to bootmem_data_t and make bootmems use it. bootmem list is sorted by node_boot_start. Only nodes against which init_bootmem() is called are linked to the list. (i386 allocates bootmem only from one node(0) not from all online nodes.) A summary: 1. for_each_online_pgdat() traverses all *online* nodes. 2. alloc_bootmem() allocates memory only from initialized-for-bootmem nodes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] remove zone_mem_mapKAMEZAWA Hiroyuki2006-03-27
| | | | | | | | | | | | | | This patch removes zone_mem_map. pfn_to_page uses pgdat, page_to_pfn uses zone. page_to_pfn can use pgdat instead of zone, which is only one user of zone_mem_map. By modifing it, we can remove zone_mem_map. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] unify pfn_to_page: generic functionsKAMEZAWA Hiroyuki2006-03-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are 3 memory models, FLATMEM, DISCONTIGMEM, SPARSEMEM. Each arch has its own page_to_pfn(), pfn_to_page() for each models. But most of them can use the same arithmetic. This patch adds asm-generic/memory_model.h, which includes generic page_to_pfn(), pfn_to_page() definitions for each memory model. When CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y, out-of-line functions are used instead of macro. This is enabled by some archs and reduces text size. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Hirokazu Takata <takata.hirokazu@renesas.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds2006-03-26
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: drivers/char/ftape/lowlevel/fdc-io.c: Correct a comment Kconfig help: MTD_JEDECPROBE already supports Intel Remove ugly debugging stuff do_mounts.c: Minor ROOT_DEV comment cleanup BUG_ON() Conversion in drivers/s390/block/dasd_devmap.c BUG_ON() Conversion in mm/mempool.c BUG_ON() Conversion in mm/memory.c BUG_ON() Conversion in kernel/fork.c BUG_ON() Conversion in ipc/sem.c BUG_ON() Conversion in fs/ext2/ BUG_ON() Conversion in fs/hfs/ BUG_ON() Conversion in fs/dcache.c BUG_ON() Conversion in fs/buffer.c BUG_ON() Conversion in input/serio/hp_sdc_mlc.c BUG_ON() Conversion in md/dm-table.c BUG_ON() Conversion in md/dm-path-selector.c BUG_ON() Conversion in drivers/isdn BUG_ON() Conversion in drivers/char BUG_ON() Conversion in drivers/mtd/
| * BUG_ON() Conversion in mm/mempool.cEric Sesterhenn2006-03-26
| | | | | | | | | | | | | | | | this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
| * BUG_ON() Conversion in mm/memory.cEric Sesterhenn2006-03-26
| | | | | | | | | | | | | | | | this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
* | [PATCH] mempool: add kzalloc allocatorMatthew Dobson2006-03-26
| | | | | | | | | | | | | | | | | | | | | | | | Add another allocator to the common mempool code: a kzalloc/kfree allocator This will be used by the next patch in the series to replace a mempool-backed kzalloc allocator. It is also very likely that there will be more users in the future. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] mempool: add kmalloc allocatorMatthew Dobson2006-03-26
| | | | | | | | | | | | | | | | | | | | | | | | Add another allocator to the common mempool code: a kmalloc/kfree allocator This will be used by the next patch in the series to replace duplicate mempool-backed kmalloc allocators in several places in the kernel. It is also very likely that there will be more users in the future. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] mempool: use common mempool page allocatorMatthew Dobson2006-03-26
| | | | | | | | | | | | | | | | | | | | | | Convert two mempool users that currently use their own mempool-backed page allocators to use the generic mempool page allocator. Also included are 2 trivial whitespace fixes. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] mempool: add page allocatorMatthew Dobson2006-03-26
| | | | | | | | | | | | | | | | | | | | This will be used by the next patch in the series to replace duplicate mempool-backed page allocators in 2 places in the kernel. It is also likely that there will be more users in the future. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] Add API for flushing Anon pagesJames Bottomley2006-03-26
|/ | | | | | | | | | | | | | | | Currently, get_user_pages() returns fully coherent pages to the kernel for anything other than anonymous pages. This is a problem for things like fuse and the SCSI generic ioctl SG_IO which can potentially wish to do DMA to anonymous pages passed in by users. The fix is to add a new memory management API: flush_anon_page() which is used in get_user_pages() to make anonymous pages coherent. Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86_64: Try to allocate node memmap near the end of nodeAndi Kleen2006-03-25
| | | | | | | | | This fixes problems with very large nodes (over 128GB) filling up all of the first 4GB with their mem_map and not leaving enough space for the swiotlb. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: restore vm_normal_page checkNick Piggin2006-03-25
| | | | | | | | | | Hugh is rightly concerned that the CONFIG_DEBUG_VM coverage has gone too far in vm_normal_page, considering that we expect production kernels to be shipped with the option turned off, and that the code has been under some large changes recently. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds2006-03-25
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (21 commits) BUG_ON() Conversion in drivers/video/ BUG_ON() Conversion in drivers/parisc/ BUG_ON() Conversion in drivers/block/ BUG_ON() Conversion in sound/sparc/cs4231.c BUG_ON() Conversion in drivers/s390/block/dasd.c BUG_ON() Conversion in lib/swiotlb.c BUG_ON() Conversion in kernel/cpu.c BUG_ON() Conversion in ipc/msg.c BUG_ON() Conversion in block/elevator.c BUG_ON() Conversion in fs/coda/ BUG_ON() Conversion in fs/binfmt_elf_fdpic.c BUG_ON() Conversion in input/serio/hil_mlc.c BUG_ON() Conversion in md/dm-hw-handler.c BUG_ON() Conversion in md/bitmap.c The comment describing how MS_ASYNC works in msync.c is confusing rcu: undeclared variable used in documentation fix typos "wich" -> "which" typo patch for fs/ufs/super.c Fix simple typos tabify drivers/char/Makefile ...
| * The comment describing how MS_ASYNC works in msync.c is confusingAmos Waterland2006-03-24
| | | | | | | | | | | | | | because of a typo. This patch just changes "my" to "by", which I believe was the original intent. Signed-off-by: Adrian Bunk <bunk@stusta.de>
* | [PATCH] fix alloc_large_system_hash() roundupJohn Hawkes2006-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The "rounded up to nearest power of 2 in size" algorithm in alloc_large_system_hash is not correct. As coded, it takes an otherwise acceptable power-of-2 value and doubles it. For example, we see the error if we boot with thash_entries=2097152 which produces a hash table with 4194304 entries. Signed-off-by: John Hawkes <hawkes@sgi.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] find_task_by_pid() needs tasklist_lockAndrew Morton2006-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A couple of places are forgetting to take it. The kswapd case is probably unimportant. keventd_create_kthread() was racy. The whole thing is a bit flakey: you start a kernel thread, get its pid from kernel_thread() then look up its task_struct. a) It assumes that pid recycling takes a "long" time. b) We get a task_struct but no reference was taken on it. The owner of the kswapd and kthread task_struct*'s must assume that the new thread won't exit unexpectedly. Because if it does, they're left holding dead memory and any attempt to control or stop that task will crash. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] quieten zone_pcp_initAnton Blanchard2006-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In zone_pcp_init we print out all zones even if they are empty: On node 0 totalpages: 245760 DMA zone: 245760 pages, LIFO batch:31 DMA32 zone: 0 pages, LIFO batch:0 Normal zone: 0 pages, LIFO batch:0 HighMem zone: 0 pages, LIFO batch:0 To conserve dmesg space why not print only the non zero zones. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] mm: make page migration dependent on swap and NUMAChristoph Lameter2006-03-25
| | | | | | | | | | | | | | | | | | | | The page migration code could function without NUMA but we currently have no users for the non-NUMA case. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] slab: fix memory leak in alloc_kmemlistChristoph Lameter2006-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have had this memory leak for a while now. The situation is complicated by the use of alloc_kmemlist() as a function to resize various caches by do_tune_cpucache(). What we do here is first of all make sure that we deallocate properly in the loop over all the nodes. If we are just resizing caches then we can simply return with -ENOMEM if an allocation fails. If the cache is new then we need to rollback and remove all earlier allocations. We detect that a cache is new by checking if the link to the global cache chain has been setup. This is a bit hackish .... (also fix up too overlong lines that I added in the last patch...) Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] alloc_kmemlist: Some cleanup in preparation for a real memory leak fixChristoph Lameter2006-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inspired by Jesper Juhl's patch from today 1. Get rid of err We do not set it to anything else but zero. 2. Drop the CONFIG_NUMA stuff. There are definitions for alloc_alien_cache and free_alien_cache() that do the right thing for the non NUMA case. 3. Better naming of variables. 4. Remove redundant cachep->nodelists[node] expressions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] slab: Bypass free lists for __drain_alien_cache()Christoph Lameter2006-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __drain_alien_cache() currently drains objects by freeing them to the (remote) freelists of the original node. However, each node also has a shared list containing objects to be used on any processor of that node. We can avoid a number of remote node accesses by copying the pointers to the free objects directly into the remote shared array. And while we are at it: Skip alien draining if the alien cache spinlock is already taken. Kiran reported that this is a performance benefit. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] slab: add transfer_objects() functionChristoph Lameter2006-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | slabr_objects() can be used to transfer objects between various object caches of the slab allocator. It is currently only used during __cache_alloc() to retrieve elements from the shared array. We will be using it soon to transfer elements from the alien caches to the remote shared array. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] mm: use kmem_cache_zallocPekka Enberg2006-03-25
| | | | | | | | | | | | | | | | Convert mm/ to use the new kmem_cache_zalloc allocator. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] slab: optimize constant-size kzalloc callsPekka Enberg2006-03-25
| | | | | | | | | | | | | | | | | | | | As suggested by Eric Dumazet, optimize kzalloc() calls that pass a compile-time constant size. Please note that the patch increases kernel text slightly (~200 bytes for defconfig on x86). Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] slab: introduce kmem_cache_zalloc allocatorPekka Enberg2006-03-25
| | | | | | | | | | | | | | | | | | | | Introduce a memory-zeroing variant of kmem_cache_alloc. The allocator already exits in XFS and there are potential users for it so this patch makes the allocator available for the general public. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] slab: implement /proc/slab_allocatorsAl Viro2006-03-25
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] strndup_user()Davi Arnaut2006-03-24
| | | | | | | | | | | This patch series creates a strndup_user() function to easy copying C strings from userspace. Also we avoid common pitfalls like userspace modifying the final \0 after the strlen_user(). Signed-off-by: Davi Arnaut <davi.arnaut@gmail.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] msync(): use do_fsync()Andrew Morton2006-03-24
| | | | | | | | | No need to duplicate all that code. Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] msync: fix return valueAndrew Morton2006-03-24
| | | | | | | | | | | | | | | | | | | | | | msync() does a strange thing. Essentially: vma = find_vma(); for ( ; ; ) { if (!vma) return -ENOMEM; ... vma = vma->vm_next; } so an msync() request which starts within or before a valid VMA and which ends within or beyond the final VMA will incorrectly return -ENOMEM. Fix. Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] msync(MS_SYNC): don't hold mmap_sem while syncingAndrew Morton2006-03-24
| | | | | | | | | | It seems bad to hold mmap_sem while performing synchronous disk I/O. Alter the msync(MS_SYNC) code so that the lock is released while we sync the file. Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] msync(): perform dirty page levellingAndrew Morton2006-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems sensible to perform dirty page throttling in msync: as the application dirties pages we can kick off pdflush early, or even force the msync() caller to perform writeout, or even throttle the msync() caller. The main effect of this is to start disk writeback earlier if we've just discovered that a large amount of pagecache has been dirtied. (Otherwise it wouldn't happen for up to five seconds, next time pdflush wakes up). It also will cause the page-dirtying process to get panalised for dirtying those pages rather than whacking someone else with the problem. We should do this for munmap() and possibly even exit(), too. We drop the mmap_sem while performing the dirty page balancing. It doesn't seem right to hold mmap_sem for that long. Note that this patch only affects MS_ASYNC. MS_SYNC will be syncing all the dirty pages anyway. We note that msync(MS_SYNC) does a full-file-sync inside mmap_sem, and always has. We can fix that up... The patch also tightens up the mmap_sem coverage in sys_msync(): no point in taking it while we perform the incoming arg checking. Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] set_page_dirty() return value fixesAndrew Morton2006-03-24
| | | | | | | | | | | | We need set_page_dirty() to return true if it actually transitioned the page from a clean to dirty state. This wasn't right in a couple of places. Do a kernel-wide audit, fix things up. This leaves open the possibility of returning a negative errno from set_page_dirty() sometime in the future. But we don't do that at present. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] balance_dirty_pages_ratelimited: take nr_pages argAndrew Morton2006-03-24
| | | | | | | | Modify balance_dirty_pages_ratelimited() so that it can take a number-of-pages-which-I-just-dirtied argument. For msync(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] fadvise(): write commandsAndrew Morton2006-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add two new linux-specific fadvise extensions(): LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file offsets `offset' and `offset+len'. Any pages which are currently under writeout are skipped, whether or not they are dirty. LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file offsets `offset' and `offset+len'. By combining these two operations the application may do several things: LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk. LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty pages at the disk. LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all of the currently dirty pages at the disk, wait until they have been written. It should be noted that none of these operations write out the file's metadata. So unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guarantees here that the data will be available after a crash. To complete this suite of operations I guess we should have a "sync file metadata only" operation. This gives applications access to all the building blocks needed for all sorts of sync operations. But sync-metadata doesn't fit well with the fadvise() interface. Probably it should be a new syscall: sys_fmetadatasync(). The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64(). It is made to represent that last affected byte in the file (ie: it is inclusive). Generally, all these byterange and pagerange functions are inclusive so we can easily represent EOF with -1. As Ulrich notes, these two functions are somewhat abusive of the fadvise() concept, which appears to be "set the future policy for this fd". But these commands are a perfect fit with the fadvise() impementation, and several of the existing fadvise() commands are synchronous and don't affect future policy either. I think we can live with the slight incongruity. Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] filemap_fdatawrite_range() api: clarify -end parameterAndrew Morton2006-03-24
| | | | | | | | | I had trouble understanding working out whether filemap_fdatawrite_range()'s `end' parameter describes the last-byte-to-be-written or the last-plus-one. Clarify that in comments. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpuset: memory_spread_slab drop useless PF_SPREAD_PAGE checkPaul Jackson2006-03-24
| | | | | | | | | | | | | | | | | | | The hook in the slab cache allocation path to handle cpuset memory spreading for tasks in cpusets with 'memory_spread_slab' enabled has a modest performance bug. The hook calls into the memory spreading handler alternate_node_alloc() if either of 'memory_spread_slab' or 'memory_spread_page' is enabled, even though the handler does nothing (albeit harmlessly) for the page case Fix - drop PF_SPREAD_PAGE from the set of flag bits that are used to trigger a call to alternate_node_alloc(). The page case is handled by separate hooks -- see the calls conditioned on cpuset_do_page_mem_spread() in mm/filemap.c Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpuset memory spread slab cache optimizationsPaul Jackson2006-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | The hooks in the slab cache allocator code path for support of NUMA mempolicies and cpuset memory spreading are in an important code path. Many systems will use neither feature. This patch optimizes those hooks down to a single check of some bits in the current tasks task_struct flags. For non NUMA systems, this hook and related code is already ifdef'd out. The optimization is done by using another task flag, set if the task is using a non-default NUMA mempolicy. Taking this flag bit along with the PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset memory spreading' patch set, one can check for the combination of any of these special case memory placement mechanisms with a single test of the current tasks task_struct flags. This patch also tightens up the code, to save a few bytes of kernel text space, and moves some of it out of line. Due to the nested inlines called from multiple places, we were ending up with three copies of this code, which once we get off the main code path (for local node allocation) seems a bit wasteful of instruction memory. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpuset memory spread slab cache implementationPaul Jackson2006-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide the slab cache infrastructure to support cpuset memory spreading. See the previous patches, cpuset_mem_spread, for an explanation of cpuset memory spreading. This patch provides a slab cache SLAB_MEM_SPREAD flag. If set in the kmem_cache_create() call defining a slab cache, then any task marked with the process state flag PF_MEMSPREAD will spread memory page allocations for that cache over all the allowed nodes, instead of preferring the local (faulting) node. On systems not configured with CONFIG_NUMA, this results in no change to the page allocation code path for slab caches. On systems with cpusets configured in the kernel, but the "memory_spread" cpuset option not enabled for the current tasks cpuset, this adds a call to a cpuset routine and failed bit test of the processor state flag PF_SPREAD_SLAB. For tasks so marked, a second inline test is done for the slab cache flag SLAB_MEM_SPREAD, and if that is set and if the allocation is not in_interrupt(), this adds a call to to a cpuset routine that computes which of the tasks mems_allowed nodes should be preferred for this allocation. ==> This patch adds another hook into the performance critical code path to allocating objects from the slab cache, in the ____cache_alloc() chunk, below. The next patch optimizes this hook, reducing the impact of the combined mempolicy plus memory spreading hooks on this critical code path to a single check against the tasks task_struct flags word. This patch provides the generic slab flags and logic needed to apply memory spreading to a particular slab. A subsequent patch will mark a few specific slab caches for this placement policy. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpuset memory spread page cache implementation and hooksPaul Jackson2006-03-24
| | | | | | | | | | | | | | | | | | | | | | | | Change the page cache allocation calls to support cpuset memory spreading. See the previous patch, cpuset_mem_spread, for an explanation of cpuset memory spreading. On systems without cpusets configured in the kernel, this is no change. On systems with cpusets configured in the kernel, but the "memory_spread" cpuset option not enabled for the current tasks cpuset, this adds a call to a cpuset routine and failed bit test of the processor state flag PF_SPREAD_PAGE. On tasks in cpusets with "memory_spread" enabled, this adds a call to a cpuset routine that computes which of the tasks mems_allowed nodes should be preferred for this allocation. If memory spreading applies to a particular allocation, then any other NUMA mempolicy does not apply. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpusets: only wakeup kswapd for zones in the current cpusetChristoph Lameter2006-03-24
| | | | | | | | | | | If we get under some memory pressure in a cpuset (we only scan zones that are in the cpuset for memory) then kswapd is woken up for all zones. This patch only wakes up kswapd in zones that are part of the current cpuset. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Represent laptop_mode as jiffies internallyBart Samwel2006-03-24
| | | | | | | | | | | | | Make that the internal value for /proc/sys/vm/laptop_mode is stored as jiffies instead of seconds. Let the sysctl interface do the conversions, instead of doing on-the-fly conversions every time the value is used. Add a description of the fact that laptop_mode doubles as a flag and a timeout to the comment above the laptop_mode variable. Signed-off-by: Bart Samwel <bart@samwel.tk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Represent dirty_*_centisecs as jiffies internallyBart Samwel2006-03-24
| | | | | | | | | | | | | | | | | | | | | Make that the internal values for: /proc/sys/vm/dirty_writeback_centisecs /proc/sys/vm/dirty_expire_centisecs are stored as jiffies instead of centiseconds. Let the sysctl interface do the conversions with full precision using clock_t_to_jiffies, instead of doing overflow-sensitive on-the-fly conversions every time the values are used. Cons: apparent precision loss if HZ is not a multiple of 100, because of conversion back and forth. This is a common problem for all sysctl values that use proc_dointvec_userhz_jiffies. (There is only one other in-tree use, in net/core/neighbour.c.) Signed-off-by: Bart Samwel <bart@samwel.tk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Block queue IO tracing support (blktrace) as of 2006-03-23Jens Axboe2006-03-23
| | | | Signed-off-by: Jens Axboe <axboe@suse.de>
* [PATCH] ext3_readdir: use generic readaheadAndrew Morton2006-03-23
| | | | | | | | | | | | | | | | | | | | | | Linus points out that ext3_readdir's readahead only cuts in when ext3_readdir() is operating at the very start of the directory. So for large directories we end up performing no readahead at all and we suck. So take it all out and use the core VM's page_cache_readahead(). This means that ext3 directory reads will use all of readahead's dynamic sizing goop. Note that we're using the directory's filp->f_ra to hold the readahead state, but readahead is actually being performed against the underlying blockdev's address_space. Fortunately the readahead code is all set up to handle this. Tested with printk. It works. I was struggling to find a real workload which actually cared. (The patch also exports page_cache_readahead() to GPL modules) Cc: "Stephen C. Tweedie" <sct@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] swsusp: userland interfaceRafael J. Wysocki2006-03-23
| | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces a user space interface for swsusp. The interface is based on a special character device, called the snapshot device, that allows user space processes to perform suspend and resume-related operations with the help of some ioctls and the read()/write() functions.  Additionally it allows these processes to allocate free swap pages from a selected swap partition, called the resume partition, so that they know which sectors of the resume partition are available to them. The interface uses the same low-level system memory snapshot-handling functions that are used by the built-it swap-writing/reading code of swsusp. The interface documentation is included in the patch. The patch assumes that the major and minor numbers of the snapshot device will be 10 (ie. misc device) and 231, the registration of which has already been requested. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] swsusp: low level interfaceRafael J. Wysocki2006-03-23
| | | | | | | | | | | | | | | | | | Introduce the low level interface that can be used for handling the snapshot of the system memory by the in-kernel swap-writing/reading code of swsusp and the userland interface code (to be introduced shortly). Also change the way in which swsusp records the allocated swap pages and, consequently, simplifies the in-kernel swap-writing/reading code (this is necessary for the userland interface too). To this end, it introduces two helper functions in mm/swapfile.c, so that the swsusp code does not refer directly to the swap internals. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>