aboutsummaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAge
* block: replace @ext_minors with GENHD_FL_EXT_DEVTTejun Heo2008-10-09
| | | | | | | | | | | | | | | With previous changes, it's meaningless to limit the number of partitions. Replace @ext_minors with GENHD_FL_EXT_DEVT such that setting the flag allows the disk to have maximum number of allowed partitions (only limited by the number of entries in parsed_partitions as determined by MAX_PART constant). This kills not-too-pretty alloc_disk_ext[_node]() functions and makes @minors parameter to alloc_disk[_node]() unnecessary. The parameter is left alone to avoid disturbing the users. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: make partition array dynamicTejun Heo2008-10-09
| | | | | | | | | | | | disk->__part used to be statically allocated to the maximum possible number of partitions. This patch makes partition array allocation dynamic. The added overhead is minimal as only real change is one memory dereference changed to RCU one. This saves both a bit of memory and cpu cycles iterating through unoccupied slots and makes increasing partition limit easier. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: move stats from disk to part0Tejun Heo2008-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move stats related fields - stamp, in_flight, dkstats - from disk to part0 and unify stat handling such that... * part_stat_*() now updates part0 together if the specified partition is not part0. ie. part_stat_*() are now essentially all_stat_*(). * {disk|all}_stat_*() are gone. * part_round_stats() is updated similary. It handles part0 stats automatically and disk_round_stats() is killed. * part_{inc|dec}_in_fligh() is implemented which automatically updates part0 stats for parts other than part0. * disk_map_sector_rcu() is updated to return part0 if no part matches. Combined with the above changes, this makes NULL special case handling in callers unnecessary. * Separate stats show code paths for disk are collapsed into part stats show code paths. * Rename disk_stat_lock/unlock() to part_stat_lock/unlock() While at it, reposition stat handling macros a bit and add missing parentheses around macro parameters. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: kill GENHD_FL_FAIL and use part0->make_it_failTejun Heo2008-10-09
| | | | | | | | GENHD_FL_FAIL for disk is what make_it_fail is for parts. Kill it and use part0->make_it_fail. Sysfs node handling is unified too. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: always set bdev->bd_partTejun Heo2008-10-09
| | | | | | | | | Till now, bdev->bd_part is set only if the bdev was for parts other than part0. This patch makes bdev->bd_part always set so that code paths don't have to differenciate common handling. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: move holder_dir from disk to part0Tejun Heo2008-10-09
| | | | | | | | | | | | | Move disk->holder_dir to part0->holder_dir. Kill now mostly superflous bdev_get_holder(). While at it, kill superflous kobject_get/put() around holder_dir, slave_dir and cmd_filter creation and collapse disk_sysfs_add_subdirs() into register_disk(). These serve no purpose but obfuscating the code. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: move policy from disk to part0Tejun Heo2008-10-09
| | | | | | | Move disk->policy to part0->policy. Implement and use get_disk_ro(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: unify sysfs size node handlingTejun Heo2008-10-09
| | | | | | | | Now that capacity and __dev are moved to part0, part0 and others can share the same method. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: move capacity from disk to part0Tejun Heo2008-10-09
| | | | | | | | | Move disk->capacity to part0->nr_sects and convert all users who directly accessed the field to use {get|set}_capacity(). This is done early to allow the __dev field to be moved. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: introduce partition 0Tejun Heo2008-10-09
| | | | | | | | | | | | | | | | | | | | | | | genhd and partition code handled disk and partitions separately. All information about the whole disk was in struct genhd and partitions in struct hd_struct. However, the whole disk (part0) and other partitions have a lot in common and the data structures end up having good number of common fields and thus separate code paths doing the same thing. Also, the partition array was indexed by partno - 1 which gets pretty confusing at times. This patch introduces partition 0 and makes the partition array indexed by partno. Following patches will unify the handling of disk and parts piece-by-piece. This patch also implements disk_partitionable() which tests whether a disk is partitionable. With coming dynamic partition array change, the most common usage of disk_max_parts() will be testing whether a disk is partitionable and the number of max partitions will become much less important. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: implement and use {disk|part}_to_dev()Tejun Heo2008-10-09
| | | | | | | | | | | | Implement {disk|part}_to_dev() and use them to access generic device instead of directly dereferencing {disk|part}->dev. To make sure no user is left behind, rename generic devices fields to __dev. This is in preparation of unifying partition 0 handling with other partitions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: implement extended dev numbersTejun Heo2008-10-09
| | | | | | | | | | | | | | | | | | | | Implement extended device numbers. A block driver can tell block layer that it wants to use extended device numbers. After the usual minor space is used up, block layer automatically allocates devt's from EXT_BLOCK_MAJOR. Currently only one major number is allocated for this but as the allocation is strictly on-demand, ~1mil minor space under it should suffice unless the system actually has more than ~1mil partitions and if that ever happens adding more majors to the extended devt area is easy. Due to internal implementation issues, the first partition can't be allocated on the extended area. In other words, genhd->minors should at least be 1. This limitation will be lifted by later changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: fix diskstats accessTejun Heo2008-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two variants of stat functions - ones prefixed with double underbars which don't care about preemption and ones without which disable preemption before manipulating per-cpu counters. It's unclear whether the underbarred ones assume that preemtion is disabled on entry as some callers don't do that. This patch unifies diskstats access by implementing disk_stat_lock() and disk_stat_unlock() which take care of both RCU (for partition access) and preemption (for per-cpu counter access). diskstats access should always be enclosed between the two functions. As such, there's no need for the versions which disables preemption. They're removed and double underbars ones are renamed to drop the underbars. As an extra argument is added, there's no danger of using the old version unconverted. disk_stat_lock() uses get_cpu() and returns the cpu index and all diskstat functions which access per-cpu counters now has @cpu argument to help RT. This change adds RCU or preemption operations at some places but also collapses several preemption ops into one at others. Overall, the performance difference should be negligible as all involved ops are very lightweight per-cpu ones. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: fix disk->part[] dereferencing raceTejun Heo2008-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | disk->part[] is protected by its matching bdev's lock. However, non-critical accesses like collecting stats and printing out sysfs and proc information used to be performed without any locking. As partitions can come and go dynamically, partitions can go away underneath those non-critical accesses. As some of those accesses are writes, this theoretically can lead to silent corruption. This patch fixes the race by using RCU for the partition array and dev reference counter to hold partitions. * Rename disk->part[] to disk->__part[] to make sure no one outside genhd layer proper accesses it directly. * Use RCU for disk->__part[] dereferencing. * Implement disk_{get|put}_part() which can be used to get and put partitions from gendisk respectively. * Iterators are implemented to help iterate through all partitions safely. * Functions which require RCU readlock are marked with _rcu suffix. * Use disk_put_part() in __blkdev_put() instead of directly putting the contained kobject. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: don't depend on consecutive minor spaceTejun Heo2008-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | * Implement disk_devt() and part_devt() and use them to directly access devt instead of computing it from ->major and ->first_minor. Note that all references to ->major and ->first_minor outside of block layer is used to determine devt of the disk (the part0) and as ->major and ->first_minor will continue to represent devt for the disk, converting these users aren't strictly necessary. However, convert them for consistency. * Implement disk_max_parts() to avoid directly deferencing genhd->minors. * Update bdget_disk() such that it doesn't assume consecutive minor space. * Move devt computation from register_disk() to add_disk() and make it the only one (all other usages use the initially determined value). These changes clean up the code and will help disk->part dereference fix and extended block device numbers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: make variable and argument names more consistentTejun Heo2008-10-09
| | | | | | | | | | | | | | In hd_struct, @partno is used to denote partition number and a number of other places use @part to denote hd_struct. Functions use @part and @index instead. This causes confusion and makes it difficult to use consistent variable names for hd_struct. Always use @partno if a variable represents partition number. Also, print out functions use @f or @part for seq_file argument. Use @seqf uniformly instead. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: update add_partition() error handlingTejun Heo2008-10-09
| | | | | | | | | | | | | | | | | | | | | | | | d805dda4 tried to fix error case handling in add_partition() but had a few problems. * disk->part[] entry is set early and left dangling if operation fails. * Once device initialized, the last put_device() is responsible for freeing all the resources. The failure path freed part_stats and p regardless of put_device() causing double free. * holders subdir holds reference to the disk device, so failure path should remove it to release resources properly which was missing. This patch fixes the above problems and while at it move partition slot busy check into add_partition() for completeness and inlines holders subdirectory creation. Using separate function for it just obfuscates the code. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Abdel Benamrouche <draconux@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: allow deleting zero length partitionTejun Heo2008-10-09
| | | | | | | | | | | | | | delete_partition() was noop for zero length partition. As the addition code allows creating zero lenght partition and deletion is assumed to always succeed, this causes memory leak for zero length partitions. Allow zero length partitions to end their meaningless lives. While at it, allow deleting zero lenght partition via BLKPG_DEL_PARTITION ioctl too. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* drop vmerge accountingMikulas Patocka2008-10-09
| | | | | | | | Remove hw_segments field from struct bio and struct request. Without virtual merge accounting they have no purpose. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: drop virtual merging accountingMikulas Patocka2008-10-09
| | | | | | | Remove virtual merge accounting. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* Let the block device know when sectors can be discardedDavid Woodhouse2008-10-09
| | | | | | | | [hirofumi@mail.parknet.co.jp: discard _after_ checking for corrupt chains] Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* mm: tiny-shmem nommu fixNick Piggin2008-10-02
| | | | | | | | | | | | | | | | | | | | The previous patch db203d53d474aa068984e409d807628f5841da1b ("mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU architectures because it was using the expanding truncate to signal ramfs to allocate a physically contiguous RAM backing the inode (otherwise it is unusable for "memory mapping" it to userspace). However do_truncate is what caused the lock ordering error, due to it taking i_mutex. In this case, we can actually just call ramfs directly to allocate memory for the mapping, rather than go via truncate. Acked-by: David Howells <dhowells@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* inotify: fix lock ordering wrt do_page_fault's mmap_semNick Piggin2008-10-02
| | | | | | | | | | | | | Fix inotify lock order reversal with mmap_sem due to holding locks over copy_to_user. Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-by: "Daniel J Blueman" <daniel.blueman@gmail.com> Tested-by: "Daniel J Blueman" <daniel.blueman@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm owner: fix race between swapoff and exitBalbir Singh2008-09-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a race between mm->owner assignment and swapoff, more easily seen when task slab poisoning is turned on. The condition occurs when try_to_unuse() runs in parallel with an exiting task. A similar race can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats> or ptrace or page migration. CPU0 CPU1 try_to_unuse looks at mm = task0->mm increments mm->mm_users task 0 exits mm->owner needs to be updated, but no new owner is found (mm_users > 1, but no other task has task->mm = task0->mm) mm_update_next_owner() leaves mmput(mm) decrements mm->mm_users task0 freed dereferencing mm->owner fails The fix is to notify the subsystem via mm_owner_changed callback(), if no new owner is found, by specifying the new task as NULL. Jiri Slaby: mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but must be set after that, so as not to pass NULL as old owner causing oops. Daisuke Nishimura: mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task() and its callers need to take account of this situation to avoid oops. Hugh Dickins: Lockdep warning and hang below exec_mmap() when testing these patches. exit_mm() up_reads mmap_sem before calling mm_update_next_owner(), so exec_mmap() now needs to do the same. And with that repositioning, there's now no point in mm_need_new_owner() allowing for NULL mm. Reported-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Fix NULL pointer dereference in proc_sys_compareLinus Torvalds2008-09-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The VFS interface for the 'd_compare()' is a bit special (read: 'odd'), because it really just essentially replaces a memcmp(). The filesystem is supposed to just compare the two names with whatever case-independent or other function. And when I say 'is supposed to', I obviously mean that 'procfs does odd things, and actually looks at the dentry that we don't even pass down, rather than just the name'. Which results in problems, because we actually call d_compare before we have even verified that the dentry is still hashed at all. And that causes a problm since the inode that procfs looks at may have been free'd and the d_inode pointer is NULL. procfs just assumes that all dentries are positive, since procfs itself never generates a negative one. But memory pressure will still result in the dentry getting torn down, and as it is removed by RCU, it still remains visible on some lists - and to d_compare. If the filesystem just did a name comparison, we wouldn't care. And we could just fix procfs to know about negative dentries too. But rather than have the low-level filesystems know about internal VFS details, just move the check for a unhashed dentry up a bit, so that we will only call d_compare on dentries that are still active. The actual oops this caused didn't look like a NULL pointer dereference because procfs did a 'container_of(inode, struct proc_inode, vfs_inode)' to get at its internal proc_inode information from the inode pointer, and accessed a field below the inode. So the oops would look something like BUG: unable to handle kernel paging request at fffffffffffffff0 IP: [<ffffffff802bc6c6>] proc_sys_compare+0x36/0x50 and was seen on both x86-64 (Alexey Dobriyan and Hugh Dickins) and ppc64 (Hugh Dickins). Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-of-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://oss.sgi.com:8090/xfs/linux-2.6Linus Torvalds2008-09-26
|\ | | | | | | | | | | * git://oss.sgi.com:8090/xfs/linux-2.6: [XFS] Remove xfs_iext_irec_compact_full() [XFS] Fix extent list corruption in xfs_iext_irec_compact_full().
| * [XFS] Remove xfs_iext_irec_compact_full()Lachlan McIlroy2008-09-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yet another bug was found in xfs_iext_irec_compact_full() and while the source of the bug was found it wasn't an easy task to track it down because the conditions are very difficult to reproduce. A HUGE thank-you goes to Russell Cattelan and Eric Sandeen for their significant effort in tracking down the source of this corruption. xfs_iext_irec_compact_full() and xfs_iext_irec_compact_pages() are almost identical - they both compact indirect extent lists by moving extents from subsequent buffers into earlier ones. xfs_iext_irec_compact_pages() only moves extents if all of the extents in the next buffer will fit into the empty space in the buffer before it. xfs_iext_irec_compact_full() will go a step further and move part of the next buffer if all the extents wont fit. It will then shift the remaining extents in the next buffer up to the start of the buffer. The bug here was that we did not update er_extoff and this caused extent list corruption. It does not appear that this extra functionality gains us much. Calling xfs_iext_irec_compact_pages() instead will do a good enough job at compacting the indirect list and will be quicker too. For the case in xfs_iext_indirect_to_direct() the total number of extents in the indirect list will fit into one buffer so we will never need the extra functionality of xfs_iext_irec_compact_full() there. Also xfs_iext_irec_compact_pages() doesn't need to do a memmove() (the buffers will never overlap) so we don't want the performance hit that can incur. SGI-PV: 987159 SGI-Modid: xfs-linux-melb:xfs-kern:32166a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
| * [XFS] Fix extent list corruption in xfs_iext_irec_compact_full().Lachlan McIlroy2008-09-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If we don't move all the records from the next buffer into the current buffer then we need to update the er_extoff field of the next buffer as we shift the remaining records to the start of the buffer. SGI-PV: 987159 SGI-Modid: xfs-linux-melb:xfs-kern:32165a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Russell Cattelan <cattelan@thebarn.com>
* | Merge branch 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6Linus Torvalds2008-09-26
|\ \ | |/ |/| | | | | | | | | | | * 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6: UBIFS: fix printk format warnings UBIFS: remove incorrect assert UBIFS: TNC / GC race fixes UBIFS: create the name of the background thread in every case
| * UBIFS: fix printk format warningsAlexander Beregalov2008-09-18
| | | | | | | | | | | | | | | | | | | | | | fs/ubifs/dir.c:428: warning: format '%llu' expects type 'long long unsigned int', but argument 5 has type 'long unsigned int' fs/ubifs/debug.c:541: warning: format '%llu' expects type 'long long unsigned int', but argument 2 has type 'long unsigned int' Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| * UBIFS: remove incorrect assertAdrian Hunter2008-09-17
| | | | | | | | | | | | | | | | | | The assert was not valid because one of the variables 'taken_empty_lebs' has transient values out of sync with the other variables. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| * UBIFS: TNC / GC race fixesAdrian Hunter2008-09-17
| | | | | | | | | | | | | | | | | | - update GC sequence number if any nodes may have been moved even if GC did not finish the LEB - don't ignore error return when reading Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| * UBIFS: create the name of the background thread in every caseSebastian Siewior2008-09-17
| | | | | | | | | | | | | | | | | | If the ubifs partition is mounted RO and then remounted RW we end up with no thread name in ubifs_remount_rw() and the thread appears nameless. Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* | 9p: use an IS_ERR test rather than a NULL testJulien Brunel2008-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of error, the function p9_client_walk returns an ERR pointer, but never returns a NULL pointer. So a NULL test that comes after an IS_ERR test should be deleted. The semantic match that finds this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @match_bad_null_test@ expression x, E; statement S1,S2; @@ x = p9_client_walk(...) ... when != x = E * if (x != NULL) S1 else S2 // </smpl> Signed-off-by: Julien Brunel <brunel@diku.dk> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | [XFS] Don't do I/O beyond eof when unreserving spaceLachlan McIlroy2008-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When unreserving space with boundaries that are not block aligned we round up the start and round down the end boundaries and then use this function, xfs_zero_remaining_bytes(), to zero the parts of the blocks that got dropped during the rounding. The problem is we don't consider if these blocks are beyond eof. Worse still is if we encounter delayed allocations beyond eof we will try to use the magic delayed allocation block number as a real block number. If the file size is ever extended to expose these blocks then we'll go through xfs_zero_eof() to zero them anyway. SGI-PV: 983683 SGI-Modid: xfs-linux-melb:xfs-kern:32055a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* | [XFS] Fix use-after-free with buffersLachlan McIlroy2008-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a use-after-free issue where log completions access buffers via the buffer log item and the buffer has already been freed. Fix this by taking a reference on the buffer when attaching the buffer log item and release the hold when the buffer log item is detached and we no longer need the buffer. Also create a new function xfs_buf_item_free() to combine some common code. SGI-PV: 985757 SGI-Modid: xfs-linux-melb:xfs-kern:32025a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* | [XFS] Prevent lockdep false positives when locking two inodes.David Chinner2008-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we call xfs_lock_two_inodes() to grab both the iolock and the ilock, then drop the ilocks on both inodes, then grab them again (as xfs_swap_extents() does) then lockdep will report a locking order problem. This is a false positive. To avoid this, disallow xfs_lock_two_inodes() fom locking both inode locks at once - force calers to make two separate calls. This means that nested dropping and regaining of the ilocks will retain the same lockdep subclass and so lockdep will not see anything wrong with this code. SGI-PV: 986238 SGI-Modid: xfs-linux-melb:xfs-kern:31999a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Peter Leckie <pleckie@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* | [XFS] Fix barrier status change detection.David Chinner2008-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The current code in xlog_iodone() uses the wrong macro to check if the barrier has been cleared due to an EOPNOTSUPP error form the lower layer. SGI-PV: 986143 SGI-Modid: xfs-linux-melb:xfs-kern:31984a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Nathaniel W. Turner <nate@houseofnate.net> Signed-off-by: Peter Leckie <pleckie@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* | [XFS] Prevent direct I/O from mapping extents beyond eofLachlan McIlroy2008-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the help from some tracing I found that we try to map extents beyond eof when doing a direct I/O read. It appears that the way to inform the generic direct I/O path (ie do_direct_IO()) that we have breached eof is to return an unmapped buffer from xfs_get_blocks_direct(). This will cause do_direct_IO() to jump to the hole handling code where is will check for eof and then abort. This problem was found because a direct I/O read was trying to map beyond eof and was encountering delayed allocations. The delayed allocations beyond eof are speculative allocations and they didn't get converted when the direct I/O flushed the file because there was only enough space in the current AG to convert and write out the dirty pages within eof. Note that xfs_iomap_write_allocate() wont necessarily convert all the delayed allocation passed to it - it will return after allocating the first extent - so if the delayed allocation extends beyond eof then it will stay that way. SGI-PV: 983683 SGI-Modid: xfs-linux-melb:xfs-kern:31929a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* | [XFS] Fix regression introduced by remount fixupChristoph Hellwig2008-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Logically we would return an error in xfs_fs_remount code to prevent users from believing they might have changed mount options using remount which can't be changed. But unfortunately mount(8) adds all options from mtab and fstab to the mount arguments in some cases so we can't blindly reject options, but have to check for each specified option if it actually differs from the currently set option and only reject it if that's the case. Until that is implemented we return success for every remount request, and silently ignore all options that we can't actually change. SGI-PV: 985710 SGI-Modid: xfs-linux-melb:xfs-kern:31908a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* | [XFS] Move memory allocations for log tracing out of the critical pathLachlan McIlroy2008-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memory allocations for log->l_grant_trace and iclog->ic_trace are done on demand when the first event is logged. In xlog_state_get_iclog_space() we call xlog_trace_iclog() under a spinlock and allocating memory here can cause us to sleep with a spinlock held and deadlock the system. For the log grant tracing we use KM_NOSLEEP but that means we can lose trace entries. Since there is no locking to serialize the log grant tracing we could race and have multiple allocations and leak memory. So move the allocations to where we initialize the log/iclog structures. Use KM_NOFS to avoid recursing into the filesystem and drop log->l_trace since it's not even used. SGI-PV: 983738 SGI-Modid: xfs-linux-melb:xfs-kern:31896a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* | rescan_partitions(): make device capacity errors non-fatalAndrew Morton2008-09-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Herton Krzesinski reports that the error-checking changes in 04ebd4aee52b06a2c38127d9208546e5b96f3a19 ("block/ioctl.c and fs/partition/check.c: check value returned by add_partition") cause his buggy USB camera to no longer mount. "The camera is an Olympus X-840. The original issue comes from the camera itself: its format program creates a partition with an off by one error". Buggy devices happen. It is better for the kernel to warn and to proceed with the mount. Reported-by: Herton Ronaldo Krzesinski <herton@mandriva.com.br> Cc: Abdel Benamrouche <draconux@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: David Brownell <david-b@pacbell.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: ifdef Quicklists in /proc/meminfoHugh Dickins2008-09-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A "Quicklists: 0 kB" line has just started appearing in /proc/meminfo, but most architectures (including x86) don't have them configured, so #ifdef it, like the highmem lines. And those architectures which do have quicklists configured are using them for page tables: so let's place it next to PageTables. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | bfs: fix Lockdep warningEric Sesterhenn2008-09-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes: ============================================= [ INFO: possible recursive locking detected ] 2.6.27-rc5-00283-g70bb089 #68 --------------------------------------------- touch/6855 is trying to acquire lock: (&info->bfs_lock){--..}, at: [<c02262f5>] bfs_delete_inode+0x9e/0x18c but task is already holding lock: (&info->bfs_lock){--..}, at: [<c0226c00>] bfs_create+0x45/0x187 other info that might help us debug this: 2 locks held by touch/6855: #0: (&type->i_mutex_dir_key#5){--..}, at: [<c018ad13>] do_filp_open+0x10b/0x62f #1: (&info->bfs_lock){--..}, at: [<c0226c00>] bfs_create+0x45/0x187 stack backtrace: Pid: 6855, comm: touch Not tainted 2.6.27-rc5-00283-g70bb089 #68 [<c013e769>] validate_chain+0x458/0x9f4 [<c013bece>] ? trace_hardirqs_off+0xb/0xd [<c013f36b>] __lock_acquire+0x666/0x6e0 [<c013f440>] lock_acquire+0x5b/0x77 [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c [<c06aab74>] mutex_lock_nested+0xbc/0x234 [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c [<c02262f5>] bfs_delete_inode+0x9e/0x18c [<c0226257>] ? bfs_delete_inode+0x0/0x18c [<c01925e1>] generic_delete_inode+0x94/0xfe [<c019265d>] generic_drop_inode+0x12/0x12f [<c0191b7e>] iput+0x4b/0x4e [<c0226d1e>] bfs_create+0x163/0x187 [<c0188b42>] vfs_create+0xa6/0x114 [<c018adb5>] do_filp_open+0x1ad/0x62f [<c0107cdc>] ? native_sched_clock+0x82/0x96 [<c06ac309>] ? _spin_unlock+0x27/0x3c [<c019379e>] ? alloc_fd+0xbf/0xc9 [<c06ae2f4>] ? sub_preempt_count+0x9d/0xab [<c019379e>] ? alloc_fd+0xbf/0xc9 [<c0180391>] do_sys_open+0x42/0xb8 [<c041d564>] ? trace_hardirqs_on_thunk+0xc/0x10 [<c0180449>] sys_open+0x1e/0x26 [<c01038bd>] sysenter_do_call+0x12/0x31 ======================= The problem is that we don't unlock the bfs->lock mutex before calling iput (we do in the other cases). Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | proc: more debugging for "already registered" caseAlexey Dobriyan2008-09-13
| | | | | | | | | | | | | | | | | | | | | | | | | | Print parent directory name as well. The aim is to catch non-creation of parent directory when proc_mkdir will return NULL and all subsequent registrations go directly in /proc instead of intended directory. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Fixed insane printk string while at it. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for_linus' of ↵Linus Torvalds2008-09-11
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6: udf: add llseek method udf: Fix error paths in udf_new_inode() udf: Fix lock inversion between iprune_mutex and alloc_mutex (v2)
| * | udf: add llseek methodChristoph Hellwig2008-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UDF currently doesn't set a llseek method for regular files, which means it will fall back to default_llseek. This means no one can seek beyond 2 Gigabytes on udf, and that there's not protection vs the i_size updates from writers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * | udf: Fix error paths in udf_new_inode()Jan Kara2008-08-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I case we failed to allocate memory for inode when creating it, we did not properly free block already allocated for this inode. Move memory allocation before the block allocation which fixes this issue (thanks for the idea go to Ingo Oeser <ioe-lkml@rameria.de>). Also remove a few superfluous initializations already done in udf_alloc_inode(). Reviewed-by: Ingo Oeser <ioe-lkml@rameria.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * | udf: Fix lock inversion between iprune_mutex and alloc_mutex (v2)Jan Kara2008-08-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A memory allocation inside alloc_mutex must not recurse back into the filesystem itself because that leads to lock inversion between iprune_mutex and alloc_mutex (and thus to deadlocks - see traces below). alloc_mutex is actually needed only to update allocation statistics in the superblock so we can drop it before we start allocating memory for the inode. tar D ffff81015b9c8c90 0 6614 6612 ffff8100d5a21a20 0000000000000086 0000000000000000 00000000ffff0000 ffff81015b9c8c90 ffff81015b8f0cd0 ffff81015b9c8ee0 0000000000000000 0000000000000003 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff803c1d8a>] __mutex_lock_slowpath+0x64/0x9b [<ffffffff803c1bef>] mutex_lock+0xa/0xb [<ffffffff8027f8c2>] shrink_icache_memory+0x38/0x200 [<ffffffff80257742>] shrink_slab+0xe3/0x15b [<ffffffff802579db>] try_to_free_pages+0x221/0x30d [<ffffffff8025657e>] isolate_pages_global+0x0/0x31 [<ffffffff8025324b>] __alloc_pages_internal+0x252/0x3ab [<ffffffff8026b08b>] cache_alloc_refill+0x22e/0x47b [<ffffffff8026ae37>] kmem_cache_alloc+0x3b/0x61 [<ffffffff8026b15b>] cache_alloc_refill+0x2fe/0x47b [<ffffffff8026b34e>] __kmalloc+0x76/0x9c [<ffffffffa00751f2>] :udf:udf_new_inode+0x202/0x2e2 [<ffffffffa007ae5e>] :udf:udf_create+0x2f/0x16d [<ffffffffa0078f27>] :udf:udf_lookup+0xa6/0xad ... kswapd0 D ffff81015b9d9270 0 125 2 ffff81015b903c28 0000000000000046 ffffffff8028cbb0 00000000fffffffb ffff81015b9d9270 ffff81015b8f0cd0 ffff81015b9d94c0 000000000271b490 ffffe2000271b458 ffffe2000271b420 ffffe20002728dc8 ffffe20002728d90 Call Trace: [<ffffffff8028cbb0>] __set_page_dirty+0xeb/0xf5 [<ffffffff8025403a>] get_dirty_limits+0x1d/0x22f [<ffffffff803c1d8a>] __mutex_lock_slowpath+0x64/0x9b [<ffffffff803c1bef>] mutex_lock+0xa/0xb [<ffffffffa0073f58>] :udf:udf_bitmap_free_blocks+0x47/0x1eb [<ffffffffa007df31>] :udf:udf_discard_prealloc+0xc6/0x172 [<ffffffffa007875a>] :udf:udf_clear_inode+0x1e/0x48 [<ffffffff8027f121>] clear_inode+0x6d/0xc4 [<ffffffff8027f7f2>] dispose_list+0x56/0xee [<ffffffff8027fa5a>] shrink_icache_memory+0x1d0/0x200 [<ffffffff80257742>] shrink_slab+0xe3/0x15b [<ffffffff80257e93>] kswapd+0x346/0x447 ... Reported-by: Tibor Tajti <tibor.tajti@gmail.com> Reviewed-by: Ingo Oeser <ioe-lkml@rameria.de> Signed-off-by: Jan Kara <jack@suse.cz>
* | | ocfs2: Fix a bug in direct IO read.Tao Ma2008-09-10
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ocfs2 will become read-only if we try to read the bytes which pass the end of i_size. This can be easily reproduced by following steps: 1. mkfs a ocfs2 volume with bs=4k cs=4k and nosparse. 2. create a small file(say less than 100 bytes) and we will create the file which is allocated 1 cluster. 3. read 8196 bytes from the kernel using O_DIRECT which exceeds the limit. 4. The ocfs2 volume becomes read-only and dmesg shows: OCFS2: ERROR (device sda13): ocfs2_direct_IO_get_blocks: Inode 66010 has a hole at block 1 File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. So suppress the ERROR message. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>