aboutsummaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAge
* Ocfs2: Handle empty list in lockres_seq_start() for dlmdebug.cTristan Ye2010-09-10
| | | | | | | | | | This patch tries to handle the case in which list 'dlm->tracking_list' is empty, to avoid accessing an invalid pointer. It fixes the following oops: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1287 Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
* Ocfs2: Re-access the journal after ocfs2_insert_extent() in dxdir codes.Tristan Ye2010-09-10
| | | | | | | | | In ocfs2_dx_dir_rebalance(), we need to rejournal_acess the blocks after calling ocfs2_insert_extent() since growing an extent tree may trigger ocfs2_extend_trans(), which makes previous journal_access meaningless. Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
* ocfs2: Fix lockdep warning in reflink.Tao Ma2010-09-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch change mutex_lock to a new subclass and add a new inode lock subclass for the target inode which caused this lockdep warning. ============================================= [ INFO: possible recursive locking detected ] 2.6.35+ #5 --------------------------------------------- reflink/11086 is trying to acquire lock: (Meta){+++++.}, at: [<ffffffffa06f9d65>] ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2] but task is already holding lock: (Meta){+++++.}, at: [<ffffffffa06f9aa0>] ocfs2_reflink_ioctl+0x5d3/0x1229 [ocfs2] other info that might help us debug this: 6 locks held by reflink/11086: #0: (&sb->s_type->i_mutex_key#15/1){+.+.+.}, at: [<ffffffff820e09ec>] lookup_create+0x26/0x97 #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa06f99a0>] ocfs2_reflink_ioctl+0x4d3/0x1229 [ocfs2] #2: (Meta){+++++.}, at: [<ffffffffa06f9aa0>] ocfs2_reflink_ioctl+0x5d3/0x1229 [ocfs2] #3: (&oi->ip_xattr_sem){+.+.+.}, at: [<ffffffffa06f9b58>] ocfs2_reflink_ioctl+0x68b/0x1229 [ocfs2] #4: (&oi->ip_alloc_sem){+.+.+.}, at: [<ffffffffa06f9b67>] ocfs2_reflink_ioctl+0x69a/0x1229 [ocfs2] #5: (&sb->s_type->i_mutex_key#15/2){+.+...}, at: [<ffffffffa06f9d4f>] ocfs2_reflink_ioctl+0x882/0x1229 [ocfs2] stack backtrace: Pid: 11086, comm: reflink Not tainted 2.6.35+ #5 Call Trace: [<ffffffff82063dd9>] validate_chain+0x56e/0xd68 [<ffffffff82062275>] ? mark_held_locks+0x49/0x69 [<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1 [<ffffffff82065a81>] lock_acquire+0xc6/0xed [<ffffffffa06f9d65>] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2] [<ffffffffa06c9ade>] __ocfs2_cluster_lock+0x975/0xa0d [ocfs2] [<ffffffffa06f9d65>] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2] [<ffffffffa06e107b>] ? ocfs2_wait_for_recovery+0x15/0x8a [ocfs2] [<ffffffffa06cb6ea>] ocfs2_inode_lock_full_nested+0x1ac/0xdc5 [ocfs2] [<ffffffffa06f9d65>] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2] [<ffffffff820623a0>] ? trace_hardirqs_on_caller+0x10b/0x12f [<ffffffff82060193>] ? debug_mutex_free_waiter+0x4f/0x53 [<ffffffffa06f9d65>] ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2] [<ffffffffa06ce24a>] ? ocfs2_file_lock_res_init+0x66/0x78 [ocfs2] [<ffffffff820bb2d2>] ? might_fault+0x40/0x8d [<ffffffffa06df9f6>] ocfs2_ioctl+0x61a/0x656 [ocfs2] [<ffffffff820ee5d3>] ? mntput_no_expire+0x1d/0xb0 [<ffffffff820e07b3>] ? path_put+0x2c/0x31 [<ffffffff820e53ac>] vfs_ioctl+0x2a/0x9d [<ffffffff820e5903>] do_vfs_ioctl+0x45d/0x4ae [<ffffffff8233a7f6>] ? _raw_spin_unlock+0x26/0x2a [<ffffffff8200299c>] ? sysret_check+0x27/0x62 [<ffffffff820e59ab>] sys_ioctl+0x57/0x7a [<ffffffff8200296b>] system_call_fastpath+0x16/0x1b Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
* ocfs2/lockdep: Move ip_xattr_sem out of ocfs2_xattr_get_nolock.Tao Ma2010-09-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the name shows, we shouldn't have any lock in ocfs2_xattr_get_nolock. so lift ip_xattr_sem to the caller. This should be safe for us since the only 2 callers are: 1. ocfs2_xattr_get which will lock the resources. 2. ocfs2_mknod which don't need this locking. And this also resolves the following lockdep warning. ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.35+ #5 ------------------------------------------------------- reflink/30027 is trying to acquire lock: (&oi->ip_alloc_sem){+.+.+.}, at: [<ffffffffa0673b67>] ocfs2_reflink_ioctl+0x69a/0x1226 [ocfs2] but task is already holding lock: (&oi->ip_xattr_sem){++++..}, at: [<ffffffffa0673b58>] ocfs2_reflink_ioctl+0x68b/0x1226 [ocfs2] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&oi->ip_xattr_sem){++++..}: [<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1 [<ffffffff82065a81>] lock_acquire+0xc6/0xed [<ffffffff82339650>] down_read+0x34/0x47 [<ffffffffa0691cb8>] ocfs2_xattr_get_nolock+0xa0/0x4e6 [ocfs2] [<ffffffffa069d64f>] ocfs2_get_acl_nolock+0x5c/0x132 [ocfs2] [<ffffffffa069d9c7>] ocfs2_init_acl+0x60/0x243 [ocfs2] [<ffffffffa066499d>] ocfs2_mknod+0xae8/0xfea [ocfs2] [<ffffffffa0665041>] ocfs2_create+0x9d/0x105 [ocfs2] [<ffffffff820e1c83>] vfs_create+0x9b/0xf4 [<ffffffff820e20bb>] do_last+0x2fd/0x5be [<ffffffff820e31c0>] do_filp_open+0x1fb/0x572 [<ffffffff820d6cf6>] do_sys_open+0x5a/0xe7 [<ffffffff820d6dac>] sys_open+0x1b/0x1d [<ffffffff8200296b>] system_call_fastpath+0x16/0x1b -> #2 (jbd2_handle){+.+...}: [<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1 [<ffffffff82065a81>] lock_acquire+0xc6/0xed [<ffffffffa0604ff8>] start_this_handle+0x4a3/0x4bc [jbd2] [<ffffffffa06051d6>] jbd2__journal_start+0xba/0xee [jbd2] [<ffffffffa0605218>] jbd2_journal_start+0xe/0x10 [jbd2] [<ffffffffa065ca34>] ocfs2_start_trans+0xb7/0x19b [ocfs2] [<ffffffffa06645f3>] ocfs2_mknod+0x73e/0xfea [ocfs2] [<ffffffffa0665041>] ocfs2_create+0x9d/0x105 [ocfs2] [<ffffffff820e1c83>] vfs_create+0x9b/0xf4 [<ffffffff820e20bb>] do_last+0x2fd/0x5be [<ffffffff820e31c0>] do_filp_open+0x1fb/0x572 [<ffffffff820d6cf6>] do_sys_open+0x5a/0xe7 [<ffffffff820d6dac>] sys_open+0x1b/0x1d [<ffffffff8200296b>] system_call_fastpath+0x16/0x1b -> #1 (&journal->j_trans_barrier){.+.+..}: [<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1 [<ffffffff82064fa9>] lock_release_non_nested+0x1e5/0x24b [<ffffffff82065999>] lock_release+0x158/0x17a [<ffffffff823389f6>] __mutex_unlock_slowpath+0xbf/0x11b [<ffffffff82338a5b>] mutex_unlock+0x9/0xb [<ffffffffa0679673>] ocfs2_free_ac_resource+0x31/0x67 [ocfs2] [<ffffffffa067c6bc>] ocfs2_free_alloc_context+0x11/0x1d [ocfs2] [<ffffffffa0633de0>] ocfs2_write_begin_nolock+0x141e/0x159b [ocfs2] [<ffffffffa0635523>] ocfs2_write_begin+0x11e/0x1e7 [ocfs2] [<ffffffff820a1297>] generic_file_buffered_write+0x10c/0x210 [<ffffffffa0653624>] ocfs2_file_aio_write+0x4cc/0x6d3 [ocfs2] [<ffffffff820d822d>] do_sync_write+0xc2/0x106 [<ffffffff820d897b>] vfs_write+0xae/0x131 [<ffffffff820d8e55>] sys_write+0x47/0x6f [<ffffffff8200296b>] system_call_fastpath+0x16/0x1b -> #0 (&oi->ip_alloc_sem){+.+.+.}: [<ffffffff82063f92>] validate_chain+0x727/0xd68 [<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1 [<ffffffff82065a81>] lock_acquire+0xc6/0xed [<ffffffff82339694>] down_write+0x31/0x52 [<ffffffffa0673b67>] ocfs2_reflink_ioctl+0x69a/0x1226 [ocfs2] [<ffffffffa06599f6>] ocfs2_ioctl+0x61a/0x656 [ocfs2] [<ffffffff820e53ac>] vfs_ioctl+0x2a/0x9d [<ffffffff820e5903>] do_vfs_ioctl+0x45d/0x4ae [<ffffffff820e59ab>] sys_ioctl+0x57/0x7a [<ffffffff8200296b>] system_call_fastpath+0x16/0x1b Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
* mm: Move vma_stack_continue into mm.hStefan Bader2010-09-09
| | | | | | | So it can be used by all that need to check for that. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'fixes' of git://oss.oracle.com/git/tma/linux-2.6Linus Torvalds2010-09-09
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'fixes' of git://oss.oracle.com/git/tma/linux-2.6: ocfs2: Fix orphan add in ocfs2_create_inode_in_orphan ocfs2: split out ocfs2_prepare_orphan_dir() into locking and prep functions ocfs2: allow return of new inode block location before allocation of the inode ocfs2: use ocfs2_alloc_dinode_update_counts() instead of open coding ocfs2: split out inode alloc code from ocfs2_mknod_locked Ocfs2: Fix a regression bug from mainline commit(6b933c8e6f1a2f3118082c455eef25f9b1ac7b45). ocfs2: Fix deadlock when allocating page ocfs2: properly set and use inode group alloc hint ocfs2: Use the right group in nfs sync check. ocfs2: Flush drive's caches on fdatasync ocfs2: make __ocfs2_page_mkwrite handle file end properly. ocfs2: Fix incorrect checksum validation error ocfs2: Fix metaecc error messages
| * ocfs2: Fix orphan add in ocfs2_create_inode_in_orphanMark Fasheh2010-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ocfs2_create_inode_in_orphan() is used by reflink to create the newly reflinked inode simultaneously in the orphan dir. This allows us to easily handle partially-reflinked files during recovery cleanup. We have a problem though - the orphan dir stringifies inode # to determine a unique name under which the orphan entry dirent can be created. Since ocfs2_create_inode_in_orphan() needs the space allocated in the orphan dir before it can allocate the inode, we currently call into the orphan code: /* * We give the orphan dir the root blkno to fake an orphan name, * and allocate enough space for our insertion. */ status = ocfs2_prepare_orphan_dir(osb, &orphan_dir, osb->root_blkno, orphan_name, &orphan_insert); Using osb->root_blkno might work fine on unindexed directories, but the orphan dir can have an index. When it has that index, the above code fails to allocate the proper index entry. Later, when we try to remove the file from the orphan dir (using the actual inode #), the reflink operation will fail. To fix this, I created a function ocfs2_alloc_orphaned_file() which uses the newly split out orphan and inode alloc code to figure out what the inode block number will be (once allocated) and then prepare the orphan dir from that data. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Tao Ma <tao.ma@oracle.com>
| * ocfs2: split out ocfs2_prepare_orphan_dir() into locking and prep functionsMark Fasheh2010-09-08
| | | | | | | | | | | | | | | | | | We do this because ocfs2_create_inode_in_orphan() wants to order locking of the orphan dir with respect to locking of the inode allocator *before* making any changes to the directory. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Tao Ma <tao.ma@oracle.com>
| * ocfs2: allow return of new inode block location before allocation of the inodeMark Fasheh2010-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows code which needs to know the eventual block number of an inode but can't allocate it yet due to transaction or lock ordering. For example, ocfs2_create_inode_in_orphan() currently gives a junk blkno for preparation of the orphan dir because it can't yet know where the actual inode is placed - that code is actually in ocfs2_mknod_locked. This is a problem when the orphan dirs are indexed as the junk inode number will create an index entry which goes unused (and fails the later removal from the orphan dir). Now with these interfaces, ocfs2_create_inode_in_orphan() can run the block group search (and get back the inode block number) *before* any actual allocation occurs. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Tao Ma <tao.ma@oracle.com>
| * ocfs2: use ocfs2_alloc_dinode_update_counts() instead of open codingMark Fasheh2010-09-08
| | | | | | | | | | | | | | | | | | ocfs2_search_chain() makes the same updates as ocfs2_alloc_dinode_update_counts to the alloc inode. Instead of open coding the bitmap update, use our helper function. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Tao Ma <tao.ma@oracle.com>
| * ocfs2: split out inode alloc code from ocfs2_mknod_lockedMark Fasheh2010-09-08
| | | | | | | | | | | | | | | | | | | | Do this by splitting the bulk of the function away from the inode allocation code at the very tom of ocfs2_mknod_locked(). Existing callers don't need to change and won't see any difference. The new function created, __ocfs2_mknod_locked() will be used shortly. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Tao Ma <tao.ma@oracle.com>
| * Ocfs2: Fix a regression bug from mainline ↵Tristan Ye2010-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit(6b933c8e6f1a2f3118082c455eef25f9b1ac7b45). The patch is to fix the regression bug brought from commit 6b933c8...( 'ocfs2: Avoid direct write if we fall back to buffered I/O'): http://oss.oracle.com/bugzilla/show_bug.cgi?id=1285 The commit 6b933c8e6f1a2f3118082c455eef25f9b1ac7b45 changed __generic_file_aio_write to generic_file_buffered_write, which didn't call filemap_{write,wait}_range to flush the pagecaches when we were falling O_DIRECT writes back to buffered ones. it did hurt the O_DIRECT semantics somehow in extented odirect writes. This patch tries to guarantee O_DIRECT writes of 'fall back to buffered' to be correctly flushed. Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Tao Ma <tao.ma@oracle.com>
| * ocfs2: Fix deadlock when allocating pageJan Kara2010-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | We cannot call grab_cache_page() when holding filesystem locks or with a transaction started as grab_cache_page() calls page allocation with GFP_KERNEL flag and thus page reclaim can recurse back into the filesystem causing deadlocks or various assertion failures. We have to use find_or_create_page() instead and pass it GFP_NOFS as we do with other allocations. Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Tao Ma <tao.ma@oracle.com>
| * ocfs2: properly set and use inode group alloc hintMark Fasheh2010-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | We were setting ac->ac_last_group in ocfs2_claim_suballoc_bits from res->sr_bg_blkno. Unfortunately, res->sr_bg_blkno is going to be zero under normal (non-fragmented) circumstances. The discontig block group patches effectively turned off that feature. Fix this by correctly calculating what the next group hint should be. Acked-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.de> Signed-off-by: Tao Ma <tao.ma@oracle.com>
| * ocfs2: Use the right group in nfs sync check.Tao Ma2010-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | We have added discontig block group now, and now an inode can be allocated in an discontig block group. So get it in ocfs2_get_suballoc_slot_bit. The old ocfs2_test_suballoc_bit gets group block no from the allocation inode which is wrong. Fix it by passing the right group. Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Tao Ma <tao.ma@oracle.com>
| * ocfs2: Flush drive's caches on fdatasyncJan Kara2010-09-08
| | | | | | | | | | | | | | | | | | | | | | When 'barrier' mount option is specified, we have to issue a cache flush during fdatasync(2). We have to do this even if inode doesn't have I_DIRTY_DATASYNC set because we still have to get written *data* to disk so that they are not lost in case of crash. Acked-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> Singed-off-by: Tao Ma <tao.ma@oracle.com>
| * ocfs2: make __ocfs2_page_mkwrite handle file end properly.Tao Ma2010-09-08
| | | | | | | | | | | | | | | | | | | | __ocfs2_page_mkwrite now is broken in handling file end. 1. the last page should be the page contains i_size - 1. 2. the len in the last page is also calculated wrong. So change them accordingly. Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Tao Ma <tao.ma@oracle.com>
| * ocfs2: Fix incorrect checksum validation errorSunil Mushran2010-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For local mounts, ocfs2_read_locked_inode() calls ocfs2_read_blocks_sync() to read the inode off the disk. The latter first checks to see if that block is cached in the journal, and, if so, returns that block. That is ok. But ocfs2_read_locked_inode() goes wrong when it tries to validate the checksum of such blocks. Blocks that are cached in the journal may not have had their checksum computed as yet. We should not validate the checksums of such blocks. Fixes ossbz#1282 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1282 Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Cc: stable@kernel.org Singed-off-by: Tao Ma <tao.ma@oracle.com>
| * ocfs2: Fix metaecc error messagesSunil Mushran2010-09-08
| | | | | | | | | | | | | | Like tools, the checksum validate function now prints the values in hex. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Singed-off-by: Tao Ma <tao.ma@oracle.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2010-09-08
|\ \ | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix lock annotations fuse: flush background queue on connection close
| * | fuse: fix lock annotationsMiklos Szeredi2010-09-07
| | | | | | | | | | | | | | | | | | | | | | | | Sparse doesn't understand lock annotations of the form __releases(&foo->lock). Change them to __releases(foo->lock). Same for __acquires(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * | fuse: flush background queue on connection closeMiklos Szeredi2010-09-07
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | David Bartly reported that fuse can hang in fuse_get_req_nofail() when the connection to the filesystem server is no longer active. If bg_queue is not empty then flush_bg_queue() called from request_end() can put more requests on to the pending queue. If this happens while ending requests on the processing queue then those background requests will be queued to the pending list and never ended. Another problem is that fuse_dev_release() didn't wake up processes sleeping on blocked_waitq. Solve this by: a) flushing the background queue before calling end_requests() on the pending and processing queues b) setting blocked = 0 and waking up processes waiting on blocked_waitq() Thanks to David for an excellent bug report. Reported-by: David Bartley <andareed@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org
* | Merge branch 'for-2.6.36' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2010-09-07
|\ \ | | | | | | | | | | | | * 'for-2.6.36' of git://linux-nfs.org/~bfields/linux: nfsd4: mask out non-access bits in nfs4_access_to_omode
| * | nfsd4: mask out non-access bits in nfs4_access_to_omodeJ. Bruce Fields2010-09-02
| | | | | | | | | | | | | | | | | | This fixes an unnecessary BUG(). Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | | Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds2010-09-07
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: Make fiemap work with sparse files xfs: prevent 32bit overflow in space reservation xfs: Disallow 32bit project quota id xfs: improve buffer cache hash scalability
| * \ \ Merge branch '2.6.36-xfs-misc' of ↵Alex Elder2010-09-03
| |\ \ \ | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev
| | * | | xfs: prevent 32bit overflow in space reservationDave Chinner2010-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we attempt to preallocate more than 2^32 blocks of space in a single syscall, the transaction block reservation will overflow leading to a hangs in the superblock block accounting code. This is trivially reproduced with xfs_io. Fix the problem by capping the allocation reservation to the maximum number of blocks a single xfs_bmapi() call can allocate (2^21 blocks). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| | * | | xfs: improve buffer cache hash scalabilityDave Chinner2010-09-02
| | | |/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing large parallel file creates on a 16p machines, large amounts of time is being spent in _xfs_buf_find(). A system wide profile with perf top shows this: 1134740.00 19.3% _xfs_buf_find 733142.00 12.5% __ticket_spin_lock The problem is that the hash contains 45,000 buffers, and the hash table width is only 256 buffers. That means we've got around 200 buffers per chain, and searching it is quite expensive. The hash table size needs to increase. Secondly, every time we do a lookup, we promote the buffer we find to the head of the hash chain. This is causing cachelines to be dirtied and causes invalidation of cachelines across all CPUs that may have walked the hash chain recently. hence every walk of the hash chain is effectively a cold cache walk. Remove the promotion to avoid this invalidation. The results are: 1045043.00 21.2% __ticket_spin_lock 326184.00 6.6% _xfs_buf_find A 70% drop in the CPU usage when looking up buffers. Unfortunately that does not result in an increase in performance underthis workload as contention on the inode_lock soaks up most of the reduction in CPU usage. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * | | xfs: Make fiemap work with sparse filesTao Ma2010-09-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In xfs_vn_fiemap, we set bvm_count to fi_extent_max + 1 and want to return fi_extent_max extents, but actually it won't work for a sparse file. The reason is that in xfs_getbmap we will calculate holes and set it in 'out', while out is malloced by bmv_count(fi_extent_max+1) which didn't consider holes. So in the worst case, if 'out' vector looks like [hole, extent, hole, extent, hole, ... hole, extent, hole], we will only return half of fi_extent_max extents. This patch add a new parameter BMV_IF_NO_HOLES for bvm_iflags. So with this flags, we don't use our 'out' in xfs_getbmap for a hole. The solution is a bit ugly by just don't increasing index of 'out' vector. I felt that it is not easy to skip it at the very beginning since we have the complicated check and some function like xfs_getbmapx_fix_eof_hole to adjust 'out'. Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | | xfs: Disallow 32bit project quota idArkadiusz Mi?kiewicz2010-09-02
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently on-disk structure is able to keep only 16bit project quota id, so disallow 32bit ones. This fixes a problem where parts of kernel structures holding project quota id are 32bit while parts (on-disk) are 16bit variables which causes project quota member files to be inaccessible for some operations (like mv/rm). Signed-off-by: Arkadiusz Mi?kiewicz <arekm@maven.pl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2010-09-07
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: 9p: potential ERR_PTR() dereference
| * | | 9p: potential ERR_PTR() dereferenceDan Carpenter2010-08-30
| |/ / | | | | | | | | | | | | | | | | | | | | | p9_client_walk() can return error values if we run out of space or there is a problem with the network. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6Linus Torvalds2010-09-07
|\ \ \ | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: sysfs: checking for NULL instead of ERR_PTR
| * | | sysfs: checking for NULL instead of ERR_PTRDan Carpenter2010-09-03
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | d_path() returns an ERR_PTR and it doesn't return NULL. Signed-off-by: Dan Carpenter <error27@gmail.com> Cc: stable <stable@kernel.org> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2010-09-07
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: fix leak of shadow dat inode in error path of load_nilfs
| * | | nilfs2: fix leak of shadow dat inode in error path of load_nilfsRyusuke Konishi2010-08-29
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | If load_nilfs() gets an error while doing recovery, it will fail to free the shadow inode of dat (nilfs->ns_gc_dat). This fixes the leak issue. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
* / / VFS: Sanity check mount flags passed to change_mnt_propagation()Valerie Aurora2010-09-07
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sanity check the flags passed to change_mnt_propagation(). Exactly one flag should be set. Return EINVAL otherwise. Userspace can pass in arbitrary combinations of MS_* flags to mount(). do_change_type() is called if any of MS_SHARED, MS_PRIVATE, MS_SLAVE, or MS_UNBINDABLE is set. do_change_type() clears MS_REC and then calls change_mnt_propagation() with the rest of the user-supplied flags. change_mnt_propagation() clearly assumes only one flag is set but do_change_type() does not check that this is true. For example, mount() with flags MS_SHARED | MS_RDONLY does not actually make the mount shared or read-only but does clear MNT_UNBINDABLE. Signed-off-by: Valerie Aurora <vaurora@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notifyLinus Torvalds2010-08-28
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git.infradead.org/users/eparis/notify: fsnotify: drop two useless bools in the fnsotify main loop fsnotify: fix list walk order fanotify: Return EPERM when a process is not privileged fanotify: resize pid and reorder structure fanotify: drop duplicate pr_debug statement fanotify: flush outstanding perm requests on group destroy fsnotify: fix ignored mask handling between inode and vfsmount marks fanotify: add MAINTAINERS entry fsnotify: reset used_inode and used_vfsmount on each pass fanotify: do not dereference inode_mark when it is unset
| * | fsnotify: drop two useless bools in the fnsotify main loopEric Paris2010-08-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | The fsnotify main loop has 2 bools which indicated if we processed the inode or vfsmount mark in that particular pass through the loop. These bool can we replaced with the inode_group and vfsmount_group variables and actually make the code a little easier to understand. Signed-off-by: Eric Paris <eparis@redhat.com>
| * | fsnotify: fix list walk orderEric Paris2010-08-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Marks were stored on the inode and vfsmonut mark list in order from highest memory address to lowest memory address. The code to walk those lists thought they were in order from lowest to highest with unpredictable results when trying to match up marks from each. It was possible that extra events would be sent to userspace when inode marks ignoring events wouldn't get matched with the vfsmount marks. This problem only affected fanotify when using both vfsmount and inode marks simultaneously. Signed-off-by: Eric Paris <eparis@redhat.com>
| * | fanotify: Return EPERM when a process is not privilegedAndreas Gruenbacher2010-08-27
| | | | | | | | | | | | | | | | | | | | | | | | The appropriate error code when privileged operations are denied is EPERM, not EACCES. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Eric Paris <paris@paris.rdu.redhat.com>
| * | fanotify: drop duplicate pr_debug statementTvrtko Ursulin2010-08-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reminded me... you have two pr_debugs in fanotify_should_send_event which output redundant information. Maybe you intended it like that so it is selectable how much log spam you want, or if not you may want to apply this patch. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * | fanotify: flush outstanding perm requests on group destroyEric Paris2010-08-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an fanotify listener is closing it may cause a deadlock between the listener and the original task doing an fs operation. If the original task is waiting for a permissions response it will be holding the srcu lock. The listener cannot clean up and exit until after that srcu lock is syncronized. Thus deadlock. The fix introduced here is to stop accepting new permissions events when a listener is shutting down and to grant permission for all outstanding events. Thus the original task will eventually release the srcu lock and the listener can complete shutdown. Reported-by: Andreas Gruenbacher <agruen@suse.de> Cc: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Eric Paris <eparis@redhat.com>
| * | fsnotify: fix ignored mask handling between inode and vfsmount marksEric Paris2010-08-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The interesting 2 list lockstep walking didn't quite work out if the inode marks only had ignores and the vfsmount list requested events. The code to shortcut list traversal would not run the inode list since it didn't have real event requests. This code forces inode list traversal when a vfsmount mark matches the event type. Maybe we could add an i_fsnotify_ignored_mask field to struct inode to get the shortcut back, but it doesn't seem worth it to grow struct inode again. I bet with the recent changes to lock the way we do now it would actually not be a major perf hit to just drop i_fsnotify_mark_mask altogether. But that is for another day. Signed-off-by: Eric Paris <eparis@redhat.com>
| * | fsnotify: reset used_inode and used_vfsmount on each passEric Paris2010-08-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fsnotify main loop has 2 booleans which tell if a particular mark was sent to the listeners or if it should be processed in the next pass. The problem is that the booleans were not reset on each traversal of the loop. So marks could get skipped even when they were not sent to the notifiers. Reported-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * | fanotify: do not dereference inode_mark when it is unsetEric Paris2010-08-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | The fanotify code is supposed to get the group from the mark. It accidentally only used the inode_mark. If the vfsmount_mark was set but not the inode_mark it would deref the NULL inode_mark. Get the group from the correct place. Reported-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com> Signed-off-by: Eric Paris <eparis@redhat.com>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2010-08-28
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6: eCryptfs: Fix encrypted file name lookup regression ecryptfs: properly mark init functions fs/ecryptfs: Return -ENOMEM on memory allocation failure
| * | | eCryptfs: Fix encrypted file name lookup regressionTyler Hicks2010-08-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes a regression caused by 21edad32205e97dc7ccb81a85234c77e760364c8 When file name encryption was enabled, ecryptfs_lookup() failed to use the encrypted and encoded version of the upper, plaintext, file name when performing a lookup in the lower file system. This made it impossible to lookup existing encrypted file names and any newly created files would have plaintext file names in the lower file system. https://bugs.launchpad.net/ecryptfs/+bug/623087 Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
| * | | ecryptfs: properly mark init functionsJerome Marchand2010-08-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some ecryptfs init functions are not prefixed by __init and thus not freed after initialization. This patch saved about 1kB in ecryptfs module. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
| * | | fs/ecryptfs: Return -ENOMEM on memory allocation failureJulia Lawall2010-08-27
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In this code, 0 is returned on memory allocation failure, even though other failures return -ENOMEM or other similar values. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression ret; expression x,e1,e2,e3; @@ ret = 0 ... when != ret = e1 *x = \(kmalloc\|kcalloc\|kzalloc\)(...) ... when != ret = e2 if (x == NULL) { ... when != ret = e3 return ret; } // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>