aboutsummaryrefslogtreecommitdiffstats
path: root/fs/xfs
Commit message (Collapse)AuthorAge
* xfs: log all dirty inodes in xfs_fs_sync_fsChristoph Hellwig2011-12-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since Linux 2.6.36 the writeback code has introduces various measures for live lock prevention during sync(). Unfortunately some of these are actively harmful for the XFS model, where the inode gets marked dirty for metadata from the data I/O handler. The older_than_this checks that are now more strictly enforced since writeback: avoid livelocking WB_SYNC_ALL writeback by only calling into __writeback_inodes_sb and thus only sampling the current cut off time once. But on a slow enough devices the previous asynchronous sync pass might not have fully completed yet, and thus XFS might mark metadata dirty only after that sampling of the cut off time for the blocking pass already happened. I have not myself reproduced this myself on a real system, but by introducing artificial delay into the XFS I/O completion workqueues it can be reproduced easily. Fix this by iterating over all XFS inodes in ->sync_fs and log all that are dirty. This might log inode that only got redirtied after the previous pass, but given how cheap delayed logging of inodes is it isn't a major concern for performance. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
* xfs: log the inode in ->write_inode calls for kupdateChristoph Hellwig2011-12-23
| | | | | | | | | | | | | | | | | | | | If the writeback code writes back an inode because it has expired we currently use the non-blockin ->write_inode path. This means any inode that is pinned is skipped. With delayed logging and a workload that has very little log traffic otherwise it is very likely that an inode that gets constantly written to is always pinned, and thus we keep refusing to write it. The VM writeback code at that point redirties it and doesn't try to write it again for another 30 seconds. This means under certain scenarious time based metadata writeback never happens. Fix this by calling into xfs_log_inode for kupdate in addition to data integrity syncs, and thus transfer the inode to the log ASAP. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
* xfs: fix the logspace waiting algorithmChristoph Hellwig2011-12-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Apply the scheme used in log_regrant_write_log_space to wake up any other threads waiting for log space before the newly added one to log_regrant_write_log_space as well, and factor the code into readable helpers. For each of the queues we have add two helpers: - one to try to wake up all waiting threads. This helper will also be usable by xfs_log_move_tail once we remove the current opportunistic wakeups in it. - one to sleep on t_wait until enough log space is available, loosely modelled after Linux waitqueues. And use them to reimplement the guts of log_regrant_write_log_space and log_regrant_write_log_space. These two function now use one and the same algorithm for waiting on log space instead of subtly different ones before, with an option to completely unify them in the near future. Also move the filesystem shutdown handling to the common caller given that we had to touch it anyway. Based on hard debugging and an earlier patch from Chandra Seetharaman <sekharan@us.ibm.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com> Tested-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Ben Myers <bpm@sgi.com>
* xfs: fix nfs export of 64-bit inodes numbers on 32-bit kernelsChristoph Hellwig2011-12-06
| | | | | | | | | | | | | | The i_ino field in the VFS inode is of type unsigned long and thus can't hold the full 64-bit inode number on 32-bit kernels. We have the full inode number in the XFS inode, so use that one for nfs exports. Note that I've also switched the 32-bit file handles types to it, just to make the code more consistent and copy & paste errors less likely to happen. Reported-by: Guoquan Yang <ygq51@hotmail.com> Reported-by: Hank Peng <pengxihan@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
* xfs: fix allocation length overflow in xfs_bmapi_write()Dave Chinner2011-12-02
| | | | | | | | | | | | | | | | | | | | | | | | | When testing the new xfstests --large-fs option that does very large file preallocations, this assert was tripped deep in xfs_alloc_vextent(): XFS: Assertion failed: args->minlen <= args->maxlen, file: fs/xfs/xfs_alloc.c, line: 2239 The allocation was trying to allocate a zero length extent because the lower 32 bits of the allocation length was zero. The remaining length of the allocation to be done was an exact multiple of 2^32 - the first case I saw was at 496TB remaining to be allocated. This turns out to be an overflow when converting the allocation length (a 64 bit quantity) into the extent length to allocate (a 32 bit quantity), and it requires the length to be allocated an exact multiple of 2^32 blocks to trip the assert. Fix it by limiting the extent lenth to allocate to MAXEXTLEN. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: fix attr2 vs large data fork assertChristoph Hellwig2011-11-29
| | | | | | | | | | | | | | | | | | | | | | | With Dmitry fsstress updates I've seen very reproducible crashes in xfs_attr_shortform_remove because xfs_attr_shortform_bytesfit claims that the attributes would not fit inline into the inode after removing an attribute. It turns out that we were operating on an inode with lots of delalloc extents, and thus an if_bytes values for the data fork that is larger than biggest possible on-disk storage for it which utterly confuses the code near the end of xfs_attr_shortform_bytesfit. Fix this by always allowing the current attribute fork, like we already do for the attr1 format, given that delalloc conversion will take care for moving either the data or attribute area out of line if it doesn't fit at that point - or making the point moot by merging extents at this point. Also document the function better, and clean up some loose bits. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
* xfs: force buffer writeback before blocking on the ilock in inode reclaimChristoph Hellwig2011-11-29
| | | | | | | | | | | | | | | | If we are doing synchronous inode reclaim we block the VM from making progress in memory reclaim. So if we encouter a flush locked inode promote it in the delwri list and wake up xfsbufd to write it out now. Without this we can get hangs of up to 30 seconds during workloads hitting synchronous inode reclaim. The scheme is copied from what we do for dquot reclaims. Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Ben Myers <bpm@sgi.com>
* xfs: validate acl countChristoph Hellwig2011-11-28
| | | | | | | | | This prevents in-memory corruption and possible panics if the on-disk ACL is badly corrupted. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
* xfs: use doalloc flag in xfs_qm_dqattach_one()Mitsuo Hayasaka2011-11-15
| | | | | | | | | | | | | | | | | | | | | | | | | | The doalloc arg in xfs_qm_dqattach_one() is a flag that indicates whether a new area to handle quota information will be allocated if needed. Originally, it was passed to xfs_qm_dqget(), but has been removed by the following commit (probably by mistake): commit 8e9b6e7fa4544ea8a0e030c8987b918509c8ff47 Author: Christoph Hellwig <hch@lst.de> Date: Sun Feb 8 21:51:42 2009 +0100 xfs: remove the unused XFS_QMOPT_DQLOCK flag As the result, xfs_qm_dqget() called from xfs_qm_dqattach_one() never allocates the new area even if it is needed. This patch gives the doalloc arg to xfs_qm_dqget() in xfs_qm_dqattach_one() to fix this problem. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Alex Elder <aelder@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
* xfs: fix force shutdown handling in xfs_end_ioChristoph Hellwig2011-11-08
| | | | | | | | Ensure ioend->io_error gets propagated back to e.g. AIO completions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
* xfs: constify xfs_item_opsChristoph Hellwig2011-11-08
| | | | | | | | | | | The log item ops aren't nessecarily the biggest exploit vector, but marking them const is easy enough. Also remove the unused xfs_item_ops_t typedef while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
* xfs: Fix possible memory corruption in xfs_readlinkCarlos Maiolino2011-11-08
| | | | | | | | | | | | | | | | | | | | | | | | Fixes a possible memory corruption when the link is larger than MAXPATHLEN and XFS_DEBUG is not enabled. This also remove the S_ISLNK assert, since the inode mode is checked previously in xfs_readlink_by_handle() and via VFS. Updated to address concerns raised by Ben Hutchings about the loose attention paid to 32- vs 64-bit values, and the lack of handling a potentially negative pathlen value: - Changed type of "pathlen" to be xfs_fsize_t, to match that of ip->i_d.di_size - Added checking for a negative pathlen to the too-long pathlen test, and generalized the message that gets reported in that case to reflect the change As a result, if a negative pathlen were encountered, this function would return EFSCORRUPTED (and would fail an assertion for a debug build)--just as would a too-long pathlen. Signed-off-by: Alex Elder <aelder@sgi.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* filesystems: add set_nlink()Miklos Szeredi2011-11-02
| | | | | | | | | Replace remaining direct i_nlink updates with a new set_nlink() updater function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* treewide: use __printf not __attribute__((format(printf,...)))Joe Perches2011-10-31
| | | | | | | | | | | | | | | | | Standardize the style for compiler based printf format verification. Standardized the location of __printf too. Done via script and a little typing. $ grep -rPl --include=*.[ch] -w "__attribute__" * | \ grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \ xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }' [akpm@linux-foundation.org: revert arch bits] Signed-off-by: Joe Perches <joe@perches.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* xfs: warn if direct reclaim tries to writeback pagesMel Gorman2011-10-31
| | | | | | | | | | | | | | | | | | | | | Direct reclaim should never writeback pages. For now, handle the situation and warn about it. Ultimately, this will be a BUG_ON. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds2011-10-28
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://oss.sgi.com/xfs/xfs: (69 commits) xfs: add AIL pushing tracepoints xfs: put in missed fix for merge problem xfs: do not flush data workqueues in xfs_flush_buftarg xfs: remove XFS_bflush xfs: remove xfs_buf_target_name xfs: use xfs_ioerror_alert in xfs_buf_iodone_callbacks xfs: clean up xfs_ioerror_alert xfs: clean up buffer allocation xfs: remove buffers from the delwri list in xfs_buf_stale xfs: remove XFS_BUF_STALE and XFS_BUF_SUPER_STALE xfs: remove XFS_BUF_SET_VTYPE and XFS_BUF_SET_VTYPE_REF xfs: remove XFS_BUF_FINISH_IOWAIT xfs: remove xfs_get_buftarg_list xfs: fix buffer flushing during unmount xfs: optimize fsync on directories xfs: reduce the number of log forces from tail pushing xfs: Don't allocate new buffers on every call to _xfs_buf_find xfs: simplify xfs_trans_ijoin* again xfs: unlock the inode before log force in xfs_change_file_space xfs: unlock the inode before log force in xfs_fs_nfs_commit_metadata ...
| * xfs: add AIL pushing tracepointsChristoph Hellwig2011-10-18
| | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * xfs: put in missed fix for merge problemAlex Elder2011-10-18
| | | | | | | | | | | | | | | | | | | | I intended to do this as part of fixing part of the conflict with the merge with Linus' tree, but evidently it didn't get included in the commit. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * Merge branch 'master' of ↵Alex Elder2011-10-17
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux Resolved conflicts: fs/xfs/xfs_trans_priv.h: - deleted struct xfs_ail field xa_flags - kept field xa_log_flush in struct xfs_ail fs/xfs/xfs_trans_ail.c: - in xfsaild_push(), in XFS_ITEM_PUSHBUF case, replaced "flush_log = 1" with "ailp->xa_log_flush++" Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: do not flush data workqueues in xfs_flush_buftargChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we call xfs_flush_buftarg (generally from sync or umount) it already is too late to flush the data workqueues, as I/O completion is signalled for them and we are thus already done with the data we would flush here. There are places where flushing them might be useful, but the current sync interface doesn't give us that opportunity. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: remove XFS_bflushChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: remove xfs_buf_target_nameChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The calling convention that returns a pointer to a static buffer is fairly nasty, so just opencode it in the only caller that is left. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: use xfs_ioerror_alert in xfs_buf_iodone_callbacksChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use xfs_ioerror_alert instead of opencoding a very similar error message. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: clean up xfs_ioerror_alertChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of passing the block number and mount structure explicitly get them off the bp and fix make the argument order more natural. Also move it to xfs_buf.c and stop printing the device name given that we already get the fs name as part of xfs_alert, and we know what device is operates on because of the caller that gets printed, finally rename it to xfs_buf_ioerror_alert and pass __func__ as argument where it makes sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: clean up buffer allocationChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change _xfs_buf_initialize to allocate the buffer directly and rename it to xfs_buf_alloc now that is the only buffer allocation routine. Also remove the xfs_buf_deallocate wrapper around the kmem_zone_free calls for buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: remove buffers from the delwri list in xfs_buf_staleChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For each call to xfs_buf_stale we call xfs_buf_delwri_dequeue either directly before or after it, or are guaranteed by the surrounding conditionals that we are never called on delwri buffers. Simply this situation by moving the call to xfs_buf_delwri_dequeue into xfs_buf_stale. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: remove XFS_BUF_STALE and XFS_BUF_SUPER_STALEChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: remove XFS_BUF_SET_VTYPE and XFS_BUF_SET_VTYPE_REFChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: remove XFS_BUF_FINISH_IOWAITChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: remove xfs_get_buftarg_listChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code is unused and under a config option that doesn't exist, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: fix buffer flushing during unmountChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code to flush buffers in the umount code is a bit iffy: we first flush all delwri buffers out, but then might be able to queue up a new one when logging the sb counts. On a normal shutdown that one would get flushed out when doing the synchronous superblock write in xfs_unmountfs_writesb, but we skip that one if the filesystem has been shut down. Fix this by moving the delwri list flushing until just before unmounting the log, and while we're at it also remove the superflous delwri list and buffer lru flusing for the rt and log device that can never have cached or delwri buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Tested-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: optimize fsync on directoriesChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Directories are only updated transactionally, which means fsync only needs to flush the log the inode is currently dirty, but not bother with checking for dirty data, non-transactional updates, and most importanly doesn't have to flush disk caches except as part of a transaction commit. While the first two optimizations can't easily be measured, the latter actually makes a difference when doing lots of fsync that do not actually have to commit the inode, e.g. because an earlier fsync already pushed the log far enough. The new xfs_dir_fsync is identical to xfs_nfs_commit_metadata except for the prototype, but I'm not sure creating a common helper for the two is worth it given how simple the functions are. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: reduce the number of log forces from tail pushingDave Chinner2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The AIL push code will issue a log force on ever single push loop that it exits and has encountered pinned items. It doesn't rescan these pinned items until it revisits the AIL from the start. Hence we only need to force the log once per walk from the start of the AIL to the target LSN. This results in numbers like this: xs_push_ail_flush..... 1456 xs_log_force......... 1485 For an 8-way 50M inode create workload - almost all the log forces are coming from the AIL pushing code. Reduce the number of log forces by only forcing the log if the previous walk found pinned buffers. This reduces the numbers to: xs_push_ail_flush..... 665 xs_log_force......... 682 For the same test. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: Don't allocate new buffers on every call to _xfs_buf_findDave Chinner2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stats show that for an 8-way unlink @ ~80,000 unlinks/s we are doing ~1 million cache hit lookups to ~3000 buffer creates. That's almost 3 orders of magnitude more cahce hits than misses, so optimising for cache hits is quite important. In the cache hit case, we do not need to allocate a new buffer in case of a cache miss, so we are effectively hitting the allocator for no good reason for vast the majority of calls to _xfs_buf_find. 8-way create workloads are showing similar cache hit/miss ratios. The result is profiles that look like this: samples pcnt function DSO _______ _____ _______________________________ _________________ 1036.00 10.0% _xfs_buf_find [kernel.kallsyms] 582.00 5.6% kmem_cache_alloc [kernel.kallsyms] 519.00 5.0% __memcpy [kernel.kallsyms] 468.00 4.5% __ticket_spin_lock [kernel.kallsyms] 388.00 3.7% kmem_cache_free [kernel.kallsyms] 331.00 3.2% xfs_log_commit_cil [kernel.kallsyms] Further, there is a fair bit of work involved in initialising a new buffer once a cache miss has occurred and we currently do that under the rbtree spinlock. That increases spinlock hold time on what are heavily used trees. To fix this, remove the initialisation of the buffer from _xfs_buf_find() and only allocate the new buffer once we've had a cache miss. Initialise the buffer immediately after allocating it in xfs_buf_get, too, so that is it ready for insert if we get another cache miss after allocation. This minimises lock hold time and avoids unnecessary allocator churn. The resulting profiles look like: samples pcnt function DSO _______ _____ ___________________________ _________________ 8111.00 9.1% _xfs_buf_find [kernel.kallsyms] 4380.00 4.9% __memcpy [kernel.kallsyms] 4341.00 4.8% __ticket_spin_lock [kernel.kallsyms] 3401.00 3.8% kmem_cache_alloc [kernel.kallsyms] 2856.00 3.2% xfs_log_commit_cil [kernel.kallsyms] 2625.00 2.9% __kmalloc [kernel.kallsyms] 2380.00 2.7% kfree [kernel.kallsyms] 2016.00 2.3% kmem_cache_free [kernel.kallsyms] Showing a significant reduction in time spent doing allocation and freeing from slabs (kmem_cache_alloc and kmem_cache_free). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: simplify xfs_trans_ijoin* againChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no reason to keep a reference to the inode even if we unlock it during transaction commit because we never drop a reference between the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref back into xfs_trans_ijoin - the third argument decides if an unlock is needed now. I'm actually starting to wonder if allowing inodes to be unlocked at transaction commit really is worth the effort. The only real benefit is that they can be unlocked earlier when commiting a synchronous transactions, but that could be solved by doing the log force manually after the unlock, too. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: unlock the inode before log force in xfs_change_file_spaceChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let the transaction commit unlock the inode before it potentially causes a synchronous log force. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: unlock the inode before log force in xfs_fs_nfs_commit_metadataChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only read the LSN we need to push to with the ilock held, and then release it before we do the log force to improve concurrency. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: unlock the inode before log force in xfs_fsyncChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only read the LSN we need to push to with the ilock held, and then release it before we do the log force to improve concurrency. This also removes the only direct caller of _xfs_trans_commit, thus allowing it to be merged into the plain xfs_trans_commit again. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: XFS_TRANS_SWAPEXT is not a valid flag for xfs_trans_commitChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | XFS_TRANS_SWAPEXT is a transaction type, not a flag for xfs_trans_commit, so don't pass it in xfs_swap_extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: fix possible overflow in xfs_ioc_trim()Lukas Czerner2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In xfs_ioc_trim it is possible that computing the last allocation group to discard might overflow for big start & len values, because the result might be bigger then xfs_agnumber_t which is 32 bit long. Fix this by not allowing the start and end block of the range to be beyond the end of the file system. Note that if the start is beyond the end of the file system we have to return -EINVAL, but in the "end" case we have to truncate it to the fs size. Also introduce "end" variable, rather than using start+len which which might be more confusing to get right as this bug shows. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: cleanup xfs_bmap.hChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert all function prototypes to the short form used elsewhere, and remove duplicates of comments already placed at the function body. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: dont ignore error code from xfs_bmbt_updateChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a case in xfs_bmap_add_extent_unwritten_real where we aren't passing the returned error on. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: pass bmalloca to xfs_bmap_add_extent_hole_realChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All the parameters passed to xfs_bmap_add_extent_hole_real() are in the xfs_bmalloca structure now. Just pass the bmalloca parameter to the function instead of 8 separate parameters. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: pass bmalloca to xfs_bmap_add_extent_delay_realChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All the parameters passed to xfs_bmap_add_extent_delay_real() are in the xfs_bmalloca structure now. Just pass the bmalloca parameter to the function instead of 8 separate parameters. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: move logflags into bmallocaChristoph Hellwig2011-10-11
| | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: move lastx and nallocs into bmallocaDave Chinner2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: move btree cursor into bmallocaDave Chinner2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: do not keep local copies of allocation ranges in xfs_bmapi_allocateDave Chinner2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: rename allocation range fields in struct xfs_bmallocaDave Chinner2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: move firstblock and bmap freelist cursor into bmalloca structureDave Chinner2011-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than passing the firstblock and freelist structure around, embed it into the bmalloca structure and remove it from the function parameters. This also enables the minleft parameter to be set only once in xfs_bmapi_write(), and the freelist cursor directly queried in xfs_bmapi_allocate to clear it when the lowspace algorithm is activated. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>