aboutsummaryrefslogtreecommitdiffstats
path: root/fs/xfs/xfs_log.c
Commit message (Collapse)AuthorAge
...
* xfs: make the log ticket transaction id randomDave Chinner2010-05-19
| | | | | | | | | | | | | | | | | | The transaction ID that is written to the log for a transaction is currently set by taking the lower 32 bits of the memory address of the ticket structure. This is not guaranteed to be unique as tickets comes from a slab and slots can be reallocated immediately after being freed. As a result, there is no guarantee of uniqueness in the ticket ID value. Fix this by assigning a random number to the ticket ID field so that it is extremely unlikely that duplicates will occur and remove the possibility of transactions being mixed up during recovery due to duplicate IDs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: clean up xlog_write_adv_cntChristoph Hellwig2010-05-19
| | | | | | | | | | | | Replace the awkward xlog_write_adv_cnt with an inline helper that makes it more obvious that it's modifying it's paramters, and replace the use of an integer type for "ptr" with a real void pointer. Also move xlog_write_adv_cnt to xfs_log_priv.h as it will be used outside of xfs_log.c in the delayed logging series. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: introduce new internal log vector structureDave Chinner2010-05-19
| | | | | | | | | | | | | | | | | | The current log IO vector structure is a flat array and not extensible. To make it possible to keep separate log IO vectors for individual log items, we need a method of chaining log IO vectors together. Introduce a new log vector type that can be used to wrap the existing log IO vectors on use that internally to the log. This means that the existing external interface (xfs_log_write) does not change and hence no changes to the transaction commit code are required. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: reindent xlog_writeChristoph Hellwig2010-05-19
| | | | | | | | | | | Reindent xlog_write to normal one tab indents and move all variable declarations into the closest enclosing block. Split from a bigger patch by Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: factor xlog_writeDave Chinner2010-05-19
| | | | | | | | | | | | | xlog_write is a mess that takes a lot of effort to understand. It is a mass of nested loops with 4 space indents to get it to fit in 80 columns and lots of funky variables that aren't obvious what they mean or do. Break it down into understandable chunks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: log ticket reservation underestimates the number of iclogsDave Chinner2010-05-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When allocation a ticket for a transaction, the ticket is initialised with the worst case log space usage based on the number of bytes the transaction may consume. Part of this calculation is the number of log headers required for the iclog space used up by the transaction. This calculation makes an undocumented assumption that if the transaction uses the log header space reservation on an iclog, then it consumes either the entire iclog or it completes. That is - the transaction that is first in an iclog is the transaction that the log header reservation is accounted to. If the transaction is larger than the iclog, then it will use the entire iclog itself. Document this assumption. Further, the current calculation uses the rule that we can fit iclog_size bytes of transaction data into an iclog. This is in correct - the amount of space available in an iclog for transaction data is the size of the iclog minus the space used for log record headers. This means that the calculation is out by 512 bytes per 32k of log space the transaction can consume. This is rarely an issue because maximally sized transactions are extremely uncommon, and for 4k block size filesystems maximal transaction reservations are about 400kb. Hence the error in this case is less than the size of an iclog, so that makes it even harder to hit. However, anyone using larger directory blocks (16k directory blocks push the maximum transaction size to approx. 900k on a 4k block size filesystem) or larger block size (e.g. 64k blocks push transactions to the 3-4MB size) could see the error grow to more than an iclog and at this point the transaction is guaranteed to get a reservation underrun and shutdown the filesystem. Fix this by adjusting the calculation to calculate the correct number of iclogs required and account for them all up front. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: factor log item initialisationDave Chinner2010-05-19
| | | | | | | | | | Each log item type does manual initialisation of the log item. Delayed logging introduces new fields that need initialisation, so factor all the open coded initialisation into a common function first. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: ensure that sync updates the log tail correctlyDave Chinner2010-04-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Updates to the VFS layer removed an extra ->sync_fs call into the filesystem during the sync process (from the quota code). Unfortunately the sync code was unknowingly relying on this call to make sure metadata buffers were flushed via a xfs_buftarg_flush() call to move the tail of the log forward in memory before the final transactions of the sync process were issued. As a result, the old code would write a very recent log tail value to the log by the end of the sync process, and so a subsequent crash would leave nothing for log recovery to do. Hence in qa test 182, log recovery only replayed a small handle for inode fsync transactions in this case. However, with the removal of the extra ->sync_fs call, the log tail was now not moved forward with the inode fsync transactions near the end of the sync procese the first (and only) buftarg flush occurred after these transactions went to disk. The result is that log recovery now sees a large number of transactions for metadata that is already on disk. This usually isn't a problem, but when the transactions include inode chunk allocation, the inode create transactions and all subsequent changes are replayed as we cannt rely on what is on disk is valid. As a result, if the inode was written and contains unlogged changes, the unlogged changes are lost, thereby violating sync semantics. The fix is to always issue a transaction after the buftarg flush occurs is the log iѕ not idle or covered. This results in a dummy transaction being written that contains the up-to-date log tail value, which will be very recent. Indeed, it will be at least as recent as the old code would have left on disk, so log recovery will behave exactly as it used to in this situation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: stop passing opaque handles to xfs_log.c routinesChristoph Hellwig2010-03-01
| | | | | | | | | | | | | Currenly we pass opaque xfs_log_ticket_t handles instead of struct xlog_ticket pointers, and void pointers instead of struct xlog_in_core pointers to various log manager functions. Instead pass properly typed pointers after adding forward declarations for them to xfs_log.h, and adjust the touched function prototypes to the standard XFS style while at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: cleanup up xfs_log_force calling conventionsChristoph Hellwig2010-01-21
| | | | | | | | | | | | | | | | Remove the XFS_LOG_FORCE argument which was always set, and the XFS_LOG_URGE define, which was never used. Split xfs_log_force into a two helpers - xfs_log_force which forces the whole log, and xfs_log_force_lsn which forces up to the specified LSN. The underlying implementations already were entirely separate, as were the users. Also re-indent the new _xfs_log_force/_xfs_log_force which previously had a weird coding style. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: kill XLOG_VEC_SET_TYPEChristoph Hellwig2010-01-21
| | | | | | | | This macro only obsfucates the log item type assignments, so kill it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: clean up log buffer writesChristoph Hellwig2010-01-15
| | | | | | | | | Don't bother using XFS_bwrite as it doesn't provide much code for our use case. Instead opencode it and fold xlog_bdstrat_cb into the new xlog_bdstrat helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: improve metadata I/O merging in the elevatorDave Chinner2009-12-16
| | | | | | | | | | | | | | | Change all async metadata buffers to use [READ|WRITE]_META I/O types so that the I/O doesn't get issued immediately. This allows merging of adjacent metadata requests but still prioritises them over bulk data. This shows a 10-15% improvement in sequential create speed of small files. Don't include the log buffers in this classification - leave them as sync types so they are issued immediately. Signed-off-by: Dave Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: event tracing supportChristoph Hellwig2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert the old xfs tracing support that could only be used with the out of tree kdb and xfsidbg patches to use the generic event tracer. To use it make sure CONFIG_EVENT_TRACING is enabled and then enable all xfs trace channels by: echo 1 > /sys/kernel/debug/tracing/events/xfs/enable or alternatively enable single events by just doing the same in one event subdirectory, e.g. echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable or set more complex filters, etc. In Documentation/trace/events.txt all this is desctribed in more detail. To reads the events do a cat /sys/kernel/debug/tracing/trace Compared to the last posting this patch converts the tracing mostly to the one tracepoint per callsite model that other users of the new tracing facility also employ. This allows a very fine-grained control of the tracing, a cleaner output of the traces and also enables the perf tool to use each tracepoint as a virtual performance counter, allowing us to e.g. count how often certain workloads git various spots in XFS. Take a look at http://lwn.net/Articles/346470/ for some examples. Also the btree tracing isn't included at all yet, as it will require additional core tracing features not in mainline yet, I plan to deliver it later. And the really nice thing about this patch is that it actually removes many lines of code while adding this nice functionality: fs/xfs/Makefile | 8 fs/xfs/linux-2.6/xfs_acl.c | 1 fs/xfs/linux-2.6/xfs_aops.c | 52 - fs/xfs/linux-2.6/xfs_aops.h | 2 fs/xfs/linux-2.6/xfs_buf.c | 117 +-- fs/xfs/linux-2.6/xfs_buf.h | 33 fs/xfs/linux-2.6/xfs_fs_subr.c | 3 fs/xfs/linux-2.6/xfs_ioctl.c | 1 fs/xfs/linux-2.6/xfs_ioctl32.c | 1 fs/xfs/linux-2.6/xfs_iops.c | 1 fs/xfs/linux-2.6/xfs_linux.h | 1 fs/xfs/linux-2.6/xfs_lrw.c | 87 -- fs/xfs/linux-2.6/xfs_lrw.h | 45 - fs/xfs/linux-2.6/xfs_super.c | 104 --- fs/xfs/linux-2.6/xfs_super.h | 7 fs/xfs/linux-2.6/xfs_sync.c | 1 fs/xfs/linux-2.6/xfs_trace.c | 75 ++ fs/xfs/linux-2.6/xfs_trace.h | 1369 +++++++++++++++++++++++++++++++++++++++++ fs/xfs/linux-2.6/xfs_vnode.h | 4 fs/xfs/quota/xfs_dquot.c | 110 --- fs/xfs/quota/xfs_dquot.h | 21 fs/xfs/quota/xfs_qm.c | 40 - fs/xfs/quota/xfs_qm_syscalls.c | 4 fs/xfs/support/ktrace.c | 323 --------- fs/xfs/support/ktrace.h | 85 -- fs/xfs/xfs.h | 16 fs/xfs/xfs_ag.h | 14 fs/xfs/xfs_alloc.c | 230 +----- fs/xfs/xfs_alloc.h | 27 fs/xfs/xfs_alloc_btree.c | 1 fs/xfs/xfs_attr.c | 107 --- fs/xfs/xfs_attr.h | 10 fs/xfs/xfs_attr_leaf.c | 14 fs/xfs/xfs_attr_sf.h | 40 - fs/xfs/xfs_bmap.c | 507 +++------------ fs/xfs/xfs_bmap.h | 49 - fs/xfs/xfs_bmap_btree.c | 6 fs/xfs/xfs_btree.c | 5 fs/xfs/xfs_btree_trace.h | 17 fs/xfs/xfs_buf_item.c | 87 -- fs/xfs/xfs_buf_item.h | 20 fs/xfs/xfs_da_btree.c | 3 fs/xfs/xfs_da_btree.h | 7 fs/xfs/xfs_dfrag.c | 2 fs/xfs/xfs_dir2.c | 8 fs/xfs/xfs_dir2_block.c | 20 fs/xfs/xfs_dir2_leaf.c | 21 fs/xfs/xfs_dir2_node.c | 27 fs/xfs/xfs_dir2_sf.c | 26 fs/xfs/xfs_dir2_trace.c | 216 ------ fs/xfs/xfs_dir2_trace.h | 72 -- fs/xfs/xfs_filestream.c | 8 fs/xfs/xfs_fsops.c | 2 fs/xfs/xfs_iget.c | 111 --- fs/xfs/xfs_inode.c | 67 -- fs/xfs/xfs_inode.h | 76 -- fs/xfs/xfs_inode_item.c | 5 fs/xfs/xfs_iomap.c | 85 -- fs/xfs/xfs_iomap.h | 8 fs/xfs/xfs_log.c | 181 +---- fs/xfs/xfs_log_priv.h | 20 fs/xfs/xfs_log_recover.c | 1 fs/xfs/xfs_mount.c | 2 fs/xfs/xfs_quota.h | 8 fs/xfs/xfs_rename.c | 1 fs/xfs/xfs_rtalloc.c | 1 fs/xfs/xfs_rw.c | 3 fs/xfs/xfs_trans.h | 47 + fs/xfs/xfs_trans_buf.c | 62 - fs/xfs/xfs_vnodeops.c | 8 70 files changed, 2151 insertions(+), 2592 deletions(-) Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: fix spin_is_locked assert on uni-processor buildsChristoph Hellwig2009-08-12
| | | | | | | | | | | | Without SMP or preemption spin_is_locked always returns false, so we can't do an assert with it. Instead use assert_spin_locked, which does the right thing on all builds. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reported-by: Johannes Engel <jcnengel@googlemail.com> Tested-by: Johannes Engel <jcnengel@googlemail.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>
* xfs: inform the xfsaild of the push target before sleepingDave Chinner2009-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When trying to reserve log space, we find the amount of space we need, then go to sleep waiting for space. When we are woken, we try to push the tail of the log forward to make sure we have space available. Unfortunately, this means that if there is not space available, and everyone who needs space goes to sleep there is no-one left to push the tail of the log to make space available. Once we have a thread waiting for space to become available, the others queue up behind it in a FIFO, and none of them push the tail of the log. This can result in everyone going to sleep in xlog_grant_log_space() if the first sleeper races with the last I/O that moves the tail of the log forward. With no further I/O tomove the tail of the log, there is nothing to wake the sleepers and hence all transactions just stop. Fix this by making sure the xfsaild will create enough space for the transaction that is about to sleep by moving the push target far enough forwards to ensure that that the curent proceeees will have enough space available when it is woken. That is, we push the AIL before we go to sleep. Because we've inserted the log ticket into the queue before we've pushed and gone to sleep, subsequent transactions will wait behind this one. Hence we are guaranteed to have space available when we are woken. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: validate log feature fields correctlyDave Chinner2009-04-06
| | | | | | | | | | | | | | | | If the large log sector size feature bit is set in the superblock by accident (say disk corruption), the then fields that are now considered valid are not checked on production kernels. The checks are present as ASSERT statements so cause a panic on a debug kernel. Change this so that the fields are validity checked if the feature bit is set and abort the log mount if the fields do not contain valid values. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: fix various typosMalcolm Parsons2009-03-29
| | | | | Signed-off-by: Malcolm Parsons <malcolm.parsons@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: cleanup log unmount handlingChristoph Hellwig2009-03-16
| | | | | | | | | | Kill the current xfs_log_unmount wrapper and opencode the two function calls in the only caller. Rename the current xfs_log_unmount_dealloc to xfs_log_unmount as it undoes xfs_log_mount and the new name makes that more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>
* xfs: fix error handling in xfs_log_mountChristoph Hellwig2009-02-12
| | | | | | | | | | We can't just call xfs_log_unmount_dealloc on any failure because the ail thread which is torn down by xfs_log_unmount_dealloc might not be initialized yet. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Reported-by: Lachlan McIlroy <lachlan@sgi.com>
* xfs: remove iclog calculation special casesChristoph Hellwig2009-02-09
| | | | | | | | | Our default has been to always use 8 32KB log buffers for a while now, so remove the special casing for larger block size filesystem to use the same or even lower number of buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>
* reduce l_icloglock roundtripsChristoph Hellwig2008-12-03
| | | | | | | | | | | | | | | | | | All but one caller of xlog_state_want_sync drop and re-acquire l_icloglock around the call to it, just so that xlog_state_want_sync can acquire and drop it. Move all lock operation out of l_icloglock and assert that the lock is held when it is called. Note that it would make sense to extende this scheme to xlog_state_release_iclog, but the locking in there is more complicated and we'd like to keep the atomic_dec_and_lock optmization for those callers not having l_icloglock yet. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Niv Sardi <xaiki@sgi.com>
* [XFS] sanitize xlog_in_core_t definitionChristoph Hellwig2008-11-30
| | | | | | | | | | | | | Move all fields from xlog_iclog_fields_t into xlog_in_core_t instead of having them in a substructure and the using #defines to make it look like they were directly in xlog_in_core_t. Also document that xlog_in_core_2_t is grossly misnamed, and make all references to it typesafe. (First sent on Semptember 15th) Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Niv Sardi <xaiki@sgi.com>
* [XFS] fix NULL pointer dereference in xfs_log_force_umountChristoph Hellwig2008-11-30
| | | | | | | | | | | | | | | | | | xfs_log_force_umount may be called very early during log recovery where If we fail a buffer read in xlog_recover_do_inode_trans we abort the mount. But at that point log recovery has started delayed writeback of inode buffers. As part of the aborted mount we try to flush out all delwri buffers, but at that point we have already freed the superblock, and set mp->m_sb_bp to NULL, and xfs_log_force_umount which gets called after the inode buffer writeback trips over it. Make xfs_log_force_umount a little more careful when accessing mp->m_sb_bp to avoid this. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Niv Sardi <xaiki@sgi.com>
* [XFS] Fix double free of log ticketsDave Chinner2008-11-17
| | | | | | | | | | | | | | | | | | | | | | | When an I/O error occurs during an intermediate commit on a rolling transaction, xfs_trans_commit() will free the transaction structure and the related ticket. However, the duplicate transaction that gets used as the transaction continues still contains a pointer to the ticket. Hence when the duplicate transaction is cancelled and freed, we free the ticket a second time. Add reference counting to the ticket so that we hold an extra reference to the ticket over the transaction commit. We drop the extra reference once we have checked that the transaction commit did not return an error, thus avoiding a double free on commit error. Credit to Nick Piggin for tripping over the problem. SGI-PV: 989741 Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] handle memory allocation failures during log initialisationDave Chinner2008-11-10
| | | | | | | | | | | | | | | | | | | | | | When there is no memory left in the system, xfs_buf_get_noaddr() can fail. If this happens at mount time during xlog_alloc_log() we fail to catch the error and oops. Catch the error from xfs_buf_get_noaddr(), and allow other memory allocations to fail and catch those errors too. Report the error to the console and fail the mount with ENOMEM. Tested by manually injecting errors into xfs_buf_get_noaddr() and xlog_alloc_log(). Version 2: o remove unnecessary casts of the returned pointer from kmem_zalloc() SGI-PV: 987246 Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] Finish removing the mount pointer from the AIL APIDavid Chinner2008-10-30
| | | | | | | | | | | | | | | | | | Change all the remaining AIL API functions that are passed struct xfs_mount pointers to pass pointers directly to the struct xfs_ail being used. With this conversion, all external access to the AIL is via the struct xfs_ail. Hence the operation and referencing of the AIL is almost entirely independent of the xfs_mount that is using it - it is now much more tightly tied to the log and the items it is tracking in the log than it is tied to the xfs_mount. SGI-PV: 988143 SGI-Modid: xfs-linux-melb:xfs-kern:32353a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* [XFS] Given the log a pointer to the AILDavid Chinner2008-10-30
| | | | | | | | | | | | | | When we need to go from the log to the AIL, we have to go via the xfs_mount. Add a xfs_ail pointer to the log so we can go directly to the AIL associated with the log. SGI-PV: 988143 SGI-Modid: xfs-linux-melb:xfs-kern:32351a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* [XFS] Move the AIL lock into the struct xfs_ailDavid Chinner2008-10-30
| | | | | | | | | | | | | | Bring the ail lock inside the struct xfs_ail. This means the AIL can be entirely manipulated via the struct xfs_ail rather than needing both the struct xfs_mount and the struct xfs_ail. SGI-PV: 988143 SGI-Modid: xfs-linux-melb:xfs-kern:32350a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* [XFS] move the AIl traversal over to a consistent interfaceDavid Chinner2008-10-30
| | | | | | | | | | | | | | | | | | | | With the new cursor interface, it makes sense to make all the traversing code use the cursor interface and make the old one go away. This means more of the AIL interfacing is done by passing struct xfs_ail pointers around the place instead of struct xfs_mount pointers. We can replace the use of xfs_trans_first_ail() in xfs_log_need_covered() as it is only checking if the AIL is empty. We can do that with a call to xfs_trans_ail_tail() instead, where a zero LSN returned indicates and empty AIL... SGI-PV: 988143 SGI-Modid: xfs-linux-melb:xfs-kern:32348a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* [XFS] Use a cursor for AIL traversal.David Chinner2008-10-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To replace the current generation number ensuring sanity of the AIL traversal, replace it with an external cursor that is linked to the AIL. Basically, we store the next item in the cursor whenever we want to drop the AIL lock to do something to the current item. When we regain the lock. the current item may already be free, so we can't reference it, but the next item in the traversal is already held in the cursor. When we move or delete an object, we search all the active cursors and if there is an item match we clear the cursor(s) that point to the object. This forces the traversal to restart transparently. We don't invalidate the cursor on insert because the cursor still points to a valid item. If the intem is inserted between the current item and the cursor it does not matter; the traversal is considered to be past the insertion point so it will be picked up in the next traversal. Hence traversal restarts pretty much disappear altogether with this method of traversal, which should substantially reduce the overhead of pushing on a busy AIL. Version 2 o add restart logic o comment cursor interface o minor cleanups SGI-PV: 988143 SGI-Modid: xfs-linux-melb:xfs-kern:32347a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* Fix barrier fail detection in XFSChristoph Hellwig2008-10-10
| | | | | | | | | | | | | | | | | | | | | | Currently we disable barriers as soon as we get a buffer in xlog_iodone that has the XBF_ORDERED flag cleared. But this can be the case not only for buffers where the barrier failed, but also the first buffer of a split log write in case of a log wraparound. Due to the disabled barriers we can easily get directory corruption on unclean shutdowns. So instead of using this check add a new buffer flag for failed barrier writes. This is a regression vs 2.6.26 caused by patch to use the right macro to check for the ORDERED flag, as we previously got true returned for every buffer. Thanks to Toei Rei for reporting the bug. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: David Chinner <david@fromorbit.com> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [XFS] Fix barrier status change detection.David Chinner2008-09-17
| | | | | | | | | | | | | | The current code in xlog_iodone() uses the wrong macro to check if the barrier has been cleared due to an EOPNOTSUPP error form the lower layer. SGI-PV: 986143 SGI-Modid: xfs-linux-melb:xfs-kern:31984a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Nathaniel W. Turner <nate@houseofnate.net> Signed-off-by: Peter Leckie <pleckie@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] Move memory allocations for log tracing out of the critical pathLachlan McIlroy2008-09-17
| | | | | | | | | | | | | | | | | | | | | | Memory allocations for log->l_grant_trace and iclog->ic_trace are done on demand when the first event is logged. In xlog_state_get_iclog_space() we call xlog_trace_iclog() under a spinlock and allocating memory here can cause us to sleep with a spinlock held and deadlock the system. For the log grant tracing we use KM_NOSLEEP but that means we can lose trace entries. Since there is no locking to serialize the log grant tracing we could race and have multiple allocations and leak memory. So move the allocations to where we initialize the log/iclog structures. Use KM_NOFS to avoid recursing into the filesystem and drop log->l_trace since it's not even used. SGI-PV: 983738 SGI-Modid: xfs-linux-melb:xfs-kern:31896a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* [XFS] Fix use after free in xfs_log_done().Lachlan McIlroy2008-08-13
| | | | | | | | | | | | | The ticket allocation code got reworked in 2.6.26 and we now free tickets whereas before we used to cache them so the use-after-free went undetected. SGI-PV: 985525 SGI-Modid: xfs-linux-melb:xfs-kern:31877a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <david@fromorbit.com>
* [XFS] Use KM_NOFS for debug trace buffersLachlan McIlroy2008-08-13
| | | | | | | | | | | | | | | | | | | Use KM_NOFS to prevent recursion back into the filesystem which can cause deadlocks. In the case of xfs_iread() we hold the lock on the inode cluster buffer while allocating memory for the trace buffers. If we recurse back into XFS to flush data that may require a transaction to allocate extents which needs log space. This can deadlock with the xfsaild thread which can't push the tail of the log because it is trying to get the inode cluster buffer lock. SGI-PV: 981498 SGI-Modid: xfs-linux-melb:xfs-kern:31838a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <david@fromorbit.com>
* [XFS] cleanup xfs_mountfsChristoph Hellwig2008-08-13
| | | | | | | | | | | Remove all the useless flags and code keyed off it in xfs_mountfs. SGI-PV: 981498 SGI-Modid: xfs-linux-melb:xfs-kern:31831a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] clean up stale references to semaphoresDavid Chinner2008-08-13
| | | | | | | | | | | | | A lot of code has been converted away from semaphores, but there are still comments that reference semaphore behaviour. The log code is the worst offender. Update the comments to reflect what the code really does now. SGI-PV: 981498 SGI-Modid: xfs-linux-melb:xfs-kern:31814a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] Convert l_flushsema to a sv_tMatthew Wilcox2008-07-28
| | | | | | | | | | | | | | | | | | | | | The l_flushsema doesn't exactly have completion semantics, nor mutex semantics. It's used as a list of tasks which are waiting to be notified that a flush has completed. It was also being used in a way that was potentially racy, depending on the semaphore implementation. By using a sv_t instead of a semaphore we avoid the need for a separate counter, since we know we just need to wake everything on the queue. Original waitqueue implementation from Matthew Wilcox. Cleanup and conversion to sv_t by Christoph Hellwig. SGI-PV: 981507 SGI-Modid: xfs-linux-melb:xfs-kern:31059a Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] Ensure that 2 GiB xfs logs work properly.Michael Nishimoto2008-07-28
| | | | | | | | | | | | We found this while experimenting with 2GiB xfs logs. The previous code never assumed that xfs logs would ever get so large. SGI-PV: 981502 SGI-Modid: xfs-linux-melb:xfs-kern:31058a Signed-off-by: Michael Nishimoto <miken@agami.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] Remove unused arg from kmem_free()Denys Vlasenko2008-07-28
| | | | | | | | | | | | | | kmem_free() function takes (ptr, size) arguments but doesn't actually use second one. This patch removes size argument from all callsites. SGI-PV: 981498 SGI-Modid: xfs-linux-melb:xfs-kern:31050a Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* Fix reference counting race on log buffersDave Chinner2008-07-11
| | | | | | | | | | | | | | | | | | | | When we release the iclog, we do an atomic_dec_and_lock to determine if we are the last reference and need to trigger update of log headers and writeout. However, in xlog_state_get_iclog_space() we also need to check if we have the last reference count there. If we do, we release the log buffer, otherwise we decrement the reference count. But the compare and decrement in xlog_state_get_iclog_space() is not atomic, so both places can see a reference count of 2 and neither will release the iclog. That leads to a filesystem hang. Close the race by replacing the atomic_read() and atomic_dec() pair with atomic_add_unless() to ensure that they are executed atomically. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Tim Shimmin <tes@sgi.com> Tested-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [XFS] Catch log unmount failures.David Chinner2008-04-17
| | | | | | | | | | | | Unmounting the log can fail. unlikely, but it can. Catch all the error conditions an make sure it's propagated upwards. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30833a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] Sanitise xfs_log_force error checking.David Chinner2008-04-17
| | | | | | | | | | | | | | | | | | | | xfs_log_force() is declared to return an error, but we almost never check it. We don't need to check it in most cases; if there's a log I/O error then we'll be shutting down the filesystem anyway and that means we'll catch the error somewhere else. However, on certain calls we should be returning an error - sync transactions, fsync, sync writes, etc. so this isn't a pure black and white distinction. Hence make xfs_log_force() a void function that issues a warning to the syslog on error, and call _xfs_log_force() in all the places where we actually care about the error status returned. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30832a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] replace remaining __FUNCTION__ occurrencesHarvey Harrison2008-04-17
| | | | | | | | | | | __FUNCTION__ is gcc-specific, use __func__ SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30775a Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] Fix lock inversion in forced shutdown.David Chinner2008-04-17
| | | | | | | | | | | | | | | Recent changes to xlog_state_release_iclog() placed the grant_lock inside the icloglock. forced unmount of the log does this the opposite way around, but does not depend on the order for correct working. Fix the inversion by changing the order locks are gained in xfs_log_force_umount(). SGI-PV: 979661 SGI-Modid: xfs-linux-melb:xfs-kern:30773a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] Reorganise xlog_t for better cacheline isolation of contentionDavid Chinner2008-04-17
| | | | | | | | | | | | | | | To reduce contention on the log in large CPU count, separate out different parts of the xlog_t structure onto different cachelines. Move each lock onto a different cacheline along with all the members that are accessed/modified while that lock is held. Also, move the debugging code into debug code. SGI-PV: 978729 SGI-Modid: xfs-linux-melb:xfs-kern:30772a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] Remove the xlog_ticket allocatorDavid Chinner2008-04-17
| | | | | | | | | | | | | | | | The ticket allocator is just a simple slab implementation internal to the log. It requires the icloglock to be held when manipulating it and this contributes to contention on that lock. Just kill the entire allocator and use a memory zone instead. While there, allow us to gracefully fail allocation with ENOMEM. SGI-PV: 978729 SGI-Modid: xfs-linux-melb:xfs-kern:30771a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] Per iclog callback chain lockDavid Chinner2008-04-17
| | | | | | | | | | | | | | | | | Rather than use the icloglock for protecting the iclog completion callback chain, use a new per-iclog lock so that walking the callback chain doesn't require holding a global lock. This reduces contention on the icloglock during transaction commit and log I/O completion by reducing the number of times we need to hold the global icloglock during these operations. SGI-PV: 978729 SGI-Modid: xfs-linux-melb:xfs-kern:30770a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* [XFS] Use atomics for iclog reference countingDavid Chinner2008-04-17
| | | | | | | | | | | | | | | | | | | | | | | | | | Now that we update the log tail LSN less frequently on transaction completion, we pass the contention straight to the global log state lock (l_iclog_lock) during transaction completion. We currently have to take this lock to decrement the iclog reference count. there is a reference count on each iclog, so we need to take þhe global lock for all refcount changes. When large numbers of processes are all doing small trnasctions, the iclog reference counts will be quite high, and the state change that absolutely requires the l_iclog_lock is the except rather than the norm. Change the reference counting on the iclogs to use atomic_inc/dec so that we can use atomic_dec_and_lock during transaction completion and avoid the need for grabbing the l_iclog_lock for every reference count decrement except the one that matters - the last. SGI-PV: 975671 SGI-Modid: xfs-linux-melb:xfs-kern:30505a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>