aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs/ctree.h
Commit message (Collapse)AuthorAge
...
* Btrfs: Wait for kernel threads to make progress during async submissionChris Mason2008-09-25
| | | | | | | | | | | Before this change, btrfs would use a bdi congestion function to make sure there weren't too many pending async checksum work items. This change makes the process creating async work items wait instead, leading to fewer congestion returns from the bdi. This improves pdflush background_writeout scanning. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Count async bios separately from async checksum work itemsChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix RHEL test for ClearPageFsMiscEric Sandeen2008-09-25
| | | | | | | | | | Newer RHEL5 kernels define both ClearPageFSMisc and ClearPageChecked, so test for both before redefining. Signed-off-by: Eric Sandeen <sandeen@redhat.com> --- Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix nodatacow for the new data=ordered modeYan Zheng2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Maintain a list of inodes that are delalloc and a way to wait on themChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix ioctl-initiated transactions vs wait_current_trans()Sage Weil2008-09-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in progress) breaks the transaction start/stop ioctls by making btrfs_start_transaction conditionally wait for the next transaction to start. If an application artificially is holding a transaction open, things deadlock. This workaround maintains a count of open ioctl-initiated transactions in fs_info, and avoids wait_current_trans() if any are currently open (in start_transaction() and btrfs_throttle()). The start transaction ioctl uses a new btrfs_start_ioctl_transaction() that _does_ call wait_current_trans(), effectively pushing the join/wait decision to the outer ioctl-initiated transaction. This more or less neuters btrfs_throttle() when ioctl-initiated transactions are in use, but that seems like a pretty fundamental consequence of wrapping lots of write()'s in a transaction. Btrfs has no way to tell if the application considers a given operation as part of it's transaction. Obviously, if the transaction start/stop ioctls aren't being used, there is no effect on current behavior. Signed-off-by: Sage Weil <sage@newdream.net> --- ctree.h | 1 + ioctl.c | 12 +++++++++++- transaction.c | 18 +++++++++++++----- transaction.h | 2 ++ 4 files changed, 27 insertions(+), 6 deletions(-) Signed-off-by: Chris Mason <chris.mason@oracle.com>
* btrfs_search_slot: reduce lock contention by cowing in two stagesChris Mason2008-09-25
| | | | | | | | | | | | | | | | A btree block cow has two parts, the first is to allocate a destination block and the second is to copy the old bock over. The first part needs locks in the extent allocation tree, and may need to do IO. This changeset splits that into a separate function that can be called without any tree locks held. btrfs_search_slot is changed to drop its path and start over if it has to COW a contended block. This often means that many writers will pre-alloc a new destination for a the same contended block, but they cache their prealloc for later use on lower levels in the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix streaming read performance with checksumming onChris Mason2008-09-25
| | | | | | | | | | | | | | | | | | Large streaming reads make for large bios, which means each entry on the list async work queues represents a large amount of data. IO congestion throttling on the device was kicking in before the async worker threads decided a single thread was busy and needed some help. The end result was that a streaming read would result in a single CPU running at 100% instead of balancing the work off to other CPUs. This patch also changes the pre-IO checksum lookup done by reads to work on a per-bio basis instead of a per-page. This results in many extra btree lookups on large streaming reads. Doing the checksum lookup right before bio submit allows us to reuse searches while processing adjacent offsets. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: implement memory reclaim for leaf reference cacheYan2008-09-25
| | | | | | | | | | | | | | The memory reclaiming issue happens when snapshot exists. In that case, some cache entries may not be used during old snapshot dropping, so they will remain in the cache until umount. The patch adds a field to struct btrfs_leaf_ref to record create time. Besides, the patch makes all dead roots of a given snapshot linked together in order of create time. After a old snapshot was completely dropped, we check the dead root list and remove all cache entries created before the oldest dead root in the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Update and fix mount -o nodatacowYan Zheng2008-09-25
| | | | | | | | | | | | | | To check whether a given file extent is referenced by multiple snapshots, the checker walks down the fs tree through dead root and checks all tree blocks in the path. We can easily detect whether a given tree block is directly referenced by other snapshot. We can also detect any indirect reference from other snapshot by checking reference's generation. The checker can always detect multiple references, but can't reliably detect cases of single reference. So btrfs may do file data cow even there is only one reference. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Throttle operations if the reference cache gets too largeChris Mason2008-09-25
| | | | | | | | | | | | A large reference cache is directly related to a lot of work pending for the cleaner thread. This throttles back new operations based on the size of the reference cache so the cleaner thread will be able to keep up. Overall, this actually makes the FS faster because the cleaner thread will be more likely to find things in cache. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Leaf reference cache updateChris Mason2008-09-25
| | | | | | | | | | | | | | | This changes the reference cache to make a single cache per root instead of one cache per transaction, and to key by the byte number of the disk block instead of the keys inside. This makes it much less likely to have cache misses if a snapshot or something has an extra reference on a higher node or a leaf while the first transaction that added the leaf into the cache is dropping. Some throttling is added to functions that free blocks heavily so they wait for old transactions to drop. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a leaf reference cacheYan Zheng2008-09-25
| | | | | | | | | | Much of the IO done while dropping snapshots is done looking up leaves in the filesystem trees to see if they point to any extents and to drop the references on any extents found. This creates a cache so that IO isn't required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Rev the disk format magicChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Create orphan inode records to prevent lost files after a crashJosef Bacik2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add ACL supportJosef Bacik2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Remove unused xattr codeJosef Bacik2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Implement new dir index formatJosef Bacik2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix the defragmention code and the block relocation code for data=orderedChris Mason2008-09-25
| | | | | | | | | | | Before setting an extent to delalloc, the code needs to wait for pending ordered extents. Also, the relocation code needs to wait for ordered IO before scanning the block group again. This is because the extents are not removed until the IO for the new extents is finished Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix some build problems on 2.6.18 based enterprise kernelsChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: alloc_mutex latency reductionChris Mason2008-09-25
| | | | | | | | This releases the alloc_mutex in a few places that hold it for over long operations. btrfs_lookup_block_group is changed so that it doesn't need the mutex at all. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Use mutex_lock_nested for tree lockingChris Mason2008-09-25
| | | | | | | | | | | | Lockdep has the notion of locking subclasses so that you can identify locks you expect to be taken after other locks of the same class. This changes the per-extent buffer btree locking routines to use a subclass based on the level in the tree. Unfortunately, lockdep can only handle 8 total subclasses, and the btrfs max level is also 8. So when lockdep is on, use a lower max level. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix some data=ordered related data corruptionsChris Mason2008-09-25
| | | | | | | | | | | | | | | | | | Stress testing was showing data checksum errors, most of which were caused by a lookup bug in the extent_map tree. The tree was caching the last pointer returned, and searches would check the last pointer first. But, search callers also expect the search to return the very first matching extent in the range, which wasn't always true with the last pointer usage. For now, the code to cache the last return value is just removed. It is easy to fix, but I think lookups are rare enough that it isn't required anymore. This commit also replaces do_sync_mapping_range with a local copy of the related functions. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Handle data checksumming on bios that span multiple ordered extentsChris Mason2008-09-25
| | | | | | | | Data checksumming is done right before the bio is sent down the IO stack, which means a single bio might span more than one ordered extent. In this case, the checksumming data is split between two ordered extents. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* btrfs_start_transaction: wait for commits in progress to finishChris Mason2008-09-25
| | | | | | | | | | | | | btrfs_commit_transaction has to loop waiting for any writers in the transaction to finish before it can proceed. btrfs_start_transaction should be polite and not join a transaction that is in the process of being finished off. There are a few places that can't wait, basically the ones doing IO that might be needed to finish the transaction. For them, btrfs_join_transaction is added. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Use async helpers to deal with pages that have been improperly dirtiedChris Mason2008-09-25
| | | | | | | | | Higher layers sometimes call set_page_dirty without asking the filesystem to help. This causes many problems for the data=ordered and cow code. This commit detects pages that haven't been properly setup for IO and kicks off an async helper to deal with them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: New data=ordered implementationChris Mason2008-09-25
| | | | | | | | | | | | | | | | | | | | | | | | The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add locking around volume management (device add/remove/balance)Chris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Online btree defragmentation fixesChris Mason2008-09-25
| | | | | | | | | | The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add btree locking to the tree defragmentation codeChris Mason2008-09-25
| | | | | | | The online btree defragger is simplified and rewritten to use standard btree searches instead of a walk up / down mechanism. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Replace the transaction work queue with kthreadsChris Mason2008-09-25
| | | | | | | This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a skip_locking parameter to struct path, and make various funcs ↵Chris Mason2008-09-25
| | | | | | | | | | | | | | | | honor it Allocations may need to read in block groups from the extent allocation tree, which will require a tree search and take locks on the extent allocation tree. But, those locks might already be held in other places, leading to deadlocks. Since the alloc_mutex serializes everything right now, it is safe to skip the btree locking while caching block groups. A better fix will be to either create a recursive lock or find a way to back off existing locks while caching block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Drop locks in btrfs_search_slot when reading a tree block.Chris Mason2008-09-25
| | | | | | | | One lock per btree block can make for significant congestion if everyone has to wait for IO at the high levels of the btree. This drops locks held by a path when doing reads during a tree search. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Replace the big fs_mutex with a collection of other locksChris Mason2008-09-25
| | | | | | | | Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Start btree concurrency work.Chris Mason2008-09-25
| | | | | | | | | | | | | | | The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a thread pool just for submit_bioChris Mason2008-09-25
| | | | | | | | If a bio submission is after a lock holder waiting for the bio on the work queue, it is possible to deadlock. Move the bios into their own pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: split out ioctl.cChristoph Hellwig2008-09-25
| | | | | | | | Split the ioctl handling out of inode.c into a file of it's own. Also fix up checkpatch.pl warnings for the moved code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a mount option to control worker thread pool sizeChris Mason2008-09-25
| | | | | | | | | | | | mount -o thread_pool_size changes the default, which is min(num_cpus + 2, 8). Larger thread pools would make more sense on very large disk arrays. This mount option controls the max size of each thread pool. There are multiple thread pools, so the total worker count will be larger than the mount option. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add async worker threads for pre and post IO checksummingChris Mason2008-09-25
| | | | | | | | | | | | | | | | | | | | Btrfs has been using workqueues to spread the checksumming load across other CPUs in the system. But, workqueues only schedule work on the same CPU that queued the work, giving them a limited benefit for systems with higher CPU counts. This code adds a generic facility to schedule work with pools of kthreads, and changes the bio submission code to queue bios up. The queueing is important to make sure large numbers of procs on the system don't turn streaming workloads into random workloads by sending IO down concurrently. The end result of all of this is much higher performance (and CPU usage) when doing checksumming on large machines. Two worker pools are created, one for writes and one for endio processing. The two could deadlock if we tried to service both from a single pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* btrfs: sanity mount option parsing and early mount codeChristoph Hellwig2008-09-25
| | | | | | | Also adds lots of comments to describe what's going on here. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: transaction ioctlsSage Weil2008-09-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These ioctls let a user application hold a transaction open while it performs a series of operations. A final ioctl does a sync on the fs (closing the current transaction). This is the main requirement for Ceph's OSD to be able to keep the data it's storing in a btrfs volume consistent, and AFAICS it works just fine. The application would do something like fd = ::open("some/file", O_RDONLY); ::ioctl(fd, BTRFS_IOC_TRANS_START); /* do a bunch of stuff */ ::ioctl(fd, BTRFS_IOC_TRANS_END); or just ::close(fd); And to ensure it commits to disk, ::ioctl(fd, BTRFS_IOC_SYNC); When a transaction is held open, the trans_handle is attached to the struct file (via private_data) so that it will get cleaned up if the process dies unexpectedly. A held transaction is also ended on fsync() to avoid a deadlock. A misbehaving application could also deliberately hold a transaction open, effectively locking up the FS, so it may make sense to restrict something like this to root or something. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Invalidate dcache entry after creating snapshot andSven Wegener2008-09-25
| | | | | | | | | | | | | | We need to invalidate an existing dcache entry after creating a new snapshot or subvolume, because a negative dache entry will stop us from accessing the new snapshot or subvolume. --- ctree.h | 23 +++++++++++++++++++++++ inode.c | 4 ++++ transaction.c | 4 ++++ 3 files changed, 31 insertions(+) Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Allocator fix variety packChris Mason2008-09-25
| | | | | | | | | | | | | | * Force chunk allocation when find_free_extent has to do a full scan * Record the max key at the start of defrag so it doesn't run forever * Block groups might not be contiguous, make a forward search for the next block group in extent-tree.c * Get rid of extra checks for total fs size * Fix relocate_one_reference to avoid relocating the same file data block twice when referenced by an older transaction * Use the open device count when allocating chunks so that we don't try to allocate from devices that don't exist Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Change the congestion functions to meter the number of async submits ↵Chris Mason2008-09-25
| | | | | | | | | as well The async submit workqueue was absorbing too many requests, leading to long stalls where the async submitters were stalling. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add mount -o degraded to allow mounts to continue with missing devicesChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Update nodatacow mode to support cloned single files and resizingChris Mason2008-09-25
| | | | | | | | | | | Before, nodatacow only checked to make sure multiple roots didn't have references on a single extent. This check makes sure that multiple inodes don't have references. nodatacow needed an extra check to see if the block group was currently readonly. This way cows forced by the chunk relocation code are honored. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Properly find the root for snapshotted blocks during chunk relocationChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add support for online device removalChris Mason2008-09-25
| | | | | | | | | | | | | This required a few structural changes to the code that manages bdev pointers: The VFS super block now gets an anon-bdev instead of a pointer to the lowest bdev. This allows us to avoid swapping the super block bdev pointer around at run time. The code to read in the super block no longer goes through the extent buffer interface. Things got ugly keeping the mapping constant. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Clone file data ioctlSage Weil2008-09-25
| | | | | | Add a new ioctl to clone file data Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add balance ioctl to restripe the chunksChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>