aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs/tree-defrag.c
Commit message (Collapse)AuthorAge
* Btrfs: do extent allocation and reference count updates in the backgroundChris Mason2009-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Change btree locking to use explicit blocking pointsChris Mason2009-02-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the btrfs metadata operations can be protected by a spinlock, but some operations still need to schedule. So far, btrfs has been using a mutex along with a trylock loop, most of the time it is able to avoid going for the full mutex, so the trylock loop is a big performance gain. This commit is step one for getting rid of the blocking locks entirely. btrfs_tree_lock takes a spinlock, and the code explicitly switches to a blocking lock when it starts an operation that can schedule. We'll be able get rid of the blocking locks in smaller pieces over time. Tracing allows us to find the most common cause of blocking, so we can start with the hot spots first. The basic idea is: btrfs_tree_lock() returns with the spin lock held btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in the extent buffer flags, and then drops the spin lock. The buffer is still considered locked by all of the btrfs code. If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops the spin lock and waits on a wait queue for the blocking bit to go away. Much of the code that needs to set the blocking bit finishes without actually blocking a good percentage of the time. So, an adaptive spin is still used against the blocking bit to avoid very high context switch rates. btrfs_clear_lock_blocking() clears the blocking bit and returns with the spinlock held again. btrfs_tree_unlock() can be called on either blocking or spinning locks, it does the right thing based on the blocking bit. ctree.c has a helper function to set/clear all the locked buffers in a path as blocking. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix checkpatch.pl warningsChris Mason2009-01-05
| | | | | | | There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: nuke fs wide allocation mutex V2Josef Bacik2008-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch of little locks. There is now a pinned_mutex, which is used when messing with the pinned_extents extent io tree, and the extent_ins_mutex which is used with the pending_del and extent_ins extent io trees. The locking for the extent tree stuff was inspired by a patch that Yan Zheng wrote to fix a race condition, I cleaned it up some and changed the locking around a little bit, but the idea remains the same. Basically instead of holding the extent_ins_mutex throughout the processing of an extent on the extent_ins or pending_del trees, we just hold it while we're searching and when we clear the bits on those trees, and lock the extent for the duration of the operations on the extent. Also to keep from getting hung up waiting to lock an extent, I've added a try_lock_extent so if we cannot lock the extent, move on to the next one in the tree and we'll come back to that one. I have tested this heavily and it does not appear to break anything. This has to be applied on top of my find_free_extent redo patch. I tested this patch on top of Yan's space reblancing code and it worked fine. The only thing that has changed since the last version is I pulled out all my debugging stuff, apparently I forgot to run guilt refresh before I sent the last patch out. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>
* Btrfs: add and improve commentsChris Mason2008-09-29
| | | | | | | | | | | This improves the comments at the top of many functions. It didn't dive into the guts of functions because I was trying to avoid merging problems with the new allocator and back reference work. extent-tree.c and volumes.c were both skipped, and there is definitely more work todo in cleaning and commenting the code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a write ahead tree log to optimize synchronous operationsChris Mason2008-09-25
| | | | | | | | | | | File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Online btree defragmentation fixesChris Mason2008-09-25
| | | | | | | | | | The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a per-inode csum mutex to avoid races creating csum itemsChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add btree locking to the tree defragmentation codeChris Mason2008-09-25
| | | | | | | The online btree defragger is simplified and rewritten to use standard btree searches instead of a walk up / down mechanism. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Start btree concurrency work.Chris Mason2008-09-25
| | | | | | | | | | | | | | | The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Allocator fix variety packChris Mason2008-09-25
| | | | | | | | | | | | | | * Force chunk allocation when find_free_extent has to do a full scan * Record the max key at the start of defrag so it doesn't run forever * Block groups might not be contiguous, make a forward search for the next block group in extent-tree.c * Get rid of extra checks for total fs size * Fix relocate_one_reference to avoid relocating the same file data block twice when referenced by an older transaction * Use the open device count when allocating chunks so that we don't try to allocate from devices that don't exist Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Handle write errors on raid1 and raid10Chris Mason2008-09-25
| | | | | | | | | | | | When duplicate copies exist, writes are allowed to fail to one of those copies. This changeset includes a few changes that allow the FS to continue even when some IOs fail. It also adds verification of the parent generation number for btree blocks. This generation is stored in the pointer to a block, and it ensures that missed writes to are detected. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Pass down the expected generation number when reading tree blocksChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Verify checksums on tree blocks found without read_tree_blockChris Mason2008-09-25
| | | | | | | | | | | | | | | | | Checksums were only verified by btrfs_read_tree_block, which meant the functions to probe the page cache for blocks were not validating checksums. Normally this is fine because the buffers will only be in cache if they have already been validated. But, there is a window while the buffer is being read from disk where it could be up to date in the cache but not yet verified. This patch makes sure all buffers go through checksum verification before they are used. This is safer, and it prevents modification of buffers before they go through the csum code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Disable tree defrag in SSD modeChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Leave on the tree defragger in mount -o ssd, it still helps thereChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add mount -o ssd, which includes optimizations for seek free storageChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add back pointers from extents to the btree or file referencing themChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Optimize allocations as we need to mix data and metadata into one groupChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Make defrag check nodes against the progress key to prevent repeating ↵Chris Mason2008-09-25
| | | | | | work Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Tune the automatic defrag codeChris Mason2008-09-25
| | | | | | | | | | | 1) Forced defrag wasn't working properly (btrfsctl -d) because some cache only checks were incorrect. 2) Defrag only the leaves unless in forced defrag mode. 3) Don't use complex logic to figure out if a leaf is needs defrag Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Defrag only leaves, and only when the parent node has a single objectidChris Mason2008-09-25
| | | | | | | This allows us to defrag huge directories, but skip the expensive defrag case in more common usage, where it does not help as much. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Defrag: only walk into nodes with the defrag bit setChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Large block related defrag optimizationsChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Breakout BTRFS_SETGET_FUNCS into a separate C file, the inlines were too big.Chris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add back the online defragging codeChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Allow tree blocks larger than the page sizeChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Create extent_buffer interface for large blocksizesChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Add support for defragging files via btrfsctl -d. Avoid OOM on extent treeChris Mason2007-09-10
| | | | | | defrag. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add BH_Defrag to mark buffers that are in need of defraggingChris Mason2007-08-10
| | | | | | | | This allows the tree walking code to defrag only the newly allocated buffers, it seems to be a good balance between perfect defragging and the performance hit of repeatedly reallocating blocks. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Btree defrag on the extent-mapping tree as wellChris Mason2007-08-10
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Further reduce the concurrency penalty of defrag and drop_snapshotChris Mason2007-08-08
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add run time btree defrag, and an ioctl to force btree defragChris Mason2007-08-07
This adds two types of btree defrag, a run time form that tries to defrag recently allocated blocks in the btree when they are still in ram, and an ioctl that forces defrag of all btree blocks. File data blocks are not defragged yet, but this can make a huge difference in sequential btree reads. Signed-off-by: Chris Mason <chris.mason@oracle.com>