aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs/ctree.c
Commit message (Collapse)AuthorAge
...
* Btrfs: free space accounting redoJosef Bacik2008-09-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) replace the per fs_info extent_io_tree that tracked free space with two rb-trees per block group to track free space areas via offset and size. The reason to do this is because most allocations come with a hint byte where to start, so we can usually find a chunk of free space at that hint byte to satisfy the allocation and get good space packing. If we cannot find free space at or after the given offset we fall back on looking for a chunk of the given size as close to that given offset as possible. When we fall back on the size search we also try to find a slot as close to the size we want as possible, to avoid breaking small chunks off of huge areas if possible. 2) remove the extent_io_tree that tracked the block group cache from fs_info and replaced it with an rb-tree thats tracks block group cache via offset. also added a per space_info list that tracks the block group cache for the particular space so we can lookup related block groups easily. 3) cleaned up the allocation code to make it a little easier to read and a little less complicated. Basically there are 3 steps, first look from our provided hint. If we couldn't find from that given hint, start back at our original search start and look for space from there. If that fails try to allocate space if we can and start looking again. If not we're screwed and need to start over again. 4) small fixes. there were some issues in volumes.c where we wouldn't allocate the rest of the disk. fixed cow_file_range to actually pass the alloc_hint, which has helped a good bit in making the fs_mark test I run have semi-normal results as we run out of space. Generally with data allocations we don't track where we last allocated from, so everytime we did a data allocation we'd search through every block group that we have looking for free space. Now searching a block group with no free space isn't terribly time consuming, it was causing a slight degradation as we got more data block groups. The alloc_hint has fixed this slight degredation and made things semi-normal. There is still one nagging problem I'm working on where we will get ENOSPC when there is definitely plenty of space. This only happens with metadata allocations, and only when we are almost full. So you generally hit the 85% mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm still tracking it down, but until then this seems to be pretty stable and make a significant performance gain. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Fix leaf overflow check in btrfs_insert_empty_itemsChris Mason2008-09-25
| | | | | | | It was incorrectly adding an extra sizeof(struct btrfs_item) and causing false positives (oops) Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: trivial sparse fixesChristoph Hellwig2008-09-25
| | | | | | | Fix a bunch of trivial sparse complaints. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: missing endianess conversion in insert_new_rootChristoph Hellwig2008-09-25
| | | | | | | Add two missing endianess conversions in this function, found by sparse. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a write ahead tree log to optimize synchronous operationsChris Mason2008-09-25
| | | | | | | | | | | File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* btrfs_search_slot: reduce lock contention by cowing in two stagesChris Mason2008-09-25
| | | | | | | | | | | | | | | | A btree block cow has two parts, the first is to allocate a destination block and the second is to copy the old bock over. The first part needs locks in the extent allocation tree, and may need to do IO. This changeset splits that into a separate function that can be called without any tree locks held. btrfs_search_slot is changed to drop its path and start over if it has to COW a contended block. This often means that many writers will pre-alloc a new destination for a the same contended block, but they cache their prealloc for later use on lower levels in the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: implement memory reclaim for leaf reference cacheYan2008-09-25
| | | | | | | | | | | | | | The memory reclaiming issue happens when snapshot exists. In that case, some cache entries may not be used during old snapshot dropping, so they will remain in the cache until umount. The patch adds a field to struct btrfs_leaf_ref to record create time. Besides, the patch makes all dead roots of a given snapshot linked together in order of create time. After a old snapshot was completely dropped, we check the dead root list and remove all cache entries created before the oldest dead root in the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a leaf reference cacheYan Zheng2008-09-25
| | | | | | | | | | Much of the IO done while dropping snapshots is done looking up leaves in the filesystem trees to see if they point to any extents and to drop the references on any extents found. This creates a cache so that IO isn't required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Fix path slots selection in btrfs_search_forwardYan2008-09-25
| | | | | | | We should decrease the found slot by one as btrfs_search_slot does when bin_search return 1 and node level > 0. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Create orphan inode records to prevent lost files after a crashJosef Bacik2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* btrfs_next_leaf: do readahead when skip_locking is turned onChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add locking around volume management (device add/remove/balance)Chris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Reduce contention on the root nodeChris Mason2008-09-25
| | | | | | | | | | | | | This calls unlock_up sooner in btrfs_search_slot in order to decrease the amount of work done with the higher level tree locks held. Also, it changes btrfs_tree_lock to spin for a big against the page lock before scheduling. This makes a big difference in context switch rate under highly contended workloads. Longer term, a better locking structure is needed than the page lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Online btree defragmentation fixesChris Mason2008-09-25
| | | | | | | | | | The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add btree locking to the tree defragmentation codeChris Mason2008-09-25
| | | | | | | The online btree defragger is simplified and rewritten to use standard btree searches instead of a walk up / down mechanism. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Replace the transaction work queue with kthreadsChris Mason2008-09-25
| | | | | | | This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix snapshot deletion to release the alloc_mutex much more often.Chris Mason2008-09-25
| | | | | | This lowers the impact of snapshot deletion on the rest of the FS. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a skip_locking parameter to struct path, and make various funcs ↵Chris Mason2008-09-25
| | | | | | | | | | | | | | | | honor it Allocations may need to read in block groups from the extent allocation tree, which will require a tree search and take locks on the extent allocation tree. But, those locks might already be held in other places, leading to deadlocks. Since the alloc_mutex serializes everything right now, it is safe to skip the btree locking while caching block groups. A better fix will be to either create a recursive lock or find a way to back off existing locks while caching block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Fix btrfs_next_leaf to check for new items after dropping locksChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Fix btrfs_del_ordered_inode to allow forcing the drop during unlinksChris Mason2008-09-25
| | | | | | | This allows us to delete an unlinked inode with dirty pages from the list instead of forcing commit to write these out before deleting the inode. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Drop locks in btrfs_search_slot when reading a tree block.Chris Mason2008-09-25
| | | | | | | | One lock per btree block can make for significant congestion if everyone has to wait for IO at the high levels of the btree. This drops locks held by a path when doing reads during a tree search. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Replace the big fs_mutex with a collection of other locksChris Mason2008-09-25
| | | | | | | | Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Start btree concurrency work.Chris Mason2008-09-25
| | | | | | | | | | | | | | | The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Allocator fix variety packChris Mason2008-09-25
| | | | | | | | | | | | | | * Force chunk allocation when find_free_extent has to do a full scan * Record the max key at the start of defrag so it doesn't run forever * Block groups might not be contiguous, make a forward search for the next block group in extent-tree.c * Get rid of extra checks for total fs size * Fix relocate_one_reference to avoid relocating the same file data block twice when referenced by an older transaction * Use the open device count when allocating chunks so that we don't try to allocate from devices that don't exist Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Handle write errors on raid1 and raid10Chris Mason2008-09-25
| | | | | | | | | | | | When duplicate copies exist, writes are allowed to fail to one of those copies. This changeset includes a few changes that allow the FS to continue even when some IOs fail. It also adds verification of the parent generation number for btree blocks. This generation is stored in the pointer to a block, and it ensures that missed writes to are detected. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Pass down the expected generation number when reading tree blocksChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix balance_level to free the middle block if there is room in the ↵Chris Mason2008-09-25
| | | | | | | | | | left one balance level starts by trying to empty the middle block, and then pushes from the right to the middle. This might empty the right block and leave a small number of pointers in the middle. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Don't empty the middle buffer in push_nodes_for_insertChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix split_node to require more empty slots in the node as wellChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Make sure nodes have enough room for a double splitChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Don't wait on tree block writeback before freeing them anymoreChris Mason2008-09-25
| | | | | | | This isn't required anymore because we don't reallocate blocks that have already been written in this transaction. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add chunk uuids and update multi-device back referencesChris Mason2008-09-25
| | | | | | | | | | | | | | | | | | | | Block headers now store the chunk tree uuid Chunk items records the device uuid for each stripes Device extent items record better back refs to the chunk tree Block groups record better back refs to the chunk tree The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY used to be the logical offset of the chunk. Now it is a chunk tree id, with the logical offset being stored in the offset field of the key. This allows a single chunk tree to record multiple logical address spaces, upping the number of bytes indexed by a chunk tree from 2^64 to 2^128. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Disable extra debugging checks on tree blocksChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Retry metadata reads in the face of checksum failuresChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Do metadata checksums for reads via a workqueueChris Mason2008-09-25
| | | | | | | | | | | | | | | | Before, metadata checksumming was done by the callers of read_tree_block, which would set EXTENT_CSUM bits in the extent tree to show that a given range of pages was already checksummed and didn't need to be verified again. But, those bits could go away via try_to_releasepage, and the end result was bogus checksum failures on pages that never left the cache. The new code validates checksums when the page is read. It is a little tricky because metadata blocks can span pages and a single read may end up going via multiple bios. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Change btrfs_map_block to return a structure with mappings for all stripesChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Properly dirty buffers in the split corner casesChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Verify checksums on tree blocks found without read_tree_blockChris Mason2008-09-25
| | | | | | | | | | | | | | | | | Checksums were only verified by btrfs_read_tree_block, which meant the functions to probe the page cache for blocks were not validating checksums. Normally this is fine because the buffers will only be in cache if they have already been validated. But, there is a window while the buffer is being read from disk where it could be up to date in the cache but not yet verified. This patch makes sure all buffers go through checksum verification before they are used. This is safer, and it prevents modification of buffers before they go through the csum code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Reorder the flags field in struct btrfs_header and record a flag on writeoutChris Mason2008-09-25
| | | | | | | | This allows detection of blocks that have already been written in the running transaction so they can be recowed instead of modified again. It is step one in trusting the transid field of the block pointers. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add support for multiple devices per filesystemChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Call btrfs_cow_block while lowering tree level.Yan2008-09-25
| | | | | | | | | | When freeing root block of a tree, btrfs_free_extent' parameter 'ref_generation' is from root block itseft. When freeing non-root block, 'ref_generation' is from its parent. so when converting a non-root block to root block, we must guarantee its generation is equal to its parent's generation. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Copy correct tree when inserting into slot 0Chris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add inode item and backref in one insert, reducing cpu usageChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: During deletes and truncate, remove many items at once from the treeChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add data=ordered supportChris Mason2008-09-25
| | | | | | | | This forces file data extents down the disk along with the metadata that references them. The current implementation is fairly simple, and just writes out all of the dirty pages in an inode before the commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Force inlining off in a few places to save stack usageChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add readahead to the online shrinker, and a mount -o alloc_start= for ↵Chris Mason2008-09-25
| | | | | | testing Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Less aggressive readahead on deletesChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* kmalloc a few large stack objects in the btrfs_ioctl pathChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add mount option to turn off data cowChris Mason2008-09-25
| | | | | | | | | | | A number of workloads do not require copy on write data or checksumming. mount -o nodatasum to disable checksums and -o nodatacow to disable both copy on write and checksumming. In nodatacow mode, copy on write is still performed when a given extent is under snapshot. Signed-off-by: Chris Mason <chris.mason@oracle.com>