aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs/ctree.c
Commit message (Collapse)AuthorAge
* Btrfs: fix split_leaf double split corner caseChris Mason2010-07-19
| | | | | | | | | | | | split_leaf was not properly balancing leaves when it was forced to split a leaf twice. This commit adds an extra push left and right before forcing the double split in hopes of getting the slot where we want to insert at either the start or end of the leaf. If the extra pushes do work, then we are able to avoid splitting twice and we keep the tree properly balanced. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix block generation verification raceYan, Zheng2010-05-26
| | | | | | | | | | After the path is released, the generation number got from block pointer is no long valid. The race may cause disk corruption, because verify_parent_transid() calls clear_extent_buffer_uptodate() when generation numbers mismatch. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Metadata ENOSPC handling for balanceYan, Zheng2010-05-25
| | | | | | | | | | | | | | This patch adds metadata ENOSPC handling for the balance code. It is consisted by following major changes: 1. Avoid COW tree leave in the phrase of merging tree. 2. Handle interaction with snapshot creation. 3. make the backref cache can live across transactions. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Introduce contexts for metadata reservationYan, Zheng2010-05-25
| | | | | | | | | | | | | | | | | | Introducing metadata reseravtion contexts has two major advantages. First, it makes metadata reseravtion more traceable. Second, it can reclaim freed space and re-add them to the itself after transaction committed. Besides add btrfs_block_rsv structure and related helper functions, This patch contains following changes: Move code that decides if freed tree block should be pinned into btrfs_free_tree_block(). Make space accounting more accurate, mainly for handling read only block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds2010-04-05
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: add check for changed leaves in setup_leaf_for_split Btrfs: create snapshot references in same commit as snapshot Btrfs: fix small race with delalloc flushing waitqueue's Btrfs: use add_to_page_cache_lru, use __page_cache_alloc Btrfs: fix chunk allocate size calculation Btrfs: kill max_extent mount option Btrfs: fail to mount if we have problems reading the block groups Btrfs: check btrfs_get_extent return for IS_ERR() Btrfs: handle kmalloc() failure in inode lookup ioctl Btrfs: dereferencing freed memory Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk() Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree() Btrfs: Remove unnecessary finish_wait() in wait_current_trans() Btrfs: add NULL check for do_walk_down() Btrfs: remove duplicate include in ioctl.c Fix trivial conflict in fs/btrfs/compression.c due to slab.h include cleanups.
| * Btrfs: add check for changed leaves in setup_leaf_for_splitChris Mason2010-04-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | setup_leaf_for_split needs to drop the path and search again, and has checks to see if the item we want to split changed size. But, it misses the case where the leaf changed and now has enough room for the item we want to insert. This adds an extra check to make sure the leaf really needs splitting before we call btrfs_split_leaf(), which keeps us from trying to split a leaf with a single item. btrfs_split_leaf() will blindly split the single item leaf, leaving us with one good leaf and one empty leaf and then a crash. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* | include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo2010-03-30
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* Btrfs: Fix per root used space accountingYan, Zheng2009-12-17
| | | | | | | | | | The bytes_used field in root item was originally planned to trace the amount of used data and tree blocks. But it never worked right since we can't trace freeing of data accurately. This patch changes it to only trace the amount of tree blocks. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add btrfs_duplicate_itemYan, Zheng2009-12-15
| | | | | | | | | | btrfs_duplicate_item duplicates item with new key, guaranteeing the source item and the new items are in the same tree leaf and contiguous. It allows us to split file extent in place, without using lock_extent to prevent bookend extent race. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: check size of inode backref before adding hardlinkYan, Zheng2009-09-24
| | | | | | | | | | | | | | | For every hardlink in btrfs, there is a corresponding inode back reference. All inode back references for hardlinks in a given directory are stored in single b-tree item. The size of b-tree item is limited by the size of b-tree leaf, so we can only create limited number of hardlinks to a given file in a directory. The original code lacks of the check, it oops if the number of hardlinks goes over the limit. This patch fixes the issue by adding check to btrfs_link and btrfs_rename. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Avoid delayed reference update loopingYan Zheng2009-07-24
| | | | | | | | | | | | btrfs_split_leaf and btrfs_del_items can end up in a loop where one is constantly spliting a given leaf and the other is constantly merging it back with the adjacent nodes. There is a better fix for this, but in the interest of something small, this patch just changes btrfs_del_items back to balancing less often. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix ordering of key field checks in btrfs_previous_itemYan Zheng2009-07-24
| | | | | | | | Check objectid of item before checking the item type, otherwise we may return zero for a key that is actually too low. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Remove code duplication in comp_keysDiego Calleja2009-07-24
| | | | | | | | comp_keys is duplicating what is done in btrfs_comp_cpu_keys, so just call it. Signed-off-by: Diego Calleja <diegocg@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix locking issue in btrfs_find_next_keyYan Zheng2009-07-22
| | | | | | | | | | | | | | | | | | When walking up the tree, btrfs_find_next_key assumes the upper level tree block is properly locked. This isn't always true even path->keep_locks is 1. This is because btrfs_find_next_key may advance path->slots[] several times instead of only once. When 'path->slots[level] >= btrfs_header_nritems(path->nodes[level])' is found, we can't guarantee the original value of 'path->slots[level]' is 'btrfs_header_nritems(path->nodes[level]) - 1'. If it's not, the tree block at 'level + 1' isn't locked. This patch fixes the issue by explicitly checking the locking state, re-searching the tree if it's not locked. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix double increment of path->slots[0] in btrfs_next_leafYan Zheng2009-07-22
| | | | | | | | | if 1 is returned by btrfs_search_slot, the path already points to the first item with 'key > searching key'. So increasing path->slots[0] by one is superfluous in that case. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: balance btree more oftenChris Mason2009-06-10
| | | | | | | | | With the new back reference code, the cost of a balance has gone down in terms of the number of back reference updates done. This commit makes us more aggressively balance leaves and nodes as they become less full. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: stop avoiding balancing at the end of the transaction.Chris Mason2009-06-10
| | | | | | | | | | | | | | | When the delayed reference code was added, some checks were added to avoid extra balancing while the delayed references were being flushed. This made for less efficient btrees, but it reduced the chances of loops where no forward progress was made because the balances made more delayed ref updates. With the new dead root removal code and the mixed back references, the extent allocation tree is no longer using precise back refs, and the delayed reference updates don't carry the risk of looping forever anymore. So, the balance avoidance is no longer required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)Yan Zheng2009-06-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Don't loop forever on metadata IO failuresChris Mason2009-05-14
| | | | | | | | | | | | | When a btrfs metadata read fails, the first thing we try to do is find a good copy on another mirror of the block. If this fails, read_tree_block() ends up returning a buffer that isn't up to date. The btrfs btree reading code was reworked to drop locks and repeat the search when IO was done, but the changes didn't add a check for failed reads. The end result was looping forever on buffers that were never going to become up to date. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: use the right node in reada_for_balanceChris Mason2009-04-20
| | | | | | | | | | | | | reada_for_balance was using the wrong index into the path node array, so it wasn't reading the right blocks. We never directly used the results of the read done by this function because the btree search is started over at the end. This fixes reada_for_balance to reada in the correct node and to avoid searching past the last slot in the node. It also makes sure to hold the parent lock while we are finding the nodes to read. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: BUG to BUG_ON changesStoyan Gaydarov2009-04-02
| | | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Optimize locking in btrfs_next_leaf()Chris Mason2009-04-03
| | | | | | | | | | btrfs_next_leaf was using blocking locks when it could have been using faster spinning ones instead. This adds a few extra checks around the pieces that block and switches over to spinning locks. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: break up btrfs_search_slot into smaller piecesChris Mason2009-04-03
| | | | | | | | btrfs_search_slot was doing too many things at once. This breaks it up into more reasonable units. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: limit balancing work while flushing delayed refsChris Mason2009-03-24
| | | | | | | | | | | | | | | | | The delayed reference mechanism is responsible for all updates to the extent allocation trees, including those updates created while processing the delayed references. This commit tries to limit the amount of work that gets created during the final run of delayed refs before a commit. It avoids cowing new blocks unless it is required to finish the commit, and so it avoids new allocations that were not really required. The goal is to avoid infinite loops where we are always making more work on the final run of delayed refs. Over the long term we'll make a special log for the last delayed ref updates as well. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: leave btree locks spinning more oftenChris Mason2009-03-24
| | | | | | | | | | | | | | | | | | | | btrfs_mark_buffer dirty would set dirty bits in the extent_io tree for the buffers it was dirtying. This may require a kmalloc and it was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to set any btree locks they were holding to blocking first. This commit changes dirty tracking for extent buffers to just use a flag in the extent buffer. Now that we have one and only one extent buffer per page, this can be safely done without losing dirty bits along the way. This also introduces a path->leave_spinning flag that callers of btrfs_search_slot can use to indicate they will properly deal with a path returned where all the locks are spinning instead of blocking. Many of the btree search callers now expect spinning paths, resulting in better btree concurrency overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: reduce stack usage in some crucial tree balancing functionsChris Mason2009-03-24
| | | | | | | | | | | | Many of the tree balancing functions follow the same pattern. 1) cow a block 2) do something to the result This commit breaks them up into two functions so the variables and code required for part two don't suck down stack during part one. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: do extent allocation and reference count updates in the backgroundChris Mason2009-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: don't preallocate metadata blocks during btrfs_search_slotChris Mason2009-03-24
| | | | | | | | | | | | In order to avoid doing expensive extent management with tree locks held, btrfs_search_slot will preallocate tree blocks for use by COW without any tree locks held. A later commit moves all of the extent allocation work for COW into a delayed update mechanism, and this preallocation will no longer be required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix spinlock assertions on UP systemsChris Mason2009-03-09
| | | | | | | | | | | btrfs_tree_locked was being used to make sure a given extent_buffer was properly locked in a few places. But, it wasn't correct for UP compiled kernels. This switches it to using assert_spin_locked instead, and renames it to btrfs_assert_tree_locked to better reflect how it was really being used. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: make a lockdep class for the extent buffer locksChris Mason2009-02-12
| | | | | | | | | | | | | | | | | | | | Btrfs is currently using spin_lock_nested with a nested value based on the tree depth of the block. But, this doesn't quite work because the max tree depth is bigger than what spin_lock_nested can deal with, and because locks are sometimes taken before the level field is filled in. The solution here is to use lockdep_set_class_and_name instead, and to set the class before unlocking the pages when the block is read from the disk and just after init of a freshly allocated tree block. btrfs_clear_path_blocking is also changed to take the locks in the proper order, and it also makes sure all the locks currently held are properly set to blocking before it tries to retake the spinlocks. Otherwise, lockdep gets upset about bad lock orderin. The lockdep magic cam from Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: remove btrfs_init_pathJeff Mahoney2009-02-12
| | | | | | | | | | | btrfs_init_path was initially used when the path objects were on the stack. Now all the work is done by btrfs_alloc_path and btrfs_init_path isn't required. This patch removes it, and just uses kmem_cache_zalloc to zero out the object. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: balance_level checks !child after accessJeff Mahoney2009-02-12
| | | | | | | | The BUG_ON() is in the wrong spot. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: don't use spin_is_contendedChris Mason2009-02-09
| | | | | | | | | | | | | | | Btrfs was using spin_is_contended to see if it should drop locks before doing extent allocations during btrfs_search_slot. The idea was to avoid expensive searches in the tree unless the lock was actually contended. But, spin_is_contended is specific to the ticket spinlocks on x86, so this is causing compile errors everywhere else. In practice, the contention could easily appear some time after we started doing the extent allocation, and it makes more sense to always drop the lock instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Only prep for btree deletion balances when nodes are mostly emptyChris Mason2009-02-04
| | | | | | | | | | | | | | | Whenever an item deletion is done, we need to balance all the nodes in the tree to make sure we don't end up with an empty node if a pointer is deleted. This balance prep happens from the root of the tree down so we can drop our locks as we go. reada_for_balance was triggering read-ahead on neighboring nodes even when no balancing was required. This adds an extra check to avoid calling balance_level() and avoid reada_for_balance() when a balance won't be required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix btrfs_unlock_up_safe to walk the entire pathChris Mason2009-02-04
| | | | | | | | | | btrfs_unlock_up_safe would break out at the first NULL node entry or unlocked node it found in the path. Some of the callers have missing nodes at the lower levels of the path, so this commit fixes things to check all the nodes in the path before returning. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: change btrfs_del_leaf to drop locks earlierChris Mason2009-02-04
| | | | | | | | | | | btrfs_del_leaf does two things. First it removes the pointer in the parent, and then it frees the block that has the leaf. It has the parent node locked for both operations. But, it only needs the parent locked while it is deleting the pointer. After that it can safely free the block without the parent locked. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Change btree locking to use explicit blocking pointsChris Mason2009-02-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the btrfs metadata operations can be protected by a spinlock, but some operations still need to schedule. So far, btrfs has been using a mutex along with a trylock loop, most of the time it is able to avoid going for the full mutex, so the trylock loop is a big performance gain. This commit is step one for getting rid of the blocking locks entirely. btrfs_tree_lock takes a spinlock, and the code explicitly switches to a blocking lock when it starts an operation that can schedule. We'll be able get rid of the blocking locks in smaller pieces over time. Tracing allows us to find the most common cause of blocking, so we can start with the hot spots first. The basic idea is: btrfs_tree_lock() returns with the spin lock held btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in the extent buffer flags, and then drops the spin lock. The buffer is still considered locked by all of the btrfs code. If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops the spin lock and waits on a wait queue for the blocking bit to go away. Much of the code that needs to set the blocking bit finishes without actually blocking a good percentage of the time. So, an adaptive spin is still used against the blocking bit to avoid very high context switch rates. btrfs_clear_lock_blocking() clears the blocking bit and returns with the spinlock held again. btrfs_tree_unlock() can be called on either blocking or spinning locks, it does the right thing based on the blocking bit. ctree.c has a helper function to set/clear all the locked buffers in a path as blocking. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: hash_lock is no longer neededChris Mason2009-02-04
| | | | | | | | | | | | | | | Before metadata is written to disk, it is updated to reflect that writeout has begun. Once this update is done, the block must be cow'd before it can be modified again. This update was originally synchronized by using a per-fs spinlock. Today the buffers for the metadata blocks are locked before writeout begins, and everyone that tests the flag has the buffer locked as well. So, the per-fs spinlock (called hash_lock for no good reason) is no longer required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: do less aggressive btree readaheadChris Mason2009-01-22
| | | | | | | | | | | | | | Just before reading a leaf, btrfs scans the node for blocks that are close by and reads them too. It tries to build up a large window of IO looking for blocks that are within a max distance from the top and bottom of the IO window. This patch changes things to just look for blocks within 64k of the target block. It will trigger less IO and make for lower latencies on the read size. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix checkpatch.pl warningsChris Mason2009-01-05
| | | | | | | There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: properly check free space for tree balancingYan Zheng2008-12-17
| | | | | | | | | | | | btrfs_insert_empty_items takes the space needed by the btrfs_item structure into account when calculating the required free space. So the tree balancing code shouldn't add sizeof(struct btrfs_item) to the size when checking the free space. This patch removes these superfluous additions. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: Fix compressed writes on truncated pagesChris Mason2008-12-15
| | | | | | | | | | | The compression code was using isize to limit the amount of data it sent through zlib. But, it wasn't properly limiting the looping to just the pages inside i_size. The end result was trying to compress too many pages, including those that had not been setup and properly locked down. This made the compression code oops while trying find_get_page on a page that didn't exist. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Delete csum items when freeing extentsChris Mason2008-12-10
| | | | | | | | | | | | | | | | This finishes off the new checksumming code by removing csum items for extents that are no longer in use. The trick is doing it without racing because a single csum item may hold csums for more than one extent. Extra checks are added to btrfs_csum_file_blocks to make sure that we are using the correct csum item after dropping locks. A new btrfs_split_item is added to split a single csum item so it can be split without dropping the leaf lock. This is used to remove csum bytes from the middle of an item. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Use map_private_extent_buffer during generic_bin_searchChris Mason2008-12-08
| | | | | | | | | | | | | It is possible that generic_bin_search will be called on a tree block that has not been locked. This happens because cache_block_block skips locking on the tree blocks. Since the tree block isn't locked, we aren't allowed to change the extent_buffer->map_token field. Using map_private_extent_buffer avoids any changes to the internal extent buffer fields. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: make things static and include the right headersChristoph Hellwig2008-12-02
| | | | | | | | Shut up various sparse warnings about symbols that should be either static or have their declarations in scope. Signed-off-by: Christoph Hellwig <hch@lst.de>
* Btrfs: Some fixes for batching extent insert.Liu Hui2008-11-18
| | | | | | | | | | | | | | In insert_extents(), when ret==1 and last is not zero, it should check if the current inserted item is the last item in this batching inserts. If so, it should just break from loop. If not, 'cur = insert_list->next' will make no sense because the list is empty now, and 'op' will point to an unexpectable place. There are also some trivial fixs in this patch including one comment typo error and deleting two redundant lines. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Seed device supportYan Zheng2008-11-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Seed device is a special btrfs with SEEDING super flag set and can only be mounted in read-only mode. Seed devices allow people to create new btrfs on top of it. The new FS contains the same contents as the seed device, but it can be mounted in read-write mode. This patch does the following: 1) split code in btrfs_alloc_chunk into two parts. The first part does makes the newly allocated chunk usable, but does not do any operation that modifies the chunk tree. The second part does the the chunk tree modifications. This division is for the bootstrap step of adding storage to the seed device. 2) Update device management code to handle seed device. The basic idea is: For an FS grown from seed devices, its seed devices are put into a list. Seed devices are opened on demand at mounting time. If any seed device is missing or has been changed, btrfs kernel module will refuse to mount the FS. 3) make btrfs_find_block_group not return NULL when all block groups are read-only. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: batch extent inserts/updates/deletions on the extent rootJosef Bacik2008-11-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While profiling the allocator I noticed a good amount of time was being spent in finish_current_insert and del_pending_extents, and as the filesystem filled up more and more time was being spent in those functions. This patch aims to try and reduce that problem. This happens two ways 1) track if we tried to delete an extent that we are going to update or insert. Once we get into finish_current_insert we discard any of the extents that were marked for deletion. This saves us from doing unnecessary work almost every time finish_current_insert runs. 2) Batch insertion/updates/deletions. Instead of doing a btrfs_search_slot for each individual extent and doing the needed operation, we instead keep the leaf around and see if there is anything else we can do on that leaf. On the insert case I introduced a btrfs_insert_some_items, which will take an array of keys with an array of data_sizes and try and squeeze in as many of those keys as possible, and then return how many keys it was able to insert. In the update case we search for an extent ref, update the ref and then loop through the leaf to see if any of the other refs we are looking to update are on that leaf, and then once we are done we release the path and search for the next ref we need to update. And finally for the deletion we try and delete the extent+ref in pairs, so we will try to find extent+ref pairs next to the extent we are trying to free and free them in bulk if possible. This along with the other cluster fix that Chris pushed out a bit ago helps make the allocator preform more uniformly as it fills up the disk. There is still a slight drop as we fill up the disk since we start having to stick new blocks in odd places which results in more COW's than on a empty fs, but the drop is not nearly as severe as it was before. Signed-off-by: Josef Bacik <jbacik@redhat.com>
* Btrfs: Improve metadata read latenciesChris Mason2008-11-13
| | | | | | | | | | | | | This fixes latency problems on metadata reads by making sure they don't go through the async submit queue, and by tuning down the amount of readahead done during btree searches. Also, the btrfs bdi congestion function is tuned to ignore the number of pending async bios and checksums pending. There is additional code that throttles new async bios now and the congestion function doesn't need to worry about it anymore. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: nuke fs wide allocation mutex V2Josef Bacik2008-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch of little locks. There is now a pinned_mutex, which is used when messing with the pinned_extents extent io tree, and the extent_ins_mutex which is used with the pending_del and extent_ins extent io trees. The locking for the extent tree stuff was inspired by a patch that Yan Zheng wrote to fix a race condition, I cleaned it up some and changed the locking around a little bit, but the idea remains the same. Basically instead of holding the extent_ins_mutex throughout the processing of an extent on the extent_ins or pending_del trees, we just hold it while we're searching and when we clear the bits on those trees, and lock the extent for the duration of the operations on the extent. Also to keep from getting hung up waiting to lock an extent, I've added a try_lock_extent so if we cannot lock the extent, move on to the next one in the tree and we'll come back to that one. I have tested this heavily and it does not appear to break anything. This has to be applied on top of my find_free_extent redo patch. I tested this patch on top of Yan's space reblancing code and it worked fine. The only thing that has changed since the last version is I pulled out all my debugging stuff, apparently I forgot to run guilt refresh before I sent the last patch out. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>