aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs/inode.c
Commit message (Collapse)AuthorAge
...
* Btrfs: extent_map and data=ordered fixes for space balancingZheng Yan2008-09-26
| | | | | | | | | | | | | | | | | | | | | * Add an EXTENT_BOUNDARY state bit to keep the writepage code from merging data extents that are in the process of being relocated. This allows us to do accounting for them properly. * The balancing code relocates data extents indepdent of the underlying inode. The extent_map code was modified to properly account for things moving around (invalidating extent_map caches in the inode). * Don't take the drop_mutex in the create_subvol ioctl. It isn't required. * Fix walking of the ordered extent list to avoid races with sys_unlink * Change the lock ordering rules. Transaction start goes outside the drop_mutex. This allows btrfs_commit_transaction to directly drop the relocation trees. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Remove Btrfs compat code for older kernelsChris Mason2008-09-25
| | | | | | | | Btrfs had compatibility code for kernels back to 2.6.18. These have been removed, and will be maintained in a separate backport git tree from now on. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix race against disk_i_size updatesChris Mason2008-09-25
| | | | | | | | | | | | | | The code to update the on disk i_size happens before the ordered_extent record is removed. So, it is possible for multiple ordered_extent completion routines to run at the same time, and to find each other in the ordered tree. The end result is they both decide not to update disk_i_size, leaving it too small. This temporary fix just puts the updates inside the extent_mutex. A real solution would be stronger ordering of disk_i_size updates against removing the ordered extent from the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Full back reference supportZheng Yan2008-09-25
| | | | | | | | | | This patch makes the back reference system to explicit record the location of parent node for all types of extents. The location of parent node is placed into the offset field of backref key. Every time a tree block is balanced, the back references for the affected lower level extents are updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: free space accounting redoJosef Bacik2008-09-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) replace the per fs_info extent_io_tree that tracked free space with two rb-trees per block group to track free space areas via offset and size. The reason to do this is because most allocations come with a hint byte where to start, so we can usually find a chunk of free space at that hint byte to satisfy the allocation and get good space packing. If we cannot find free space at or after the given offset we fall back on looking for a chunk of the given size as close to that given offset as possible. When we fall back on the size search we also try to find a slot as close to the size we want as possible, to avoid breaking small chunks off of huge areas if possible. 2) remove the extent_io_tree that tracked the block group cache from fs_info and replaced it with an rb-tree thats tracks block group cache via offset. also added a per space_info list that tracks the block group cache for the particular space so we can lookup related block groups easily. 3) cleaned up the allocation code to make it a little easier to read and a little less complicated. Basically there are 3 steps, first look from our provided hint. If we couldn't find from that given hint, start back at our original search start and look for space from there. If that fails try to allocate space if we can and start looking again. If not we're screwed and need to start over again. 4) small fixes. there were some issues in volumes.c where we wouldn't allocate the rest of the disk. fixed cow_file_range to actually pass the alloc_hint, which has helped a good bit in making the fs_mark test I run have semi-normal results as we run out of space. Generally with data allocations we don't track where we last allocated from, so everytime we did a data allocation we'd search through every block group that we have looking for free space. Now searching a block group with no free space isn't terribly time consuming, it was causing a slight degradation as we got more data block groups. The alloc_hint has fixed this slight degredation and made things semi-normal. There is still one nagging problem I'm working on where we will get ENOSPC when there is definitely plenty of space. This only happens with metadata allocations, and only when we are almost full. So you generally hit the 85% mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm still tracking it down, but until then this seems to be pretty stable and make a significant performance gain. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Dir fsync optimizationsChris Mason2008-09-25
| | | | | | | | | | | Drop i_mutex during the commit Don't bother doing the fsync at all unless the dir is marked as dirtied and needing fsync in this transaction. For directories, this means that someone has unlinked a file from the dir without fsyncing the file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix releasepage to properly keep dirty and writeback pagesChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Update the highest objectid in a root after log replay is doneChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* remove unused function btrfs_ilookupChristoph Hellwig2008-09-25
| | | | | | | | btrfs_ilookup is unused, which is good because a normal filesystem should never have to use ilookup anyway. Remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a write ahead tree log to optimize synchronous operationsChris Mason2008-09-25
| | | | | | | | | | | File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: optimize btrget/set/removexattrChristoph Hellwig2008-09-25
| | | | | | | | | | | | btrfs actually stores the whole xattr name, including the prefix ondisk, so using the generic resolver that strips off the prefix is not very helpful. Instead do the real ondisk xattrs manually and only use the generic resolver for synthetic xattrs like ACLs. (Sorry Josef for guiding you towards the wrong direction here intially) Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Optimise NFS readdir hack slightly; don't call readdir() again when doneDavid Woodhouse2008-09-25
| | | | | | Date: Sun, 17 Aug 2008 17:12:56 +0100 Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Minor cleanup of btrfs_real_readdir()David Woodhouse2008-09-25
| | | | | | Date: Sun, 17 Aug 2008 17:08:36 +0100 Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Remove special cases for "." and ".."David Woodhouse2008-09-25
| | | | | | | | | Date: Sun, 17 Aug 2008 15:14:48 +0100 We never get asked by the VFS to lookup either of them, and we can handle the readdir() case a lot more simply, too. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Implement our own copy of the nfsd readdir hack, for older kernelsDavid Woodhouse2008-09-25
| | | | | | Date: Wed, 6 Aug 2008 19:42:33 +0100 Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Introduce btrfs_iget helperBalaji Rao2008-09-25
| | | | | | | | | Date: Mon, 21 Jul 2008 02:01:04 +0530 This patch introduces a btrfs_iget helper to be used in NFS support. Signed-off-by: Balaji Rao <balajirrao@gmail.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Lookup readpage checksums on bio submission againChris Mason2008-09-25
| | | | | | | | This optimization had been removed because I thought it was triggering csum errors. The real cause of the errors was elsewhere, and so this optimization is back. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix add_extent_mapping to check for duplicates across the whole rangeChris Mason2008-09-25
| | | | | | | | | | | | add_extent_mapping was allowing the insertion of overlapping extents. This never used to happen because it only inserted the extents from disk and those were never overlapping. But, with the data=ordered code, the disk and memory representations of the file are not the same. add_extent_mapping needs to ensure a new extent does not overlap before it inserts. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Lower contention on the csum mutexChris Mason2008-09-25
| | | | | | | This takes the csum mutex deeper in the call chain and releases it more often. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Init address_space->writeback_index properlyChris Mason2008-09-25
| | | | | | | | | | | The writeback_index field is used by write_cache_pages to pick up where writeback on a given inode left off. But, it is never set to a sane value, so writeback can often start at a random offset in the file. Kernels 2.6.28 and higher will have this fixed, but for everyone else, we also fill in the value in btrfs. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Avoid calling into the FS for the final iput on fake root inodesChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix nodatacow for the new data=ordered modeYan Zheng2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Get rid of BTRFS_I(inode)->index and use local vars insteadChris Mason2008-09-25
| | | | | | | | rename and link don't always have a lock on the source inode, and our use of a per-inode index variable was racy. This changes things to store the index in a local variable instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* btrfs_lookup_bio_sums seems broken, go back to the readpage_io_hook for nowChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Maintain a list of inodes that are delalloc and a way to wait on themChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Hold csum mutex while reading in sums during readpagesChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Drop some debugging around the extent_map pinned flagChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix streaming read performance with checksumming onChris Mason2008-09-25
| | | | | | | | | | | | | | | | | | Large streaming reads make for large bios, which means each entry on the list async work queues represents a large amount of data. IO congestion throttling on the device was kicking in before the async worker threads decided a single thread was busy and needed some help. The end result was that a streaming read would result in a single CPU running at 100% instead of balancing the work off to other CPUs. This patch also changes the pre-IO checksum lookup done by reads to work on a per-bio basis instead of a per-page. This results in many extra btree lookups on large streaming reads. Doing the checksum lookup right before bio submit allows us to reuse searches while processing adjacent offsets. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add compatibility for kernels >= 2.6.27-rc1Sven Wegener2008-09-25
| | | | | | | Add a couple of #if's to follow API changes. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: implement memory reclaim for leaf reference cacheYan2008-09-25
| | | | | | | | | | | | | | The memory reclaiming issue happens when snapshot exists. In that case, some cache entries may not be used during old snapshot dropping, so they will remain in the cache until umount. The patch adds a field to struct btrfs_leaf_ref to record create time. Besides, the patch makes all dead roots of a given snapshot linked together in order of create time. After a old snapshot was completely dropped, we check the dead root list and remove all cache entries created before the oldest dead root in the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Update and fix mount -o nodatacowYan Zheng2008-09-25
| | | | | | | | | | | | | | To check whether a given file extent is referenced by multiple snapshots, the checker walks down the fs tree through dead root and checks all tree blocks in the path. We can easily detect whether a given tree block is directly referenced by other snapshot. We can also detect any indirect reference from other snapshot by checking reference's generation. The checker can always detect multiple references, but can't reliably detect cases of single reference. So btrfs may do file data cow even there is only one reference. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Throttle operations if the reference cache gets too largeChris Mason2008-09-25
| | | | | | | | | | | | A large reference cache is directly related to a lot of work pending for the cleaner thread. This throttles back new operations based on the size of the reference cache so the cleaner thread will be able to keep up. Overall, this actually makes the FS faster because the cleaner thread will be more likely to find things in cache. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Leaf reference cache updateChris Mason2008-09-25
| | | | | | | | | | | | | | | This changes the reference cache to make a single cache per root instead of one cache per transaction, and to key by the byte number of the disk block instead of the keys inside. This makes it much less likely to have cache misses if a snapshot or something has an extra reference on a higher node or a leaf while the first transaction that added the leaf into the cache is dropping. Some throttling is added to functions that free blocks heavily so they wait for old transactions to drop. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix .. lookup corner caseYan2008-09-25
| | | | | | | Inode ref item can be in the next leaf when we find "path->slots[0] == btrfs_header_nritems(...)". Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Remove unused variable in fixup_tree_root_locationBalaji Rao2008-09-25
| | | | | | | Remove a unused variable 'path' in fixup_tree_root_location. Signed-off-by: Balaji Rao <balajirrao@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Create orphan inode records to prevent lost files after a crashJosef Bacik2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add ACL supportJosef Bacik2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Implement new dir index formatJosef Bacik2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Search data ordered extents first for checksums on readChris Mason2008-09-25
| | | | | | | | | | | | | | | | | | | | | | Checksum items are not inserted into the tree until all of the io from a given extent is complete. This means one dirty page from an extent may be written, freed, and then read again before the entire extent is on disk and the checksum item is inserted. The checksums themselves are stored in the ordered extent so they can be inserted in bulk when IO is complete. On read, if a checksum item isn't found, the ordered extents were being searched for a checksum record. This all worked most of the time, but the checksum insertion code tries to reduce the number of tree operations by pre-inserting checksum items based on i_size and a few other factors. This means the read code might find a checksum item that hasn't yet really been filled in. This commit changes things to check the ordered extents first and only dive into the btree if nothing was found. This removes the need for extra locking and is more reliable. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Take the csum mutex while reading checksumsChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix some data=ordered related data corruptionsChris Mason2008-09-25
| | | | | | | | | | | | | | | | | | Stress testing was showing data checksum errors, most of which were caused by a lookup bug in the extent_map tree. The tree was caching the last pointer returned, and searches would check the last pointer first. But, search callers also expect the search to return the very first matching extent in the range, which wasn't always true with the last pointer usage. For now, the code to cache the last return value is just removed. It is easy to fix, but I think lookups are rare enough that it isn't required anymore. This commit also replaces do_sync_mapping_range with a local copy of the related functions. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Index extent buffers in an rbtreeChris Mason2008-09-25
| | | | | | | | | | | | Before, extent buffers were a temporary object, meant to map a number of pages at once and collect operations on them. But, a few extra fields have crept in, and they are also the best place to store a per-tree block lock field as well. This commit puts the extent buffers into an rbtree, and ensures a single extent buffer for each tree block. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Data ordered fixesChris Mason2008-09-25
| | | | | | | | | | | | | | | | | | | * In btrfs_delete_inode, wait for ordered extents after calling truncate_inode_pages. This is much faster, and more correct * Properly clear our the PageChecked bit everywhere we redirty the page. * Change the writepage fixup handler to lock the page range and check to see if an ordered extent had been inserted since the improperly dirtied page was discovered * Wait for ordered extents outside the transaction. This isn't required for locking rules but does improve transaction latencies * Reduce contention on the alloc_mutex by dropping it while incrementing refs on a node/leaf and while dropping refs on a leaf. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Fix btrfs_wait_ordered_extent_range to properly waitChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Keep extent mappings in ram until pending ordered extents are doneChris Mason2008-09-25
| | | | | | | | It was possible for stale mappings from disk to be used instead of the new pending ordered extent. This adds a flag to the extent map struct to keep it pinned until the pending ordered extent is actually on disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Don't allow releasepage to succeed if EXTENT_ORDERED is setChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Handle data checksumming on bios that span multiple ordered extentsChris Mason2008-09-25
| | | | | | | | Data checksumming is done right before the bio is sent down the IO stack, which means a single bio might span more than one ordered extent. In this case, the checksumming data is split between two ordered extents. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Cleanup and comment ordered-data.cChris Mason2008-09-25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Add a per-inode lock around btrfs_drop_extentsChris Mason2008-09-25
| | | | | | | | | | | | btrfs_drop_extents is always called with a range lock held on the inode. But, it may operate on extents outside that range as it drops and splits them. This patch adds a per-inode mutex that is held while calling btrfs_drop_extents and while inserting new extents into the tree. It prevents races from two procs working against adjacent ranges in the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Don't pin pages in ram until the entire ordered extent is on disk.Chris Mason2008-09-25
| | | | | | | | | | | | | | | | | | Checksum items are not inserted until the entire ordered extent is on disk, but individual pages might be clean and available for reclaim long before the whole extent is on disk. In order to allow those pages to be freed, we need to be able to search the list of ordered extents to find the checksum that is going to be inserted in the tree. This way if the page needs to be read back in before the checksums are in the btree, we'll be able to verify the checksum on the page. This commit adds the ability to search the pending ordered extents for a given offset in the file, and changes btrfs_releasepage to allow ordered pages to be freed. Signed-off-by: Chris Mason <chris.mason@oracle.com>