aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* Btrfs: always pin metadata in discard modeChris Mason2009-10-14
| | | | | | | | | | | | | We have an optimization in btrfs to allow blocks to be immediately freed if they were allocated in this transaction and never written. Otherwise they are pinned and freed when the transaction commits. This isn't optimal for discard mode because immediately freeing them means immediately discarding them. It is better to give the block to the pinning code and letting the (slow) discard happen later. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: enable discard supportChristoph Hellwig2009-10-14
| | | | | | | | | The discard support code in btrfs currently is guarded by ifdefs for BIO_RW_DISCARD, which is never defines as it's the name of an enum memeber. Just remove the useless ifdefs to actually enable the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: add -o discard optionChristoph Hellwig2009-10-14
| | | | | | | | | | Enable discard by default is not a good idea given the the trim speed of SSD prototypes we've seen, and the carecteristics for many high-end arrays. Turn of discards by default and require the -o discard option to enable them on. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: properly wait log writers during log syncYan, Zheng2009-10-14
| | | | | | | | | A recently fsync optimization make btrfs_sync_log skip calling wait_for_writer in the single log writer case. This is incorrect since the writer count can also be increased by btrfs_pin_log. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix possible ENOSPC problems with truncateJosef Bacik2009-10-14
| | | | | | | | | | There's a problem where we don't do any space reservation for truncates, which can cause you to OOPs because you will be allowed to go off in the weeds a bit since we don't account for the delalloc bytes that are created as a result of the truncate. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix btrfs acl #ifdef checksChris Mason2009-10-13
| | | | | | | | The btrfs acl code was #ifdefing for a define that didn't exist. This correctly matches it to the values used by the Kconfig file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: streamline tree-log btree block writeoutChris Mason2009-10-13
| | | | | | | | | | | | | | | | | | | | | | | | Syncing the tree log is a 3 phase operation. 1) write and wait for all the tree log blocks for a given root. 2) write and wait for all the tree log blocks for the tree of tree log roots. 3) write and wait for the super blocks (barriers here) This isn't as efficient as it could be because there is no requirement to wait for the blocks from step one to hit the disk before we start writing the blocks from step two. This commit changes the sequence so that we don't start waiting until all the tree blocks from both steps one and two have been sent to disk. We do this by breaking up btrfs_write_wait_marked_extents into two functions, which is trivial because it was already broken up into two parts. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: avoid tree log commit when there are no changesChris Mason2009-10-13
| | | | | | | | | | | | | | | | | | rpm has a habit of running fdatasync when the file hasn't changed. We already detect if a file hasn't been changed in the current transaction but it might have been sent to the tree-log in this transaction and not changed since the last call to fsync. In this case, we want to avoid a tree log sync, which includes a number of synchronous writes and barriers. This commit extends the existing tracking of the last transaction to change a file to also track the last sub-transaction. The end result is that rpm -ivh and -Uvh are roughly twice as fast, and on par with ext3. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: only write one super copy during fsyncChris Mason2009-10-13
| | | | | | | | | | | | During a tree-log commit for fsync, we've been writing at least two copies of the super block and forcing them to disk. The other filesystems write only one, and this change brings us on par with them. A full transaction commit will write all the super copies, so we still have redundant info written on a regular basis. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix file clone ioctl for bookend extentsChris Mason2009-10-09
| | | | | | | | | | | | | | | | | | | | | | The file clone ioctl was incorrectly taking the offset into the extent on disk into account when calculating the length of the cloned extent. The length never changes based on the offset into the physical extent. Test case: fallocate -l 1g image mke2fs image bcp image image2 e2fsck -f image2 (errors on image2) The math bug ends up wrapping the length of the extent, and things go wrong from there. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix uninit compiler warning in cow_file_range_nocowChris Mason2009-10-09
| | | | | | | | The extent_type variable was exposed uninit via a goto. It should be impossible to trigger because it is protected by a check on another variable, but this makes sure. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: constify dentry_operationsAlexey Dobriyan2009-10-09
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: optimize back reference update during btrfs_drop_snapshotYan, Zheng2009-10-09
| | | | | | | This patch reading level 0 tree blocks that already use full backrefs. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: remove negative dentry when deleting subvolumneYan, Zheng2009-10-09
| | | | | | | | | | The use of btrfs_dentry_delete is removing dentries from the dcache when deleting subvolumne. btrfs_dentry_delete ignores negative dentries. This is incorrect since if we don't remove the negative dentry, its parent dentry can't be removed. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: optimize fsync for the single writer caseJosef Bacik2009-10-08
| | | | | | | | | | | This patch optimizes the tree logging stuff so it doesn't always wait 1 jiffie for new people to join the logging transaction if there is only ever 1 writer. This helps a little bit with latency where we have something like RPM where it will fdatasync every file it writes, and so waiting the 1 jiffie for every fdatasync really starts to add up. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: async delalloc flushing under space pressureJosef Bacik2009-10-08
| | | | | | | | | | | | This patch moves the delalloc flushing that occurs when we are under space pressure off to a async thread pool. This helps since we only free up metadata space when we actually insert the extent item, which means it takes quite a while for space to be free'ed up if we wait on all ordered extents. However, if space is freed up due to inline extents being inserted, we can wake people who are waiting up early, and they can finish their work. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: release delalloc reservations on extent item insertionJosef Bacik2009-10-08
| | | | | | | | | | | | | | | | | | | | | | | | This patch fixes an issue with the delalloc metadata space reservation code. The problem is we used to free the reservation as soon as we allocated the delalloc region. The problem with this is if we are not inserting an inline extent, we don't actually insert the extent item until after the ordered extent is written out. This patch does 3 things, 1) It moves the reservation clearing stuff into the ordered code, so when we remove the ordered extent we remove the reservation. 2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear delalloc bits in the cases where we want to clear the metadata reservation when we clear the delalloc extent, in the case that we do an inline extent or we invalidate the page. 3) It adds another waitqueue to the space info so that when we start a fs wide delalloc flush, anybody else who also hits that area will simply wait for the flush to finish and then try to make their allocation. This has been tested thoroughly to make sure we did not regress on performance. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: delay clearing EXTENT_DELALLOC for compressed extentsChris Mason2009-10-08
| | | | | | | | | | | | | | | | When compression is on, the cow_file_range code is farmed off to worker threads. This allows us to do significant CPU work in parallel on SMP machines. But it is a delicate balance around when we clear flags and how. In the past we cleared the delalloc flag immediately, which was safe because the pages stayed locked. But this is causing problems with the newest ENOSPC code, and with the recent extent state cleanups we can now clear the delalloc bit at the same time the uncompressed code does. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: cleanup extent_clear_unlock_delalloc flagsChris Mason2009-10-08
| | | | | | | | | extent_clear_unlock_delalloc has a growing set of ugly parameters that is very difficult to read and maintain. This switches to a flag field and well named flag defines. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix possible softlockup in the allocatorJosef Bacik2009-10-06
| | | | | | | | | | | | | | | | | | | | | | | Like the cluster allocating stuff, we can lockup the box with the normal allocation path. This happens when we 1) Start to cache a block group that is severely fragmented, but has a decent amount of free space. 2) Start to commit a transaction 3) Have the commit try and empty out some of the delalloc inodes with extents that are relatively large. The inodes will not be able to make the allocations because they will ask for allocations larger than a contiguous area in the free space cache. So we will wait for more progress to be made on the block group, but since we're in a commit the caching kthread won't make any more progress and it already has enough free space that wait_block_group_cache_progress will just return. So, if we wait and fail to make the allocation the next time around, just loop and go to the next block group. This keeps us from getting stuck in a softlockup. Thanks, Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix deadlock on async thread startupChris Mason2009-10-05
| | | | | | | | | | | | | | | | | | | | | | | The btrfs async worker threads are used for a wide variety of things, including processing bio end_io functions. This means that when the endio threads aren't running, the rest of the FS isn't able to do the final processing required to clear PageWriteback. The endio threads also try to exit as they become idle and start more as the work piles up. The problem is that starting more threads means kthreadd may need to allocate ram, and that allocation may wait until the global number of writeback pages on the system is below a certain limit. The result of that throttling is that end IO threads wait on kthreadd, who is waiting on IO to end, which will never happen. This commit fixes the deadlock by handing off thread startup to a dedicated thread. It also fixes a bug where the on-demand thread creation was creating far too many threads because it didn't take into account threads being started by other procs. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix data space leak fixJosef Bacik2009-10-01
| | | | | | | | | | | | There is a problem where page_mkwrite can be called on a dirtied page that already has a delalloc range associated with it. The fix is to clear any delalloc bits for the range we are dirtying so the space accounting gets handled properly. This is the same thing we do in the normal write case, so we are consistent across the board. With this patch we no longer leak reserved space. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: take i_mutex before generic_write_checksChris Mason2009-10-01
| | | | | | | | | | btrfs_file_write was incorrectly calling generic_write_checks without taking i_mutex. This lead to problems with racing around i_size when doing O_APPEND writes. The fix here is to move i_mutex higher. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix arguments to btrfs_wait_on_page_writeback_rangeChristoph Hellwig2009-10-01
| | | | | | | | | wait_on_page_writeback_range/btrfs_wait_on_page_writeback_range takes a pagecache offset, not a byte offset into the file. Shift the arguments around to wait for the correct range Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix deadlock with free space handling and user transactionsSage Weil2009-09-29
| | | | | | | | | | | | If an ioctl-initiated transaction is open, we can't force a commit during the free space checks in order to free up pinned extents or else we deadlock. Just ENOSPC instead. A more satisfying solution that reserves space for the entire user transaction up front is forthcoming... Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix error cases for ioctl transactionsSage Weil2009-09-29
| | | | | | | | Fix leak of vfsmount write reference and open_ioctl_trans reference on ENOMEM. Clean up the error paths while we're at it. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Use CONFIG_BTRFS_POSIX_ACL to enable ACL codeChris Ball2009-09-29
| | | | | | | | | We've already defined CONFIG_BTRFS_POSIX_ACL in Kconfig, but we're currently not using it and are testing CONFIG_FS_POSIX_ACL instead. CONFIG_FS_POSIX_ACL states "Never use this symbol for ifdefs". Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: introduce missing kfreeJulia Lawall2009-09-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Error handling code following a kzalloc should free the allocated data. The semantic match that finds the problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @r exists@ local idexpression x; statement S; expression E; identifier f,f1,l; position p1,p2; expression *ptr != NULL; @@ x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...); ... if (x == NULL) S <... when != x when != if (...) { <+...x...+> } ( x->f1 = E | (x->f1 == NULL || ...) | f(...,x->f1,...) ) ...> ( return \(0\|<+...x...+>\|ptr\); | return@p2 ...; ) @script:python@ p1 << r.p1; p2 << r.p2; @@ print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix setting umask when POSIX ACLs are not enabledChris Ball2009-09-29
| | | | | | | | | | We currently set sb->s_flags |= MS_POSIXACL unconditionally, which is incorrect -- it tells the VFS that it shouldn't set umask because we will, yet we don't set it ourselves if we aren't using POSIX ACLs, so the umask ends up ignored. Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: proper -ENOSPC handlingJosef Bacik2009-09-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the start of a transaction we do a btrfs_reserve_metadata_space() and specify how many items we plan on modifying. Then once we've done our modifications and such, just call btrfs_unreserve_metadata_space() for the same number of items we reserved. For keeping track of metadata needed for data I've had to add an extent_io op for when we merge extents. This lets us track space properly when we are doing sequential writes, so we don't end up reserving way more metadata space than what we need. The only place where the metadata space accounting is not done is in the relocation code. This is because Yan is going to be reworking that code in the near future, so running btrfs-vol -b could still possibly result in a ENOSPC related panic. This patch also turns off the metadata_ratio stuff in order to allow users to more efficiently use their disk space. This patch makes it so we track how much metadata we need for an inode's delayed allocation extents by tracking how many extents are currently waiting for allocation. It introduces two new callbacks for the extent_io tree's, merge_extent_hook and split_extent_hook. These help us keep track of when we merge delalloc extents together and split them up. Reservations are handled prior to any actually dirty'ing occurs, and then we unreserve after we dirty. btrfs_unreserve_metadata_for_delalloc() will make the appropriate unreservations as needed based on the number of reservations we currently have and the number of extents we currently have. Doing the reservation outside of doing any of the actual dirty'ing lets us do things like filemap_flush() the inode to try and force delalloc to happen, or as a last resort actually start allocation on all delalloc inodes in the fs. This has survived dbench, fs_mark and an fsx torture test. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: hash the btree inode during fill_superYan Zheng2009-09-24
| | | | | | | The snapshot deletion patches dropped this line, but the inode needs to be hashed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: relocate file extents in clustersYan, Zheng2009-09-24
| | | | | | | | | | | The extent relocation code copy file extents one by one when relocating data block group. This is inefficient if file extents are small. This patch makes the relocation code copy file extents in clusters. So we can can make better use of read-ahead. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: don't rename file into dummy directoryYan, Zheng2009-09-24
| | | | | | | | | | | A recent change enforces only one access point to each subvolume. The first directory entry (the one added when the subvolume/snapshot was created) is treated as valid access point, all other subvolume links are linked to dummy empty directories. The dummy directories are temporary inodes that only in memory, so we can not rename file into them. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: check size of inode backref before adding hardlinkYan, Zheng2009-09-24
| | | | | | | | | | | | | | | For every hardlink in btrfs, there is a corresponding inode back reference. All inode back references for hardlinks in a given directory are stored in single b-tree item. The size of b-tree item is limited by the size of b-tree leaf, so we can only create limited number of hardlinks to a given file in a directory. The original code lacks of the check, it oops if the number of hardlinks goes over the limit. This patch fixes the issue by adding check to btrfs_link and btrfs_rename. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix releasepage to avoid unlocking extents we haven't lockedChris Mason2009-09-23
| | | | | | | | | | | | | | | | | During releasepage, we try to drop any extent_state structs for the bye offsets of the page we're releaseing. But the code was incorrectly telling clear_extent_bit to delete the state struct unconditionallly. Normally this would be fine because we have the page locked, but other parts of btrfs will lock down an entire extent, the most common place being IO completion. releasepage was deleting the extent state without first locking the extent, which may result in removing a state struct that another process had locked down. The fix here is to leave the NODATASUM and EXTENT_LOCKED bits alone in releasepage. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix test_range_bit for whole file extentsChris Mason2009-09-23
| | | | | | | | | | If test_range_bit finds an extent that goes all the way to (u64)-1, it can incorrectly wrap the u64 instead of treaing it like the end of the address space. This just adds a check for the highest possible offset so we don't wrap. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix errors handling cached state in set/clear_extent_bitChris Mason2009-09-23
| | | | | | | | | | | | | | | | | | | | Both set and clear_extent_bit allow passing a cached state struct to reduce rbtree search times. clear_extent_bit was improperly bypassing some of the checks around making sure the extent state fields were correct for a given operation. The fix used here (from Yan Zheng) is to use the hit_next goto target instead of jumping all the way down to start clearing bits without making sure the cached state was exactly correct for the operation we were doing. This also fixes up the setting of the start variable for both ops in the case where we find an overlapping extent that begins before the range we want to change. In both cases we were incorrectly going backwards from the original requested change. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix early enospc during balancingChris Mason2009-09-22
| | | | | | | | | | | | | | | | | | We now do extra checks before a balance to make sure there is room for the balance to take place. One of the checks was testing to see if we were trying to balance away the last block group of a given type. If there is no space available for new chunks, we should not try and balance away the last block group of a give type. But, the code wasn't checking for available chunk space, and so it was exiting too soon. The fix here is to combine some of the checks and make sure we try to allocate new chunks when we're balancing the last block group. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: deal with NULL space infoChris Mason2009-09-22
| | | | | | | | | After a balance it is briefly possible for the space info field in the inode to be NULL. This adds some checks to make sure things properly deal with the NULL value. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: account for space used by the super mirrorsJosef Bacik2009-09-21
| | | | | | | | | | | | As we get closer to proper -ENOSPC handling in btrfs, we need more accurate space accounting for the space info's. Currently we exclude the free space for the super mirrors, but the space they take up isn't accounted for in any of the counters. This patch introduces bytes_super, which keeps track of the amount of bytes used for a super mirror in the block group cache and space info. This makes sure that our free space caclucations will be completely accurate. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix extent entry threshold calculationJosef Bacik2009-09-21
| | | | | | | | | | | | | | | There is a slight problem with the extent entry threshold calculation for the free space cache. We only adjust the threshold down as we add bitmaps, but never actually adjust the threshold up as we add bitmaps. This means we could fragment the free space so badly that we end up using all bitmaps to describe the free space, use all the free space which would result in the bitmaps being freed, but then go to add free space again as we delete things and immediately add bitmaps since the extent threshold would still be 0. Now as we free bitmaps the extent threshold will be ratcheted up to allow more extent entries to be added. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: remove dead codeJosef Bacik2009-09-21
| | | | | | | | This patch removes a bunch of dead code from the snapshot removal stuff. It was confusing me when doing the metadata ENOSPC stuff so I killed it. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix bitmap size trackingJosef Bacik2009-09-21
| | | | | | | | | | | | | When we first go to add free space, we allocate a new info and set the offset and bytes to the space we are adding. This is fine, except we actually set the size of a bitmap as we set the bits in it, so if we add space to a bitmap, we'd end up counting the same space twice. This isn't a huge deal, it just makes the allocator behave weirdly since it will think that a bitmap entry has more space than it ends up actually having. I used a BUG_ON() to catch when this problem happened, and with this patch I no longer get the BUG_ON(). Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: don't keep retrying a block group if we fail to allocate a clusterJosef Bacik2009-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | The box can get locked up in the allocator if we happen upon a block group under these conditions: 1) During a commit, so caching threads cannot make progress 2) Our block group currently is in the middle of being cached 3) Our block group currently has plenty of free space in it 4) Our block group is so fragmented that it ends up having no free space chunks larger than min_bytes calculated by btrfs_find_space_cluster. What happens is we try and do btrfs_find_space_cluster, which fails because it is unable to find enough free space chunks that are large than min_bytes and are close enough together. Since the block group is not cached we do a wait_block_group_cache_progress, which waits for the number of bytes we need, except the block group already has _plenty_ of free space, its just severely fragmented, so we loop and try again, ad infinitum. This patch keeps us from waiting on the block group to finish caching if we failed to find a free space cluster before. It also makes sure that we don't even try to find a free space cluster if we are on our last loop in the allocator, since we will have tried everything at this point at it is futile. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: make balance code choose more wisely when relocatingJosef Bacik2009-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we can panic the box if the first block group we go to move is of a type where there is no space left to move those extents. For example, if we fill the disk up with data, and then we try to balance and we have no room to move the data nor room to allocate new chunks, we will panic. Change this by checking to see if we have room to move this chunk around, and if not, return -ENOSPC and move on to the next chunk. This will make sure we remove block groups that are moveable, like if we have alot of empty metadata block groups, and then that way we make room to be able to balance our data chunks as well. Tested this with an fs that would panic on btrfs-vol -b normally, but no longer panics with this patch. V1->V2: -actually search for a free extent on the device to make sure we can allocate a chunk if need be. -fix btrfs_shrink_device to make sure we actually try to relocate all the chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r we don't remove the device with data still on it. -check to make sure the block group we are going to relocate isn't the last one in that particular space -fix a bug in btrfs_shrink_device where we would change the device's size and not fix it if we fail to do our relocate Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix arithmetic error in clone ioctlSage Weil2009-09-21
| | | | | | | | Fix an arithmetic error that was breaking extents cloned via the clone ioctl starting in the second half of a file. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: add snapshot/subvolume destroy ioctlYan, Zheng2009-09-21
| | | | | | | | This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being used and doesn't contains links to other subvolumes can be destroyed. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: change how subvolumes are organizedYan, Zheng2009-09-21
| | | | | | | | | | | | | | | | | | | | | btrfs allows subvolumes and snapshots anywhere in the directory tree. If we snapshot a subvolume that contains a link to other subvolume called subvolA, subvolA can be accessed through both the original subvolume and the snapshot. This is similar to creating hard link to directory, and has the very similar problems. The aim of this patch is enforcing there is only one access point to each subvolume. Only the first directory entry (the one added when the subvolume/snapshot was created) is treated as valid access point. The first directory entry is distinguished by checking root forward reference. If the corresponding root forward reference is missing, we know the entry is not the first one. This patch also adds snapshot/subvolume rename support, the code allows rename subvolume link across subvolumes. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: do not reuse objectid of deleted snapshot/subvolYan, Zheng2009-09-21
| | | | | | | | | | | | | The new back reference format does not allow reusing objectid of deleted snapshot/subvol. So we use ++highest_objectid to allocate objectid for new snapshot/subvol. Now we use ++highest_objectid to allocate objectid for both new inode and new snapshot/subvolume, so this patch removes 'find hole' code in btrfs_find_free_objectid. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: speed up snapshot droppingYan, Zheng2009-09-21
| | | | | | | | | | | | | | | | | This patch contains two changes to avoid unnecessary tree block reads during snapshot dropping. First, check tree block's reference count and flags before reading the tree block. if reference count > 1 and there is no need to update backrefs, we can avoid reading the tree block. Second, save when snapshot was created in root_key.offset. we can compare block pointer's generation with snapshot's creation generation during updating backrefs. If a given block was created before snapshot was created, the snapshot can't be the tree block's owner. So we can avoid reading the block. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>