aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs
Commit message (Collapse)AuthorAge
...
| | * Btrfs: fix worker thread double spin_lock_irqChris Mason2009-09-15
| | | | | | | | | | | | | | | | | | | | | The exit-on-idle code for async worker threads was incorrectly calling spin_lock_irq with interrupts already off. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | * Btrfs: fix async worker startup raceChris Mason2009-09-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After a new worker thread starts, it is placed into the list of idle threads. But, this may race with a check for idle done by the worker thread itself, resulting in a double list_add operation. This fix adds a check to make sure the idle thread addition is done properly. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | * Merge branch 'master' of ↵Chris Mason2009-09-11
| | |\ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
| | | * Btrfs: zero page past end of inline file itemsChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When btrfs_get_extent is reading inline file items for readpage, it needs to copy the inline extent into the page. If the inline extent doesn't cover all of the page, that means there is a hole in the file, or that our file is smaller than one page. readpage does zeroing for the case where the file is smaller than one page, but nobody is currently zeroing for the case where there is a hole after the inline item. This commit changes btrfs_get_extent to zero fill the page past the end of the inline item. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: fix btrfs page_mkwrite to return locked pageChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This closes a whole where the page may be written before the page_mkwrite caller has a chance to dirty it (thanks to Nick Piggin) Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: Fix extent replacment raceChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Data COW means that whenever we write to a file, we replace any old extent pointers with new ones. There was a window where a readpage might find the old extent pointers on disk and cache them in the extent_map tree in ram in the middle of a given write replacing them. Even though both the readpage and the write had their respective bytes in the file locked, the extent readpage inserts may cover more bytes than it had locked down. This commit closes the race by keeping the new extent pinned in the extent map tree until after the on-disk btree is properly setup with the new extent pointers. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: Use PagePrivate2 to track pages in the data=ordered code.Chris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs writes go through delalloc to the data=ordered code. This makes sure that all of the data is on disk before the metadata that references it. The tracking means that we have to make sure each page in an extent is fully written before we add that extent into the on-disk btree. This was done in the past by setting the EXTENT_ORDERED bit for the range of an extent when it was added to the data=ordered code, and then clearing the EXTENT_ORDERED bit in the extent state tree as each page finished IO. One of the reasons we had to do this was because sometimes pages are magically dirtied without page_mkwrite being called. The EXTENT_ORDERED bit is checked at writepage time, and if it isn't there, our page become dirty without going through the proper path. These bit operations make for a number of rbtree searches for each page, and can cause considerable lock contention. This commit switches from the EXTENT_ORDERED bit to use PagePrivate2. As pages go into the ordered code, PagePrivate2 is set on each one. This is a cheap operation because we already have all the pages locked and ready to go. As IO finishes, the PagePrivate2 bit is cleared and the ordered accoutning is updated for each page. At writepage time, if the PagePrivate2 bit is missing, we go into the writepage fixup code to handle improperly dirtied pages. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: use a cached state for extent state operations during delallocChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes the btrfs code to find delalloc ranges in the extent state tree to use the new state caching code from set/test bit. It reduces one of the biggest causes of rbtree searches in the writeback path. test_range_bit is also modified to take the cached state as a starting point while searching. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: don't lock bits in the extent tree during writepageChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At writepage time, we have the page locked and we have the extent_map entry for this extent pinned in the extent_map tree. So, the page can't go away and its mapping can't change. There is no need for the extra extent_state lock bits during writepage. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: cache values for locking extentsChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many of the btrfs extent state tree users follow the same pattern. They lock an extent range in the tree, do some operation and then unlock. This translates to at least 2 rbtree searches, and maybe more if they are doing operations on the extent state tree. A locked extent in the tree isn't going to be merged or changed, and so we can safely return the extent state structure as a cached handle. This changes set_extent_bit to give back a cached handle, and also changes both set_extent_bit and clear_extent_bit to use the cached handle if it is available. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: reduce CPU usage in the extent_state treeChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs is currently mirroring some of the page state bits into its extent state tree. The goal behind this was to use it in supporting blocksizes other than the page size. But, we don't currently support that, and we're using quite a lot of CPU on the rb tree and its spin lock. This commit starts a series of cleanups to reduce the amount of work done in the extent state tree as part of each IO. This commit: * Adds the ability to lock an extent in the state tree and also set other bits. The idea is to do locking and delalloc in one call * Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits. Btrfs is using a combination of the page bits and the ordered write code for this instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: Fix new state initialization orderChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the extent state tree is manipulated, there are call backs that are used to take extra actions when different state bits are set or cleared. One example of this is a counter for the total number of delayed allocation bytes in a single inode and in the whole FS. When new states are inserted, this callback is being done before we properly setup the new state. This hasn't caused problems before because the lock bit was always done first, and the existing call backs don't care about the lock bit. This patch makes sure the state is properly setup before using the callback, which is important for later optimizations that do more work without using the lock bit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: switch extent_map to a rw lockChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two main users of the extent_map tree. The first is regular file inodes, where it is evenly spread between readers and writers. The second is the chunk allocation tree, which maps blocks from logical addresses to phyiscal ones, and it is 99.99% reads. The mapping tree is a point of lock contention during heavy IO workloads, so this commit switches things to a rw lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: tweak congestion backoffChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The btrfs io submission thread tries to back off congested devices in favor of rotating off to another disk. But, it tries to make sure it submits at least some IO before rotating on (the others may be congested too), and so it has a magic number of requests it tries to write before it hops. This makes the magic number smaller. Testing shows that we're spending too much time on congested devices and leaving the other devices idle. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: use larger nr_to_write for larger extentsChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When btrfs fills a large delayed allocation extent, it is a good idea to try and convince the write_cache_pages caller to go ahead and write a good chunk of that extent. The extra IO is basically free because we know it is contiguous. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: reduce worker thread spin_lock_irq hold timesChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes the btrfs worker threads to batch work items into a local list. It allows us to pull work items in large chunks and significantly reduces the number of times we need to take the worker thread spinlock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: keep irqs on more often in the worker threadsChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The btrfs worker thread spinlock was being used both for the queueing of IO and for the processing of ordered events. The ordered events never happen from end_io handlers, and so they don't need to use the _irq version of spinlocks. This adds a dedicated lock to the ordered lists so they don't have to run with irqs off. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: optimize set extent bitChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Btrfs set_extent_bit call currently searches the rbtree every time it needs to find more extent_state objects to fill the requested operation. This adds a simple test with rb_next to see if the next object in the tree was adjacent to the one we just found. If so, we skip the search and just use the next object. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| | | * Btrfs: Allow worker threads to exit when idleChris Mason2009-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Btrfs worker threads don't currently die off after they have been idle for a while, leading to a lot of threads sitting around doing nothing for each mount. Also, they are unable to start atomically (from end_io hanlders). This commit reworks the worker threads so they can be started from end_io handlers (just setting a flag that asks for a thread to be added at a later date) and so they can exit if they have been idle for a long time. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* | | | Merge branch 'hwpoison' of ↵Linus Torvalds2009-09-24
|\ \ \ \ | |/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits) HWPOISON: Enable error_remove_page on btrfs HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs HWPOISON: Add madvise() based injector for hardware poisoned pages v4 HWPOISON: Enable error_remove_page for NFS HWPOISON: Enable .remove_error_page for migration aware file systems HWPOISON: The high level memory error handler in the VM v7 HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process HWPOISON: shmem: call set_page_dirty() with locked page HWPOISON: Define a new error_remove_page address space op for async truncation HWPOISON: Add invalidate_inode_page HWPOISON: Refactor truncate to allow direct truncating of page v2 HWPOISON: check and isolate corrupted free pages v2 HWPOISON: Handle hardware poisoned pages in try_to_unmap HWPOISON: Use bitmask/action code for try_to_unmap behaviour HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2 HWPOISON: Add poison check to page fault handling HWPOISON: Add basic support for poisoned pages in fault handler v3 HWPOISON: Add new SIGBUS error codes for hardware poison signals HWPOISON: Add support for poison swap entries v2 HWPOISON: Export some rmap vma locking to outside world ...
| * | | HWPOISON: Enable error_remove_page on btrfsAndi Kleen2009-09-16
| | | | | | | | | | | | | | | | | | | | | | | | Cc: chris.mason@oracle.com Signed-off-by: Andi Kleen <ak@linux.intel.com>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2009-09-22
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) trivial: fix typo in aic7xxx comment trivial: fix comment typo in drivers/ata/pata_hpt37x.c trivial: typo in kernel-parameters.txt trivial: fix typo in tracing documentation trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c trivial: remove unnecessary semicolons trivial: Fix duplicated word "options" in comment trivial: kbuild: remove extraneous blank line after declaration of usage() trivial: improve help text for mm debug config options trivial: doc: hpfall: accept disk device to unload as argument trivial: doc: hpfall: reduce risk that hpfall can do harm trivial: SubmittingPatches: Fix reference to renumbered step trivial: fix typos "man[ae]g?ment" -> "management" trivial: media/video/cx88: add __init/__exit macros to cx88 drivers trivial: fix typo in CONFIG_DEBUG_FS in gcov doc trivial: fix missing printk space in amd_k7_smp_check trivial: fix typo s/ketymap/keymap/ in comment trivial: fix typo "to to" in multiple files trivial: fix typos in comments s/DGBU/DBGU/ ...
| * | | | trivial: remove unnecessary semicolonsJoe Perches2009-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | | | const: mark remaining inode_operations as constAlexey Dobriyan2009-09-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | const: mark remaining address_space_operations constAlexey Dobriyan2009-09-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | const: mark remaining super_operations constAlexey Dobriyan2009-09-22
|/ / / / | | | | | | | | | | | | | | | | | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | fs: Assign bdi in super_blockJens Axboe2009-09-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We do this automatically in get_sb_bdev() from the set_bdev_super() callback. Filesystems that have their own private backing_dev_info must assign that in ->fill_super(). Note that ->s_bdi assignment is required for proper writeback! Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* | | | writeback: get rid of wbc->for_writepagesJens Axboe2009-09-16
|/ / / | | | | | | | | | | | | | | | | | | It's only set, it's never checked. Kill it. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* | | Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2009-09-14
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits) block: use blkdev_issue_discard in blk_ioctl_discard Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads block: don't assume device has a request list backing in nr_requests store block: Optimal I/O limit wrapper cfq: choose a new next_req when a request is dispatched Seperate read and write statistics of in_flight requests aoe: end barrier bios with EOPNOTSUPP block: trace bio queueing trial only when it occurs block: enable rq CPU completion affinity by default cfq: fix the log message after dispatched a request block: use printk_once cciss: memory leak in cciss_init_one() splice: update mtime and atime on files block: make blk_iopoll_prep_sched() follow normal 0/1 return convention cfq-iosched: get rid of must_alloc flag block: use interrupts disabled version of raise_softirq_irqoff() block: fix comment in blk-iopoll.c block: adjust default budget for blk-iopoll block: fix long lines in block/blk-iopoll.c block: add blk-iopoll, a NAPI like approach for block devices ...
| * | | block: use blkdev_issue_discard in blk_ioctl_discardChristoph Hellwig2009-09-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard, the only difference between the two is that blkdev_issue_discard needs to send a barrier discard request and blk_ioctl_discard a non-barrier one, and blk_ioctl_discard needs to wait on the request. To facilitates this add a flags argument to blkdev_issue_discard to control both aspects of the behaviour. This will be very useful later on for using the waiting funcitonality for other callers. Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
| * | | bio: first step in sanitizing the bio->bi_rw flag testingJens Axboe2009-09-11
| |/ / | | | | | | | | | | | | | | | | | | | | | Get rid of any functions that test for these bits and make callers use bio_rw_flagged() directly. Then it is at least directly apparent what variable and flag they check. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* / / writeback: add name to backing_dev_infoJens Axboe2009-09-11
|/ / | | | | | | | | | | | | | | This enables us to track who does what and print info. Its main use is catching dirty inodes on the default_backing_dev_info, so we can fix that up. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* | btrfs: fix inode rbtree corruptionFrom: Nick Piggin2009-08-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Node may not be inserted over existing node. This causes inode tree corruption and I was seeing crashes in inode_tree_del which I can not reproduce after this patch. The other way to fix this would be to tie inode lifetime in the rbtree with inode while not in freeing state. I had a look at this but it is not so trivial at this point. At least this patch gets things working again. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Acked-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds2009-08-07
|\| | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix balancing oops when invalidate_inode_pages2 returns EBUSY Btrfs: correct error-handling zlib error handling Btrfs: remove superfluous NULL pointer check in btrfs_rename() Btrfs: make sure the async caching thread advances the key Btrfs: fix btrfs_remove_from_free_space corner case
| * Btrfs: fix balancing oops when invalidate_inode_pages2 returns EBUSYYan Zheng2009-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | invalidate_inode_pages2_range may return -EBUSY occasionally which results Oops. This patch fixes the issue by moving invalidate_inode_pages2_range into a loop and keeping calling it until the return value is not -EBUSY. The EBUSY return is temporary, and can happen when the btrfs release page function is unable to release a page because the EXTENT_LOCK bit is set. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: correct error-handling zlib error handlingJulia Lawall2009-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | find_zlib_workspace returns an ERR_PTR value in an error case instead of NULL. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @match exists@ expression x, E; statement S1, S2; @@ x = find_zlib_workspace(...) ... when != x = E ( * if (x == NULL || ...) S1 else S2 | * if (x == NULL && ...) S1 else S2 ) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: remove superfluous NULL pointer check in btrfs_rename()Bartlomiej Zolnierkiewicz2009-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | This takes care of the following entry from Dan's list: fs/btrfs/inode.c +4788 btrfs_rename(36) warning: variable derefenced before check 'old_inode' Reported-by: Dan Carpenter <error27@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Eugene Teo <eteo@redhat.com> Cc: Julia Lawall <julia@diku.dk> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: make sure the async caching thread advances the keyChris Mason2009-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The async caching thread can end up looping forever if a given search puts it at the last key in a leaf. It will end up calling btrfs_next_leaf and then checking if it needs to politely drop the read semaphore. Most of the time this looping isn't noticed because it is able to make progress the next time around. But, during log replay, we wait on the async caching thread to finish, and the async thread is waiting on the commit, and no progress is really made. The fix used here is to copy the key out of the next leaf, that way our search lands there properly. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: fix btrfs_remove_from_free_space corner caseJosef Bacik2009-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yan Zheng hit a problem where we tried to remove some free space but failed because we couldn't find the free space entry. This is because the free space was held within a bitmap that had a starting offset well before the actual offset of the free space, and there were free space extents that were in the same range as that offset, so tree_search_offset returned with NULL because we couldn't find a free space extent that had that offset. This is fixed by making sure that if we fail to find the entry, we re-search again with bitmap_only set to 1 and do an offset_to_bitmap so we can get the appropriate bitmap. A similar problem happens in btrfs_alloc_from_bitmap for the clustering code, but that is not as bad since we will just go and redo our cluster allocation. Also this adds some debugging checks to make sure that the free space we are trying to remove from the bitmap is in fact there. This can probably go away after a while, but since this code is only used by the tree-logging stuff it would be nice to run with it for a while to make sure there are no problems. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds2009-07-30
|\| | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: be more polite in the async caching threads Btrfs: preserve commit_root for async caching
| * Btrfs: be more polite in the async caching threadsChris Mason2009-07-30
| | | | | | | | | | | | | | | | | | The semaphore used by the async caching threads can prevent a transaction commit, which can make the FS appear to stall. This releases the semaphore more often when a transaction commit is in progress. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: preserve commit_root for async cachingYan Zheng2009-07-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The async block group caching code uses the commit_root pointer to get a stable version of the extent allocation tree for scanning. This copy of the tree root isn't going to change and it significantly reduces the complexity of the scanning code. During a commit, we have a loop where we update the extent allocation tree root. We need to loop because updating the root pointer in the tree of tree roots may allocate blocks which may change the extent allocation tree. Right now the commit_root pointer is changed inside this loop. It is more correct to change the commit_root pointer only after all the looping is done. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds2009-07-28
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (22 commits) Btrfs: Fix async caching interaction with unmount Btrfs: change how we unpin extents Btrfs: Correct redundant test in add_inode_ref Btrfs: find smallest available device extent during chunk allocation Btrfs: clear all space_info->full after removing a block group Btrfs: make flushoncommit mount option correctly wait on ordered_extents Btrfs: Avoid delayed reference update looping Btrfs: Fix ordering of key field checks in btrfs_previous_item Btrfs: find_free_dev_extent doesn't handle holes at the start of the device Btrfs: Remove code duplication in comp_keys Btrfs: async block group caching Btrfs: use hybrid extents+bitmap rb tree for free space Btrfs: Fix crash on read failures at mount Btrfs: remove of redundant btrfs_header_level Btrfs: adjust NULL test Btrfs: Remove broken sanity check from btrfs_rmap_block() Btrfs: convert nested spin_lock_irqsave to spin_lock Btrfs: make sure all dirty blocks are written at commit time Btrfs: fix locking issue in btrfs_find_next_key Btrfs: fix double increment of path->slots[0] in btrfs_next_leaf ...
| * Btrfs: Fix async caching interaction with unmountYan Zheng2009-07-28
| | | | | | | | | | | | | | | | | | | | - don't stop the caching thread until btrfs_commit_super return. - if caching is interrupted by umount, set last to (u64)-1. otherwise the un-scanned range of block group will be considered as free extent. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: change how we unpin extentsJosef Bacik2009-07-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are racy with async block caching and unpinning extents. This patch makes things much less complicated by only unpinning the extent if the block group is cached. We check the block_group->cached var under the block_group->lock spin lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters, and unpin the extent and add the free space back. If it is not set to this, we start the caching of the block group so the next time we unpin extents we can unpin the extent. This keeps us from racing with the async caching threads, lets us kill the fs wide async thread counter, and keeps us from having to set DELALLOC bits for every extent we hit if there are caching kthreads going. One thing that needed to be changed was btrfs_free_super_mirror_extents. Now instead of just looking for LOCKED extents, we also look for DIRTY extents, since we could have left some extents pinned in the previous transaction that will never get freed now that we are unmounting, which would cause us to leak memory. So btrfs_free_super_mirror_extents has been changed to btrfs_free_pinned_extents, and it will clear the extents locked for the super mirror, and any remaining pinned extents that may be present. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: Correct redundant test in add_inode_refJulia Lawall2009-07-27
| | | | | | | | | | | | | | | | | | | | | | dir has already been tested. It seems that this test should be on the recently returned value inode. A simplified version of the semantic match that finds this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: find smallest available device extent during chunk allocationChris Mason2009-07-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allocating new block group is easy when the disk has plenty of space. But things get difficult as the disk fills up, especially if the FS has been run through btrfs-vol -b. The balance operation is likely to make the total bytes available on the device greater than the largest extent we'll actually be able to allocate. But the device extent allocation code incorrectly assumes that a device with 5G free will be able to allocate a 5G extent. It isn't normally a problem because device extents don't get freed unless btrfs-vol -b is run. This fixes the device extent allocator to remember the largest free extent it can find, and then uses that value as a fallback. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: clear all space_info->full after removing a block groupChris Mason2009-07-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs allocates individual extents from block groups, and each block group has a specific type. It may hold metadata, data mirrored or striped etc. When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r) we free block groups. Once a block group is freed, the space it was using on the device may be available for use by new block groups. btrfs_remove_block_group was clearing the flag that said 'our devices are full, don't even try to allocate new block groups', but it was only clearing that flag for a specific type of block group. This commit clears the full flag for all of the types of block groups, making it much more likely that we'll be able to balance space when the drive is close to full. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: make flushoncommit mount option correctly wait on ordered_extentsSage Weil2009-07-24
| | | | | | | | | | | | | | | | | | | | | | | | | | The commit_transaction call to wait_ordered_extents when snap_pending passes nocow_only=1 to process only NOCOW or PREALLOC extents. This isn't correct for the 'flushoncommit' mode, as it skips extents we just started IO on in start_delalloc_inodes. So, in the flushoncommit case, wait on all ordered extents. Otherwise, only pass the nocow_only flag to wait_ordered_extents if snap_pending. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: Avoid delayed reference update loopingYan Zheng2009-07-24
| | | | | | | | | | | | | | | | | | | | | | | | btrfs_split_leaf and btrfs_del_items can end up in a loop where one is constantly spliting a given leaf and the other is constantly merging it back with the adjacent nodes. There is a better fix for this, but in the interest of something small, this patch just changes btrfs_del_items back to balancing less often. Signed-off-by: Chris Mason <chris.mason@oracle.com>