aboutsummaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAge
...
| * | | | | | | | btrfs: ignore device open failures in __btrfs_open_devicesEric Sandeen2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This: # mkfs.btrfs /dev/sdb{1,2} ; wipefs -a /dev/sdb1; mount /dev/sdb2 /mnt/test would lead to a blkdev open/close mismatch when the mount fails, and a permanently busy (opened O_EXCL) sdb2: # wipefs -a /dev/sdb2 wipefs: error: /dev/sdb2: probing initialization failed: Device or resource busy It's because btrfs_open_devices() may open some devices, fail on the last one, and return that failure stored in "ret." The mount then fails, but the caller then does not clean up the open devices. Chris assures me that: "btrfs_open_devices just means: go off and open every bdev you can from this uuid. It should return success if we opened any of them at all." So change the logic to ignore any open failures; just skip processing of that device. Later on it's decided whether we have enough devices to continue. Reported-by: Jan Safranek <jsafrane@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: improve the performance of the csums lookupMiao Xie2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is very likely that there are several blocks in bio, it is very inefficient if we get their csums one by one. This patch improves this problem by getting the csums in batch. According to the result of the following test, the execute time of __btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us). # dd if=<mnt>/file of=/dev/null bs=1M count=1024 Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: fix bad extent loggingJosef Bacik2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A user sent me a btrfs-image of a file system that was panicing on mount during the log recovery. I had originally thought these problems were from a bug in the free space cache code, but that was just a symptom of the problem. The problem is if your application does something like this [prealloc][prealloc][prealloc] the internal extent maps will merge those all together into one extent map, even though on disk they are 3 separate extents. So if you go to write into one of these ranges the extent map will be right since we use the physical extent when doing the write, but when we log the extents they will use the wrong sizes for the remainder prealloc space. If this doesn't happen to trip up the free space cache (which it won't in a lot of cases) then you will get bogus entries in your extent tree which will screw stuff up later. The data and such will still work, but everything else is broken. This patch fixes this by not allowing extents that are on the modified list to be merged. This has the side effect that we are no longer adding everything to the modified list all the time, which means we now have to call btrfs_drop_extents every time we log an extent into the tree. So this allows me to drop all this speciality code I was using to get around calling btrfs_drop_extents. With this patch the testcase I've created no longer creates a bogus file system after replaying the log. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: log ram bytes properlyJosef Bacik2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When logging changed extents I was logging ram_bytes as the current length, which isn't correct, it's supposed to be the ram bytes of the original extent. This is for compression where even if we split the extent we need to know the ram bytes so when we uncompress the extent we know how big it will be. This was still working out right with compression for some reason but I think we were getting lucky. It was definitely off for prealloc which is why I noticed it, btrfsck was complaining about it. With this patch btrfsck no longer complains after a log replay. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: don't wait on ordered extents if we have a trans openJosef Bacik2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave was hitting a lockdep warning because we're now properly taking the ordered operations mutex in the ordered wait stuff. This is because some cases we will have a trans handle when we are flushing delalloc space, but we can't wait on ordered extents because we could potentially deadlock, so fix this by not doing the wait if we have a trans handle. Thanks Reported-and-tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: fix error handling in make/read block groupJosef Bacik2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I noticed that we will add a block group to the space info before we add it to the block group cache rb tree, so we could potentially allocate from the block group before it's able to be searched for. I don't think this is too much of a problem, the race window is microscopic, but just in case move the tree insertion to above the space info linking. This makes it easier to adjust the error handling as well, so we can remove a couple of BUG_ON(ret)'s and have real error handling setup for these scenarios. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: fix double free in the iterate_extent_inodes()Wang Shilong2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If btrfs_find_all_roots() fails, 'roots' has been freed or 'roots' fails to allocate. We don't need to free it outside btrfs_find_all_roots() again.Fix it. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: kill some BUG_ONs() in the find_parent_nodes()Wang Shilong2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The reason that BUG_ON() happens in these places is just because of ENOMEM. We try ro return ENOMEM rather than trigger BUG_ON(), the caller will abort the transaction thus avoiding the kernel panic. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: compare relevant parts of delayed tree refsJosef Bacik2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A user reported a panic while running a balance. What was happening was he was relocating a block, which added the reference to the relocation tree. Then relocation would walk through the relocation tree and drop that reference and free that block, and then it would walk down a snapshot which referenced the same block and add another ref to the block. The problem is this was all happening in the same transaction, so the parent block was free'ed up when we drop our reference which was immediately available for allocation, and then it was used _again_ to add a reference for the same block from a different snapshot. This resulted in something like this in the delayed ref tree add ref to 90234880, parent=2067398656, ref_root 1766, level 1 del ref to 90234880, parent=2067398656, ref_root 18446744073709551608, level 1 add ref to 90234880, parent=2067398656, ref_root 1767, level 1 as you can see the ref_root's don't match, because when we inc the ref we use the header owner, which is the original tree the block belonged to, instead of the data reloc tree. Then when we remove the extent we use the reloc tree objectid. But none of this matters, since it is a shared reference which means only the parent matters. When the delayed ref stuff runs it adds all the increments first, and then does all the drops, to make sure that we don't delete the ref if we net a positive ref count. But tree blocks aren't allowed to have multiple refs from the same block, so this panics when it tries to add the second ref. We need the add and the drop to cancel each other out in memory so we only do the final add. So to fix this we need to adjust how the delayed refs are added to the tree. Only the ref_root matters when it is a normal backref, and only the parent matters when it is a shared backref. So make our decision based on what ref type we have. This allows us to keep the ref_root in memory in case anybody wants to use it for something else, and it allows the delayed refs to be merged properly so we don't end up with this panic. With this patch the users image no longer panics on mount, and it has a clean fsck after a normal mount/umount cycle. Thanks, Cc: stable@vger.kernel.org Reported-by: Roman Mamedov <rm@romanrm.ru> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: fix infinite loop when we abort on mountJosef Bacik2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Testing my enospc log code I managed to abort a transaction during mount, which put me into an infinite loop. This is because of two things, first we don't reset trans_no_join if we abort during transaction commit, which will force anybody trying to start a transaction to just loop endlessly waiting for it to be set to 0. But this is still just a symptom, the second issue is we don't set the fs state to error during errors on mount. This is because we don't want to do the flip read only thing during mount, but we still really want to set the fs state to an error to keep us from even getting to the trans_no_join check. So fix both of these things, make sure to reset trans_no_join if we abort during a commit, and make sure we set the fs state to error no matter if we're mounting or not. This should keep us from getting into this infinite loop again. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: fix a warning when disabling quotaWang Shilong2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Steps to reproduce: mkfs.btrfs <disk> mount <disk> <mnt> btrfs quota enable <mnt> btrfs sub create <mnt>/subv i=1 while [ $i -le 10000 ] do dd if=/dev/zero of=<mnt>/subv/data_$i bs=1K count=1 i=$(($i+1)) if [ $i -eq 500 ] then btrfs quota disable $mnt fi done dmesg Obviously, this warn_on() is unnecessary, and it will be easily triggered. Just remove it. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: pass NULL instead of 0Liu Bo2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | set_extent_bit()'s (u64 *failed_start) expects NULL not 0. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | btrfs: make subvol creation/deletion killable in the early stagesDavid Sterba2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The subvolume ioctls block on the parent directory mutex that can be held by other concurrent snapshot activity for a long time. Give the user at least some chance to get out of this situation by allowing to send a kill signal. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | btrfs: cover more error codes in btrfs_decode_errorDavid Sterba2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | btrfs: make orphan cleanup less verboseDavid Sterba2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The messages btrfs: unlinked 123 orphans btrfs: truncated 456 orphans are not useful to regular users and raise questions whether there are problems with the filesystem. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | btrfs: deprecate subvolrootid mount optionDavid Sterba2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This mount option was a workaround when subvol= assumed path relative to the default subvolume, not the toplevel one. This was fixed long time ago and subvolrootid has no effect. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: Include the device in most error printk()sSimon Kirby2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With more than one btrfs volume mounted, it can be very difficult to find out which volume is hitting an error. btrfs_error() will print this, but it is currently rigged as more of a fatal error handler, while many of the printk()s are currently for debugging and yet-unhandled cases. This patch just changes the functions where the device information is already available. Some cases remain where the root or fs_info is not passed to the function emitting the error. This may introduce some confusion with volumes backed by multiple devices emitting errors referring to the primary device in the set instead of the one on which the error occurred. Use btrfs_printk(fs_info, format, ...) rather than writing the device string every time, and introduce macro wrappers ala XFS for brevity. Since the function already cannot be used for continuations, print a newline as part of the btrfs_printk() message rather than at each caller. Signed-off-by: Simon Kirby <sim@hostway.ca> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | btrfs: update kconfig titleDavid Sterba2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Kconfig title does not make much sense after the cleanup of CONFIG_EXPERIMENTAL option, align the wording with other filesystems. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | btrfs: clean snapshots one by oneDavid Sterba2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each time pick one dead root from the list and let the caller know if it's needed to continue. This should improve responsiveness during umount and balance which at some point waits for cleaning all currently queued dead roots. A new dead root is added to the end of the list, so the snapshots disappear in the order of deletion. The snapshot cleaning work is now done only from the cleaner thread and the others wake it if needed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | btrfs: Cleanup some redundant codes in btrfs_log_inode()Zhi Yong Wu2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | btrfs: Cleanup some redundant codes in btrfs_lookup_csums_range()Zhi Yong Wu2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: share stop worker codeLiu Bo2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Share the exactly same code of stopping workers. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: add a incompatible format change for smaller metadata extent refsJosef Bacik2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently store the first key of the tree block inside the reference for the tree block in the extent tree. This takes up quite a bit of space. Make a new key type for metadata which holds the level as the offset and completely removes storing the btrfs_tree_block_info inside the extent ref. This reduces the size from 51 bytes to 33 bytes per extent reference for each tree block. In practice this results in a 30-35% decrease in the size of our extent tree, which means we COW less and can keep more of the extent tree in memory which makes our heavy metadata operations go much faster. This is not an automatic format change, you must enable it at mkfs time or with btrfstune. This patch deals with having metadata stored as either the old format or the new format so it is easy to convert. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: use helper to cleanup tree rootsLiu Bo2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | free_root_pointers() has been introduced to cleanup all of tree roots, so just use it instead. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: cleanup unused arguments of btrfs_csum_dataLiu Bo2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Argument 'root' is no more used in btrfs_csum_data(). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | btrfs: clean up transaction abort messagesDavid Sterba2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The transaction abort stacktrace is printed only once per module lifetime, but we'd like to see it each time it happens per mounted filesystem. Introduce a fs_state flag that records it. Tweak the messages around abort: * add error number to the first abort * print the exact negative errno from btrfs_decode_error * clean up btrfs_decode_error and callers * no dots at the end of the messages Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | btrfs: merge save_error_info helpers into oneDavid Sterba2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: add some free space cache testsJosef Bacik2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We keep hitting bugs in the tree log replay because btrfs_remove_free_space doesn't account for some corner case. So add a bunch of tests to try and fully test btrfs_remove_free_space since the only time it is called is during tree log replay. These tests all finish successfully, so as we find more of these bugs we need to add to these tests to make sure we don't regress in fixing things. I've hidden the tests behind a Kconfig option, but they take no time to run so all btrfs developers should have this turned on all the time. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | | | | | Btrfs: cleanup unused functionLiu Bo2013-04-29
| | |_|_|/ / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_abort_devices() is no more used. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* | | | | | | | Merge tag 'for-linus-v3.10-rc1-2' of git://oss.sgi.com/xfs/xfsLinus Torvalds2013-05-09
|\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs update (#2) from Ben Myers: - add CONFIG_XFS_WARN, a step between zero debugging and CONFIG_XFS_DEBUG. - fix attrmulti and attrlist to fall back to vmalloc when kmalloc fails. * tag 'for-linus-v3.10-rc1-2' of git://oss.sgi.com/xfs/xfs: xfs: fallback to vmalloc for large buffers in xfs_compat_attrlist_by_handle xfs: fallback to vmalloc for large buffers in xfs_attrlist_by_handle xfs: introduce CONFIG_XFS_WARN
| * | | | | | | | xfs: fallback to vmalloc for large buffers in xfs_compat_attrlist_by_handleEric Sandeen2013-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shamelessly copied from dchinner's: ad650f5b xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get xfsdump uses a large buffer for extended attributes, which has a kmalloc'd shadow buffer in the kernel. This can fail after the system has been running for some time as it is a high order allocation. Add a fallback to vmalloc so that it doesn't require contiguous memory and so won't randomly fail while xfsdump is running. This was done for xfs_attrlist_by_handle but xfs_compat_attrlist_by_handle (the 32-bit version) needs the same attention. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | | | | | xfs: fallback to vmalloc for large buffers in xfs_attrlist_by_handleEric Sandeen2013-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shamelessly copied from dchinner's: ad650f5b xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get xfsdump uses for a large buffer for extended attributes, which has a kmalloc'd shadow buffer in the kernel. This can fail after the system has been running for some time as it is a high order allocation. Add a fallback to vmalloc so that it doesn't require contiguous memory and so won't randomly fail while xfsdump is running. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | | | | | xfs: introduce CONFIG_XFS_WARNDave Chinner2013-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running a CONFIG_XFS_DEBUG kernel in production environments is not the best idea as it introduces significant overhead, can change the behaviour of algorithms (such as allocation) to improve test coverage, and (most importantly) panic the machine on non-fatal errors. There are many cases where all we want to do is run a kernel with more bounds checking enabled, such as is provided by the ASSERT() statements throughout the code, but without all the potential overhead and drawbacks. This patch converts all the ASSERT statements to evaluate as WARN_ON(1) statements and hence if they fail dump a warning and a stack trace to the log. This has minimal overhead and does not change any algorithms, and will allow us to find strange "out of bounds" problems more easily on production machines. There are a few places where assert statements contain debug only code. These are converted to be debug-or-warn only code so that we still get all the assert checks in the code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
* | | | | | | | | Merge tag 'nfs-for-3.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2013-05-09
|\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more NFS client bugfixes from Trond Myklebust: - Ensure that we match the 'sec=' mount flavour against the server list - Fix the NFSv4 byte range locking in the presence of delegations - Ensure that we conform to the NFSv4.1 spec w.r.t. freeing lock stateids - Fix a pNFS data server connection race * tag 'nfs-for-3.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS4.1 Fix data server connection race NFSv3: match sec= flavor against server list NFSv4.1: Ensure that we free the lock stateid on the server NFSv4: Convert nfs41_free_stateid to use an asynchronous RPC call SUNRPC: Don't spam syslog with "Pseudoflavor not found" messages NFSv4.x: Fix handling of partially delegated locks
| * | | | | | | | | NFS4.1 Fix data server connection raceAndy Adamson2013-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlike meta data server mounts which support multiple mount points to the same server via struct nfs_server, data servers support a single connection. Concurrent calls to setup the data server connection can race where the first call allocates the nfs_client struct, and before the cache struct nfs_client pointer can be set, a second call also tries to setup the connection, finds the already allocated nfs_client, bumps the reference count, re-initializes the session,etc. This results in a hanging data server session after umount. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
| * | | | | | | | | NFSv3: match sec= flavor against server listWeston Andros Adamson2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Older linux clients match the 'sec=' mount option flavor against the server's flavor list (if available) and return EPERM if the specified flavor or AUTH_NULL (which "matches" any flavor) is not found. Recent changes skip this step and allow the vfs mount even though no operations will succeed, creating a 'dud' mount. This patch reverts back to the old behavior of matching specified flavors against the server list and also returns EPERM when no sec= is specified and none of the flavors returned by the server are supported by the client. Example of behavior change: the server's /etc/exports: /export/krb5 *(sec=krb5,rw,no_root_squash) old client behavior: $ uname -a Linux one.apikia.fake 3.8.8-202.fc18.x86_64 #1 SMP Wed Apr 17 23:25:17 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux $ sudo mount -v -o sec=sys,vers=3 zero:/export/krb5 /mnt mount.nfs: timeout set for Sun May 5 17:32:04 2013 mount.nfs: trying text-based options 'sec=sys,vers=3,addr=192.168.100.10' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: trying 192.168.100.10 prog 100003 vers 3 prot TCP port 2049 mount.nfs: prog 100005, trying vers=3, prot=17 mount.nfs: trying 192.168.100.10 prog 100005 vers 3 prot UDP port 20048 mount.nfs: mount(2): Permission denied mount.nfs: access denied by server while mounting zero:/export/krb5 recently changed behavior: $ uname -a Linux one.apikia.fake 3.9.0-testing+ #2 SMP Fri May 3 20:29:32 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux $ sudo mount -v -o sec=sys,vers=3 zero:/export/krb5 /mnt mount.nfs: timeout set for Sun May 5 17:37:17 2013 mount.nfs: trying text-based options 'sec=sys,vers=3,addr=192.168.100.10' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: trying 192.168.100.10 prog 100003 vers 3 prot TCP port 2049 mount.nfs: prog 100005, trying vers=3, prot=17 mount.nfs: trying 192.168.100.10 prog 100005 vers 3 prot UDP port 20048 $ ls /mnt ls: cannot open directory /mnt: Permission denied $ sudo ls /mnt ls: cannot open directory /mnt: Permission denied $ sudo df /mnt df: ‘/mnt’: Permission denied df: no file systems processed $ sudo umount /mnt $ Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
| * | | | | | | | | NFSv4.1: Ensure that we free the lock stateid on the serverTrond Myklebust2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This ensures that the server doesn't need to keep huge numbers of lock stateids waiting around for the final CLOSE. See section 8.2.4 in RFC5661. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
| * | | | | | | | | NFSv4: Convert nfs41_free_stateid to use an asynchronous RPC callTrond Myklebust2013-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The main reason for doing this is will be to allow for an asynchronous RPC mode that we can use for freeing lock stateids as per section 8.2.4 of RFC5661. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
| * | | | | | | | | NFSv4.x: Fix handling of partially delegated locksTrond Myklebust2013-05-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a NFS client receives a delegation for a file after it has taken a lock on that file, we can currently end up in a situation where we mistakenly skip unlocking that file. The following patch swaps an erroneous check in nfs4_proc_unlck for whether or not the file has a delegation to one which checks whether or not we hold a lock stateid for that file. Reported-by: Chuck Lever <Chuck.Lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [>=3.7] Tested-by: Chuck Lever <Chuck.Lever@oracle.com>
* | | | | | | | | | Merge tag 'f2fs-for-v3.10' of ↵Linus Torvalds2013-05-08
|\ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "This patch-set includes the following major enhancement patches. - introduce a new gloabl lock scheme - add tracepoints on several major functions - fix the overall cleaning process focused on victim selection - apply the block plugging to merge IOs as much as possible - enhance management of free nids and its list - enhance the readahead mode for node pages - address several cretical deadlock conditions - reduce lock_page calls The other minor bug fixes and enhancements are as follows. - calculation mistakes: overflow - bio types: READ, READA, and READ_SYNC - fix the recovery flow, data races, and null pointer errors" * tag 'f2fs-for-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits) f2fs: cover free_nid management with spin_lock f2fs: optimize scan_nat_page() f2fs: code cleanup for scan_nat_page() and build_free_nids() f2fs: bugfix for alloc_nid_failed() f2fs: recover when journal contains deleted files f2fs: continue to mount after failing recovery f2fs: avoid deadlock during evict after f2fs_gc f2fs: modify the number of issued pages to merge IOs f2fs: remove useless #include <linux/proc_fs.h> as we're now using sysfs as debug entry. f2fs: fix inconsistent using of NM_WOUT_THRESHOLD f2fs: check truncation of mapping after lock_page f2fs: enhance alloc_nid and build_free_nids flows f2fs: add a tracepoint on f2fs_new_inode f2fs: check nid == 0 in add_free_nid f2fs: add REQ_META about metadata requests for submit f2fs: give a chance to merge IOs by IO scheduler f2fs: avoid frequent background GC f2fs: add tracepoints to debug checkpoint request f2fs: add tracepoints for write page operations f2fs: add tracepoints to debug the block allocation ...
| * | | | | | | | | | f2fs: cover free_nid management with spin_lockJaegeuk Kim2013-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After build_free_nids() searches free nid candidates from nat pages and current journal blocks, it checks all the candidates if they are allocated so that the nat cache has its nid with an allocated block address. In this procedure, previously we used list_for_each_entry_safe(fnid, next_fnid, &nm_i->free_nid_list, list). But, this is not covered by free_nid_list_lock, resulting in null pointer bug. This patch moves this checking routine inside add_free_nid() in order not to use the spin_lock. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
| * | | | | | | | | | f2fs: optimize scan_nat_page()Haicheng Li2013-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When nm_i->fcnt > 2 * MAX_FREE_NIDS, stop scanning other NAT entries. Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> [Jaegeuk Kim: fix handling the return value of add_free_nid()] Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
| * | | | | | | | | | f2fs: code cleanup for scan_nat_page() and build_free_nids()Haicheng Li2013-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch does two cleanups: 1. remove unused variable "fcnt" in build_free_nids(). 2. make scan_nat_page() as void type and remove useless variable "fcnt". Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
| * | | | | | | | | | f2fs: bugfix for alloc_nid_failed()Haicheng Li2013-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Directly drop the free_nid cache when nm_i->fcnt > 2 * MAX_FREE_NIDS Since there is NOT nmi->free_nid_list_lock spinlock protection between a sequential calling of alloc_nid() and alloc_nid_failed(), some other threads may already add new free_nid to the free_nid_list during this period. We need to make sure nmi->fcnt is never > 2 * MAX_FREE_NIDS. Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> [Jaegeuk Kim: fit the coding style] Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
| * | | | | | | | | | f2fs: recover when journal contains deleted filesChris Fries2013-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When recovering a journal file with fsync data for files that have been deleted, don't bail out on recovery. Signed-off-by: Chris Fries <C.Fries@motorola.com> Reviewed-by: Russell Knize <rknize2@motorola.com> Reviewed-by: Jason Hrycay <jason.hrycay@motorola.com> [Jaegeuk Kim: fit the coding style] Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
| * | | | | | | | | | f2fs: continue to mount after failing recoveryChris Fries2013-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When unable to roll forward the journal, we shouldn't bail out and not mount, we should continue to attempt the mount. Bad recovery data is likely unrecoverable at this point, and requiring the user to try to mount again doesn't solve any issues. Signed-off-by: Chris Fries <C.Fries@motorola.com> Reviewed-by: Russell Knize <rknize2@motorola.com> Reviewed-by: Jason Hrycay <jason.hrycay@motorola.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
| * | | | | | | | | | f2fs: avoid deadlock during evict after f2fs_gcJaegeuk Kim2013-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | o Deadlock case #1 Thread 1: - writeback_sb_inodes - do_writepages - f2fs_write_data_pages - write_cache_pages - f2fs_write_data_page - f2fs_balance_fs - wait mutex_lock(gc_mutex) Thread 2: - f2fs_balance_fs - mutex_lock(gc_mutex) - f2fs_gc - f2fs_iget - wait iget_locked(inode->i_lock) Thread 3: - do_unlinkat - iput - lock(inode->i_lock) - evict - inode_wait_for_writeback o Deadlock case #2 Thread 1: - __writeback_single_inode : set I_SYNC - do_writepages - f2fs_write_data_page - f2fs_balance_fs - f2fs_gc - iput - evict - inode_wait_for_writeback(I_SYNC) In order to avoid this, even though iput is called with the zero-reference count, we need to stop the eviction procedure if the inode is on writeback. So this patch links f2fs_drop_inode which checks the I_SYNC flag. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
| * | | | | | | | | | f2fs: modify the number of issued pages to merge IOsJaegeuk Kim2013-04-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When testing f2fs on an SSD, I found some 128 page IOs followed by 1 page IO were issued by f2fs_write_node_pages. This means that there were some mishandling flows which degrades performance. Previous f2fs_write_node_pages determines the number of pages to be written, nr_to_write, as follows. 1. The bio_get_nr_vecs returns 129 pages. 2. The bio_alloc makes a room for 128 pages. 3. The initial 128 pages go into one bio. 4. The existing bio is submitted, and a new bio is prepared for the last 1 page. 5. Finally, sync_node_pages submits the last 1 page bio. The problem is from the use of bio_get_nr_vecs, so this patch replace it with max_hw_blocks using queue_max_sectors. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
| * | | | | | | | | | f2fs: remove useless #include <linux/proc_fs.h> as we're now using sysfs as ↵Haicheng Li2013-04-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | debug entry. Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
| * | | | | | | | | | f2fs: fix inconsistent using of NM_WOUT_THRESHOLDHaicheng Li2013-04-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | try_to_free_nats() is usually called with parameter nr_shrink as "nm_i->nat_cnt - NM_WOUT_THRESHOLD" by flush_nat_entries() during checkpointing process. However, this is inconsistent with the actual threshold check as "if (nm_i->nat_cnt < 2 * NM_WOUT_THRESHOLD)" , which will ignore the free_nats requests when NM_WOUT_THRESHOLD < nm_i->nat_cnt < 2 * NM_WOUT_THRESHOLD So fix the threshold check condition. Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>