aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorFilipe Manana <fdmanana@suse.com>2018-11-12 05:23:58 -0500
committerDavid Sterba <dsterba@suse.com>2018-11-13 07:49:43 -0500
commitaab15e8ec25765cf7968c72cbec7583acf99d8a4 (patch)
tree75f30255aa9b32ca7f8b2bfe8eb2241983fd90f0
parentf8397d69daef06d358430d3054662fb597e37c00 (diff)
Btrfs: fix rare chances for data loss when doing a fast fsync
After the simplification of the fast fsync patch done recently by commit b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync time") and commit e7175a692765 ("btrfs: remove the wait ordered logic in the log_one_extent path"), we got a very short time window where we can get extents logged without writeback completing first or extents logged without logging the respective data checksums. Both issues can only happen when doing a non-full (fast) fsync. As soon as we enter btrfs_sync_file() we trigger writeback, then lock the inode and then wait for the writeback to complete before starting to log the inode. However before we acquire the inode's lock and after we started writeback, it's possible that more writes happened and dirtied more pages. If that happened and those pages get writeback triggered while we are logging the inode (for example, the VM subsystem triggering it due to memory pressure, or another concurrent fsync), we end up seeing the respective extent maps in the inode's list of modified extents and will log matching file extent items without waiting for the respective ordered extents to complete, meaning that either of the following will happen: 1) We log an extent after its writeback finishes but before its checksums are added to the csum tree, leading to -EIO errors when attempting to read the extent after a log replay. 2) We log an extent before its writeback finishes. Therefore after the log replay we will have a file extent item pointing to an unwritten extent (and without the respective data checksums as well). This could not happen before the fast fsync patch simplification, because for any extent we found in the list of modified extents, we would wait for its respective ordered extent to finish writeback or collect its checksums for logging if it did not complete yet. Fix this by triggering writeback again after acquiring the inode's lock and before waiting for ordered extents to complete. Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the log_one_extent path") Fixes: b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync time") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-rw-r--r--fs/btrfs/file.c24
1 files changed, 24 insertions, 0 deletions
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 97c7a086f7bd..b92b7f05c3d5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2089,6 +2089,30 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
2089 atomic_inc(&root->log_batch); 2089 atomic_inc(&root->log_batch);
2090 2090
2091 /* 2091 /*
2092 * Before we acquired the inode's lock, someone may have dirtied more
2093 * pages in the target range. We need to make sure that writeback for
2094 * any such pages does not start while we are logging the inode, because
2095 * if it does, any of the following might happen when we are not doing a
2096 * full inode sync:
2097 *
2098 * 1) We log an extent after its writeback finishes but before its
2099 * checksums are added to the csum tree, leading to -EIO errors
2100 * when attempting to read the extent after a log replay.
2101 *
2102 * 2) We can end up logging an extent before its writeback finishes.
2103 * Therefore after the log replay we will have a file extent item
2104 * pointing to an unwritten extent (and no data checksums as well).
2105 *
2106 * So trigger writeback for any eventual new dirty pages and then we
2107 * wait for all ordered extents to complete below.
2108 */
2109 ret = start_ordered_ops(inode, start, end);
2110 if (ret) {
2111 inode_unlock(inode);
2112 goto out;
2113 }
2114
2115 /*
2092 * We have to do this here to avoid the priority inversion of waiting on 2116 * We have to do this here to avoid the priority inversion of waiting on
2093 * IO of a lower priority task while holding a transaciton open. 2117 * IO of a lower priority task while holding a transaciton open.
2094 */ 2118 */