Btrfs: add semaphore to synchronize direct IO writes with fsync

Due to the optimization of lockless direct IO writes (the inode's i_mutex is not held) introduced in commit 38851cc19adb ("Btrfs: implement unlocked dio write"), we started having races between such writes with concurrent fsync operations that use the fast fsync path. These races were addressed in the patches titled "Btrfs: fix race between fsync and lockless direct IO writes" and "Btrfs: fix race between fsync and direct IO writes for prealloc extents". The races happened because the direct IO path, like every other write path, does create extent maps followed by the corresponding ordered extents while the fast fsync path collected first ordered extents and then it collected extent maps. This made it possible to log file extent items (based on the collected extent maps) without waiting for the corresponding ordered extents to complete (get their IO done). The two fixes mentioned before added a solution that consists of making the direct IO path create first the ordered extents and then the extent maps, while the fsync path attempts to collect any new ordered extents once it collects the extent maps. This was simple and did not require adding any synchonization primitive to any data structure (struct btrfs_inode for example) but it makes things more fragile for future development endeavours and adds an exceptional approach compared to the other write paths. This change adds a read-write semaphore to the btrfs inode structure and makes the direct IO path create the extent maps and the ordered extents while holding read access on that semaphore, while the fast fsync path collects extent maps and ordered extents while holding write access on that semaphore. The logic for direct IO write path is encapsulated in a new helper function that is used both for cow and nocow direct IO writes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
author: Filipe Manana <fdmanana@suse.com> 2016-05-12 08:53:36 -0400
committer: Filipe Manana <fdmanana@suse.com> 2016-05-12 20:59:36 -0400
commit: 5f9a8a51d8b95505d8de8b7191ae2ed8c504d4af (patch)
tree: d97a7f5d321694e09c3046e9027c23a02d6a5878 /fs/btrfs/tree-log.c
parent: f78c436c3931e7df713688028f2b4faf72bf9f2a (diff)
1 files changed, 14 insertions, 37 deletions
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index a24a0ba523d6..003a826f4cff 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4141,6 +4141,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
        INIT_LIST_HEAD(&extents);
+        down_write(&BTRFS_I(inode)->dio_sem);
        write_lock(&tree->lock);
        test_gen = root->fs_info->last_trans_committed;
@@ -4169,13 +4170,20 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
        }
        list_sort(NULL, &extents, extent_cmp);
+        btrfs_get_logged_extents(inode, logged_list, start, end);
        /*
-         * Collect any new ordered extents within the range. This is to
+         * Some ordered extents started by fsync might have completed
-         * prevent logging file extent items without waiting for the disk
+         * before we could collect them into the list logged_list, which
-         * location they point to being written. We do this only to deal
+         * means they're gone, not in our logged_list nor in the inode's
-         * with races against concurrent lockless direct IO writes.
+         * ordered tree. We want the application/user space to know an
+         * error happened while attempting to persist file data so that
+         * it can take proper action. If such error happened, we leave
+         * without writing to the log tree and the fsync must report the
+         * file data write error and not commit the current transaction.
         */
-        btrfs_get_logged_extents(inode, logged_list, start, end);
+        ret = btrfs_inode_check_errors(inode);
+        if (ret)
+                ctx->io_err = ret;
 process:
        while (!list_empty(&extents)) {
                em = list_entry(extents.next, struct extent_map, list);
@@ -4202,6 +4210,7 @@ process:
        }
        WARN_ON(!list_empty(&extents));
        write_unlock(&tree->lock);
+        up_write(&BTRFS_I(inode)->dio_sem);
        btrfs_release_path(path);
        return ret;
@@ -4623,23 +4632,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
        mutex_lock(&BTRFS_I(inode)->log_mutex);
        /*
-         * Collect ordered extents only if we are logging data. This is to
-         * ensure a subsequent request to log this inode in LOG_INODE_ALL mode
-         * will process the ordered extents if they still exists at the time,
-         * because when we collect them we test and set for the flag
-         * BTRFS_ORDERED_LOGGED to prevent multiple log requests to process the
-         * same ordered extents. The consequence for the LOG_INODE_ALL log mode
-         * not processing the ordered extents is that we end up logging the
-         * corresponding file extent items, based on the extent maps in the
-         * inode's extent_map_tree's modified_list, without logging the
-         * respective checksums (since the may still be only attached to the
-         * ordered extents and have not been inserted in the csum tree by
-         * btrfs_finish_ordered_io() yet).
-         */
-        if (inode_only == LOG_INODE_ALL)
-                btrfs_get_logged_extents(inode, &logged_list, start, end);
-        /*
         * a brute force approach to making sure we get the most uptodate
         * copies of everything.
         */
@@ -4846,21 +4838,6 @@ log_extents:
                        goto out_unlock;
        }
        if (fast_search) {
-                /*
-                 * Some ordered extents started by fsync might have completed
-                 * before we collected the ordered extents in logged_list, which
-                 * means they're gone, not in our logged_list nor in the inode's
-                 * ordered tree. We want the application/user space to know an
-                 * error happened while attempting to persist file data so that
-                 * it can take proper action. If such error happened, we leave
-                 * without writing to the log tree and the fsync must report the
-                 * file data write error and not commit the current transaction.
-                 */
-                err = btrfs_inode_check_errors(inode);
-                if (err) {
-                        ctx->io_err = err;
-                        goto out_unlock;
-                }
                ret = btrfs_log_changed_extents(trans, root, inode, dst_path,
                                                &logged_list, ctx, start, end);
                if (ret) {
author	Filipe Manana <fdmanana@suse.com>	2016-05-12 08:53:36 -0400
committer	Filipe Manana <fdmanana@suse.com>	2016-05-12 20:59:36 -0400
commit	5f9a8a51d8b95505d8de8b7191ae2ed8c504d4af (patch)
tree	d97a7f5d321694e09c3046e9027c23a02d6a5878 /fs/btrfs/tree-log.c
parent	f78c436c3931e7df713688028f2b4faf72bf9f2a (diff)