Fix race when checking i_size on direct i/o read

So far I've had one ACK for this, and no other comments. So I think it is probably time to send this via some suitable tree. I'm guessing that the vfs tree would be the most appropriate route, but not sure that there is one at the moment (don't see anything recent at kernel.org) so in that case I think -mm is the "back up plan". Al, please let me know if you will take this? Steve. --------------------- Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes" thread on linux-fsdevel, this patch is my current version of the fix proposed as option (b) in that thread. Removing the i_size test from the direct i/o read path at vfs level means that filesystems now have to deal with requests which are beyond i_size themselves. These I've divided into three sets: a) Those with "no op" ->direct_IO (9p, cifs, ceph) These are obviously not going to be an issue b) Those with "home brew" ->direct_IO (nfs, fuse) I've been told that NFS should not have any problem with the larger i_size, however I've added an extra test to FUSE to duplicate the original behaviour just to be on the safe side. c) Those using __blockdev_direct_IO() These call through to ->get_block() which should deal with the EOF condition correctly. I've verified that with GFS2 and I believe that Zheng has verified it for ext4. I've also run the test on XFS and it passes both before and after this change. The part of the patch in filemap.c looks a lot larger than it really is - there are only two lines of real change. The rest is just indentation of the contained code. There remains a test of i_size though, which was added for btrfs. It doesn't cause the other filesystems a problem as the test is performed after ->direct_IO has been called. It is possible that there is a race that does matter to btrfs, however this patch doesn't change that, so its still an overall improvement. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Reported-by: Zheng Liu <gnehzuil.liu@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Acked-by: Miklos Szeredi <miklos@szeredi.hu> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
author: Steven Whitehouse <swhiteho@redhat.com> 2014-01-24 09:42:22 -0500
committer: Al Viro <viro@zeniv.linux.org.uk> 2014-01-26 08:26:42 -0500
commit: 9fe55eea7e4b444bafc42fa0000cc2d1d2847275 (patch)
tree: 4672f575cce42e8147d31bc742a692340c08ccf2 /mm
parent: 2796e4cec525a2b1cace3b29b2f02735dafea007 (diff)
1 files changed, 20 insertions, 22 deletions
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a92021c..01842867c9d2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1428,30 +1428,28 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
                if (!count)
                        goto out; /* skip atime */
                size = i_size_read(inode);
-                if (pos < size) {
+                retval = filemap_write_and_wait_range(mapping, pos,
-                        retval = filemap_write_and_wait_range(mapping, pos,
                                        pos + iov_length(iov, nr_segs) - 1);
-                        if (!retval) {
+                if (!retval) {
-                                retval = mapping->a_ops->direct_IO(READ, iocb,
+                        retval = mapping->a_ops->direct_IO(READ, iocb,
-                                                        iov, pos, nr_segs);
+                                                           iov, pos, nr_segs);
-                        }
+                }
-                        if (retval > 0) {
+                if (retval > 0) {
-                                *ppos = pos + retval;
+                        *ppos = pos + retval;
-                                count -= retval;
+                        count -= retval;
-                        }
+                }
-                        /*
+                /*
-                         * Btrfs can have a short DIO read if we encounter
+                 * Btrfs can have a short DIO read if we encounter
-                         * compressed extents, so if there was an error, or if
+                 * compressed extents, so if there was an error, or if
-                         * we've already read everything we wanted to, or if
+                 * we've already read everything we wanted to, or if
-                         * there was a short read because we hit EOF, go ahead
+                 * there was a short read because we hit EOF, go ahead
-                         * and return.  Otherwise fallthrough to buffered io for
+                 * and return.  Otherwise fallthrough to buffered io for
-                         * the rest of the read.
+                 * the rest of the read.
-                         */
+                 */
-                        if (retval < 0 || !count || *ppos >= size) {
+                if (retval < 0 || !count || *ppos >= size) {
-                                file_accessed(filp);
+                        file_accessed(filp);
-                                goto out;
+                        goto out;
-                        }
                }
        }
author	Steven Whitehouse <swhiteho@redhat.com>	2014-01-24 09:42:22 -0500
committer	Al Viro <viro@zeniv.linux.org.uk>	2014-01-26 08:26:42 -0500
commit	9fe55eea7e4b444bafc42fa0000cc2d1d2847275 (patch)
tree	4672f575cce42e8147d31bc742a692340c08ccf2 /mm
parent	2796e4cec525a2b1cace3b29b2f02735dafea007 (diff)

diff --git a/mm/filemap.c b/mm/filemap.c index b7749a92021c..01842867c9d2 100644 --- a/mm/filemap.c +++ b/mm/filemap.c
@@ -1428,30 +1428,28 @@ generic_file_aio_read(struct kiocb iocb, const struct iovec iov,
1428	if (!count)	1428	if (!count)
1429	goto out; /* skip atime */	1429	goto out; /* skip atime */
1430	size = i_size_read(inode);	1430	size = i_size_read(inode);
1431	if (pos < size) {	1431	retval = filemap_write_and_wait_range(mapping, pos,
1432	retval = filemap_write_and_wait_range(mapping, pos,
1433	pos + iov_length(iov, nr_segs) - 1);	1432	pos + iov_length(iov, nr_segs) - 1);
1434	if (!retval) {	1433	if (!retval) {
1435	retval = mapping->a_ops->direct_IO(READ, iocb,	1434	retval = mapping->a_ops->direct_IO(READ, iocb,
1436	iov, pos, nr_segs);	1435	iov, pos, nr_segs);
1437	}	1436	}
1438	if (retval > 0) {	1437	if (retval > 0) {
1439	*ppos = pos + retval;	1438	*ppos = pos + retval;
1440	count -= retval;	1439	count -= retval;
1441	}	1440	}
1442		1441
1443	/*	1442	/*
1444	* Btrfs can have a short DIO read if we encounter	1443	* Btrfs can have a short DIO read if we encounter
1445	* compressed extents, so if there was an error, or if	1444	* compressed extents, so if there was an error, or if
1446	* we've already read everything we wanted to, or if	1445	* we've already read everything we wanted to, or if
1447	* there was a short read because we hit EOF, go ahead	1446	* there was a short read because we hit EOF, go ahead
1448	* and return. Otherwise fallthrough to buffered io for	1447	* and return. Otherwise fallthrough to buffered io for
1449	* the rest of the read.	1448	* the rest of the read.
1450	*/	1449	*/
1451	if (retval < 0 \|\| !count \|\| *ppos >= size) {	1450	if (retval < 0 \|\| !count \|\| *ppos >= size) {
1452	file_accessed(filp);	1451	file_accessed(filp);
1453	goto out;	1452	goto out;
1454	}
1455	}	1453	}
1456	}	1454	}
1457		1455