xfs: rework zero range to prevent invalid i_size updates

The zero range operation is analogous to fallocate with the exception of converting the range to zeroes. E.g., it attempts to allocate zeroed blocks over the range specified by the caller. The XFS implementation kills all delalloc blocks currently over the aligned range, converts the range to allocated zero blocks (unwritten extents) and handles the partial pages at the ends of the range by sending writes through the pagecache. The current implementation suffers from several problems associated with inode size. If the aligned range covers an extending I/O, said I/O is discarded and an inode size update from a previous write never makes it to disk. Further, if an unaligned zero range extends beyond eof, the page write induced for the partial end page can itself increase the inode size, even if the zero range request is not supposed to update i_size (via KEEP_SIZE, similar to an fallocate beyond EOF). The latter behavior not only incorrectly increases the inode size, but can lead to stray delalloc blocks on the inode. Typically, post-eof preallocation blocks are either truncated on release or inode eviction or explicitly written to by xfs_zero_eof() on natural file size extension. If the inode size increases due to zero range, however, associated blocks leak into the address space having never been converted or mapped to pagecache pages. A direct I/O to such an uncovered range cannot convert the extent via writeback and will BUG(). For example: $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file> ... $ xfs_io -d -c "pread 128k 128k" <file> <BUG> If the entire delalloc extent happens to not have page coverage whatsoever (e.g., delalloc conversion couldn't find a large enough free space extent), even a full file writeback won't convert what's left of the extent and we'll assert on inode eviction. Rework xfs_zero_file_space() to avoid buffered I/O for partial pages. Use the existing hole punch and prealloc mechanisms as primitives for zero range. This implementation is not efficient nor ideal as we writeback dirty data over the range and remove existing extents rather than convert to unwrittern. The former writeback, however, is currently the only mechanism available to ensure consistency between pagecache and extent state. Even a pagecache truncate/delalloc punch prior to hole punch has lead to inconsistencies due to racing with writeback. This provides a consistent, correct implementation of zero range that survives fsstress/fsx testing without assert failures. The implementation can be optimized from this point forward once the fundamental issue of pagecache and delalloc extent state consistency is addressed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
author: Brian Foster <bfoster@redhat.com> 2014-10-29 19:35:11 -0400
committer: Dave Chinner <david@fromorbit.com> 2014-10-29 19:35:11 -0400
commit: 5d11fb4b9a1d90983452c029b5e1377af78fda49 (patch)
tree: 919a9214bdce26cd6dcc3f6dbeef247a4c1d6c95 /fs/xfs
parent: f55fefd1a5a339b1bd08c120b93312d6eb64a9fb (diff)
1 files changed, 20 insertions, 52 deletions
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 92e8f99a5857..281002689d64 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1338,7 +1338,10 @@ xfs_free_file_space(
        goto out;
 }
+/*
+ * Preallocate and zero a range of a file. This mechanism has the allocation
+ * semantics of fallocate and in addition converts data in the range to zeroes.
+ */
 int
 xfs_zero_file_space(
        struct xfs_inode        *ip,
@@ -1346,65 +1349,30 @@ xfs_zero_file_space(
        xfs_off_t               len)
 {
        struct xfs_mount        *mp = ip->i_mount;
-        uint                    granularity;
+        uint                    blksize;
-        xfs_off_t               start_boundary;
-        xfs_off_t               end_boundary;
        int                     error;
        trace_xfs_zero_file_space(ip);
-        granularity = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
+        blksize = 1 << mp->m_sb.sb_blocklog;
        /*
-         * Round the range of extents we are going to convert inwards.  If the
+         * Punch a hole and prealloc the range. We use hole punch rather than
-         * offset is aligned, then it doesn't get changed so we zero from the
+         * unwritten extent conversion for two reasons:
-         * start of the block offset points to.
+         *
+         * 1.) Hole punch handles partial block zeroing for us.
+         *
+         * 2.) If prealloc returns ENOSPC, the file range is still zero-valued
+         * by virtue of the hole punch.
         */
-        start_boundary = round_up(offset, granularity);
+        error = xfs_free_file_space(ip, offset, len);
-        end_boundary = round_down(offset + len, granularity);
+        if (error)
+                goto out;
-        ASSERT(start_boundary >= offset);
-        ASSERT(end_boundary <= offset + len);
-        if (start_boundary < end_boundary - 1) {
-                /*
-                 * Writeback the range to ensure any inode size updates due to
-                 * appending writes make it to disk (otherwise we could just
-                 * punch out the delalloc blocks).
-                 */
-                error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
-                                start_boundary, end_boundary - 1);
-                if (error)
-                        goto out;
-                truncate_pagecache_range(VFS_I(ip), start_boundary,
-                                         end_boundary - 1);
-                /* convert the blocks */
-                error = xfs_alloc_file_space(ip, start_boundary,
-                                        end_boundary - start_boundary - 1,
-                                        XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT);
-                if (error)
-                        goto out;
-                /* We've handled the interior of the range, now for the edges */
-                if (start_boundary != offset) {
-                        error = xfs_iozero(ip, offset, start_boundary - offset);
-                        if (error)
-                                goto out;
-                }
-                if (end_boundary != offset + len)
-                        error = xfs_iozero(ip, end_boundary,
-                                           offset + len - end_boundary);
-        } else {
-                /*
-                 * It's either a sub-granularity range or the range spanned lies
-                 * partially across two adjacent blocks.
-                 */
-                error = xfs_iozero(ip, offset, len);
-        }
+        error = xfs_alloc_file_space(ip, round_down(offset, blksize),
+                                     round_up(offset + len, blksize) -
+                                     round_down(offset, blksize),
+                                     XFS_BMAPI_PREALLOC);
 out:
        return error;
author	Brian Foster <bfoster@redhat.com>	2014-10-29 19:35:11 -0400
committer	Dave Chinner <david@fromorbit.com>	2014-10-29 19:35:11 -0400
commit	5d11fb4b9a1d90983452c029b5e1377af78fda49 (patch)
tree	919a9214bdce26cd6dcc3f6dbeef247a4c1d6c95 /fs/xfs
parent	f55fefd1a5a339b1bd08c120b93312d6eb64a9fb (diff)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 92e8f99a5857..281002689d64 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c
@@ -1338,7 +1338,10 @@ xfs_free_file_space(
1338	goto out;	1338	goto out;
1339	}	1339	}
1340		1340
1341		1341	/*
		1342	* Preallocate and zero a range of a file. This mechanism has the allocation
		1343	* semantics of fallocate and in addition converts data in the range to zeroes.
		1344	*/
1342	int	1345	int
1343	xfs_zero_file_space(	1346	xfs_zero_file_space(
1344	struct xfs_inode *ip,	1347	struct xfs_inode *ip,
@@ -1346,65 +1349,30 @@ xfs_zero_file_space(
1346	xfs_off_t len)	1349	xfs_off_t len)
1347	{	1350	{
1348	struct xfs_mount *mp = ip->i_mount;	1351	struct xfs_mount *mp = ip->i_mount;
1349	uint granularity;	1352	uint blksize;
1350	xfs_off_t start_boundary;
1351	xfs_off_t end_boundary;
1352	int error;	1353	int error;
1353		1354
1354	trace_xfs_zero_file_space(ip);	1355	trace_xfs_zero_file_space(ip);
1355		1356
1356	granularity = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);	1357	blksize = 1 << mp->m_sb.sb_blocklog;
1357		1358
1358	/*	1359	/*
1359	* Round the range of extents we are going to convert inwards. If the	1360	* Punch a hole and prealloc the range. We use hole punch rather than
1360	* offset is aligned, then it doesn't get changed so we zero from the	1361	* unwritten extent conversion for two reasons:
1361	* start of the block offset points to.	1362	*
		1363	* 1.) Hole punch handles partial block zeroing for us.
		1364	*
		1365	* 2.) If prealloc returns ENOSPC, the file range is still zero-valued
		1366	* by virtue of the hole punch.
1362	*/	1367	*/
1363	start_boundary = round_up(offset, granularity);	1368	error = xfs_free_file_space(ip, offset, len);
1364	end_boundary = round_down(offset + len, granularity);	1369	if (error)
1365		1370	goto out;
1366	ASSERT(start_boundary >= offset);
1367	ASSERT(end_boundary <= offset + len);
1368
1369	if (start_boundary < end_boundary - 1) {
1370	/*
1371	* Writeback the range to ensure any inode size updates due to
1372	* appending writes make it to disk (otherwise we could just
1373	* punch out the delalloc blocks).
1374	*/
1375	error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
1376	start_boundary, end_boundary - 1);
1377	if (error)
1378	goto out;
1379	truncate_pagecache_range(VFS_I(ip), start_boundary,
1380	end_boundary - 1);
1381
1382	/* convert the blocks */
1383	error = xfs_alloc_file_space(ip, start_boundary,
1384	end_boundary - start_boundary - 1,
1385	XFS_BMAPI_PREALLOC \| XFS_BMAPI_CONVERT);
1386	if (error)
1387	goto out;
1388
1389	/* We've handled the interior of the range, now for the edges */
1390	if (start_boundary != offset) {
1391	error = xfs_iozero(ip, offset, start_boundary - offset);
1392	if (error)
1393	goto out;
1394	}
1395
1396	if (end_boundary != offset + len)
1397	error = xfs_iozero(ip, end_boundary,
1398	offset + len - end_boundary);
1399
1400	} else {
1401	/*
1402	* It's either a sub-granularity range or the range spanned lies
1403	* partially across two adjacent blocks.
1404	*/
1405	error = xfs_iozero(ip, offset, len);
1406	}
1407		1371
		1372	error = xfs_alloc_file_space(ip, round_down(offset, blksize),
		1373	round_up(offset + len, blksize) -
		1374	round_down(offset, blksize),
		1375	XFS_BMAPI_PREALLOC);
1408	out:	1376	out:
1409	return error;	1377	return error;
1410		1378