xfs: don't use speculative prealloc for small files

Dedicated small file workloads have been seeing significant free space fragmentation causing premature inode allocation failure when large inode sizes are in use. A particular test case showed that a workload that runs to a real ENOSPC on 256 byte inodes would fail inode allocation with ENOSPC about about 80% full with 512 byte inodes, and at about 50% full with 1024 byte inodes. The same workload, when run with -o allocsize=4096 on 1024 byte inodes would run to being 100% full before giving ENOSPC. That is, no freespace fragmentation at all. The issue was caused by the specific IO pattern the application had - the framework it was using did not support direct IO, and so it was emulating it by using fadvise(DONT_NEED). The result was that the data was getting written back before the speculative prealloc had been trimmed from memory by the close(), and so small single block files were being allocated with 2 blocks, and then having one truncated away. The result was lots of small 4k free space extents, and hence each new 8k allocation would take another 8k from contiguous free space and turn it into 4k of allocated space and 4k of free space. Hence inode allocation, which requires contiguous, aligned allocation of 16k (256 byte inodes), 32k (512 byte inodes) or 64k (1024 byte inodes) can fail to find sufficiently large freespace and hence fail while there is still lots of free space available. There's a simple fix for this, and one that has precendence in the allocator code already - don't do speculative allocation unless the size of the file is larger than a certain size. In this case, that size is the minimum default preallocation size: mp->m_writeio_blocks. And to keep with the concept of being nice to people when the files are still relatively small, cap the prealloc to mp->m_writeio_blocks until the file goes over a stripe unit is size, at which point we'll fall back to the current behaviour based on the last extent size. This will effectively turn off speculative prealloc for very small files, keep preallocation low for small files, and behave as it currently does for any file larger than a stripe unit. This completely avoids the freespace fragmentation problem this particular IO pattern was causing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
author: Dave Chinner <dchinner@redhat.com> 2013-06-27 02:04:48 -0400
committer: Ben Myers <bpm@sgi.com> 2013-06-27 14:27:37 -0400
commit: 133eeb1747c33b6d75483c074b27d4e5e02286dc (patch)
tree: e63b28edac4667334dd3a7ce0c99b6278e91c4a5 /fs/xfs/xfs_iomap.c
parent: 34eefc06a06f496b92c3267a0601129a932c7174 (diff)
1 files changed, 13 insertions, 0 deletions
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 8f8aaee7f379..6a7096422295 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -284,6 +284,15 @@ xfs_iomap_eof_want_preallocate(
                return 0;
        /*
+         * If the file is smaller than the minimum prealloc and we are using
+         * dynamic preallocation, don't do any preallocation at all as it is
+         * likely this is the only write to the file that is going to be done.
+         */
+        if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE) &&
+            XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_writeio_blocks))
+                return 0;
+        /*
         * If there are any real blocks past eof, then don't
         * do any speculative allocation.
         */
@@ -345,6 +354,10 @@ xfs_iomap_eof_prealloc_initial_size(
        if (mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)
                return 0;
+        /* If the file is small, then use the minimum prealloc */
+        if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign))
+                return 0;
        /*
         * As we write multiple pages, the offset will always align to the
         * start of a page and hence point to a hole at EOF. i.e. if the size is
author	Dave Chinner <dchinner@redhat.com>	2013-06-27 02:04:48 -0400
committer	Ben Myers <bpm@sgi.com>	2013-06-27 14:27:37 -0400
commit	133eeb1747c33b6d75483c074b27d4e5e02286dc (patch)
tree	e63b28edac4667334dd3a7ce0c99b6278e91c4a5 /fs/xfs/xfs_iomap.c
parent	34eefc06a06f496b92c3267a0601129a932c7174 (diff)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 8f8aaee7f379..6a7096422295 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c
@@ -284,6 +284,15 @@ xfs_iomap_eof_want_preallocate(
284	return 0;	284	return 0;
285		285
286	/*	286	/*
		287	* If the file is smaller than the minimum prealloc and we are using
		288	* dynamic preallocation, don't do any preallocation at all as it is
		289	* likely this is the only write to the file that is going to be done.
		290	*/
		291	if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE) &&
		292	XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_writeio_blocks))
		293	return 0;
		294
		295	/*
287	* If there are any real blocks past eof, then don't	296	* If there are any real blocks past eof, then don't
288	* do any speculative allocation.	297	* do any speculative allocation.
289	*/	298	*/
@@ -345,6 +354,10 @@ xfs_iomap_eof_prealloc_initial_size(
345	if (mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)	354	if (mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)
346	return 0;	355	return 0;
347		356
		357	/* If the file is small, then use the minimum prealloc */
		358	if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign))
		359	return 0;
		360
348	/*	361	/*
349	* As we write multiple pages, the offset will always align to the	362	* As we write multiple pages, the offset will always align to the
350	* start of a page and hence point to a hole at EOF. i.e. if the size is	363	* start of a page and hence point to a hole at EOF. i.e. if the size is