swap: add a simple detector for inappropriate swapin readahead

This is a patch to improve swap readahead algorithm. It's from Hugh and I slightly changed it. Hugh's original changelog: swapin readahead does a blind readahead, whether or not the swapin is sequential. This may be ok on harddisk, because large reads have relatively small costs, and if the readahead pages are unneeded they can be reclaimed easily - though, what if their allocation forced reclaim of useful pages? But on SSD devices large reads are more expensive than small ones: if the readahead pages are unneeded, reading them in caused significant overhead. This patch adds very simplistic random read detection. Stealing the PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages() simply looks at readahead's current success rate, and narrows or widens its readahead window accordingly. There is little science to its heuristic: it's about as stupid as can be whilst remaining effective. The table below shows elapsed times (in centiseconds) when running a single repetitive swapping load across a 1000MB mapping in 900MB ram with 1GB swap (the harddisk tests had taken painfully too long when I used mem=500M, but SSD shows similar results for that). Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch which Shaohua showed to be defective; HughNew this Nov 14 patch, with page_cluster as usual at default of 3 (8-page reads); HughPC4 this same patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0 (1-page reads: no readahead). HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for sequential access to the mapping, cycling five times around; Rand for the same number of random touches. Anon for a MAP_PRIVATE anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs. One weakness of Shaohua's vma/anon_vma approach was that it did not optimize Shmem: seen below. Konstantin's approach was perhaps mistuned, 50% slower on Seq: did not compete and is not shown below. HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 73921 76210 75611 76904 78191 121542 Seq Shmem 73601 73176 73855 72947 74543 118322 Rand Anon 895392 831243 871569 845197 846496 841680 Rand Shmem 1058375 1053486 827935 764955 764376 756489 SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 24634 24198 24673 25107 21614 70018 Seq Shmem 24959 24932 25052 25703 22030 69678 Rand Anon 43014 26146 28075 25989 26935 25901 Rand Shmem 45349 45215 28249 24268 24138 24332 These tests are, of course, two extremes of a very simple case: under heavier mixed loads I've not yet observed any consistent improvement or degradation, and wider testing would be welcome. Shaohua Li: Test shows Vanilla is slightly better in sequential workload than Hugh's patch. I observed with Hugh's patch sometimes the readahead size is shrinked too fast (from 8 to 1 immediately) in sequential workload if there is no hit. And in such case, continuing doing readahead is good actually. I don't prepare a sophisticated algorithm for the sequential workload because so far we can't guarantee sequential accessed pages are swap out sequentially. So I slightly change Hugh's heuristic - don't shrink readahead size too fast. Here is my test result (unit second, 3 runs average): Vanilla Hugh New Seq 356 370 360 Random 4525 2447 2444 Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'. The first part is running a random workload (till around 1200 of the x-axis) and the second part is running a sequential workload. swapin and swapout throughput are almost identical in steady state in both workloads. These are expected behavior. while in Vanilla, swapin is much bigger than swapout especially in random workload (because wrong readahead). Original patches by: Shaohua Li and Konstantin Khlebnikov. [fengguang.wu@intel.com: swapin_nr_pages() can be static] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Shaohua Li <shli@kernel.org> 2014-02-06 15:04:21 -0500
committer: Linus Torvalds <torvalds@linux-foundation.org> 2014-02-06 16:48:51 -0500
commit: 579f82901f6f41256642936d7e632f3979ad76d4 (patch)
tree: 13fbb21ce5ef3cefccc80675411614f4b9bca9d0
parent: fb951eb5e167de9f07973ce0dfff674a2019bfab (diff)
2 files changed, 62 insertions, 5 deletions
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e464b4e987e8..d1fe1a761047 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1)
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
 PAGEFLAG(MappedToDisk, mappedtodisk)
-/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
+/* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
-PAGEFLAG(Readahead, reclaim)            /* Reminder to do async read-ahead */
+PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
 #ifdef CONFIG_HIGHMEM
 /*
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 98e85e9c2b2d..e76ace30d436 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -63,6 +63,8 @@ unsigned long total_swapcache_pages(void)
        return ret;
 }
+static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
 void show_swap_cache_info(void)
 {
        printk("%lu pages in swap cache\n", total_swapcache_pages());
@@ -286,8 +288,11 @@ struct page * lookup_swap_cache(swp_entry_t entry)
        page = find_get_page(swap_address_space(entry), entry.val);
-        if (page)
+        if (page) {
                INC_CACHE_INFO(find_success);
+                if (TestClearPageReadahead(page))
+                        atomic_inc(&swapin_readahead_hits);
+        }
        INC_CACHE_INFO(find_total);
        return page;
@@ -389,6 +394,50 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
        return found_page;
 }
+static unsigned long swapin_nr_pages(unsigned long offset)
+{
+        static unsigned long prev_offset;
+        unsigned int pages, max_pages, last_ra;
+        static atomic_t last_readahead_pages;
+        max_pages = 1 << ACCESS_ONCE(page_cluster);
+        if (max_pages <= 1)
+                return 1;
+        /*
+         * This heuristic has been found to work well on both sequential and
+         * random loads, swapping to hard disk or to SSD: please don't ask
+         * what the "+ 2" means, it just happens to work well, that's all.
+         */
+        pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
+        if (pages == 2) {
+                /*
+                 * We can have no readahead hits to judge by: but must not get
+                 * stuck here forever, so check for an adjacent offset instead
+                 * (and don't even bother to check whether swap type is same).
+                 */
+                if (offset != prev_offset + 1 && offset != prev_offset - 1)
+                        pages = 1;
+                prev_offset = offset;
+        } else {
+                unsigned int roundup = 4;
+                while (roundup < pages)
+                        roundup <<= 1;
+                pages = roundup;
+        }
+        if (pages > max_pages)
+                pages = max_pages;
+        /* Don't shrink readahead too fast */
+        last_ra = atomic_read(&last_readahead_pages) / 2;
+        if (pages < last_ra)
+                pages = last_ra;
+        atomic_set(&last_readahead_pages, pages);
+        return pages;
+}
 /**
 * swapin_readahead - swap in pages in hope we need them soon
 * @entry: swap entry of this memory
@@ -412,11 +461,16 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
                        struct vm_area_struct *vma, unsigned long addr)
 {
        struct page *page;
-        unsigned long offset = swp_offset(entry);
+        unsigned long entry_offset = swp_offset(entry);
+        unsigned long offset = entry_offset;
        unsigned long start_offset, end_offset;
-        unsigned long mask = (1UL << page_cluster) - 1;
+        unsigned long mask;
        struct blk_plug plug;
+        mask = swapin_nr_pages(offset) - 1;
+        if (!mask)
+                goto skip;
        /* Read a page_cluster sized and aligned cluster around offset. */
        start_offset = offset & ~mask;
        end_offset = offset | mask;
@@ -430,10 +484,13 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
                                                gfp_mask, vma, addr);
                if (!page)
                        continue;
+                if (offset != entry_offset)
+                        SetPageReadahead(page);
                page_cache_release(page);
        }
        blk_finish_plug(&plug);
        lru_add_drain();        /* Push any new pages onto the LRU now */
+skip:
        return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
author	Shaohua Li <shli@kernel.org>	2014-02-06 15:04:21 -0500
committer	Linus Torvalds <torvalds@linux-foundation.org>	2014-02-06 16:48:51 -0500
commit	579f82901f6f41256642936d7e632f3979ad76d4 (patch)
tree	13fbb21ce5ef3cefccc80675411614f4b9bca9d0
parent	fb951eb5e167de9f07973ce0dfff674a2019bfab (diff)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index e464b4e987e8..d1fe1a761047 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h
@@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1)
228	TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)	228	TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
229	PAGEFLAG(MappedToDisk, mappedtodisk)	229	PAGEFLAG(MappedToDisk, mappedtodisk)
230		230
231	/* PG_readahead is only used for file reads; PG_reclaim is only for writes */	231	/* PG_readahead is only used for reads; PG_reclaim is only for writes */
232	PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)	232	PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
233	PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */	233	PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
234		234
235	#ifdef CONFIG_HIGHMEM	235	#ifdef CONFIG_HIGHMEM
236	/*	236	/*


diff --git a/mm/swap_state.c b/mm/swap_state.c index 98e85e9c2b2d..e76ace30d436 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c
@@ -63,6 +63,8 @@ unsigned long total_swapcache_pages(void)
63	return ret;	63	return ret;
64	}	64	}
65		65
		66	static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
		67
66	void show_swap_cache_info(void)	68	void show_swap_cache_info(void)
67	{	69	{
68	printk("%lu pages in swap cache\n", total_swapcache_pages());	70	printk("%lu pages in swap cache\n", total_swapcache_pages());
@@ -286,8 +288,11 @@ struct page * lookup_swap_cache(swp_entry_t entry)
286		288
287	page = find_get_page(swap_address_space(entry), entry.val);	289	page = find_get_page(swap_address_space(entry), entry.val);
288		290
289	if (page)	291	if (page) {
290	INC_CACHE_INFO(find_success);	292	INC_CACHE_INFO(find_success);
		293	if (TestClearPageReadahead(page))
		294	atomic_inc(&swapin_readahead_hits);
		295	}
291		296
292	INC_CACHE_INFO(find_total);	297	INC_CACHE_INFO(find_total);
293	return page;	298	return page;
@@ -389,6 +394,50 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
389	return found_page;	394	return found_page;
390	}	395	}
391		396
		397	static unsigned long swapin_nr_pages(unsigned long offset)
		398	{
		399	static unsigned long prev_offset;
		400	unsigned int pages, max_pages, last_ra;
		401	static atomic_t last_readahead_pages;
		402
		403	max_pages = 1 << ACCESS_ONCE(page_cluster);
		404	if (max_pages <= 1)
		405	return 1;
		406
		407	/*
		408	* This heuristic has been found to work well on both sequential and
		409	* random loads, swapping to hard disk or to SSD: please don't ask
		410	* what the "+ 2" means, it just happens to work well, that's all.
		411	*/
		412	pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
		413	if (pages == 2) {
		414	/*
		415	* We can have no readahead hits to judge by: but must not get
		416	* stuck here forever, so check for an adjacent offset instead
		417	* (and don't even bother to check whether swap type is same).
		418	*/
		419	if (offset != prev_offset + 1 && offset != prev_offset - 1)
		420	pages = 1;
		421	prev_offset = offset;
		422	} else {
		423	unsigned int roundup = 4;
		424	while (roundup < pages)
		425	roundup <<= 1;
		426	pages = roundup;
		427	}
		428
		429	if (pages > max_pages)
		430	pages = max_pages;
		431
		432	/* Don't shrink readahead too fast */
		433	last_ra = atomic_read(&last_readahead_pages) / 2;
		434	if (pages < last_ra)
		435	pages = last_ra;
		436	atomic_set(&last_readahead_pages, pages);
		437
		438	return pages;
		439	}
		440
392	/**	441	/**
393	* swapin_readahead - swap in pages in hope we need them soon	442	* swapin_readahead - swap in pages in hope we need them soon
394	* @entry: swap entry of this memory	443	* @entry: swap entry of this memory
@@ -412,11 +461,16 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
412	struct vm_area_struct *vma, unsigned long addr)	461	struct vm_area_struct *vma, unsigned long addr)
413	{	462	{
414	struct page *page;	463	struct page *page;
415	unsigned long offset = swp_offset(entry);	464	unsigned long entry_offset = swp_offset(entry);
		465	unsigned long offset = entry_offset;
416	unsigned long start_offset, end_offset;	466	unsigned long start_offset, end_offset;
417	unsigned long mask = (1UL << page_cluster) - 1;	467	unsigned long mask;
418	struct blk_plug plug;	468	struct blk_plug plug;
419		469
		470	mask = swapin_nr_pages(offset) - 1;
		471	if (!mask)
		472	goto skip;
		473
420	/* Read a page_cluster sized and aligned cluster around offset. */	474	/* Read a page_cluster sized and aligned cluster around offset. */
421	start_offset = offset & ~mask;	475	start_offset = offset & ~mask;
422	end_offset = offset \| mask;	476	end_offset = offset \| mask;
@@ -430,10 +484,13 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
430	gfp_mask, vma, addr);	484	gfp_mask, vma, addr);
431	if (!page)	485	if (!page)
432	continue;	486	continue;
		487	if (offset != entry_offset)
		488	SetPageReadahead(page);
433	page_cache_release(page);	489	page_cache_release(page);
434	}	490	}
435	blk_finish_plug(&plug);	491	blk_finish_plug(&plug);
436		492
437	lru_add_drain(); /* Push any new pages onto the LRU now */	493	lru_add_drain(); /* Push any new pages onto the LRU now */
		494	skip:
438	return read_swap_cache_async(entry, gfp_mask, vma, addr);	495	return read_swap_cache_async(entry, gfp_mask, vma, addr);
439	}	496	}