aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorFengguang Wu <wfg@mail.ustc.edu.cn>2007-07-19 04:48:01 -0400
committerLinus Torvalds <torvalds@woody.linux-foundation.org>2007-07-19 13:04:44 -0400
commit122a21d11cbfda6d1e33cbc8ae9e4c4ee2f1886e (patch)
treee13f4e2dd0f838f5f922ed047e5ee56bf3546f21
parent5ce1110b92b31d079aa443e967f43a2294e01194 (diff)
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one. It is more flexible and reliable, while maintaining almost the same behavior and performance. Also it is full integrated with adaptive readahead. It is designed to be called on demand: - on a missing page, to do synchronous readahead - on a lookahead page, to do asynchronous readahead In this way it eliminated the awkward workarounds for cache hit/miss, readahead thrashing, retried read, and unaligned read. It also adopts the data structure introduced by adaptive readahead, parameterizes readahead pipelining with `lookahead_index', and reduces the current/ahead windows to one single window. HEURISTICS The logic deals with four cases: - sequential-next found a consistent readahead window, so push it forward - random standalone small read, so read as is - sequential-first create a new readahead window for a sequential/oversize request - lookahead-clueless hit a lookahead page not associated with the readahead window, so create a new readahead window and ramp it up In each case, three parameters are determined: - readahead index: where the next readahead begins - readahead size: how much to readahead - lookahead size: when to do the next readahead (for pipelining) BEHAVIORS The old behaviors are maximally preserved for trivial sequential/random reads. Notable changes are: - It no longer imposes strict sequential checks. It might help some interleaved cases, and clustered random reads. It does introduce risks of a random lookahead hit triggering an unexpected readahead. But in general it is more likely to do good than to do evil. - Interleaved reads are supported in a minimal way. Their chances of being detected and proper handled are still low. - Readahead thrashings are better handled. The current readahead leads to tiny average I/O sizes, because it never turn back for the thrashed pages. They have to be fault in by do_generic_mapping_read() one by one. Whereas the on-demand readahead will redo readahead for them. OVERHEADS The new code reduced the overheads of - excessively calling the readahead routine on small sized reads (the current readahead code insists on seeing all requests) - doing a lot of pointless page-cache lookups for small cached files (the current readahead only turns itself off after 256 cache hits, unfortunately most files are < 1MB, so never see that chance) That accounts for speedup of - 0.3% on 1-page sequential reads on sparse file - 1.2% on 1-page cache hot sequential reads - 3.2% on 256-page cache hot sequential reads - 1.3% on cache hot `tar /lib` However, it does introduce one extra page-cache lookup per cache miss, which impacts random reads slightly. That's 1% overheads for 1-page random reads on sparse file. PERFORMANCE The basic benchmark setup is - 2.6.20 kernel with on-demand readahead - 1MB max readahead size - 2.9GHz Intel Core 2 CPU - 2GB memory - 160G/8M Hitachi SATA II 7200 RPM disk The benchmarks show that - it maintains the same performance for trivial sequential/random reads - sysbench/OLTP performance on MySQL gains up to 8% - performance on readahead thrashing gains up to 3 times iozone throughput (KB/s): roughly the same ========================================== iozone -c -t1 -s 4096m -r 64k 2.6.20 on-demand gain first run " Initial write " 61437.27 64521.53 +5.0% " Rewrite " 47893.02 48335.20 +0.9% " Read " 62111.84 62141.49 +0.0% " Re-read " 62242.66 62193.17 -0.1% " Reverse Read " 50031.46 49989.79 -0.1% " Stride read " 8657.61 8652.81 -0.1% " Random read " 13914.28 13898.23 -0.1% " Mixed workload " 19069.27 19033.32 -0.2% " Random write " 14849.80 14104.38 -5.0% " Pwrite " 62955.30 65701.57 +4.4% " Pread " 62209.99 62256.26 +0.1% second run " Initial write " 60810.31 66258.69 +9.0% " Rewrite " 49373.89 57833.66 +17.1% " Read " 62059.39 62251.28 +0.3% " Re-read " 62264.32 62256.82 -0.0% " Reverse Read " 49970.96 50565.72 +1.2% " Stride read " 8654.81 8638.45 -0.2% " Random read " 13901.44 13949.91 +0.3% " Mixed workload " 19041.32 19092.04 +0.3% " Random write " 14019.99 14161.72 +1.0% " Pwrite " 64121.67 68224.17 +6.4% " Pread " 62225.08 62274.28 +0.1% In summary, writes are unstable, reads are pretty close on average: access pattern 2.6.20 on-demand gain Read 62085.61 62196.38 +0.2% Re-read 62253.49 62224.99 -0.0% Reverse Read 50001.21 50277.75 +0.6% Stride read 8656.21 8645.63 -0.1% Random read 13907.86 13924.07 +0.1% Mixed workload 19055.29 19062.68 +0.0% Pread 62217.53 62265.27 +0.1% aio-stress: roughly the same ============================ aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso 2.6.20 on-demand delta sequential 92.57s 92.54s -0.0% random 311.87s 312.15s +0.1% sysbench fileio: roughly the same ================================= sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \ --file-total-size=4G --file-block-size=64K \ --num-threads=001 --max-requests=10000 --max-time=900 run threads 2.6.20 on-demand delta first run 1 59.1974s 59.2262s +0.0% 2 58.0575s 58.2269s +0.3% 4 48.0545s 47.1164s -2.0% 8 41.0684s 41.2229s +0.4% 16 35.8817s 36.4448s +1.6% 32 32.6614s 32.8240s +0.5% 64 23.7601s 24.1481s +1.6% 128 24.3719s 23.8225s -2.3% 256 23.2366s 22.0488s -5.1% second run 1 59.6720s 59.5671s -0.2% 8 41.5158s 41.9541s +1.1% 64 25.0200s 23.9634s -4.2% 256 22.5491s 20.9486s -7.1% Note that the numbers are not very stable because of the writes. The overall performance is close when we sum all seconds up: sum all up 495.046s 491.514s -0.7% sysbench oltp (trans/sec): up to 8% gain ======================================== sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \ --mysql-socket=/var/run/mysqld/mysqld.sock \ --mysql-user=root --mysql-password=readahead \ --num-threads=064 --max-requests=10000 --max-time=900 run 10000-transactions run threads 2.6.20 on-demand gain 1 62.81 64.56 +2.8% 2 67.97 70.93 +4.4% 4 81.81 85.87 +5.0% 8 94.60 97.89 +3.5% 16 99.07 104.68 +5.7% 32 95.93 104.28 +8.7% 64 96.48 103.68 +7.5% 5000-transactions run 1 48.21 48.65 +0.9% 8 68.60 70.19 +2.3% 64 70.57 74.72 +5.9% 2000-transactions run 1 37.57 38.04 +1.3% 2 38.43 38.99 +1.5% 4 45.39 46.45 +2.3% 8 51.64 52.36 +1.4% 16 54.39 55.18 +1.5% 32 52.13 54.49 +4.5% 64 54.13 54.61 +0.9% That's interesting results. Some investigations show that - MySQL is accessing the db file non-uniformly: some parts are more hot than others - It is mostly doing 4-page random reads, and sometimes doing two reads in a row, the latter one triggers a 16-page readahead. - The on-demand readahead leaves many lookahead pages (flagged PG_readahead) there. Many of them will be hit, and trigger more readahead pages. Which might save more seeks. - Naturally, the readahead windows tend to lie in hot areas, and the lookahead pages in hot areas is more likely to be hit. - The more overall read density, the more possible gain. That also explains the adaptive readahead tricks for clustered random reads. readahead thrashing: 3 times better =================================== We boot kernel with "mem=128m single", and start a 100KB/s stream on every second, until reaching 200 streams. max throughput min avg I/O size 2.6.20: 5MB/s 16KB on-demand: 15MB/s 140KB Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-rw-r--r--include/linux/mm.h6
-rw-r--r--mm/readahead.c174
2 files changed, 180 insertions, 0 deletions
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f8e12b3b6110..619c0e80cf0c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1138,6 +1138,12 @@ int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
1138 pgoff_t offset, unsigned long nr_to_read); 1138 pgoff_t offset, unsigned long nr_to_read);
1139int force_page_cache_readahead(struct address_space *mapping, struct file *filp, 1139int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
1140 pgoff_t offset, unsigned long nr_to_read); 1140 pgoff_t offset, unsigned long nr_to_read);
1141unsigned long page_cache_readahead_ondemand(struct address_space *mapping,
1142 struct file_ra_state *ra,
1143 struct file *filp,
1144 struct page *page,
1145 pgoff_t offset,
1146 unsigned long size);
1141unsigned long page_cache_readahead(struct address_space *mapping, 1147unsigned long page_cache_readahead(struct address_space *mapping,
1142 struct file_ra_state *ra, 1148 struct file_ra_state *ra,
1143 struct file *filp, 1149 struct file *filp,
diff --git a/mm/readahead.c b/mm/readahead.c
index 072ce8f8357d..c094e4f5a250 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -611,3 +611,177 @@ unsigned long ra_submit(struct file_ra_state *ra,
611 return actual; 611 return actual;
612} 612}
613EXPORT_SYMBOL_GPL(ra_submit); 613EXPORT_SYMBOL_GPL(ra_submit);
614
615/*
616 * Get the previous window size, ramp it up, and
617 * return it as the new window size.
618 */
619static unsigned long get_next_ra_size2(struct file_ra_state *ra,
620 unsigned long max)
621{
622 unsigned long cur = ra->readahead_index - ra->ra_index;
623 unsigned long newsize;
624
625 if (cur < max / 16)
626 newsize = cur * 4;
627 else
628 newsize = cur * 2;
629
630 return min(newsize, max);
631}
632
633/*
634 * On-demand readahead design.
635 *
636 * The fields in struct file_ra_state represent the most-recently-executed
637 * readahead attempt:
638 *
639 * |-------- last readahead window -------->|
640 * |-- application walking here -->|
641 * ======#============|==================#=====================|
642 * ^la_index ^ra_index ^lookahead_index ^readahead_index
643 *
644 * [ra_index, readahead_index) represents the last readahead window.
645 *
646 * [la_index, lookahead_index] is where the application would be walking(in
647 * the common case of cache-cold sequential reads): the last window was
648 * established when the application was at la_index, and the next window will
649 * be bring in when the application reaches lookahead_index.
650 *
651 * To overlap application thinking time and disk I/O time, we do
652 * `readahead pipelining': Do not wait until the application consumed all
653 * readahead pages and stalled on the missing page at readahead_index;
654 * Instead, submit an asynchronous readahead I/O as early as the application
655 * reads on the page at lookahead_index. Normally lookahead_index will be
656 * equal to ra_index, for maximum pipelining.
657 *
658 * In interleaved sequential reads, concurrent streams on the same fd can
659 * be invalidating each other's readahead state. So we flag the new readahead
660 * page at lookahead_index with PG_readahead, and use it as readahead
661 * indicator. The flag won't be set on already cached pages, to avoid the
662 * readahead-for-nothing fuss, saving pointless page cache lookups.
663 *
664 * prev_index tracks the last visited page in the _previous_ read request.
665 * It should be maintained by the caller, and will be used for detecting
666 * small random reads. Note that the readahead algorithm checks loosely
667 * for sequential patterns. Hence interleaved reads might be served as
668 * sequential ones.
669 *
670 * There is a special-case: if the first page which the application tries to
671 * read happens to be the first page of the file, it is assumed that a linear
672 * read is about to happen and the window is immediately set to the initial size
673 * based on I/O request size and the max_readahead.
674 *
675 * The code ramps up the readahead size aggressively at first, but slow down as
676 * it approaches max_readhead.
677 */
678
679/*
680 * A minimal readahead algorithm for trivial sequential/random reads.
681 */
682static unsigned long
683ondemand_readahead(struct address_space *mapping,
684 struct file_ra_state *ra, struct file *filp,
685 struct page *page, pgoff_t offset,
686 unsigned long req_size)
687{
688 unsigned long max; /* max readahead pages */
689 pgoff_t ra_index; /* readahead index */
690 unsigned long ra_size; /* readahead size */
691 unsigned long la_size; /* lookahead size */
692 int sequential;
693
694 max = ra->ra_pages;
695 sequential = (offset - ra->prev_index <= 1UL) || (req_size > max);
696
697 /*
698 * Lookahead/readahead hit, assume sequential access.
699 * Ramp up sizes, and push forward the readahead window.
700 */
701 if (offset && (offset == ra->lookahead_index ||
702 offset == ra->readahead_index)) {
703 ra_index = ra->readahead_index;
704 ra_size = get_next_ra_size2(ra, max);
705 la_size = ra_size;
706 goto fill_ra;
707 }
708
709 /*
710 * Standalone, small read.
711 * Read as is, and do not pollute the readahead state.
712 */
713 if (!page && !sequential) {
714 return __do_page_cache_readahead(mapping, filp,
715 offset, req_size, 0);
716 }
717
718 /*
719 * It may be one of
720 * - first read on start of file
721 * - sequential cache miss
722 * - oversize random read
723 * Start readahead for it.
724 */
725 ra_index = offset;
726 ra_size = get_init_ra_size(req_size, max);
727 la_size = ra_size > req_size ? ra_size - req_size : ra_size;
728
729 /*
730 * Hit on a lookahead page without valid readahead state.
731 * E.g. interleaved reads.
732 * Not knowing its readahead pos/size, bet on the minimal possible one.
733 */
734 if (page) {
735 ra_index++;
736 ra_size = min(4 * ra_size, max);
737 }
738
739fill_ra:
740 ra_set_index(ra, offset, ra_index);
741 ra_set_size(ra, ra_size, la_size);
742
743 return ra_submit(ra, mapping, filp);
744}
745
746/**
747 * page_cache_readahead_ondemand - generic file readahead
748 * @mapping: address_space which holds the pagecache and I/O vectors
749 * @ra: file_ra_state which holds the readahead state
750 * @filp: passed on to ->readpage() and ->readpages()
751 * @page: the page at @offset, or NULL if non-present
752 * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
753 * @req_size: hint: total size of the read which the caller is performing in
754 * PAGE_CACHE_SIZE units
755 *
756 * page_cache_readahead_ondemand() is the entry point of readahead logic.
757 * This function should be called when it is time to perform readahead:
758 * 1) @page == NULL
759 * A cache miss happened, time for synchronous readahead.
760 * 2) @page != NULL && PageReadahead(@page)
761 * A look-ahead hit occured, time for asynchronous readahead.
762 */
763unsigned long
764page_cache_readahead_ondemand(struct address_space *mapping,
765 struct file_ra_state *ra, struct file *filp,
766 struct page *page, pgoff_t offset,
767 unsigned long req_size)
768{
769 /* no read-ahead */
770 if (!ra->ra_pages)
771 return 0;
772
773 if (page) {
774 ClearPageReadahead(page);
775
776 /*
777 * Defer asynchronous read-ahead on IO congestion.
778 */
779 if (bdi_read_congested(mapping->backing_dev_info))
780 return 0;
781 }
782
783 /* do read-ahead */
784 return ondemand_readahead(mapping, ra, filp, page,
785 offset, req_size);
786}
787EXPORT_SYMBOL_GPL(page_cache_readahead_ondemand);