diff options
author | Robin Dong <sanbai@taobao.com> | 2012-10-08 19:29:05 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2012-10-09 03:22:19 -0400 |
commit | d741c9cdeee6a569dae0dbbaf028065402955b59 (patch) | |
tree | acdff4cf25911e5ba8294822bf2e8c74b4d67974 /mm | |
parent | 314e51b9851b4f4e8ab302243ff5a6fc6147f379 (diff) |
mm: fix nonuniform page status when writing new file with small buffer
When writing a new file with 2048 bytes buffer, such as write(fd, buffer,
2048), it will call generic_perform_write() twice for every page:
write_begin
mark_page_accessed(page)
write_end
write_begin
mark_page_accessed(page)
write_end
Pages 1-13 will be added to lru-pvecs in write_begin() and will *NOT* be
added to active_list even they have be accessed twice because they are not
PageLRU(page). But when page 14th comes, all pages in lru-pvecs will be
moved to inactive_list (by __lru_cache_add() ) in first write_begin(), now
page 14th *is* PageLRU(page). And after second write_end() only page 14th
will be in active_list.
In Hadoop environment, we do comes to this situation: after writing a
file, we find out that only 14th, 28th, 42th... page are in active_list
and others in inactive_list. Now kswapd works, shrinks the inactive_list,
the file only have 14th, 28th...pages in memory, the readahead request
size will be broken to only 52k (13*4k), system's performance falls
dramatically.
This problem can also replay by below steps (the machine has 8G memory):
1. dd if=/dev/zero of=/test/file.out bs=1024 count=1048576
2. cat another 7.5G file to /dev/null
3. vmtouch -m 1G -v /test/file.out, it will show:
/test/file.out
[oooooooooooooooooooOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 187847/262144
the 'o' means same pages are in memory but same are not.
The solution for this problem is simple: the 14th page should be added to
lru_add_pvecs before mark_page_accessed() just as other pages.
[akpm@linux-foundation.org: tweak comment]
[akpm@linux-foundation.org: grab better comment from the v3 patch]
Signed-off-by: Robin Dong <sanbai@taobao.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm')
-rw-r--r-- | mm/swap.c | 11 |
1 files changed, 10 insertions, 1 deletions
@@ -446,13 +446,22 @@ void mark_page_accessed(struct page *page) | |||
446 | } | 446 | } |
447 | EXPORT_SYMBOL(mark_page_accessed); | 447 | EXPORT_SYMBOL(mark_page_accessed); |
448 | 448 | ||
449 | /* | ||
450 | * Order of operations is important: flush the pagevec when it's already | ||
451 | * full, not when adding the last page, to make sure that last page is | ||
452 | * not added to the LRU directly when passed to this function. Because | ||
453 | * mark_page_accessed() (called after this when writing) only activates | ||
454 | * pages that are on the LRU, linear writes in subpage chunks would see | ||
455 | * every PAGEVEC_SIZE page activated, which is unexpected. | ||
456 | */ | ||
449 | void __lru_cache_add(struct page *page, enum lru_list lru) | 457 | void __lru_cache_add(struct page *page, enum lru_list lru) |
450 | { | 458 | { |
451 | struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru]; | 459 | struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru]; |
452 | 460 | ||
453 | page_cache_get(page); | 461 | page_cache_get(page); |
454 | if (!pagevec_add(pvec, page)) | 462 | if (!pagevec_space(pvec)) |
455 | __pagevec_lru_add(pvec, lru); | 463 | __pagevec_lru_add(pvec, lru); |
464 | pagevec_add(pvec, page); | ||
456 | put_cpu_var(lru_add_pvecs); | 465 | put_cpu_var(lru_add_pvecs); |
457 | } | 466 | } |
458 | EXPORT_SYMBOL(__lru_cache_add); | 467 | EXPORT_SYMBOL(__lru_cache_add); |