aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@macmini.osdl.org>2006-12-29 13:00:58 -0500
committerLinus Torvalds <torvalds@macmini.osdl.org>2006-12-29 13:00:58 -0500
commit7658cc289288b8ae7dd2c2224549a048431222b3 (patch)
treebbd697764987e6eaf9cb8c49af3524d6a579dfdc
parent3bf8ba38f38d3647368e4edcf7d019f9f8d9184a (diff)
VM: Fix nasty and subtle race in shared mmap'ed page writeback
The VM layer (on the face of it, fairly reasonably) expected that when it does a ->writepage() call to the filesystem, it would write out the full page at that point in time. Especially since it had earlier marked the whole page dirty with "set_page_dirty()". But that isn't actually the case: ->writepage() does not actually write a page, it writes the parts of the page that have been explicitly marked dirty before, *and* that had not got written out for other reasons since the last time we told it they were dirty. That last caveat is the important one. Which _most_ of the time ends up being the whole page (since we had called "set_page_dirty()" on the page earlier), but if the filesystem had done any dirty flushing of its own (for example, to honor some internal write ordering guarantees), it might end up doing only a partial page IO (or none at all) when ->writepage() is actually called. That is the correct thing in general (since we actually often _want_ only the known-dirty parts of the page to be written out), but the shared dirty page handling had implicitly forgotten about these details, and had a number of cases where it was doing just the "->writepage()" part, without telling the low-level filesystem that the whole page might have been re-dirtied as part of being mapped writably into user space. Since most of the time the FS did actually write out the full page, we didn't notice this for a loong time, and this needed some really odd patterns to trigger. But it caused occasional corruption with rtorrent and with the Debian "apt" database, because both use shared mmaps to update the end result. This fixes it. Finally. After way too much hair-pulling. Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Martin J. Bligh <mbligh@google.com> Acked-by: Martin Michlmayr <tbm@cyrius.com> Acked-by: Martin Johansson <martin@fatbob.nu> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Andrei Popa <andrei.popa@i-neo.ro> Cc: High Dickins <hugh@veritas.com> Cc: Andrew Morton <akpm@osdl.org>, Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Gordon Farquharson <gordonfarquharson@gmail.com> Cc: Guillaume Chazarain <guichaz@yahoo.fr> Cc: Theodore Tso <tytso@mit.edu> Cc: Kenneth Cheng <kenneth.w.chen@intel.com> Cc: Tobias Diedrich <ranma@tdiedrich.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-rw-r--r--mm/page-writeback.c45
1 files changed, 37 insertions, 8 deletions
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3a198c9248d..1d2fc89ca56d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
862{ 862{
863 struct address_space *mapping = page_mapping(page); 863 struct address_space *mapping = page_mapping(page);
864 864
865 if (!mapping) 865 if (mapping && mapping_cap_account_dirty(mapping)) {
866 return TestClearPageDirty(page); 866 /*
867 867 * Yes, Virginia, this is indeed insane.
868 if (TestClearPageDirty(page)) { 868 *
869 if (mapping_cap_account_dirty(mapping)) { 869 * We use this sequence to make sure that
870 page_mkclean(page); 870 * (a) we account for dirty stats properly
871 * (b) we tell the low-level filesystem to
872 * mark the whole page dirty if it was
873 * dirty in a pagetable. Only to then
874 * (c) clean the page again and return 1 to
875 * cause the writeback.
876 *
877 * This way we avoid all nasty races with the
878 * dirty bit in multiple places and clearing
879 * them concurrently from different threads.
880 *
881 * Note! Normally the "set_page_dirty(page)"
882 * has no effect on the actual dirty bit - since
883 * that will already usually be set. But we
884 * need the side effects, and it can help us
885 * avoid races.
886 *
887 * We basically use the page "master dirty bit"
888 * as a serialization point for all the different
889 * threads doing their things.
890 *
891 * FIXME! We still have a race here: if somebody
892 * adds the page back to the page tables in
893 * between the "page_mkclean()" and the "TestClearPageDirty()",
894 * we might have it mapped without the dirty bit set.
895 */
896 if (page_mkclean(page))
897 set_page_dirty(page);
898 if (TestClearPageDirty(page)) {
871 dec_zone_page_state(page, NR_FILE_DIRTY); 899 dec_zone_page_state(page, NR_FILE_DIRTY);
900 return 1;
872 } 901 }
873 return 1; 902 return 0;
874 } 903 }
875 return 0; 904 return TestClearPageDirty(page);
876} 905}
877EXPORT_SYMBOL(clear_page_dirty_for_io); 906EXPORT_SYMBOL(clear_page_dirty_for_io);
878 907