mm: fix race in COW logic

There is a race in the COW logic. It contains a shortcut to avoid the COW and reuse the page if we have the sole reference on the page, however it is possible to have two racing do_wp_page()ers with one causing the other to mistakenly believe it is safe to take the shortcut when it is not. This could lead to data corruption. Process 1 and process2 each have a wp pte of the same anon page (ie. one forked the other). The page's mapcount is 2. Then they both attempt to write to it around the same time... proc1 proc2 thr1 proc2 thr2 CPU0 CPU1 CPU3 do_wp_page() do_wp_page() trylock_page() can_share_swap_page() load page mapcount (==2) reuse = 0 pte unlock copy page to new_page pte lock page_remove_rmap(page); trylock_page() can_share_swap_page() load page mapcount (==1) reuse = 1 ptep_set_access_flags (allow W) write private key into page read from page ptep_clear_flush() set_pte_at(pte of new_page) Fix this by moving the page_remove_rmap of the old page after the pte clear and flush. Potentially the entire branch could be moved down here, but in order to stay consistent, I won't (should probably move all the *_mm_counter stuff with one patch). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Nick Piggin <npiggin@suse.de> 2008-06-23 08:30:30 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2008-06-23 14:28:32 -0400
commit: 945754a1754f9d4c2974a8241ad4f92fad7f3a6a (patch)
tree: d310dc918d8094a1a6e00e15b24f7953616d7a82 /mm/memory.c
parent: 672ca28e300c17bf8d792a2a7a8631193e580c74 (diff)
1 files changed, 26 insertions, 1 deletions
diff --git a/mm/memory.c b/mm/memory.c
index 423e0e7c2f73..d14b251a25a6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1785,7 +1785,6 @@ gotten:
        page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
        if (likely(pte_same(*page_table, orig_pte))) {
                if (old_page) {
-                        page_remove_rmap(old_page, vma);
                        if (!PageAnon(old_page)) {
                                dec_mm_counter(mm, file_rss);
                                inc_mm_counter(mm, anon_rss);
@@ -1807,6 +1806,32 @@ gotten:
                lru_cache_add_active(new_page);
                page_add_new_anon_rmap(new_page, vma, address);
+                if (old_page) {
+                        /*
+                         * Only after switching the pte to the new page may
+                         * we remove the mapcount here. Otherwise another
+                         * process may come and find the rmap count decremented
+                         * before the pte is switched to the new page, and
+                         * "reuse" the old page writing into it while our pte
+                         * here still points into it and can be read by other
+                         * threads.
+                         *
+                         * The critical issue is to order this
+                         * page_remove_rmap with the ptp_clear_flush above.
+                         * Those stores are ordered by (if nothing else,)
+                         * the barrier present in the atomic_add_negative
+                         * in page_remove_rmap.
+                         *
+                         * Then the TLB flush in ptep_clear_flush ensures that
+                         * no process can access the old page before the
+                         * decremented mapcount is visible. And the old page
+                         * cannot be reused until after the decremented
+                         * mapcount is visible. So transitively, TLBs to
+                         * old page will be flushed before it can be reused.
+                         */
+                        page_remove_rmap(old_page, vma);
+                }
                /* Free the old page.. */
                new_page = old_page;
                ret |= VM_FAULT_WRITE;
author	Nick Piggin <npiggin@suse.de>	2008-06-23 08:30:30 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2008-06-23 14:28:32 -0400
commit	945754a1754f9d4c2974a8241ad4f92fad7f3a6a (patch)
tree	d310dc918d8094a1a6e00e15b24f7953616d7a82 /mm/memory.c
parent	672ca28e300c17bf8d792a2a7a8631193e580c74 (diff)

diff --git a/mm/memory.c b/mm/memory.c index 423e0e7c2f73..d14b251a25a6 100644 --- a/mm/memory.c +++ b/mm/memory.c
@@ -1785,7 +1785,6 @@ gotten:
1785	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);	1785	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
1786	if (likely(pte_same(*page_table, orig_pte))) {	1786	if (likely(pte_same(*page_table, orig_pte))) {
1787	if (old_page) {	1787	if (old_page) {
1788	page_remove_rmap(old_page, vma);
1789	if (!PageAnon(old_page)) {	1788	if (!PageAnon(old_page)) {
1790	dec_mm_counter(mm, file_rss);	1789	dec_mm_counter(mm, file_rss);
1791	inc_mm_counter(mm, anon_rss);	1790	inc_mm_counter(mm, anon_rss);
@@ -1807,6 +1806,32 @@ gotten:
1807	lru_cache_add_active(new_page);	1806	lru_cache_add_active(new_page);
1808	page_add_new_anon_rmap(new_page, vma, address);	1807	page_add_new_anon_rmap(new_page, vma, address);
1809		1808
		1809	if (old_page) {
		1810	/*
		1811	* Only after switching the pte to the new page may
		1812	* we remove the mapcount here. Otherwise another
		1813	* process may come and find the rmap count decremented
		1814	* before the pte is switched to the new page, and
		1815	* "reuse" the old page writing into it while our pte
		1816	* here still points into it and can be read by other
		1817	* threads.
		1818	*
		1819	* The critical issue is to order this
		1820	* page_remove_rmap with the ptp_clear_flush above.
		1821	* Those stores are ordered by (if nothing else,)
		1822	* the barrier present in the atomic_add_negative
		1823	* in page_remove_rmap.
		1824	*
		1825	* Then the TLB flush in ptep_clear_flush ensures that
		1826	* no process can access the old page before the
		1827	* decremented mapcount is visible. And the old page
		1828	* cannot be reused until after the decremented
		1829	* mapcount is visible. So transitively, TLBs to
		1830	* old page will be flushed before it can be reused.
		1831	*/
		1832	page_remove_rmap(old_page, vma);
		1833	}
		1834
1810	/* Free the old page.. */	1835	/* Free the old page.. */
1811	new_page = old_page;	1836	new_page = old_page;
1812	ret \|= VM_FAULT_WRITE;	1837	ret \|= VM_FAULT_WRITE;