mmu-notifiers: core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) mm_take_all_locks() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken too. 2) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_take_all_locks may be interrupted by a signal and return -EINTR. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. The patch also adds a few needed but missing includes that would prevent kernel to compile after these changes on non-x86 archs (x86 didn't need them by luck). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix mm/filemap_xip.c build] [akpm@linux-foundation.org: fix mm/mmu_notifier.c build] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kanoj Sarcar <kanojsarcar@yahoo.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Izik Eidus <izike@qumranet.com> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Andrea Arcangeli <andrea@qumranet.com> 2008-07-28 18:46:29 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2008-07-28 19:30:21 -0400
commit: cddb8a5c14aa89810b40495d94d3d2a0faee6619 (patch)
tree: d0b47b071f7d2dd1d6f9c36084aa8cfcef90d1da /mm/memory.c
parent: 7906d00cd1f687268f0a3599442d113767795ae6 (diff)
1 files changed, 29 insertions, 6 deletions
diff --git a/mm/memory.c b/mm/memory.c
index a8ca04faaea6..67f0ab9077d9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -652,6 +653,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
        unsigned long next;
        unsigned long addr = vma->vm_start;
        unsigned long end = vma->vm_end;
+        int ret;
        /*
         * Don't copy ptes where a page fault will fill them correctly.
@@ -667,17 +669,33 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
        if (is_vm_hugetlb_page(vma))
                return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+        /*
+         * We need to invalidate the secondary MMU mappings only when
+         * there could be a permission downgrade on the ptes of the
+         * parent mm. And a permission downgrade will only happen if
+         * is_cow_mapping() returns true.
+         */
+        if (is_cow_mapping(vma->vm_flags))
+                mmu_notifier_invalidate_range_start(src_mm, addr, end);
+        ret = 0;
        dst_pgd = pgd_offset(dst_mm, addr);
        src_pgd = pgd_offset(src_mm, addr);
        do {
                next = pgd_addr_end(addr, end);
                if (pgd_none_or_clear_bad(src_pgd))
                        continue;
-                if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+                if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-                                                vma, addr, next))
+                                            vma, addr, next))) {
-                        return -ENOMEM;
+                        ret = -ENOMEM;
+                        break;
+                }
        } while (dst_pgd++, src_pgd++, addr = next, addr != end);
-        return 0;
+        if (is_cow_mapping(vma->vm_flags))
+                mmu_notifier_invalidate_range_end(src_mm,
+                                                  vma->vm_start, end);
+        return ret;
 }
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -881,7 +899,9 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
        unsigned long start = start_addr;
        spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
        int fullmm = (*tlbp)->fullmm;
+        struct mm_struct *mm = vma->vm_mm;
+        mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
        for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
                unsigned long end;
@@ -946,6 +966,7 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
                }
        }
 out:
+        mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
        return start;   /* which is now the end (or restart) address */
 }
@@ -1616,10 +1637,11 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 {
        pgd_t *pgd;
        unsigned long next;
-        unsigned long end = addr + size;
+        unsigned long start = addr, end = addr + size;
        int err;
        BUG_ON(addr >= end);
+        mmu_notifier_invalidate_range_start(mm, start, end);
        pgd = pgd_offset(mm, addr);
        do {
                next = pgd_addr_end(addr, end);
@@ -1627,6 +1649,7 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
                if (err)
                        break;
        } while (pgd++, addr = next, addr != end);
+        mmu_notifier_invalidate_range_end(mm, start, end);
        return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1839,7 +1862,7 @@ gotten:
                 * seen in the presence of one thread doing SMC and another
                 * thread doing COW.
                 */
-                ptep_clear_flush(vma, address, page_table);
+                ptep_clear_flush_notify(vma, address, page_table);
                set_pte_at(mm, address, page_table, entry);
                update_mmu_cache(vma, address, entry);
                lru_cache_add_active(new_page);
author	Andrea Arcangeli <andrea@qumranet.com>	2008-07-28 18:46:29 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2008-07-28 19:30:21 -0400
commit	cddb8a5c14aa89810b40495d94d3d2a0faee6619 (patch)
tree	d0b47b071f7d2dd1d6f9c36084aa8cfcef90d1da /mm/memory.c
parent	7906d00cd1f687268f0a3599442d113767795ae6 (diff)

diff --git a/mm/memory.c b/mm/memory.c index a8ca04faaea6..67f0ab9077d9 100644 --- a/mm/memory.c +++ b/mm/memory.c
@@ -51,6 +51,7 @@
51	#include <linux/init.h>	51	#include <linux/init.h>
52	#include <linux/writeback.h>	52	#include <linux/writeback.h>
53	#include <linux/memcontrol.h>	53	#include <linux/memcontrol.h>
		54	#include <linux/mmu_notifier.h>
54		55
55	#include <asm/pgalloc.h>	56	#include <asm/pgalloc.h>
56	#include <asm/uaccess.h>	57	#include <asm/uaccess.h>
@@ -652,6 +653,7 @@ int copy_page_range(struct mm_struct dst_mm, struct mm_struct src_mm,
652	unsigned long next;	653	unsigned long next;
653	unsigned long addr = vma->vm_start;	654	unsigned long addr = vma->vm_start;
654	unsigned long end = vma->vm_end;	655	unsigned long end = vma->vm_end;
		656	int ret;
655		657
656	/*	658	/*
657	* Don't copy ptes where a page fault will fill them correctly.	659	* Don't copy ptes where a page fault will fill them correctly.
@@ -667,17 +669,33 @@ int copy_page_range(struct mm_struct dst_mm, struct mm_struct src_mm,
667	if (is_vm_hugetlb_page(vma))	669	if (is_vm_hugetlb_page(vma))
668	return copy_hugetlb_page_range(dst_mm, src_mm, vma);	670	return copy_hugetlb_page_range(dst_mm, src_mm, vma);
669		671
		672	/*
		673	* We need to invalidate the secondary MMU mappings only when
		674	* there could be a permission downgrade on the ptes of the
		675	* parent mm. And a permission downgrade will only happen if
		676	* is_cow_mapping() returns true.
		677	*/
		678	if (is_cow_mapping(vma->vm_flags))
		679	mmu_notifier_invalidate_range_start(src_mm, addr, end);
		680
		681	ret = 0;
670	dst_pgd = pgd_offset(dst_mm, addr);	682	dst_pgd = pgd_offset(dst_mm, addr);
671	src_pgd = pgd_offset(src_mm, addr);	683	src_pgd = pgd_offset(src_mm, addr);
672	do {	684	do {
673	next = pgd_addr_end(addr, end);	685	next = pgd_addr_end(addr, end);
674	if (pgd_none_or_clear_bad(src_pgd))	686	if (pgd_none_or_clear_bad(src_pgd))
675	continue;	687	continue;
676	if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,	688	if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
677	vma, addr, next))	689	vma, addr, next))) {
678	return -ENOMEM;	690	ret = -ENOMEM;
		691	break;
		692	}
679	} while (dst_pgd++, src_pgd++, addr = next, addr != end);	693	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
680	return 0;	694
		695	if (is_cow_mapping(vma->vm_flags))
		696	mmu_notifier_invalidate_range_end(src_mm,
		697	vma->vm_start, end);
		698	return ret;
681	}	699	}
682		700
683	static unsigned long zap_pte_range(struct mmu_gather *tlb,	701	static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -881,7 +899,9 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
881	unsigned long start = start_addr;	899	unsigned long start = start_addr;
882	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;	900	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
883	int fullmm = (*tlbp)->fullmm;	901	int fullmm = (*tlbp)->fullmm;
		902	struct mm_struct *mm = vma->vm_mm;
884		903
		904	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
885	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {	905	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
886	unsigned long end;	906	unsigned long end;
887		907
@@ -946,6 +966,7 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
946	}	966	}
947	}	967	}
948	out:	968	out:
		969	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
949	return start; /* which is now the end (or restart) address */	970	return start; /* which is now the end (or restart) address */
950	}	971	}
951		972
@@ -1616,10 +1637,11 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
1616	{	1637	{
1617	pgd_t *pgd;	1638	pgd_t *pgd;
1618	unsigned long next;	1639	unsigned long next;
1619	unsigned long end = addr + size;	1640	unsigned long start = addr, end = addr + size;
1620	int err;	1641	int err;
1621		1642
1622	BUG_ON(addr >= end);	1643	BUG_ON(addr >= end);
		1644	mmu_notifier_invalidate_range_start(mm, start, end);
1623	pgd = pgd_offset(mm, addr);	1645	pgd = pgd_offset(mm, addr);
1624	do {	1646	do {
1625	next = pgd_addr_end(addr, end);	1647	next = pgd_addr_end(addr, end);
@@ -1627,6 +1649,7 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
1627	if (err)	1649	if (err)
1628	break;	1650	break;
1629	} while (pgd++, addr = next, addr != end);	1651	} while (pgd++, addr = next, addr != end);
		1652	mmu_notifier_invalidate_range_end(mm, start, end);
1630	return err;	1653	return err;
1631	}	1654	}
1632	EXPORT_SYMBOL_GPL(apply_to_page_range);	1655	EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1839,7 +1862,7 @@ gotten:
1839	* seen in the presence of one thread doing SMC and another	1862	* seen in the presence of one thread doing SMC and another
1840	* thread doing COW.	1863	* thread doing COW.
1841	*/	1864	*/
1842	ptep_clear_flush(vma, address, page_table);	1865	ptep_clear_flush_notify(vma, address, page_table);
1843	set_pte_at(mm, address, page_table, entry);	1866	set_pte_at(mm, address, page_table, entry);
1844	update_mmu_cache(vma, address, entry);	1867	update_mmu_cache(vma, address, entry);
1845	lru_cache_add_active(new_page);	1868	lru_cache_add_active(new_page);