diff options
| author | Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> | 2012-06-20 04:00:26 -0400 |
|---|---|---|
| committer | Avi Kivity <avi@redhat.com> | 2012-07-11 09:51:23 -0400 |
| commit | 58d8b1728ea3da391ef01c43a384ea06ce4b7c8a (patch) | |
| tree | d63f20f37df1672b7d9e25733a24d4a1770cf09e /Documentation/virtual | |
| parent | 6fbc277053836a4d80c72a0843bcbc7595b31e87 (diff) | |
KVM: MMU: document mmu-lock and fast page fault
Document fast page fault and mmu-lock in locking.txt
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Diffstat (limited to 'Documentation/virtual')
| -rw-r--r-- | Documentation/virtual/kvm/locking.txt | 130 |
1 files changed, 129 insertions, 1 deletions
diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt index 3b4cd3bf5631..41b7ac9884b5 100644 --- a/Documentation/virtual/kvm/locking.txt +++ b/Documentation/virtual/kvm/locking.txt | |||
| @@ -6,7 +6,129 @@ KVM Lock Overview | |||
| 6 | 6 | ||
| 7 | (to be written) | 7 | (to be written) |
| 8 | 8 | ||
| 9 | 2. Reference | 9 | 2: Exception |
| 10 | ------------ | ||
| 11 | |||
| 12 | Fast page fault: | ||
| 13 | |||
| 14 | Fast page fault is the fast path which fixes the guest page fault out of | ||
| 15 | the mmu-lock on x86. Currently, the page fault can be fast only if the | ||
| 16 | shadow page table is present and it is caused by write-protect, that means | ||
| 17 | we just need change the W bit of the spte. | ||
| 18 | |||
| 19 | What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and | ||
| 20 | SPTE_MMU_WRITEABLE bit on the spte: | ||
| 21 | - SPTE_HOST_WRITEABLE means the gfn is writable on host. | ||
| 22 | - SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when | ||
| 23 | the gfn is writable on guest mmu and it is not write-protected by shadow | ||
| 24 | page write-protection. | ||
| 25 | |||
| 26 | On fast page fault path, we will use cmpxchg to atomically set the spte W | ||
| 27 | bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, this | ||
| 28 | is safe because whenever changing these bits can be detected by cmpxchg. | ||
| 29 | |||
| 30 | But we need carefully check these cases: | ||
| 31 | 1): The mapping from gfn to pfn | ||
| 32 | The mapping from gfn to pfn may be changed since we can only ensure the pfn | ||
| 33 | is not changed during cmpxchg. This is a ABA problem, for example, below case | ||
| 34 | will happen: | ||
| 35 | |||
| 36 | At the beginning: | ||
| 37 | gpte = gfn1 | ||
| 38 | gfn1 is mapped to pfn1 on host | ||
| 39 | spte is the shadow page table entry corresponding with gpte and | ||
| 40 | spte = pfn1 | ||
| 41 | |||
| 42 | VCPU 0 VCPU0 | ||
| 43 | on fast page fault path: | ||
| 44 | |||
| 45 | old_spte = *spte; | ||
| 46 | pfn1 is swapped out: | ||
| 47 | spte = 0; | ||
| 48 | |||
| 49 | pfn1 is re-alloced for gfn2. | ||
| 50 | |||
| 51 | gpte is changed to point to | ||
| 52 | gfn2 by the guest: | ||
| 53 | spte = pfn1; | ||
| 54 | |||
| 55 | if (cmpxchg(spte, old_spte, old_spte+W) | ||
| 56 | mark_page_dirty(vcpu->kvm, gfn1) | ||
| 57 | OOPS!!! | ||
| 58 | |||
| 59 | We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. | ||
| 60 | |||
| 61 | For direct sp, we can easily avoid it since the spte of direct sp is fixed | ||
| 62 | to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic() | ||
| 63 | to pin gfn to pfn, because after gfn_to_pfn_atomic(): | ||
| 64 | - We have held the refcount of pfn that means the pfn can not be freed and | ||
| 65 | be reused for another gfn. | ||
| 66 | - The pfn is writable that means it can not be shared between different gfns | ||
| 67 | by KSM. | ||
| 68 | |||
| 69 | Then, we can ensure the dirty bitmaps is correctly set for a gfn. | ||
| 70 | |||
| 71 | Currently, to simplify the whole things, we disable fast page fault for | ||
| 72 | indirect shadow page. | ||
| 73 | |||
| 74 | 2): Dirty bit tracking | ||
| 75 | In the origin code, the spte can be fast updated (non-atomically) if the | ||
| 76 | spte is read-only and the Accessed bit has already been set since the | ||
| 77 | Accessed bit and Dirty bit can not be lost. | ||
| 78 | |||
| 79 | But it is not true after fast page fault since the spte can be marked | ||
| 80 | writable between reading spte and updating spte. Like below case: | ||
| 81 | |||
| 82 | At the beginning: | ||
| 83 | spte.W = 0 | ||
| 84 | spte.Accessed = 1 | ||
| 85 | |||
| 86 | VCPU 0 VCPU0 | ||
| 87 | In mmu_spte_clear_track_bits(): | ||
| 88 | |||
| 89 | old_spte = *spte; | ||
| 90 | |||
| 91 | /* 'if' condition is satisfied. */ | ||
| 92 | if (old_spte.Accssed == 1 && | ||
| 93 | old_spte.W == 0) | ||
| 94 | spte = 0ull; | ||
| 95 | on fast page fault path: | ||
| 96 | spte.W = 1 | ||
| 97 | memory write on the spte: | ||
| 98 | spte.Dirty = 1 | ||
| 99 | |||
| 100 | |||
| 101 | else | ||
| 102 | old_spte = xchg(spte, 0ull) | ||
| 103 | |||
| 104 | |||
| 105 | if (old_spte.Accssed == 1) | ||
| 106 | kvm_set_pfn_accessed(spte.pfn); | ||
| 107 | if (old_spte.Dirty == 1) | ||
| 108 | kvm_set_pfn_dirty(spte.pfn); | ||
| 109 | OOPS!!! | ||
| 110 | |||
| 111 | The Dirty bit is lost in this case. | ||
| 112 | |||
| 113 | In order to avoid this kind of issue, we always treat the spte as "volatile" | ||
| 114 | if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, | ||
| 115 | the spte is always atomicly updated in this case. | ||
| 116 | |||
| 117 | 3): flush tlbs due to spte updated | ||
| 118 | If the spte is updated from writable to readonly, we should flush all TLBs, | ||
| 119 | otherwise rmap_write_protect will find a read-only spte, even though the | ||
| 120 | writable spte might be cached on a CPU's TLB. | ||
| 121 | |||
| 122 | As mentioned before, the spte can be updated to writable out of mmu-lock on | ||
| 123 | fast page fault path, in order to easily audit the path, we see if TLBs need | ||
| 124 | be flushed caused by this reason in mmu_spte_update() since this is a common | ||
| 125 | function to update spte (present -> present). | ||
| 126 | |||
| 127 | Since the spte is "volatile" if it can be updated out of mmu-lock, we always | ||
| 128 | atomicly update the spte, the race caused by fast page fault can be avoided, | ||
| 129 | See the comments in spte_has_volatile_bits() and mmu_spte_update(). | ||
| 130 | |||
| 131 | 3. Reference | ||
| 10 | ------------ | 132 | ------------ |
| 11 | 133 | ||
| 12 | Name: kvm_lock | 134 | Name: kvm_lock |
| @@ -23,3 +145,9 @@ Arch: x86 | |||
| 23 | Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} | 145 | Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} |
| 24 | - tsc offset in vmcb | 146 | - tsc offset in vmcb |
| 25 | Comment: 'raw' because updating the tsc offsets must not be preempted. | 147 | Comment: 'raw' because updating the tsc offsets must not be preempted. |
| 148 | |||
| 149 | Name: kvm->mmu_lock | ||
| 150 | Type: spinlock_t | ||
| 151 | Arch: any | ||
| 152 | Protects: -shadow page/shadow tlb entry | ||
| 153 | Comment: it is a spinlock since it is used in mmu notifier. | ||
