diff options
author | Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> | 2012-06-20 04:00:26 -0400 |
---|---|---|
committer | Avi Kivity <avi@redhat.com> | 2012-07-11 09:51:23 -0400 |
commit | 58d8b1728ea3da391ef01c43a384ea06ce4b7c8a (patch) | |
tree | d63f20f37df1672b7d9e25733a24d4a1770cf09e | |
parent | 6fbc277053836a4d80c72a0843bcbc7595b31e87 (diff) |
KVM: MMU: document mmu-lock and fast page fault
Document fast page fault and mmu-lock in locking.txt
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
-rw-r--r-- | Documentation/virtual/kvm/locking.txt | 130 |
1 files changed, 129 insertions, 1 deletions
diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt index 3b4cd3bf5631..41b7ac9884b5 100644 --- a/Documentation/virtual/kvm/locking.txt +++ b/Documentation/virtual/kvm/locking.txt | |||
@@ -6,7 +6,129 @@ KVM Lock Overview | |||
6 | 6 | ||
7 | (to be written) | 7 | (to be written) |
8 | 8 | ||
9 | 2. Reference | 9 | 2: Exception |
10 | ------------ | ||
11 | |||
12 | Fast page fault: | ||
13 | |||
14 | Fast page fault is the fast path which fixes the guest page fault out of | ||
15 | the mmu-lock on x86. Currently, the page fault can be fast only if the | ||
16 | shadow page table is present and it is caused by write-protect, that means | ||
17 | we just need change the W bit of the spte. | ||
18 | |||
19 | What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and | ||
20 | SPTE_MMU_WRITEABLE bit on the spte: | ||
21 | - SPTE_HOST_WRITEABLE means the gfn is writable on host. | ||
22 | - SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when | ||
23 | the gfn is writable on guest mmu and it is not write-protected by shadow | ||
24 | page write-protection. | ||
25 | |||
26 | On fast page fault path, we will use cmpxchg to atomically set the spte W | ||
27 | bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, this | ||
28 | is safe because whenever changing these bits can be detected by cmpxchg. | ||
29 | |||
30 | But we need carefully check these cases: | ||
31 | 1): The mapping from gfn to pfn | ||
32 | The mapping from gfn to pfn may be changed since we can only ensure the pfn | ||
33 | is not changed during cmpxchg. This is a ABA problem, for example, below case | ||
34 | will happen: | ||
35 | |||
36 | At the beginning: | ||
37 | gpte = gfn1 | ||
38 | gfn1 is mapped to pfn1 on host | ||
39 | spte is the shadow page table entry corresponding with gpte and | ||
40 | spte = pfn1 | ||
41 | |||
42 | VCPU 0 VCPU0 | ||
43 | on fast page fault path: | ||
44 | |||
45 | old_spte = *spte; | ||
46 | pfn1 is swapped out: | ||
47 | spte = 0; | ||
48 | |||
49 | pfn1 is re-alloced for gfn2. | ||
50 | |||
51 | gpte is changed to point to | ||
52 | gfn2 by the guest: | ||
53 | spte = pfn1; | ||
54 | |||
55 | if (cmpxchg(spte, old_spte, old_spte+W) | ||
56 | mark_page_dirty(vcpu->kvm, gfn1) | ||
57 | OOPS!!! | ||
58 | |||
59 | We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. | ||
60 | |||
61 | For direct sp, we can easily avoid it since the spte of direct sp is fixed | ||
62 | to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic() | ||
63 | to pin gfn to pfn, because after gfn_to_pfn_atomic(): | ||
64 | - We have held the refcount of pfn that means the pfn can not be freed and | ||
65 | be reused for another gfn. | ||
66 | - The pfn is writable that means it can not be shared between different gfns | ||
67 | by KSM. | ||
68 | |||
69 | Then, we can ensure the dirty bitmaps is correctly set for a gfn. | ||
70 | |||
71 | Currently, to simplify the whole things, we disable fast page fault for | ||
72 | indirect shadow page. | ||
73 | |||
74 | 2): Dirty bit tracking | ||
75 | In the origin code, the spte can be fast updated (non-atomically) if the | ||
76 | spte is read-only and the Accessed bit has already been set since the | ||
77 | Accessed bit and Dirty bit can not be lost. | ||
78 | |||
79 | But it is not true after fast page fault since the spte can be marked | ||
80 | writable between reading spte and updating spte. Like below case: | ||
81 | |||
82 | At the beginning: | ||
83 | spte.W = 0 | ||
84 | spte.Accessed = 1 | ||
85 | |||
86 | VCPU 0 VCPU0 | ||
87 | In mmu_spte_clear_track_bits(): | ||
88 | |||
89 | old_spte = *spte; | ||
90 | |||
91 | /* 'if' condition is satisfied. */ | ||
92 | if (old_spte.Accssed == 1 && | ||
93 | old_spte.W == 0) | ||
94 | spte = 0ull; | ||
95 | on fast page fault path: | ||
96 | spte.W = 1 | ||
97 | memory write on the spte: | ||
98 | spte.Dirty = 1 | ||
99 | |||
100 | |||
101 | else | ||
102 | old_spte = xchg(spte, 0ull) | ||
103 | |||
104 | |||
105 | if (old_spte.Accssed == 1) | ||
106 | kvm_set_pfn_accessed(spte.pfn); | ||
107 | if (old_spte.Dirty == 1) | ||
108 | kvm_set_pfn_dirty(spte.pfn); | ||
109 | OOPS!!! | ||
110 | |||
111 | The Dirty bit is lost in this case. | ||
112 | |||
113 | In order to avoid this kind of issue, we always treat the spte as "volatile" | ||
114 | if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, | ||
115 | the spte is always atomicly updated in this case. | ||
116 | |||
117 | 3): flush tlbs due to spte updated | ||
118 | If the spte is updated from writable to readonly, we should flush all TLBs, | ||
119 | otherwise rmap_write_protect will find a read-only spte, even though the | ||
120 | writable spte might be cached on a CPU's TLB. | ||
121 | |||
122 | As mentioned before, the spte can be updated to writable out of mmu-lock on | ||
123 | fast page fault path, in order to easily audit the path, we see if TLBs need | ||
124 | be flushed caused by this reason in mmu_spte_update() since this is a common | ||
125 | function to update spte (present -> present). | ||
126 | |||
127 | Since the spte is "volatile" if it can be updated out of mmu-lock, we always | ||
128 | atomicly update the spte, the race caused by fast page fault can be avoided, | ||
129 | See the comments in spte_has_volatile_bits() and mmu_spte_update(). | ||
130 | |||
131 | 3. Reference | ||
10 | ------------ | 132 | ------------ |
11 | 133 | ||
12 | Name: kvm_lock | 134 | Name: kvm_lock |
@@ -23,3 +145,9 @@ Arch: x86 | |||
23 | Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} | 145 | Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} |
24 | - tsc offset in vmcb | 146 | - tsc offset in vmcb |
25 | Comment: 'raw' because updating the tsc offsets must not be preempted. | 147 | Comment: 'raw' because updating the tsc offsets must not be preempted. |
148 | |||
149 | Name: kvm->mmu_lock | ||
150 | Type: spinlock_t | ||
151 | Arch: any | ||
152 | Protects: -shadow page/shadow tlb entry | ||
153 | Comment: it is a spinlock since it is used in mmu notifier. | ||