diff options
Diffstat (limited to 'Documentation/virtual/kvm')
-rw-r--r-- | Documentation/virtual/kvm/api.txt | 34 | ||||
-rw-r--r-- | Documentation/virtual/kvm/locking.txt | 130 | ||||
-rw-r--r-- | Documentation/virtual/kvm/msr.txt | 33 | ||||
-rw-r--r-- | Documentation/virtual/kvm/ppc-pv.txt | 2 |
4 files changed, 196 insertions, 3 deletions
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 2c9948379469..bf33aaa4c59f 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt | |||
@@ -1946,6 +1946,40 @@ the guest using the specified gsi pin. The irqfd is removed using | |||
1946 | the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd | 1946 | the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd |
1947 | and kvm_irqfd.gsi. | 1947 | and kvm_irqfd.gsi. |
1948 | 1948 | ||
1949 | 4.76 KVM_PPC_ALLOCATE_HTAB | ||
1950 | |||
1951 | Capability: KVM_CAP_PPC_ALLOC_HTAB | ||
1952 | Architectures: powerpc | ||
1953 | Type: vm ioctl | ||
1954 | Parameters: Pointer to u32 containing hash table order (in/out) | ||
1955 | Returns: 0 on success, -1 on error | ||
1956 | |||
1957 | This requests the host kernel to allocate an MMU hash table for a | ||
1958 | guest using the PAPR paravirtualization interface. This only does | ||
1959 | anything if the kernel is configured to use the Book 3S HV style of | ||
1960 | virtualization. Otherwise the capability doesn't exist and the ioctl | ||
1961 | returns an ENOTTY error. The rest of this description assumes Book 3S | ||
1962 | HV. | ||
1963 | |||
1964 | There must be no vcpus running when this ioctl is called; if there | ||
1965 | are, it will do nothing and return an EBUSY error. | ||
1966 | |||
1967 | The parameter is a pointer to a 32-bit unsigned integer variable | ||
1968 | containing the order (log base 2) of the desired size of the hash | ||
1969 | table, which must be between 18 and 46. On successful return from the | ||
1970 | ioctl, it will have been updated with the order of the hash table that | ||
1971 | was allocated. | ||
1972 | |||
1973 | If no hash table has been allocated when any vcpu is asked to run | ||
1974 | (with the KVM_RUN ioctl), the host kernel will allocate a | ||
1975 | default-sized hash table (16 MB). | ||
1976 | |||
1977 | If this ioctl is called when a hash table has already been allocated, | ||
1978 | the kernel will clear out the existing hash table (zero all HPTEs) and | ||
1979 | return the hash table order in the parameter. (If the guest is using | ||
1980 | the virtualized real-mode area (VRMA) facility, the kernel will | ||
1981 | re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.) | ||
1982 | |||
1949 | 1983 | ||
1950 | 5. The kvm_run structure | 1984 | 5. The kvm_run structure |
1951 | ------------------------ | 1985 | ------------------------ |
diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt index 3b4cd3bf5631..41b7ac9884b5 100644 --- a/Documentation/virtual/kvm/locking.txt +++ b/Documentation/virtual/kvm/locking.txt | |||
@@ -6,7 +6,129 @@ KVM Lock Overview | |||
6 | 6 | ||
7 | (to be written) | 7 | (to be written) |
8 | 8 | ||
9 | 2. Reference | 9 | 2: Exception |
10 | ------------ | ||
11 | |||
12 | Fast page fault: | ||
13 | |||
14 | Fast page fault is the fast path which fixes the guest page fault out of | ||
15 | the mmu-lock on x86. Currently, the page fault can be fast only if the | ||
16 | shadow page table is present and it is caused by write-protect, that means | ||
17 | we just need change the W bit of the spte. | ||
18 | |||
19 | What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and | ||
20 | SPTE_MMU_WRITEABLE bit on the spte: | ||
21 | - SPTE_HOST_WRITEABLE means the gfn is writable on host. | ||
22 | - SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when | ||
23 | the gfn is writable on guest mmu and it is not write-protected by shadow | ||
24 | page write-protection. | ||
25 | |||
26 | On fast page fault path, we will use cmpxchg to atomically set the spte W | ||
27 | bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, this | ||
28 | is safe because whenever changing these bits can be detected by cmpxchg. | ||
29 | |||
30 | But we need carefully check these cases: | ||
31 | 1): The mapping from gfn to pfn | ||
32 | The mapping from gfn to pfn may be changed since we can only ensure the pfn | ||
33 | is not changed during cmpxchg. This is a ABA problem, for example, below case | ||
34 | will happen: | ||
35 | |||
36 | At the beginning: | ||
37 | gpte = gfn1 | ||
38 | gfn1 is mapped to pfn1 on host | ||
39 | spte is the shadow page table entry corresponding with gpte and | ||
40 | spte = pfn1 | ||
41 | |||
42 | VCPU 0 VCPU0 | ||
43 | on fast page fault path: | ||
44 | |||
45 | old_spte = *spte; | ||
46 | pfn1 is swapped out: | ||
47 | spte = 0; | ||
48 | |||
49 | pfn1 is re-alloced for gfn2. | ||
50 | |||
51 | gpte is changed to point to | ||
52 | gfn2 by the guest: | ||
53 | spte = pfn1; | ||
54 | |||
55 | if (cmpxchg(spte, old_spte, old_spte+W) | ||
56 | mark_page_dirty(vcpu->kvm, gfn1) | ||
57 | OOPS!!! | ||
58 | |||
59 | We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. | ||
60 | |||
61 | For direct sp, we can easily avoid it since the spte of direct sp is fixed | ||
62 | to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic() | ||
63 | to pin gfn to pfn, because after gfn_to_pfn_atomic(): | ||
64 | - We have held the refcount of pfn that means the pfn can not be freed and | ||
65 | be reused for another gfn. | ||
66 | - The pfn is writable that means it can not be shared between different gfns | ||
67 | by KSM. | ||
68 | |||
69 | Then, we can ensure the dirty bitmaps is correctly set for a gfn. | ||
70 | |||
71 | Currently, to simplify the whole things, we disable fast page fault for | ||
72 | indirect shadow page. | ||
73 | |||
74 | 2): Dirty bit tracking | ||
75 | In the origin code, the spte can be fast updated (non-atomically) if the | ||
76 | spte is read-only and the Accessed bit has already been set since the | ||
77 | Accessed bit and Dirty bit can not be lost. | ||
78 | |||
79 | But it is not true after fast page fault since the spte can be marked | ||
80 | writable between reading spte and updating spte. Like below case: | ||
81 | |||
82 | At the beginning: | ||
83 | spte.W = 0 | ||
84 | spte.Accessed = 1 | ||
85 | |||
86 | VCPU 0 VCPU0 | ||
87 | In mmu_spte_clear_track_bits(): | ||
88 | |||
89 | old_spte = *spte; | ||
90 | |||
91 | /* 'if' condition is satisfied. */ | ||
92 | if (old_spte.Accssed == 1 && | ||
93 | old_spte.W == 0) | ||
94 | spte = 0ull; | ||
95 | on fast page fault path: | ||
96 | spte.W = 1 | ||
97 | memory write on the spte: | ||
98 | spte.Dirty = 1 | ||
99 | |||
100 | |||
101 | else | ||
102 | old_spte = xchg(spte, 0ull) | ||
103 | |||
104 | |||
105 | if (old_spte.Accssed == 1) | ||
106 | kvm_set_pfn_accessed(spte.pfn); | ||
107 | if (old_spte.Dirty == 1) | ||
108 | kvm_set_pfn_dirty(spte.pfn); | ||
109 | OOPS!!! | ||
110 | |||
111 | The Dirty bit is lost in this case. | ||
112 | |||
113 | In order to avoid this kind of issue, we always treat the spte as "volatile" | ||
114 | if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, | ||
115 | the spte is always atomicly updated in this case. | ||
116 | |||
117 | 3): flush tlbs due to spte updated | ||
118 | If the spte is updated from writable to readonly, we should flush all TLBs, | ||
119 | otherwise rmap_write_protect will find a read-only spte, even though the | ||
120 | writable spte might be cached on a CPU's TLB. | ||
121 | |||
122 | As mentioned before, the spte can be updated to writable out of mmu-lock on | ||
123 | fast page fault path, in order to easily audit the path, we see if TLBs need | ||
124 | be flushed caused by this reason in mmu_spte_update() since this is a common | ||
125 | function to update spte (present -> present). | ||
126 | |||
127 | Since the spte is "volatile" if it can be updated out of mmu-lock, we always | ||
128 | atomicly update the spte, the race caused by fast page fault can be avoided, | ||
129 | See the comments in spte_has_volatile_bits() and mmu_spte_update(). | ||
130 | |||
131 | 3. Reference | ||
10 | ------------ | 132 | ------------ |
11 | 133 | ||
12 | Name: kvm_lock | 134 | Name: kvm_lock |
@@ -23,3 +145,9 @@ Arch: x86 | |||
23 | Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} | 145 | Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} |
24 | - tsc offset in vmcb | 146 | - tsc offset in vmcb |
25 | Comment: 'raw' because updating the tsc offsets must not be preempted. | 147 | Comment: 'raw' because updating the tsc offsets must not be preempted. |
148 | |||
149 | Name: kvm->mmu_lock | ||
150 | Type: spinlock_t | ||
151 | Arch: any | ||
152 | Protects: -shadow page/shadow tlb entry | ||
153 | Comment: it is a spinlock since it is used in mmu notifier. | ||
diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 96b41bd97523..730471048583 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt | |||
@@ -223,3 +223,36 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 | |||
223 | steal: the amount of time in which this vCPU did not run, in | 223 | steal: the amount of time in which this vCPU did not run, in |
224 | nanoseconds. Time during which the vcpu is idle, will not be | 224 | nanoseconds. Time during which the vcpu is idle, will not be |
225 | reported as steal time. | 225 | reported as steal time. |
226 | |||
227 | MSR_KVM_EOI_EN: 0x4b564d04 | ||
228 | data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 | ||
229 | when disabled. Bit 1 is reserved and must be zero. When PV end of | ||
230 | interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned | ||
231 | physical address of a 4 byte memory area which must be in guest RAM and | ||
232 | must be zeroed. | ||
233 | |||
234 | The first, least significant bit of 4 byte memory location will be | ||
235 | written to by the hypervisor, typically at the time of interrupt | ||
236 | injection. Value of 1 means that guest can skip writing EOI to the apic | ||
237 | (using MSR or MMIO write); instead, it is sufficient to signal | ||
238 | EOI by clearing the bit in guest memory - this location will | ||
239 | later be polled by the hypervisor. | ||
240 | Value of 0 means that the EOI write is required. | ||
241 | |||
242 | It is always safe for the guest to ignore the optimization and perform | ||
243 | the APIC EOI write anyway. | ||
244 | |||
245 | Hypervisor is guaranteed to only modify this least | ||
246 | significant bit while in the current VCPU context, this means that | ||
247 | guest does not need to use either lock prefix or memory ordering | ||
248 | primitives to synchronise with the hypervisor. | ||
249 | |||
250 | However, hypervisor can set and clear this memory bit at any time: | ||
251 | therefore to make sure hypervisor does not interrupt the | ||
252 | guest and clear the least significant bit in the memory area | ||
253 | in the window between guest testing it to detect | ||
254 | whether it can skip EOI apic write and between guest | ||
255 | clearing it to signal EOI to the hypervisor, | ||
256 | guest must both read the least significant bit in the memory area and | ||
257 | clear it using a single CPU instruction, such as test and clear, or | ||
258 | compare and exchange. | ||
diff --git a/Documentation/virtual/kvm/ppc-pv.txt b/Documentation/virtual/kvm/ppc-pv.txt index 6e7c37050930..4911cf95c67e 100644 --- a/Documentation/virtual/kvm/ppc-pv.txt +++ b/Documentation/virtual/kvm/ppc-pv.txt | |||
@@ -109,8 +109,6 @@ The following bits are safe to be set inside the guest: | |||
109 | 109 | ||
110 | MSR_EE | 110 | MSR_EE |
111 | MSR_RI | 111 | MSR_RI |
112 | MSR_CR | ||
113 | MSR_ME | ||
114 | 112 | ||
115 | If any other bit changes in the MSR, please still use mtmsr(d). | 113 | If any other bit changes in the MSR, please still use mtmsr(d). |
116 | 114 | ||