diff options
Diffstat (limited to 'Documentation/virtual/kvm/mmu.txt')
-rw-r--r-- | Documentation/virtual/kvm/mmu.txt | 91 |
1 files changed, 83 insertions, 8 deletions
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index 43fcb761ed16..290894176142 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt | |||
@@ -191,12 +191,12 @@ Shadow pages contain the following information: | |||
191 | A counter keeping track of how many hardware registers (guest cr3 or | 191 | A counter keeping track of how many hardware registers (guest cr3 or |
192 | pdptrs) are now pointing at the page. While this counter is nonzero, the | 192 | pdptrs) are now pointing at the page. While this counter is nonzero, the |
193 | page cannot be destroyed. See role.invalid. | 193 | page cannot be destroyed. See role.invalid. |
194 | multimapped: | 194 | parent_ptes: |
195 | Whether there exist multiple sptes pointing at this page. | 195 | The reverse mapping for the pte/ptes pointing at this page's spt. If |
196 | parent_pte/parent_ptes: | 196 | parent_ptes bit 0 is zero, only one spte points at this pages and |
197 | If multimapped is zero, parent_pte points at the single spte that points at | 197 | parent_ptes points at this single spte, otherwise, there exists multiple |
198 | this page's spt. Otherwise, parent_ptes points at a data structure | 198 | sptes pointing at this page and (parent_ptes & ~0x1) points at a data |
199 | with a list of parent_ptes. | 199 | structure with a list of parent_ptes. |
200 | unsync: | 200 | unsync: |
201 | If true, then the translations in this page may not match the guest's | 201 | If true, then the translations in this page may not match the guest's |
202 | translation. This is equivalent to the state of the tlb when a pte is | 202 | translation. This is equivalent to the state of the tlb when a pte is |
@@ -210,6 +210,24 @@ Shadow pages contain the following information: | |||
210 | A bitmap indicating which sptes in spt point (directly or indirectly) at | 210 | A bitmap indicating which sptes in spt point (directly or indirectly) at |
211 | pages that may be unsynchronized. Used to quickly locate all unsychronized | 211 | pages that may be unsynchronized. Used to quickly locate all unsychronized |
212 | pages reachable from a given page. | 212 | pages reachable from a given page. |
213 | mmu_valid_gen: | ||
214 | Generation number of the page. It is compared with kvm->arch.mmu_valid_gen | ||
215 | during hash table lookup, and used to skip invalidated shadow pages (see | ||
216 | "Zapping all pages" below.) | ||
217 | clear_spte_count: | ||
218 | Only present on 32-bit hosts, where a 64-bit spte cannot be written | ||
219 | atomically. The reader uses this while running out of the MMU lock | ||
220 | to detect in-progress updates and retry them until the writer has | ||
221 | finished the write. | ||
222 | write_flooding_count: | ||
223 | A guest may write to a page table many times, causing a lot of | ||
224 | emulations if the page needs to be write-protected (see "Synchronized | ||
225 | and unsynchronized pages" below). Leaf pages can be unsynchronized | ||
226 | so that they do not trigger frequent emulation, but this is not | ||
227 | possible for non-leafs. This field counts the number of emulations | ||
228 | since the last time the page table was actually used; if emulation | ||
229 | is triggered too frequently on this page, KVM will unmap the page | ||
230 | to avoid emulation in the future. | ||
213 | 231 | ||
214 | Reverse map | 232 | Reverse map |
215 | =========== | 233 | =========== |
@@ -258,14 +276,26 @@ This is the most complicated event. The cause of a page fault can be: | |||
258 | 276 | ||
259 | Handling a page fault is performed as follows: | 277 | Handling a page fault is performed as follows: |
260 | 278 | ||
279 | - if the RSV bit of the error code is set, the page fault is caused by guest | ||
280 | accessing MMIO and cached MMIO information is available. | ||
281 | - walk shadow page table | ||
282 | - check for valid generation number in the spte (see "Fast invalidation of | ||
283 | MMIO sptes" below) | ||
284 | - cache the information to vcpu->arch.mmio_gva, vcpu->arch.access and | ||
285 | vcpu->arch.mmio_gfn, and call the emulator | ||
286 | - If both P bit and R/W bit of error code are set, this could possibly | ||
287 | be handled as a "fast page fault" (fixed without taking the MMU lock). See | ||
288 | the description in Documentation/virtual/kvm/locking.txt. | ||
261 | - if needed, walk the guest page tables to determine the guest translation | 289 | - if needed, walk the guest page tables to determine the guest translation |
262 | (gva->gpa or ngpa->gpa) | 290 | (gva->gpa or ngpa->gpa) |
263 | - if permissions are insufficient, reflect the fault back to the guest | 291 | - if permissions are insufficient, reflect the fault back to the guest |
264 | - determine the host page | 292 | - determine the host page |
265 | - if this is an mmio request, there is no host page; call the emulator | 293 | - if this is an mmio request, there is no host page; cache the info to |
266 | to emulate the instruction instead | 294 | vcpu->arch.mmio_gva, vcpu->arch.access and vcpu->arch.mmio_gfn |
267 | - walk the shadow page table to find the spte for the translation, | 295 | - walk the shadow page table to find the spte for the translation, |
268 | instantiating missing intermediate page tables as necessary | 296 | instantiating missing intermediate page tables as necessary |
297 | - If this is an mmio request, cache the mmio info to the spte and set some | ||
298 | reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask) | ||
269 | - try to unsynchronize the page | 299 | - try to unsynchronize the page |
270 | - if successful, we can let the guest continue and modify the gpte | 300 | - if successful, we can let the guest continue and modify the gpte |
271 | - emulate the instruction | 301 | - emulate the instruction |
@@ -351,6 +381,51 @@ causes its write_count to be incremented, thus preventing instantiation of | |||
351 | a large spte. The frames at the end of an unaligned memory slot have | 381 | a large spte. The frames at the end of an unaligned memory slot have |
352 | artificially inflated ->write_counts so they can never be instantiated. | 382 | artificially inflated ->write_counts so they can never be instantiated. |
353 | 383 | ||
384 | Zapping all pages (page generation count) | ||
385 | ========================================= | ||
386 | |||
387 | For the large memory guests, walking and zapping all pages is really slow | ||
388 | (because there are a lot of pages), and also blocks memory accesses of | ||
389 | all VCPUs because it needs to hold the MMU lock. | ||
390 | |||
391 | To make it be more scalable, kvm maintains a global generation number | ||
392 | which is stored in kvm->arch.mmu_valid_gen. Every shadow page stores | ||
393 | the current global generation-number into sp->mmu_valid_gen when it | ||
394 | is created. Pages with a mismatching generation number are "obsolete". | ||
395 | |||
396 | When KVM need zap all shadow pages sptes, it just simply increases the global | ||
397 | generation-number then reload root shadow pages on all vcpus. As the VCPUs | ||
398 | create new shadow page tables, the old pages are not used because of the | ||
399 | mismatching generation number. | ||
400 | |||
401 | KVM then walks through all pages and zaps obsolete pages. While the zap | ||
402 | operation needs to take the MMU lock, the lock can be released periodically | ||
403 | so that the VCPUs can make progress. | ||
404 | |||
405 | Fast invalidation of MMIO sptes | ||
406 | =============================== | ||
407 | |||
408 | As mentioned in "Reaction to events" above, kvm will cache MMIO | ||
409 | information in leaf sptes. When a new memslot is added or an existing | ||
410 | memslot is changed, this information may become stale and needs to be | ||
411 | invalidated. This also needs to hold the MMU lock while walking all | ||
412 | shadow pages, and is made more scalable with a similar technique. | ||
413 | |||
414 | MMIO sptes have a few spare bits, which are used to store a | ||
415 | generation number. The global generation number is stored in | ||
416 | kvm_memslots(kvm)->generation, and increased whenever guest memory info | ||
417 | changes. This generation number is distinct from the one described in | ||
418 | the previous section. | ||
419 | |||
420 | When KVM finds an MMIO spte, it checks the generation number of the spte. | ||
421 | If the generation number of the spte does not equal the global generation | ||
422 | number, it will ignore the cached MMIO information and handle the page | ||
423 | fault through the slow path. | ||
424 | |||
425 | Since only 19 bits are used to store generation-number on mmio spte, all | ||
426 | pages are zapped when there is an overflow. | ||
427 | |||
428 | |||
354 | Further reading | 429 | Further reading |
355 | =============== | 430 | =============== |
356 | 431 | ||