aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/virtual/kvm/mmu.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/virtual/kvm/mmu.txt')
-rw-r--r--Documentation/virtual/kvm/mmu.txt91
1 files changed, 83 insertions, 8 deletions
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index 43fcb761ed16..290894176142 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -191,12 +191,12 @@ Shadow pages contain the following information:
191 A counter keeping track of how many hardware registers (guest cr3 or 191 A counter keeping track of how many hardware registers (guest cr3 or
192 pdptrs) are now pointing at the page. While this counter is nonzero, the 192 pdptrs) are now pointing at the page. While this counter is nonzero, the
193 page cannot be destroyed. See role.invalid. 193 page cannot be destroyed. See role.invalid.
194 multimapped: 194 parent_ptes:
195 Whether there exist multiple sptes pointing at this page. 195 The reverse mapping for the pte/ptes pointing at this page's spt. If
196 parent_pte/parent_ptes: 196 parent_ptes bit 0 is zero, only one spte points at this pages and
197 If multimapped is zero, parent_pte points at the single spte that points at 197 parent_ptes points at this single spte, otherwise, there exists multiple
198 this page's spt. Otherwise, parent_ptes points at a data structure 198 sptes pointing at this page and (parent_ptes & ~0x1) points at a data
199 with a list of parent_ptes. 199 structure with a list of parent_ptes.
200 unsync: 200 unsync:
201 If true, then the translations in this page may not match the guest's 201 If true, then the translations in this page may not match the guest's
202 translation. This is equivalent to the state of the tlb when a pte is 202 translation. This is equivalent to the state of the tlb when a pte is
@@ -210,6 +210,24 @@ Shadow pages contain the following information:
210 A bitmap indicating which sptes in spt point (directly or indirectly) at 210 A bitmap indicating which sptes in spt point (directly or indirectly) at
211 pages that may be unsynchronized. Used to quickly locate all unsychronized 211 pages that may be unsynchronized. Used to quickly locate all unsychronized
212 pages reachable from a given page. 212 pages reachable from a given page.
213 mmu_valid_gen:
214 Generation number of the page. It is compared with kvm->arch.mmu_valid_gen
215 during hash table lookup, and used to skip invalidated shadow pages (see
216 "Zapping all pages" below.)
217 clear_spte_count:
218 Only present on 32-bit hosts, where a 64-bit spte cannot be written
219 atomically. The reader uses this while running out of the MMU lock
220 to detect in-progress updates and retry them until the writer has
221 finished the write.
222 write_flooding_count:
223 A guest may write to a page table many times, causing a lot of
224 emulations if the page needs to be write-protected (see "Synchronized
225 and unsynchronized pages" below). Leaf pages can be unsynchronized
226 so that they do not trigger frequent emulation, but this is not
227 possible for non-leafs. This field counts the number of emulations
228 since the last time the page table was actually used; if emulation
229 is triggered too frequently on this page, KVM will unmap the page
230 to avoid emulation in the future.
213 231
214Reverse map 232Reverse map
215=========== 233===========
@@ -258,14 +276,26 @@ This is the most complicated event. The cause of a page fault can be:
258 276
259Handling a page fault is performed as follows: 277Handling a page fault is performed as follows:
260 278
279 - if the RSV bit of the error code is set, the page fault is caused by guest
280 accessing MMIO and cached MMIO information is available.
281 - walk shadow page table
282 - check for valid generation number in the spte (see "Fast invalidation of
283 MMIO sptes" below)
284 - cache the information to vcpu->arch.mmio_gva, vcpu->arch.access and
285 vcpu->arch.mmio_gfn, and call the emulator
286 - If both P bit and R/W bit of error code are set, this could possibly
287 be handled as a "fast page fault" (fixed without taking the MMU lock). See
288 the description in Documentation/virtual/kvm/locking.txt.
261 - if needed, walk the guest page tables to determine the guest translation 289 - if needed, walk the guest page tables to determine the guest translation
262 (gva->gpa or ngpa->gpa) 290 (gva->gpa or ngpa->gpa)
263 - if permissions are insufficient, reflect the fault back to the guest 291 - if permissions are insufficient, reflect the fault back to the guest
264 - determine the host page 292 - determine the host page
265 - if this is an mmio request, there is no host page; call the emulator 293 - if this is an mmio request, there is no host page; cache the info to
266 to emulate the instruction instead 294 vcpu->arch.mmio_gva, vcpu->arch.access and vcpu->arch.mmio_gfn
267 - walk the shadow page table to find the spte for the translation, 295 - walk the shadow page table to find the spte for the translation,
268 instantiating missing intermediate page tables as necessary 296 instantiating missing intermediate page tables as necessary
297 - If this is an mmio request, cache the mmio info to the spte and set some
298 reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask)
269 - try to unsynchronize the page 299 - try to unsynchronize the page
270 - if successful, we can let the guest continue and modify the gpte 300 - if successful, we can let the guest continue and modify the gpte
271 - emulate the instruction 301 - emulate the instruction
@@ -351,6 +381,51 @@ causes its write_count to be incremented, thus preventing instantiation of
351a large spte. The frames at the end of an unaligned memory slot have 381a large spte. The frames at the end of an unaligned memory slot have
352artificially inflated ->write_counts so they can never be instantiated. 382artificially inflated ->write_counts so they can never be instantiated.
353 383
384Zapping all pages (page generation count)
385=========================================
386
387For the large memory guests, walking and zapping all pages is really slow
388(because there are a lot of pages), and also blocks memory accesses of
389all VCPUs because it needs to hold the MMU lock.
390
391To make it be more scalable, kvm maintains a global generation number
392which is stored in kvm->arch.mmu_valid_gen. Every shadow page stores
393the current global generation-number into sp->mmu_valid_gen when it
394is created. Pages with a mismatching generation number are "obsolete".
395
396When KVM need zap all shadow pages sptes, it just simply increases the global
397generation-number then reload root shadow pages on all vcpus. As the VCPUs
398create new shadow page tables, the old pages are not used because of the
399mismatching generation number.
400
401KVM then walks through all pages and zaps obsolete pages. While the zap
402operation needs to take the MMU lock, the lock can be released periodically
403so that the VCPUs can make progress.
404
405Fast invalidation of MMIO sptes
406===============================
407
408As mentioned in "Reaction to events" above, kvm will cache MMIO
409information in leaf sptes. When a new memslot is added or an existing
410memslot is changed, this information may become stale and needs to be
411invalidated. This also needs to hold the MMU lock while walking all
412shadow pages, and is made more scalable with a similar technique.
413
414MMIO sptes have a few spare bits, which are used to store a
415generation number. The global generation number is stored in
416kvm_memslots(kvm)->generation, and increased whenever guest memory info
417changes. This generation number is distinct from the one described in
418the previous section.
419
420When KVM finds an MMIO spte, it checks the generation number of the spte.
421If the generation number of the spte does not equal the global generation
422number, it will ignore the cached MMIO information and handle the page
423fault through the slow path.
424
425Since only 19 bits are used to store generation-number on mmio spte, all
426pages are zapped when there is an overflow.
427
428
354Further reading 429Further reading
355=============== 430===============
356 431