4 files changed, 196 insertions, 3 deletions
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 2c9948379469..bf33aaa4c59f 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1946,6 +1946,40 @@ the guest using the specified gsi pin.  The irqfd is removed using
 the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
 and kvm_irqfd.gsi.
+4.76 KVM_PPC_ALLOCATE_HTAB
+Capability: KVM_CAP_PPC_ALLOC_HTAB
+Architectures: powerpc
+Type: vm ioctl
+Parameters: Pointer to u32 containing hash table order (in/out)
+Returns: 0 on success, -1 on error
+This requests the host kernel to allocate an MMU hash table for a
+guest using the PAPR paravirtualization interface.  This only does
+anything if the kernel is configured to use the Book 3S HV style of
+virtualization.  Otherwise the capability doesn't exist and the ioctl
+returns an ENOTTY error.  The rest of this description assumes Book 3S
+HV.
+There must be no vcpus running when this ioctl is called; if there
+are, it will do nothing and return an EBUSY error.
+The parameter is a pointer to a 32-bit unsigned integer variable
+containing the order (log base 2) of the desired size of the hash
+table, which must be between 18 and 46.  On successful return from the
+ioctl, it will have been updated with the order of the hash table that
+was allocated.
+If no hash table has been allocated when any vcpu is asked to run
+(with the KVM_RUN ioctl), the host kernel will allocate a
+default-sized hash table (16 MB).
+If this ioctl is called when a hash table has already been allocated,
+the kernel will clear out the existing hash table (zero all HPTEs) and
+return the hash table order in the parameter.  (If the guest is using
+the virtualized real-mode area (VRMA) facility, the kernel will
+re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
 5. The kvm_run structure
 ------------------------
diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt
index 3b4cd3bf5631..41b7ac9884b5 100644
--- a/Documentation/virtual/kvm/locking.txt
+++ b/Documentation/virtual/kvm/locking.txt
@@ -6,7 +6,129 @@ KVM Lock Overview
 (to be written)
-2. Reference
+2: Exception
+------------
+Fast page fault:
+Fast page fault is the fast path which fixes the guest page fault out of
+the mmu-lock on x86. Currently, the page fault can be fast only if the
+shadow page table is present and it is caused by write-protect, that means
+we just need change the W bit of the spte.
+What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
+SPTE_MMU_WRITEABLE bit on the spte:
+- SPTE_HOST_WRITEABLE means the gfn is writable on host.
+- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
+  the gfn is writable on guest mmu and it is not write-protected by shadow
+  page write-protection.
+On fast page fault path, we will use cmpxchg to atomically set the spte W
+bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, this
+is safe because whenever changing these bits can be detected by cmpxchg.
+But we need carefully check these cases:
+1): The mapping from gfn to pfn
+The mapping from gfn to pfn may be changed since we can only ensure the pfn
+is not changed during cmpxchg. This is a ABA problem, for example, below case
+will happen:
+At the beginning:
+gpte = gfn1
+gfn1 is mapped to pfn1 on host
+spte is the shadow page table entry corresponding with gpte and
+spte = pfn1
+   VCPU 0                           VCPU0
+on fast page fault path:
+   old_spte = *spte;
+                                 pfn1 is swapped out:
+                                    spte = 0;
+                                 pfn1 is re-alloced for gfn2.
+                                 gpte is changed to point to
+                                 gfn2 by the guest:
+                                    spte = pfn1;
+   if (cmpxchg(spte, old_spte, old_spte+W)
+        mark_page_dirty(vcpu->kvm, gfn1)
+             OOPS!!!
+We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
+For direct sp, we can easily avoid it since the spte of direct sp is fixed
+to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
+to pin gfn to pfn, because after gfn_to_pfn_atomic():
+- We have held the refcount of pfn that means the pfn can not be freed and
+  be reused for another gfn.
+- The pfn is writable that means it can not be shared between different gfns
+  by KSM.
+Then, we can ensure the dirty bitmaps is correctly set for a gfn.
+Currently, to simplify the whole things, we disable fast page fault for
+indirect shadow page.
+2): Dirty bit tracking
+In the origin code, the spte can be fast updated (non-atomically) if the
+spte is read-only and the Accessed bit has already been set since the
+Accessed bit and Dirty bit can not be lost.
+But it is not true after fast page fault since the spte can be marked
+writable between reading spte and updating spte. Like below case:
+At the beginning:
+spte.W = 0
+spte.Accessed = 1
+   VCPU 0                                       VCPU0
+In mmu_spte_clear_track_bits():
+   old_spte = *spte;
+   /* 'if' condition is satisfied. */
+   if (old_spte.Accssed == 1 &&
+        old_spte.W == 0)
+      spte = 0ull;
+                                         on fast page fault path:
+                                             spte.W = 1
+                                         memory write on the spte:
+                                             spte.Dirty = 1
+   else
+      old_spte = xchg(spte, 0ull)
+   if (old_spte.Accssed == 1)
+      kvm_set_pfn_accessed(spte.pfn);
+   if (old_spte.Dirty == 1)
+      kvm_set_pfn_dirty(spte.pfn);
+      OOPS!!!
+The Dirty bit is lost in this case.
+In order to avoid this kind of issue, we always treat the spte as "volatile"
+if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
+the spte is always atomicly updated in this case.
+3): flush tlbs due to spte updated
+If the spte is updated from writable to readonly, we should flush all TLBs,
+otherwise rmap_write_protect will find a read-only spte, even though the
+writable spte might be cached on a CPU's TLB.
+As mentioned before, the spte can be updated to writable out of mmu-lock on
+fast page fault path, in order to easily audit the path, we see if TLBs need
+be flushed caused by this reason in mmu_spte_update() since this is a common
+function to update spte (present -> present).
+Since the spte is "volatile" if it can be updated out of mmu-lock, we always
+atomicly update the spte, the race caused by fast page fault can be avoided,
+See the comments in spte_has_volatile_bits() and mmu_spte_update().
+3. Reference
 ------------
 Name:           kvm_lock
@@ -23,3 +145,9 @@ Arch:		x86
 Protects:       - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
                - tsc offset in vmcb
 Comment:        'raw' because updating the tsc offsets must not be preempted.
+Name:           kvm->mmu_lock
+Type:           spinlock_t
+Arch:           any
+Protects:       -shadow page/shadow tlb entry
+Comment:        it is a spinlock since it is used in mmu notifier.
diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index 96b41bd97523..730471048583 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -223,3 +223,36 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
                steal: the amount of time in which this vCPU did not run, in
                nanoseconds. Time during which the vcpu is idle, will not be
                reported as steal time.
+MSR_KVM_EOI_EN: 0x4b564d04
+        data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
+        when disabled.  Bit 1 is reserved and must be zero.  When PV end of
+        interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
+        physical address of a 4 byte memory area which must be in guest RAM and
+        must be zeroed.
+        The first, least significant bit of 4 byte memory location will be
+        written to by the hypervisor, typically at the time of interrupt
+        injection.  Value of 1 means that guest can skip writing EOI to the apic
+        (using MSR or MMIO write); instead, it is sufficient to signal
+        EOI by clearing the bit in guest memory - this location will
+        later be polled by the hypervisor.
+        Value of 0 means that the EOI write is required.
+        It is always safe for the guest to ignore the optimization and perform
+        the APIC EOI write anyway.
+        Hypervisor is guaranteed to only modify this least
+        significant bit while in the current VCPU context, this means that
+        guest does not need to use either lock prefix or memory ordering
+        primitives to synchronise with the hypervisor.
+        However, hypervisor can set and clear this memory bit at any time:
+        therefore to make sure hypervisor does not interrupt the
+        guest and clear the least significant bit in the memory area
+        in the window between guest testing it to detect
+        whether it can skip EOI apic write and between guest
+        clearing it to signal EOI to the hypervisor,
+        guest must both read the least significant bit in the memory area and
+        clear it using a single CPU instruction, such as test and clear, or
+        compare and exchange.
diff --git a/Documentation/virtual/kvm/ppc-pv.txt b/Documentation/virtual/kvm/ppc-pv.txt
index 6e7c37050930..4911cf95c67e 100644
--- a/Documentation/virtual/kvm/ppc-pv.txt
+++ b/Documentation/virtual/kvm/ppc-pv.txt
@@ -109,8 +109,6 @@ The following bits are safe to be set inside the guest:
  MSR_EE
  MSR_RI
-  MSR_CR
-  MSR_ME
 If any other bit changes in the MSR, please still use mtmsr(d).

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 2c9948379469..bf33aaa4c59f 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt
@@ -1946,6 +1946,40 @@ the guest using the specified gsi pin. The irqfd is removed using
1946	the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd	1946	the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
1947	and kvm_irqfd.gsi.	1947	and kvm_irqfd.gsi.
1948		1948
		1949	4.76 KVM_PPC_ALLOCATE_HTAB
		1950
		1951	Capability: KVM_CAP_PPC_ALLOC_HTAB
		1952	Architectures: powerpc
		1953	Type: vm ioctl
		1954	Parameters: Pointer to u32 containing hash table order (in/out)
		1955	Returns: 0 on success, -1 on error
		1956
		1957	This requests the host kernel to allocate an MMU hash table for a
		1958	guest using the PAPR paravirtualization interface. This only does
		1959	anything if the kernel is configured to use the Book 3S HV style of
		1960	virtualization. Otherwise the capability doesn't exist and the ioctl
		1961	returns an ENOTTY error. The rest of this description assumes Book 3S
		1962	HV.
		1963
		1964	There must be no vcpus running when this ioctl is called; if there
		1965	are, it will do nothing and return an EBUSY error.
		1966
		1967	The parameter is a pointer to a 32-bit unsigned integer variable
		1968	containing the order (log base 2) of the desired size of the hash
		1969	table, which must be between 18 and 46. On successful return from the
		1970	ioctl, it will have been updated with the order of the hash table that
		1971	was allocated.
		1972
		1973	If no hash table has been allocated when any vcpu is asked to run
		1974	(with the KVM_RUN ioctl), the host kernel will allocate a
		1975	default-sized hash table (16 MB).
		1976
		1977	If this ioctl is called when a hash table has already been allocated,
		1978	the kernel will clear out the existing hash table (zero all HPTEs) and
		1979	return the hash table order in the parameter. (If the guest is using
		1980	the virtualized real-mode area (VRMA) facility, the kernel will
		1981	re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
		1982
1949		1983
1950	5. The kvm_run structure	1984	5. The kvm_run structure
1951	------------------------	1985	------------------------


diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt index 3b4cd3bf5631..41b7ac9884b5 100644 --- a/Documentation/virtual/kvm/locking.txt +++ b/Documentation/virtual/kvm/locking.txt
@@ -6,7 +6,129 @@ KVM Lock Overview
6		6
7	(to be written)	7	(to be written)
8		8
9	2. Reference	9	2: Exception
		10	------------
		11
		12	Fast page fault:
		13
		14	Fast page fault is the fast path which fixes the guest page fault out of
		15	the mmu-lock on x86. Currently, the page fault can be fast only if the
		16	shadow page table is present and it is caused by write-protect, that means
		17	we just need change the W bit of the spte.
		18
		19	What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
		20	SPTE_MMU_WRITEABLE bit on the spte:
		21	- SPTE_HOST_WRITEABLE means the gfn is writable on host.
		22	- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
		23	the gfn is writable on guest mmu and it is not write-protected by shadow
		24	page write-protection.
		25
		26	On fast page fault path, we will use cmpxchg to atomically set the spte W
		27	bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, this
		28	is safe because whenever changing these bits can be detected by cmpxchg.
		29
		30	But we need carefully check these cases:
		31	1): The mapping from gfn to pfn
		32	The mapping from gfn to pfn may be changed since we can only ensure the pfn
		33	is not changed during cmpxchg. This is a ABA problem, for example, below case
		34	will happen:
		35
		36	At the beginning:
		37	gpte = gfn1
		38	gfn1 is mapped to pfn1 on host
		39	spte is the shadow page table entry corresponding with gpte and
		40	spte = pfn1
		41
		42	VCPU 0 VCPU0
		43	on fast page fault path:
		44
		45	old_spte = *spte;
		46	pfn1 is swapped out:
		47	spte = 0;
		48
		49	pfn1 is re-alloced for gfn2.
		50
		51	gpte is changed to point to
		52	gfn2 by the guest:
		53	spte = pfn1;
		54
		55	if (cmpxchg(spte, old_spte, old_spte+W)
		56	mark_page_dirty(vcpu->kvm, gfn1)
		57	OOPS!!!
		58
		59	We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
		60
		61	For direct sp, we can easily avoid it since the spte of direct sp is fixed
		62	to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
		63	to pin gfn to pfn, because after gfn_to_pfn_atomic():
		64	- We have held the refcount of pfn that means the pfn can not be freed and
		65	be reused for another gfn.
		66	- The pfn is writable that means it can not be shared between different gfns
		67	by KSM.
		68
		69	Then, we can ensure the dirty bitmaps is correctly set for a gfn.
		70
		71	Currently, to simplify the whole things, we disable fast page fault for
		72	indirect shadow page.
		73
		74	2): Dirty bit tracking
		75	In the origin code, the spte can be fast updated (non-atomically) if the
		76	spte is read-only and the Accessed bit has already been set since the
		77	Accessed bit and Dirty bit can not be lost.
		78
		79	But it is not true after fast page fault since the spte can be marked
		80	writable between reading spte and updating spte. Like below case:
		81
		82	At the beginning:
		83	spte.W = 0
		84	spte.Accessed = 1
		85
		86	VCPU 0 VCPU0
		87	In mmu_spte_clear_track_bits():
		88
		89	old_spte = *spte;
		90
		91	/* 'if' condition is satisfied. */
		92	if (old_spte.Accssed == 1 &&
		93	old_spte.W == 0)
		94	spte = 0ull;
		95	on fast page fault path:
		96	spte.W = 1
		97	memory write on the spte:
		98	spte.Dirty = 1
		99
		100
		101	else
		102	old_spte = xchg(spte, 0ull)
		103
		104
		105	if (old_spte.Accssed == 1)
		106	kvm_set_pfn_accessed(spte.pfn);
		107	if (old_spte.Dirty == 1)
		108	kvm_set_pfn_dirty(spte.pfn);
		109	OOPS!!!
		110
		111	The Dirty bit is lost in this case.
		112
		113	In order to avoid this kind of issue, we always treat the spte as "volatile"
		114	if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
		115	the spte is always atomicly updated in this case.
		116
		117	3): flush tlbs due to spte updated
		118	If the spte is updated from writable to readonly, we should flush all TLBs,
		119	otherwise rmap_write_protect will find a read-only spte, even though the
		120	writable spte might be cached on a CPU's TLB.
		121
		122	As mentioned before, the spte can be updated to writable out of mmu-lock on
		123	fast page fault path, in order to easily audit the path, we see if TLBs need
		124	be flushed caused by this reason in mmu_spte_update() since this is a common
		125	function to update spte (present -> present).
		126
		127	Since the spte is "volatile" if it can be updated out of mmu-lock, we always
		128	atomicly update the spte, the race caused by fast page fault can be avoided,
		129	See the comments in spte_has_volatile_bits() and mmu_spte_update().
		130
		131	3. Reference
10	------------	132	------------
11		133
12	Name: kvm_lock	134	Name: kvm_lock
@@ -23,3 +145,9 @@ Arch: x86
23	Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}	145	Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
24	- tsc offset in vmcb	146	- tsc offset in vmcb
25	Comment: 'raw' because updating the tsc offsets must not be preempted.	147	Comment: 'raw' because updating the tsc offsets must not be preempted.
		148
		149	Name: kvm->mmu_lock
		150	Type: spinlock_t
		151	Arch: any
		152	Protects: -shadow page/shadow tlb entry
		153	Comment: it is a spinlock since it is used in mmu notifier.


diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 96b41bd97523..730471048583 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt
@@ -223,3 +223,36 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
223	steal: the amount of time in which this vCPU did not run, in	223	steal: the amount of time in which this vCPU did not run, in
224	nanoseconds. Time during which the vcpu is idle, will not be	224	nanoseconds. Time during which the vcpu is idle, will not be
225	reported as steal time.	225	reported as steal time.
		226
		227	MSR_KVM_EOI_EN: 0x4b564d04
		228	data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
		229	when disabled. Bit 1 is reserved and must be zero. When PV end of
		230	interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
		231	physical address of a 4 byte memory area which must be in guest RAM and
		232	must be zeroed.
		233
		234	The first, least significant bit of 4 byte memory location will be
		235	written to by the hypervisor, typically at the time of interrupt
		236	injection. Value of 1 means that guest can skip writing EOI to the apic
		237	(using MSR or MMIO write); instead, it is sufficient to signal
		238	EOI by clearing the bit in guest memory - this location will
		239	later be polled by the hypervisor.
		240	Value of 0 means that the EOI write is required.
		241
		242	It is always safe for the guest to ignore the optimization and perform
		243	the APIC EOI write anyway.
		244
		245	Hypervisor is guaranteed to only modify this least
		246	significant bit while in the current VCPU context, this means that
		247	guest does not need to use either lock prefix or memory ordering
		248	primitives to synchronise with the hypervisor.
		249
		250	However, hypervisor can set and clear this memory bit at any time:
		251	therefore to make sure hypervisor does not interrupt the
		252	guest and clear the least significant bit in the memory area
		253	in the window between guest testing it to detect
		254	whether it can skip EOI apic write and between guest
		255	clearing it to signal EOI to the hypervisor,
		256	guest must both read the least significant bit in the memory area and
		257	clear it using a single CPU instruction, such as test and clear, or
		258	compare and exchange.


diff --git a/Documentation/virtual/kvm/ppc-pv.txt b/Documentation/virtual/kvm/ppc-pv.txt index 6e7c37050930..4911cf95c67e 100644 --- a/Documentation/virtual/kvm/ppc-pv.txt +++ b/Documentation/virtual/kvm/ppc-pv.txt
@@ -109,8 +109,6 @@ The following bits are safe to be set inside the guest:
109		109
110	MSR_EE	110	MSR_EE
111	MSR_RI	111	MSR_RI
112	MSR_CR
113	MSR_ME
114		112
115	If any other bit changes in the MSR, please still use mtmsr(d).	113	If any other bit changes in the MSR, please still use mtmsr(d).
116		114