Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini: "PPC: - Better machine check handling for HV KVM - Ability to support guests with threads=2, 4 or 8 on POWER9 - Fix for a race that could cause delayed recognition of signals - Fix for a bug where POWER9 guests could sleep with interrupts pending. ARM: - VCPU request overhaul - allow timer and PMU to have their interrupt number selected from userspace - workaround for Cavium erratum 30115 - handling of memory poisonning - the usual crop of fixes and cleanups s390: - initial machine check forwarding - migration support for the CMMA page hinting information - cleanups and fixes x86: - nested VMX bugfixes and improvements - more reliable NMI window detection on AMD - APIC timer optimizations Generic: - VCPU request overhaul + documentation of common code patterns - kvm_stat improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits) Update my email address kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12 kvm: x86: mmu: allow A/D bits to be disabled in an mmu x86: kvm: mmu: make spte mmio mask more explicit x86: kvm: mmu: dead code thanks to access tracking KVM: PPC: Book3S: Fix typo in XICS-on-XIVE state saving code KVM: PPC: Book3S HV: Close race with testing for signals on guest entry KVM: PPC: Book3S HV: Simplify dynamic micro-threading code KVM: x86: remove ignored type attribute KVM: LAPIC: Fix lapic timer injection delay KVM: lapic: reorganize restart_apic_timer KVM: lapic: reorganize start_hv_timer kvm: nVMX: Check memory operand to INVVPID KVM: s390: Inject machine check into the nested guest KVM: s390: Inject machine check into the guest tools/kvm_stat: add new interactive command 'b' tools/kvm_stat: add new command line switch '-i' tools/kvm_stat: fix error on interactive command 'g' KVM: SVM: suppress unnecessary NMI singlestep on GIF=0 and nested exit ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2017-07-06 21:38:31 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2017-07-06 21:38:31 -0400
commit: c136b84393d4e340e1b53fc7f737dd5827b19ee5 (patch)
tree: 985a1bdfafe7ec5ce2d3c738f601cad3998d8ce9
parent: e0f25a3f2d052e36ff67a9b4db835c3e27e950d8 (diff)
parent: 1372324b328cd5dabaef5e345e37ad48c63df2a9 (diff)
95 files changed, 4250 insertions, 967 deletions
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f24ee1c99412..aa1d4409fe0a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1862,6 +1862,18 @@
                        for all guests.
                        Default is 1 (enabled) if in 64-bit or 32-bit PAE mode.
+        kvm-arm.vgic_v3_group0_trap=
+                        [KVM,ARM] Trap guest accesses to GICv3 group-0
+                        system registers
+        kvm-arm.vgic_v3_group1_trap=
+                        [KVM,ARM] Trap guest accesses to GICv3 group-1
+                        system registers
+        kvm-arm.vgic_v3_common_trap=
+                        [KVM,ARM] Trap guest accesses to GICv3 common
+                        system registers
        kvm-intel.ept=  [KVM,Intel] Disable extended page tables
                        (virtualized MMU) support on capable Intel chips.
                        Default is 1 (enabled)
diff --git a/Documentation/arm64/silicon-errata.txt b/Documentation/arm64/silicon-errata.txt
index 10f2dddbf449..f5f93dca54b7 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -62,6 +62,7 @@ stable kernels.
 | Cavium         | ThunderX GICv3  | #23154          | CAVIUM_ERRATUM_23154        |
 | Cavium         | ThunderX Core   | #27456          | CAVIUM_ERRATUM_27456        |
 | Cavium         | ThunderX SMMUv2 | #27704          | N/A                         |
+| Cavium         | ThunderX Core   | #30115          | CAVIUM_ERRATUM_30115        |
 |                |                 |                 |                             |
 | Freescale/NXP  | LS2080A/LS1043A | A-008585        | FSL_ERRATUM_A008585         |
 |                |                 |                 |                             |
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 4029943887a3..3a9831b72945 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3255,6 +3255,141 @@ Otherwise, if the MCE is a corrected error, KVM will just
 store it in the corresponding bank (provided this bank is
 not holding a previously reported uncorrected error).
+4.107 KVM_S390_GET_CMMA_BITS
+Capability: KVM_CAP_S390_CMMA_MIGRATION
+Architectures: s390
+Type: vm ioctl
+Parameters: struct kvm_s390_cmma_log (in, out)
+Returns: 0 on success, a negative value on error
+This ioctl is used to get the values of the CMMA bits on the s390
+architecture. It is meant to be used in two scenarios:
+- During live migration to save the CMMA values. Live migration needs
+  to be enabled via the KVM_REQ_START_MIGRATION VM property.
+- To non-destructively peek at the CMMA values, with the flag
+  KVM_S390_CMMA_PEEK set.
+The ioctl takes parameters via the kvm_s390_cmma_log struct. The desired
+values are written to a buffer whose location is indicated via the "values"
+member in the kvm_s390_cmma_log struct.  The values in the input struct are
+also updated as needed.
+Each CMMA value takes up one byte.
+struct kvm_s390_cmma_log {
+        __u64 start_gfn;
+        __u32 count;
+        __u32 flags;
+        union {
+                __u64 remaining;
+                __u64 mask;
+        };
+        __u64 values;
+};
+start_gfn is the number of the first guest frame whose CMMA values are
+to be retrieved,
+count is the length of the buffer in bytes,
+values points to the buffer where the result will be written to.
+If count is greater than KVM_S390_SKEYS_MAX, then it is considered to be
+KVM_S390_SKEYS_MAX. KVM_S390_SKEYS_MAX is re-used for consistency with
+other ioctls.
+The result is written in the buffer pointed to by the field values, and
+the values of the input parameter are updated as follows.
+Depending on the flags, different actions are performed. The only
+supported flag so far is KVM_S390_CMMA_PEEK.
+The default behaviour if KVM_S390_CMMA_PEEK is not set is:
+start_gfn will indicate the first page frame whose CMMA bits were dirty.
+It is not necessarily the same as the one passed as input, as clean pages
+are skipped.
+count will indicate the number of bytes actually written in the buffer.
+It can (and very often will) be smaller than the input value, since the
+buffer is only filled until 16 bytes of clean values are found (which
+are then not copied in the buffer). Since a CMMA migration block needs
+the base address and the length, for a total of 16 bytes, we will send
+back some clean data if there is some dirty data afterwards, as long as
+the size of the clean data does not exceed the size of the header. This
+allows to minimize the amount of data to be saved or transferred over
+the network at the expense of more roundtrips to userspace. The next
+invocation of the ioctl will skip over all the clean values, saving
+potentially more than just the 16 bytes we found.
+If KVM_S390_CMMA_PEEK is set:
+the existing storage attributes are read even when not in migration
+mode, and no other action is performed;
+the output start_gfn will be equal to the input start_gfn,
+the output count will be equal to the input count, except if the end of
+memory has been reached.
+In both cases:
+the field "remaining" will indicate the total number of dirty CMMA values
+still remaining, or 0 if KVM_S390_CMMA_PEEK is set and migration mode is
+not enabled.
+mask is unused.
+values points to the userspace buffer where the result will be stored.
+This ioctl can fail with -ENOMEM if not enough memory can be allocated to
+complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if
+KVM_S390_CMMA_PEEK is not set but migration mode was not enabled, with
+-EFAULT if the userspace address is invalid or if no page table is
+present for the addresses (e.g. when using hugepages).
+4.108 KVM_S390_SET_CMMA_BITS
+Capability: KVM_CAP_S390_CMMA_MIGRATION
+Architectures: s390
+Type: vm ioctl
+Parameters: struct kvm_s390_cmma_log (in)
+Returns: 0 on success, a negative value on error
+This ioctl is used to set the values of the CMMA bits on the s390
+architecture. It is meant to be used during live migration to restore
+the CMMA values, but there are no restrictions on its use.
+The ioctl takes parameters via the kvm_s390_cmma_values struct.
+Each CMMA value takes up one byte.
+struct kvm_s390_cmma_log {
+        __u64 start_gfn;
+        __u32 count;
+        __u32 flags;
+        union {
+                __u64 remaining;
+                __u64 mask;
+        };
+        __u64 values;
+};
+start_gfn indicates the starting guest frame number,
+count indicates how many values are to be considered in the buffer,
+flags is not used and must be 0.
+mask indicates which PGSTE bits are to be considered.
+remaining is not used.
+values points to the buffer in userspace where to store the values.
+This ioctl can fail with -ENOMEM if not enough memory can be allocated to
+complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if
+the count field is too large (e.g. more than KVM_S390_CMMA_SIZE_MAX) or
+if the flags field was not 0, with -EFAULT if the userspace address is
+invalid, if invalid pages are written to (e.g. after the end of memory)
+or if no page table is present for the addresses (e.g. when using
+hugepages).
 5. The kvm_run structure
 ------------------------
@@ -3996,6 +4131,34 @@ Parameters: none
 Allow use of adapter-interruption suppression.
 Returns: 0 on success; -EBUSY if a VCPU has already been created.
+7.11 KVM_CAP_PPC_SMT
+Architectures: ppc
+Parameters: vsmt_mode, flags
+Enabling this capability on a VM provides userspace with a way to set
+the desired virtual SMT mode (i.e. the number of virtual CPUs per
+virtual core).  The virtual SMT mode, vsmt_mode, must be a power of 2
+between 1 and 8.  On POWER8, vsmt_mode must also be no greater than
+the number of threads per subcore for the host.  Currently flags must
+be 0.  A successful call to enable this capability will result in
+vsmt_mode being returned when the KVM_CAP_PPC_SMT capability is
+subsequently queried for the VM.  This capability is only supported by
+HV KVM, and can only be set before any VCPUs have been created.
+The KVM_CAP_PPC_SMT_POSSIBLE capability indicates which virtual SMT
+modes are available.
+7.12 KVM_CAP_PPC_FWNMI
+Architectures: ppc
+Parameters: none
+With this capability a machine check exception in the guest address
+space will cause KVM to exit the guest with NMI exit reason. This
+enables QEMU to build error log and branch to guest kernel registered
+machine check handling routine. Without this capability KVM will
+branch to guests' 0x200 interrupt vector.
 8. Other capabilities.
 ----------------------
@@ -4157,3 +4320,12 @@ Currently the following bits are defined for the device_irq_level bitmap:
 Future versions of kvm may implement additional events. These will get
 indicated by returning a higher number from KVM_CHECK_EXTENSION and will be
 listed above.
+8.10 KVM_CAP_PPC_SMT_POSSIBLE
+Architectures: ppc
+Querying this capability returns a bitmap indicating the possible
+virtual SMT modes that can be set using KVM_CAP_PPC_SMT.  If bit N
+(counting from the right) is set, then a virtual SMT mode of 2^N is
+available.
diff --git a/Documentation/virtual/kvm/devices/s390_flic.txt b/Documentation/virtual/kvm/devices/s390_flic.txt
index c2518cea8ab4..2f1cbf1301d2 100644
--- a/Documentation/virtual/kvm/devices/s390_flic.txt
+++ b/Documentation/virtual/kvm/devices/s390_flic.txt
@@ -16,6 +16,7 @@ FLIC provides support to
 - register and modify adapter interrupt sources (KVM_DEV_FLIC_ADAPTER_*)
 - modify AIS (adapter-interruption-suppression) mode state (KVM_DEV_FLIC_AISM)
 - inject adapter interrupts on a specified adapter (KVM_DEV_FLIC_AIRQ_INJECT)
+- get/set all AIS mode states (KVM_DEV_FLIC_AISM_ALL)
 Groups:
  KVM_DEV_FLIC_ENQUEUE
@@ -136,6 +137,20 @@ struct kvm_s390_ais_req {
    an isc according to the adapter-interruption-suppression mode on condition
    that the AIS capability is enabled.
+  KVM_DEV_FLIC_AISM_ALL
+    Gets or sets the adapter-interruption-suppression mode for all ISCs. Takes
+    a kvm_s390_ais_all describing:
+struct kvm_s390_ais_all {
+       __u8 simm; /* Single-Interruption-Mode mask */
+       __u8 nimm; /* No-Interruption-Mode mask *
+};
+    simm contains Single-Interruption-Mode mask for all ISCs, nimm contains
+    No-Interruption-Mode mask for all ISCs. Each bit in simm and nimm corresponds
+    to an ISC (MSB0 bit 0 to ISC 0 and so on). The combination of simm bit and
+    nimm bit presents AIS mode for a ISC.
 Note: The KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR device ioctls executed on
 FLIC with an unknown group or attribute gives the error code EINVAL (instead of
 ENXIO, as specified in the API documentation). It is not possible to conclude
diff --git a/Documentation/virtual/kvm/devices/vcpu.txt b/Documentation/virtual/kvm/devices/vcpu.txt
index 02f50686c418..2b5dab16c4f2 100644
--- a/Documentation/virtual/kvm/devices/vcpu.txt
+++ b/Documentation/virtual/kvm/devices/vcpu.txt
@@ -16,7 +16,9 @@ Parameters: in kvm_device_attr.addr the address for PMU overflow interrupt is a
 Returns: -EBUSY: The PMU overflow interrupt is already set
         -ENXIO: The overflow interrupt not set when attempting to get it
         -ENODEV: PMUv3 not supported
-         -EINVAL: Invalid PMU overflow interrupt number supplied
+         -EINVAL: Invalid PMU overflow interrupt number supplied or
+                  trying to set the IRQ number without using an in-kernel
+                  irqchip.
 A value describing the PMUv3 (Performance Monitor Unit v3) overflow interrupt
 number for this vcpu. This interrupt could be a PPI or SPI, but the interrupt
@@ -25,11 +27,36 @@ all vcpus, while as an SPI it must be a separate number per vcpu.
 1.2 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_INIT
 Parameters: no additional parameter in kvm_device_attr.addr
-Returns: -ENODEV: PMUv3 not supported
+Returns: -ENODEV: PMUv3 not supported or GIC not initialized
-         -ENXIO: PMUv3 not properly configured as required prior to calling this
+         -ENXIO: PMUv3 not properly configured or in-kernel irqchip not
-                 attribute
+                 configured as required prior to calling this attribute
         -EBUSY: PMUv3 already initialized
-Request the initialization of the PMUv3.  This must be done after creating the
+Request the initialization of the PMUv3.  If using the PMUv3 with an in-kernel
-in-kernel irqchip.  Creating a PMU with a userspace irqchip is currently not
+virtual GIC implementation, this must be done after initializing the in-kernel
-supported.
+irqchip.
+2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
+Architectures: ARM,ARM64
+2.1. ATTRIBUTE: KVM_ARM_VCPU_TIMER_IRQ_VTIMER
+2.2. ATTRIBUTE: KVM_ARM_VCPU_TIMER_IRQ_PTIMER
+Parameters: in kvm_device_attr.addr the address for the timer interrupt is a
+            pointer to an int
+Returns: -EINVAL: Invalid timer interrupt number
+         -EBUSY:  One or more VCPUs has already run
+A value describing the architected timer interrupt number when connected to an
+in-kernel virtual GIC.  These must be a PPI (16 <= intid < 32).  Setting the
+attribute overrides the default values (see below).
+KVM_ARM_VCPU_TIMER_IRQ_VTIMER: The EL1 virtual timer intid (default: 27)
+KVM_ARM_VCPU_TIMER_IRQ_PTIMER: The EL1 physical timer intid (default: 30)
+Setting the same PPI for different timers will prevent the VCPUs from running.
+Setting the interrupt number on a VCPU configures all VCPUs created at that
+time to use the number provided for a given timer, overwriting any previously
+configured values on other VCPUs.  Userspace should configure the interrupt
+numbers on at least one VCPU after creating all VCPUs and before running any
+VCPUs.
diff --git a/Documentation/virtual/kvm/devices/vm.txt b/Documentation/virtual/kvm/devices/vm.txt
index 575ccb022aac..903fc926860b 100644
--- a/Documentation/virtual/kvm/devices/vm.txt
+++ b/Documentation/virtual/kvm/devices/vm.txt
@@ -222,3 +222,36 @@ Allows user space to disable dea key wrapping, clearing the wrapping key.
 Parameters: none
 Returns:    0
+5. GROUP: KVM_S390_VM_MIGRATION
+Architectures: s390
+5.1. ATTRIBUTE: KVM_S390_VM_MIGRATION_STOP (w/o)
+Allows userspace to stop migration mode, needed for PGSTE migration.
+Setting this attribute when migration mode is not active will have no
+effects.
+Parameters: none
+Returns:    0
+5.2. ATTRIBUTE: KVM_S390_VM_MIGRATION_START (w/o)
+Allows userspace to start migration mode, needed for PGSTE migration.
+Setting this attribute when migration mode is already active will have
+no effects.
+Parameters: none
+Returns:    -ENOMEM if there is not enough free memory to start migration mode
+            -EINVAL if the state of the VM is invalid (e.g. no memory defined)
+            0 in case of success.
+5.3. ATTRIBUTE: KVM_S390_VM_MIGRATION_STATUS (r/o)
+Allows userspace to query the status of migration mode.
+Parameters: address of a buffer in user space to store the data (u64) to;
+            the data itself is either 0 if migration mode is disabled or 1
+            if it is enabled
+Returns:    -EFAULT if the given address is not accessible from kernel space
+            0 in case of success.
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index 481b6a9c25d5..f50d45b1e967 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -179,6 +179,10 @@ Shadow pages contain the following information:
    shadow page; it is also used to go back from a struct kvm_mmu_page
    to a memslot, through the kvm_memslots_for_spte_role macro and
    __gfn_to_memslot.
+  role.ad_disabled:
+    Is 1 if the MMU instance cannot use A/D bits.  EPT did not have A/D
+    bits before Haswell; shadow EPT page tables also cannot use A/D bits
+    if the L1 hypervisor does not enable them.
  gfn:
    Either the guest page table containing the translations shadowed by this
    page, or the base page frame for linear translations.  See role.direct.
diff --git a/Documentation/virtual/kvm/vcpu-requests.rst b/Documentation/virtual/kvm/vcpu-requests.rst
new file mode 100644
index 000000000000..5feb3706a7ae
--- /dev/null
+++ b/Documentation/virtual/kvm/vcpu-requests.rst
@@ -0,0 +1,307 @@
+=================
+KVM VCPU Requests
+=================
+Overview
+========
+KVM supports an internal API enabling threads to request a VCPU thread to
+perform some activity.  For example, a thread may request a VCPU to flush
+its TLB with a VCPU request.  The API consists of the following functions::
+  /* Check if any requests are pending for VCPU @vcpu. */
+  bool kvm_request_pending(struct kvm_vcpu *vcpu);
+  /* Check if VCPU @vcpu has request @req pending. */
+  bool kvm_test_request(int req, struct kvm_vcpu *vcpu);
+  /* Clear request @req for VCPU @vcpu. */
+  void kvm_clear_request(int req, struct kvm_vcpu *vcpu);
+  /*
+   * Check if VCPU @vcpu has request @req pending. When the request is
+   * pending it will be cleared and a memory barrier, which pairs with
+   * another in kvm_make_request(), will be issued.
+   */
+  bool kvm_check_request(int req, struct kvm_vcpu *vcpu);
+  /*
+   * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs
+   * with another in kvm_check_request(), prior to setting the request.
+   */
+  void kvm_make_request(int req, struct kvm_vcpu *vcpu);
+  /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
+  bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
+Typically a requester wants the VCPU to perform the activity as soon
+as possible after making the request.  This means most requests
+(kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(),
+and kvm_make_all_cpus_request() has the kicking of all VCPUs built
+into it.
+VCPU Kicks
+----------
+The goal of a VCPU kick is to bring a VCPU thread out of guest mode in
+order to perform some KVM maintenance.  To do so, an IPI is sent, forcing
+a guest mode exit.  However, a VCPU thread may not be in guest mode at the
+time of the kick.  Therefore, depending on the mode and state of the VCPU
+thread, there are two other actions a kick may take.  All three actions
+are listed below:
+1) Send an IPI.  This forces a guest mode exit.
+2) Waking a sleeping VCPU.  Sleeping VCPUs are VCPU threads outside guest
+   mode that wait on waitqueues.  Waking them removes the threads from
+   the waitqueues, allowing the threads to run again.  This behavior
+   may be suppressed, see KVM_REQUEST_NO_WAKEUP below.
+3) Nothing.  When the VCPU is not in guest mode and the VCPU thread is not
+   sleeping, then there is nothing to do.
+VCPU Mode
+---------
+VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the
+guest is running in guest mode or not, as well as some specific
+outside guest mode states.  The architecture may use ``vcpu->mode`` to
+ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"),
+as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and
+even to ensure IPI acknowledgements are waited upon (see "Waiting for
+Acknowledgements").  The following modes are defined:
+OUTSIDE_GUEST_MODE
+  The VCPU thread is outside guest mode.
+IN_GUEST_MODE
+  The VCPU thread is in guest mode.
+EXITING_GUEST_MODE
+  The VCPU thread is transitioning from IN_GUEST_MODE to
+  OUTSIDE_GUEST_MODE.
+READING_SHADOW_PAGE_TABLES
+  The VCPU thread is outside guest mode, but it wants the sender of
+  certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU
+  thread is done reading the page tables.
+VCPU Request Internals
+======================
+VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap.
+This means general bitops, like those documented in [atomic-ops]_ could
+also be used, e.g. ::
+  clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests);
+However, VCPU request users should refrain from doing so, as it would
+break the abstraction.  The first 8 bits are reserved for architecture
+independent requests, all additional bits are available for architecture
+dependent requests.
+Architecture Independent Requests
+---------------------------------
+KVM_REQ_TLB_FLUSH
+  KVM's common MMU notifier may need to flush all of a guest's TLB
+  entries, calling kvm_flush_remote_tlbs() to do so.  Architectures that
+  choose to use the common kvm_flush_remote_tlbs() implementation will
+  need to handle this VCPU request.
+KVM_REQ_MMU_RELOAD
+  When shadow page tables are used and memory slots are removed it's
+  necessary to inform each VCPU to completely refresh the tables.  This
+  request is used for that.
+KVM_REQ_PENDING_TIMER
+  This request may be made from a timer handler run on the host on behalf
+  of a VCPU.  It informs the VCPU thread to inject a timer interrupt.
+KVM_REQ_UNHALT
+  This request may be made from the KVM common function kvm_vcpu_block(),
+  which is used to emulate an instruction that causes a CPU to halt until
+  one of an architectural specific set of events and/or interrupts is
+  received (determined by checking kvm_arch_vcpu_runnable()).  When that
+  event or interrupt arrives kvm_vcpu_block() makes the request.  This is
+  in contrast to when kvm_vcpu_block() returns due to any other reason,
+  such as a pending signal, which does not indicate the VCPU's halt
+  emulation should stop, and therefore does not make the request.
+KVM_REQUEST_MASK
+----------------
+VCPU requests should be masked by KVM_REQUEST_MASK before using them with
+bitops.  This is because only the lower 8 bits are used to represent the
+request's number.  The upper bits are used as flags.  Currently only two
+flags are defined.
+VCPU Request Flags
+------------------
+KVM_REQUEST_NO_WAKEUP
+  This flag is applied to requests that only need immediate attention
+  from VCPUs running in guest mode.  That is, sleeping VCPUs do not need
+  to be awaken for these requests.  Sleeping VCPUs will handle the
+  requests when they are awaken later for some other reason.
+KVM_REQUEST_WAIT
+  When requests with this flag are made with kvm_make_all_cpus_request(),
+  then the caller will wait for each VCPU to acknowledge its IPI before
+  proceeding.  This flag only applies to VCPUs that would receive IPIs.
+  If, for example, the VCPU is sleeping, so no IPI is necessary, then
+  the requesting thread does not wait.  This means that this flag may be
+  safely combined with KVM_REQUEST_NO_WAKEUP.  See "Waiting for
+  Acknowledgements" for more information about requests with
+  KVM_REQUEST_WAIT.
+VCPU Requests with Associated State
+===================================
+Requesters that want the receiving VCPU to handle new state need to ensure
+the newly written state is observable to the receiving VCPU thread's CPU
+by the time it observes the request.  This means a write memory barrier
+must be inserted after writing the new state and before setting the VCPU
+request bit.  Additionally, on the receiving VCPU thread's side, a
+corresponding read barrier must be inserted after reading the request bit
+and before proceeding to read the new state associated with it.  See
+scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation
+[memory-barriers]_.
+The pair of functions, kvm_check_request() and kvm_make_request(), provide
+the memory barriers, allowing this requirement to be handled internally by
+the API.
+Ensuring Requests Are Seen
+==========================
+When making requests to VCPUs, we want to avoid the receiving VCPU
+executing in guest mode for an arbitrary long time without handling the
+request.  We can be sure this won't happen as long as we ensure the VCPU
+thread checks kvm_request_pending() before entering guest mode and that a
+kick will send an IPI to force an exit from guest mode when necessary.
+Extra care must be taken to cover the period after the VCPU thread's last
+kvm_request_pending() check and before it has entered guest mode, as kick
+IPIs will only trigger guest mode exits for VCPU threads that are in guest
+mode or at least have already disabled interrupts in order to prepare to
+enter guest mode.  This means that an optimized implementation (see "IPI
+Reduction") must be certain when it's safe to not send the IPI.  One
+solution, which all architectures except s390 apply, is to:
+- set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and
+  the last kvm_request_pending() check;
+- enable interrupts atomically when entering the guest.
+This solution also requires memory barriers to be placed carefully in both
+the requesting thread and the receiving VCPU.  With the memory barriers we
+can exclude the possibility of a VCPU thread observing
+!kvm_request_pending() on its last check and then not receiving an IPI for
+the next request made of it, even if the request is made immediately after
+the check.  This is done by way of the Dekker memory barrier pattern
+(scenario 10 of [lwn-mb]_).  As the Dekker pattern requires two variables,
+this solution pairs ``vcpu->mode`` with ``vcpu->requests``.  Substituting
+them into the pattern gives::
+  CPU1                                    CPU2
+  =================                       =================
+  local_irq_disable();
+  WRITE_ONCE(vcpu->mode, IN_GUEST_MODE);  kvm_make_request(REQ, vcpu);
+  smp_mb();                               smp_mb();
+  if (kvm_request_pending(vcpu)) {        if (READ_ONCE(vcpu->mode) ==
+                                              IN_GUEST_MODE) {
+      ...abort guest entry...                 ...send IPI...
+  }                                       }
+As stated above, the IPI is only useful for VCPU threads in guest mode or
+that have already disabled interrupts.  This is why this specific case of
+the Dekker pattern has been extended to disable interrupts before setting
+``vcpu->mode`` to IN_GUEST_MODE.  WRITE_ONCE() and READ_ONCE() are used to
+pedantically implement the memory barrier pattern, guaranteeing the
+compiler doesn't interfere with ``vcpu->mode``'s carefully planned
+accesses.
+IPI Reduction
+-------------
+As only one IPI is needed to get a VCPU to check for any/all requests,
+then they may be coalesced.  This is easily done by having the first IPI
+sending kick also change the VCPU mode to something !IN_GUEST_MODE.  The
+transitional state, EXITING_GUEST_MODE, is used for this purpose.
+Waiting for Acknowledgements
+----------------------------
+Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to
+be sent, and the acknowledgements to be waited upon, even when the target
+VCPU threads are in modes other than IN_GUEST_MODE.  For example, one case
+is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which
+is set after disabling interrupts.  To support these cases, the
+KVM_REQUEST_WAIT flag changes the condition for sending an IPI from
+checking that the VCPU is IN_GUEST_MODE to checking that it is not
+OUTSIDE_GUEST_MODE.
+Request-less VCPU Kicks
+-----------------------
+As the determination of whether or not to send an IPI depends on the
+two-variable Dekker memory barrier pattern, then it's clear that
+request-less VCPU kicks are almost never correct.  Without the assurance
+that a non-IPI generating kick will still result in an action by the
+receiving VCPU, as the final kvm_request_pending() check does for
+request-accompanying kicks, then the kick may not do anything useful at
+all.  If, for instance, a request-less kick was made to a VCPU that was
+just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then
+the VCPU thread may continue its entry without actually having done
+whatever it was the kick was meant to initiate.
+One exception is x86's posted interrupt mechanism.  In this case, however,
+even the request-less VCPU kick is coupled with the same
+local_irq_disable() + smp_mb() pattern described above; the ON bit
+(Outstanding Notification) in the posted interrupt descriptor takes the
+role of ``vcpu->requests``.  When sending a posted interrupt, PIR.ON is
+set before reading ``vcpu->mode``; dually, in the VCPU thread,
+vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to
+IN_GUEST_MODE.
+Additional Considerations
+=========================
+Sleeping VCPUs
+--------------
+VCPU threads may need to consider requests before and/or after calling
+functions that may put them to sleep, e.g. kvm_vcpu_block().  Whether they
+do or not, and, if they do, which requests need consideration, is
+architecture dependent.  kvm_vcpu_block() calls kvm_arch_vcpu_runnable()
+to check if it should awaken.  One reason to do so is to provide
+architectures a function where requests may be checked if necessary.
+Clearing Requests
+-----------------
+Generally it only makes sense for the receiving VCPU thread to clear a
+request.  However, in some circumstances, such as when the requesting
+thread and the receiving VCPU thread are executed serially, such as when
+they are the same thread, or when they are using some form of concurrency
+control to temporarily execute synchronously, then it's possible to know
+that the request may be cleared immediately, rather than waiting for the
+receiving VCPU thread to handle the request in VCPU RUN.  The only current
+examples of this are kvm_vcpu_block() calls made by VCPUs to block
+themselves.  A possible side-effect of that call is to make the
+KVM_REQ_UNHALT request, which may then be cleared immediately when the
+VCPU returns from the call.
+References
+==========
+.. [atomic-ops] Documentation/core-api/atomic_ops.rst
+.. [memory-barriers] Documentation/memory-barriers.txt
+.. [lwn-mb] https://lwn.net/Articles/573436/
diff --git a/MAINTAINERS b/MAINTAINERS
index 75ac9dc85804..1c1d106a3347 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7350,7 +7350,7 @@ F:	arch/powerpc/kvm/
 KERNEL VIRTUAL MACHINE for s390 (KVM/s390)
 M:      Christian Borntraeger <borntraeger@de.ibm.com>
-M:      Cornelia Huck <cornelia.huck@de.ibm.com>
+M:      Cornelia Huck <cohuck@redhat.com>
 L:      linux-s390@vger.kernel.org
 W:      http://www.ibm.com/developerworks/linux/linux390/
 T:      git git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git
@@ -11268,7 +11268,7 @@ S:	Supported
 F:      drivers/iommu/s390-iommu.c
 S390 VFIO-CCW DRIVER
-M:      Cornelia Huck <cornelia.huck@de.ibm.com>
+M:      Cornelia Huck <cohuck@redhat.com>
 M:      Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
 L:      linux-s390@vger.kernel.org
 L:      kvm@vger.kernel.org
@@ -13814,7 +13814,7 @@ F:	include/uapi/linux/virtio_*.h
 F:      drivers/crypto/virtio/
 VIRTIO DRIVERS FOR S390
-M:      Cornelia Huck <cornelia.huck@de.ibm.com>
+M:      Cornelia Huck <cohuck@redhat.com>
 M:      Halil Pasic <pasic@linux.vnet.ibm.com>
 L:      linux-s390@vger.kernel.org
 L:      virtualization@lists.linux-foundation.org
diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index f0e66577ce05..127e2dd2e21c 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -44,7 +44,9 @@
 #define KVM_MAX_VCPUS VGIC_V2_MAX_CPUS
 #endif
-#define KVM_REQ_VCPU_EXIT       (8 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_SLEEP \
+        KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_IRQ_PENDING     KVM_ARCH_REQ(1)
 u32 *kvm_vcpu_reg(struct kvm_vcpu *vcpu, u8 reg_num, u32 mode);
 int __attribute_const__ kvm_target_cpu(void);
@@ -233,8 +235,6 @@ struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
 struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
-void kvm_arm_halt_vcpu(struct kvm_vcpu *vcpu);
-void kvm_arm_resume_vcpu(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_coproc_indices(struct kvm_vcpu *vcpu, u64 __user *uindices);
 unsigned long kvm_arm_num_coproc_regs(struct kvm_vcpu *vcpu);
@@ -291,20 +291,12 @@ static inline void kvm_arm_init_debug(void) {}
 static inline void kvm_arm_setup_debug(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arm_clear_debug(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu) {}
-static inline int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
-                                             struct kvm_device_attr *attr)
+int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
-{
+                               struct kvm_device_attr *attr);
-        return -ENXIO;
+int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
-}
+                               struct kvm_device_attr *attr);
-static inline int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
+int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
-                                             struct kvm_device_attr *attr)
+                               struct kvm_device_attr *attr);
-{
-        return -ENXIO;
-}
-static inline int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
-                                             struct kvm_device_attr *attr)
-{
-        return -ENXIO;
-}
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
index 5e3c673fa3f4..5db2d4c6a55f 100644
--- a/arch/arm/include/uapi/asm/kvm.h
+++ b/arch/arm/include/uapi/asm/kvm.h
@@ -203,6 +203,14 @@ struct kvm_arch_memory_slot {
 #define KVM_DEV_ARM_VGIC_LINE_LEVEL_INTID_MASK 0x3ff
 #define VGIC_LEVEL_INFO_LINE_LEVEL      0
+/* Device Control API on vcpu fd */
+#define KVM_ARM_VCPU_PMU_V3_CTRL        0
+#define   KVM_ARM_VCPU_PMU_V3_IRQ       0
+#define   KVM_ARM_VCPU_PMU_V3_INIT      1
+#define KVM_ARM_VCPU_TIMER_CTRL         1
+#define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER         0
+#define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER         1
 #define   KVM_DEV_ARM_VGIC_CTRL_INIT            0
 #define   KVM_DEV_ARM_ITS_SAVE_TABLES           1
 #define   KVM_DEV_ARM_ITS_RESTORE_TABLES        2
diff --git a/arch/arm/kvm/guest.c b/arch/arm/kvm/guest.c
index fa6182a40941..1e0784ebbfd6 100644
--- a/arch/arm/kvm/guest.c
+++ b/arch/arm/kvm/guest.c
@@ -301,3 +301,54 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 {
        return -EINVAL;
 }
+int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
+                               struct kvm_device_attr *attr)
+{
+        int ret;
+        switch (attr->group) {
+        case KVM_ARM_VCPU_TIMER_CTRL:
+                ret = kvm_arm_timer_set_attr(vcpu, attr);
+                break;
+        default:
+                ret = -ENXIO;
+                break;
+        }
+        return ret;
+}
+int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
+                               struct kvm_device_attr *attr)
+{
+        int ret;
+        switch (attr->group) {
+        case KVM_ARM_VCPU_TIMER_CTRL:
+                ret = kvm_arm_timer_get_attr(vcpu, attr);
+                break;
+        default:
+                ret = -ENXIO;
+                break;
+        }
+        return ret;
+}
+int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
+                               struct kvm_device_attr *attr)
+{
+        int ret;
+        switch (attr->group) {
+        case KVM_ARM_VCPU_TIMER_CTRL:
+                ret = kvm_arm_timer_has_attr(vcpu, attr);
+                break;
+        default:
+                ret = -ENXIO;
+                break;
+        }
+        return ret;
+}
diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
index f86a9aaef462..54442e375354 100644
--- a/arch/arm/kvm/handle_exit.c
+++ b/arch/arm/kvm/handle_exit.c
@@ -72,6 +72,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
                trace_kvm_wfx(*vcpu_pc(vcpu), false);
                vcpu->stat.wfi_exit_stat++;
                kvm_vcpu_block(vcpu);
+                kvm_clear_request(KVM_REQ_UNHALT, vcpu);
        }
        kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
diff --git a/arch/arm/kvm/hyp/switch.c b/arch/arm/kvm/hyp/switch.c
index 624a510d31df..ebd2dd46adf7 100644
--- a/arch/arm/kvm/hyp/switch.c
+++ b/arch/arm/kvm/hyp/switch.c
@@ -237,8 +237,10 @@ void __hyp_text __noreturn __hyp_panic(int cause)
                vcpu = (struct kvm_vcpu *)read_sysreg(HTPIDR);
                host_ctxt = kern_hyp_va(vcpu->arch.host_cpu_context);
+                __timer_save_state(vcpu);
                __deactivate_traps(vcpu);
                __deactivate_vm(vcpu);
+                __banked_restore_state(host_ctxt);
                __sysreg_restore_state(host_ctxt);
        }
diff --git a/arch/arm/kvm/reset.c b/arch/arm/kvm/reset.c
index 1da8b2d14550..5ed0c3ee33d6 100644
--- a/arch/arm/kvm/reset.c
+++ b/arch/arm/kvm/reset.c
@@ -37,16 +37,6 @@ static struct kvm_regs cortexa_regs_reset = {
        .usr_regs.ARM_cpsr = SVC_MODE | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT,
 };
-static const struct kvm_irq_level cortexa_ptimer_irq = {
-        { .irq = 30 },
-        .level = 1,
-};
-static const struct kvm_irq_level cortexa_vtimer_irq = {
-        { .irq = 27 },
-        .level = 1,
-};
 /*******************************************************************************
 * Exported reset function
@@ -62,16 +52,12 @@ static const struct kvm_irq_level cortexa_vtimer_irq = {
 int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
 {
        struct kvm_regs *reset_regs;
-        const struct kvm_irq_level *cpu_vtimer_irq;
-        const struct kvm_irq_level *cpu_ptimer_irq;
        switch (vcpu->arch.target) {
        case KVM_ARM_TARGET_CORTEX_A7:
        case KVM_ARM_TARGET_CORTEX_A15:
                reset_regs = &cortexa_regs_reset;
                vcpu->arch.midr = read_cpuid_id();
-                cpu_vtimer_irq = &cortexa_vtimer_irq;
-                cpu_ptimer_irq = &cortexa_ptimer_irq;
                break;
        default:
                return -ENODEV;
@@ -84,5 +70,5 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
        kvm_reset_coprocs(vcpu);
        /* Reset arch_timer context */
-        return kvm_timer_vcpu_reset(vcpu, cpu_vtimer_irq, cpu_ptimer_irq);
+        return kvm_timer_vcpu_reset(vcpu);
 }
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9f7a934ff707..192208ea2842 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -488,6 +488,17 @@ config CAVIUM_ERRATUM_27456
          If unsure, say Y.
+config CAVIUM_ERRATUM_30115
+        bool "Cavium erratum 30115: Guest may disable interrupts in host"
+        default y
+        help
+          On ThunderX T88 pass 1.x through 2.2, T81 pass 1.0 through
+          1.2, and T83 Pass 1.0, KVM guest execution may disable
+          interrupts in host. Trapping both GICv3 group-0 and group-1
+          accesses sidesteps the issue.
+          If unsure, say Y.
 config QCOM_FALKOR_ERRATUM_1003
        bool "Falkor E1003: Incorrect translation due to ASID change"
        default y
diff --git a/arch/arm64/include/asm/arch_gicv3.h b/arch/arm64/include/asm/arch_gicv3.h
index 1a98bc8602a2..8cef47fa2218 100644
--- a/arch/arm64/include/asm/arch_gicv3.h
+++ b/arch/arm64/include/asm/arch_gicv3.h
@@ -89,7 +89,7 @@ static inline void gic_write_ctlr(u32 val)
 static inline void gic_write_grpen1(u32 val)
 {
-        write_sysreg_s(val, SYS_ICC_GRPEN1_EL1);
+        write_sysreg_s(val, SYS_ICC_IGRPEN1_EL1);
        isb();
 }
diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h
index b3aab8a17868..8d2272c6822c 100644
--- a/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@ -38,7 +38,8 @@
 #define ARM64_WORKAROUND_REPEAT_TLBI            17
 #define ARM64_WORKAROUND_QCOM_FALKOR_E1003      18
 #define ARM64_WORKAROUND_858921                 19
+#define ARM64_WORKAROUND_CAVIUM_30115           20
-#define ARM64_NCAPS                             20
+#define ARM64_NCAPS                             21
 #endif /* __ASM_CPUCAPS_H */
diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h
index 0984d1b3a8f2..235e77d98261 100644
--- a/arch/arm64/include/asm/cputype.h
+++ b/arch/arm64/include/asm/cputype.h
@@ -86,6 +86,7 @@
 #define CAVIUM_CPU_PART_THUNDERX        0x0A1
 #define CAVIUM_CPU_PART_THUNDERX_81XX   0x0A2
+#define CAVIUM_CPU_PART_THUNDERX_83XX   0x0A3
 #define BRCM_CPU_PART_VULCAN            0x516
@@ -96,6 +97,7 @@
 #define MIDR_CORTEX_A73 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A73)
 #define MIDR_THUNDERX   MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
 #define MIDR_THUNDERX_81XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_81XX)
+#define MIDR_THUNDERX_83XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_83XX)
 #define MIDR_QCOM_FALKOR_V1 MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR_V1)
 #ifndef __ASSEMBLY__
diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index 28bf02efce76..8cabd57b6348 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -19,6 +19,7 @@
 #define __ASM_ESR_H
 #include <asm/memory.h>
+#include <asm/sysreg.h>
 #define ESR_ELx_EC_UNKNOWN      (0x00)
 #define ESR_ELx_EC_WFx          (0x01)
@@ -182,6 +183,29 @@
 #define ESR_ELx_SYS64_ISS_SYS_CNTFRQ    (ESR_ELx_SYS64_ISS_SYS_VAL(3, 3, 0, 14, 0) | \
                                         ESR_ELx_SYS64_ISS_DIR_READ)
+#define esr_sys64_to_sysreg(e)                                  \
+        sys_reg((((e) & ESR_ELx_SYS64_ISS_OP0_MASK) >>          \
+                 ESR_ELx_SYS64_ISS_OP0_SHIFT),                  \
+                (((e) & ESR_ELx_SYS64_ISS_OP1_MASK) >>          \
+                 ESR_ELx_SYS64_ISS_OP1_SHIFT),                  \
+                (((e) & ESR_ELx_SYS64_ISS_CRN_MASK) >>          \
+                 ESR_ELx_SYS64_ISS_CRN_SHIFT),                  \
+                (((e) & ESR_ELx_SYS64_ISS_CRM_MASK) >>          \
+                 ESR_ELx_SYS64_ISS_CRM_SHIFT),                  \
+                (((e) & ESR_ELx_SYS64_ISS_OP2_MASK) >>          \
+                 ESR_ELx_SYS64_ISS_OP2_SHIFT))
+#define esr_cp15_to_sysreg(e)                                   \
+        sys_reg(3,                                              \
+                (((e) & ESR_ELx_SYS64_ISS_OP1_MASK) >>          \
+                 ESR_ELx_SYS64_ISS_OP1_SHIFT),                  \
+                (((e) & ESR_ELx_SYS64_ISS_CRN_MASK) >>          \
+                 ESR_ELx_SYS64_ISS_CRN_SHIFT),                  \
+                (((e) & ESR_ELx_SYS64_ISS_CRM_MASK) >>          \
+                 ESR_ELx_SYS64_ISS_CRM_SHIFT),                  \
+                (((e) & ESR_ELx_SYS64_ISS_OP2_MASK) >>          \
+                 ESR_ELx_SYS64_ISS_OP2_SHIFT))
 #ifndef __ASSEMBLY__
 #include <asm/types.h>
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 1f252a95bc02..d68630007b14 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -42,7 +42,9 @@
 #define KVM_VCPU_MAX_FEATURES 4
-#define KVM_REQ_VCPU_EXIT       (8 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_SLEEP \
+        KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_IRQ_PENDING     KVM_ARCH_REQ(1)
 int __attribute_const__ kvm_target_cpu(void);
 int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
@@ -334,8 +336,6 @@ struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
 struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
-void kvm_arm_halt_vcpu(struct kvm_vcpu *vcpu);
-void kvm_arm_resume_vcpu(struct kvm_vcpu *vcpu);
 u64 __kvm_call_hyp(void *hypfn, ...);
 #define kvm_call_hyp(f, ...) __kvm_call_hyp(kvm_ksym_ref(f), ##__VA_ARGS__)
diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index b18e852d27e8..4572a9b560fa 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -127,6 +127,7 @@ int __vgic_v2_perform_cpuif_access(struct kvm_vcpu *vcpu);
 void __vgic_v3_save_state(struct kvm_vcpu *vcpu);
 void __vgic_v3_restore_state(struct kvm_vcpu *vcpu);
+int __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu);
 void __timer_save_state(struct kvm_vcpu *vcpu);
 void __timer_restore_state(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index b4d13d9267ff..16e44fa9b3b6 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -180,14 +180,31 @@
 #define SYS_VBAR_EL1                    sys_reg(3, 0, 12, 0, 0)
+#define SYS_ICC_IAR0_EL1                sys_reg(3, 0, 12, 8, 0)
+#define SYS_ICC_EOIR0_EL1               sys_reg(3, 0, 12, 8, 1)
+#define SYS_ICC_HPPIR0_EL1              sys_reg(3, 0, 12, 8, 2)
+#define SYS_ICC_BPR0_EL1                sys_reg(3, 0, 12, 8, 3)
+#define SYS_ICC_AP0Rn_EL1(n)            sys_reg(3, 0, 12, 8, 4 | n)
+#define SYS_ICC_AP0R0_EL1               SYS_ICC_AP0Rn_EL1(0)
+#define SYS_ICC_AP0R1_EL1               SYS_ICC_AP0Rn_EL1(1)
+#define SYS_ICC_AP0R2_EL1               SYS_ICC_AP0Rn_EL1(2)
+#define SYS_ICC_AP0R3_EL1               SYS_ICC_AP0Rn_EL1(3)
+#define SYS_ICC_AP1Rn_EL1(n)            sys_reg(3, 0, 12, 9, n)
+#define SYS_ICC_AP1R0_EL1               SYS_ICC_AP1Rn_EL1(0)
+#define SYS_ICC_AP1R1_EL1               SYS_ICC_AP1Rn_EL1(1)
+#define SYS_ICC_AP1R2_EL1               SYS_ICC_AP1Rn_EL1(2)
+#define SYS_ICC_AP1R3_EL1               SYS_ICC_AP1Rn_EL1(3)
 #define SYS_ICC_DIR_EL1                 sys_reg(3, 0, 12, 11, 1)
+#define SYS_ICC_RPR_EL1                 sys_reg(3, 0, 12, 11, 3)
 #define SYS_ICC_SGI1R_EL1               sys_reg(3, 0, 12, 11, 5)
 #define SYS_ICC_IAR1_EL1                sys_reg(3, 0, 12, 12, 0)
 #define SYS_ICC_EOIR1_EL1               sys_reg(3, 0, 12, 12, 1)
+#define SYS_ICC_HPPIR1_EL1              sys_reg(3, 0, 12, 12, 2)
 #define SYS_ICC_BPR1_EL1                sys_reg(3, 0, 12, 12, 3)
 #define SYS_ICC_CTLR_EL1                sys_reg(3, 0, 12, 12, 4)
 #define SYS_ICC_SRE_EL1                 sys_reg(3, 0, 12, 12, 5)
-#define SYS_ICC_GRPEN1_EL1              sys_reg(3, 0, 12, 12, 7)
+#define SYS_ICC_IGRPEN0_EL1             sys_reg(3, 0, 12, 12, 6)
+#define SYS_ICC_IGRPEN1_EL1             sys_reg(3, 0, 12, 12, 7)
 #define SYS_CONTEXTIDR_EL1              sys_reg(3, 0, 13, 0, 1)
 #define SYS_TPIDR_EL1                   sys_reg(3, 0, 13, 0, 4)
@@ -287,8 +304,8 @@
 #define SCTLR_ELx_M     1
 #define SCTLR_EL2_RES1  ((1 << 4)  | (1 << 5)  | (1 << 11) | (1 << 16) | \
-                         (1 << 16) | (1 << 18) | (1 << 22) | (1 << 23) | \
+                         (1 << 18) | (1 << 22) | (1 << 23) | (1 << 28) | \
-                         (1 << 28) | (1 << 29))
+                         (1 << 29))
 #define SCTLR_ELx_FLAGS (SCTLR_ELx_M | SCTLR_ELx_A | SCTLR_ELx_C | \
                         SCTLR_ELx_SA | SCTLR_ELx_I)
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 70eea2ecc663..9f3ca24bbcc6 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -232,6 +232,9 @@ struct kvm_arch_memory_slot {
 #define KVM_ARM_VCPU_PMU_V3_CTRL        0
 #define   KVM_ARM_VCPU_PMU_V3_IRQ       0
 #define   KVM_ARM_VCPU_PMU_V3_INIT      1
+#define KVM_ARM_VCPU_TIMER_CTRL         1
+#define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER         0
+#define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER         1
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_TYPE_SHIFT          24
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 2ed2a7657711..0e27f86ee709 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -133,6 +133,27 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
                MIDR_RANGE(MIDR_THUNDERX_81XX, 0x00, 0x00),
        },
 #endif
+#ifdef CONFIG_CAVIUM_ERRATUM_30115
+        {
+        /* Cavium ThunderX, T88 pass 1.x - 2.2 */
+                .desc = "Cavium erratum 30115",
+                .capability = ARM64_WORKAROUND_CAVIUM_30115,
+                MIDR_RANGE(MIDR_THUNDERX, 0x00,
+                           (1 << MIDR_VARIANT_SHIFT) | 2),
+        },
+        {
+        /* Cavium ThunderX, T81 pass 1.0 - 1.2 */
+                .desc = "Cavium erratum 30115",
+                .capability = ARM64_WORKAROUND_CAVIUM_30115,
+                MIDR_RANGE(MIDR_THUNDERX_81XX, 0x00, 0x02),
+        },
+        {
+        /* Cavium ThunderX, T83 pass 1.0 */
+                .desc = "Cavium erratum 30115",
+                .capability = ARM64_WORKAROUND_CAVIUM_30115,
+                MIDR_RANGE(MIDR_THUNDERX_83XX, 0x00, 0x00),
+        },
+#endif
        {
                .desc = "Mismatched cache line size",
                .capability = ARM64_MISMATCHED_CACHE_LINE_SIZE,
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index b37446a8ffdb..5c7f657dd207 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -390,6 +390,9 @@ int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
        case KVM_ARM_VCPU_PMU_V3_CTRL:
                ret = kvm_arm_pmu_v3_set_attr(vcpu, attr);
                break;
+        case KVM_ARM_VCPU_TIMER_CTRL:
+                ret = kvm_arm_timer_set_attr(vcpu, attr);
+                break;
        default:
                ret = -ENXIO;
                break;
@@ -407,6 +410,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
        case KVM_ARM_VCPU_PMU_V3_CTRL:
                ret = kvm_arm_pmu_v3_get_attr(vcpu, attr);
                break;
+        case KVM_ARM_VCPU_TIMER_CTRL:
+                ret = kvm_arm_timer_get_attr(vcpu, attr);
+                break;
        default:
                ret = -ENXIO;
                break;
@@ -424,6 +430,9 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
        case KVM_ARM_VCPU_PMU_V3_CTRL:
                ret = kvm_arm_pmu_v3_has_attr(vcpu, attr);
                break;
+        case KVM_ARM_VCPU_TIMER_CTRL:
+                ret = kvm_arm_timer_has_attr(vcpu, attr);
+                break;
        default:
                ret = -ENXIO;
                break;
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index fa1b18e364fc..17d8a1677a0b 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -89,6 +89,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
                trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false);
                vcpu->stat.wfi_exit_stat++;
                kvm_vcpu_block(vcpu);
+                kvm_clear_request(KVM_REQ_UNHALT, vcpu);
        }
        kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index aede1658aeda..945e79c641c4 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -350,6 +350,20 @@ again:
                }
        }
+        if (static_branch_unlikely(&vgic_v3_cpuif_trap) &&
+            exit_code == ARM_EXCEPTION_TRAP &&
+            (kvm_vcpu_trap_get_class(vcpu) == ESR_ELx_EC_SYS64 ||
+             kvm_vcpu_trap_get_class(vcpu) == ESR_ELx_EC_CP15_32)) {
+                int ret = __vgic_v3_perform_cpuif_access(vcpu);
+                if (ret == 1) {
+                        __skip_instr(vcpu);
+                        goto again;
+                }
+                /* 0 falls through to be handled out of EL2 */
+        }
        fp_enabled = __fpsimd_enabled();
        __sysreg_save_guest_state(guest_ctxt);
@@ -422,6 +436,7 @@ void __hyp_text __noreturn __hyp_panic(void)
                vcpu = (struct kvm_vcpu *)read_sysreg(tpidr_el2);
                host_ctxt = kern_hyp_va(vcpu->arch.host_cpu_context);
+                __timer_save_state(vcpu);
                __deactivate_traps(vcpu);
                __deactivate_vm(vcpu);
                __sysreg_restore_host_state(host_ctxt);
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 561badf93de8..3256b9228e75 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -46,16 +46,6 @@ static const struct kvm_regs default_regs_reset32 = {
                        COMPAT_PSR_I_BIT | COMPAT_PSR_F_BIT),
 };
-static const struct kvm_irq_level default_ptimer_irq = {
-        .irq    = 30,
-        .level  = 1,
-};
-static const struct kvm_irq_level default_vtimer_irq = {
-        .irq    = 27,
-        .level  = 1,
-};
 static bool cpu_has_32bit_el1(void)
 {
        u64 pfr0;
@@ -108,8 +98,6 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
 */
 int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
 {
-        const struct kvm_irq_level *cpu_vtimer_irq;
-        const struct kvm_irq_level *cpu_ptimer_irq;
        const struct kvm_regs *cpu_reset;
        switch (vcpu->arch.target) {
@@ -122,8 +110,6 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
                        cpu_reset = &default_regs_reset;
                }
-                cpu_vtimer_irq = &default_vtimer_irq;
-                cpu_ptimer_irq = &default_ptimer_irq;
                break;
        }
@@ -137,5 +123,5 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
        kvm_pmu_vcpu_reset(vcpu);
        /* Reset timer */
-        return kvm_timer_vcpu_reset(vcpu, cpu_vtimer_irq, cpu_ptimer_irq);
+        return kvm_timer_vcpu_reset(vcpu);
 }
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 0fe27024a2e1..77862881ae86 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -56,7 +56,8 @@
 */
 static bool read_from_write_only(struct kvm_vcpu *vcpu,
-                                 const struct sys_reg_params *params)
+                                 struct sys_reg_params *params,
+                                 const struct sys_reg_desc *r)
 {
        WARN_ONCE(1, "Unexpected sys_reg read to write-only register\n");
        print_sys_reg_instr(params);
@@ -64,6 +65,16 @@ static bool read_from_write_only(struct kvm_vcpu *vcpu,
        return false;
 }
+static bool write_to_read_only(struct kvm_vcpu *vcpu,
+                               struct sys_reg_params *params,
+                               const struct sys_reg_desc *r)
+{
+        WARN_ONCE(1, "Unexpected sys_reg write to read-only register\n");
+        print_sys_reg_instr(params);
+        kvm_inject_undefined(vcpu);
+        return false;
+}
 /* 3 bits per cache level, as per CLIDR, but non-existent caches always 0 */
 static u32 cache_levels;
@@ -93,7 +104,7 @@ static bool access_dcsw(struct kvm_vcpu *vcpu,
                        const struct sys_reg_desc *r)
 {
        if (!p->is_write)
-                return read_from_write_only(vcpu, p);
+                return read_from_write_only(vcpu, p, r);
        kvm_set_way_flush(vcpu);
        return true;
@@ -135,7 +146,7 @@ static bool access_gic_sgi(struct kvm_vcpu *vcpu,
                           const struct sys_reg_desc *r)
 {
        if (!p->is_write)
-                return read_from_write_only(vcpu, p);
+                return read_from_write_only(vcpu, p, r);
        vgic_v3_dispatch_sgi(vcpu, p->regval);
@@ -773,7 +784,7 @@ static bool access_pmswinc(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
                return trap_raz_wi(vcpu, p, r);
        if (!p->is_write)
-                return read_from_write_only(vcpu, p);
+                return read_from_write_only(vcpu, p, r);
        if (pmu_write_swinc_el0_disabled(vcpu))
                return false;
@@ -953,7 +964,15 @@ static const struct sys_reg_desc sys_reg_descs[] = {
        { SYS_DESC(SYS_VBAR_EL1), NULL, reset_val, VBAR_EL1, 0 },
+        { SYS_DESC(SYS_ICC_IAR0_EL1), write_to_read_only },
+        { SYS_DESC(SYS_ICC_EOIR0_EL1), read_from_write_only },
+        { SYS_DESC(SYS_ICC_HPPIR0_EL1), write_to_read_only },
+        { SYS_DESC(SYS_ICC_DIR_EL1), read_from_write_only },
+        { SYS_DESC(SYS_ICC_RPR_EL1), write_to_read_only },
        { SYS_DESC(SYS_ICC_SGI1R_EL1), access_gic_sgi },
+        { SYS_DESC(SYS_ICC_IAR1_EL1), write_to_read_only },
+        { SYS_DESC(SYS_ICC_EOIR1_EL1), read_from_write_only },
+        { SYS_DESC(SYS_ICC_HPPIR1_EL1), write_to_read_only },
        { SYS_DESC(SYS_ICC_SRE_EL1), access_gic_sre },
        { SYS_DESC(SYS_CONTEXTIDR_EL1), access_vm_reg, reset_val, CONTEXTIDR_EL1, 0 },
diff --git a/arch/arm64/kvm/vgic-sys-reg-v3.c b/arch/arm64/kvm/vgic-sys-reg-v3.c
index 6260b69e5622..116786d2e8e8 100644
--- a/arch/arm64/kvm/vgic-sys-reg-v3.c
+++ b/arch/arm64/kvm/vgic-sys-reg-v3.c
@@ -268,36 +268,21 @@ static bool access_gic_sre(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
        return true;
 }
 static const struct sys_reg_desc gic_v3_icc_reg_descs[] = {
-        /* ICC_PMR_EL1 */
+        { SYS_DESC(SYS_ICC_PMR_EL1), access_gic_pmr },
-        { Op0(3), Op1(0), CRn(4), CRm(6), Op2(0), access_gic_pmr },
+        { SYS_DESC(SYS_ICC_BPR0_EL1), access_gic_bpr0 },
-        /* ICC_BPR0_EL1 */
+        { SYS_DESC(SYS_ICC_AP0R0_EL1), access_gic_ap0r },
-        { Op0(3), Op1(0), CRn(12), CRm(8), Op2(3), access_gic_bpr0 },
+        { SYS_DESC(SYS_ICC_AP0R1_EL1), access_gic_ap0r },
-        /* ICC_AP0R0_EL1 */
+        { SYS_DESC(SYS_ICC_AP0R2_EL1), access_gic_ap0r },
-        { Op0(3), Op1(0), CRn(12), CRm(8), Op2(4), access_gic_ap0r },
+        { SYS_DESC(SYS_ICC_AP0R3_EL1), access_gic_ap0r },
-        /* ICC_AP0R1_EL1 */
+        { SYS_DESC(SYS_ICC_AP1R0_EL1), access_gic_ap1r },
-        { Op0(3), Op1(0), CRn(12), CRm(8), Op2(5), access_gic_ap0r },
+        { SYS_DESC(SYS_ICC_AP1R1_EL1), access_gic_ap1r },
-        /* ICC_AP0R2_EL1 */
+        { SYS_DESC(SYS_ICC_AP1R2_EL1), access_gic_ap1r },
-        { Op0(3), Op1(0), CRn(12), CRm(8), Op2(6), access_gic_ap0r },
+        { SYS_DESC(SYS_ICC_AP1R3_EL1), access_gic_ap1r },
-        /* ICC_AP0R3_EL1 */
+        { SYS_DESC(SYS_ICC_BPR1_EL1), access_gic_bpr1 },
-        { Op0(3), Op1(0), CRn(12), CRm(8), Op2(7), access_gic_ap0r },
+        { SYS_DESC(SYS_ICC_CTLR_EL1), access_gic_ctlr },
-        /* ICC_AP1R0_EL1 */
+        { SYS_DESC(SYS_ICC_SRE_EL1), access_gic_sre },
-        { Op0(3), Op1(0), CRn(12), CRm(9), Op2(0), access_gic_ap1r },
+        { SYS_DESC(SYS_ICC_IGRPEN0_EL1), access_gic_grpen0 },
-        /* ICC_AP1R1_EL1 */
+        { SYS_DESC(SYS_ICC_IGRPEN1_EL1), access_gic_grpen1 },
-        { Op0(3), Op1(0), CRn(12), CRm(9), Op2(1), access_gic_ap1r },
-        /* ICC_AP1R2_EL1 */
-        { Op0(3), Op1(0), CRn(12), CRm(9), Op2(2), access_gic_ap1r },
-        /* ICC_AP1R3_EL1 */
-        { Op0(3), Op1(0), CRn(12), CRm(9), Op2(3), access_gic_ap1r },
-        /* ICC_BPR1_EL1 */
-        { Op0(3), Op1(0), CRn(12), CRm(12), Op2(3), access_gic_bpr1 },
-        /* ICC_CTLR_EL1 */
-        { Op0(3), Op1(0), CRn(12), CRm(12), Op2(4), access_gic_ctlr },
-        /* ICC_SRE_EL1 */
-        { Op0(3), Op1(0), CRn(12), CRm(12), Op2(5), access_gic_sre },
-        /* ICC_IGRPEN0_EL1 */
-        { Op0(3), Op1(0), CRn(12), CRm(12), Op2(6), access_gic_grpen0 },
-        /* ICC_GRPEN1_EL1 */
-        { Op0(3), Op1(0), CRn(12), CRm(12), Op2(7), access_gic_grpen1 },
 };
 int vgic_v3_has_cpu_sysregs_attr(struct kvm_vcpu *vcpu, bool is_write, u64 id,
diff --git a/arch/mips/kvm/trap_emul.c b/arch/mips/kvm/trap_emul.c
index a563759fd142..6a0d7040d882 100644
--- a/arch/mips/kvm/trap_emul.c
+++ b/arch/mips/kvm/trap_emul.c
@@ -1094,7 +1094,7 @@ static void kvm_trap_emul_check_requests(struct kvm_vcpu *vcpu, int cpu,
        struct mm_struct *mm;
        int i;
-        if (likely(!vcpu->requests))
+        if (likely(!kvm_request_pending(vcpu)))
                return;
        if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu)) {
diff --git a/arch/mips/kvm/vz.c b/arch/mips/kvm/vz.c
index 71d8856ade64..74805035edc8 100644
--- a/arch/mips/kvm/vz.c
+++ b/arch/mips/kvm/vz.c
@@ -2337,7 +2337,7 @@ static int kvm_vz_check_requests(struct kvm_vcpu *vcpu, int cpu)
        int ret = 0;
        int i;
-        if (!vcpu->requests)
+        if (!kvm_request_pending(vcpu))
                return 0;
        if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu)) {
diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 2bf35017ffc0..b8d5b8e35244 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -86,7 +86,6 @@ struct kvmppc_vcore {
        u16 last_cpu;
        u8 vcore_state;
        u8 in_guest;
-        struct kvmppc_vcore *master_vcore;
        struct kvm_vcpu *runnable_threads[MAX_SMT_THREADS];
        struct list_head preempt_list;
        spinlock_t lock;
diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h b/arch/powerpc/include/asm/kvm_book3s_asm.h
index b148496ffe36..7cea76f11c26 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -81,7 +81,7 @@ struct kvm_split_mode {
        u8              subcore_size;
        u8              do_nap;
        u8              napped[MAX_SMT_THREADS];
-        struct kvmppc_vcore *master_vcs[MAX_SUBCORES];
+        struct kvmppc_vcore *vc[MAX_SUBCORES];
 };
 /*
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 9c51ac4b8f36..8b3f1238d07f 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -35,6 +35,7 @@
 #include <asm/page.h>
 #include <asm/cacheflush.h>
 #include <asm/hvcall.h>
+#include <asm/mce.h>
 #define KVM_MAX_VCPUS           NR_CPUS
 #define KVM_MAX_VCORES          NR_CPUS
@@ -52,8 +53,8 @@
 #define KVM_IRQCHIP_NUM_PINS     256
 /* PPC-specific vcpu->requests bit members */
-#define KVM_REQ_WATCHDOG           8
+#define KVM_REQ_WATCHDOG        KVM_ARCH_REQ(0)
-#define KVM_REQ_EPR_EXIT           9
+#define KVM_REQ_EPR_EXIT        KVM_ARCH_REQ(1)
 #include <linux/mmu_notifier.h>
@@ -267,6 +268,8 @@ struct kvm_resize_hpt;
 struct kvm_arch {
        unsigned int lpid;
+        unsigned int smt_mode;          /* # vcpus per virtual core */
+        unsigned int emul_smt_mode;     /* emualted SMT mode, on P9 */
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
        unsigned int tlb_sets;
        struct kvm_hpt_info hpt;
@@ -285,6 +288,7 @@ struct kvm_arch {
        cpumask_t need_tlb_flush;
        cpumask_t cpu_in_guest;
        u8 radix;
+        u8 fwnmi_enabled;
        pgd_t *pgtable;
        u64 process_table;
        struct dentry *debugfs_dir;
@@ -566,6 +570,7 @@ struct kvm_vcpu_arch {
        ulong wort;
        ulong tid;
        ulong psscr;
+        ulong hfscr;
        ulong shadow_srr1;
 #endif
        u32 vrsave; /* also USPRG0 */
@@ -579,7 +584,7 @@ struct kvm_vcpu_arch {
        ulong mcsrr0;
        ulong mcsrr1;
        ulong mcsr;
-        u32 dec;
+        ulong dec;
 #ifdef CONFIG_BOOKE
        u32 decar;
 #endif
@@ -710,6 +715,7 @@ struct kvm_vcpu_arch {
        unsigned long pending_exceptions;
        u8 ceded;
        u8 prodded;
+        u8 doorbell_request;
        u32 last_inst;
        struct swait_queue_head *wqp;
@@ -722,6 +728,7 @@ struct kvm_vcpu_arch {
        int prev_cpu;
        bool timer_running;
        wait_queue_head_t cpu_run;
+        struct machine_check_event mce_evt; /* Valid if trap == 0x200 */
        struct kvm_vcpu_arch_shared *shared;
 #if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_PR_POSSIBLE)
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index e0d88c38602b..ba5fadd6f3c9 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -315,6 +315,8 @@ struct kvmppc_ops {
                                        struct irq_bypass_producer *);
        int (*configure_mmu)(struct kvm *kvm, struct kvm_ppc_mmuv3_cfg *cfg);
        int (*get_rmmu_info)(struct kvm *kvm, struct kvm_ppc_rmmu_info *info);
+        int (*set_smt_mode)(struct kvm *kvm, unsigned long mode,
+                            unsigned long flags);
 };
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index 3a8d278e7421..1a9b45198c06 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -103,6 +103,8 @@
 #define OP_31_XOP_STBUX     247
 #define OP_31_XOP_LHZX      279
 #define OP_31_XOP_LHZUX     311
+#define OP_31_XOP_MSGSNDP   142
+#define OP_31_XOP_MSGCLRP   174
 #define OP_31_XOP_MFSPR     339
 #define OP_31_XOP_LWAX      341
 #define OP_31_XOP_LHAX      343
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 07fbeb927834..8cf8f0c96906 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -60,6 +60,12 @@ struct kvm_regs {
 #define KVM_SREGS_E_FSL_PIDn    (1 << 0) /* PID1/PID2 */
+/* flags for kvm_run.flags */
+#define KVM_RUN_PPC_NMI_DISP_MASK               (3 << 0)
+#define   KVM_RUN_PPC_NMI_DISP_FULLY_RECOV      (1 << 0)
+#define   KVM_RUN_PPC_NMI_DISP_LIMITED_RECOV    (2 << 0)
+#define   KVM_RUN_PPC_NMI_DISP_NOT_RECOV        (3 << 0)
 /*
 * Feature bits indicate which sections of the sregs struct are valid,
 * both in KVM_GET_SREGS and KVM_SET_SREGS.  On KVM_SET_SREGS, registers
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 709e23425317..ae8e89e0d083 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -485,6 +485,7 @@ int main(void)
        OFFSET(KVM_ENABLED_HCALLS, kvm, arch.enabled_hcalls);
        OFFSET(KVM_VRMA_SLB_V, kvm, arch.vrma_slb_v);
        OFFSET(KVM_RADIX, kvm, arch.radix);
+        OFFSET(KVM_FWNMI, kvm, arch.fwnmi_enabled);
        OFFSET(VCPU_DSISR, kvm_vcpu, arch.shregs.dsisr);
        OFFSET(VCPU_DAR, kvm_vcpu, arch.shregs.dar);
        OFFSET(VCPU_VPA, kvm_vcpu, arch.vpa.pinned_addr);
@@ -513,6 +514,7 @@ int main(void)
        OFFSET(VCPU_PENDING_EXC, kvm_vcpu, arch.pending_exceptions);
        OFFSET(VCPU_CEDED, kvm_vcpu, arch.ceded);
        OFFSET(VCPU_PRODDED, kvm_vcpu, arch.prodded);
+        OFFSET(VCPU_DBELL_REQ, kvm_vcpu, arch.doorbell_request);
        OFFSET(VCPU_MMCR, kvm_vcpu, arch.mmcr);
        OFFSET(VCPU_PMC, kvm_vcpu, arch.pmc);
        OFFSET(VCPU_SPMC, kvm_vcpu, arch.spmc);
@@ -542,6 +544,7 @@ int main(void)
        OFFSET(VCPU_WORT, kvm_vcpu, arch.wort);
        OFFSET(VCPU_TID, kvm_vcpu, arch.tid);
        OFFSET(VCPU_PSSCR, kvm_vcpu, arch.psscr);
+        OFFSET(VCPU_HFSCR, kvm_vcpu, arch.hfscr);
        OFFSET(VCORE_ENTRY_EXIT, kvmppc_vcore, entry_exit_map);
        OFFSET(VCORE_IN_GUEST, kvmppc_vcore, in_guest);
        OFFSET(VCORE_NAPPING_THREADS, kvmppc_vcore, napping_threads);
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 5f9eada3519b..a9bfa49f3698 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -405,6 +405,7 @@ void machine_check_print_event_info(struct machine_check_event *evt,
                break;
        }
 }
+EXPORT_SYMBOL_GPL(machine_check_print_event_info);
 uint64_t get_mce_fault_addr(struct machine_check_event *evt)
 {
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 773b35d16a0b..0b436df746fc 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -46,6 +46,8 @@
 #include <linux/of.h>
 #include <asm/reg.h>
+#include <asm/ppc-opcode.h>
+#include <asm/disassemble.h>
 #include <asm/cputable.h>
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
@@ -645,6 +647,7 @@ static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
        unsigned long stolen;
        unsigned long core_stolen;
        u64 now;
+        unsigned long flags;
        dt = vcpu->arch.dtl_ptr;
        vpa = vcpu->arch.vpa.pinned_addr;
@@ -652,10 +655,10 @@ static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
        core_stolen = vcore_stolen_time(vc, now);
        stolen = core_stolen - vcpu->arch.stolen_logged;
        vcpu->arch.stolen_logged = core_stolen;
-        spin_lock_irq(&vcpu->arch.tbacct_lock);
+        spin_lock_irqsave(&vcpu->arch.tbacct_lock, flags);
        stolen += vcpu->arch.busy_stolen;
        vcpu->arch.busy_stolen = 0;
-        spin_unlock_irq(&vcpu->arch.tbacct_lock);
+        spin_unlock_irqrestore(&vcpu->arch.tbacct_lock, flags);
        if (!dt || !vpa)
                return;
        memset(dt, 0, sizeof(struct dtl_entry));
@@ -675,6 +678,26 @@ static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
        vcpu->arch.dtl.dirty = true;
 }
+/* See if there is a doorbell interrupt pending for a vcpu */
+static bool kvmppc_doorbell_pending(struct kvm_vcpu *vcpu)
+{
+        int thr;
+        struct kvmppc_vcore *vc;
+        if (vcpu->arch.doorbell_request)
+                return true;
+        /*
+         * Ensure that the read of vcore->dpdes comes after the read
+         * of vcpu->doorbell_request.  This barrier matches the
+         * lwsync in book3s_hv_rmhandlers.S just before the
+         * fast_guest_return label.
+         */
+        smp_rmb();
+        vc = vcpu->arch.vcore;
+        thr = vcpu->vcpu_id - vc->first_vcpuid;
+        return !!(vc->dpdes & (1 << thr));
+}
 static bool kvmppc_power8_compatible(struct kvm_vcpu *vcpu)
 {
        if (vcpu->arch.vcore->arch_compat >= PVR_ARCH_207)
@@ -926,6 +949,101 @@ static int kvmppc_emulate_debug_inst(struct kvm_run *run,
        }
 }
+static void do_nothing(void *x)
+{
+}
+static unsigned long kvmppc_read_dpdes(struct kvm_vcpu *vcpu)
+{
+        int thr, cpu, pcpu, nthreads;
+        struct kvm_vcpu *v;
+        unsigned long dpdes;
+        nthreads = vcpu->kvm->arch.emul_smt_mode;
+        dpdes = 0;
+        cpu = vcpu->vcpu_id & ~(nthreads - 1);
+        for (thr = 0; thr < nthreads; ++thr, ++cpu) {
+                v = kvmppc_find_vcpu(vcpu->kvm, cpu);
+                if (!v)
+                        continue;
+                /*
+                 * If the vcpu is currently running on a physical cpu thread,
+                 * interrupt it in order to pull it out of the guest briefly,
+                 * which will update its vcore->dpdes value.
+                 */
+                pcpu = READ_ONCE(v->cpu);
+                if (pcpu >= 0)
+                        smp_call_function_single(pcpu, do_nothing, NULL, 1);
+                if (kvmppc_doorbell_pending(v))
+                        dpdes |= 1 << thr;
+        }
+        return dpdes;
+}
+/*
+ * On POWER9, emulate doorbell-related instructions in order to
+ * give the guest the illusion of running on a multi-threaded core.
+ * The instructions emulated are msgsndp, msgclrp, mfspr TIR,
+ * and mfspr DPDES.
+ */
+static int kvmppc_emulate_doorbell_instr(struct kvm_vcpu *vcpu)
+{
+        u32 inst, rb, thr;
+        unsigned long arg;
+        struct kvm *kvm = vcpu->kvm;
+        struct kvm_vcpu *tvcpu;
+        if (!cpu_has_feature(CPU_FTR_ARCH_300))
+                return EMULATE_FAIL;
+        if (kvmppc_get_last_inst(vcpu, INST_GENERIC, &inst) != EMULATE_DONE)
+                return RESUME_GUEST;
+        if (get_op(inst) != 31)
+                return EMULATE_FAIL;
+        rb = get_rb(inst);
+        thr = vcpu->vcpu_id & (kvm->arch.emul_smt_mode - 1);
+        switch (get_xop(inst)) {
+        case OP_31_XOP_MSGSNDP:
+                arg = kvmppc_get_gpr(vcpu, rb);
+                if (((arg >> 27) & 0xf) != PPC_DBELL_SERVER)
+                        break;
+                arg &= 0x3f;
+                if (arg >= kvm->arch.emul_smt_mode)
+                        break;
+                tvcpu = kvmppc_find_vcpu(kvm, vcpu->vcpu_id - thr + arg);
+                if (!tvcpu)
+                        break;
+                if (!tvcpu->arch.doorbell_request) {
+                        tvcpu->arch.doorbell_request = 1;
+                        kvmppc_fast_vcpu_kick_hv(tvcpu);
+                }
+                break;
+        case OP_31_XOP_MSGCLRP:
+                arg = kvmppc_get_gpr(vcpu, rb);
+                if (((arg >> 27) & 0xf) != PPC_DBELL_SERVER)
+                        break;
+                vcpu->arch.vcore->dpdes = 0;
+                vcpu->arch.doorbell_request = 0;
+                break;
+        case OP_31_XOP_MFSPR:
+                switch (get_sprn(inst)) {
+                case SPRN_TIR:
+                        arg = thr;
+                        break;
+                case SPRN_DPDES:
+                        arg = kvmppc_read_dpdes(vcpu);
+                        break;
+                default:
+                        return EMULATE_FAIL;
+                }
+                kvmppc_set_gpr(vcpu, get_rt(inst), arg);
+                break;
+        default:
+                return EMULATE_FAIL;
+        }
+        kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
+        return RESUME_GUEST;
+}
 static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu,
                                 struct task_struct *tsk)
 {
@@ -971,15 +1089,20 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu,
                r = RESUME_GUEST;
                break;
        case BOOK3S_INTERRUPT_MACHINE_CHECK:
-                /*
+                /* Exit to guest with KVM_EXIT_NMI as exit reason */
-                 * Deliver a machine check interrupt to the guest.
+                run->exit_reason = KVM_EXIT_NMI;
-                 * We have to do this, even if the host has handled the
+                run->hw.hardware_exit_reason = vcpu->arch.trap;
-                 * machine check, because machine checks use SRR0/1 and
+                /* Clear out the old NMI status from run->flags */
-                 * the interrupt might have trashed guest state in them.
+                run->flags &= ~KVM_RUN_PPC_NMI_DISP_MASK;
-                 */
+                /* Now set the NMI status */
-                kvmppc_book3s_queue_irqprio(vcpu,
+                if (vcpu->arch.mce_evt.disposition == MCE_DISPOSITION_RECOVERED)
-                                            BOOK3S_INTERRUPT_MACHINE_CHECK);
+                        run->flags |= KVM_RUN_PPC_NMI_DISP_FULLY_RECOV;
-                r = RESUME_GUEST;
+                else
+                        run->flags |= KVM_RUN_PPC_NMI_DISP_NOT_RECOV;
+                r = RESUME_HOST;
+                /* Print the MCE event to host console. */
+                machine_check_print_event_info(&vcpu->arch.mce_evt, false);
                break;
        case BOOK3S_INTERRUPT_PROGRAM:
        {
@@ -1048,12 +1171,19 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu,
                break;
        /*
         * This occurs if the guest (kernel or userspace), does something that
-         * is prohibited by HFSCR.  We just generate a program interrupt to
+         * is prohibited by HFSCR.
-         * the guest.
+         * On POWER9, this could be a doorbell instruction that we need
+         * to emulate.
+         * Otherwise, we just generate a program interrupt to the guest.
         */
        case BOOK3S_INTERRUPT_H_FAC_UNAVAIL:
-                kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
+                r = EMULATE_FAIL;
-                r = RESUME_GUEST;
+                if ((vcpu->arch.hfscr >> 56) == FSCR_MSGP_LG)
+                        r = kvmppc_emulate_doorbell_instr(vcpu);
+                if (r == EMULATE_FAIL) {
+                        kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
+                        r = RESUME_GUEST;
+                }
                break;
        case BOOK3S_INTERRUPT_HV_RM_HARD:
                r = RESUME_PASSTHROUGH;
@@ -1143,6 +1273,12 @@ static void kvmppc_set_lpcr(struct kvm_vcpu *vcpu, u64 new_lpcr,
        mask = LPCR_DPFD | LPCR_ILE | LPCR_TC;
        if (cpu_has_feature(CPU_FTR_ARCH_207S))
                mask |= LPCR_AIL;
+        /*
+         * On POWER9, allow userspace to enable large decrementer for the
+         * guest, whether or not the host has it enabled.
+         */
+        if (cpu_has_feature(CPU_FTR_ARCH_300))
+                mask |= LPCR_LD;
        /* Broken 32-bit version of LPCR must not clear top bits */
        if (preserve_top32)
@@ -1611,7 +1747,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
        init_swait_queue_head(&vcore->wq);
        vcore->preempt_tb = TB_NIL;
        vcore->lpcr = kvm->arch.lpcr;
-        vcore->first_vcpuid = core * threads_per_vcore();
+        vcore->first_vcpuid = core * kvm->arch.smt_mode;
        vcore->kvm = kvm;
        INIT_LIST_HEAD(&vcore->preempt_list);
@@ -1770,14 +1906,10 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
                                                   unsigned int id)
 {
        struct kvm_vcpu *vcpu;
-        int err = -EINVAL;
+        int err;
        int core;
        struct kvmppc_vcore *vcore;
-        core = id / threads_per_vcore();
-        if (core >= KVM_MAX_VCORES)
-                goto out;
        err = -ENOMEM;
        vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
        if (!vcpu)
@@ -1808,6 +1940,20 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
        vcpu->arch.busy_preempt = TB_NIL;
        vcpu->arch.intr_msr = MSR_SF | MSR_ME;
+        /*
+         * Set the default HFSCR for the guest from the host value.
+         * This value is only used on POWER9.
+         * On POWER9 DD1, TM doesn't work, so we make sure to
+         * prevent the guest from using it.
+         * On POWER9, we want to virtualize the doorbell facility, so we
+         * turn off the HFSCR bit, which causes those instructions to trap.
+         */
+        vcpu->arch.hfscr = mfspr(SPRN_HFSCR);
+        if (!cpu_has_feature(CPU_FTR_TM))
+                vcpu->arch.hfscr &= ~HFSCR_TM;
+        if (cpu_has_feature(CPU_FTR_ARCH_300))
+                vcpu->arch.hfscr &= ~HFSCR_MSGP;
        kvmppc_mmu_book3s_hv_init(vcpu);
        vcpu->arch.state = KVMPPC_VCPU_NOTREADY;
@@ -1815,11 +1961,17 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
        init_waitqueue_head(&vcpu->arch.cpu_run);
        mutex_lock(&kvm->lock);
-        vcore = kvm->arch.vcores[core];
+        vcore = NULL;
-        if (!vcore) {
+        err = -EINVAL;
-                vcore = kvmppc_vcore_create(kvm, core);
+        core = id / kvm->arch.smt_mode;
-                kvm->arch.vcores[core] = vcore;
+        if (core < KVM_MAX_VCORES) {
-                kvm->arch.online_vcores++;
+                vcore = kvm->arch.vcores[core];
+                if (!vcore) {
+                        err = -ENOMEM;
+                        vcore = kvmppc_vcore_create(kvm, core);
+                        kvm->arch.vcores[core] = vcore;
+                        kvm->arch.online_vcores++;
+                }
        }
        mutex_unlock(&kvm->lock);
@@ -1847,6 +1999,43 @@ out:
        return ERR_PTR(err);
 }
+static int kvmhv_set_smt_mode(struct kvm *kvm, unsigned long smt_mode,
+                              unsigned long flags)
+{
+        int err;
+        int esmt = 0;
+        if (flags)
+                return -EINVAL;
+        if (smt_mode > MAX_SMT_THREADS || !is_power_of_2(smt_mode))
+                return -EINVAL;
+        if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+                /*
+                 * On POWER8 (or POWER7), the threading mode is "strict",
+                 * so we pack smt_mode vcpus per vcore.
+                 */
+                if (smt_mode > threads_per_subcore)
+                        return -EINVAL;
+        } else {
+                /*
+                 * On POWER9, the threading mode is "loose",
+                 * so each vcpu gets its own vcore.
+                 */
+                esmt = smt_mode;
+                smt_mode = 1;
+        }
+        mutex_lock(&kvm->lock);
+        err = -EBUSY;
+        if (!kvm->arch.online_vcores) {
+                kvm->arch.smt_mode = smt_mode;
+                kvm->arch.emul_smt_mode = esmt;
+                err = 0;
+        }
+        mutex_unlock(&kvm->lock);
+        return err;
+}
 static void unpin_vpa(struct kvm *kvm, struct kvmppc_vpa *vpa)
 {
        if (vpa->pinned_addr)
@@ -1897,7 +2086,7 @@ static void kvmppc_end_cede(struct kvm_vcpu *vcpu)
        }
 }
-extern void __kvmppc_vcore_entry(void);
+extern int __kvmppc_vcore_entry(void);
 static void kvmppc_remove_runnable(struct kvmppc_vcore *vc,
                                   struct kvm_vcpu *vcpu)
@@ -1962,10 +2151,6 @@ static void kvmppc_release_hwthread(int cpu)
        tpaca->kvm_hstate.kvm_split_mode = NULL;
 }
-static void do_nothing(void *x)
-{
-}
 static void radix_flush_cpu(struct kvm *kvm, int cpu, struct kvm_vcpu *vcpu)
 {
        int i;
@@ -1983,11 +2168,35 @@ static void radix_flush_cpu(struct kvm *kvm, int cpu, struct kvm_vcpu *vcpu)
                        smp_call_function_single(cpu + i, do_nothing, NULL, 1);
 }
+static void kvmppc_prepare_radix_vcpu(struct kvm_vcpu *vcpu, int pcpu)
+{
+        struct kvm *kvm = vcpu->kvm;
+        /*
+         * With radix, the guest can do TLB invalidations itself,
+         * and it could choose to use the local form (tlbiel) if
+         * it is invalidating a translation that has only ever been
+         * used on one vcpu.  However, that doesn't mean it has
+         * only ever been used on one physical cpu, since vcpus
+         * can move around between pcpus.  To cope with this, when
+         * a vcpu moves from one pcpu to another, we need to tell
+         * any vcpus running on the same core as this vcpu previously
+         * ran to flush the TLB.  The TLB is shared between threads,
+         * so we use a single bit in .need_tlb_flush for all 4 threads.
+         */
+        if (vcpu->arch.prev_cpu != pcpu) {
+                if (vcpu->arch.prev_cpu >= 0 &&
+                    cpu_first_thread_sibling(vcpu->arch.prev_cpu) !=
+                    cpu_first_thread_sibling(pcpu))
+                        radix_flush_cpu(kvm, vcpu->arch.prev_cpu, vcpu);
+                vcpu->arch.prev_cpu = pcpu;
+        }
+}
 static void kvmppc_start_thread(struct kvm_vcpu *vcpu, struct kvmppc_vcore *vc)
 {
        int cpu;
        struct paca_struct *tpaca;
-        struct kvmppc_vcore *mvc = vc->master_vcore;
        struct kvm *kvm = vc->kvm;
        cpu = vc->pcpu;
@@ -1997,36 +2206,16 @@ static void kvmppc_start_thread(struct kvm_vcpu *vcpu, struct kvmppc_vcore *vc)
                        vcpu->arch.timer_running = 0;
                }
                cpu += vcpu->arch.ptid;
-                vcpu->cpu = mvc->pcpu;
+                vcpu->cpu = vc->pcpu;
                vcpu->arch.thread_cpu = cpu;
-                /*
-                 * With radix, the guest can do TLB invalidations itself,
-                 * and it could choose to use the local form (tlbiel) if
-                 * it is invalidating a translation that has only ever been
-                 * used on one vcpu.  However, that doesn't mean it has
-                 * only ever been used on one physical cpu, since vcpus
-                 * can move around between pcpus.  To cope with this, when
-                 * a vcpu moves from one pcpu to another, we need to tell
-                 * any vcpus running on the same core as this vcpu previously
-                 * ran to flush the TLB.  The TLB is shared between threads,
-                 * so we use a single bit in .need_tlb_flush for all 4 threads.
-                 */
-                if (kvm_is_radix(kvm) && vcpu->arch.prev_cpu != cpu) {
-                        if (vcpu->arch.prev_cpu >= 0 &&
-                            cpu_first_thread_sibling(vcpu->arch.prev_cpu) !=
-                            cpu_first_thread_sibling(cpu))
-                                radix_flush_cpu(kvm, vcpu->arch.prev_cpu, vcpu);
-                        vcpu->arch.prev_cpu = cpu;
-                }
                cpumask_set_cpu(cpu, &kvm->arch.cpu_in_guest);
        }
        tpaca = &paca[cpu];
        tpaca->kvm_hstate.kvm_vcpu = vcpu;
-        tpaca->kvm_hstate.ptid = cpu - mvc->pcpu;
+        tpaca->kvm_hstate.ptid = cpu - vc->pcpu;
        /* Order stores to hstate.kvm_vcpu etc. before store to kvm_vcore */
        smp_wmb();
-        tpaca->kvm_hstate.kvm_vcore = mvc;
+        tpaca->kvm_hstate.kvm_vcore = vc;
        if (cpu != smp_processor_id())
                kvmppc_ipi_thread(cpu);
 }
@@ -2155,8 +2344,7 @@ struct core_info {
        int             max_subcore_threads;
        int             total_threads;
        int             subcore_threads[MAX_SUBCORES];
-        struct kvm      *subcore_vm[MAX_SUBCORES];
+        struct kvmppc_vcore *vc[MAX_SUBCORES];
-        struct list_head vcs[MAX_SUBCORES];
 };
 /*
@@ -2167,17 +2355,12 @@ static int subcore_thread_map[MAX_SUBCORES] = { 0, 4, 2, 6 };
 static void init_core_info(struct core_info *cip, struct kvmppc_vcore *vc)
 {
-        int sub;
        memset(cip, 0, sizeof(*cip));
        cip->n_subcores = 1;
        cip->max_subcore_threads = vc->num_threads;
        cip->total_threads = vc->num_threads;
        cip->subcore_threads[0] = vc->num_threads;
-        cip->subcore_vm[0] = vc->kvm;
+        cip->vc[0] = vc;
-        for (sub = 0; sub < MAX_SUBCORES; ++sub)
-                INIT_LIST_HEAD(&cip->vcs[sub]);
-        list_add_tail(&vc->preempt_list, &cip->vcs[0]);
 }
 static bool subcore_config_ok(int n_subcores, int n_threads)
@@ -2197,9 +2380,8 @@ static bool subcore_config_ok(int n_subcores, int n_threads)
        return n_subcores * roundup_pow_of_two(n_threads) <= MAX_SMT_THREADS;
 }
-static void init_master_vcore(struct kvmppc_vcore *vc)
+static void init_vcore_to_run(struct kvmppc_vcore *vc)
 {
-        vc->master_vcore = vc;
        vc->entry_exit_map = 0;
        vc->in_guest = 0;
        vc->napping_threads = 0;
@@ -2224,9 +2406,9 @@ static bool can_dynamic_split(struct kvmppc_vcore *vc, struct core_info *cip)
        ++cip->n_subcores;
        cip->total_threads += vc->num_threads;
        cip->subcore_threads[sub] = vc->num_threads;
-        cip->subcore_vm[sub] = vc->kvm;
+        cip->vc[sub] = vc;
-        init_master_vcore(vc);
+        init_vcore_to_run(vc);
-        list_move_tail(&vc->preempt_list, &cip->vcs[sub]);
+        list_del_init(&vc->preempt_list);
        return true;
 }
@@ -2294,6 +2476,18 @@ static void collect_piggybacks(struct core_info *cip, int target_threads)
        spin_unlock(&lp->lock);
 }
+static bool recheck_signals(struct core_info *cip)
+{
+        int sub, i;
+        struct kvm_vcpu *vcpu;
+        for (sub = 0; sub < cip->n_subcores; ++sub)
+                for_each_runnable_thread(i, vcpu, cip->vc[sub])
+                        if (signal_pending(vcpu->arch.run_task))
+                                return true;
+        return false;
+}
 static void post_guest_process(struct kvmppc_vcore *vc, bool is_master)
 {
        int still_running = 0, i;
@@ -2331,7 +2525,6 @@ static void post_guest_process(struct kvmppc_vcore *vc, bool is_master)
                        wake_up(&vcpu->arch.cpu_run);
                }
        }
-        list_del_init(&vc->preempt_list);
        if (!is_master) {
                if (still_running > 0) {
                        kvmppc_vcore_preempt(vc);
@@ -2393,6 +2586,21 @@ static inline int kvmppc_set_host_core(unsigned int cpu)
        return 0;
 }
+static void set_irq_happened(int trap)
+{
+        switch (trap) {
+        case BOOK3S_INTERRUPT_EXTERNAL:
+                local_paca->irq_happened |= PACA_IRQ_EE;
+                break;
+        case BOOK3S_INTERRUPT_H_DOORBELL:
+                local_paca->irq_happened |= PACA_IRQ_DBELL;
+                break;
+        case BOOK3S_INTERRUPT_HMI:
+                local_paca->irq_happened |= PACA_IRQ_HMI;
+                break;
+        }
+}
 /*
 * Run a set of guest threads on a physical core.
 * Called with vc->lock held.
@@ -2403,7 +2611,7 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
        int i;
        int srcu_idx;
        struct core_info core_info;
-        struct kvmppc_vcore *pvc, *vcnext;
+        struct kvmppc_vcore *pvc;
        struct kvm_split_mode split_info, *sip;
        int split, subcore_size, active;
        int sub;
@@ -2412,6 +2620,7 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
        int pcpu, thr;
        int target_threads;
        int controlled_threads;
+        int trap;
        /*
         * Remove from the list any threads that have a signal pending
@@ -2426,7 +2635,7 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
        /*
         * Initialize *vc.
         */
-        init_master_vcore(vc);
+        init_vcore_to_run(vc);
        vc->preempt_tb = TB_NIL;
        /*
@@ -2463,6 +2672,43 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
        if (vc->num_threads < target_threads)
                collect_piggybacks(&core_info, target_threads);
+        /*
+         * On radix, arrange for TLB flushing if necessary.
+         * This has to be done before disabling interrupts since
+         * it uses smp_call_function().
+         */
+        pcpu = smp_processor_id();
+        if (kvm_is_radix(vc->kvm)) {
+                for (sub = 0; sub < core_info.n_subcores; ++sub)
+                        for_each_runnable_thread(i, vcpu, core_info.vc[sub])
+                                kvmppc_prepare_radix_vcpu(vcpu, pcpu);
+        }
+        /*
+         * Hard-disable interrupts, and check resched flag and signals.
+         * If we need to reschedule or deliver a signal, clean up
+         * and return without going into the guest(s).
+         */
+        local_irq_disable();
+        hard_irq_disable();
+        if (lazy_irq_pending() || need_resched() ||
+            recheck_signals(&core_info)) {
+                local_irq_enable();
+                vc->vcore_state = VCORE_INACTIVE;
+                /* Unlock all except the primary vcore */
+                for (sub = 1; sub < core_info.n_subcores; ++sub) {
+                        pvc = core_info.vc[sub];
+                        /* Put back on to the preempted vcores list */
+                        kvmppc_vcore_preempt(pvc);
+                        spin_unlock(&pvc->lock);
+                }
+                for (i = 0; i < controlled_threads; ++i)
+                        kvmppc_release_hwthread(pcpu + i);
+                return;
+        }
+        kvmppc_clear_host_core(pcpu);
        /* Decide on micro-threading (split-core) mode */
        subcore_size = threads_per_subcore;
        cmd_bit = stat_bit = 0;
@@ -2486,13 +2732,10 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
                split_info.ldbar = mfspr(SPRN_LDBAR);
                split_info.subcore_size = subcore_size;
                for (sub = 0; sub < core_info.n_subcores; ++sub)
-                        split_info.master_vcs[sub] =
+                        split_info.vc[sub] = core_info.vc[sub];
-                                list_first_entry(&core_info.vcs[sub],
-                                        struct kvmppc_vcore, preempt_list);
                /* order writes to split_info before kvm_split_mode pointer */
                smp_wmb();
        }
-        pcpu = smp_processor_id();
        for (thr = 0; thr < controlled_threads; ++thr)
                paca[pcpu + thr].kvm_hstate.kvm_split_mode = sip;
@@ -2512,32 +2755,29 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
                }
        }
-        kvmppc_clear_host_core(pcpu);
        /* Start all the threads */
        active = 0;
        for (sub = 0; sub < core_info.n_subcores; ++sub) {
                thr = subcore_thread_map[sub];
                thr0_done = false;
                active |= 1 << thr;
-                list_for_each_entry(pvc, &core_info.vcs[sub], preempt_list) {
+                pvc = core_info.vc[sub];
-                        pvc->pcpu = pcpu + thr;
+                pvc->pcpu = pcpu + thr;
-                        for_each_runnable_thread(i, vcpu, pvc) {
+                for_each_runnable_thread(i, vcpu, pvc) {
-                                kvmppc_start_thread(vcpu, pvc);
+                        kvmppc_start_thread(vcpu, pvc);
-                                kvmppc_create_dtl_entry(vcpu, pvc);
+                        kvmppc_create_dtl_entry(vcpu, pvc);
-                                trace_kvm_guest_enter(vcpu);
+                        trace_kvm_guest_enter(vcpu);
-                                if (!vcpu->arch.ptid)
+                        if (!vcpu->arch.ptid)
-                                        thr0_done = true;
+                                thr0_done = true;
-                                active |= 1 << (thr + vcpu->arch.ptid);
+                        active |= 1 << (thr + vcpu->arch.ptid);
-                        }
-                        /*
-                         * We need to start the first thread of each subcore
-                         * even if it doesn't have a vcpu.
-                         */
-                        if (pvc->master_vcore == pvc && !thr0_done)
-                                kvmppc_start_thread(NULL, pvc);
-                        thr += pvc->num_threads;
                }
+                /*
+                 * We need to start the first thread of each subcore
+                 * even if it doesn't have a vcpu.
+                 */
+                if (!thr0_done)
+                        kvmppc_start_thread(NULL, pvc);
+                thr += pvc->num_threads;
        }
        /*
@@ -2564,17 +2804,27 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
        trace_kvmppc_run_core(vc, 0);
        for (sub = 0; sub < core_info.n_subcores; ++sub)
-                list_for_each_entry(pvc, &core_info.vcs[sub], preempt_list)
+                spin_unlock(&core_info.vc[sub]->lock);
-                        spin_unlock(&pvc->lock);
+        /*
+         * Interrupts will be enabled once we get into the guest,
+         * so tell lockdep that we're about to enable interrupts.
+         */
+        trace_hardirqs_on();
        guest_enter();
        srcu_idx = srcu_read_lock(&vc->kvm->srcu);
-        __kvmppc_vcore_entry();
+        trap = __kvmppc_vcore_entry();
        srcu_read_unlock(&vc->kvm->srcu, srcu_idx);
+        guest_exit();
+        trace_hardirqs_off();
+        set_irq_happened(trap);
        spin_lock(&vc->lock);
        /* prevent other vcpu threads from doing kvmppc_start_thread() now */
        vc->vcore_state = VCORE_EXITING;
@@ -2602,6 +2852,10 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
                split_info.do_nap = 0;
        }
+        kvmppc_set_host_core(pcpu);
+        local_irq_enable();
        /* Let secondaries go back to the offline loop */
        for (i = 0; i < controlled_threads; ++i) {
                kvmppc_release_hwthread(pcpu + i);
@@ -2610,18 +2864,15 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
                cpumask_clear_cpu(pcpu + i, &vc->kvm->arch.cpu_in_guest);
        }
-        kvmppc_set_host_core(pcpu);
        spin_unlock(&vc->lock);
        /* make sure updates to secondary vcpu structs are visible now */
        smp_mb();
-        guest_exit();
-        for (sub = 0; sub < core_info.n_subcores; ++sub)
+        for (sub = 0; sub < core_info.n_subcores; ++sub) {
-                list_for_each_entry_safe(pvc, vcnext, &core_info.vcs[sub],
+                pvc = core_info.vc[sub];
-                                         preempt_list)
+                post_guest_process(pvc, pvc == vc);
-                        post_guest_process(pvc, pvc == vc);
+        }
        spin_lock(&vc->lock);
        preempt_enable();
@@ -2666,6 +2917,30 @@ static void shrink_halt_poll_ns(struct kvmppc_vcore *vc)
                vc->halt_poll_ns /= halt_poll_ns_shrink;
 }
+#ifdef CONFIG_KVM_XICS
+static inline bool xive_interrupt_pending(struct kvm_vcpu *vcpu)
+{
+        if (!xive_enabled())
+                return false;
+        return vcpu->arch.xive_saved_state.pipr <
+                vcpu->arch.xive_saved_state.cppr;
+}
+#else
+static inline bool xive_interrupt_pending(struct kvm_vcpu *vcpu)
+{
+        return false;
+}
+#endif /* CONFIG_KVM_XICS */
+static bool kvmppc_vcpu_woken(struct kvm_vcpu *vcpu)
+{
+        if (vcpu->arch.pending_exceptions || vcpu->arch.prodded ||
+            kvmppc_doorbell_pending(vcpu) || xive_interrupt_pending(vcpu))
+                return true;
+        return false;
+}
 /*
 * Check to see if any of the runnable vcpus on the vcore have pending
 * exceptions or are no longer ceded
@@ -2676,8 +2951,7 @@ static int kvmppc_vcore_check_block(struct kvmppc_vcore *vc)
        int i;
        for_each_runnable_thread(i, vcpu, vc) {
-                if (vcpu->arch.pending_exceptions || !vcpu->arch.ceded ||
+                if (!vcpu->arch.ceded || kvmppc_vcpu_woken(vcpu))
-                    vcpu->arch.prodded)
                        return 1;
        }
@@ -2819,15 +3093,14 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
         */
        if (!signal_pending(current)) {
                if (vc->vcore_state == VCORE_PIGGYBACK) {
-                        struct kvmppc_vcore *mvc = vc->master_vcore;
+                        if (spin_trylock(&vc->lock)) {
-                        if (spin_trylock(&mvc->lock)) {
+                                if (vc->vcore_state == VCORE_RUNNING &&
-                                if (mvc->vcore_state == VCORE_RUNNING &&
+                                    !VCORE_IS_EXITING(vc)) {
-                                    !VCORE_IS_EXITING(mvc)) {
                                        kvmppc_create_dtl_entry(vcpu, vc);
                                        kvmppc_start_thread(vcpu, vc);
                                        trace_kvm_guest_enter(vcpu);
                                }
-                                spin_unlock(&mvc->lock);
+                                spin_unlock(&vc->lock);
                        }
                } else if (vc->vcore_state == VCORE_RUNNING &&
                           !VCORE_IS_EXITING(vc)) {
@@ -2863,7 +3136,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
                        break;
                n_ceded = 0;
                for_each_runnable_thread(i, v, vc) {
-                        if (!v->arch.pending_exceptions && !v->arch.prodded)
+                        if (!kvmppc_vcpu_woken(v))
                                n_ceded += v->arch.ceded;
                        else
                                v->arch.ceded = 0;
@@ -3519,6 +3792,19 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
                kvm_hv_vm_activated();
        /*
+         * Initialize smt_mode depending on processor.
+         * POWER8 and earlier have to use "strict" threading, where
+         * all vCPUs in a vcore have to run on the same (sub)core,
+         * whereas on POWER9 the threads can each run a different
+         * guest.
+         */
+        if (!cpu_has_feature(CPU_FTR_ARCH_300))
+                kvm->arch.smt_mode = threads_per_subcore;
+        else
+                kvm->arch.smt_mode = 1;
+        kvm->arch.emul_smt_mode = 1;
+        /*
         * Create a debugfs directory for the VM
         */
        snprintf(buf, sizeof(buf), "vm%d", current->pid);
@@ -3947,6 +4233,7 @@ static struct kvmppc_ops kvm_ops_hv = {
 #endif
        .configure_mmu = kvmhv_configure_mmu,
        .get_rmmu_info = kvmhv_get_rmmu_info,
+        .set_smt_mode = kvmhv_set_smt_mode,
 };
 static int kvm_init_subcore_bitmap(void)
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c
index ee4c2558c305..90644db9d38e 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -307,7 +307,7 @@ void kvmhv_commence_exit(int trap)
                return;
        for (i = 0; i < MAX_SUBCORES; ++i) {
-                vc = sip->master_vcs[i];
+                vc = sip->vc[i];
                if (!vc)
                        break;
                do {
diff --git a/arch/powerpc/kvm/book3s_hv_interrupts.S b/arch/powerpc/kvm/book3s_hv_interrupts.S
index 404deb512844..dc54373c8780 100644
--- a/arch/powerpc/kvm/book3s_hv_interrupts.S
+++ b/arch/powerpc/kvm/book3s_hv_interrupts.S
@@ -61,13 +61,6 @@ BEGIN_FTR_SECTION
        std     r3, HSTATE_DABR(r13)
 END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
-        /* Hard-disable interrupts */
-        mfmsr   r10
-        std     r10, HSTATE_HOST_MSR(r13)
-        rldicl  r10,r10,48,1
-        rotldi  r10,r10,16
-        mtmsrd  r10,1
        /* Save host PMU registers */
 BEGIN_FTR_SECTION
        /* Work around P8 PMAE bug */
@@ -153,6 +146,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
         *
         * R1       = host R1
         * R2       = host R2
+         * R3       = trap number on this thread
         * R12      = exit handler id
         * R13      = PACA
         */
diff --git a/arch/powerpc/kvm/book3s_hv_ras.c b/arch/powerpc/kvm/book3s_hv_ras.c
index 7ef0993214f3..c356f9a40b24 100644
--- a/arch/powerpc/kvm/book3s_hv_ras.c
+++ b/arch/powerpc/kvm/book3s_hv_ras.c
@@ -130,12 +130,28 @@ static long kvmppc_realmode_mc_power7(struct kvm_vcpu *vcpu)
 out:
        /*
+         * For guest that supports FWNMI capability, hook the MCE event into
+         * vcpu structure. We are going to exit the guest with KVM_EXIT_NMI
+         * exit reason. On our way to exit we will pull this event from vcpu
+         * structure and print it from thread 0 of the core/subcore.
+         *
+         * For guest that does not support FWNMI capability (old QEMU):
         * We are now going enter guest either through machine check
         * interrupt (for unhandled errors) or will continue from
         * current HSRR0 (for handled errors) in guest. Hence
         * queue up the event so that we can log it from host console later.
         */
-        machine_check_queue_event();
+        if (vcpu->kvm->arch.fwnmi_enabled) {
+                /*
+                 * Hook up the mce event on to vcpu structure.
+                 * First clear the old event.
+                 */
+                memset(&vcpu->arch.mce_evt, 0, sizeof(vcpu->arch.mce_evt));
+                if (get_mce_event(&mce_evt, MCE_EVENT_RELEASE)) {
+                        vcpu->arch.mce_evt = mce_evt;
+                }
+        } else
+                machine_check_queue_event();
        return handled;
 }
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 4888dd494604..6ea4b53f4b16 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -45,7 +45,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
 #define NAPPING_NOVCPU  2
 /* Stack frame offsets for kvmppc_hv_entry */
-#define SFS                     144
+#define SFS                     160
 #define STACK_SLOT_TRAP         (SFS-4)
 #define STACK_SLOT_TID          (SFS-16)
 #define STACK_SLOT_PSSCR        (SFS-24)
@@ -54,6 +54,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
 #define STACK_SLOT_CIABR        (SFS-48)
 #define STACK_SLOT_DAWR         (SFS-56)
 #define STACK_SLOT_DAWRX        (SFS-64)
+#define STACK_SLOT_HFSCR        (SFS-72)
 /*
 * Call kvmppc_hv_entry in real mode.
@@ -68,6 +69,7 @@ _GLOBAL_TOC(kvmppc_hv_entry_trampoline)
        std     r0, PPC_LR_STKOFF(r1)
        stdu    r1, -112(r1)
        mfmsr   r10
+        std     r10, HSTATE_HOST_MSR(r13)
        LOAD_REG_ADDR(r5, kvmppc_call_hv_entry)
        li      r0,MSR_RI
        andc    r0,r10,r0
@@ -152,20 +154,21 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
        stb     r0, HSTATE_HWTHREAD_REQ(r13)
        /*
-         * For external and machine check interrupts, we need
+         * For external interrupts we need to call the Linux
-         * to call the Linux handler to process the interrupt.
+         * handler to process the interrupt. We do that by jumping
-         * We do that by jumping to absolute address 0x500 for
+         * to absolute address 0x500 for external interrupts.
-         * external interrupts, or the machine_check_fwnmi label
+         * The [h]rfid at the end of the handler will return to
-         * for machine checks (since firmware might have patched
+         * the book3s_hv_interrupts.S code. For other interrupts
-         * the vector area at 0x200).  The [h]rfid at the end of the
+         * we do the rfid to get back to the book3s_hv_interrupts.S
-         * handler will return to the book3s_hv_interrupts.S code.
+         * code here.
-         * For other interrupts we do the rfid to get back
-         * to the book3s_hv_interrupts.S code here.
         */
        ld      r8, 112+PPC_LR_STKOFF(r1)
        addi    r1, r1, 112
        ld      r7, HSTATE_HOST_MSR(r13)
+        /* Return the trap number on this thread as the return value */
+        mr      r3, r12
        /*
         * If we came back from the guest via a relocation-on interrupt,
         * we will be in virtual mode at this point, which makes it a
@@ -175,59 +178,20 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
        andi.   r0, r0, MSR_IR          /* in real mode? */
        bne     .Lvirt_return
-        cmpwi   cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK
+        /* RFI into the highmem handler */
-        cmpwi   r12, BOOK3S_INTERRUPT_EXTERNAL
-        beq     11f
-        cmpwi   r12, BOOK3S_INTERRUPT_H_DOORBELL
-        beq     15f     /* Invoke the H_DOORBELL handler */
-        cmpwi   cr2, r12, BOOK3S_INTERRUPT_HMI
-        beq     cr2, 14f                        /* HMI check */
-        /* RFI into the highmem handler, or branch to interrupt handler */
        mfmsr   r6
        li      r0, MSR_RI
        andc    r6, r6, r0
        mtmsrd  r6, 1                   /* Clear RI in MSR */
        mtsrr0  r8
        mtsrr1  r7
-        beq     cr1, 13f                /* machine check */
        RFI
-        /* On POWER7, we have external interrupts set to use HSRR0/1 */
+        /* Virtual-mode return */
-11:     mtspr   SPRN_HSRR0, r8
-        mtspr   SPRN_HSRR1, r7
-        ba      0x500
-13:     b       machine_check_fwnmi
-14:     mtspr   SPRN_HSRR0, r8
-        mtspr   SPRN_HSRR1, r7
-        b       hmi_exception_after_realmode
-15:     mtspr SPRN_HSRR0, r8
-        mtspr SPRN_HSRR1, r7
-        ba    0xe80
-        /* Virtual-mode return - can't get here for HMI or machine check */
 .Lvirt_return:
-        cmpwi   r12, BOOK3S_INTERRUPT_EXTERNAL
+        mtlr    r8
-        beq     16f
-        cmpwi   r12, BOOK3S_INTERRUPT_H_DOORBELL
-        beq     17f
-        andi.   r0, r7, MSR_EE          /* were interrupts hard-enabled? */
-        beq     18f
-        mtmsrd  r7, 1                   /* if so then re-enable them */
-18:     mtlr    r8
        blr
-16:     mtspr   SPRN_HSRR0, r8          /* jump to reloc-on external vector */
-        mtspr   SPRN_HSRR1, r7
-        b       exc_virt_0x4500_hardware_interrupt
-17:     mtspr   SPRN_HSRR0, r8
-        mtspr   SPRN_HSRR1, r7
-        b       exc_virt_0x4e80_h_doorbell
 kvmppc_primary_no_guest:
        /* We handle this much like a ceded vcpu */
        /* put the HDEC into the DEC, since HDEC interrupts don't wake us */
@@ -769,6 +733,8 @@ BEGIN_FTR_SECTION
        std     r6, STACK_SLOT_PSSCR(r1)
        std     r7, STACK_SLOT_PID(r1)
        std     r8, STACK_SLOT_IAMR(r1)
+        mfspr   r5, SPRN_HFSCR
+        std     r5, STACK_SLOT_HFSCR(r1)
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 BEGIN_FTR_SECTION
        mfspr   r5, SPRN_CIABR
@@ -920,8 +886,10 @@ FTR_SECTION_ELSE
        ld      r5, VCPU_TID(r4)
        ld      r6, VCPU_PSSCR(r4)
        oris    r6, r6, PSSCR_EC@h      /* This makes stop trap to HV */
+        ld      r7, VCPU_HFSCR(r4)
        mtspr   SPRN_TIDR, r5
        mtspr   SPRN_PSSCR, r6
+        mtspr   SPRN_HFSCR, r7
 ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
 8:
@@ -936,7 +904,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
        mftb    r7
        subf    r3,r7,r8
        mtspr   SPRN_DEC,r3
-        stw     r3,VCPU_DEC(r4)
+        std     r3,VCPU_DEC(r4)
        ld      r5, VCPU_SPRG0(r4)
        ld      r6, VCPU_SPRG1(r4)
@@ -1048,7 +1016,13 @@ kvmppc_cede_reentry:		/* r4 = vcpu, r13 = paca */
        li      r0, BOOK3S_INTERRUPT_EXTERNAL
        bne     cr1, 12f
        mfspr   r0, SPRN_DEC
-        cmpwi   r0, 0
+BEGIN_FTR_SECTION
+        /* On POWER9 check whether the guest has large decrementer enabled */
+        andis.  r8, r8, LPCR_LD@h
+        bne     15f
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
+        extsw   r0, r0
+15:     cmpdi   r0, 0
        li      r0, BOOK3S_INTERRUPT_DECREMENTER
        bge     5f
@@ -1058,6 +1032,23 @@ kvmppc_cede_reentry:		/* r4 = vcpu, r13 = paca */
        mr      r9, r4
        bl      kvmppc_msr_interrupt
 5:
+BEGIN_FTR_SECTION
+        b       fast_guest_return
+END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
+        /* On POWER9, check for pending doorbell requests */
+        lbz     r0, VCPU_DBELL_REQ(r4)
+        cmpwi   r0, 0
+        beq     fast_guest_return
+        ld      r5, HSTATE_KVM_VCORE(r13)
+        /* Set DPDES register so the CPU will take a doorbell interrupt */
+        li      r0, 1
+        mtspr   SPRN_DPDES, r0
+        std     r0, VCORE_DPDES(r5)
+        /* Make sure other cpus see vcore->dpdes set before dbell req clear */
+        lwsync
+        /* Clear the pending doorbell request */
+        li      r0, 0
+        stb     r0, VCPU_DBELL_REQ(r4)
 /*
 * Required state:
@@ -1232,6 +1223,15 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
        stw     r12,VCPU_TRAP(r9)
+        /*
+         * Now that we have saved away SRR0/1 and HSRR0/1,
+         * interrupts are recoverable in principle, so set MSR_RI.
+         * This becomes important for relocation-on interrupts from
+         * the guest, which we can get in radix mode on POWER9.
+         */
+        li      r0, MSR_RI
+        mtmsrd  r0, 1
 #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
        addi    r3, r9, VCPU_TB_RMINTR
        mr      r4, r9
@@ -1288,6 +1288,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
        beq     4f
        b       guest_exit_cont
 3:
+        /* If it's a hypervisor facility unavailable interrupt, save HFSCR */
+        cmpwi   r12, BOOK3S_INTERRUPT_H_FAC_UNAVAIL
+        bne     14f
+        mfspr   r3, SPRN_HFSCR
+        std     r3, VCPU_HFSCR(r9)
+        b       guest_exit_cont
+14:
        /* External interrupt ? */
        cmpwi   r12, BOOK3S_INTERRUPT_EXTERNAL
        bne+    guest_exit_cont
@@ -1475,12 +1482,18 @@ mc_cont:
        mtspr   SPRN_SPURR,r4
        /* Save DEC */
+        ld      r3, HSTATE_KVM_VCORE(r13)
        mfspr   r5,SPRN_DEC
        mftb    r6
+        /* On P9, if the guest has large decr enabled, don't sign extend */
+BEGIN_FTR_SECTION
+        ld      r4, VCORE_LPCR(r3)
+        andis.  r4, r4, LPCR_LD@h
+        bne     16f
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
        extsw   r5,r5
-        add     r5,r5,r6
+16:     add     r5,r5,r6
        /* r5 is a guest timebase value here, convert to host TB */
-        ld      r3,HSTATE_KVM_VCORE(r13)
        ld      r4,VCORE_TB_OFFSET(r3)
        subf    r5,r4,r5
        std     r5,VCPU_DEC_EXPIRES(r9)
@@ -1525,6 +1538,9 @@ FTR_SECTION_ELSE
        rldicl  r6, r6, 4, 50           /* r6 &= PSSCR_GUEST_VIS */
        rotldi  r6, r6, 60
        std     r6, VCPU_PSSCR(r9)
+        /* Restore host HFSCR value */
+        ld      r7, STACK_SLOT_HFSCR(r1)
+        mtspr   SPRN_HFSCR, r7
 ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
        /*
         * Restore various registers to 0, where non-zero values
@@ -2402,8 +2418,15 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM)
        mfspr   r3, SPRN_DEC
        mfspr   r4, SPRN_HDEC
        mftb    r5
+BEGIN_FTR_SECTION
+        /* On P9 check whether the guest has large decrementer mode enabled */
+        ld      r6, HSTATE_KVM_VCORE(r13)
+        ld      r6, VCORE_LPCR(r6)
+        andis.  r6, r6, LPCR_LD@h
+        bne     68f
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
        extsw   r3, r3
-        EXTEND_HDEC(r4)
+68:     EXTEND_HDEC(r4)
        cmpd    r3, r4
        ble     67f
        mtspr   SPRN_DEC, r4
@@ -2589,22 +2612,32 @@ machine_check_realmode:
        ld      r9, HSTATE_KVM_VCPU(r13)
        li      r12, BOOK3S_INTERRUPT_MACHINE_CHECK
        /*
-         * Deliver unhandled/fatal (e.g. UE) MCE errors to guest through
+         * For the guest that is FWNMI capable, deliver all the MCE errors
-         * machine check interrupt (set HSRR0 to 0x200). And for handled
+         * (handled/unhandled) by exiting the guest with KVM_EXIT_NMI exit
-         * errors (no-fatal), just go back to guest execution with current
+         * reason. This new approach injects machine check errors in guest
-         * HSRR0 instead of exiting guest. This new approach will inject
+         * address space to guest with additional information in the form
-         * machine check to guest for fatal error causing guest to crash.
+         * of RTAS event, thus enabling guest kernel to suitably handle
-         *
+         * such errors.
-         * The old code used to return to host for unhandled errors which
-         * was causing guest to hang with soft lockups inside guest and
-         * makes it difficult to recover guest instance.
         *
+         * For the guest that is not FWNMI capable (old QEMU) fallback
+         * to old behaviour for backward compatibility:
+         * Deliver unhandled/fatal (e.g. UE) MCE errors to guest either
+         * through machine check interrupt (set HSRR0 to 0x200).
+         * For handled errors (no-fatal), just go back to guest execution
+         * with current HSRR0.
         * if we receive machine check with MSR(RI=0) then deliver it to
         * guest as machine check causing guest to crash.
         */
        ld      r11, VCPU_MSR(r9)
        rldicl. r0, r11, 64-MSR_HV_LG, 63 /* check if it happened in HV mode */
        bne     mc_cont                 /* if so, exit to host */
+        /* Check if guest is capable of handling NMI exit */
+        ld      r10, VCPU_KVM(r9)
+        lbz     r10, KVM_FWNMI(r10)
+        cmpdi   r10, 1                  /* FWNMI capable? */
+        beq     mc_cont                 /* if so, exit with KVM_EXIT_NMI. */
+        /* if not, fall through for backward compatibility. */
        andi.   r10, r11, MSR_RI        /* check for unrecoverable exception */
        beq     1f                      /* Deliver a machine check to guest */
        ld      r10, VCPU_PC(r9)
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index ffe1da95033a..08b200a0bbce 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1257,8 +1257,8 @@ static void xive_pre_save_scan(struct kvmppc_xive *xive)
                if (!xc)
                        continue;
                for (j = 0; j < KVMPPC_XIVE_Q_COUNT; j++) {
-                        if (xc->queues[i].qpage)
+                        if (xc->queues[j].qpage)
-                                xive_pre_save_queue(xive, &xc->queues[i]);
+                                xive_pre_save_queue(xive, &xc->queues[j]);
                }
        }
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 3eaac3809977..071b87ee682f 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -687,7 +687,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
        kvmppc_core_check_exceptions(vcpu);
-        if (vcpu->requests) {
+        if (kvm_request_pending(vcpu)) {
                /* Exception delivery raised request; start over */
                return 1;
        }
diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c
index c873ffe55362..4d8b4d6cebff 100644
--- a/arch/powerpc/kvm/emulate.c
+++ b/arch/powerpc/kvm/emulate.c
@@ -39,7 +39,7 @@ void kvmppc_emulate_dec(struct kvm_vcpu *vcpu)
        unsigned long dec_nsec;
        unsigned long long dec_time;
-        pr_debug("mtDEC: %x\n", vcpu->arch.dec);
+        pr_debug("mtDEC: %lx\n", vcpu->arch.dec);
        hrtimer_try_to_cancel(&vcpu->arch.dec_timer);
 #ifdef CONFIG_PPC_BOOK3S
@@ -109,7 +109,7 @@ static int kvmppc_emulate_mtspr(struct kvm_vcpu *vcpu, int sprn, int rs)
        case SPRN_TBWU: break;
        case SPRN_DEC:
-                vcpu->arch.dec = spr_val;
+                vcpu->arch.dec = (u32) spr_val;
                kvmppc_emulate_dec(vcpu);
                break;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 7f71ab5fcad1..1a75c0b5f4ca 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -55,8 +55,7 @@ EXPORT_SYMBOL_GPL(kvmppc_pr_ops);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
 {
-        return !!(v->arch.pending_exceptions) ||
+        return !!(v->arch.pending_exceptions) || kvm_request_pending(v);
-               v->requests;
 }
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
@@ -108,7 +107,7 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu)
                 */
                smp_mb();
-                if (vcpu->requests) {
+                if (kvm_request_pending(vcpu)) {
                        /* Make sure we process requests preemptable */
                        local_irq_enable();
                        trace_kvm_check_requests(vcpu);
@@ -554,13 +553,28 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
        case KVM_CAP_PPC_SMT:
                r = 0;
-                if (hv_enabled) {
+                if (kvm) {
+                        if (kvm->arch.emul_smt_mode > 1)
+                                r = kvm->arch.emul_smt_mode;
+                        else
+                                r = kvm->arch.smt_mode;
+                } else if (hv_enabled) {
                        if (cpu_has_feature(CPU_FTR_ARCH_300))
                                r = 1;
                        else
                                r = threads_per_subcore;
                }
                break;
+        case KVM_CAP_PPC_SMT_POSSIBLE:
+                r = 1;
+                if (hv_enabled) {
+                        if (!cpu_has_feature(CPU_FTR_ARCH_300))
+                                r = ((threads_per_subcore << 1) - 1);
+                        else
+                                /* P9 can emulate dbells, so allow any mode */
+                                r = 8 | 4 | 2 | 1;
+                }
+                break;
        case KVM_CAP_PPC_RMA:
                r = 0;
                break;
@@ -619,6 +633,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
                r = !!hv_enabled && !cpu_has_feature(CPU_FTR_ARCH_300);
                break;
 #endif
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+        case KVM_CAP_PPC_FWNMI:
+                r = hv_enabled;
+                break;
+#endif
        case KVM_CAP_PPC_HTM:
                r = cpu_has_feature(CPU_FTR_TM_COMP) &&
                    is_kvmppc_hv_enabled(kvm);
@@ -1538,6 +1557,15 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
                break;
        }
 #endif /* CONFIG_KVM_XICS */
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+        case KVM_CAP_PPC_FWNMI:
+                r = -EINVAL;
+                if (!is_kvmppc_hv_enabled(vcpu->kvm))
+                        break;
+                r = 0;
+                vcpu->kvm->arch.fwnmi_enabled = true;
+                break;
+#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
        default:
                r = -EINVAL;
                break;
@@ -1712,6 +1740,15 @@ static int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
                r = 0;
                break;
        }
+        case KVM_CAP_PPC_SMT: {
+                unsigned long mode = cap->args[0];
+                unsigned long flags = cap->args[1];
+                r = -EINVAL;
+                if (kvm->arch.kvm_ops->set_smt_mode)
+                        r = kvm->arch.kvm_ops->set_smt_mode(kvm, mode, flags);
+                break;
+        }
 #endif
        default:
                r = -EINVAL;
diff --git a/arch/s390/include/asm/ctl_reg.h b/arch/s390/include/asm/ctl_reg.h
index d0441ad2a990..e508dff92535 100644
--- a/arch/s390/include/asm/ctl_reg.h
+++ b/arch/s390/include/asm/ctl_reg.h
@@ -59,7 +59,9 @@ union ctlreg0 {
                unsigned long lap  : 1; /* Low-address-protection control */
                unsigned long      : 4;
                unsigned long edat : 1; /* Enhanced-DAT-enablement control */
-                unsigned long      : 4;
+                unsigned long      : 2;
+                unsigned long iep  : 1; /* Instruction-Execution-Protection */
+                unsigned long      : 1;
                unsigned long afp  : 1; /* AFP-register control */
                unsigned long vx   : 1; /* Vector enablement control */
                unsigned long      : 7;
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 6baae236f461..a409d5991934 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -42,9 +42,11 @@
 #define KVM_HALT_POLL_NS_DEFAULT 80000
 /* s390-specific vcpu->requests bit members */
-#define KVM_REQ_ENABLE_IBS         8
+#define KVM_REQ_ENABLE_IBS      KVM_ARCH_REQ(0)
-#define KVM_REQ_DISABLE_IBS        9
+#define KVM_REQ_DISABLE_IBS     KVM_ARCH_REQ(1)
-#define KVM_REQ_ICPT_OPEREXC       10
+#define KVM_REQ_ICPT_OPEREXC    KVM_ARCH_REQ(2)
+#define KVM_REQ_START_MIGRATION KVM_ARCH_REQ(3)
+#define KVM_REQ_STOP_MIGRATION  KVM_ARCH_REQ(4)
 #define SIGP_CTRL_C             0x80
 #define SIGP_CTRL_SCN_MASK      0x3f
@@ -56,7 +58,7 @@ union bsca_sigp_ctrl {
                __u8 r : 1;
                __u8 scn : 6;
        };
-} __packed;
+};
 union esca_sigp_ctrl {
        __u16 value;
@@ -65,14 +67,14 @@ union esca_sigp_ctrl {
                __u8 reserved: 7;
                __u8 scn;
        };
-} __packed;
+};
 struct esca_entry {
        union esca_sigp_ctrl sigp_ctrl;
        __u16   reserved1[3];
        __u64   sda;
        __u64   reserved2[6];
-} __packed;
+};
 struct bsca_entry {
        __u8    reserved0;
@@ -80,7 +82,7 @@ struct bsca_entry {
        __u16   reserved[3];
        __u64   sda;
        __u64   reserved2[2];
-} __attribute__((packed));
+};
 union ipte_control {
        unsigned long val;
@@ -97,7 +99,7 @@ struct bsca_block {
        __u64   mcn;
        __u64   reserved2;
        struct bsca_entry cpu[KVM_S390_BSCA_CPU_SLOTS];
-} __attribute__((packed));
+};
 struct esca_block {
        union ipte_control ipte_control;
@@ -105,7 +107,7 @@ struct esca_block {
        __u64   mcn[4];
        __u64   reserved2[20];
        struct esca_entry cpu[KVM_S390_ESCA_CPU_SLOTS];
-} __packed;
+};
 /*
 * This struct is used to store some machine check info from lowcore
@@ -274,7 +276,7 @@ struct kvm_s390_sie_block {
 struct kvm_s390_itdb {
        __u8    data[256];
-} __packed;
+};
 struct sie_page {
        struct kvm_s390_sie_block sie_block;
@@ -282,7 +284,7 @@ struct sie_page {
        __u8 reserved218[1000];         /* 0x0218 */
        struct kvm_s390_itdb itdb;      /* 0x0600 */
        __u8 reserved700[2304];         /* 0x0700 */
-} __packed;
+};
 struct kvm_vcpu_stat {
        u64 exit_userspace;
@@ -695,7 +697,7 @@ struct sie_page2 {
        __u64 fac_list[S390_ARCH_FAC_LIST_SIZE_U64];    /* 0x0000 */
        struct kvm_s390_crypto_cb crycb;                /* 0x0800 */
        u8 reserved900[0x1000 - 0x900];                 /* 0x0900 */
-} __packed;
+};
 struct kvm_s390_vsie {
        struct mutex mutex;
@@ -705,6 +707,12 @@ struct kvm_s390_vsie {
        struct page *pages[KVM_MAX_VCPUS];
 };
+struct kvm_s390_migration_state {
+        unsigned long bitmap_size;      /* in bits (number of guest pages) */
+        atomic64_t dirty_pages;         /* number of dirty pages */
+        unsigned long *pgste_bitmap;
+};
 struct kvm_arch{
        void *sca;
        int use_esca;
@@ -732,6 +740,7 @@ struct kvm_arch{
        struct kvm_s390_crypto crypto;
        struct kvm_s390_vsie vsie;
        u64 epoch;
+        struct kvm_s390_migration_state *migration_state;
        /* subset of available cpu features enabled by user space */
        DECLARE_BITMAP(cpu_feat, KVM_S390_VM_CPU_FEAT_NR_BITS);
 };
diff --git a/arch/s390/include/asm/nmi.h b/arch/s390/include/asm/nmi.h
index 13623b9991d4..9d91cf3e427f 100644
--- a/arch/s390/include/asm/nmi.h
+++ b/arch/s390/include/asm/nmi.h
@@ -26,6 +26,12 @@
 #define MCCK_CODE_PSW_MWP_VALID         _BITUL(63 - 20)
 #define MCCK_CODE_PSW_IA_VALID          _BITUL(63 - 23)
+#define MCCK_CR14_CR_PENDING_SUB_MASK   (1 << 28)
+#define MCCK_CR14_RECOVERY_SUB_MASK     (1 << 27)
+#define MCCK_CR14_DEGRAD_SUB_MASK       (1 << 26)
+#define MCCK_CR14_EXT_DAMAGE_SUB_MASK   (1 << 25)
+#define MCCK_CR14_WARN_SUB_MASK         (1 << 24)
 #ifndef __ASSEMBLY__
 union mci {
diff --git a/arch/s390/include/uapi/asm/kvm.h b/arch/s390/include/uapi/asm/kvm.h
index 3dd2a1d308dd..69d09c39bbcd 100644
--- a/arch/s390/include/uapi/asm/kvm.h
+++ b/arch/s390/include/uapi/asm/kvm.h
@@ -28,6 +28,7 @@
 #define KVM_DEV_FLIC_CLEAR_IO_IRQ       8
 #define KVM_DEV_FLIC_AISM               9
 #define KVM_DEV_FLIC_AIRQ_INJECT        10
+#define KVM_DEV_FLIC_AISM_ALL           11
 /*
 * We can have up to 4*64k pending subchannels + 8 adapter interrupts,
 * as well as up  to ASYNC_PF_PER_VCPU*KVM_MAX_VCPUS pfault done interrupts.
@@ -53,6 +54,11 @@ struct kvm_s390_ais_req {
        __u16 mode;
 };
+struct kvm_s390_ais_all {
+        __u8 simm;
+        __u8 nimm;
+};
 #define KVM_S390_IO_ADAPTER_MASK 1
 #define KVM_S390_IO_ADAPTER_MAP 2
 #define KVM_S390_IO_ADAPTER_UNMAP 3
@@ -70,6 +76,7 @@ struct kvm_s390_io_adapter_req {
 #define KVM_S390_VM_TOD                 1
 #define KVM_S390_VM_CRYPTO              2
 #define KVM_S390_VM_CPU_MODEL           3
+#define KVM_S390_VM_MIGRATION           4
 /* kvm attributes for mem_ctrl */
 #define KVM_S390_VM_MEM_ENABLE_CMMA     0
@@ -151,6 +158,11 @@ struct kvm_s390_vm_cpu_subfunc {
 #define KVM_S390_VM_CRYPTO_DISABLE_AES_KW       2
 #define KVM_S390_VM_CRYPTO_DISABLE_DEA_KW       3
+/* kvm attributes for migration mode */
+#define KVM_S390_VM_MIGRATION_STOP      0
+#define KVM_S390_VM_MIGRATION_START     1
+#define KVM_S390_VM_MIGRATION_STATUS    2
 /* for KVM_GET_REGS and KVM_SET_REGS */
 struct kvm_regs {
        /* general purpose regs for s390 */
diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c
index 875f8bea8c67..653cae5e1ee1 100644
--- a/arch/s390/kvm/gaccess.c
+++ b/arch/s390/kvm/gaccess.c
@@ -89,7 +89,7 @@ struct region3_table_entry_fc1 {
        unsigned long f  : 1; /* Fetch-Protection Bit */
        unsigned long fc : 1; /* Format-Control */
        unsigned long p  : 1; /* DAT-Protection Bit */
-        unsigned long co : 1; /* Change-Recording Override */
+        unsigned long iep: 1; /* Instruction-Execution-Protection */
        unsigned long    : 2;
        unsigned long i  : 1; /* Region-Invalid Bit */
        unsigned long cr : 1; /* Common-Region Bit */
@@ -131,7 +131,7 @@ struct segment_entry_fc1 {
        unsigned long f  : 1; /* Fetch-Protection Bit */
        unsigned long fc : 1; /* Format-Control */
        unsigned long p  : 1; /* DAT-Protection Bit */
-        unsigned long co : 1; /* Change-Recording Override */
+        unsigned long iep: 1; /* Instruction-Execution-Protection */
        unsigned long    : 2;
        unsigned long i  : 1; /* Segment-Invalid Bit */
        unsigned long cs : 1; /* Common-Segment Bit */
@@ -168,7 +168,8 @@ union page_table_entry {
                unsigned long z  : 1; /* Zero Bit */
                unsigned long i  : 1; /* Page-Invalid Bit */
                unsigned long p  : 1; /* DAT-Protection Bit */
-                unsigned long    : 9;
+                unsigned long iep: 1; /* Instruction-Execution-Protection */
+                unsigned long    : 8;
        };
 };
@@ -241,7 +242,7 @@ struct ale {
        unsigned long asteo  : 25; /* ASN-Second-Table-Entry Origin */
        unsigned long        : 6;
        unsigned long astesn : 32; /* ASTE Sequence Number */
-} __packed;
+};
 struct aste {
        unsigned long i      : 1; /* ASX-Invalid Bit */
@@ -257,7 +258,7 @@ struct aste {
        unsigned long ald    : 32;
        unsigned long astesn : 32;
        /* .. more fields there */
-} __packed;
+};
 int ipte_lock_held(struct kvm_vcpu *vcpu)
 {
@@ -485,6 +486,7 @@ enum prot_type {
        PROT_TYPE_KEYC = 1,
        PROT_TYPE_ALC  = 2,
        PROT_TYPE_DAT  = 3,
+        PROT_TYPE_IEP  = 4,
 };
 static int trans_exc(struct kvm_vcpu *vcpu, int code, unsigned long gva,
@@ -500,6 +502,9 @@ static int trans_exc(struct kvm_vcpu *vcpu, int code, unsigned long gva,
        switch (code) {
        case PGM_PROTECTION:
                switch (prot) {
+                case PROT_TYPE_IEP:
+                        tec->b61 = 1;
+                        /* FALL THROUGH */
                case PROT_TYPE_LA:
                        tec->b56 = 1;
                        break;
@@ -591,6 +596,7 @@ static int deref_table(struct kvm *kvm, unsigned long gpa, unsigned long *val)
 * @gpa: points to where guest physical (absolute) address should be stored
 * @asce: effective asce
 * @mode: indicates the access mode to be used
+ * @prot: returns the type for protection exceptions
 *
 * Translate a guest virtual address into a guest absolute address by means
 * of dynamic address translation as specified by the architecture.
@@ -606,19 +612,21 @@ static int deref_table(struct kvm *kvm, unsigned long gpa, unsigned long *val)
 */
 static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
                                     unsigned long *gpa, const union asce asce,
-                                     enum gacc_mode mode)
+                                     enum gacc_mode mode, enum prot_type *prot)
 {
        union vaddress vaddr = {.addr = gva};
        union raddress raddr = {.addr = gva};
        union page_table_entry pte;
        int dat_protection = 0;
+        int iep_protection = 0;
        union ctlreg0 ctlreg0;
        unsigned long ptr;
-        int edat1, edat2;
+        int edat1, edat2, iep;
        ctlreg0.val = vcpu->arch.sie_block->gcr[0];
        edat1 = ctlreg0.edat && test_kvm_facility(vcpu->kvm, 8);
        edat2 = edat1 && test_kvm_facility(vcpu->kvm, 78);
+        iep = ctlreg0.iep && test_kvm_facility(vcpu->kvm, 130);
        if (asce.r)
                goto real_address;
        ptr = asce.origin * 4096;
@@ -702,6 +710,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
                        return PGM_TRANSLATION_SPEC;
                if (rtte.fc && edat2) {
                        dat_protection |= rtte.fc1.p;
+                        iep_protection = rtte.fc1.iep;
                        raddr.rfaa = rtte.fc1.rfaa;
                        goto absolute_address;
                }
@@ -729,6 +738,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
                        return PGM_TRANSLATION_SPEC;
                if (ste.fc && edat1) {
                        dat_protection |= ste.fc1.p;
+                        iep_protection = ste.fc1.iep;
                        raddr.sfaa = ste.fc1.sfaa;
                        goto absolute_address;
                }
@@ -745,12 +755,19 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
        if (pte.z)
                return PGM_TRANSLATION_SPEC;
        dat_protection |= pte.p;
+        iep_protection = pte.iep;
        raddr.pfra = pte.pfra;
 real_address:
        raddr.addr = kvm_s390_real_to_abs(vcpu, raddr.addr);
 absolute_address:
-        if (mode == GACC_STORE && dat_protection)
+        if (mode == GACC_STORE && dat_protection) {
+                *prot = PROT_TYPE_DAT;
                return PGM_PROTECTION;
+        }
+        if (mode == GACC_IFETCH && iep_protection && iep) {
+                *prot = PROT_TYPE_IEP;
+                return PGM_PROTECTION;
+        }
        if (kvm_is_error_gpa(vcpu->kvm, raddr.addr))
                return PGM_ADDRESSING;
        *gpa = raddr.addr;
@@ -782,6 +799,7 @@ static int guest_page_range(struct kvm_vcpu *vcpu, unsigned long ga, u8 ar,
 {
        psw_t *psw = &vcpu->arch.sie_block->gpsw;
        int lap_enabled, rc = 0;
+        enum prot_type prot;
        lap_enabled = low_address_protection_enabled(vcpu, asce);
        while (nr_pages) {
@@ -791,7 +809,7 @@ static int guest_page_range(struct kvm_vcpu *vcpu, unsigned long ga, u8 ar,
                                         PROT_TYPE_LA);
                ga &= PAGE_MASK;
                if (psw_bits(*psw).dat) {
-                        rc = guest_translate(vcpu, ga, pages, asce, mode);
+                        rc = guest_translate(vcpu, ga, pages, asce, mode, &prot);
                        if (rc < 0)
                                return rc;
                } else {
@@ -800,7 +818,7 @@ static int guest_page_range(struct kvm_vcpu *vcpu, unsigned long ga, u8 ar,
                                rc = PGM_ADDRESSING;
                }
                if (rc)
-                        return trans_exc(vcpu, rc, ga, ar, mode, PROT_TYPE_DAT);
+                        return trans_exc(vcpu, rc, ga, ar, mode, prot);
                ga += PAGE_SIZE;
                pages++;
                nr_pages--;
@@ -886,6 +904,7 @@ int guest_translate_address(struct kvm_vcpu *vcpu, unsigned long gva, u8 ar,
                            unsigned long *gpa, enum gacc_mode mode)
 {
        psw_t *psw = &vcpu->arch.sie_block->gpsw;
+        enum prot_type prot;
        union asce asce;
        int rc;
@@ -900,9 +919,9 @@ int guest_translate_address(struct kvm_vcpu *vcpu, unsigned long gva, u8 ar,
        }
        if (psw_bits(*psw).dat && !asce.r) {    /* Use DAT? */
-                rc = guest_translate(vcpu, gva, gpa, asce, mode);
+                rc = guest_translate(vcpu, gva, gpa, asce, mode, &prot);
                if (rc > 0)
-                        return trans_exc(vcpu, rc, gva, 0, mode, PROT_TYPE_DAT);
+                        return trans_exc(vcpu, rc, gva, 0, mode, prot);
        } else {
                *gpa = kvm_s390_real_to_abs(vcpu, gva);
                if (kvm_is_error_gpa(vcpu->kvm, *gpa))
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index 2d120fef7d90..a619ddae610d 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -251,8 +251,13 @@ static unsigned long deliverable_irqs(struct kvm_vcpu *vcpu)
                __clear_bit(IRQ_PEND_EXT_SERVICE, &active_mask);
        if (psw_mchk_disabled(vcpu))
                active_mask &= ~IRQ_PEND_MCHK_MASK;
+        /*
+         * Check both floating and local interrupt's cr14 because
+         * bit IRQ_PEND_MCHK_REP could be set in both cases.
+         */
        if (!(vcpu->arch.sie_block->gcr[14] &
-              vcpu->kvm->arch.float_int.mchk.cr14))
+           (vcpu->kvm->arch.float_int.mchk.cr14 |
+           vcpu->arch.local_int.irq.mchk.cr14)))
                __clear_bit(IRQ_PEND_MCHK_REP, &active_mask);
        /*
@@ -1876,6 +1881,28 @@ out:
        return ret < 0 ? ret : n;
 }
+static int flic_ais_mode_get_all(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+        struct kvm_s390_float_interrupt *fi = &kvm->arch.float_int;
+        struct kvm_s390_ais_all ais;
+        if (attr->attr < sizeof(ais))
+                return -EINVAL;
+        if (!test_kvm_facility(kvm, 72))
+                return -ENOTSUPP;
+        mutex_lock(&fi->ais_lock);
+        ais.simm = fi->simm;
+        ais.nimm = fi->nimm;
+        mutex_unlock(&fi->ais_lock);
+        if (copy_to_user((void __user *)attr->addr, &ais, sizeof(ais)))
+                return -EFAULT;
+        return 0;
+}
 static int flic_get_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 {
        int r;
@@ -1885,6 +1912,9 @@ static int flic_get_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
                r = get_all_floating_irqs(dev->kvm, (u8 __user *) attr->addr,
                                          attr->attr);
                break;
+        case KVM_DEV_FLIC_AISM_ALL:
+                r = flic_ais_mode_get_all(dev->kvm, attr);
+                break;
        default:
                r = -EINVAL;
        }
@@ -2235,6 +2265,25 @@ static int flic_inject_airq(struct kvm *kvm, struct kvm_device_attr *attr)
        return kvm_s390_inject_airq(kvm, adapter);
 }
+static int flic_ais_mode_set_all(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+        struct kvm_s390_float_interrupt *fi = &kvm->arch.float_int;
+        struct kvm_s390_ais_all ais;
+        if (!test_kvm_facility(kvm, 72))
+                return -ENOTSUPP;
+        if (copy_from_user(&ais, (void __user *)attr->addr, sizeof(ais)))
+                return -EFAULT;
+        mutex_lock(&fi->ais_lock);
+        fi->simm = ais.simm;
+        fi->nimm = ais.nimm;
+        mutex_unlock(&fi->ais_lock);
+        return 0;
+}
 static int flic_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 {
        int r = 0;
@@ -2277,6 +2326,9 @@ static int flic_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
        case KVM_DEV_FLIC_AIRQ_INJECT:
                r = flic_inject_airq(dev->kvm, attr);
                break;
+        case KVM_DEV_FLIC_AISM_ALL:
+                r = flic_ais_mode_set_all(dev->kvm, attr);
+                break;
        default:
                r = -EINVAL;
        }
@@ -2298,6 +2350,7 @@ static int flic_has_attr(struct kvm_device *dev,
        case KVM_DEV_FLIC_CLEAR_IO_IRQ:
        case KVM_DEV_FLIC_AISM:
        case KVM_DEV_FLIC_AIRQ_INJECT:
+        case KVM_DEV_FLIC_AISM_ALL:
                return 0;
        }
        return -ENXIO;
@@ -2415,6 +2468,42 @@ static int set_adapter_int(struct kvm_kernel_irq_routing_entry *e,
        return ret;
 }
+/*
+ * Inject the machine check to the guest.
+ */
+void kvm_s390_reinject_machine_check(struct kvm_vcpu *vcpu,
+                                     struct mcck_volatile_info *mcck_info)
+{
+        struct kvm_s390_interrupt_info inti;
+        struct kvm_s390_irq irq;
+        struct kvm_s390_mchk_info *mchk;
+        union mci mci;
+        __u64 cr14 = 0;         /* upper bits are not used */
+        mci.val = mcck_info->mcic;
+        if (mci.sr)
+                cr14 |= MCCK_CR14_RECOVERY_SUB_MASK;
+        if (mci.dg)
+                cr14 |= MCCK_CR14_DEGRAD_SUB_MASK;
+        if (mci.w)
+                cr14 |= MCCK_CR14_WARN_SUB_MASK;
+        mchk = mci.ck ? &inti.mchk : &irq.u.mchk;
+        mchk->cr14 = cr14;
+        mchk->mcic = mcck_info->mcic;
+        mchk->ext_damage_code = mcck_info->ext_damage_code;
+        mchk->failing_storage_address = mcck_info->failing_storage_address;
+        if (mci.ck) {
+                /* Inject the floating machine check */
+                inti.type = KVM_S390_MCHK;
+                WARN_ON_ONCE(__inject_vm(vcpu->kvm, &inti));
+        } else {
+                /* Inject the machine check to specified vcpu */
+                irq.type = KVM_S390_MCHK;
+                WARN_ON_ONCE(kvm_s390_inject_vcpu(vcpu, &irq));
+        }
+}
 int kvm_set_routing_entry(struct kvm *kvm,
                          struct kvm_kernel_irq_routing_entry *e,
                          const struct kvm_irq_routing_entry *ue)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index b0d7de5a533d..3f2884e99ed4 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -30,6 +30,7 @@
 #include <linux/vmalloc.h>
 #include <linux/bitmap.h>
 #include <linux/sched/signal.h>
+#include <linux/string.h>
 #include <asm/asm-offsets.h>
 #include <asm/lowcore.h>
@@ -386,6 +387,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
        case KVM_CAP_S390_SKEYS:
        case KVM_CAP_S390_IRQ_STATE:
        case KVM_CAP_S390_USER_INSTR0:
+        case KVM_CAP_S390_CMMA_MIGRATION:
        case KVM_CAP_S390_AIS:
                r = 1;
                break;
@@ -749,6 +751,129 @@ static int kvm_s390_vm_set_crypto(struct kvm *kvm, struct kvm_device_attr *attr)
        return 0;
 }
+static void kvm_s390_sync_request_broadcast(struct kvm *kvm, int req)
+{
+        int cx;
+        struct kvm_vcpu *vcpu;
+        kvm_for_each_vcpu(cx, vcpu, kvm)
+                kvm_s390_sync_request(req, vcpu);
+}
+/*
+ * Must be called with kvm->srcu held to avoid races on memslots, and with
+ * kvm->lock to avoid races with ourselves and kvm_s390_vm_stop_migration.
+ */
+static int kvm_s390_vm_start_migration(struct kvm *kvm)
+{
+        struct kvm_s390_migration_state *mgs;
+        struct kvm_memory_slot *ms;
+        /* should be the only one */
+        struct kvm_memslots *slots;
+        unsigned long ram_pages;
+        int slotnr;
+        /* migration mode already enabled */
+        if (kvm->arch.migration_state)
+                return 0;
+        slots = kvm_memslots(kvm);
+        if (!slots || !slots->used_slots)
+                return -EINVAL;
+        mgs = kzalloc(sizeof(*mgs), GFP_KERNEL);
+        if (!mgs)
+                return -ENOMEM;
+        kvm->arch.migration_state = mgs;
+        if (kvm->arch.use_cmma) {
+                /*
+                 * Get the last slot. They should be sorted by base_gfn, so the
+                 * last slot is also the one at the end of the address space.
+                 * We have verified above that at least one slot is present.
+                 */
+                ms = slots->memslots + slots->used_slots - 1;
+                /* round up so we only use full longs */
+                ram_pages = roundup(ms->base_gfn + ms->npages, BITS_PER_LONG);
+                /* allocate enough bytes to store all the bits */
+                mgs->pgste_bitmap = vmalloc(ram_pages / 8);
+                if (!mgs->pgste_bitmap) {
+                        kfree(mgs);
+                        kvm->arch.migration_state = NULL;
+                        return -ENOMEM;
+                }
+                mgs->bitmap_size = ram_pages;
+                atomic64_set(&mgs->dirty_pages, ram_pages);
+                /* mark all the pages in active slots as dirty */
+                for (slotnr = 0; slotnr < slots->used_slots; slotnr++) {
+                        ms = slots->memslots + slotnr;
+                        bitmap_set(mgs->pgste_bitmap, ms->base_gfn, ms->npages);
+                }
+                kvm_s390_sync_request_broadcast(kvm, KVM_REQ_START_MIGRATION);
+        }
+        return 0;
+}
+/*
+ * Must be called with kvm->lock to avoid races with ourselves and
+ * kvm_s390_vm_start_migration.
+ */
+static int kvm_s390_vm_stop_migration(struct kvm *kvm)
+{
+        struct kvm_s390_migration_state *mgs;
+        /* migration mode already disabled */
+        if (!kvm->arch.migration_state)
+                return 0;
+        mgs = kvm->arch.migration_state;
+        kvm->arch.migration_state = NULL;
+        if (kvm->arch.use_cmma) {
+                kvm_s390_sync_request_broadcast(kvm, KVM_REQ_STOP_MIGRATION);
+                vfree(mgs->pgste_bitmap);
+        }
+        kfree(mgs);
+        return 0;
+}
+static int kvm_s390_vm_set_migration(struct kvm *kvm,
+                                     struct kvm_device_attr *attr)
+{
+        int idx, res = -ENXIO;
+        mutex_lock(&kvm->lock);
+        switch (attr->attr) {
+        case KVM_S390_VM_MIGRATION_START:
+                idx = srcu_read_lock(&kvm->srcu);
+                res = kvm_s390_vm_start_migration(kvm);
+                srcu_read_unlock(&kvm->srcu, idx);
+                break;
+        case KVM_S390_VM_MIGRATION_STOP:
+                res = kvm_s390_vm_stop_migration(kvm);
+                break;
+        default:
+                break;
+        }
+        mutex_unlock(&kvm->lock);
+        return res;
+}
+static int kvm_s390_vm_get_migration(struct kvm *kvm,
+                                     struct kvm_device_attr *attr)
+{
+        u64 mig = (kvm->arch.migration_state != NULL);
+        if (attr->attr != KVM_S390_VM_MIGRATION_STATUS)
+                return -ENXIO;
+        if (copy_to_user((void __user *)attr->addr, &mig, sizeof(mig)))
+                return -EFAULT;
+        return 0;
+}
 static int kvm_s390_set_tod_high(struct kvm *kvm, struct kvm_device_attr *attr)
 {
        u8 gtod_high;
@@ -1089,6 +1214,9 @@ static int kvm_s390_vm_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
        case KVM_S390_VM_CRYPTO:
                ret = kvm_s390_vm_set_crypto(kvm, attr);
                break;
+        case KVM_S390_VM_MIGRATION:
+                ret = kvm_s390_vm_set_migration(kvm, attr);
+                break;
        default:
                ret = -ENXIO;
                break;
@@ -1111,6 +1239,9 @@ static int kvm_s390_vm_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
        case KVM_S390_VM_CPU_MODEL:
                ret = kvm_s390_get_cpu_model(kvm, attr);
                break;
+        case KVM_S390_VM_MIGRATION:
+                ret = kvm_s390_vm_get_migration(kvm, attr);
+                break;
        default:
                ret = -ENXIO;
                break;
@@ -1178,6 +1309,9 @@ static int kvm_s390_vm_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
                        break;
                }
                break;
+        case KVM_S390_VM_MIGRATION:
+                ret = 0;
+                break;
        default:
                ret = -ENXIO;
                break;
@@ -1285,6 +1419,182 @@ out:
        return r;
 }
+/*
+ * Base address and length must be sent at the start of each block, therefore
+ * it's cheaper to send some clean data, as long as it's less than the size of
+ * two longs.
+ */
+#define KVM_S390_MAX_BIT_DISTANCE (2 * sizeof(void *))
+/* for consistency */
+#define KVM_S390_CMMA_SIZE_MAX ((u32)KVM_S390_SKEYS_MAX)
+/*
+ * This function searches for the next page with dirty CMMA attributes, and
+ * saves the attributes in the buffer up to either the end of the buffer or
+ * until a block of at least KVM_S390_MAX_BIT_DISTANCE clean bits is found;
+ * no trailing clean bytes are saved.
+ * In case no dirty bits were found, or if CMMA was not enabled or used, the
+ * output buffer will indicate 0 as length.
+ */
+static int kvm_s390_get_cmma_bits(struct kvm *kvm,
+                                  struct kvm_s390_cmma_log *args)
+{
+        struct kvm_s390_migration_state *s = kvm->arch.migration_state;
+        unsigned long bufsize, hva, pgstev, i, next, cur;
+        int srcu_idx, peek, r = 0, rr;
+        u8 *res;
+        cur = args->start_gfn;
+        i = next = pgstev = 0;
+        if (unlikely(!kvm->arch.use_cmma))
+                return -ENXIO;
+        /* Invalid/unsupported flags were specified */
+        if (args->flags & ~KVM_S390_CMMA_PEEK)
+                return -EINVAL;
+        /* Migration mode query, and we are not doing a migration */
+        peek = !!(args->flags & KVM_S390_CMMA_PEEK);
+        if (!peek && !s)
+                return -EINVAL;
+        /* CMMA is disabled or was not used, or the buffer has length zero */
+        bufsize = min(args->count, KVM_S390_CMMA_SIZE_MAX);
+        if (!bufsize || !kvm->mm->context.use_cmma) {
+                memset(args, 0, sizeof(*args));
+                return 0;
+        }
+        if (!peek) {
+                /* We are not peeking, and there are no dirty pages */
+                if (!atomic64_read(&s->dirty_pages)) {
+                        memset(args, 0, sizeof(*args));
+                        return 0;
+                }
+                cur = find_next_bit(s->pgste_bitmap, s->bitmap_size,
+                                    args->start_gfn);
+                if (cur >= s->bitmap_size)      /* nothing found, loop back */
+                        cur = find_next_bit(s->pgste_bitmap, s->bitmap_size, 0);
+                if (cur >= s->bitmap_size) {    /* again! (very unlikely) */
+                        memset(args, 0, sizeof(*args));
+                        return 0;
+                }
+                next = find_next_bit(s->pgste_bitmap, s->bitmap_size, cur + 1);
+        }
+        res = vmalloc(bufsize);
+        if (!res)
+                return -ENOMEM;
+        args->start_gfn = cur;
+        down_read(&kvm->mm->mmap_sem);
+        srcu_idx = srcu_read_lock(&kvm->srcu);
+        while (i < bufsize) {
+                hva = gfn_to_hva(kvm, cur);
+                if (kvm_is_error_hva(hva)) {
+                        r = -EFAULT;
+                        break;
+                }
+                /* decrement only if we actually flipped the bit to 0 */
+                if (!peek && test_and_clear_bit(cur, s->pgste_bitmap))
+                        atomic64_dec(&s->dirty_pages);
+                r = get_pgste(kvm->mm, hva, &pgstev);
+                if (r < 0)
+                        pgstev = 0;
+                /* save the value */
+                res[i++] = (pgstev >> 24) & 0x3;
+                /*
+                 * if the next bit is too far away, stop.
+                 * if we reached the previous "next", find the next one
+                 */
+                if (!peek) {
+                        if (next > cur + KVM_S390_MAX_BIT_DISTANCE)
+                                break;
+                        if (cur == next)
+                                next = find_next_bit(s->pgste_bitmap,
+                                                     s->bitmap_size, cur + 1);
+                /* reached the end of the bitmap or of the buffer, stop */
+                        if ((next >= s->bitmap_size) ||
+                            (next >= args->start_gfn + bufsize))
+                                break;
+                }
+                cur++;
+        }
+        srcu_read_unlock(&kvm->srcu, srcu_idx);
+        up_read(&kvm->mm->mmap_sem);
+        args->count = i;
+        args->remaining = s ? atomic64_read(&s->dirty_pages) : 0;
+        rr = copy_to_user((void __user *)args->values, res, args->count);
+        if (rr)
+                r = -EFAULT;
+        vfree(res);
+        return r;
+}
+/*
+ * This function sets the CMMA attributes for the given pages. If the input
+ * buffer has zero length, no action is taken, otherwise the attributes are
+ * set and the mm->context.use_cmma flag is set.
+ */
+static int kvm_s390_set_cmma_bits(struct kvm *kvm,
+                                  const struct kvm_s390_cmma_log *args)
+{
+        unsigned long hva, mask, pgstev, i;
+        uint8_t *bits;
+        int srcu_idx, r = 0;
+        mask = args->mask;
+        if (!kvm->arch.use_cmma)
+                return -ENXIO;
+        /* invalid/unsupported flags */
+        if (args->flags != 0)
+                return -EINVAL;
+        /* Enforce sane limit on memory allocation */
+        if (args->count > KVM_S390_CMMA_SIZE_MAX)
+                return -EINVAL;
+        /* Nothing to do */
+        if (args->count == 0)
+                return 0;
+        bits = vmalloc(sizeof(*bits) * args->count);
+        if (!bits)
+                return -ENOMEM;
+        r = copy_from_user(bits, (void __user *)args->values, args->count);
+        if (r) {
+                r = -EFAULT;
+                goto out;
+        }
+        down_read(&kvm->mm->mmap_sem);
+        srcu_idx = srcu_read_lock(&kvm->srcu);
+        for (i = 0; i < args->count; i++) {
+                hva = gfn_to_hva(kvm, args->start_gfn + i);
+                if (kvm_is_error_hva(hva)) {
+                        r = -EFAULT;
+                        break;
+                }
+                pgstev = bits[i];
+                pgstev = pgstev << 24;
+                mask &= _PGSTE_GPS_USAGE_MASK;
+                set_pgste_bits(kvm->mm, hva, mask, pgstev);
+        }
+        srcu_read_unlock(&kvm->srcu, srcu_idx);
+        up_read(&kvm->mm->mmap_sem);
+        if (!kvm->mm->context.use_cmma) {
+                down_write(&kvm->mm->mmap_sem);
+                kvm->mm->context.use_cmma = 1;
+                up_write(&kvm->mm->mmap_sem);
+        }
+out:
+        vfree(bits);
+        return r;
+}
 long kvm_arch_vm_ioctl(struct file *filp,
                       unsigned int ioctl, unsigned long arg)
 {
@@ -1363,6 +1673,29 @@ long kvm_arch_vm_ioctl(struct file *filp,
                r = kvm_s390_set_skeys(kvm, &args);
                break;
        }
+        case KVM_S390_GET_CMMA_BITS: {
+                struct kvm_s390_cmma_log args;
+                r = -EFAULT;
+                if (copy_from_user(&args, argp, sizeof(args)))
+                        break;
+                r = kvm_s390_get_cmma_bits(kvm, &args);
+                if (!r) {
+                        r = copy_to_user(argp, &args, sizeof(args));
+                        if (r)
+                                r = -EFAULT;
+                }
+                break;
+        }
+        case KVM_S390_SET_CMMA_BITS: {
+                struct kvm_s390_cmma_log args;
+                r = -EFAULT;
+                if (copy_from_user(&args, argp, sizeof(args)))
+                        break;
+                r = kvm_s390_set_cmma_bits(kvm, &args);
+                break;
+        }
        default:
                r = -ENOTTY;
        }
@@ -1631,6 +1964,10 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
        kvm_s390_destroy_adapters(kvm);
        kvm_s390_clear_float_irqs(kvm);
        kvm_s390_vsie_destroy(kvm);
+        if (kvm->arch.migration_state) {
+                vfree(kvm->arch.migration_state->pgste_bitmap);
+                kfree(kvm->arch.migration_state);
+        }
        KVM_EVENT(3, "vm 0x%pK destroyed", kvm);
 }
@@ -1975,7 +2312,6 @@ int kvm_s390_vcpu_setup_cmma(struct kvm_vcpu *vcpu)
        if (!vcpu->arch.sie_block->cbrlo)
                return -ENOMEM;
-        vcpu->arch.sie_block->ecb2 |= ECB2_CMMA;
        vcpu->arch.sie_block->ecb2 &= ~ECB2_PFMFI;
        return 0;
 }
@@ -2439,7 +2775,7 @@ static int kvm_s390_handle_requests(struct kvm_vcpu *vcpu)
 {
 retry:
        kvm_s390_vcpu_request_handled(vcpu);
-        if (!vcpu->requests)
+        if (!kvm_request_pending(vcpu))
                return 0;
        /*
         * We use MMU_RELOAD just to re-arm the ipte notifier for the
@@ -2488,6 +2824,27 @@ retry:
                goto retry;
        }
+        if (kvm_check_request(KVM_REQ_START_MIGRATION, vcpu)) {
+                /*
+                 * Disable CMMA virtualization; we will emulate the ESSA
+                 * instruction manually, in order to provide additional
+                 * functionalities needed for live migration.
+                 */
+                vcpu->arch.sie_block->ecb2 &= ~ECB2_CMMA;
+                goto retry;
+        }
+        if (kvm_check_request(KVM_REQ_STOP_MIGRATION, vcpu)) {
+                /*
+                 * Re-enable CMMA virtualization if CMMA is available and
+                 * was used.
+                 */
+                if ((vcpu->kvm->arch.use_cmma) &&
+                    (vcpu->kvm->mm->context.use_cmma))
+                        vcpu->arch.sie_block->ecb2 |= ECB2_CMMA;
+                goto retry;
+        }
        /* nothing to do, just clear the request */
        kvm_clear_request(KVM_REQ_UNHALT, vcpu);
@@ -2682,6 +3039,9 @@ static int vcpu_post_run_fault_in_sie(struct kvm_vcpu *vcpu)
 static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
 {
+        struct mcck_volatile_info *mcck_info;
+        struct sie_page *sie_page;
        VCPU_EVENT(vcpu, 6, "exit sie icptcode %d",
                   vcpu->arch.sie_block->icptcode);
        trace_kvm_s390_sie_exit(vcpu, vcpu->arch.sie_block->icptcode);
@@ -2692,6 +3052,15 @@ static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
        vcpu->run->s.regs.gprs[14] = vcpu->arch.sie_block->gg14;
        vcpu->run->s.regs.gprs[15] = vcpu->arch.sie_block->gg15;
+        if (exit_reason == -EINTR) {
+                VCPU_EVENT(vcpu, 3, "%s", "machine check");
+                sie_page = container_of(vcpu->arch.sie_block,
+                                        struct sie_page, sie_block);
+                mcck_info = &sie_page->mcck_info;
+                kvm_s390_reinject_machine_check(vcpu, mcck_info);
+                return 0;
+        }
        if (vcpu->arch.sie_block->icptcode > 0) {
                int rc = kvm_handle_sie_intercept(vcpu);
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 55f5c8457d6d..6fedc8bc7a37 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -397,4 +397,6 @@ static inline int kvm_s390_use_sca_entries(void)
         */
        return sclp.has_sigpif;
 }
+void kvm_s390_reinject_machine_check(struct kvm_vcpu *vcpu,
+                                     struct mcck_volatile_info *mcck_info);
 #endif
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index e53292a89257..8a1dac793d6b 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -24,6 +24,7 @@
 #include <asm/ebcdic.h>
 #include <asm/sysinfo.h>
 #include <asm/pgtable.h>
+#include <asm/page-states.h>
 #include <asm/pgalloc.h>
 #include <asm/gmap.h>
 #include <asm/io.h>
@@ -949,13 +950,72 @@ static int handle_pfmf(struct kvm_vcpu *vcpu)
        return 0;
 }
+static inline int do_essa(struct kvm_vcpu *vcpu, const int orc)
+{
+        struct kvm_s390_migration_state *ms = vcpu->kvm->arch.migration_state;
+        int r1, r2, nappended, entries;
+        unsigned long gfn, hva, res, pgstev, ptev;
+        unsigned long *cbrlo;
+        /*
+         * We don't need to set SD.FPF.SK to 1 here, because if we have a
+         * machine check here we either handle it or crash
+         */
+        kvm_s390_get_regs_rre(vcpu, &r1, &r2);
+        gfn = vcpu->run->s.regs.gprs[r2] >> PAGE_SHIFT;
+        hva = gfn_to_hva(vcpu->kvm, gfn);
+        entries = (vcpu->arch.sie_block->cbrlo & ~PAGE_MASK) >> 3;
+        if (kvm_is_error_hva(hva))
+                return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);
+        nappended = pgste_perform_essa(vcpu->kvm->mm, hva, orc, &ptev, &pgstev);
+        if (nappended < 0) {
+                res = orc ? 0x10 : 0;
+                vcpu->run->s.regs.gprs[r1] = res; /* Exception Indication */
+                return 0;
+        }
+        res = (pgstev & _PGSTE_GPS_USAGE_MASK) >> 22;
+        /*
+         * Set the block-content state part of the result. 0 means resident, so
+         * nothing to do if the page is valid. 2 is for preserved pages
+         * (non-present and non-zero), and 3 for zero pages (non-present and
+         * zero).
+         */
+        if (ptev & _PAGE_INVALID) {
+                res |= 2;
+                if (pgstev & _PGSTE_GPS_ZERO)
+                        res |= 1;
+        }
+        vcpu->run->s.regs.gprs[r1] = res;
+        /*
+         * It is possible that all the normal 511 slots were full, in which case
+         * we will now write in the 512th slot, which is reserved for host use.
+         * In both cases we let the normal essa handling code process all the
+         * slots, including the reserved one, if needed.
+         */
+        if (nappended > 0) {
+                cbrlo = phys_to_virt(vcpu->arch.sie_block->cbrlo & PAGE_MASK);
+                cbrlo[entries] = gfn << PAGE_SHIFT;
+        }
+        if (orc) {
+                /* increment only if we are really flipping the bit to 1 */
+                if (!test_and_set_bit(gfn, ms->pgste_bitmap))
+                        atomic64_inc(&ms->dirty_pages);
+        }
+        return nappended;
+}
 static int handle_essa(struct kvm_vcpu *vcpu)
 {
        /* entries expected to be 1FF */
        int entries = (vcpu->arch.sie_block->cbrlo & ~PAGE_MASK) >> 3;
        unsigned long *cbrlo;
        struct gmap *gmap;
-        int i;
+        int i, orc;
        VCPU_EVENT(vcpu, 4, "ESSA: release %d pages", entries);
        gmap = vcpu->arch.gmap;
@@ -965,12 +1025,45 @@ static int handle_essa(struct kvm_vcpu *vcpu)
        if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
                return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+        /* Check for invalid operation request code */
-        if (((vcpu->arch.sie_block->ipb & 0xf0000000) >> 28) > 6)
+        orc = (vcpu->arch.sie_block->ipb & 0xf0000000) >> 28;
+        if (orc > ESSA_MAX)
                return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);
-        /* Retry the ESSA instruction */
+        if (likely(!vcpu->kvm->arch.migration_state)) {
-        kvm_s390_retry_instr(vcpu);
+                /*
+                 * CMMA is enabled in the KVM settings, but is disabled in
+                 * the SIE block and in the mm_context, and we are not doing
+                 * a migration. Enable CMMA in the mm_context.
+                 * Since we need to take a write lock to write to the context
+                 * to avoid races with storage keys handling, we check if the
+                 * value really needs to be written to; if the value is
+                 * already correct, we do nothing and avoid the lock.
+                 */
+                if (vcpu->kvm->mm->context.use_cmma == 0) {
+                        down_write(&vcpu->kvm->mm->mmap_sem);
+                        vcpu->kvm->mm->context.use_cmma = 1;
+                        up_write(&vcpu->kvm->mm->mmap_sem);
+                }
+                /*
+                 * If we are here, we are supposed to have CMMA enabled in
+                 * the SIE block. Enabling CMMA works on a per-CPU basis,
+                 * while the context use_cmma flag is per process.
+                 * It's possible that the context flag is enabled and the
+                 * SIE flag is not, so we set the flag always; if it was
+                 * already set, nothing changes, otherwise we enable it
+                 * on this CPU too.
+                 */
+                vcpu->arch.sie_block->ecb2 |= ECB2_CMMA;
+                /* Retry the ESSA instruction */
+                kvm_s390_retry_instr(vcpu);
+        } else {
+                /* Account for the possible extra cbrl entry */
+                i = do_essa(vcpu, orc);
+                if (i < 0)
+                        return i;
+                entries += i;
+        }
        vcpu->arch.sie_block->cbrlo &= PAGE_MASK;       /* reset nceo */
        cbrlo = phys_to_virt(vcpu->arch.sie_block->cbrlo);
        down_read(&gmap->mm->mmap_sem);
diff --git a/arch/s390/kvm/vsie.c b/arch/s390/kvm/vsie.c
index 4719ecb9ab42..715c19c45d9a 100644
--- a/arch/s390/kvm/vsie.c
+++ b/arch/s390/kvm/vsie.c
@@ -26,16 +26,21 @@
 struct vsie_page {
        struct kvm_s390_sie_block scb_s;        /* 0x0000 */
+        /*
+         * the backup info for machine check. ensure it's at
+         * the same offset as that in struct sie_page!
+         */
+        struct mcck_volatile_info mcck_info;    /* 0x0200 */
        /* the pinned originial scb */
-        struct kvm_s390_sie_block *scb_o;       /* 0x0200 */
+        struct kvm_s390_sie_block *scb_o;       /* 0x0218 */
        /* the shadow gmap in use by the vsie_page */
-        struct gmap *gmap;                      /* 0x0208 */
+        struct gmap *gmap;                      /* 0x0220 */
        /* address of the last reported fault to guest2 */
-        unsigned long fault_addr;               /* 0x0210 */
+        unsigned long fault_addr;               /* 0x0228 */
-        __u8 reserved[0x0700 - 0x0218];         /* 0x0218 */
+        __u8 reserved[0x0700 - 0x0230];         /* 0x0230 */
        struct kvm_s390_crypto_cb crycb;        /* 0x0700 */
        __u8 fac[S390_ARCH_FAC_LIST_SIZE_BYTE]; /* 0x0800 */
-} __packed;
+};
 /* trigger a validity icpt for the given scb */
 static int set_validity_icpt(struct kvm_s390_sie_block *scb,
@@ -801,6 +806,8 @@ static int do_vsie_run(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
 {
        struct kvm_s390_sie_block *scb_s = &vsie_page->scb_s;
        struct kvm_s390_sie_block *scb_o = vsie_page->scb_o;
+        struct mcck_volatile_info *mcck_info;
+        struct sie_page *sie_page;
        int rc;
        handle_last_fault(vcpu, vsie_page);
@@ -822,6 +829,14 @@ static int do_vsie_run(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
        local_irq_enable();
        vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+        if (rc == -EINTR) {
+                VCPU_EVENT(vcpu, 3, "%s", "machine check");
+                sie_page = container_of(scb_s, struct sie_page, sie_block);
+                mcck_info = &sie_page->mcck_info;
+                kvm_s390_reinject_machine_check(vcpu, mcck_info);
+                return 0;
+        }
        if (rc > 0)
                rc = 0; /* we could still have an icpt */
        else if (rc == -EFAULT)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 695605eb1dfb..1588e9e3dc01 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -48,28 +48,31 @@
 #define KVM_IRQCHIP_NUM_PINS  KVM_IOAPIC_NUM_PINS
 /* x86-specific vcpu->requests bit members */
-#define KVM_REQ_MIGRATE_TIMER      8
+#define KVM_REQ_MIGRATE_TIMER           KVM_ARCH_REQ(0)
-#define KVM_REQ_REPORT_TPR_ACCESS  9
+#define KVM_REQ_REPORT_TPR_ACCESS       KVM_ARCH_REQ(1)
-#define KVM_REQ_TRIPLE_FAULT      10
+#define KVM_REQ_TRIPLE_FAULT            KVM_ARCH_REQ(2)
-#define KVM_REQ_MMU_SYNC          11
+#define KVM_REQ_MMU_SYNC                KVM_ARCH_REQ(3)
-#define KVM_REQ_CLOCK_UPDATE      12
+#define KVM_REQ_CLOCK_UPDATE            KVM_ARCH_REQ(4)
-#define KVM_REQ_EVENT             14
+#define KVM_REQ_EVENT                   KVM_ARCH_REQ(6)
-#define KVM_REQ_APF_HALT          15
+#define KVM_REQ_APF_HALT                KVM_ARCH_REQ(7)
-#define KVM_REQ_STEAL_UPDATE      16
+#define KVM_REQ_STEAL_UPDATE            KVM_ARCH_REQ(8)
-#define KVM_REQ_NMI               17
+#define KVM_REQ_NMI                     KVM_ARCH_REQ(9)
-#define KVM_REQ_PMU               18
+#define KVM_REQ_PMU                     KVM_ARCH_REQ(10)
-#define KVM_REQ_PMI               19
+#define KVM_REQ_PMI                     KVM_ARCH_REQ(11)
-#define KVM_REQ_SMI               20
+#define KVM_REQ_SMI                     KVM_ARCH_REQ(12)
-#define KVM_REQ_MASTERCLOCK_UPDATE 21
+#define KVM_REQ_MASTERCLOCK_UPDATE      KVM_ARCH_REQ(13)
-#define KVM_REQ_MCLOCK_INPROGRESS (22 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_MCLOCK_INPROGRESS \
-#define KVM_REQ_SCAN_IOAPIC       (23 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+        KVM_ARCH_REQ_FLAGS(14, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
-#define KVM_REQ_GLOBAL_CLOCK_UPDATE 24
+#define KVM_REQ_SCAN_IOAPIC \
-#define KVM_REQ_APIC_PAGE_RELOAD  (25 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+        KVM_ARCH_REQ_FLAGS(15, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
-#define KVM_REQ_HV_CRASH          26
+#define KVM_REQ_GLOBAL_CLOCK_UPDATE     KVM_ARCH_REQ(16)
-#define KVM_REQ_IOAPIC_EOI_EXIT   27
+#define KVM_REQ_APIC_PAGE_RELOAD \
-#define KVM_REQ_HV_RESET          28
+        KVM_ARCH_REQ_FLAGS(17, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
-#define KVM_REQ_HV_EXIT           29
+#define KVM_REQ_HV_CRASH                KVM_ARCH_REQ(18)
-#define KVM_REQ_HV_STIMER         30
+#define KVM_REQ_IOAPIC_EOI_EXIT         KVM_ARCH_REQ(19)
+#define KVM_REQ_HV_RESET                KVM_ARCH_REQ(20)
+#define KVM_REQ_HV_EXIT                 KVM_ARCH_REQ(21)
+#define KVM_REQ_HV_STIMER               KVM_ARCH_REQ(22)
 #define CR0_RESERVED_BITS                                               \
        (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -254,7 +257,8 @@ union kvm_mmu_page_role {
                unsigned cr0_wp:1;
                unsigned smep_andnot_wp:1;
                unsigned smap_andnot_wp:1;
-                unsigned :8;
+                unsigned ad_disabled:1;
+                unsigned :7;
                /*
                 * This is left at the top of the word so that
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index d406894cd9a2..5573c75f8e4c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -426,6 +426,8 @@
 #define MSR_IA32_TSC_ADJUST             0x0000003b
 #define MSR_IA32_BNDCFGS                0x00000d90
+#define MSR_IA32_BNDCFGS_RSVD           0x00000ffc
 #define MSR_IA32_XSS                    0x00000da0
 #define FEATURE_CONTROL_LOCKED                          (1<<0)
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index a6fd40aade7c..da6728383052 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -144,6 +144,14 @@ static inline bool guest_cpuid_has_rtm(struct kvm_vcpu *vcpu)
        return best && (best->ebx & bit(X86_FEATURE_RTM));
 }
+static inline bool guest_cpuid_has_mpx(struct kvm_vcpu *vcpu)
+{
+        struct kvm_cpuid_entry2 *best;
+        best = kvm_find_cpuid_entry(vcpu, 7, 0);
+        return best && (best->ebx & bit(X86_FEATURE_MPX));
+}
 static inline bool guest_cpuid_has_rdtscp(struct kvm_vcpu *vcpu)
 {
        struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 80890dee66ce..fb0055953fbc 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -900,7 +900,7 @@ static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt,
        if (rc != X86EMUL_CONTINUE)                                     \
                goto done;                                              \
        ctxt->_eip += sizeof(_type);                                    \
-        _x = *(_type __aligned(1) *) ctxt->fetch.ptr;                   \
+        memcpy(&_x, ctxt->fetch.ptr, sizeof(_type));                    \
        ctxt->fetch.ptr += sizeof(_type);                               \
        _x;                                                             \
 })
@@ -3942,6 +3942,25 @@ static int check_fxsr(struct x86_emulate_ctxt *ctxt)
 }
 /*
+ * Hardware doesn't save and restore XMM 0-7 without CR4.OSFXSR, but does save
+ * and restore MXCSR.
+ */
+static size_t __fxstate_size(int nregs)
+{
+        return offsetof(struct fxregs_state, xmm_space[0]) + nregs * 16;
+}
+static inline size_t fxstate_size(struct x86_emulate_ctxt *ctxt)
+{
+        bool cr4_osfxsr;
+        if (ctxt->mode == X86EMUL_MODE_PROT64)
+                return __fxstate_size(16);
+        cr4_osfxsr = ctxt->ops->get_cr(ctxt, 4) & X86_CR4_OSFXSR;
+        return __fxstate_size(cr4_osfxsr ? 8 : 0);
+}
+/*
 * FXSAVE and FXRSTOR have 4 different formats depending on execution mode,
 *  1) 16 bit mode
 *  2) 32 bit mode
@@ -3962,7 +3981,6 @@ static int check_fxsr(struct x86_emulate_ctxt *ctxt)
 static int em_fxsave(struct x86_emulate_ctxt *ctxt)
 {
        struct fxregs_state fx_state;
-        size_t size;
        int rc;
        rc = check_fxsr(ctxt);
@@ -3978,68 +3996,42 @@ static int em_fxsave(struct x86_emulate_ctxt *ctxt)
        if (rc != X86EMUL_CONTINUE)
                return rc;
-        if (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_OSFXSR)
+        return segmented_write_std(ctxt, ctxt->memop.addr.mem, &fx_state,
-                size = offsetof(struct fxregs_state, xmm_space[8 * 16/4]);
+                                   fxstate_size(ctxt));
-        else
-                size = offsetof(struct fxregs_state, xmm_space[0]);
-        return segmented_write_std(ctxt, ctxt->memop.addr.mem, &fx_state, size);
-}
-static int fxrstor_fixup(struct x86_emulate_ctxt *ctxt,
-                struct fxregs_state *new)
-{
-        int rc = X86EMUL_CONTINUE;
-        struct fxregs_state old;
-        rc = asm_safe("fxsave %[fx]", , [fx] "+m"(old));
-        if (rc != X86EMUL_CONTINUE)
-                return rc;
-        /*
-         * 64 bit host will restore XMM 8-15, which is not correct on non-64
-         * bit guests.  Load the current values in order to preserve 64 bit
-         * XMMs after fxrstor.
-         */
-#ifdef CONFIG_X86_64
-        /* XXX: accessing XMM 8-15 very awkwardly */
-        memcpy(&new->xmm_space[8 * 16/4], &old.xmm_space[8 * 16/4], 8 * 16);
-#endif
-        /*
-         * Hardware doesn't save and restore XMM 0-7 without CR4.OSFXSR, but
-         * does save and restore MXCSR.
-         */
-        if (!(ctxt->ops->get_cr(ctxt, 4) & X86_CR4_OSFXSR))
-                memcpy(new->xmm_space, old.xmm_space, 8 * 16);
-        return rc;
 }
 static int em_fxrstor(struct x86_emulate_ctxt *ctxt)
 {
        struct fxregs_state fx_state;
        int rc;
+        size_t size;
        rc = check_fxsr(ctxt);
        if (rc != X86EMUL_CONTINUE)
                return rc;
-        rc = segmented_read_std(ctxt, ctxt->memop.addr.mem, &fx_state, 512);
+        ctxt->ops->get_fpu(ctxt);
-        if (rc != X86EMUL_CONTINUE)
-                return rc;
-        if (fx_state.mxcsr >> 16)
+        size = fxstate_size(ctxt);
-                return emulate_gp(ctxt, 0);
+        if (size < __fxstate_size(16)) {
+                rc = asm_safe("fxsave %[fx]", , [fx] "+m"(fx_state));
+                if (rc != X86EMUL_CONTINUE)
+                        goto out;
+        }
-        ctxt->ops->get_fpu(ctxt);
+        rc = segmented_read_std(ctxt, ctxt->memop.addr.mem, &fx_state, size);
+        if (rc != X86EMUL_CONTINUE)
+                goto out;
-        if (ctxt->mode < X86EMUL_MODE_PROT64)
+        if (fx_state.mxcsr >> 16) {
-                rc = fxrstor_fixup(ctxt, &fx_state);
+                rc = emulate_gp(ctxt, 0);
+                goto out;
+        }
        if (rc == X86EMUL_CONTINUE)
                rc = asm_safe("fxrstor %[fx]", : [fx] "m"(fx_state));
+out:
        ctxt->ops->put_fpu(ctxt);
        return rc;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index d24c8742d9b0..2819d4c123eb 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1495,6 +1495,7 @@ EXPORT_SYMBOL_GPL(kvm_lapic_hv_timer_in_use);
 static void cancel_hv_timer(struct kvm_lapic *apic)
 {
+        WARN_ON(!apic->lapic_timer.hv_timer_in_use);
        preempt_disable();
        kvm_x86_ops->cancel_hv_timer(apic->vcpu);
        apic->lapic_timer.hv_timer_in_use = false;
@@ -1503,25 +1504,56 @@ static void cancel_hv_timer(struct kvm_lapic *apic)
 static bool start_hv_timer(struct kvm_lapic *apic)
 {
-        u64 tscdeadline = apic->lapic_timer.tscdeadline;
+        struct kvm_timer *ktimer = &apic->lapic_timer;
+        int r;
-        if ((atomic_read(&apic->lapic_timer.pending) &&
+        if (!kvm_x86_ops->set_hv_timer)
-                !apic_lvtt_period(apic)) ||
+                return false;
-                kvm_x86_ops->set_hv_timer(apic->vcpu, tscdeadline)) {
-                if (apic->lapic_timer.hv_timer_in_use)
+        if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
-                        cancel_hv_timer(apic);
+                return false;
-        } else {
-                apic->lapic_timer.hv_timer_in_use = true;
-                hrtimer_cancel(&apic->lapic_timer.timer);
-                /* In case the sw timer triggered in the window */
+        r = kvm_x86_ops->set_hv_timer(apic->vcpu, ktimer->tscdeadline);
-                if (atomic_read(&apic->lapic_timer.pending) &&
+        if (r < 0)
-                        !apic_lvtt_period(apic))
+                return false;
-                        cancel_hv_timer(apic);
+        ktimer->hv_timer_in_use = true;
+        hrtimer_cancel(&ktimer->timer);
+        /*
+         * Also recheck ktimer->pending, in case the sw timer triggered in
+         * the window.  For periodic timer, leave the hv timer running for
+         * simplicity, and the deadline will be recomputed on the next vmexit.
+         */
+        if (!apic_lvtt_period(apic) && (r || atomic_read(&ktimer->pending))) {
+                if (r)
+                        apic_timer_expired(apic);
+                return false;
        }
-        trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
-                        apic->lapic_timer.hv_timer_in_use);
+        trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, true);
-        return apic->lapic_timer.hv_timer_in_use;
+        return true;
+}
+static void start_sw_timer(struct kvm_lapic *apic)
+{
+        struct kvm_timer *ktimer = &apic->lapic_timer;
+        if (apic->lapic_timer.hv_timer_in_use)
+                cancel_hv_timer(apic);
+        if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
+                return;
+        if (apic_lvtt_period(apic) || apic_lvtt_oneshot(apic))
+                start_sw_period(apic);
+        else if (apic_lvtt_tscdeadline(apic))
+                start_sw_tscdeadline(apic);
+        trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, false);
+}
+static void restart_apic_timer(struct kvm_lapic *apic)
+{
+        if (!start_hv_timer(apic))
+                start_sw_timer(apic);
 }
 void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
@@ -1535,19 +1567,14 @@ void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
        if (apic_lvtt_period(apic) && apic->lapic_timer.period) {
                advance_periodic_target_expiration(apic);
-                if (!start_hv_timer(apic))
+                restart_apic_timer(apic);
-                        start_sw_period(apic);
        }
 }
 EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer);
 void kvm_lapic_switch_to_hv_timer(struct kvm_vcpu *vcpu)
 {
-        struct kvm_lapic *apic = vcpu->arch.apic;
+        restart_apic_timer(vcpu->arch.apic);
-        WARN_ON(apic->lapic_timer.hv_timer_in_use);
-        start_hv_timer(apic);
 }
 EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_hv_timer);
@@ -1556,33 +1583,28 @@ void kvm_lapic_switch_to_sw_timer(struct kvm_vcpu *vcpu)
        struct kvm_lapic *apic = vcpu->arch.apic;
        /* Possibly the TSC deadline timer is not enabled yet */
-        if (!apic->lapic_timer.hv_timer_in_use)
+        if (apic->lapic_timer.hv_timer_in_use)
-                return;
+                start_sw_timer(apic);
+}
-        cancel_hv_timer(apic);
+EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_sw_timer);
-        if (atomic_read(&apic->lapic_timer.pending))
+void kvm_lapic_restart_hv_timer(struct kvm_vcpu *vcpu)
-                return;
+{
+        struct kvm_lapic *apic = vcpu->arch.apic;
-        if (apic_lvtt_period(apic) || apic_lvtt_oneshot(apic))
+        WARN_ON(!apic->lapic_timer.hv_timer_in_use);
-                start_sw_period(apic);
+        restart_apic_timer(apic);
-        else if (apic_lvtt_tscdeadline(apic))
-                start_sw_tscdeadline(apic);
 }
-EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_sw_timer);
 static void start_apic_timer(struct kvm_lapic *apic)
 {
        atomic_set(&apic->lapic_timer.pending, 0);
-        if (apic_lvtt_period(apic) || apic_lvtt_oneshot(apic)) {
+        if ((apic_lvtt_period(apic) || apic_lvtt_oneshot(apic))
-                if (set_target_expiration(apic) &&
+            && !set_target_expiration(apic))
-                        !(kvm_x86_ops->set_hv_timer && start_hv_timer(apic)))
+                return;
-                        start_sw_period(apic);
-        } else if (apic_lvtt_tscdeadline(apic)) {
+        restart_apic_timer(apic);
-                if (!(kvm_x86_ops->set_hv_timer && start_hv_timer(apic)))
-                        start_sw_tscdeadline(apic);
-        }
 }
 static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
@@ -1813,16 +1835,6 @@ void kvm_free_lapic(struct kvm_vcpu *vcpu)
 * LAPIC interface
 *----------------------------------------------------------------------
 */
-u64 kvm_get_lapic_target_expiration_tsc(struct kvm_vcpu *vcpu)
-{
-        struct kvm_lapic *apic = vcpu->arch.apic;
-        if (!lapic_in_kernel(vcpu))
-                return 0;
-        return apic->lapic_timer.tscdeadline;
-}
 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu)
 {
        struct kvm_lapic *apic = vcpu->arch.apic;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index bcbe811f3b97..29caa2c3dff9 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -87,7 +87,6 @@ int kvm_apic_get_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s);
 int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s);
 int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
-u64 kvm_get_lapic_target_expiration_tsc(struct kvm_vcpu *vcpu);
 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
@@ -216,4 +215,5 @@ void kvm_lapic_switch_to_sw_timer(struct kvm_vcpu *vcpu);
 void kvm_lapic_switch_to_hv_timer(struct kvm_vcpu *vcpu);
 void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
 bool kvm_lapic_hv_timer_in_use(struct kvm_vcpu *vcpu);
+void kvm_lapic_restart_hv_timer(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cb8225969255..aafd399cf8c6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -183,13 +183,13 @@ static u64 __read_mostly shadow_user_mask;
 static u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
 static u64 __read_mostly shadow_mmio_mask;
+static u64 __read_mostly shadow_mmio_value;
 static u64 __read_mostly shadow_present_mask;
 /*
- * The mask/value to distinguish a PTE that has been marked not-present for
+ * SPTEs used by MMUs without A/D bits are marked with shadow_acc_track_value.
- * access tracking purposes.
+ * Non-present SPTEs with shadow_acc_track_value set are in place for access
- * The mask would be either 0 if access tracking is disabled, or
+ * tracking.
- * SPTE_SPECIAL_MASK|VMX_EPT_RWX_MASK if access tracking is enabled.
 */
 static u64 __read_mostly shadow_acc_track_mask;
 static const u64 shadow_acc_track_value = SPTE_SPECIAL_MASK;
@@ -207,16 +207,40 @@ static const u64 shadow_acc_track_saved_bits_shift = PT64_SECOND_AVAIL_BITS_SHIF
 static void mmu_spte_set(u64 *sptep, u64 spte);
 static void mmu_free_roots(struct kvm_vcpu *vcpu);
-void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
+void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value)
 {
+        BUG_ON((mmio_mask & mmio_value) != mmio_value);
+        shadow_mmio_value = mmio_value | SPTE_SPECIAL_MASK;
        shadow_mmio_mask = mmio_mask | SPTE_SPECIAL_MASK;
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
+static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
+{
+        return sp->role.ad_disabled;
+}
+static inline bool spte_ad_enabled(u64 spte)
+{
+        MMU_WARN_ON((spte & shadow_mmio_mask) == shadow_mmio_value);
+        return !(spte & shadow_acc_track_value);
+}
+static inline u64 spte_shadow_accessed_mask(u64 spte)
+{
+        MMU_WARN_ON((spte & shadow_mmio_mask) == shadow_mmio_value);
+        return spte_ad_enabled(spte) ? shadow_accessed_mask : 0;
+}
+static inline u64 spte_shadow_dirty_mask(u64 spte)
+{
+        MMU_WARN_ON((spte & shadow_mmio_mask) == shadow_mmio_value);
+        return spte_ad_enabled(spte) ? shadow_dirty_mask : 0;
+}
 static inline bool is_access_track_spte(u64 spte)
 {
-        /* Always false if shadow_acc_track_mask is zero.  */
+        return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0;
-        return (spte & shadow_acc_track_mask) == shadow_acc_track_value;
 }
 /*
@@ -270,7 +294,7 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
        u64 mask = generation_mmio_spte_mask(gen);
        access &= ACC_WRITE_MASK | ACC_USER_MASK;
-        mask |= shadow_mmio_mask | access | gfn << PAGE_SHIFT;
+        mask |= shadow_mmio_value | access | gfn << PAGE_SHIFT;
        trace_mark_mmio_spte(sptep, gfn, access, gen);
        mmu_spte_set(sptep, mask);
@@ -278,7 +302,7 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
 static bool is_mmio_spte(u64 spte)
 {
-        return (spte & shadow_mmio_mask) == shadow_mmio_mask;
+        return (spte & shadow_mmio_mask) == shadow_mmio_value;
 }
 static gfn_t get_mmio_spte_gfn(u64 spte)
@@ -315,12 +339,20 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 spte)
        return likely(kvm_gen == spte_gen);
 }
+/*
+ * Sets the shadow PTE masks used by the MMU.
+ *
+ * Assumptions:
+ *  - Setting either @accessed_mask or @dirty_mask requires setting both
+ *  - At least one of @accessed_mask or @acc_track_mask must be set
+ */
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
                u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask,
                u64 acc_track_mask)
 {
-        if (acc_track_mask != 0)
+        BUG_ON(!dirty_mask != !accessed_mask);
-                acc_track_mask |= SPTE_SPECIAL_MASK;
+        BUG_ON(!accessed_mask && !acc_track_mask);
+        BUG_ON(acc_track_mask & shadow_acc_track_value);
        shadow_user_mask = user_mask;
        shadow_accessed_mask = accessed_mask;
@@ -329,7 +361,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
        shadow_x_mask = x_mask;
        shadow_present_mask = p_mask;
        shadow_acc_track_mask = acc_track_mask;
-        WARN_ON(shadow_accessed_mask != 0 && shadow_acc_track_mask != 0);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
@@ -549,7 +580,7 @@ static bool spte_has_volatile_bits(u64 spte)
            is_access_track_spte(spte))
                return true;
-        if (shadow_accessed_mask) {
+        if (spte_ad_enabled(spte)) {
                if ((spte & shadow_accessed_mask) == 0 ||
                    (is_writable_pte(spte) && (spte & shadow_dirty_mask) == 0))
                        return true;
@@ -560,14 +591,17 @@ static bool spte_has_volatile_bits(u64 spte)
 static bool is_accessed_spte(u64 spte)
 {
-        return shadow_accessed_mask ? spte & shadow_accessed_mask
+        u64 accessed_mask = spte_shadow_accessed_mask(spte);
-                                    : !is_access_track_spte(spte);
+        return accessed_mask ? spte & accessed_mask
+                             : !is_access_track_spte(spte);
 }
 static bool is_dirty_spte(u64 spte)
 {
-        return shadow_dirty_mask ? spte & shadow_dirty_mask
+        u64 dirty_mask = spte_shadow_dirty_mask(spte);
-                                 : spte & PT_WRITABLE_MASK;
+        return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 /* Rules for using mmu_spte_set:
@@ -707,10 +741,10 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
 static u64 mark_spte_for_access_track(u64 spte)
 {
-        if (shadow_accessed_mask != 0)
+        if (spte_ad_enabled(spte))
                return spte & ~shadow_accessed_mask;
-        if (shadow_acc_track_mask == 0 || is_access_track_spte(spte))
+        if (is_access_track_spte(spte))
                return spte;
        /*
@@ -729,7 +763,6 @@ static u64 mark_spte_for_access_track(u64 spte)
        spte |= (spte & shadow_acc_track_saved_bits_mask) <<
                shadow_acc_track_saved_bits_shift;
        spte &= ~shadow_acc_track_mask;
-        spte |= shadow_acc_track_value;
        return spte;
 }
@@ -741,6 +774,7 @@ static u64 restore_acc_track_spte(u64 spte)
        u64 saved_bits = (spte >> shadow_acc_track_saved_bits_shift)
                         & shadow_acc_track_saved_bits_mask;
+        WARN_ON_ONCE(spte_ad_enabled(spte));
        WARN_ON_ONCE(!is_access_track_spte(spte));
        new_spte &= ~shadow_acc_track_mask;
@@ -759,7 +793,7 @@ static bool mmu_spte_age(u64 *sptep)
        if (!is_accessed_spte(spte))
                return false;
-        if (shadow_accessed_mask) {
+        if (spte_ad_enabled(spte)) {
                clear_bit((ffs(shadow_accessed_mask) - 1),
                          (unsigned long *)sptep);
        } else {
@@ -1390,6 +1424,22 @@ static bool spte_clear_dirty(u64 *sptep)
        return mmu_spte_update(sptep, spte);
 }
+static bool wrprot_ad_disabled_spte(u64 *sptep)
+{
+        bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
+                                               (unsigned long *)sptep);
+        if (was_writable)
+                kvm_set_pfn_dirty(spte_to_pfn(*sptep));
+        return was_writable;
+}
+/*
+ * Gets the GFN ready for another round of dirty logging by clearing the
+ *      - D bit on ad-enabled SPTEs, and
+ *      - W bit on ad-disabled SPTEs.
+ * Returns true iff any D or W bits were cleared.
+ */
 static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
 {
        u64 *sptep;
@@ -1397,7 +1447,10 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
        bool flush = false;
        for_each_rmap_spte(rmap_head, &iter, sptep)
-                flush |= spte_clear_dirty(sptep);
+                if (spte_ad_enabled(*sptep))
+                        flush |= spte_clear_dirty(sptep);
+                else
+                        flush |= wrprot_ad_disabled_spte(sptep);
        return flush;
 }
@@ -1420,7 +1473,8 @@ static bool __rmap_set_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
        bool flush = false;
        for_each_rmap_spte(rmap_head, &iter, sptep)
-                flush |= spte_set_dirty(sptep);
+                if (spte_ad_enabled(*sptep))
+                        flush |= spte_set_dirty(sptep);
        return flush;
 }
@@ -1452,7 +1506,8 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 }
 /**
- * kvm_mmu_clear_dirty_pt_masked - clear MMU D-bit for PT level pages
+ * kvm_mmu_clear_dirty_pt_masked - clear MMU D-bit for PT level pages, or write
+ * protect the page if the D-bit isn't supported.
 * @kvm: kvm instance
 * @slot: slot to clear D-bit
 * @gfn_offset: start of the BITS_PER_LONG pages we care about
@@ -1766,18 +1821,9 @@ static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
        u64 *sptep;
        struct rmap_iterator iter;
-        /*
-         * If there's no access bit in the secondary pte set by the hardware and
-         * fast access tracking is also not enabled, it's up to gup-fast/gup to
-         * set the access bit in the primary pte or in the page structure.
-         */
-        if (!shadow_accessed_mask && !shadow_acc_track_mask)
-                goto out;
        for_each_rmap_spte(rmap_head, &iter, sptep)
                if (is_accessed_spte(*sptep))
                        return 1;
-out:
        return 0;
 }
@@ -1798,18 +1844,6 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 {
-        /*
-         * In case of absence of EPT Access and Dirty Bits supports,
-         * emulate the accessed bit for EPT, by checking if this page has
-         * an EPT mapping, and clearing it if it does. On the next access,
-         * a new EPT mapping will be established.
-         * This has some overhead, but not as much as the cost of swapping
-         * out actively used pages or breaking up actively used hugepages.
-         */
-        if (!shadow_accessed_mask && !shadow_acc_track_mask)
-                return kvm_handle_hva_range(kvm, start, end, 0,
-                                            kvm_unmap_rmapp);
        return kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
 }
@@ -2398,7 +2432,12 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
        BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
        spte = __pa(sp->spt) | shadow_present_mask | PT_WRITABLE_MASK |
-               shadow_user_mask | shadow_x_mask | shadow_accessed_mask;
+               shadow_user_mask | shadow_x_mask;
+        if (sp_ad_disabled(sp))
+                spte |= shadow_acc_track_value;
+        else
+                spte |= shadow_accessed_mask;
        mmu_spte_set(sptep, spte);
@@ -2666,10 +2705,15 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 {
        u64 spte = 0;
        int ret = 0;
+        struct kvm_mmu_page *sp;
        if (set_mmio_spte(vcpu, sptep, gfn, pfn, pte_access))
                return 0;
+        sp = page_header(__pa(sptep));
+        if (sp_ad_disabled(sp))
+                spte |= shadow_acc_track_value;
        /*
         * For the EPT case, shadow_present_mask is 0 if hardware
         * supports exec-only page table entries.  In that case,
@@ -2678,7 +2722,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
         */
        spte |= shadow_present_mask;
        if (!speculative)
-                spte |= shadow_accessed_mask;
+                spte |= spte_shadow_accessed_mask(spte);
        if (pte_access & ACC_EXEC_MASK)
                spte |= shadow_x_mask;
@@ -2735,7 +2779,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
        if (pte_access & ACC_WRITE_MASK) {
                kvm_vcpu_mark_page_dirty(vcpu, gfn);
-                spte |= shadow_dirty_mask;
+                spte |= spte_shadow_dirty_mask(spte);
        }
        if (speculative)
@@ -2877,16 +2921,16 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 {
        struct kvm_mmu_page *sp;
+        sp = page_header(__pa(sptep));
        /*
-         * Since it's no accessed bit on EPT, it's no way to
+         * Without accessed bits, there's no way to distinguish between
-         * distinguish between actually accessed translations
+         * actually accessed translations and prefetched, so disable pte
-         * and prefetched, so disable pte prefetch if EPT is
+         * prefetch if accessed bits aren't available.
-         * enabled.
         */
-        if (!shadow_accessed_mask)
+        if (sp_ad_disabled(sp))
                return;
-        sp = page_header(__pa(sptep));
        if (sp->role.level > PT_PAGE_TABLE_LEVEL)
                return;
@@ -4290,6 +4334,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
        context->base_role.word = 0;
        context->base_role.smm = is_smm(vcpu);
+        context->base_role.ad_disabled = (shadow_accessed_mask == 0);
        context->page_fault = tdp_page_fault;
        context->sync_page = nonpaging_sync_page;
        context->invlpg = nonpaging_invlpg;
@@ -4377,6 +4422,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
        context->root_level = context->shadow_root_level;
        context->root_hpa = INVALID_PAGE;
        context->direct_map = false;
+        context->base_role.ad_disabled = !accessed_dirty;
        update_permission_bitmask(vcpu, context, true);
        update_pkru_bitmask(vcpu, context, true);
@@ -4636,6 +4682,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
        mask.smep_andnot_wp = 1;
        mask.smap_andnot_wp = 1;
        mask.smm = 1;
+        mask.ad_disabled = 1;
        /*
         * If we don't have indirect shadow pages, it means no page is
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 330bf3a811fb..a276834950c1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -51,7 +51,7 @@ static inline u64 rsvd_bits(int s, int e)
        return ((1ULL << (e - s + 1)) - 1) << s;
 }
-void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
+void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value);
 void
 reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 5a24b846a1cb..8b97a6cba8d1 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -30,8 +30,9 @@
                                                                        \
        role.word = __entry->role;                                      \
                                                                        \
-        trace_seq_printf(p, "sp gen %lx gfn %llx %u%s q%u%s %s%s"       \
+        trace_seq_printf(p, "sp gen %lx gfn %llx l%u%s q%u%s %s%s"      \
-                         " %snxe root %u %s%c", __entry->mmu_valid_gen, \
+                         " %snxe %sad root %u %s%c",                    \
+                         __entry->mmu_valid_gen,                        \
                         __entry->gfn, role.level,                      \
                         role.cr4_pae ? " pae" : "",                    \
                         role.quadrant,                                 \
@@ -39,6 +40,7 @@
                         access_str[role.access],                       \
                         role.invalid ? " invalid" : "",                \
                         role.nxe ? "" : "!",                           \
+                         role.ad_disabled ? "!" : "",                   \
                         __entry->root_count,                           \
                         __entry->unsync ? "unsync" : "sync", 0);       \
        saved_ptr;                                                      \
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 33460fcdeef9..905ea6052517 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -190,6 +190,7 @@ struct vcpu_svm {
        struct nested_state nested;
        bool nmi_singlestep;
+        u64 nmi_singlestep_guest_rflags;
        unsigned int3_injected;
        unsigned long int3_rip;
@@ -964,6 +965,18 @@ static void svm_disable_lbrv(struct vcpu_svm *svm)
        set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 0, 0);
 }
+static void disable_nmi_singlestep(struct vcpu_svm *svm)
+{
+        svm->nmi_singlestep = false;
+        if (!(svm->vcpu.guest_debug & KVM_GUESTDBG_SINGLESTEP)) {
+                /* Clear our flags if they were not set by the guest */
+                if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF))
+                        svm->vmcb->save.rflags &= ~X86_EFLAGS_TF;
+                if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_RF))
+                        svm->vmcb->save.rflags &= ~X86_EFLAGS_RF;
+        }
+}
 /* Note:
 * This hash table is used to map VM_ID to a struct kvm_arch,
 * when handling AMD IOMMU GALOG notification to schedule in
@@ -1713,11 +1726,24 @@ static void svm_vcpu_unblocking(struct kvm_vcpu *vcpu)
 static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
 {
-        return to_svm(vcpu)->vmcb->save.rflags;
+        struct vcpu_svm *svm = to_svm(vcpu);
+        unsigned long rflags = svm->vmcb->save.rflags;
+        if (svm->nmi_singlestep) {
+                /* Hide our flags if they were not set by the guest */
+                if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF))
+                        rflags &= ~X86_EFLAGS_TF;
+                if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_RF))
+                        rflags &= ~X86_EFLAGS_RF;
+        }
+        return rflags;
 }
 static void svm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 {
+        if (to_svm(vcpu)->nmi_singlestep)
+                rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
       /*
        * Any change of EFLAGS.VM is accompanied by a reload of SS
        * (caused by either a task switch or an inter-privilege IRET),
@@ -2112,10 +2138,7 @@ static int db_interception(struct vcpu_svm *svm)
        }
        if (svm->nmi_singlestep) {
-                svm->nmi_singlestep = false;
+                disable_nmi_singlestep(svm);
-                if (!(svm->vcpu.guest_debug & KVM_GUESTDBG_SINGLESTEP))
-                        svm->vmcb->save.rflags &=
-                                ~(X86_EFLAGS_TF | X86_EFLAGS_RF);
        }
        if (svm->vcpu.guest_debug &
@@ -2370,8 +2393,8 @@ static void nested_svm_uninit_mmu_context(struct kvm_vcpu *vcpu)
 static int nested_svm_check_permissions(struct vcpu_svm *svm)
 {
-        if (!(svm->vcpu.arch.efer & EFER_SVME)
+        if (!(svm->vcpu.arch.efer & EFER_SVME) ||
-            || !is_paging(&svm->vcpu)) {
+            !is_paging(&svm->vcpu)) {
                kvm_queue_exception(&svm->vcpu, UD_VECTOR);
                return 1;
        }
@@ -2381,7 +2404,7 @@ static int nested_svm_check_permissions(struct vcpu_svm *svm)
                return 1;
        }
-       return 0;
+        return 0;
 }
 static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
@@ -2534,6 +2557,31 @@ static int nested_svm_exit_handled_msr(struct vcpu_svm *svm)
        return (value & mask) ? NESTED_EXIT_DONE : NESTED_EXIT_HOST;
 }
+/* DB exceptions for our internal use must not cause vmexit */
+static int nested_svm_intercept_db(struct vcpu_svm *svm)
+{
+        unsigned long dr6;
+        /* if we're not singlestepping, it's not ours */
+        if (!svm->nmi_singlestep)
+                return NESTED_EXIT_DONE;
+        /* if it's not a singlestep exception, it's not ours */
+        if (kvm_get_dr(&svm->vcpu, 6, &dr6))
+                return NESTED_EXIT_DONE;
+        if (!(dr6 & DR6_BS))
+                return NESTED_EXIT_DONE;
+        /* if the guest is singlestepping, it should get the vmexit */
+        if (svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF) {
+                disable_nmi_singlestep(svm);
+                return NESTED_EXIT_DONE;
+        }
+        /* it's ours, the nested hypervisor must not see this one */
+        return NESTED_EXIT_HOST;
+}
 static int nested_svm_exit_special(struct vcpu_svm *svm)
 {
        u32 exit_code = svm->vmcb->control.exit_code;
@@ -2589,8 +2637,12 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
        }
        case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: {
                u32 excp_bits = 1 << (exit_code - SVM_EXIT_EXCP_BASE);
-                if (svm->nested.intercept_exceptions & excp_bits)
+                if (svm->nested.intercept_exceptions & excp_bits) {
-                        vmexit = NESTED_EXIT_DONE;
+                        if (exit_code == SVM_EXIT_EXCP_BASE + DB_VECTOR)
+                                vmexit = nested_svm_intercept_db(svm);
+                        else
+                                vmexit = NESTED_EXIT_DONE;
+                }
                /* async page fault always cause vmexit */
                else if ((exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) &&
                         svm->apf_reason != 0)
@@ -4627,10 +4679,17 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
            == HF_NMI_MASK)
                return; /* IRET will cause a vm exit */
+        if ((svm->vcpu.arch.hflags & HF_GIF_MASK) == 0)
+                return; /* STGI will cause a vm exit */
+        if (svm->nested.exit_required)
+                return; /* we're not going to run the guest yet */
        /*
         * Something prevents NMI from been injected. Single step over possible
         * problem (IRET or exception injection or interrupt shadow)
         */
+        svm->nmi_singlestep_guest_rflags = svm_get_rflags(vcpu);
        svm->nmi_singlestep = true;
        svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
 }
@@ -4771,6 +4830,22 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
        if (unlikely(svm->nested.exit_required))
                return;
+        /*
+         * Disable singlestep if we're injecting an interrupt/exception.
+         * We don't want our modified rflags to be pushed on the stack where
+         * we might not be able to easily reset them if we disabled NMI
+         * singlestep later.
+         */
+        if (svm->nmi_singlestep && svm->vmcb->control.event_inj) {
+                /*
+                 * Event injection happens before external interrupts cause a
+                 * vmexit and interrupts are disabled here, so smp_send_reschedule
+                 * is enough to force an immediate vmexit.
+                 */
+                disable_nmi_singlestep(svm);
+                smp_send_reschedule(vcpu->cpu);
+        }
        pre_svm_run(svm);
        sync_lapic_to_cr8(vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6dcc4873e435..f76efad248ab 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -913,8 +913,9 @@ static void nested_release_page_clean(struct page *page)
        kvm_release_page_clean(page);
 }
+static bool nested_ept_ad_enabled(struct kvm_vcpu *vcpu);
 static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu);
-static u64 construct_eptp(unsigned long root_hpa);
+static u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long root_hpa);
 static bool vmx_xsaves_supported(void);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
@@ -2772,7 +2773,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
                if (enable_ept_ad_bits) {
                        vmx->nested.nested_vmx_secondary_ctls_high |=
                                SECONDARY_EXEC_ENABLE_PML;
-                       vmx->nested.nested_vmx_ept_caps |= VMX_EPT_AD_BIT;
+                        vmx->nested.nested_vmx_ept_caps |= VMX_EPT_AD_BIT;
                }
        } else
                vmx->nested.nested_vmx_ept_caps = 0;
@@ -3198,7 +3199,8 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
                msr_info->data = vmcs_readl(GUEST_SYSENTER_ESP);
                break;
        case MSR_IA32_BNDCFGS:
-                if (!kvm_mpx_supported())
+                if (!kvm_mpx_supported() ||
+                    (!msr_info->host_initiated && !guest_cpuid_has_mpx(vcpu)))
                        return 1;
                msr_info->data = vmcs_read64(GUEST_BNDCFGS);
                break;
@@ -3280,7 +3282,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
                vmcs_writel(GUEST_SYSENTER_ESP, data);
                break;
        case MSR_IA32_BNDCFGS:
-                if (!kvm_mpx_supported())
+                if (!kvm_mpx_supported() ||
+                    (!msr_info->host_initiated && !guest_cpuid_has_mpx(vcpu)))
+                        return 1;
+                if (is_noncanonical_address(data & PAGE_MASK) ||
+                    (data & MSR_IA32_BNDCFGS_RSVD))
                        return 1;
                vmcs_write64(GUEST_BNDCFGS, data);
                break;
@@ -4013,7 +4019,7 @@ static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid)
        if (enable_ept) {
                if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
                        return;
-                ept_sync_context(construct_eptp(vcpu->arch.mmu.root_hpa));
+                ept_sync_context(construct_eptp(vcpu, vcpu->arch.mmu.root_hpa));
        } else {
                vpid_sync_context(vpid);
        }
@@ -4188,14 +4194,15 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
        vmx->emulation_required = emulation_required(vcpu);
 }
-static u64 construct_eptp(unsigned long root_hpa)
+static u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long root_hpa)
 {
        u64 eptp;
        /* TODO write the value reading from MSR */
        eptp = VMX_EPT_DEFAULT_MT |
                VMX_EPT_DEFAULT_GAW << VMX_EPT_GAW_EPTP_SHIFT;
-        if (enable_ept_ad_bits)
+        if (enable_ept_ad_bits &&
+            (!is_guest_mode(vcpu) || nested_ept_ad_enabled(vcpu)))
                eptp |= VMX_EPT_AD_ENABLE_BIT;
        eptp |= (root_hpa & PAGE_MASK);
@@ -4209,7 +4216,7 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
        guest_cr3 = cr3;
        if (enable_ept) {
-                eptp = construct_eptp(cr3);
+                eptp = construct_eptp(vcpu, cr3);
                vmcs_write64(EPT_POINTER, eptp);
                if (is_paging(vcpu) || is_guest_mode(vcpu))
                        guest_cr3 = kvm_read_cr3(vcpu);
@@ -5170,7 +5177,8 @@ static void ept_set_mmio_spte_mask(void)
         * EPT Misconfigurations can be generated if the value of bits 2:0
         * of an EPT paging-structure entry is 110b (write/execute).
         */
-        kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE);
+        kvm_mmu_set_mmio_spte_mask(VMX_EPT_RWX_MASK,
+                                   VMX_EPT_MISCONFIG_WX_VALUE);
 }
 #define VMX_XSS_EXIT_BITMAP 0
@@ -6220,17 +6228,6 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
        exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
-        if (is_guest_mode(vcpu)
-            && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) {
-                /*
-                 * Fix up exit_qualification according to whether guest
-                 * page table accesses are reads or writes.
-                 */
-                u64 eptp = nested_ept_get_cr3(vcpu);
-                if (!(eptp & VMX_EPT_AD_ENABLE_BIT))
-                        exit_qualification &= ~EPT_VIOLATION_ACC_WRITE;
-        }
        /*
         * EPT violation happened while executing iret from NMI,
         * "blocked by NMI" bit has to be set before next VM entry.
@@ -6453,7 +6450,7 @@ void vmx_enable_tdp(void)
                enable_ept_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull,
                0ull, VMX_EPT_EXECUTABLE_MASK,
                cpu_has_vmx_ept_execute_only() ? 0ull : VMX_EPT_READABLE_MASK,
-                enable_ept_ad_bits ? 0ull : VMX_EPT_RWX_MASK);
+                VMX_EPT_RWX_MASK);
        ept_set_mmio_spte_mask();
        kvm_enable_tdp();
@@ -6557,7 +6554,6 @@ static __init int hardware_setup(void)
        vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_CS, false);
        vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
        vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
-        vmx_disable_intercept_for_msr(MSR_IA32_BNDCFGS, true);
        memcpy(vmx_msr_bitmap_legacy_x2apic_apicv,
                        vmx_msr_bitmap_legacy, PAGE_SIZE);
@@ -7661,7 +7657,10 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
        unsigned long type, types;
        gva_t gva;
        struct x86_exception e;
-        int vpid;
+        struct {
+                u64 vpid;
+                u64 gla;
+        } operand;
        if (!(vmx->nested.nested_vmx_secondary_ctls_high &
              SECONDARY_EXEC_ENABLE_VPID) ||
@@ -7691,17 +7690,28 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
        if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
                        vmx_instruction_info, false, &gva))
                return 1;
-        if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &vpid,
+        if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
-                                sizeof(u32), &e)) {
+                                sizeof(operand), &e)) {
                kvm_inject_page_fault(vcpu, &e);
                return 1;
        }
+        if (operand.vpid >> 16) {
+                nested_vmx_failValid(vcpu,
+                        VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+                return kvm_skip_emulated_instruction(vcpu);
+        }
        switch (type) {
        case VMX_VPID_EXTENT_INDIVIDUAL_ADDR:
+                if (is_noncanonical_address(operand.gla)) {
+                        nested_vmx_failValid(vcpu,
+                                VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+                        return kvm_skip_emulated_instruction(vcpu);
+                }
+                /* fall through */
        case VMX_VPID_EXTENT_SINGLE_CONTEXT:
        case VMX_VPID_EXTENT_SINGLE_NON_GLOBAL:
-                if (!vpid) {
+                if (!operand.vpid) {
                        nested_vmx_failValid(vcpu,
                                VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
                        return kvm_skip_emulated_instruction(vcpu);
@@ -9394,6 +9404,11 @@ static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
        vmcs12->guest_physical_address = fault->address;
 }
+static bool nested_ept_ad_enabled(struct kvm_vcpu *vcpu)
+{
+        return nested_ept_get_cr3(vcpu) & VMX_EPT_AD_ENABLE_BIT;
+}
 /* Callbacks for nested_ept_init_mmu_context: */
 static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
@@ -9404,18 +9419,18 @@ static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
 static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
 {
-        u64 eptp;
+        bool wants_ad;
        WARN_ON(mmu_is_nested(vcpu));
-        eptp = nested_ept_get_cr3(vcpu);
+        wants_ad = nested_ept_ad_enabled(vcpu);
-        if ((eptp & VMX_EPT_AD_ENABLE_BIT) && !enable_ept_ad_bits)
+        if (wants_ad && !enable_ept_ad_bits)
                return 1;
        kvm_mmu_unload(vcpu);
        kvm_init_shadow_ept_mmu(vcpu,
                        to_vmx(vcpu)->nested.nested_vmx_ept_caps &
                        VMX_EPT_EXECUTE_ONLY_BIT,
-                        eptp & VMX_EPT_AD_ENABLE_BIT);
+                        wants_ad);
        vcpu->arch.mmu.set_cr3           = vmx_set_cr3;
        vcpu->arch.mmu.get_cr3           = nested_ept_get_cr3;
        vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
@@ -10728,8 +10743,7 @@ static void sync_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
                vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
        }
-        if (nested_cpu_has_ept(vmcs12))
+        vmcs12->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
-                vmcs12->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
        if (nested_cpu_has_vid(vmcs12))
                vmcs12->guest_intr_status = vmcs_read16(GUEST_INTR_STATUS);
@@ -10754,8 +10768,6 @@ static void sync_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
        vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
        if (kvm_mpx_supported())
                vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
-        if (nested_cpu_has_xsaves(vmcs12))
-                vmcs12->xss_exit_bitmap = vmcs_read64(XSS_EXIT_BITMAP);
 }
 /*
@@ -11152,7 +11164,8 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
        vmx->hv_deadline_tsc = tscl + delta_tsc;
        vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
                        PIN_BASED_VMX_PREEMPTION_TIMER);
-        return 0;
+        return delta_tsc == 0;
 }
 static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0e846f0cb83b..6c7266f7766d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2841,10 +2841,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
                        kvm_vcpu_write_tsc_offset(vcpu, offset);
                        vcpu->arch.tsc_catchup = 1;
                }
-                if (kvm_lapic_hv_timer_in_use(vcpu) &&
-                                kvm_x86_ops->set_hv_timer(vcpu,
+                if (kvm_lapic_hv_timer_in_use(vcpu))
-                                        kvm_get_lapic_target_expiration_tsc(vcpu)))
+                        kvm_lapic_restart_hv_timer(vcpu);
-                        kvm_lapic_switch_to_sw_timer(vcpu);
                /*
                 * On a host with synchronized TSC, there is no need to update
                 * kvmclock on vcpu->cpu migration
@@ -6011,7 +6011,7 @@ static void kvm_set_mmio_spte_mask(void)
                mask &= ~1ull;
 #endif
-        kvm_mmu_set_mmio_spte_mask(mask);
+        kvm_mmu_set_mmio_spte_mask(mask, mask);
 }
 #ifdef CONFIG_X86_64
@@ -6733,7 +6733,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
        bool req_immediate_exit = false;
-        if (vcpu->requests) {
+        if (kvm_request_pending(vcpu)) {
                if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
                        kvm_mmu_unload(vcpu);
                if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
@@ -6897,7 +6897,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
                        kvm_x86_ops->sync_pir_to_irr(vcpu);
        }
-        if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
+        if (vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu)
            || need_resched() || signal_pending(current)) {
                vcpu->mode = OUTSIDE_GUEST_MODE;
                smp_wmb();
diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
index 295584f31a4e..f0053f884b4a 100644
--- a/include/kvm/arm_arch_timer.h
+++ b/include/kvm/arm_arch_timer.h
@@ -57,9 +57,7 @@ struct arch_timer_cpu {
 int kvm_timer_hyp_init(void);
 int kvm_timer_enable(struct kvm_vcpu *vcpu);
-int kvm_timer_vcpu_reset(struct kvm_vcpu *vcpu,
+int kvm_timer_vcpu_reset(struct kvm_vcpu *vcpu);
-                         const struct kvm_irq_level *virt_irq,
-                         const struct kvm_irq_level *phys_irq);
 void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu);
 void kvm_timer_sync_hwstate(struct kvm_vcpu *vcpu);
@@ -70,6 +68,10 @@ void kvm_timer_vcpu_terminate(struct kvm_vcpu *vcpu);
 u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid);
 int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
+int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
+int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
+int kvm_arm_timer_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 bool kvm_timer_should_fire(struct arch_timer_context *timer_ctx);
 void kvm_timer_schedule(struct kvm_vcpu *vcpu);
 void kvm_timer_unschedule(struct kvm_vcpu *vcpu);
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 1ab4633adf4f..f6e030617467 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -35,6 +35,7 @@ struct kvm_pmu {
        int irq_num;
        struct kvm_pmc pmc[ARMV8_PMU_MAX_COUNTERS];
        bool ready;
+        bool created;
        bool irq_level;
 };
@@ -63,6 +64,7 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu,
                            struct kvm_device_attr *attr);
 int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu,
                            struct kvm_device_attr *attr);
+int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu);
 #else
 struct kvm_pmu {
 };
@@ -112,6 +114,10 @@ static inline int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu,
 {
        return -ENXIO;
 }
+static inline int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu)
+{
+        return 0;
+}
 #endif
 #endif
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index ef718586321c..34dba516ef24 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -38,6 +38,10 @@
 #define VGIC_MIN_LPI            8192
 #define KVM_IRQCHIP_NUM_PINS    (1020 - 32)
+#define irq_is_ppi(irq) ((irq) >= VGIC_NR_SGIS && (irq) < VGIC_NR_PRIVATE_IRQS)
+#define irq_is_spi(irq) ((irq) >= VGIC_NR_PRIVATE_IRQS && \
+                         (irq) <= VGIC_MAX_SPI)
 enum vgic_type {
        VGIC_V2,                /* Good ol' GICv2 */
        VGIC_V3,                /* New fancy GICv3 */
@@ -119,6 +123,9 @@ struct vgic_irq {
        u8 source;                      /* GICv2 SGIs only */
        u8 priority;
        enum vgic_irq_config config;    /* Level or edge */
+        void *owner;                    /* Opaque pointer to reserve an interrupt
+                                           for in-kernel devices. */
 };
 struct vgic_register_region;
@@ -285,6 +292,7 @@ struct vgic_cpu {
 };
 extern struct static_key_false vgic_v2_cpuif_trap;
+extern struct static_key_false vgic_v3_cpuif_trap;
 int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 *addr, bool write);
 void kvm_vgic_early_init(struct kvm *kvm);
@@ -298,9 +306,7 @@ int kvm_vgic_hyp_init(void);
 void kvm_vgic_init_cpu_hardware(void);
 int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int intid,
-                        bool level);
+                        bool level, void *owner);
-int kvm_vgic_inject_mapped_irq(struct kvm *kvm, int cpuid, unsigned int intid,
-                               bool level);
 int kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu, u32 virt_irq, u32 phys_irq);
 int kvm_vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, unsigned int virt_irq);
 bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, unsigned int virt_irq);
@@ -341,4 +347,6 @@ int kvm_send_userspace_msi(struct kvm *kvm, struct kvm_msi *msi);
 */
 int kvm_vgic_setup_default_irq_routing(struct kvm *kvm);
+int kvm_vgic_set_owner(struct kvm_vcpu *vcpu, unsigned int intid, void *owner);
 #endif /* __KVM_ARM_VGIC_H */
diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
index 1fa293a37f4a..6a1f87ff94e2 100644
--- a/include/linux/irqchip/arm-gic-v3.h
+++ b/include/linux/irqchip/arm-gic-v3.h
@@ -405,6 +405,7 @@
 #define ICH_LR_PHYS_ID_SHIFT            32
 #define ICH_LR_PHYS_ID_MASK             (0x3ffULL << ICH_LR_PHYS_ID_SHIFT)
 #define ICH_LR_PRIORITY_SHIFT           48
+#define ICH_LR_PRIORITY_MASK            (0xffULL << ICH_LR_PRIORITY_SHIFT)
 /* These are for GICv2 emulation only */
 #define GICH_LR_VIRTUALID               (0x3ffUL << 0)
@@ -416,6 +417,11 @@
 #define ICH_HCR_EN                      (1 << 0)
 #define ICH_HCR_UIE                     (1 << 1)
+#define ICH_HCR_TC                      (1 << 10)
+#define ICH_HCR_TALL0                   (1 << 11)
+#define ICH_HCR_TALL1                   (1 << 12)
+#define ICH_HCR_EOIcount_SHIFT          27
+#define ICH_HCR_EOIcount_MASK           (0x1f << ICH_HCR_EOIcount_SHIFT)
 #define ICH_VMCR_ACK_CTL_SHIFT          2
 #define ICH_VMCR_ACK_CTL_MASK           (1 << ICH_VMCR_ACK_CTL_SHIFT)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8c0664309815..0b50e7b35ed4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -126,6 +126,13 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_PENDING_TIMER     2
 #define KVM_REQ_UNHALT            3
+#define KVM_REQUEST_ARCH_BASE     8
+#define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
+        BUILD_BUG_ON((unsigned)(nr) >= 32 - KVM_REQUEST_ARCH_BASE); \
+        (unsigned)(((nr) + KVM_REQUEST_ARCH_BASE) | (flags)); \
+})
+#define KVM_ARCH_REQ(nr)           KVM_ARCH_REQ_FLAGS(nr, 0)
 #define KVM_USERSPACE_IRQ_SOURCE_ID             0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID        1
@@ -1098,6 +1105,11 @@ static inline void kvm_make_request(int req, struct kvm_vcpu *vcpu)
        set_bit(req & KVM_REQUEST_MASK, &vcpu->requests);
 }
+static inline bool kvm_request_pending(struct kvm_vcpu *vcpu)
+{
+        return READ_ONCE(vcpu->requests);
+}
 static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
 {
        return test_bit(req & KVM_REQUEST_MASK, &vcpu->requests);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 577429a95ad8..c0b6dfec5f87 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -155,6 +155,35 @@ struct kvm_s390_skeys {
        __u32 reserved[9];
 };
+#define KVM_S390_CMMA_PEEK (1 << 0)
+/**
+ * kvm_s390_cmma_log - Used for CMMA migration.
+ *
+ * Used both for input and output.
+ *
+ * @start_gfn: Guest page number to start from.
+ * @count: Size of the result buffer.
+ * @flags: Control operation mode via KVM_S390_CMMA_* flags
+ * @remaining: Used with KVM_S390_GET_CMMA_BITS. Indicates how many dirty
+ *             pages are still remaining.
+ * @mask: Used with KVM_S390_SET_CMMA_BITS. Bitmap of bits to actually set
+ *        in the PGSTE.
+ * @values: Pointer to the values buffer.
+ *
+ * Used in KVM_S390_{G,S}ET_CMMA_BITS ioctls.
+ */
+struct kvm_s390_cmma_log {
+        __u64 start_gfn;
+        __u32 count;
+        __u32 flags;
+        union {
+                __u64 remaining;
+                __u64 mask;
+        };
+        __u64 values;
+};
 struct kvm_hyperv_exit {
 #define KVM_EXIT_HYPERV_SYNIC          1
 #define KVM_EXIT_HYPERV_HCALL          2
@@ -895,6 +924,9 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_SPAPR_TCE_VFIO 142
 #define KVM_CAP_X86_GUEST_MWAIT 143
 #define KVM_CAP_ARM_USER_IRQ 144
+#define KVM_CAP_S390_CMMA_MIGRATION 145
+#define KVM_CAP_PPC_FWNMI 146
+#define KVM_CAP_PPC_SMT_POSSIBLE 147
 #ifdef KVM_CAP_IRQ_ROUTING
@@ -1318,6 +1350,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_S390_GET_IRQ_STATE    _IOW(KVMIO, 0xb6, struct kvm_s390_irq_state)
 /* Available with KVM_CAP_X86_SMM */
 #define KVM_SMI                   _IO(KVMIO,   0xb7)
+/* Available with KVM_CAP_S390_CMMA_MIGRATION */
+#define KVM_S390_GET_CMMA_BITS      _IOW(KVMIO, 0xb8, struct kvm_s390_cmma_log)
+#define KVM_S390_SET_CMMA_BITS      _IOW(KVMIO, 0xb9, struct kvm_s390_cmma_log)
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU     (1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3          (1 << 1)
diff --git a/tools/kvm/kvm_stat/kvm_stat b/tools/kvm/kvm_stat/kvm_stat
index 8f74ed8e7237..dd8f00cfb8b4 100755
--- a/tools/kvm/kvm_stat/kvm_stat
+++ b/tools/kvm/kvm_stat/kvm_stat
@@ -295,114 +295,6 @@ class ArchS390(Arch):
 ARCH = Arch.get_arch()
-def walkdir(path):
-    """Returns os.walk() data for specified directory.
-    As it is only a wrapper it returns the same 3-tuple of (dirpath,
-    dirnames, filenames).
-    """
-    return next(os.walk(path))
-def parse_int_list(list_string):
-    """Returns an int list from a string of comma separated integers and
-    integer ranges."""
-    integers = []
-    members = list_string.split(',')
-    for member in members:
-        if '-' not in member:
-            integers.append(int(member))
-        else:
-            int_range = member.split('-')
-            integers.extend(range(int(int_range[0]),
-                                  int(int_range[1]) + 1))
-    return integers
-def get_pid_from_gname(gname):
-    """Fuzzy function to convert guest name to QEMU process pid.
-    Returns a list of potential pids, can be empty if no match found.
-    Throws an exception on processing errors.
-    """
-    pids = []
-    try:
-        child = subprocess.Popen(['ps', '-A', '--format', 'pid,args'],
-                                 stdout=subprocess.PIPE)
-    except:
-        raise Exception
-    for line in child.stdout:
-        line = line.lstrip().split(' ', 1)
-        # perform a sanity check before calling the more expensive
-        # function to possibly extract the guest name
-        if ' -name ' in line[1] and gname == get_gname_from_pid(line[0]):
-            pids.append(int(line[0]))
-    child.stdout.close()
-    return pids
-def get_gname_from_pid(pid):
-    """Returns the guest name for a QEMU process pid.
-    Extracts the guest name from the QEMU comma line by processing the '-name'
-    option. Will also handle names specified out of sequence.
-    """
-    name = ''
-    try:
-        line = open('/proc/{}/cmdline'.format(pid), 'rb').read().split('\0')
-        parms = line[line.index('-name') + 1].split(',')
-        while '' in parms:
-            # commas are escaped (i.e. ',,'), hence e.g. 'foo,bar' results in
-            # ['foo', '', 'bar'], which we revert here
-            idx = parms.index('')
-            parms[idx - 1] += ',' + parms[idx + 1]
-            del parms[idx:idx+2]
-        # the '-name' switch allows for two ways to specify the guest name,
-        # where the plain name overrides the name specified via 'guest='
-        for arg in parms:
-            if '=' not in arg:
-                name = arg
-                break
-            if arg[:6] == 'guest=':
-                name = arg[6:]
-    except (ValueError, IOError, IndexError):
-        pass
-    return name
-def get_online_cpus():
-    """Returns a list of cpu id integers."""
-    with open('/sys/devices/system/cpu/online') as cpu_list:
-        cpu_string = cpu_list.readline()
-        return parse_int_list(cpu_string)
-def get_filters():
-    """Returns a dict of trace events, their filter ids and
-    the values that can be filtered.
-    Trace events can be filtered for special values by setting a
-    filter string via an ioctl. The string normally has the format
-    identifier==value. For each filter a new event will be created, to
-    be able to distinguish the events.
-    """
-    filters = {}
-    filters['kvm_userspace_exit'] = ('reason', USERSPACE_EXIT_REASONS)
-    if ARCH.exit_reasons:
-        filters['kvm_exit'] = ('exit_reason', ARCH.exit_reasons)
-    return filters
-libc = ctypes.CDLL('libc.so.6', use_errno=True)
-syscall = libc.syscall
 class perf_event_attr(ctypes.Structure):
    """Struct that holds the necessary data to set up a trace event.
@@ -432,25 +324,6 @@ class perf_event_attr(ctypes.Structure):
        self.read_format = PERF_FORMAT_GROUP
-def perf_event_open(attr, pid, cpu, group_fd, flags):
-    """Wrapper for the sys_perf_evt_open() syscall.
-    Used to set up performance events, returns a file descriptor or -1
-    on error.
-    Attributes are:
-    - syscall number
-    - struct perf_event_attr *
-    - pid or -1 to monitor all pids
-    - cpu number or -1 to monitor all cpus
-    - The file descriptor of the group leader or -1 to create a group.
-    - flags
-    """
-    return syscall(ARCH.sc_perf_evt_open, ctypes.pointer(attr),
-                   ctypes.c_int(pid), ctypes.c_int(cpu),
-                   ctypes.c_int(group_fd), ctypes.c_long(flags))
 PERF_TYPE_TRACEPOINT = 2
 PERF_FORMAT_GROUP = 1 << 3
@@ -495,6 +368,8 @@ class Event(object):
    """Represents a performance event and manages its life cycle."""
    def __init__(self, name, group, trace_cpu, trace_pid, trace_point,
                 trace_filter, trace_set='kvm'):
+        self.libc = ctypes.CDLL('libc.so.6', use_errno=True)
+        self.syscall = self.libc.syscall
        self.name = name
        self.fd = None
        self.setup_event(group, trace_cpu, trace_pid, trace_point,
@@ -511,6 +386,25 @@ class Event(object):
        if self.fd:
            os.close(self.fd)
+    def perf_event_open(self, attr, pid, cpu, group_fd, flags):
+        """Wrapper for the sys_perf_evt_open() syscall.
+        Used to set up performance events, returns a file descriptor or -1
+        on error.
+        Attributes are:
+        - syscall number
+        - struct perf_event_attr *
+        - pid or -1 to monitor all pids
+        - cpu number or -1 to monitor all cpus
+        - The file descriptor of the group leader or -1 to create a group.
+        - flags
+        """
+        return self.syscall(ARCH.sc_perf_evt_open, ctypes.pointer(attr),
+                            ctypes.c_int(pid), ctypes.c_int(cpu),
+                            ctypes.c_int(group_fd), ctypes.c_long(flags))
    def setup_event_attribute(self, trace_set, trace_point):
        """Returns an initialized ctype perf_event_attr struct."""
@@ -539,8 +433,8 @@ class Event(object):
        if group.events:
            group_leader = group.events[0].fd
-        fd = perf_event_open(event_attr, trace_pid,
+        fd = self.perf_event_open(event_attr, trace_pid,
-                             trace_cpu, group_leader, 0)
+                                  trace_cpu, group_leader, 0)
        if fd == -1:
            err = ctypes.get_errno()
            raise OSError(err, os.strerror(err),
@@ -575,17 +469,53 @@ class Event(object):
        fcntl.ioctl(self.fd, ARCH.ioctl_numbers['RESET'], 0)
-class TracepointProvider(object):
+class Provider(object):
+    """Encapsulates functionalities used by all providers."""
+    @staticmethod
+    def is_field_wanted(fields_filter, field):
+        """Indicate whether field is valid according to fields_filter."""
+        if not fields_filter:
+            return True
+        return re.match(fields_filter, field) is not None
+    @staticmethod
+    def walkdir(path):
+        """Returns os.walk() data for specified directory.
+        As it is only a wrapper it returns the same 3-tuple of (dirpath,
+        dirnames, filenames).
+        """
+        return next(os.walk(path))
+class TracepointProvider(Provider):
    """Data provider for the stats class.
    Manages the events/groups from which it acquires its data.
    """
-    def __init__(self):
+    def __init__(self, pid, fields_filter):
        self.group_leaders = []
-        self.filters = get_filters()
+        self.filters = self.get_filters()
-        self._fields = self.get_available_fields()
+        self.update_fields(fields_filter)
-        self._pid = 0
+        self.pid = pid
+    @staticmethod
+    def get_filters():
+        """Returns a dict of trace events, their filter ids and
+        the values that can be filtered.
+        Trace events can be filtered for special values by setting a
+        filter string via an ioctl. The string normally has the format
+        identifier==value. For each filter a new event will be created, to
+        be able to distinguish the events.
+        """
+        filters = {}
+        filters['kvm_userspace_exit'] = ('reason', USERSPACE_EXIT_REASONS)
+        if ARCH.exit_reasons:
+            filters['kvm_exit'] = ('exit_reason', ARCH.exit_reasons)
+        return filters
    def get_available_fields(self):
        """Returns a list of available event's of format 'event name(filter
@@ -603,7 +533,7 @@ class TracepointProvider(object):
        """
        path = os.path.join(PATH_DEBUGFS_TRACING, 'events', 'kvm')
-        fields = walkdir(path)[1]
+        fields = self.walkdir(path)[1]
        extra = []
        for field in fields:
            if field in self.filters:
@@ -613,6 +543,34 @@ class TracepointProvider(object):
        fields += extra
        return fields
+    def update_fields(self, fields_filter):
+        """Refresh fields, applying fields_filter"""
+        self._fields = [field for field in self.get_available_fields()
+                        if self.is_field_wanted(fields_filter, field)]
+    @staticmethod
+    def get_online_cpus():
+        """Returns a list of cpu id integers."""
+        def parse_int_list(list_string):
+            """Returns an int list from a string of comma separated integers and
+            integer ranges."""
+            integers = []
+            members = list_string.split(',')
+            for member in members:
+                if '-' not in member:
+                    integers.append(int(member))
+                else:
+                    int_range = member.split('-')
+                    integers.extend(range(int(int_range[0]),
+                                          int(int_range[1]) + 1))
+            return integers
+        with open('/sys/devices/system/cpu/online') as cpu_list:
+            cpu_string = cpu_list.readline()
+            return parse_int_list(cpu_string)
    def setup_traces(self):
        """Creates all event and group objects needed to be able to retrieve
        data."""
@@ -621,9 +579,9 @@ class TracepointProvider(object):
            # Fetch list of all threads of the monitored pid, as qemu
            # starts a thread for each vcpu.
            path = os.path.join('/proc', str(self._pid), 'task')
-            groupids = walkdir(path)[1]
+            groupids = self.walkdir(path)[1]
        else:
-            groupids = get_online_cpus()
+            groupids = self.get_online_cpus()
        # The constant is needed as a buffer for python libs, std
        # streams and other files that the script opens.
@@ -671,9 +629,6 @@ class TracepointProvider(object):
            self.group_leaders.append(group)
-    def available_fields(self):
-        return self.get_available_fields()
    @property
    def fields(self):
        return self._fields
@@ -707,7 +662,7 @@ class TracepointProvider(object):
        self.setup_traces()
        self.fields = self._fields
-    def read(self):
+    def read(self, by_guest=0):
        """Returns 'event name: current value' for all enabled events."""
        ret = defaultdict(int)
        for group in self.group_leaders:
@@ -723,16 +678,17 @@ class TracepointProvider(object):
                event.reset()
-class DebugfsProvider(object):
+class DebugfsProvider(Provider):
    """Provides data from the files that KVM creates in the kvm debugfs
    folder."""
-    def __init__(self):
+    def __init__(self, pid, fields_filter, include_past):
-        self._fields = self.get_available_fields()
+        self.update_fields(fields_filter)
        self._baseline = {}
-        self._pid = 0
        self.do_read = True
        self.paths = []
-        self.reset()
+        self.pid = pid
+        if include_past:
+            self.restore()
    def get_available_fields(self):
        """"Returns a list of available fields.
@@ -740,7 +696,12 @@ class DebugfsProvider(object):
        The fields are all available KVM debugfs files
        """
-        return walkdir(PATH_DEBUGFS_KVM)[2]
+        return self.walkdir(PATH_DEBUGFS_KVM)[2]
+    def update_fields(self, fields_filter):
+        """Refresh fields, applying fields_filter"""
+        self._fields = [field for field in self.get_available_fields()
+                        if self.is_field_wanted(fields_filter, field)]
    @property
    def fields(self):
@@ -757,10 +718,9 @@ class DebugfsProvider(object):
    @pid.setter
    def pid(self, pid):
+        self._pid = pid
        if pid != 0:
-            self._pid = pid
+            vms = self.walkdir(PATH_DEBUGFS_KVM)[1]
-            vms = walkdir(PATH_DEBUGFS_KVM)[1]
            if len(vms) == 0:
                self.do_read = False
@@ -771,8 +731,15 @@ class DebugfsProvider(object):
            self.do_read = True
        self.reset()
-    def read(self, reset=0):
+    def read(self, reset=0, by_guest=0):
-        """Returns a dict with format:'file name / field -> current value'."""
+        """Returns a dict with format:'file name / field -> current value'.
+        Parameter 'reset':
+          0   plain read
+          1   reset field counts to 0
+          2   restore the original field counts
+        """
        results = {}
        # If no debugfs filtering support is available, then don't read.
@@ -789,12 +756,22 @@ class DebugfsProvider(object):
            for field in self._fields:
                value = self.read_field(field, path)
                key = path + field
-                if reset:
+                if reset == 1:
                    self._baseline[key] = value
+                if reset == 2:
+                    self._baseline[key] = 0
                if self._baseline.get(key, -1) == -1:
                    self._baseline[key] = value
-                results[field] = (results.get(field, 0) + value -
+                increment = (results.get(field, 0) + value -
-                                  self._baseline.get(key, 0))
+                             self._baseline.get(key, 0))
+                if by_guest:
+                    pid = key.split('-')[0]
+                    if pid in results:
+                        results[pid] += increment
+                    else:
+                        results[pid] = increment
+                else:
+                    results[field] = increment
        return results
@@ -813,6 +790,11 @@ class DebugfsProvider(object):
        self._baseline = {}
        self.read(1)
+    def restore(self):
+        """Reset field counters"""
+        self._baseline = {}
+        self.read(2)
 class Stats(object):
    """Manages the data providers and the data they provide.
@@ -821,33 +803,32 @@ class Stats(object):
    provider data.
    """
-    def __init__(self, providers, pid, fields=None):
+    def __init__(self, options):
-        self.providers = providers
+        self.providers = self.get_providers(options)
-        self._pid_filter = pid
+        self._pid_filter = options.pid
-        self._fields_filter = fields
+        self._fields_filter = options.fields
        self.values = {}
-        self.update_provider_pid()
-        self.update_provider_filters()
+    @staticmethod
+    def get_providers(options):
+        """Returns a list of data providers depending on the passed options."""
+        providers = []
+        if options.debugfs:
+            providers.append(DebugfsProvider(options.pid, options.fields,
+                                             options.dbgfs_include_past))
+        if options.tracepoints or not providers:
+            providers.append(TracepointProvider(options.pid, options.fields))
+        return providers
    def update_provider_filters(self):
        """Propagates fields filters to providers."""
-        def wanted(key):
-            if not self._fields_filter:
-                return True
-            return re.match(self._fields_filter, key) is not None
        # As we reset the counters when updating the fields we can
        # also clear the cache of old values.
        self.values = {}
        for provider in self.providers:
-            provider_fields = [key for key in provider.get_available_fields()
+            provider.update_fields(self._fields_filter)
-                               if wanted(key)]
-            provider.fields = provider_fields
-    def update_provider_pid(self):
-        """Propagates pid filters to providers."""
-        for provider in self.providers:
-            provider.pid = self._pid_filter
    def reset(self):
        self.values = {}
@@ -873,27 +854,52 @@ class Stats(object):
        if pid != self._pid_filter:
            self._pid_filter = pid
            self.values = {}
-            self.update_provider_pid()
+            for provider in self.providers:
+                provider.pid = self._pid_filter
-    def get(self):
+    def get(self, by_guest=0):
        """Returns a dict with field -> (value, delta to last value) of all
        provider data."""
        for provider in self.providers:
-            new = provider.read()
+            new = provider.read(by_guest=by_guest)
-            for key in provider.fields:
+            for key in new if by_guest else provider.fields:
                oldval = self.values.get(key, (0, 0))[0]
                newval = new.get(key, 0)
                newdelta = newval - oldval
                self.values[key] = (newval, newdelta)
        return self.values
-LABEL_WIDTH = 40
+    def toggle_display_guests(self, to_pid):
-NUMBER_WIDTH = 10
+        """Toggle between collection of stats by individual event and by
-DELAY_INITIAL = 0.25
+        guest pid
-DELAY_REGULAR = 3.0
+        Events reported by DebugfsProvider change when switching to/from
+        reading by guest values. Hence we have to remove the excess event
+        names from self.values.
+        """
+        if any(isinstance(ins, TracepointProvider) for ins in self.providers):
+            return 1
+        if to_pid:
+            for provider in self.providers:
+                if isinstance(provider, DebugfsProvider):
+                    for key in provider.fields:
+                        if key in self.values.keys():
+                            del self.values[key]
+        else:
+            oldvals = self.values.copy()
+            for key in oldvals:
+                if key.isdigit():
+                    del self.values[key]
+        # Update oldval (see get())
+        self.get(to_pid)
+        return 0
+DELAY_DEFAULT = 3.0
 MAX_GUEST_NAME_LEN = 48
 MAX_REGEX_LEN = 44
 DEFAULT_REGEX = r'^[^\(]*$'
+SORT_DEFAULT = 0
 class Tui(object):
@@ -901,7 +907,10 @@ class Tui(object):
    def __init__(self, stats):
        self.stats = stats
        self.screen = None
-        self.update_drilldown()
+        self._delay_initial = 0.25
+        self._delay_regular = DELAY_DEFAULT
+        self._sorting = SORT_DEFAULT
+        self._display_guests = 0
    def __enter__(self):
        """Initialises curses for later use.  Based on curses.wrapper
@@ -929,7 +938,7 @@ class Tui(object):
        return self
    def __exit__(self, *exception):
-        """Resets the terminal to its normal state.  Based on curses.wrappre
+        """Resets the terminal to its normal state.  Based on curses.wrapper
           implementation from the Python standard library."""
        if self.screen:
            self.screen.keypad(0)
@@ -937,6 +946,86 @@ class Tui(object):
            curses.nocbreak()
            curses.endwin()
+    def get_all_gnames(self):
+        """Returns a list of (pid, gname) tuples of all running guests"""
+        res = []
+        try:
+            child = subprocess.Popen(['ps', '-A', '--format', 'pid,args'],
+                                     stdout=subprocess.PIPE)
+        except:
+            raise Exception
+        for line in child.stdout:
+            line = line.lstrip().split(' ', 1)
+            # perform a sanity check before calling the more expensive
+            # function to possibly extract the guest name
+            if ' -name ' in line[1]:
+                res.append((line[0], self.get_gname_from_pid(line[0])))
+        child.stdout.close()
+        return res
+    def print_all_gnames(self, row):
+        """Print a list of all running guests along with their pids."""
+        self.screen.addstr(row, 2, '%8s  %-60s' %
+                           ('Pid', 'Guest Name (fuzzy list, might be '
+                            'inaccurate!)'),
+                           curses.A_UNDERLINE)
+        row += 1
+        try:
+            for line in self.get_all_gnames():
+                self.screen.addstr(row, 2, '%8s  %-60s' % (line[0], line[1]))
+                row += 1
+                if row >= self.screen.getmaxyx()[0]:
+                    break
+        except Exception:
+            self.screen.addstr(row + 1, 2, 'Not available')
+    def get_pid_from_gname(self, gname):
+        """Fuzzy function to convert guest name to QEMU process pid.
+        Returns a list of potential pids, can be empty if no match found.
+        Throws an exception on processing errors.
+        """
+        pids = []
+        for line in self.get_all_gnames():
+            if gname == line[1]:
+                pids.append(int(line[0]))
+        return pids
+    @staticmethod
+    def get_gname_from_pid(pid):
+        """Returns the guest name for a QEMU process pid.
+        Extracts the guest name from the QEMU comma line by processing the
+        '-name' option. Will also handle names specified out of sequence.
+        """
+        name = ''
+        try:
+            line = open('/proc/{}/cmdline'
+                        .format(pid), 'rb').read().split('\0')
+            parms = line[line.index('-name') + 1].split(',')
+            while '' in parms:
+                # commas are escaped (i.e. ',,'), hence e.g. 'foo,bar' results
+                # in # ['foo', '', 'bar'], which we revert here
+                idx = parms.index('')
+                parms[idx - 1] += ',' + parms[idx + 1]
+                del parms[idx:idx+2]
+            # the '-name' switch allows for two ways to specify the guest name,
+            # where the plain name overrides the name specified via 'guest='
+            for arg in parms:
+                if '=' not in arg:
+                    name = arg
+                    break
+                if arg[:6] == 'guest=':
+                    name = arg[6:]
+        except (ValueError, IOError, IndexError):
+            pass
+        return name
    def update_drilldown(self):
        """Sets or removes a filter that only allows fields without braces."""
        if not self.stats.fields_filter:
@@ -954,7 +1043,7 @@ class Tui(object):
        if pid is None:
            pid = self.stats.pid_filter
        self.screen.erase()
-        gname = get_gname_from_pid(pid)
+        gname = self.get_gname_from_pid(pid)
        if gname:
            gname = ('({})'.format(gname[:MAX_GUEST_NAME_LEN] + '...'
                                   if len(gname) > MAX_GUEST_NAME_LEN
@@ -970,13 +1059,13 @@ class Tui(object):
            if len(regex) > MAX_REGEX_LEN:
                regex = regex[:MAX_REGEX_LEN] + '...'
            self.screen.addstr(1, 17, 'regex filter: {0}'.format(regex))
-        self.screen.addstr(2, 1, 'Event')
+        if self._display_guests:
-        self.screen.addstr(2, 1 + LABEL_WIDTH + NUMBER_WIDTH -
+            col_name = 'Guest Name'
-                           len('Total'), 'Total')
+        else:
-        self.screen.addstr(2, 1 + LABEL_WIDTH + NUMBER_WIDTH + 7 -
+            col_name = 'Event'
-                           len('%Total'), '%Total')
+        self.screen.addstr(2, 1, '%-40s %10s%7s %8s' %
-        self.screen.addstr(2, 1 + LABEL_WIDTH + NUMBER_WIDTH + 7 + 8 -
+                           (col_name, 'Total', '%Total', 'CurAvg/s'),
-                           len('Current'), 'Current')
+                           curses.A_STANDOUT)
        self.screen.addstr(4, 1, 'Collecting data...')
        self.screen.refresh()
@@ -984,16 +1073,25 @@ class Tui(object):
        row = 3
        self.screen.move(row, 0)
        self.screen.clrtobot()
-        stats = self.stats.get()
+        stats = self.stats.get(self._display_guests)
-        def sortkey(x):
+        def sortCurAvg(x):
+            # sort by current events if available
            if stats[x][1]:
                return (-stats[x][1], -stats[x][0])
            else:
                return (0, -stats[x][0])
+        def sortTotal(x):
+            # sort by totals
+            return (0, -stats[x][0])
        total = 0.
        for val in stats.values():
            total += val[0]
+        if self._sorting == SORT_DEFAULT:
+            sortkey = sortCurAvg
+        else:
+            sortkey = sortTotal
        for key in sorted(stats.keys(), key=sortkey):
            if row >= self.screen.getmaxyx()[0]:
@@ -1001,18 +1099,61 @@ class Tui(object):
            values = stats[key]
            if not values[0] and not values[1]:
                break
-            col = 1
+            if values[0] is not None:
-            self.screen.addstr(row, col, key)
+                cur = int(round(values[1] / sleeptime)) if values[1] else ''
-            col += LABEL_WIDTH
+                if self._display_guests:
-            self.screen.addstr(row, col, '%10d' % (values[0],))
+                    key = self.get_gname_from_pid(key)
-            col += NUMBER_WIDTH
+                self.screen.addstr(row, 1, '%-40s %10d%7.1f %8s' %
-            self.screen.addstr(row, col, '%7.1f' % (values[0] * 100 / total,))
+                                   (key, values[0], values[0] * 100 / total,
-            col += 7
+                                    cur))
-            if values[1] is not None:
-                self.screen.addstr(row, col, '%8d' % (values[1] / sleeptime,))
            row += 1
+        if row == 3:
+            self.screen.addstr(4, 1, 'No matching events reported yet')
        self.screen.refresh()
+    def show_msg(self, text):
+        """Display message centered text and exit on key press"""
+        hint = 'Press any key to continue'
+        curses.cbreak()
+        self.screen.erase()
+        (x, term_width) = self.screen.getmaxyx()
+        row = 2
+        for line in text:
+            start = (term_width - len(line)) / 2
+            self.screen.addstr(row, start, line)
+            row += 1
+        self.screen.addstr(row + 1, (term_width - len(hint)) / 2, hint,
+                           curses.A_STANDOUT)
+        self.screen.getkey()
+    def show_help_interactive(self):
+        """Display help with list of interactive commands"""
+        msg = ('   b     toggle events by guests (debugfs only, honors'
+               ' filters)',
+               '   c     clear filter',
+               '   f     filter by regular expression',
+               '   g     filter by guest name',
+               '   h     display interactive commands reference',
+               '   o     toggle sorting order (Total vs CurAvg/s)',
+               '   p     filter by PID',
+               '   q     quit',
+               '   r     reset stats',
+               '   s     set update interval',
+               '   x     toggle reporting of stats for individual child trace'
+               ' events',
+               'Any other key refreshes statistics immediately')
+        curses.cbreak()
+        self.screen.erase()
+        self.screen.addstr(0, 0, "Interactive commands reference",
+                           curses.A_BOLD)
+        self.screen.addstr(2, 0, "Press any key to exit", curses.A_STANDOUT)
+        row = 4
+        for line in msg:
+            self.screen.addstr(row, 0, line)
+            row += 1
+        self.screen.getkey()
+        self.refresh_header()
    def show_filter_selection(self):
        """Draws filter selection mask.
@@ -1059,6 +1200,7 @@ class Tui(object):
                               'This might limit the shown data to the trace '
                               'statistics.')
            self.screen.addstr(5, 0, msg)
+            self.print_all_gnames(7)
            curses.echo()
            self.screen.addstr(3, 0, "Pid [0 or pid]: ")
@@ -1077,10 +1219,40 @@ class Tui(object):
                self.refresh_header(pid)
                self.update_pid(pid)
                break
            except ValueError:
                msg = '"' + str(pid) + '": Not a valid pid'
-                continue
+    def show_set_update_interval(self):
+        """Draws update interval selection mask."""
+        msg = ''
+        while True:
+            self.screen.erase()
+            self.screen.addstr(0, 0, 'Set update interval (defaults to %fs).' %
+                               DELAY_DEFAULT, curses.A_BOLD)
+            self.screen.addstr(4, 0, msg)
+            self.screen.addstr(2, 0, 'Change delay from %.1fs to ' %
+                               self._delay_regular)
+            curses.echo()
+            val = self.screen.getstr()
+            curses.noecho()
+            try:
+                if len(val) > 0:
+                    delay = float(val)
+                    if delay < 0.1:
+                        msg = '"' + str(val) + '": Value must be >=0.1'
+                        continue
+                    if delay > 25.5:
+                        msg = '"' + str(val) + '": Value must be <=25.5'
+                        continue
+                else:
+                    delay = DELAY_DEFAULT
+                self._delay_regular = delay
+                break
+            except ValueError:
+                msg = '"' + str(val) + '": Invalid value'
+        self.refresh_header()
    def show_vm_selection_by_guest_name(self):
        """Draws guest selection mask.
@@ -1098,6 +1270,7 @@ class Tui(object):
                               'This might limit the shown data to the trace '
                               'statistics.')
            self.screen.addstr(5, 0, msg)
+            self.print_all_gnames(7)
            curses.echo()
            self.screen.addstr(3, 0, "Guest [ENTER or guest]: ")
            gname = self.screen.getstr()
@@ -1110,7 +1283,7 @@ class Tui(object):
            else:
                pids = []
                try:
-                    pids = get_pid_from_gname(gname)
+                    pids = self.get_pid_from_gname(gname)
                except:
                    msg = '"' + gname + '": Internal error while searching, ' \
                          'use pid filter instead'
@@ -1128,38 +1301,60 @@ class Tui(object):
    def show_stats(self):
        """Refreshes the screen and processes user input."""
-        sleeptime = DELAY_INITIAL
+        sleeptime = self._delay_initial
        self.refresh_header()
+        start = 0.0  # result based on init value never appears on screen
        while True:
-            self.refresh_body(sleeptime)
+            self.refresh_body(time.time() - start)
            curses.halfdelay(int(sleeptime * 10))
-            sleeptime = DELAY_REGULAR
+            start = time.time()
+            sleeptime = self._delay_regular
            try:
                char = self.screen.getkey()
-                if char == 'x':
+                if char == 'b':
+                    self._display_guests = not self._display_guests
+                    if self.stats.toggle_display_guests(self._display_guests):
+                        self.show_msg(['Command not available with tracepoints'
+                                       ' enabled', 'Restart with debugfs only '
+                                       '(see option \'-d\') and try again!'])
+                        self._display_guests = not self._display_guests
                    self.refresh_header()
-                    self.update_drilldown()
-                    sleeptime = DELAY_INITIAL
-                if char == 'q':
-                    break
                if char == 'c':
                    self.stats.fields_filter = DEFAULT_REGEX
                    self.refresh_header(0)
                    self.update_pid(0)
-                    sleeptime = DELAY_INITIAL
                if char == 'f':
+                    curses.curs_set(1)
                    self.show_filter_selection()
-                    sleeptime = DELAY_INITIAL
+                    curses.curs_set(0)
+                    sleeptime = self._delay_initial
                if char == 'g':
+                    curses.curs_set(1)
                    self.show_vm_selection_by_guest_name()
-                    sleeptime = DELAY_INITIAL
+                    curses.curs_set(0)
+                    sleeptime = self._delay_initial
+                if char == 'h':
+                    self.show_help_interactive()
+                if char == 'o':
+                    self._sorting = not self._sorting
                if char == 'p':
+                    curses.curs_set(1)
                    self.show_vm_selection_by_pid()
-                    sleeptime = DELAY_INITIAL
+                    curses.curs_set(0)
+                    sleeptime = self._delay_initial
+                if char == 'q':
+                    break
                if char == 'r':
-                    self.refresh_header()
                    self.stats.reset()
-                    sleeptime = DELAY_INITIAL
+                if char == 's':
+                    curses.curs_set(1)
+                    self.show_set_update_interval()
+                    curses.curs_set(0)
+                    sleeptime = self._delay_initial
+                if char == 'x':
+                    self.update_drilldown()
+                    # prevents display of current values on next refresh
+                    self.stats.get()
            except KeyboardInterrupt:
                break
            except curses.error:
@@ -1227,13 +1422,17 @@ Requirements:
  the large number of files that are possibly opened.
 Interactive Commands:
+   b     toggle events by guests (debugfs only, honors filters)
   c     clear filter
   f     filter by regular expression
   g     filter by guest name
+   h     display interactive commands reference
+   o     toggle sorting order (Total vs CurAvg/s)
   p     filter by PID
   q     quit
-   x     toggle reporting of stats for individual child trace events
   r     reset stats
+   s     set update interval
+   x     toggle reporting of stats for individual child trace events
 Press any other key to refresh statistics immediately.
 """
@@ -1246,7 +1445,7 @@ Press any other key to refresh statistics immediately.
    def cb_guest_to_pid(option, opt, val, parser):
        try:
-            pids = get_pid_from_gname(val)
+            pids = Tui.get_pid_from_gname(val)
        except:
            raise optparse.OptionValueError('Error while searching for guest '
                                            '"{}", use "-p" to specify a pid '
@@ -1268,6 +1467,13 @@ Press any other key to refresh statistics immediately.
                         dest='once',
                         help='run in batch mode for one second',
                         )
+    optparser.add_option('-i', '--debugfs-include-past',
+                         action='store_true',
+                         default=False,
+                         dest='dbgfs_include_past',
+                         help='include all available data on past events for '
+                              'debugfs',
+                         )
    optparser.add_option('-l', '--log',
                         action='store_true',
                         default=False,
@@ -1288,7 +1494,7 @@ Press any other key to refresh statistics immediately.
                         )
    optparser.add_option('-f', '--fields',
                         action='store',
-                         default=None,
+                         default=DEFAULT_REGEX,
                         dest='fields',
                         help='fields to display (regex)',
                         )
@@ -1311,20 +1517,6 @@ Press any other key to refresh statistics immediately.
    return options
-def get_providers(options):
-    """Returns a list of data providers depending on the passed options."""
-    providers = []
-    if options.tracepoints:
-        providers.append(TracepointProvider())
-    if options.debugfs:
-        providers.append(DebugfsProvider())
-    if len(providers) == 0:
-        providers.append(TracepointProvider())
-    return providers
 def check_access(options):
    """Exits if the current user can't access all needed directories."""
    if not os.path.exists('/sys/kernel/debug'):
@@ -1365,8 +1557,7 @@ def main():
        sys.stderr.write('Did you use a (unsupported) tid instead of a pid?\n')
        sys.exit('Specified pid does not exist.')
-    providers = get_providers(options)
+    stats = Stats(options)
-    stats = Stats(providers, options.pid, fields=options.fields)
    if options.log:
        log(stats)
diff --git a/tools/kvm/kvm_stat/kvm_stat.txt b/tools/kvm/kvm_stat/kvm_stat.txt
index 109431bdc63c..e5cf836be8a1 100644
--- a/tools/kvm/kvm_stat/kvm_stat.txt
+++ b/tools/kvm/kvm_stat/kvm_stat.txt
@@ -29,18 +29,26 @@ meaning of events.
 INTERACTIVE COMMANDS
 --------------------
 [horizontal]
+*b*::   toggle events by guests (debugfs only, honors filters)
 *c*::   clear filter
 *f*::   filter by regular expression
 *g*::   filter by guest name
+*h*::   display interactive commands reference
+*o*::   toggle sorting order (Total vs CurAvg/s)
 *p*::   filter by PID
 *q*::   quit
 *r*::   reset stats
+*s*::   set update interval
 *x*::   toggle reporting of stats for child trace events
 Press any other key to refresh statistics immediately.
@@ -64,6 +72,10 @@ OPTIONS
 --debugfs::
        retrieve statistics from debugfs
+-i::
+--debugfs-include-past::
+        include all available data on past events for debugfs
 -p<pid>::
 --pid=<pid>::
        limit statistics to one virtual machine (pid)
diff --git a/virt/kvm/arm/aarch32.c b/virt/kvm/arm/aarch32.c
index 528af4b2d09e..79c7c357804b 100644
--- a/virt/kvm/arm/aarch32.c
+++ b/virt/kvm/arm/aarch32.c
@@ -60,7 +60,7 @@ static const unsigned short cc_map[16] = {
 /*
 * Check if a trapped instruction should have been executed or not.
 */
-bool kvm_condition_valid32(const struct kvm_vcpu *vcpu)
+bool __hyp_text kvm_condition_valid32(const struct kvm_vcpu *vcpu)
 {
        unsigned long cpsr;
        u32 cpsr_cond;
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 5976609ef27c..8e89d63005c7 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -21,6 +21,7 @@
 #include <linux/kvm_host.h>
 #include <linux/interrupt.h>
 #include <linux/irq.h>
+#include <linux/uaccess.h>
 #include <clocksource/arm_arch_timer.h>
 #include <asm/arch_timer.h>
@@ -35,6 +36,16 @@ static struct timecounter *timecounter;
 static unsigned int host_vtimer_irq;
 static u32 host_vtimer_irq_flags;
+static const struct kvm_irq_level default_ptimer_irq = {
+        .irq    = 30,
+        .level  = 1,
+};
+static const struct kvm_irq_level default_vtimer_irq = {
+        .irq    = 27,
+        .level  = 1,
+};
 void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
 {
        vcpu_vtimer(vcpu)->active_cleared_last = false;
@@ -95,7 +106,7 @@ static void kvm_timer_inject_irq_work(struct work_struct *work)
         * If the vcpu is blocked we want to wake it up so that it will see
         * the timer has expired when entering the guest.
         */
-        kvm_vcpu_kick(vcpu);
+        kvm_vcpu_wake_up(vcpu);
 }
 static u64 kvm_timer_compute_delta(struct arch_timer_context *timer_ctx)
@@ -215,7 +226,8 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
        if (likely(irqchip_in_kernel(vcpu->kvm))) {
                ret = kvm_vgic_inject_irq(vcpu->kvm, vcpu->vcpu_id,
                                          timer_ctx->irq.irq,
-                                          timer_ctx->irq.level);
+                                          timer_ctx->irq.level,
+                                          timer_ctx);
                WARN_ON(ret);
        }
 }
@@ -445,23 +457,12 @@ void kvm_timer_sync_hwstate(struct kvm_vcpu *vcpu)
        kvm_timer_update_state(vcpu);
 }
-int kvm_timer_vcpu_reset(struct kvm_vcpu *vcpu,
+int kvm_timer_vcpu_reset(struct kvm_vcpu *vcpu)
-                         const struct kvm_irq_level *virt_irq,
-                         const struct kvm_irq_level *phys_irq)
 {
        struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
        struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
        /*
-         * The vcpu timer irq number cannot be determined in
-         * kvm_timer_vcpu_init() because it is called much before
-         * kvm_vcpu_set_target(). To handle this, we determine
-         * vcpu timer irq number when the vcpu is reset.
-         */
-        vtimer->irq.irq = virt_irq->irq;
-        ptimer->irq.irq = phys_irq->irq;
-        /*
         * The bits in CNTV_CTL are architecturally reset to UNKNOWN for ARMv8
         * and to 0 for ARMv7.  We provide an implementation that always
         * resets the timer to be disabled and unmasked and is compliant with
@@ -496,6 +497,8 @@ static void update_vtimer_cntvoff(struct kvm_vcpu *vcpu, u64 cntvoff)
 void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu)
 {
        struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
+        struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
+        struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
        /* Synchronize cntvoff across all vtimers of a VM. */
        update_vtimer_cntvoff(vcpu, kvm_phys_timer_read());
@@ -504,6 +507,9 @@ void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu)
        INIT_WORK(&timer->expired, kvm_timer_inject_irq_work);
        hrtimer_init(&timer->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
        timer->timer.function = kvm_timer_expire;
+        vtimer->irq.irq = default_vtimer_irq.irq;
+        ptimer->irq.irq = default_ptimer_irq.irq;
 }
 static void kvm_timer_init_interrupt(void *info)
@@ -613,6 +619,30 @@ void kvm_timer_vcpu_terminate(struct kvm_vcpu *vcpu)
        kvm_vgic_unmap_phys_irq(vcpu, vtimer->irq.irq);
 }
+static bool timer_irqs_are_valid(struct kvm_vcpu *vcpu)
+{
+        int vtimer_irq, ptimer_irq;
+        int i, ret;
+        vtimer_irq = vcpu_vtimer(vcpu)->irq.irq;
+        ret = kvm_vgic_set_owner(vcpu, vtimer_irq, vcpu_vtimer(vcpu));
+        if (ret)
+                return false;
+        ptimer_irq = vcpu_ptimer(vcpu)->irq.irq;
+        ret = kvm_vgic_set_owner(vcpu, ptimer_irq, vcpu_ptimer(vcpu));
+        if (ret)
+                return false;
+        kvm_for_each_vcpu(i, vcpu, vcpu->kvm) {
+                if (vcpu_vtimer(vcpu)->irq.irq != vtimer_irq ||
+                    vcpu_ptimer(vcpu)->irq.irq != ptimer_irq)
+                        return false;
+        }
+        return true;
+}
 int kvm_timer_enable(struct kvm_vcpu *vcpu)
 {
        struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
@@ -632,6 +662,11 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu)
        if (!vgic_initialized(vcpu->kvm))
                return -ENODEV;
+        if (!timer_irqs_are_valid(vcpu)) {
+                kvm_debug("incorrectly configured timer irqs\n");
+                return -EINVAL;
+        }
        /*
         * Find the physical IRQ number corresponding to the host_vtimer_irq
         */
@@ -681,3 +716,79 @@ void kvm_timer_init_vhe(void)
        val |= (CNTHCTL_EL1PCTEN << cnthctl_shift);
        write_sysreg(val, cnthctl_el2);
 }
+static void set_timer_irqs(struct kvm *kvm, int vtimer_irq, int ptimer_irq)
+{
+        struct kvm_vcpu *vcpu;
+        int i;
+        kvm_for_each_vcpu(i, vcpu, kvm) {
+                vcpu_vtimer(vcpu)->irq.irq = vtimer_irq;
+                vcpu_ptimer(vcpu)->irq.irq = ptimer_irq;
+        }
+}
+int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+        int __user *uaddr = (int __user *)(long)attr->addr;
+        struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
+        struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
+        int irq;
+        if (!irqchip_in_kernel(vcpu->kvm))
+                return -EINVAL;
+        if (get_user(irq, uaddr))
+                return -EFAULT;
+        if (!(irq_is_ppi(irq)))
+                return -EINVAL;
+        if (vcpu->arch.timer_cpu.enabled)
+                return -EBUSY;
+        switch (attr->attr) {
+        case KVM_ARM_VCPU_TIMER_IRQ_VTIMER:
+                set_timer_irqs(vcpu->kvm, irq, ptimer->irq.irq);
+                break;
+        case KVM_ARM_VCPU_TIMER_IRQ_PTIMER:
+                set_timer_irqs(vcpu->kvm, vtimer->irq.irq, irq);
+                break;
+        default:
+                return -ENXIO;
+        }
+        return 0;
+}
+int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+        int __user *uaddr = (int __user *)(long)attr->addr;
+        struct arch_timer_context *timer;
+        int irq;
+        switch (attr->attr) {
+        case KVM_ARM_VCPU_TIMER_IRQ_VTIMER:
+                timer = vcpu_vtimer(vcpu);
+                break;
+        case KVM_ARM_VCPU_TIMER_IRQ_PTIMER:
+                timer = vcpu_ptimer(vcpu);
+                break;
+        default:
+                return -ENXIO;
+        }
+        irq = timer->irq.irq;
+        return put_user(irq, uaddr);
+}
+int kvm_arm_timer_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+        switch (attr->attr) {
+        case KVM_ARM_VCPU_TIMER_IRQ_VTIMER:
+        case KVM_ARM_VCPU_TIMER_IRQ_PTIMER:
+                return 0;
+        }
+        return -ENXIO;
+}
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 3417e184c8e1..a39a1e161e63 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -368,6 +368,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
        kvm_timer_vcpu_put(vcpu);
 }
+static void vcpu_power_off(struct kvm_vcpu *vcpu)
+{
+        vcpu->arch.power_off = true;
+        kvm_make_request(KVM_REQ_SLEEP, vcpu);
+        kvm_vcpu_kick(vcpu);
+}
 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
                                    struct kvm_mp_state *mp_state)
 {
@@ -387,7 +394,7 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
                vcpu->arch.power_off = false;
                break;
        case KVM_MP_STATE_STOPPED:
-                vcpu->arch.power_off = true;
+                vcpu_power_off(vcpu);
                break;
        default:
                return -EINVAL;
@@ -520,6 +527,10 @@ static int kvm_vcpu_first_run_init(struct kvm_vcpu *vcpu)
        }
        ret = kvm_timer_enable(vcpu);
+        if (ret)
+                return ret;
+        ret = kvm_arm_pmu_v3_enable(vcpu);
        return ret;
 }
@@ -536,21 +547,7 @@ void kvm_arm_halt_guest(struct kvm *kvm)
        kvm_for_each_vcpu(i, vcpu, kvm)
                vcpu->arch.pause = true;
-        kvm_make_all_cpus_request(kvm, KVM_REQ_VCPU_EXIT);
+        kvm_make_all_cpus_request(kvm, KVM_REQ_SLEEP);
-}
-void kvm_arm_halt_vcpu(struct kvm_vcpu *vcpu)
-{
-        vcpu->arch.pause = true;
-        kvm_vcpu_kick(vcpu);
-}
-void kvm_arm_resume_vcpu(struct kvm_vcpu *vcpu)
-{
-        struct swait_queue_head *wq = kvm_arch_vcpu_wq(vcpu);
-        vcpu->arch.pause = false;
-        swake_up(wq);
 }
 void kvm_arm_resume_guest(struct kvm *kvm)
@@ -558,16 +555,23 @@ void kvm_arm_resume_guest(struct kvm *kvm)
        int i;
        struct kvm_vcpu *vcpu;
-        kvm_for_each_vcpu(i, vcpu, kvm)
+        kvm_for_each_vcpu(i, vcpu, kvm) {
-                kvm_arm_resume_vcpu(vcpu);
+                vcpu->arch.pause = false;
+                swake_up(kvm_arch_vcpu_wq(vcpu));
+        }
 }
-static void vcpu_sleep(struct kvm_vcpu *vcpu)
+static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
 {
        struct swait_queue_head *wq = kvm_arch_vcpu_wq(vcpu);
        swait_event_interruptible(*wq, ((!vcpu->arch.power_off) &&
                                       (!vcpu->arch.pause)));
+        if (vcpu->arch.power_off || vcpu->arch.pause) {
+                /* Awaken to handle a signal, request we sleep again later. */
+                kvm_make_request(KVM_REQ_SLEEP, vcpu);
+        }
 }
 static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
@@ -575,6 +579,20 @@ static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
        return vcpu->arch.target >= 0;
 }
+static void check_vcpu_requests(struct kvm_vcpu *vcpu)
+{
+        if (kvm_request_pending(vcpu)) {
+                if (kvm_check_request(KVM_REQ_SLEEP, vcpu))
+                        vcpu_req_sleep(vcpu);
+                /*
+                 * Clear IRQ_PENDING requests that were made to guarantee
+                 * that a VCPU sees new virtual interrupts.
+                 */
+                kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
+        }
+}
 /**
 * kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest code
 * @vcpu:       The VCPU pointer
@@ -620,8 +638,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
                update_vttbr(vcpu->kvm);
-                if (vcpu->arch.power_off || vcpu->arch.pause)
+                check_vcpu_requests(vcpu);
-                        vcpu_sleep(vcpu);
                /*
                 * Preparing the interrupts to be injected also
@@ -650,8 +667,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
                        run->exit_reason = KVM_EXIT_INTR;
                }
+                /*
+                 * Ensure we set mode to IN_GUEST_MODE after we disable
+                 * interrupts and before the final VCPU requests check.
+                 * See the comment in kvm_vcpu_exiting_guest_mode() and
+                 * Documentation/virtual/kvm/vcpu-requests.rst
+                 */
+                smp_store_mb(vcpu->mode, IN_GUEST_MODE);
                if (ret <= 0 || need_new_vmid_gen(vcpu->kvm) ||
-                        vcpu->arch.power_off || vcpu->arch.pause) {
+                    kvm_request_pending(vcpu)) {
+                        vcpu->mode = OUTSIDE_GUEST_MODE;
                        local_irq_enable();
                        kvm_pmu_sync_hwstate(vcpu);
                        kvm_timer_sync_hwstate(vcpu);
@@ -667,7 +693,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
                 */
                trace_kvm_entry(*vcpu_pc(vcpu));
                guest_enter_irqoff();
-                vcpu->mode = IN_GUEST_MODE;
                ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
@@ -756,6 +781,7 @@ static int vcpu_interrupt_line(struct kvm_vcpu *vcpu, int number, bool level)
         * trigger a world-switch round on the running physical CPU to set the
         * virtual IRQ/FIQ fields in the HCR appropriately.
         */
+        kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu);
        kvm_vcpu_kick(vcpu);
        return 0;
@@ -806,7 +832,7 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level,
                if (irq_num < VGIC_NR_SGIS || irq_num >= VGIC_NR_PRIVATE_IRQS)
                        return -EINVAL;
-                return kvm_vgic_inject_irq(kvm, vcpu->vcpu_id, irq_num, level);
+                return kvm_vgic_inject_irq(kvm, vcpu->vcpu_id, irq_num, level, NULL);
        case KVM_ARM_IRQ_TYPE_SPI:
                if (!irqchip_in_kernel(kvm))
                        return -ENXIO;
@@ -814,7 +840,7 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level,
                if (irq_num < VGIC_NR_PRIVATE_IRQS)
                        return -EINVAL;
-                return kvm_vgic_inject_irq(kvm, 0, irq_num, level);
+                return kvm_vgic_inject_irq(kvm, 0, irq_num, level, NULL);
        }
        return -EINVAL;
@@ -884,7 +910,7 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu,
         * Handle the "start in power-off" case.
         */
        if (test_bit(KVM_ARM_VCPU_POWER_OFF, vcpu->arch.features))
-                vcpu->arch.power_off = true;
+                vcpu_power_off(vcpu);
        else
                vcpu->arch.power_off = false;
@@ -1115,9 +1141,6 @@ static void cpu_init_hyp_mode(void *dummy)
        __cpu_init_hyp_mode(pgd_ptr, hyp_stack_ptr, vector_ptr);
        __cpu_init_stage2();
-        if (is_kernel_in_hyp_mode())
-                kvm_timer_init_vhe();
        kvm_arm_init_debug();
 }
@@ -1137,6 +1160,7 @@ static void cpu_hyp_reinit(void)
                 * event was cancelled before the CPU was reset.
                 */
                __cpu_init_stage2();
+                kvm_timer_init_vhe();
        } else {
                cpu_init_hyp_mode(NULL);
        }
diff --git a/virt/kvm/arm/hyp/vgic-v3-sr.c b/virt/kvm/arm/hyp/vgic-v3-sr.c
index 87940364570b..91728faa13fd 100644
--- a/virt/kvm/arm/hyp/vgic-v3-sr.c
+++ b/virt/kvm/arm/hyp/vgic-v3-sr.c
@@ -19,10 +19,12 @@
 #include <linux/irqchip/arm-gic-v3.h>
 #include <linux/kvm_host.h>
+#include <asm/kvm_emulate.h>
 #include <asm/kvm_hyp.h>
 #define vtr_to_max_lr_idx(v)            ((v) & 0xf)
 #define vtr_to_nr_pre_bits(v)           ((((u32)(v) >> 26) & 7) + 1)
+#define vtr_to_nr_apr_regs(v)           (1 << (vtr_to_nr_pre_bits(v) - 5))
 static u64 __hyp_text __gic_v3_get_lr(unsigned int lr)
 {
@@ -118,6 +120,90 @@ static void __hyp_text __gic_v3_set_lr(u64 val, int lr)
        }
 }
+static void __hyp_text __vgic_v3_write_ap0rn(u32 val, int n)
+{
+        switch (n) {
+        case 0:
+                write_gicreg(val, ICH_AP0R0_EL2);
+                break;
+        case 1:
+                write_gicreg(val, ICH_AP0R1_EL2);
+                break;
+        case 2:
+                write_gicreg(val, ICH_AP0R2_EL2);
+                break;
+        case 3:
+                write_gicreg(val, ICH_AP0R3_EL2);
+                break;
+        }
+}
+static void __hyp_text __vgic_v3_write_ap1rn(u32 val, int n)
+{
+        switch (n) {
+        case 0:
+                write_gicreg(val, ICH_AP1R0_EL2);
+                break;
+        case 1:
+                write_gicreg(val, ICH_AP1R1_EL2);
+                break;
+        case 2:
+                write_gicreg(val, ICH_AP1R2_EL2);
+                break;
+        case 3:
+                write_gicreg(val, ICH_AP1R3_EL2);
+                break;
+        }
+}
+static u32 __hyp_text __vgic_v3_read_ap0rn(int n)
+{
+        u32 val;
+        switch (n) {
+        case 0:
+                val = read_gicreg(ICH_AP0R0_EL2);
+                break;
+        case 1:
+                val = read_gicreg(ICH_AP0R1_EL2);
+                break;
+        case 2:
+                val = read_gicreg(ICH_AP0R2_EL2);
+                break;
+        case 3:
+                val = read_gicreg(ICH_AP0R3_EL2);
+                break;
+        default:
+                unreachable();
+        }
+        return val;
+}
+static u32 __hyp_text __vgic_v3_read_ap1rn(int n)
+{
+        u32 val;
+        switch (n) {
+        case 0:
+                val = read_gicreg(ICH_AP1R0_EL2);
+                break;
+        case 1:
+                val = read_gicreg(ICH_AP1R1_EL2);
+                break;
+        case 2:
+                val = read_gicreg(ICH_AP1R2_EL2);
+                break;
+        case 3:
+                val = read_gicreg(ICH_AP1R3_EL2);
+                break;
+        default:
+                unreachable();
+        }
+        return val;
+}
 void __hyp_text __vgic_v3_save_state(struct kvm_vcpu *vcpu)
 {
        struct vgic_v3_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v3;
@@ -154,24 +240,27 @@ void __hyp_text __vgic_v3_save_state(struct kvm_vcpu *vcpu)
                switch (nr_pre_bits) {
                case 7:
-                        cpu_if->vgic_ap0r[3] = read_gicreg(ICH_AP0R3_EL2);
+                        cpu_if->vgic_ap0r[3] = __vgic_v3_read_ap0rn(3);
-                        cpu_if->vgic_ap0r[2] = read_gicreg(ICH_AP0R2_EL2);
+                        cpu_if->vgic_ap0r[2] = __vgic_v3_read_ap0rn(2);
                case 6:
-                        cpu_if->vgic_ap0r[1] = read_gicreg(ICH_AP0R1_EL2);
+                        cpu_if->vgic_ap0r[1] = __vgic_v3_read_ap0rn(1);
                default:
-                        cpu_if->vgic_ap0r[0] = read_gicreg(ICH_AP0R0_EL2);
+                        cpu_if->vgic_ap0r[0] = __vgic_v3_read_ap0rn(0);
                }
                switch (nr_pre_bits) {
                case 7:
-                        cpu_if->vgic_ap1r[3] = read_gicreg(ICH_AP1R3_EL2);
+                        cpu_if->vgic_ap1r[3] = __vgic_v3_read_ap1rn(3);
-                        cpu_if->vgic_ap1r[2] = read_gicreg(ICH_AP1R2_EL2);
+                        cpu_if->vgic_ap1r[2] = __vgic_v3_read_ap1rn(2);
                case 6:
-                        cpu_if->vgic_ap1r[1] = read_gicreg(ICH_AP1R1_EL2);
+                        cpu_if->vgic_ap1r[1] = __vgic_v3_read_ap1rn(1);
                default:
-                        cpu_if->vgic_ap1r[0] = read_gicreg(ICH_AP1R0_EL2);
+                        cpu_if->vgic_ap1r[0] = __vgic_v3_read_ap1rn(0);
                }
        } else {
+                if (static_branch_unlikely(&vgic_v3_cpuif_trap))
+                        write_gicreg(0, ICH_HCR_EL2);
                cpu_if->vgic_elrsr = 0xffff;
                cpu_if->vgic_ap0r[0] = 0;
                cpu_if->vgic_ap0r[1] = 0;
@@ -224,26 +313,34 @@ void __hyp_text __vgic_v3_restore_state(struct kvm_vcpu *vcpu)
                switch (nr_pre_bits) {
                case 7:
-                        write_gicreg(cpu_if->vgic_ap0r[3], ICH_AP0R3_EL2);
+                        __vgic_v3_write_ap0rn(cpu_if->vgic_ap0r[3], 3);
-                        write_gicreg(cpu_if->vgic_ap0r[2], ICH_AP0R2_EL2);
+                        __vgic_v3_write_ap0rn(cpu_if->vgic_ap0r[2], 2);
                case 6:
-                        write_gicreg(cpu_if->vgic_ap0r[1], ICH_AP0R1_EL2);
+                        __vgic_v3_write_ap0rn(cpu_if->vgic_ap0r[1], 1);
                default:
-                        write_gicreg(cpu_if->vgic_ap0r[0], ICH_AP0R0_EL2);
+                        __vgic_v3_write_ap0rn(cpu_if->vgic_ap0r[0], 0);
                }
                switch (nr_pre_bits) {
                case 7:
-                        write_gicreg(cpu_if->vgic_ap1r[3], ICH_AP1R3_EL2);
+                        __vgic_v3_write_ap1rn(cpu_if->vgic_ap1r[3], 3);
-                        write_gicreg(cpu_if->vgic_ap1r[2], ICH_AP1R2_EL2);
+                        __vgic_v3_write_ap1rn(cpu_if->vgic_ap1r[2], 2);
                case 6:
-                        write_gicreg(cpu_if->vgic_ap1r[1], ICH_AP1R1_EL2);
+                        __vgic_v3_write_ap1rn(cpu_if->vgic_ap1r[1], 1);
                default:
-                        write_gicreg(cpu_if->vgic_ap1r[0], ICH_AP1R0_EL2);
+                        __vgic_v3_write_ap1rn(cpu_if->vgic_ap1r[0], 0);
                }
                for (i = 0; i < used_lrs; i++)
                        __gic_v3_set_lr(cpu_if->vgic_lr[i], i);
+        } else {
+                /*
+                 * If we need to trap system registers, we must write
+                 * ICH_HCR_EL2 anyway, even if no interrupts are being
+                 * injected,
+                 */
+                if (static_branch_unlikely(&vgic_v3_cpuif_trap))
+                        write_gicreg(cpu_if->vgic_hcr, ICH_HCR_EL2);
        }
        /*
@@ -287,3 +384,697 @@ void __hyp_text __vgic_v3_write_vmcr(u32 vmcr)
 {
        write_gicreg(vmcr, ICH_VMCR_EL2);
 }
+#ifdef CONFIG_ARM64
+static int __hyp_text __vgic_v3_bpr_min(void)
+{
+        /* See Pseudocode for VPriorityGroup */
+        return 8 - vtr_to_nr_pre_bits(read_gicreg(ICH_VTR_EL2));
+}
+static int __hyp_text __vgic_v3_get_group(struct kvm_vcpu *vcpu)
+{
+        u32 esr = kvm_vcpu_get_hsr(vcpu);
+        u8 crm = (esr & ESR_ELx_SYS64_ISS_CRM_MASK) >> ESR_ELx_SYS64_ISS_CRM_SHIFT;
+        return crm != 8;
+}
+#define GICv3_IDLE_PRIORITY     0xff
+static int __hyp_text __vgic_v3_highest_priority_lr(struct kvm_vcpu *vcpu,
+                                                    u32 vmcr,
+                                                    u64 *lr_val)
+{
+        unsigned int used_lrs = vcpu->arch.vgic_cpu.used_lrs;
+        u8 priority = GICv3_IDLE_PRIORITY;
+        int i, lr = -1;
+        for (i = 0; i < used_lrs; i++) {
+                u64 val = __gic_v3_get_lr(i);
+                u8 lr_prio = (val & ICH_LR_PRIORITY_MASK) >> ICH_LR_PRIORITY_SHIFT;
+                /* Not pending in the state? */
+                if ((val & ICH_LR_STATE) != ICH_LR_PENDING_BIT)
+                        continue;
+                /* Group-0 interrupt, but Group-0 disabled? */
+                if (!(val & ICH_LR_GROUP) && !(vmcr & ICH_VMCR_ENG0_MASK))
+                        continue;
+                /* Group-1 interrupt, but Group-1 disabled? */
+                if ((val & ICH_LR_GROUP) && !(vmcr & ICH_VMCR_ENG1_MASK))
+                        continue;
+                /* Not the highest priority? */
+                if (lr_prio >= priority)
+                        continue;
+                /* This is a candidate */
+                priority = lr_prio;
+                *lr_val = val;
+                lr = i;
+        }
+        if (lr == -1)
+                *lr_val = ICC_IAR1_EL1_SPURIOUS;
+        return lr;
+}
+static int __hyp_text __vgic_v3_find_active_lr(struct kvm_vcpu *vcpu,
+                                               int intid, u64 *lr_val)
+{
+        unsigned int used_lrs = vcpu->arch.vgic_cpu.used_lrs;
+        int i;
+        for (i = 0; i < used_lrs; i++) {
+                u64 val = __gic_v3_get_lr(i);
+                if ((val & ICH_LR_VIRTUAL_ID_MASK) == intid &&
+                    (val & ICH_LR_ACTIVE_BIT)) {
+                        *lr_val = val;
+                        return i;
+                }
+        }
+        *lr_val = ICC_IAR1_EL1_SPURIOUS;
+        return -1;
+}
+static int __hyp_text __vgic_v3_get_highest_active_priority(void)
+{
+        u8 nr_apr_regs = vtr_to_nr_apr_regs(read_gicreg(ICH_VTR_EL2));
+        u32 hap = 0;
+        int i;
+        for (i = 0; i < nr_apr_regs; i++) {
+                u32 val;
+                /*
+                 * The ICH_AP0Rn_EL2 and ICH_AP1Rn_EL2 registers
+                 * contain the active priority levels for this VCPU
+                 * for the maximum number of supported priority
+                 * levels, and we return the full priority level only
+                 * if the BPR is programmed to its minimum, otherwise
+                 * we return a combination of the priority level and
+                 * subpriority, as determined by the setting of the
+                 * BPR, but without the full subpriority.
+                 */
+                val  = __vgic_v3_read_ap0rn(i);
+                val |= __vgic_v3_read_ap1rn(i);
+                if (!val) {
+                        hap += 32;
+                        continue;
+                }
+                return (hap + __ffs(val)) << __vgic_v3_bpr_min();
+        }
+        return GICv3_IDLE_PRIORITY;
+}
+static unsigned int __hyp_text __vgic_v3_get_bpr0(u32 vmcr)
+{
+        return (vmcr & ICH_VMCR_BPR0_MASK) >> ICH_VMCR_BPR0_SHIFT;
+}
+static unsigned int __hyp_text __vgic_v3_get_bpr1(u32 vmcr)
+{
+        unsigned int bpr;
+        if (vmcr & ICH_VMCR_CBPR_MASK) {
+                bpr = __vgic_v3_get_bpr0(vmcr);
+                if (bpr < 7)
+                        bpr++;
+        } else {
+                bpr = (vmcr & ICH_VMCR_BPR1_MASK) >> ICH_VMCR_BPR1_SHIFT;
+        }
+        return bpr;
+}
+/*
+ * Convert a priority to a preemption level, taking the relevant BPR
+ * into account by zeroing the sub-priority bits.
+ */
+static u8 __hyp_text __vgic_v3_pri_to_pre(u8 pri, u32 vmcr, int grp)
+{
+        unsigned int bpr;
+        if (!grp)
+                bpr = __vgic_v3_get_bpr0(vmcr) + 1;
+        else
+                bpr = __vgic_v3_get_bpr1(vmcr);
+        return pri & (GENMASK(7, 0) << bpr);
+}
+/*
+ * The priority value is independent of any of the BPR values, so we
+ * normalize it using the minumal BPR value. This guarantees that no
+ * matter what the guest does with its BPR, we can always set/get the
+ * same value of a priority.
+ */
+static void __hyp_text __vgic_v3_set_active_priority(u8 pri, u32 vmcr, int grp)
+{
+        u8 pre, ap;
+        u32 val;
+        int apr;
+        pre = __vgic_v3_pri_to_pre(pri, vmcr, grp);
+        ap = pre >> __vgic_v3_bpr_min();
+        apr = ap / 32;
+        if (!grp) {
+                val = __vgic_v3_read_ap0rn(apr);
+                __vgic_v3_write_ap0rn(val | BIT(ap % 32), apr);
+        } else {
+                val = __vgic_v3_read_ap1rn(apr);
+                __vgic_v3_write_ap1rn(val | BIT(ap % 32), apr);
+        }
+}
+static int __hyp_text __vgic_v3_clear_highest_active_priority(void)
+{
+        u8 nr_apr_regs = vtr_to_nr_apr_regs(read_gicreg(ICH_VTR_EL2));
+        u32 hap = 0;
+        int i;
+        for (i = 0; i < nr_apr_regs; i++) {
+                u32 ap0, ap1;
+                int c0, c1;
+                ap0 = __vgic_v3_read_ap0rn(i);
+                ap1 = __vgic_v3_read_ap1rn(i);
+                if (!ap0 && !ap1) {
+                        hap += 32;
+                        continue;
+                }
+                c0 = ap0 ? __ffs(ap0) : 32;
+                c1 = ap1 ? __ffs(ap1) : 32;
+                /* Always clear the LSB, which is the highest priority */
+                if (c0 < c1) {
+                        ap0 &= ~BIT(c0);
+                        __vgic_v3_write_ap0rn(ap0, i);
+                        hap += c0;
+                } else {
+                        ap1 &= ~BIT(c1);
+                        __vgic_v3_write_ap1rn(ap1, i);
+                        hap += c1;
+                }
+                /* Rescale to 8 bits of priority */
+                return hap << __vgic_v3_bpr_min();
+        }
+        return GICv3_IDLE_PRIORITY;
+}
+static void __hyp_text __vgic_v3_read_iar(struct kvm_vcpu *vcpu, u32 vmcr, int rt)
+{
+        u64 lr_val;
+        u8 lr_prio, pmr;
+        int lr, grp;
+        grp = __vgic_v3_get_group(vcpu);
+        lr = __vgic_v3_highest_priority_lr(vcpu, vmcr, &lr_val);
+        if (lr < 0)
+                goto spurious;
+        if (grp != !!(lr_val & ICH_LR_GROUP))
+                goto spurious;
+        pmr = (vmcr & ICH_VMCR_PMR_MASK) >> ICH_VMCR_PMR_SHIFT;
+        lr_prio = (lr_val & ICH_LR_PRIORITY_MASK) >> ICH_LR_PRIORITY_SHIFT;
+        if (pmr <= lr_prio)
+                goto spurious;
+        if (__vgic_v3_get_highest_active_priority() <= __vgic_v3_pri_to_pre(lr_prio, vmcr, grp))
+                goto spurious;
+        lr_val &= ~ICH_LR_STATE;
+        /* No active state for LPIs */
+        if ((lr_val & ICH_LR_VIRTUAL_ID_MASK) <= VGIC_MAX_SPI)
+                lr_val |= ICH_LR_ACTIVE_BIT;
+        __gic_v3_set_lr(lr_val, lr);
+        __vgic_v3_set_active_priority(lr_prio, vmcr, grp);
+        vcpu_set_reg(vcpu, rt, lr_val & ICH_LR_VIRTUAL_ID_MASK);
+        return;
+spurious:
+        vcpu_set_reg(vcpu, rt, ICC_IAR1_EL1_SPURIOUS);
+}
+static void __hyp_text __vgic_v3_clear_active_lr(int lr, u64 lr_val)
+{
+        lr_val &= ~ICH_LR_ACTIVE_BIT;
+        if (lr_val & ICH_LR_HW) {
+                u32 pid;
+                pid = (lr_val & ICH_LR_PHYS_ID_MASK) >> ICH_LR_PHYS_ID_SHIFT;
+                gic_write_dir(pid);
+        }
+        __gic_v3_set_lr(lr_val, lr);
+}
+static void __hyp_text __vgic_v3_bump_eoicount(void)
+{
+        u32 hcr;
+        hcr = read_gicreg(ICH_HCR_EL2);
+        hcr += 1 << ICH_HCR_EOIcount_SHIFT;
+        write_gicreg(hcr, ICH_HCR_EL2);
+}
+static void __hyp_text __vgic_v3_write_dir(struct kvm_vcpu *vcpu,
+                                           u32 vmcr, int rt)
+{
+        u32 vid = vcpu_get_reg(vcpu, rt);
+        u64 lr_val;
+        int lr;
+        /* EOImode == 0, nothing to be done here */
+        if (!(vmcr & ICH_VMCR_EOIM_MASK))
+                return;
+        /* No deactivate to be performed on an LPI */
+        if (vid >= VGIC_MIN_LPI)
+                return;
+        lr = __vgic_v3_find_active_lr(vcpu, vid, &lr_val);
+        if (lr == -1) {
+                __vgic_v3_bump_eoicount();
+                return;
+        }
+        __vgic_v3_clear_active_lr(lr, lr_val);
+}
+static void __hyp_text __vgic_v3_write_eoir(struct kvm_vcpu *vcpu, u32 vmcr, int rt)
+{
+        u32 vid = vcpu_get_reg(vcpu, rt);
+        u64 lr_val;
+        u8 lr_prio, act_prio;
+        int lr, grp;
+        grp = __vgic_v3_get_group(vcpu);
+        /* Drop priority in any case */
+        act_prio = __vgic_v3_clear_highest_active_priority();
+        /* If EOIing an LPI, no deactivate to be performed */
+        if (vid >= VGIC_MIN_LPI)
+                return;
+        /* EOImode == 1, nothing to be done here */
+        if (vmcr & ICH_VMCR_EOIM_MASK)
+                return;
+        lr = __vgic_v3_find_active_lr(vcpu, vid, &lr_val);
+        if (lr == -1) {
+                __vgic_v3_bump_eoicount();
+                return;
+        }
+        lr_prio = (lr_val & ICH_LR_PRIORITY_MASK) >> ICH_LR_PRIORITY_SHIFT;
+        /* If priorities or group do not match, the guest has fscked-up. */
+        if (grp != !!(lr_val & ICH_LR_GROUP) ||
+            __vgic_v3_pri_to_pre(lr_prio, vmcr, grp) != act_prio)
+                return;
+        /* Let's now perform the deactivation */
+        __vgic_v3_clear_active_lr(lr, lr_val);
+}
+static void __hyp_text __vgic_v3_read_igrpen0(struct kvm_vcpu *vcpu, u32 vmcr, int rt)
+{
+        vcpu_set_reg(vcpu, rt, !!(vmcr & ICH_VMCR_ENG0_MASK));
+}
+static void __hyp_text __vgic_v3_read_igrpen1(struct kvm_vcpu *vcpu, u32 vmcr, int rt)
+{
+        vcpu_set_reg(vcpu, rt, !!(vmcr & ICH_VMCR_ENG1_MASK));
+}
+static void __hyp_text __vgic_v3_write_igrpen0(struct kvm_vcpu *vcpu, u32 vmcr, int rt)
+{
+        u64 val = vcpu_get_reg(vcpu, rt);
+        if (val & 1)
+                vmcr |= ICH_VMCR_ENG0_MASK;
+        else
+                vmcr &= ~ICH_VMCR_ENG0_MASK;
+        __vgic_v3_write_vmcr(vmcr);
+}
+static void __hyp_text __vgic_v3_write_igrpen1(struct kvm_vcpu *vcpu, u32 vmcr, int rt)
+{
+        u64 val = vcpu_get_reg(vcpu, rt);
+        if (val & 1)
+                vmcr |= ICH_VMCR_ENG1_MASK;
+        else
+                vmcr &= ~ICH_VMCR_ENG1_MASK;
+        __vgic_v3_write_vmcr(vmcr);
+}
+static void __hyp_text __vgic_v3_read_bpr0(struct kvm_vcpu *vcpu, u32 vmcr, int rt)
+{
+        vcpu_set_reg(vcpu, rt, __vgic_v3_get_bpr0(vmcr));
+}
+static void __hyp_text __vgic_v3_read_bpr1(struct kvm_vcpu *vcpu, u32 vmcr, int rt)
+{
+        vcpu_set_reg(vcpu, rt, __vgic_v3_get_bpr1(vmcr));
+}
+static void __hyp_text __vgic_v3_write_bpr0(struct kvm_vcpu *vcpu, u32 vmcr, int rt)
+{
+        u64 val = vcpu_get_reg(vcpu, rt);
+        u8 bpr_min = __vgic_v3_bpr_min() - 1;
+        /* Enforce BPR limiting */
+        if (val < bpr_min)
+                val = bpr_min;
+        val <<= ICH_VMCR_BPR0_SHIFT;
+        val &= ICH_VMCR_BPR0_MASK;
+        vmcr &= ~ICH_VMCR_BPR0_MASK;
+        vmcr |= val;
+        __vgic_v3_write_vmcr(vmcr);
+}
+static void __hyp_text __vgic_v3_write_bpr1(struct kvm_vcpu *vcpu, u32 vmcr, int rt)
+{
+        u64 val = vcpu_get_reg(vcpu, rt);
+        u8 bpr_min = __vgic_v3_bpr_min();
+        if (vmcr & ICH_VMCR_CBPR_MASK)
+                return;
+        /* Enforce BPR limiting */
+        if (val < bpr_min)
+                val = bpr_min;
+        val <<= ICH_VMCR_BPR1_SHIFT;
+        val &= ICH_VMCR_BPR1_MASK;
+        vmcr &= ~ICH_VMCR_BPR1_MASK;
+        vmcr |= val;
+        __vgic_v3_write_vmcr(vmcr);
+}
+static void __hyp_text __vgic_v3_read_apxrn(struct kvm_vcpu *vcpu, int rt, int n)
+{
+        u32 val;
+        if (!__vgic_v3_get_group(vcpu))
+                val = __vgic_v3_read_ap0rn(n);
+        else
+                val = __vgic_v3_read_ap1rn(n);
+        vcpu_set_reg(vcpu, rt, val);
+}
+static void __hyp_text __vgic_v3_write_apxrn(struct kvm_vcpu *vcpu, int rt, int n)
+{
+        u32 val = vcpu_get_reg(vcpu, rt);
+        if (!__vgic_v3_get_group(vcpu))
+                __vgic_v3_write_ap0rn(val, n);
+        else
+                __vgic_v3_write_ap1rn(val, n);
+}
+static void __hyp_text __vgic_v3_read_apxr0(struct kvm_vcpu *vcpu,
+                                            u32 vmcr, int rt)
+{
+        __vgic_v3_read_apxrn(vcpu, rt, 0);
+}
+static void __hyp_text __vgic_v3_read_apxr1(struct kvm_vcpu *vcpu,
+                                            u32 vmcr, int rt)
+{
+        __vgic_v3_read_apxrn(vcpu, rt, 1);
+}
+static void __hyp_text __vgic_v3_read_apxr2(struct kvm_vcpu *vcpu,
+                                            u32 vmcr, int rt)
+{
+        __vgic_v3_read_apxrn(vcpu, rt, 2);
+}
+static void __hyp_text __vgic_v3_read_apxr3(struct kvm_vcpu *vcpu,
+                                            u32 vmcr, int rt)
+{
+        __vgic_v3_read_apxrn(vcpu, rt, 3);
+}
+static void __hyp_text __vgic_v3_write_apxr0(struct kvm_vcpu *vcpu,
+                                             u32 vmcr, int rt)
+{
+        __vgic_v3_write_apxrn(vcpu, rt, 0);
+}
+static void __hyp_text __vgic_v3_write_apxr1(struct kvm_vcpu *vcpu,
+                                             u32 vmcr, int rt)
+{
+        __vgic_v3_write_apxrn(vcpu, rt, 1);
+}
+static void __hyp_text __vgic_v3_write_apxr2(struct kvm_vcpu *vcpu,
+                                             u32 vmcr, int rt)
+{
+        __vgic_v3_write_apxrn(vcpu, rt, 2);
+}
+static void __hyp_text __vgic_v3_write_apxr3(struct kvm_vcpu *vcpu,
+                                             u32 vmcr, int rt)
+{
+        __vgic_v3_write_apxrn(vcpu, rt, 3);
+}
+static void __hyp_text __vgic_v3_read_hppir(struct kvm_vcpu *vcpu,
+                                            u32 vmcr, int rt)
+{
+        u64 lr_val;
+        int lr, lr_grp, grp;
+        grp = __vgic_v3_get_group(vcpu);
+        lr = __vgic_v3_highest_priority_lr(vcpu, vmcr, &lr_val);
+        if (lr == -1)
+                goto spurious;
+        lr_grp = !!(lr_val & ICH_LR_GROUP);
+        if (lr_grp != grp)
+                lr_val = ICC_IAR1_EL1_SPURIOUS;
+spurious:
+        vcpu_set_reg(vcpu, rt, lr_val & ICH_LR_VIRTUAL_ID_MASK);
+}
+static void __hyp_text __vgic_v3_read_pmr(struct kvm_vcpu *vcpu,
+                                          u32 vmcr, int rt)
+{
+        vmcr &= ICH_VMCR_PMR_MASK;
+        vmcr >>= ICH_VMCR_PMR_SHIFT;
+        vcpu_set_reg(vcpu, rt, vmcr);
+}
+static void __hyp_text __vgic_v3_write_pmr(struct kvm_vcpu *vcpu,
+                                           u32 vmcr, int rt)
+{
+        u32 val = vcpu_get_reg(vcpu, rt);
+        val <<= ICH_VMCR_PMR_SHIFT;
+        val &= ICH_VMCR_PMR_MASK;
+        vmcr &= ~ICH_VMCR_PMR_MASK;
+        vmcr |= val;
+        write_gicreg(vmcr, ICH_VMCR_EL2);
+}
+static void __hyp_text __vgic_v3_read_rpr(struct kvm_vcpu *vcpu,
+                                          u32 vmcr, int rt)
+{
+        u32 val = __vgic_v3_get_highest_active_priority();
+        vcpu_set_reg(vcpu, rt, val);
+}
+static void __hyp_text __vgic_v3_read_ctlr(struct kvm_vcpu *vcpu,
+                                           u32 vmcr, int rt)
+{
+        u32 vtr, val;
+        vtr = read_gicreg(ICH_VTR_EL2);
+        /* PRIbits */
+        val = ((vtr >> 29) & 7) << ICC_CTLR_EL1_PRI_BITS_SHIFT;
+        /* IDbits */
+        val |= ((vtr >> 23) & 7) << ICC_CTLR_EL1_ID_BITS_SHIFT;
+        /* SEIS */
+        val |= ((vtr >> 22) & 1) << ICC_CTLR_EL1_SEIS_SHIFT;
+        /* A3V */
+        val |= ((vtr >> 21) & 1) << ICC_CTLR_EL1_A3V_SHIFT;
+        /* EOImode */
+        val |= ((vmcr & ICH_VMCR_EOIM_MASK) >> ICH_VMCR_EOIM_SHIFT) << ICC_CTLR_EL1_EOImode_SHIFT;
+        /* CBPR */
+        val |= (vmcr & ICH_VMCR_CBPR_MASK) >> ICH_VMCR_CBPR_SHIFT;
+        vcpu_set_reg(vcpu, rt, val);
+}
+static void __hyp_text __vgic_v3_write_ctlr(struct kvm_vcpu *vcpu,
+                                            u32 vmcr, int rt)
+{
+        u32 val = vcpu_get_reg(vcpu, rt);
+        if (val & ICC_CTLR_EL1_CBPR_MASK)
+                vmcr |= ICH_VMCR_CBPR_MASK;
+        else
+                vmcr &= ~ICH_VMCR_CBPR_MASK;
+        if (val & ICC_CTLR_EL1_EOImode_MASK)
+                vmcr |= ICH_VMCR_EOIM_MASK;
+        else
+                vmcr &= ~ICH_VMCR_EOIM_MASK;
+        write_gicreg(vmcr, ICH_VMCR_EL2);
+}
+int __hyp_text __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu)
+{
+        int rt;
+        u32 esr;
+        u32 vmcr;
+        void (*fn)(struct kvm_vcpu *, u32, int);
+        bool is_read;
+        u32 sysreg;
+        esr = kvm_vcpu_get_hsr(vcpu);
+        if (vcpu_mode_is_32bit(vcpu)) {
+                if (!kvm_condition_valid(vcpu))
+                        return 1;
+                sysreg = esr_cp15_to_sysreg(esr);
+        } else {
+                sysreg = esr_sys64_to_sysreg(esr);
+        }
+        is_read = (esr & ESR_ELx_SYS64_ISS_DIR_MASK) == ESR_ELx_SYS64_ISS_DIR_READ;
+        switch (sysreg) {
+        case SYS_ICC_IAR0_EL1:
+        case SYS_ICC_IAR1_EL1:
+                if (unlikely(!is_read))
+                        return 0;
+                fn = __vgic_v3_read_iar;
+                break;
+        case SYS_ICC_EOIR0_EL1:
+        case SYS_ICC_EOIR1_EL1:
+                if (unlikely(is_read))
+                        return 0;
+                fn = __vgic_v3_write_eoir;
+                break;
+        case SYS_ICC_IGRPEN1_EL1:
+                if (is_read)
+                        fn = __vgic_v3_read_igrpen1;
+                else
+                        fn = __vgic_v3_write_igrpen1;
+                break;
+        case SYS_ICC_BPR1_EL1:
+                if (is_read)
+                        fn = __vgic_v3_read_bpr1;
+                else
+                        fn = __vgic_v3_write_bpr1;
+                break;
+        case SYS_ICC_AP0Rn_EL1(0):
+        case SYS_ICC_AP1Rn_EL1(0):
+                if (is_read)
+                        fn = __vgic_v3_read_apxr0;
+                else
+                        fn = __vgic_v3_write_apxr0;
+                break;
+        case SYS_ICC_AP0Rn_EL1(1):
+        case SYS_ICC_AP1Rn_EL1(1):
+                if (is_read)
+                        fn = __vgic_v3_read_apxr1;
+                else
+                        fn = __vgic_v3_write_apxr1;
+                break;
+        case SYS_ICC_AP0Rn_EL1(2):
+        case SYS_ICC_AP1Rn_EL1(2):
+                if (is_read)
+                        fn = __vgic_v3_read_apxr2;
+                else
+                        fn = __vgic_v3_write_apxr2;
+                break;
+        case SYS_ICC_AP0Rn_EL1(3):
+        case SYS_ICC_AP1Rn_EL1(3):
+                if (is_read)
+                        fn = __vgic_v3_read_apxr3;
+                else
+                        fn = __vgic_v3_write_apxr3;
+                break;
+        case SYS_ICC_HPPIR0_EL1:
+        case SYS_ICC_HPPIR1_EL1:
+                if (unlikely(!is_read))
+                        return 0;
+                fn = __vgic_v3_read_hppir;
+                break;
+        case SYS_ICC_IGRPEN0_EL1:
+                if (is_read)
+                        fn = __vgic_v3_read_igrpen0;
+                else
+                        fn = __vgic_v3_write_igrpen0;
+                break;
+        case SYS_ICC_BPR0_EL1:
+                if (is_read)
+                        fn = __vgic_v3_read_bpr0;
+                else
+                        fn = __vgic_v3_write_bpr0;
+                break;
+        case SYS_ICC_DIR_EL1:
+                if (unlikely(is_read))
+                        return 0;
+                fn = __vgic_v3_write_dir;
+                break;
+        case SYS_ICC_RPR_EL1:
+                if (unlikely(!is_read))
+                        return 0;
+                fn = __vgic_v3_read_rpr;
+                break;
+        case SYS_ICC_CTLR_EL1:
+                if (is_read)
+                        fn = __vgic_v3_read_ctlr;
+                else
+                        fn = __vgic_v3_write_ctlr;
+                break;
+        case SYS_ICC_PMR_EL1:
+                if (is_read)
+                        fn = __vgic_v3_read_pmr;
+                else
+                        fn = __vgic_v3_write_pmr;
+                break;
+        default:
+                return 0;
+        }
+        vmcr = __vgic_v3_read_vmcr();
+        rt = kvm_vcpu_sys_get_rt(vcpu);
+        fn(vcpu, vmcr, rt);
+        return 1;
+}
+#endif
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 1c44aa35f909..0e1fc75f3585 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -20,6 +20,7 @@
 #include <linux/kvm_host.h>
 #include <linux/io.h>
 #include <linux/hugetlb.h>
+#include <linux/sched/signal.h>
 #include <trace/events/kvm.h>
 #include <asm/pgalloc.h>
 #include <asm/cacheflush.h>
@@ -1262,6 +1263,24 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
        __coherent_cache_guest_page(vcpu, pfn, size);
 }
+static void kvm_send_hwpoison_signal(unsigned long address,
+                                     struct vm_area_struct *vma)
+{
+        siginfo_t info;
+        info.si_signo   = SIGBUS;
+        info.si_errno   = 0;
+        info.si_code    = BUS_MCEERR_AR;
+        info.si_addr    = (void __user *)address;
+        if (is_vm_hugetlb_page(vma))
+                info.si_addr_lsb = huge_page_shift(hstate_vma(vma));
+        else
+                info.si_addr_lsb = PAGE_SHIFT;
+        send_sig_info(SIGBUS, &info, current);
+}
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
                          struct kvm_memory_slot *memslot, unsigned long hva,
                          unsigned long fault_status)
@@ -1331,6 +1350,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
        smp_rmb();
        pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
+        if (pfn == KVM_PFN_ERR_HWPOISON) {
+                kvm_send_hwpoison_signal(hva, vma);
+                return 0;
+        }
        if (is_error_noslot_pfn(pfn))
                return -EFAULT;
diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index 4b43e7f3b158..fc8a723ff387 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -203,6 +203,24 @@ static u64 kvm_pmu_overflow_status(struct kvm_vcpu *vcpu)
        return reg;
 }
+static void kvm_pmu_check_overflow(struct kvm_vcpu *vcpu)
+{
+        struct kvm_pmu *pmu = &vcpu->arch.pmu;
+        bool overflow = !!kvm_pmu_overflow_status(vcpu);
+        if (pmu->irq_level == overflow)
+                return;
+        pmu->irq_level = overflow;
+        if (likely(irqchip_in_kernel(vcpu->kvm))) {
+                int ret = kvm_vgic_inject_irq(vcpu->kvm, vcpu->vcpu_id,
+                                              pmu->irq_num, overflow,
+                                              &vcpu->arch.pmu);
+                WARN_ON(ret);
+        }
+}
 /**
 * kvm_pmu_overflow_set - set PMU overflow interrupt
 * @vcpu: The vcpu pointer
@@ -210,37 +228,18 @@ static u64 kvm_pmu_overflow_status(struct kvm_vcpu *vcpu)
 */
 void kvm_pmu_overflow_set(struct kvm_vcpu *vcpu, u64 val)
 {
-        u64 reg;
        if (val == 0)
                return;
        vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= val;
-        reg = kvm_pmu_overflow_status(vcpu);
+        kvm_pmu_check_overflow(vcpu);
-        if (reg != 0)
-                kvm_vcpu_kick(vcpu);
 }
 static void kvm_pmu_update_state(struct kvm_vcpu *vcpu)
 {
-        struct kvm_pmu *pmu = &vcpu->arch.pmu;
-        bool overflow;
        if (!kvm_arm_pmu_v3_ready(vcpu))
                return;
+        kvm_pmu_check_overflow(vcpu);
-        overflow = !!kvm_pmu_overflow_status(vcpu);
-        if (pmu->irq_level == overflow)
-                return;
-        pmu->irq_level = overflow;
-        if (likely(irqchip_in_kernel(vcpu->kvm))) {
-                int ret;
-                ret = kvm_vgic_inject_irq(vcpu->kvm, vcpu->vcpu_id,
-                                          pmu->irq_num, overflow);
-                WARN_ON(ret);
-        }
 }
 bool kvm_pmu_should_notify_user(struct kvm_vcpu *vcpu)
@@ -451,34 +450,74 @@ bool kvm_arm_support_pmu_v3(void)
        return (perf_num_counters() > 0);
 }
-static int kvm_arm_pmu_v3_init(struct kvm_vcpu *vcpu)
+int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu)
 {
-        if (!kvm_arm_support_pmu_v3())
+        if (!vcpu->arch.pmu.created)
-                return -ENODEV;
+                return 0;
        /*
-         * We currently require an in-kernel VGIC to use the PMU emulation,
+         * A valid interrupt configuration for the PMU is either to have a
-         * because we do not support forwarding PMU overflow interrupts to
+         * properly configured interrupt number and using an in-kernel
-         * userspace yet.
+         * irqchip, or to not have an in-kernel GIC and not set an IRQ.
         */
-        if (!irqchip_in_kernel(vcpu->kvm) || !vgic_initialized(vcpu->kvm))
+        if (irqchip_in_kernel(vcpu->kvm)) {
+                int irq = vcpu->arch.pmu.irq_num;
+                if (!kvm_arm_pmu_irq_initialized(vcpu))
+                        return -EINVAL;
+                /*
+                 * If we are using an in-kernel vgic, at this point we know
+                 * the vgic will be initialized, so we can check the PMU irq
+                 * number against the dimensions of the vgic and make sure
+                 * it's valid.
+                 */
+                if (!irq_is_ppi(irq) && !vgic_valid_spi(vcpu->kvm, irq))
+                        return -EINVAL;
+        } else if (kvm_arm_pmu_irq_initialized(vcpu)) {
+                   return -EINVAL;
+        }
+        kvm_pmu_vcpu_reset(vcpu);
+        vcpu->arch.pmu.ready = true;
+        return 0;
+}
+static int kvm_arm_pmu_v3_init(struct kvm_vcpu *vcpu)
+{
+        if (!kvm_arm_support_pmu_v3())
                return -ENODEV;
-        if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features) ||
+        if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
-            !kvm_arm_pmu_irq_initialized(vcpu))
                return -ENXIO;
-        if (kvm_arm_pmu_v3_ready(vcpu))
+        if (vcpu->arch.pmu.created)
                return -EBUSY;
-        kvm_pmu_vcpu_reset(vcpu);
+        if (irqchip_in_kernel(vcpu->kvm)) {
-        vcpu->arch.pmu.ready = true;
+                int ret;
+                /*
+                 * If using the PMU with an in-kernel virtual GIC
+                 * implementation, we require the GIC to be already
+                 * initialized when initializing the PMU.
+                 */
+                if (!vgic_initialized(vcpu->kvm))
+                        return -ENODEV;
+                if (!kvm_arm_pmu_irq_initialized(vcpu))
+                        return -ENXIO;
+                ret = kvm_vgic_set_owner(vcpu, vcpu->arch.pmu.irq_num,
+                                         &vcpu->arch.pmu);
+                if (ret)
+                        return ret;
+        }
+        vcpu->arch.pmu.created = true;
        return 0;
 }
-#define irq_is_ppi(irq) ((irq) >= VGIC_NR_SGIS && (irq) < VGIC_NR_PRIVATE_IRQS)
 /*
 * For one VM the interrupt type must be same for each vcpu.
 * As a PPI, the interrupt number is the same for all vcpus,
@@ -512,6 +551,9 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
                int __user *uaddr = (int __user *)(long)attr->addr;
                int irq;
+                if (!irqchip_in_kernel(vcpu->kvm))
+                        return -EINVAL;
                if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
                        return -ENODEV;
@@ -519,7 +561,7 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
                        return -EFAULT;
                /* The PMU overflow interrupt can be a PPI or a valid SPI. */
-                if (!(irq_is_ppi(irq) || vgic_valid_spi(vcpu->kvm, irq)))
+                if (!(irq_is_ppi(irq) || irq_is_spi(irq)))
                        return -EINVAL;
                if (!pmu_irq_is_valid(vcpu->kvm, irq))
@@ -546,6 +588,9 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
                int __user *uaddr = (int __user *)(long)attr->addr;
                int irq;
+                if (!irqchip_in_kernel(vcpu->kvm))
+                        return -EINVAL;
                if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
                        return -ENODEV;
diff --git a/virt/kvm/arm/psci.c b/virt/kvm/arm/psci.c
index a08d7a93aebb..f1e363bab5e8 100644
--- a/virt/kvm/arm/psci.c
+++ b/virt/kvm/arm/psci.c
@@ -57,6 +57,7 @@ static unsigned long kvm_psci_vcpu_suspend(struct kvm_vcpu *vcpu)
         * for KVM will preserve the register state.
         */
        kvm_vcpu_block(vcpu);
+        kvm_clear_request(KVM_REQ_UNHALT, vcpu);
        return PSCI_RET_SUCCESS;
 }
@@ -64,6 +65,8 @@ static unsigned long kvm_psci_vcpu_suspend(struct kvm_vcpu *vcpu)
 static void kvm_psci_vcpu_off(struct kvm_vcpu *vcpu)
 {
        vcpu->arch.power_off = true;
+        kvm_make_request(KVM_REQ_SLEEP, vcpu);
+        kvm_vcpu_kick(vcpu);
 }
 static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu)
@@ -178,10 +181,9 @@ static void kvm_prepare_system_event(struct kvm_vcpu *vcpu, u32 type)
         * after this call is handled and before the VCPUs have been
         * re-initialized.
         */
-        kvm_for_each_vcpu(i, tmp, vcpu->kvm) {
+        kvm_for_each_vcpu(i, tmp, vcpu->kvm)
                tmp->arch.power_off = true;
-                kvm_vcpu_kick(tmp);
+        kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_SLEEP);
-        }
        memset(&vcpu->run->system_event, 0, sizeof(vcpu->run->system_event));
        vcpu->run->system_event.type = type;
diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c
index f138ed2e9c63..b7baf581611a 100644
--- a/virt/kvm/arm/vgic/vgic-irqfd.c
+++ b/virt/kvm/arm/vgic/vgic-irqfd.c
@@ -34,7 +34,7 @@ static int vgic_irqfd_set_irq(struct kvm_kernel_irq_routing_entry *e,
        if (!vgic_valid_spi(kvm, spi_id))
                return -EINVAL;
-        return kvm_vgic_inject_irq(kvm, 0, spi_id, level);
+        return kvm_vgic_inject_irq(kvm, 0, spi_id, level, NULL);
 }
 /**
diff --git a/virt/kvm/arm/vgic/vgic-mmio-v2.c b/virt/kvm/arm/vgic/vgic-mmio-v2.c
index 63e0bbdcddcc..37522e65eb53 100644
--- a/virt/kvm/arm/vgic/vgic-mmio-v2.c
+++ b/virt/kvm/arm/vgic/vgic-mmio-v2.c
@@ -308,34 +308,36 @@ static const struct vgic_register_region vgic_v2_dist_registers[] = {
                vgic_mmio_read_v2_misc, vgic_mmio_write_v2_misc, 12,
                VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_IGROUP,
-                vgic_mmio_read_rao, vgic_mmio_write_wi, 1,
+                vgic_mmio_read_rao, vgic_mmio_write_wi, NULL, NULL, 1,
                VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_ENABLE_SET,
-                vgic_mmio_read_enable, vgic_mmio_write_senable, 1,
+                vgic_mmio_read_enable, vgic_mmio_write_senable, NULL, NULL, 1,
                VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_ENABLE_CLEAR,
-                vgic_mmio_read_enable, vgic_mmio_write_cenable, 1,
+                vgic_mmio_read_enable, vgic_mmio_write_cenable, NULL, NULL, 1,
                VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_PENDING_SET,
-                vgic_mmio_read_pending, vgic_mmio_write_spending, 1,
+                vgic_mmio_read_pending, vgic_mmio_write_spending, NULL, NULL, 1,
                VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_PENDING_CLEAR,
-                vgic_mmio_read_pending, vgic_mmio_write_cpending, 1,
+                vgic_mmio_read_pending, vgic_mmio_write_cpending, NULL, NULL, 1,
                VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_ACTIVE_SET,
-                vgic_mmio_read_active, vgic_mmio_write_sactive, 1,
+                vgic_mmio_read_active, vgic_mmio_write_sactive,
+                NULL, vgic_mmio_uaccess_write_sactive, 1,
                VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_ACTIVE_CLEAR,
-                vgic_mmio_read_active, vgic_mmio_write_cactive, 1,
+                vgic_mmio_read_active, vgic_mmio_write_cactive,
+                NULL, vgic_mmio_uaccess_write_cactive, 1,
                VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_PRI,
-                vgic_mmio_read_priority, vgic_mmio_write_priority, 8,
+                vgic_mmio_read_priority, vgic_mmio_write_priority, NULL, NULL,
-                VGIC_ACCESS_32bit | VGIC_ACCESS_8bit),
+                8, VGIC_ACCESS_32bit | VGIC_ACCESS_8bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_TARGET,
-                vgic_mmio_read_target, vgic_mmio_write_target, 8,
+                vgic_mmio_read_target, vgic_mmio_write_target, NULL, NULL, 8,
                VGIC_ACCESS_32bit | VGIC_ACCESS_8bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_CONFIG,
-                vgic_mmio_read_config, vgic_mmio_write_config, 2,
+                vgic_mmio_read_config, vgic_mmio_write_config, NULL, NULL, 2,
                VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_LENGTH(GIC_DIST_SOFTINT,
                vgic_mmio_read_raz, vgic_mmio_write_sgir, 4,
diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c b/virt/kvm/arm/vgic/vgic-mmio-v3.c
index 201d5e2e973d..714fa3933546 100644
--- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
+++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
@@ -456,11 +456,13 @@ static const struct vgic_register_region vgic_v3_dist_registers[] = {
                vgic_mmio_read_raz, vgic_mmio_write_wi, 1,
                VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED(GICD_ISACTIVER,
-                vgic_mmio_read_active, vgic_mmio_write_sactive, NULL, NULL, 1,
+                vgic_mmio_read_active, vgic_mmio_write_sactive,
+                NULL, vgic_mmio_uaccess_write_sactive, 1,
                VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED(GICD_ICACTIVER,
-                vgic_mmio_read_active, vgic_mmio_write_cactive, NULL, NULL, 1,
+                vgic_mmio_read_active, vgic_mmio_write_cactive,
-                VGIC_ACCESS_32bit),
+                NULL, vgic_mmio_uaccess_write_cactive,
+                1, VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED(GICD_IPRIORITYR,
                vgic_mmio_read_priority, vgic_mmio_write_priority, NULL, NULL,
                8, VGIC_ACCESS_32bit | VGIC_ACCESS_8bit),
@@ -526,12 +528,14 @@ static const struct vgic_register_region vgic_v3_sgibase_registers[] = {
                vgic_mmio_read_pending, vgic_mmio_write_cpending,
                vgic_mmio_read_raz, vgic_mmio_write_wi, 4,
                VGIC_ACCESS_32bit),
-        REGISTER_DESC_WITH_LENGTH(GICR_ISACTIVER0,
+        REGISTER_DESC_WITH_LENGTH_UACCESS(GICR_ISACTIVER0,
-                vgic_mmio_read_active, vgic_mmio_write_sactive, 4,
+                vgic_mmio_read_active, vgic_mmio_write_sactive,
-                VGIC_ACCESS_32bit),
+                NULL, vgic_mmio_uaccess_write_sactive,
-        REGISTER_DESC_WITH_LENGTH(GICR_ICACTIVER0,
+                4, VGIC_ACCESS_32bit),
-                vgic_mmio_read_active, vgic_mmio_write_cactive, 4,
+        REGISTER_DESC_WITH_LENGTH_UACCESS(GICR_ICACTIVER0,
-                VGIC_ACCESS_32bit),
+                vgic_mmio_read_active, vgic_mmio_write_cactive,
+                NULL, vgic_mmio_uaccess_write_cactive,
+                4, VGIC_ACCESS_32bit),
        REGISTER_DESC_WITH_LENGTH(GICR_IPRIORITYR0,
                vgic_mmio_read_priority, vgic_mmio_write_priority, 32,
                VGIC_ACCESS_32bit | VGIC_ACCESS_8bit),
diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c
index 1c17b2a2f105..c1e4bdd66131 100644
--- a/virt/kvm/arm/vgic/vgic-mmio.c
+++ b/virt/kvm/arm/vgic/vgic-mmio.c
@@ -231,56 +231,94 @@ static void vgic_mmio_change_active(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
 * be migrated while we don't hold the IRQ locks and we don't want to be
 * chasing moving targets.
 *
- * For private interrupts, we only have to make sure the single and only VCPU
+ * For private interrupts we don't have to do anything because userspace
- * that can potentially queue the IRQ is stopped.
+ * accesses to the VGIC state already require all VCPUs to be stopped, and
+ * only the VCPU itself can modify its private interrupts active state, which
+ * guarantees that the VCPU is not running.
 */
 static void vgic_change_active_prepare(struct kvm_vcpu *vcpu, u32 intid)
 {
-        if (intid < VGIC_NR_PRIVATE_IRQS)
+        if (intid > VGIC_NR_PRIVATE_IRQS)
-                kvm_arm_halt_vcpu(vcpu);
-        else
                kvm_arm_halt_guest(vcpu->kvm);
 }
 /* See vgic_change_active_prepare */
 static void vgic_change_active_finish(struct kvm_vcpu *vcpu, u32 intid)
 {
-        if (intid < VGIC_NR_PRIVATE_IRQS)
+        if (intid > VGIC_NR_PRIVATE_IRQS)
-                kvm_arm_resume_vcpu(vcpu);
-        else
                kvm_arm_resume_guest(vcpu->kvm);
 }
-void vgic_mmio_write_cactive(struct kvm_vcpu *vcpu,
+static void __vgic_mmio_write_cactive(struct kvm_vcpu *vcpu,
-                             gpa_t addr, unsigned int len,
+                                      gpa_t addr, unsigned int len,
-                             unsigned long val)
+                                      unsigned long val)
 {
        u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
        int i;
-        vgic_change_active_prepare(vcpu, intid);
        for_each_set_bit(i, &val, len * 8) {
                struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i);
                vgic_mmio_change_active(vcpu, irq, false);
                vgic_put_irq(vcpu->kvm, irq);
        }
-        vgic_change_active_finish(vcpu, intid);
 }
-void vgic_mmio_write_sactive(struct kvm_vcpu *vcpu,
+void vgic_mmio_write_cactive(struct kvm_vcpu *vcpu,
                             gpa_t addr, unsigned int len,
                             unsigned long val)
 {
        u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
-        int i;
+        mutex_lock(&vcpu->kvm->lock);
        vgic_change_active_prepare(vcpu, intid);
+        __vgic_mmio_write_cactive(vcpu, addr, len, val);
+        vgic_change_active_finish(vcpu, intid);
+        mutex_unlock(&vcpu->kvm->lock);
+}
+void vgic_mmio_uaccess_write_cactive(struct kvm_vcpu *vcpu,
+                                     gpa_t addr, unsigned int len,
+                                     unsigned long val)
+{
+        __vgic_mmio_write_cactive(vcpu, addr, len, val);
+}
+static void __vgic_mmio_write_sactive(struct kvm_vcpu *vcpu,
+                                      gpa_t addr, unsigned int len,
+                                      unsigned long val)
+{
+        u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
+        int i;
        for_each_set_bit(i, &val, len * 8) {
                struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i);
                vgic_mmio_change_active(vcpu, irq, true);
                vgic_put_irq(vcpu->kvm, irq);
        }
+}
+void vgic_mmio_write_sactive(struct kvm_vcpu *vcpu,
+                             gpa_t addr, unsigned int len,
+                             unsigned long val)
+{
+        u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
+        mutex_lock(&vcpu->kvm->lock);
+        vgic_change_active_prepare(vcpu, intid);
+        __vgic_mmio_write_sactive(vcpu, addr, len, val);
        vgic_change_active_finish(vcpu, intid);
+        mutex_unlock(&vcpu->kvm->lock);
+}
+void vgic_mmio_uaccess_write_sactive(struct kvm_vcpu *vcpu,
+                                     gpa_t addr, unsigned int len,
+                                     unsigned long val)
+{
+        __vgic_mmio_write_sactive(vcpu, addr, len, val);
 }
 unsigned long vgic_mmio_read_priority(struct kvm_vcpu *vcpu,
diff --git a/virt/kvm/arm/vgic/vgic-mmio.h b/virt/kvm/arm/vgic/vgic-mmio.h
index ea4171acdef3..5693f6df45ec 100644
--- a/virt/kvm/arm/vgic/vgic-mmio.h
+++ b/virt/kvm/arm/vgic/vgic-mmio.h
@@ -75,7 +75,7 @@ extern struct kvm_io_device_ops kvm_io_gic_ops;
 * The _WITH_LENGTH version instantiates registers with a fixed length
 * and is mutually exclusive with the _PER_IRQ version.
 */
-#define REGISTER_DESC_WITH_BITS_PER_IRQ(off, rd, wr, bpi, acc)          \
+#define REGISTER_DESC_WITH_BITS_PER_IRQ(off, rd, wr, ur, uw, bpi, acc)  \
        {                                                               \
                .reg_offset = off,                                      \
                .bits_per_irq = bpi,                                    \
@@ -83,6 +83,8 @@ extern struct kvm_io_device_ops kvm_io_gic_ops;
                .access_flags = acc,                                    \
                .read = rd,                                             \
                .write = wr,                                            \
+                .uaccess_read = ur,                                     \
+                .uaccess_write = uw,                                    \
        }
 #define REGISTER_DESC_WITH_LENGTH(off, rd, wr, length, acc)             \
@@ -165,6 +167,14 @@ void vgic_mmio_write_sactive(struct kvm_vcpu *vcpu,
                             gpa_t addr, unsigned int len,
                             unsigned long val);
+void vgic_mmio_uaccess_write_cactive(struct kvm_vcpu *vcpu,
+                                     gpa_t addr, unsigned int len,
+                                     unsigned long val);
+void vgic_mmio_uaccess_write_sactive(struct kvm_vcpu *vcpu,
+                                     gpa_t addr, unsigned int len,
+                                     unsigned long val);
 unsigned long vgic_mmio_read_priority(struct kvm_vcpu *vcpu,
                                      gpa_t addr, unsigned int len);
diff --git a/virt/kvm/arm/vgic/vgic-v3.c b/virt/kvm/arm/vgic/vgic-v3.c
index 030248e669f6..96ea597db0e7 100644
--- a/virt/kvm/arm/vgic/vgic-v3.c
+++ b/virt/kvm/arm/vgic/vgic-v3.c
@@ -21,6 +21,10 @@
 #include "vgic.h"
+static bool group0_trap;
+static bool group1_trap;
+static bool common_trap;
 void vgic_v3_set_underflow(struct kvm_vcpu *vcpu)
 {
        struct vgic_v3_cpu_if *cpuif = &vcpu->arch.vgic_cpu.vgic_v3;
@@ -258,6 +262,12 @@ void vgic_v3_enable(struct kvm_vcpu *vcpu)
        /* Get the show on the road... */
        vgic_v3->vgic_hcr = ICH_HCR_EN;
+        if (group0_trap)
+                vgic_v3->vgic_hcr |= ICH_HCR_TALL0;
+        if (group1_trap)
+                vgic_v3->vgic_hcr |= ICH_HCR_TALL1;
+        if (common_trap)
+                vgic_v3->vgic_hcr |= ICH_HCR_TC;
 }
 int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq)
@@ -429,6 +439,26 @@ out:
        return ret;
 }
+DEFINE_STATIC_KEY_FALSE(vgic_v3_cpuif_trap);
+static int __init early_group0_trap_cfg(char *buf)
+{
+        return strtobool(buf, &group0_trap);
+}
+early_param("kvm-arm.vgic_v3_group0_trap", early_group0_trap_cfg);
+static int __init early_group1_trap_cfg(char *buf)
+{
+        return strtobool(buf, &group1_trap);
+}
+early_param("kvm-arm.vgic_v3_group1_trap", early_group1_trap_cfg);
+static int __init early_common_trap_cfg(char *buf)
+{
+        return strtobool(buf, &common_trap);
+}
+early_param("kvm-arm.vgic_v3_common_trap", early_common_trap_cfg);
 /**
 * vgic_v3_probe - probe for a GICv3 compatible interrupt controller in DT
 * @node:       pointer to the DT node
@@ -480,6 +510,21 @@ int vgic_v3_probe(const struct gic_kvm_info *info)
        if (kvm_vgic_global_state.vcpu_base == 0)
                kvm_info("disabling GICv2 emulation\n");
+#ifdef CONFIG_ARM64
+        if (cpus_have_const_cap(ARM64_WORKAROUND_CAVIUM_30115)) {
+                group0_trap = true;
+                group1_trap = true;
+        }
+#endif
+        if (group0_trap || group1_trap || common_trap) {
+                kvm_info("GICv3 sysreg trapping enabled ([%s%s%s], reduced performance)\n",
+                         group0_trap ? "G0" : "",
+                         group1_trap ? "G1" : "",
+                         common_trap ? "C"  : "");
+                static_branch_enable(&vgic_v3_cpuif_trap);
+        }
        kvm_vgic_global_state.vctrl_base = NULL;
        kvm_vgic_global_state.type = VGIC_V3;
        kvm_vgic_global_state.max_gic_vcpus = VGIC_V3_MAX_CPUS;
diff --git a/virt/kvm/arm/vgic/vgic.c b/virt/kvm/arm/vgic/vgic.c
index 83b24d20ff8f..fed717e07938 100644
--- a/virt/kvm/arm/vgic/vgic.c
+++ b/virt/kvm/arm/vgic/vgic.c
@@ -35,11 +35,12 @@ struct vgic_global kvm_vgic_global_state __ro_after_init = {
 /*
 * Locking order is always:
- * its->cmd_lock (mutex)
+ * kvm->lock (mutex)
- *   its->its_lock (mutex)
+ *   its->cmd_lock (mutex)
- *     vgic_cpu->ap_list_lock
+ *     its->its_lock (mutex)
- *       kvm->lpi_list_lock
+ *       vgic_cpu->ap_list_lock
- *         vgic_irq->irq_lock
+ *         kvm->lpi_list_lock
+ *           vgic_irq->irq_lock
 *
 * If you need to take multiple locks, always take the upper lock first,
 * then the lower ones, e.g. first take the its_lock, then the irq_lock.
@@ -234,10 +235,14 @@ static void vgic_sort_ap_list(struct kvm_vcpu *vcpu)
 /*
 * Only valid injection if changing level for level-triggered IRQs or for a
- * rising edge.
+ * rising edge, and in-kernel connected IRQ lines can only be controlled by
+ * their owner.
 */
-static bool vgic_validate_injection(struct vgic_irq *irq, bool level)
+static bool vgic_validate_injection(struct vgic_irq *irq, bool level, void *owner)
 {
+        if (irq->owner != owner)
+                return false;
        switch (irq->config) {
        case VGIC_CONFIG_LEVEL:
                return irq->line_level != level;
@@ -285,8 +290,10 @@ retry:
                 * won't see this one until it exits for some other
                 * reason.
                 */
-                if (vcpu)
+                if (vcpu) {
+                        kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu);
                        kvm_vcpu_kick(vcpu);
+                }
                return false;
        }
@@ -332,6 +339,7 @@ retry:
        spin_unlock(&irq->irq_lock);
        spin_unlock(&vcpu->arch.vgic_cpu.ap_list_lock);
+        kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu);
        kvm_vcpu_kick(vcpu);
        return true;
@@ -346,13 +354,16 @@ retry:
 *                            false: to ignore the call
 *           Level-sensitive  true:  raise the input signal
 *                            false: lower the input signal
+ * @owner:   The opaque pointer to the owner of the IRQ being raised to verify
+ *           that the caller is allowed to inject this IRQ.  Userspace
+ *           injections will have owner == NULL.
 *
 * The VGIC is not concerned with devices being active-LOW or active-HIGH for
 * level-sensitive interrupts.  You can think of the level parameter as 1
 * being HIGH and 0 being LOW and all devices being active-HIGH.
 */
 int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int intid,
-                        bool level)
+                        bool level, void *owner)
 {
        struct kvm_vcpu *vcpu;
        struct vgic_irq *irq;
@@ -374,7 +385,7 @@ int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int intid,
        spin_lock(&irq->irq_lock);
-        if (!vgic_validate_injection(irq, level)) {
+        if (!vgic_validate_injection(irq, level, owner)) {
                /* Nothing to see here, move along... */
                spin_unlock(&irq->irq_lock);
                vgic_put_irq(kvm, irq);
@@ -431,6 +442,39 @@ int kvm_vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, unsigned int virt_irq)
 }
 /**
+ * kvm_vgic_set_owner - Set the owner of an interrupt for a VM
+ *
+ * @vcpu:   Pointer to the VCPU (used for PPIs)
+ * @intid:  The virtual INTID identifying the interrupt (PPI or SPI)
+ * @owner:  Opaque pointer to the owner
+ *
+ * Returns 0 if intid is not already used by another in-kernel device and the
+ * owner is set, otherwise returns an error code.
+ */
+int kvm_vgic_set_owner(struct kvm_vcpu *vcpu, unsigned int intid, void *owner)
+{
+        struct vgic_irq *irq;
+        int ret = 0;
+        if (!vgic_initialized(vcpu->kvm))
+                return -EAGAIN;
+        /* SGIs and LPIs cannot be wired up to any device */
+        if (!irq_is_ppi(intid) && !vgic_valid_spi(vcpu->kvm, intid))
+                return -EINVAL;
+        irq = vgic_get_irq(vcpu->kvm, vcpu, intid);
+        spin_lock(&irq->irq_lock);
+        if (irq->owner && irq->owner != owner)
+                ret = -EEXIST;
+        else
+                irq->owner = owner;
+        spin_unlock(&irq->irq_lock);
+        return ret;
+}
+/**
 * vgic_prune_ap_list - Remove non-relevant interrupts from the list
 *
 * @vcpu: The VCPU pointer
@@ -721,8 +765,10 @@ void vgic_kick_vcpus(struct kvm *kvm)
         * a good kick...
         */
        kvm_for_each_vcpu(c, vcpu, kvm) {
-                if (kvm_vgic_vcpu_pending_irq(vcpu))
+                if (kvm_vgic_vcpu_pending_irq(vcpu)) {
+                        kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu);
                        kvm_vcpu_kick(vcpu);
+                }
        }
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f0fe9d02f6bb..19f0ecb9b93e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -73,17 +73,17 @@ MODULE_LICENSE("GPL");
 /* Architectures should define their poll value according to the halt latency */
 unsigned int halt_poll_ns = KVM_HALT_POLL_NS_DEFAULT;
-module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
+module_param(halt_poll_ns, uint, 0644);
 EXPORT_SYMBOL_GPL(halt_poll_ns);
 /* Default doubles per-vcpu halt_poll_ns. */
 unsigned int halt_poll_ns_grow = 2;
-module_param(halt_poll_ns_grow, uint, S_IRUGO | S_IWUSR);
+module_param(halt_poll_ns_grow, uint, 0644);
 EXPORT_SYMBOL_GPL(halt_poll_ns_grow);
 /* Default resets per-vcpu halt_poll_ns . */
 unsigned int halt_poll_ns_shrink;
-module_param(halt_poll_ns_shrink, uint, S_IRUGO | S_IWUSR);
+module_param(halt_poll_ns_shrink, uint, 0644);
 EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
 /*
@@ -3191,6 +3191,12 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
                return PTR_ERR(file);
        }
+        /*
+         * Don't call kvm_put_kvm anymore at this point; file->f_op is
+         * already set, with ->release() being kvm_vm_release().  In error
+         * cases it will be called by the final fput(file) and will take
+         * care of doing kvm_put_kvm(kvm).
+         */
        if (kvm_create_vm_debugfs(kvm, r) < 0) {
                put_unused_fd(r);
                fput(file);
author	Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 21:38:31 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 21:38:31 -0400
commit	c136b84393d4e340e1b53fc7f737dd5827b19ee5 (patch)
tree	985a1bdfafe7ec5ce2d3c738f601cad3998d8ce9
parent	e0f25a3f2d052e36ff67a9b4db835c3e27e950d8 (diff)
parent	1372324b328cd5dabaef5e345e37ad48c63df2a9 (diff)