aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm/x86.c
Commit message (Collapse)AuthorAge
...
| * | x86: kvm: Make kvm_get_time_and_clockread() nanoseconds basedThomas Gleixner2014-07-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert the relevant base data right away to nanoseconds instead of doing the conversion on every readout. Reduces text size by 160 bytes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Gleb Natapov <gleb@kernel.org> Cc: kvm@vger.kernel.org Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
| * | x86: kvm: Use ktime_get_boot_ns()Thomas Gleixner2014-07-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the new nanoseconds based interface and get rid of the timespec conversion dance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Gleb Natapov <gleb@kernel.org> Cc: kvm@vger.kernel.org Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
* | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2014-08-04
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM changes from Paolo Bonzini: "These are the x86, MIPS and s390 changes; PPC and ARM will come in a few days. MIPS and s390 have little going on this release; just bugfixes, some small, some larger. The highlights for x86 are nested VMX improvements (Jan Kiszka), optimizations for old processor (up to Nehalem, by me and Bandan Das), and a lot of x86 emulator bugfixes (Nadav Amit). Stephen Rothwell reported a trivial conflict with the tracing branch" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (104 commits) x86/kvm: Resolve shadow warnings in macro expansion KVM: s390: rework broken SIGP STOP interrupt handling KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table KVM: vmx: remove duplicate vmx_mpx_supported() prototype KVM: s390: Fix memory leak on busy SIGP stop x86/kvm: Resolve shadow warning from min macro kvm: Resolve missing-field-initializers warnings Replace NR_VMX_MSR with its definition KVM: x86: Assertions to check no overrun in MSR lists KVM: x86: set rflags.rf during fault injection KVM: x86: Setting rflags.rf during rep-string emulation KVM: x86: DR6/7.RTM cannot be written KVM: nVMX: clean up nested_release_vmcs12 and code around it KVM: nVMX: fix lifetime issues for vmcs02 KVM: x86: Defining missing x86 vectors KVM: x86: emulator injects #DB when RFLAGS.RF is set KVM: x86: Cleanup of rflags.rf cleaning KVM: x86: Clear rflags.rf on emulated instructions KVM: x86: popf emulation should not change RF KVM: x86: Clearing rflags.rf upon skipped emulated instruction ...
| * | | KVM: x86: Assertions to check no overrun in MSR listsNadav Amit2014-07-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently there is no check whether shared MSRs list overrun the allocated size which can results in bugs. In addition there is no check that vmx->guest_msrs has sufficient space to accommodate all the VMX msrs. This patch adds the assertions. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: set rflags.rf during fault injectionNadav Amit2014-07-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | x86 does not automatically set rflags.rf during event injection. This patch does partial job, setting rflags.rf upon fault injection. It does not handle the setting of RF upon interrupt injection on rep-string instruction. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: DR6/7.RTM cannot be writtenNadav Amit2014-07-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Haswell and newer Intel CPUs have support for RTM, and in that case DR6.RTM is not fixed to 1 and DR7.RTM is not fixed to zero. That is not the case in the current KVM implementation. This bug is apparent only if the MOV-DR instruction is emulated or the host also debugs the guest. This patch is a partial fix which enables DR6.RTM and DR7.RTM to be cleared and set respectively. It also sets DR6.RTM upon every debug exception. Obviously, it is not a complete fix, as debugging of RTM is still unsupported. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: emulator injects #DB when RFLAGS.RF is setNadav Amit2014-07-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the RFLAGS.RF is set, then no #DB should occur on instruction breakpoints. However, the KVM emulator injects #DB regardless to RFLAGS.RF. This patch fixes this behavior. KVM, however, still appears not to update RFLAGS.RF correctly, regardless of this patch. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: Clearing rflags.rf upon skipped emulated instructionNadav Amit2014-07-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When skipping an emulated instruction, rflags.rf should be cleared as it would be on real x86 CPU. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: use kvm_read_guest_page for emulator accessesPaolo Bonzini2014-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Emulator accesses are always done a page at a time, either by the emulator itself (for fetches) or because we need to query the MMU for address translations. Speed up these accesses by using kvm_read_guest_page and, in the case of fetches, by inlining kvm_read_guest_virt_helper and dropping the loop around kvm_read_guest_page. This final tweak saves 30-100 more clock cycles (4-10%), bringing the count (as measured by kvm-unit-tests) down to 720-1100 clock cycles on a Sandy Bridge Xeon host, compared to 2300-3200 before the whole series and 925-1700 after the first two low-hanging fruit changes. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: emulate: move init_decode_cache to emulate.cBandan Das2014-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Core emulator functions all belong in emulator.c, x86 should have no knowledge of emulator internals Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: avoid useless set of KVM_REQ_EVENT after emulationPaolo Bonzini2014-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Despite the provisions to emulate up to 130 consecutive instructions, in practice KVM will emulate just one before exiting handle_invalid_guest_state, because x86_emulate_instruction always sets KVM_REQ_EVENT. However, we only need to do this if an interrupt could be injected, which happens a) if an interrupt shadow bit (STI or MOV SS) has gone away; b) if the interrupt flag has just been set (other instructions than STI can set it without enabling an interrupt shadow). This cuts another 700-900 cycles from the cost of emulating an instruction (measured on a Sandy Bridge Xeon: 1650-2600 cycles before the patch on kvm-unit-tests, 925-1700 afterwards). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: return all bits from get_interrupt_shadowPaolo Bonzini2014-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the next patch we will need to know the full state of the interrupt shadow; we will then set KVM_REQ_EVENT when one bit is cleared. However, right now get_interrupt_shadow only returns the one corresponding to the emulated instruction, or an unconditional 0 if the emulated instruction does not have an interrupt shadow. This is confusing and does not allow us to check for cleared bits as mentioned above. Clean the callback up, and modify toggle_interruptibility to match the comment above the call. As a small result, the call to set_interrupt_shadow will be skipped in the common case where int_shadow == 0 && mask == 0. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: svm: writes to MSR_K7_HWCR generates GPE in guestMatthias Lange2014-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 575203 the MCE subsystem in the Linux kernel for AMD sets bit 18 in MSR_K7_HWCR. Running such a kernel as a guest in KVM on an AMD host results in a GPE injected into the guest because kvm_set_msr_common returns 1. This patch fixes this by masking bit 18 from the MSR value desired by the guest. Signed-off-by: Matthias Lange <matthias.lange@kernkonzept.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: Pending interrupt may be delivered after INITNadav Amit2014-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We encountered a scenario in which after an INIT is delivered, a pending interrupt is delivered, although it was sent before the INIT. As the SDM states in section 10.4.7.1, the ISR and the IRR should be cleared after INIT as KVM does. This also means that pending interrupts should be cleared. This patch clears upon reset (and INIT) the pending interrupts; and at the same occassion clears the pending exceptions, since they may cause a similar issue. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: fix TSC matchingTomasz Grabiec2014-07-09
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I've observed kvmclock being marked as unstable on a modern single-socket system with a stable TSC and qemu-1.6.2 or qemu-2.0.0. The culprit was failure in TSC matching because of overflow of kvm_arch::nr_vcpus_matched_tsc in case there were multiple TSC writes in a single synchronization cycle. Turns out that qemu does multiple TSC writes during init, below is the evidence of that (qemu-2.0.0): The first one: 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xffffffffa04cfd6b : kvm_arch_vcpu_postcreate+0x4b/0x80 [kvm] 0xffffffffa04b8188 : kvm_vm_ioctl+0x418/0x750 [kvm] The second one: 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel] 0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm] 0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm] 0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm] 0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm] #0 kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780 #1 kvm_put_msrs at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270 #2 kvm_arch_put_registers at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909 #3 kvm_cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1641 #4 cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:330 #5 cpu_synchronize_all_post_init () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:521 #6 main at /build/buildd/qemu-2.0.0+dfsg/vl.c:4390 The third one: 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel] 0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm] 0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm] 0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm] 0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm] #0 kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780 #1 kvm_put_msrs at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270 #2 kvm_arch_put_registers at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909 #3 kvm_cpu_synchronize_post_reset at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1635 #4 cpu_synchronize_post_reset at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:323 #5 cpu_synchronize_all_post_reset () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:512 #6 main at /build/buildd/qemu-2.0.0+dfsg/vl.c:4482 The fix is to count each vCPU only once when matched, so that nr_vcpus_matched_tsc holds the size of the matched set. This is achieved by reusing generation counters. Every vCPU with this_tsc_generation == cur_tsc_generation is in the matched set. The match set is cleared by setting cur_tsc_generation to a value which no other vCPU is set to (by incrementing it). I needed to bump up the counter size form u8 to u64 to ensure it never overflows. Otherwise in cases TSC is not written the same number of times on each vCPU the counter could overflow and incorrectly indicate some vCPUs as being in the matched set. This scenario seems unlikely but I'm not sure if it can be disregarded. Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | arch: x86: kvm: x86.c: Cleaning up variable is set more than onceRickard Strandqvist2014-06-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | A struct member variable is set to the same value more than once This was found using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: Hypercall handling does not considers opsize correctlyNadav Amit2014-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the hypercall handling routine only considers LME as an indication to whether the guest uses 32/64-bit mode. This is incosistent with hyperv hypercalls handling and against the common sense of considering cs.l as well. This patch uses is_64_bit_mode instead of is_long_mode for that matter. In addition, the result is masked in respect to the guest execution mode. Last, it changes kvm_hv_hypercall to use is_64_bit_mode as well to simplify the code. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: rdpmc emulation checks the counter incorrectlyNadav Amit2014-06-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rdpmc emulation checks that the counter (ECX) is not higher than 2, without taking into considerations bits 30:31 role (e.g., bit 30 marks whether the counter is fixed). The fix uses the pmu information for checking the validity of the pmu counter. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: Check for nested events if there is an injectable interruptBandan Das2014-07-08
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | With commit b6b8a1451fc40412c57d1 that introduced vmx_check_nested_events, checks for injectable interrupts happen at different points in time for L1 and L2 that could potentially cause a race. The regression occurs because KVM_REQ_EVENT is always set when nested_run_pending is set even if there's no pending interrupt. Consequently, there could be a small window when check_nested_events returns without exiting to L1, but an interrupt comes through soon after and it incorrectly, gets injected to L2 by inject_pending_event Fix this by adding a call to check for nested events too when a check for injectable interrupt returns true Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: fix wrong address when writing Hyper-V tsc pageXiaoming Gao2014-06-19
|/ | | | | | | | | | | | | When kvm_write_guest writes the tsc_ref structure to the guest, or it will lead the low HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT bits of the TSC page address must be cleared, or the guest can see a non-zero sequence number. Otherwise Windows guests would not be able to get a correct clocksource (QueryPerformanceCounter will always return 0) which causes serious chaos. Signed-off-by: Xiaoming Gao <newtongao@tencnet.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into nextLinus Torvalds2014-06-04
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Paolo Bonzini: "At over 200 commits, covering almost all supported architectures, this was a pretty active cycle for KVM. Changes include: - a lot of s390 changes: optimizations, support for migration, GDB support and more - ARM changes are pretty small: support for the PSCI 0.2 hypercall interface on both the guest and the host (the latter acked by Catalin) - initial POWER8 and little-endian host support - support for running u-boot on embedded POWER targets - pretty large changes to MIPS too, completing the userspace interface and improving the handling of virtualized timer hardware - for x86, a larger set of changes is scheduled for 3.17. Still, we have a few emulator bugfixes and support for running nested fully-virtualized Xen guests (para-virtualized Xen guests have always worked). And some optimizations too. The only missing architecture here is ia64. It's not a coincidence that support for KVM on ia64 is scheduled for removal in 3.17" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits) KVM: add missing cleanup_srcu_struct KVM: PPC: Book3S PR: Rework SLB switching code KVM: PPC: Book3S PR: Use SLB entry 0 KVM: PPC: Book3S HV: Fix machine check delivery to guest KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs KVM: PPC: Book3S HV: Make sure we don't miss dirty pages KVM: PPC: Book3S HV: Fix dirty map for hugepages KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates() KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number KVM: PPC: Book3S: Add ONE_REG register names that were missed KVM: PPC: Add CAP to indicate hcall fixes KVM: PPC: MPIC: Reset IRQ source private members KVM: PPC: Graciously fail broken LE hypercalls PPC: ePAPR: Fix hypercall on LE guest KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler KVM: PPC: BOOK3S: Always use the saved DAR value PPC: KVM: Make NX bit available with magic page KVM: PPC: Disable NX for old magic page using guests KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest ...
| * KVM: x86: drop set_rflags callbackPaolo Bonzini2014-05-22
| | | | | | | | | | | | | | Not needed anymore now that the CPL is computed directly during task switch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: Fix CR3 reserved bits check in long modeJan Kiszka2014-05-12
| | | | | | | | | | | | | | | | Regression of 346874c9: PAE is set in long mode, but that does not mean we have valid PDPTRs. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm/x86: implement hv EOI assistMichael S. Tsirkin2014-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems that it's easy to implement the EOI assist on top of the PV EOI feature: simply convert the page address to the format expected by PV EOI. Notes: -"No EOI required" is set only if interrupt injected is edge triggered; this is true because level interrupts are going through IOAPIC which disables PV EOI. In any case, if guest triggers EOI the bit will get cleared on exit. -For migration, set of HV_X64_MSR_APIC_ASSIST_PAGE sets KVM_PV_EOI_EN internally, so restoring HV_X64_MSR_APIC_ASSIST_PAGE seems sufficient In any case, bit is cleared on exit so worst case it's never re-enabled -no handling of PV EOI data is performed at HV_X64_MSR_EOI write; HV_X64_MSR_EOI is a separate optimization - it's an X2APIC replacement that lets you do EOI with an MSR and not IO. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: improve the usability of the 'kvm_pio' tracepointUlrich Obergfell2014-05-05
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch moves the 'kvm_pio' tracepoint to emulator_pio_in_emulated() and emulator_pio_out_emulated(), and it adds an argument (a pointer to the 'pio_data'). A single 8-bit or 16-bit or 32-bit data item is fetched from 'pio_data' (depending on 'size'), and the value is included in the trace record ('val'). If 'count' is greater than one, this is indicated by the string "(...)" in the trace output. Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: MMU: flush tlb out of mmu lock when write-protect the sptesXiao Guangrong2014-04-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we can flush all the TLBs out of the mmu lock without TLB corruption when write-proect the sptes, it is because: - we have marked large sptes readonly instead of dropping them that means we just change the spte from writable to readonly so that we only need to care the case of changing spte from present to present (changing the spte from present to nonpresent will flush all the TLBs immediately), in other words, the only case we need to care is mmu_spte_update() - in mmu_spte_update(), we haved checked SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that means it does not depend on PT_WRITABLE_MASK anymore Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * KVM: MMU: lazily drop large spteXiao Guangrong2014-04-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, kvm zaps the large spte if write-protected is needed, the later read can fault on that spte. Actually, we can make the large spte readonly instead of making them un-present, the page fault caused by read access can be avoided The idea is from Avi: | As I mentioned before, write-protecting a large spte is a good idea, | since it moves some work from protect-time to fault-time, so it reduces | jitter. This removes the need for the return value. This version has fixed the issue reported in 6b73a9606, the reason of that issue is that fast_page_fault() directly sets the readonly large spte to writable but only dirty the first page into the dirty-bitmap that means other pages are missed. Fixed it by only the normal sptes (on the PT_PAGE_TABLE_LEVEL level) can be fast fixed Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * KVM: x86: Processor mode may be determined incorrectlyNadav Amit2014-04-23
| | | | | | | | | | | | | | | | If EFER.LMA is off, cs.l does not determine execution mode. Currently, the emulation engine assumes differently. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * KVM: x86: Fix CR3 reserved bitsNadav Amit2014-04-23
| | | | | | | | | | | | | | | | | | According to Intel specifications, PAE and non-PAE does not have any reserved bits. In long-mode, regardless to PCIDE, only the high bits (above the physical address) are reserved. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * KVM: support any-length wildcard ioeventfdMichael S. Tsirkin2014-04-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is sometimes benefitial to ignore IO size, and only match on address. In hindsight this would have been a better default than matching length when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind of access can be optimized on VMX: there no need to do page lookups. This can currently be done with many ioeventfds but in a suboptimal way. However we can't change kernel/userspace ABI without risk of breaking some applications. Use len = 0 to mean "ignore length for matching" in a more optimal way. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2014-05-28
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Paolo Bonzini: "Small fixes for x86, slightly larger fixes for PPC, and a forgotten s390 patch. The PPC fixes are important because they fix breakage that is new in 3.15" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: s390: announce irqfd capability KVM: x86: disable master clock if TSC is reset during suspend KVM: vmx: disable APIC virtualization in nested guests KVM guest: Make pv trampoline code executable KVM: PPC: Book3S: ifdef on CONFIG_KVM_BOOK3S_32_HANDLER for 32bit KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit KVM: PPC: Book3S: HV: make _PAGE_NUMA take effect
| * | KVM: x86: disable master clock if TSC is reset during suspendMarcelo Tosatti2014-05-14
| |/ | | | | | | | | | | | | | | | | | | | | | | Updating system_time from the kernel clock once master clock has been enabled can result in time backwards event, in case kernel clock frequency is lower than TSC frequency. Disable master clock in case it is necessary to update it from the resume path. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* / asmlinkage, x86: Add explicit __visible to arch/x86/*Andi Kleen2014-05-05
|/ | | | | | | | | | | As requested by Linus add explicit __visible to the asmlinkage users. This marks all functions visible to assembler. Tree sweep for arch/x86/* Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1398984278-29319-3-git-send-email-andi@firstfloor.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* Merge git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2014-04-14
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Marcelo Tosatti: - Fix for guest triggerable BUG_ON (CVE-2014-0155) - CR4.SMAP support - Spurious WARN_ON() fix * git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: remove WARN_ON from get_kernel_ns() KVM: Rename variable smep to cr4_smep KVM: expose SMAP feature to guest KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode KVM: Add SMAP support when setting CR4 KVM: Remove SMAP bit from CR4_RESERVED_BITS KVM: ioapic: try to recover if pending_eoi goes out of range KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi (CVE-2014-0155)
| * KVM: x86: remove WARN_ON from get_kernel_ns()Marcelo Tosatti2014-04-14
| | | | | | | | | | | | | | | | | | Function and callers can be preempted. https://bugzilla.kernel.org/show_bug.cgi?id=73721 Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: Add SMAP support when setting CR4Feng Wu2014-04-14
| | | | | | | | | | | | | | | | | | | | This patch adds SMAP handling logic when setting CR4 for guests Thanks a lot to Paolo Bonzini for his suggestion to use the branchless way to detect SMAP violation. Signed-off-by: Feng Wu <feng.wu@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* | Merge tag 'cpu-hotplug-3.15-rc1' of ↵Linus Torvalds2014-04-07
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull CPU hotplug notifiers registration fixes from Rafael Wysocki: "The purpose of this single series of commits from Srivatsa S Bhat (with a small piece from Gautham R Shenoy) touching multiple subsystems that use CPU hotplug notifiers is to provide a way to register them that will not lead to deadlocks with CPU online/offline operations as described in the changelog of commit 93ae4f978ca7f ("CPU hotplug: Provide lockless versions of callback registration functions"). The first three commits in the series introduce the API and document it and the rest simply goes through the users of CPU hotplug notifiers and converts them to using the new method" * tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (52 commits) net/iucv/iucv.c: Fix CPU hotplug callback registration net/core/flow.c: Fix CPU hotplug callback registration mm, zswap: Fix CPU hotplug callback registration mm, vmstat: Fix CPU hotplug callback registration profile: Fix CPU hotplug callback registration trace, ring-buffer: Fix CPU hotplug callback registration xen, balloon: Fix CPU hotplug callback registration hwmon, via-cputemp: Fix CPU hotplug callback registration hwmon, coretemp: Fix CPU hotplug callback registration thermal, x86-pkg-temp: Fix CPU hotplug callback registration octeon, watchdog: Fix CPU hotplug callback registration oprofile, nmi-timer: Fix CPU hotplug callback registration intel-idle: Fix CPU hotplug callback registration clocksource, dummy-timer: Fix CPU hotplug callback registration drivers/base/topology.c: Fix CPU hotplug callback registration acpi-cpufreq: Fix CPU hotplug callback registration zsmalloc: Fix CPU hotplug callback registration scsi, fcoe: Fix CPU hotplug callback registration scsi, bnx2fc: Fix CPU hotplug callback registration scsi, bnx2i: Fix CPU hotplug callback registration ...
| * x86, kvm: Fix CPU hotplug callback registrationSrivatsa S. Bhat2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); This is wrong, since it is prone to ABBA deadlocks involving the cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently with CPU hotplug operations). Instead, the correct and race-free way of performing the callback registration is: cpu_notifier_register_begin(); for_each_online_cpu(cpu) init_cpu(cpu); /* Note the use of the double underscored version of the API */ __register_cpu_notifier(&foobar_cpu_notifier); cpu_notifier_register_done(); Fix the kvm code in x86 by using this latter form of callback registration. Cc: Gleb Natapov <gleb@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | KVM: vmx: fix MPX detectionPaolo Bonzini2014-03-27
| | | | | | | | | | | | | | | | | | | | | | kvm_x86_ops is still NULL at this point. Since kvm_init_msr_list cannot fail, it is safe to initialize it before the call. Fixes: 93c4adc7afedf9b0ec190066d45b6d67db5270da Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Jet Chen <jet.chen@intel.com> Cc: kvm@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: handle missing MPX in nested virtualizationPaolo Bonzini2014-03-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing nested virtualization, we may be able to read BNDCFGS but still not be allowed to write to GUEST_BNDCFGS in the VMCS. Guard writes to the field with vmx_mpx_supported(), and similarly hide the MSR from userspace if the processor does not support the field. We could work around this with the generic MSR save/load machinery, but there is only a limited number of MSR save/load slots and it is not really worthwhile to waste one for a scenario that should not happen except in the nested virtualization case. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: introduce kvm_supported_xcr0()Paolo Bonzini2014-03-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XSAVE support for KVM is already using host_xcr0 & KVM_SUPPORTED_XCR0 as a "dynamic" version of KVM_SUPPORTED_XCR0. However, this is not enough because the MPX bits should not be presented to the guest unless kvm_x86_ops confirms the support. So, replace all instances of host_xcr0 & KVM_SUPPORTED_XCR0 with a new function kvm_supported_xcr0() that also has this check. Note that here: if (xstate_bv & ~KVM_SUPPORTED_XCR0) return -EINVAL; if (xstate_bv & ~host_cr0) return -EINVAL; the code is equivalent to if ((xstate_bv & ~KVM_SUPPORTED_XCR0) || (xstate_bv & ~host_cr0) return -EINVAL; i.e. "xstate_bv & (~KVM_SUPPORTED_XCR0 | ~host_cr0)" which is in turn equal to "xstate_bv & ~(KVM_SUPPORTED_XCR0 & host_cr0)". So we should also use the new function there. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: x86: ignore ioapic polarityGabriel L. Somlo2014-03-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both QEMU and KVM have already accumulated a significant number of optimizations based on the hard-coded assumption that ioapic polarity will always use the ActiveHigh convention, where the logical and physical states of level-triggered irq lines always match (i.e., active(asserted) == high == 1, inactive == low == 0). QEMU guests are expected to follow directions given via ACPI and configure the ioapic with polarity 0 (ActiveHigh). However, even when misbehaving guests (e.g. OS X <= 10.9) set the ioapic polarity to 1 (ActiveLow), QEMU will still use the ActiveHigh signaling convention when interfacing with KVM. This patch modifies KVM to completely ignore ioapic polarity as set by the guest OS, enabling misbehaving guests to work alongside those which comply with the ActiveHigh polarity specified by QEMU's ACPI tables. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu> [Move documentation to KVM_IRQ_LINE, add ia64. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Allow the guest to run with dirty debug registersPaolo Bonzini2014-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When not running in guest-debug mode, the guest controls the debug registers and having to take an exit for each DR access is a waste of time. If the guest gets into a state where each context switch causes DR to be saved and restored, this can take away as much as 40% of the execution time from the guest. After this patch, VMX- and SVM-specific code can set a flag in switch_db_regs, telling vcpu_enter_guest that on the next exit the debug registers might be dirty and need to be reloaded (syncing will be taken care of by a new callback in kvm_x86_ops). This flag can be set on the first access to a debug registers, so that multiple accesses to the debug registers only cause one vmexit. Note that since the guest will be able to read debug registers and enable breakpoints in DR7, we need to ensure that they are synchronized on entry to the guest---including DR6 that was not synced before. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: change vcpu->arch.switch_db_regs to a bit maskPaolo Bonzini2014-03-11
| | | | | | | | | | | | | | The next patch will add another bit that we can test with the same "if". Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Remove return code from enable_irq/nmi_windowJan Kiszka2014-03-11
| | | | | | | | | | | | | | | | | | | | | | It's no longer possible to enter enable_irq_window in guest mode when L1 intercepts external interrupts and we are entering L2. This is now caught in vcpu_enter_guest. So we can remove the check from the VMX version of enable_irq_window, thus the need to return an error code from both enable_irq_window and enable_nmi_window. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: nVMX: Rework interception of IRQs and NMIsJan Kiszka2014-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the check for leaving L2 on pending and intercepted IRQs or NMIs from the *_allowed handler into a dedicated callback. Invoke this callback at the relevant points before KVM checks if IRQs/NMIs can be injected. The callback has the task to switch from L2 to L1 if needed and inject the proper vmexit events. The rework fixes L2 wakeups from HLT and provides the foundation for preemption timer emulation. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | Merge tag 'kvm-for-3.15-1' of ↵Paolo Bonzini2014-03-04
|\| | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into kvm-next
| * kvm: x86: fix emulator buffer overflow (CVE-2014-0049)Andrew Honig2014-02-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem occurs when the guest performs a pusha with the stack address pointing to an mmio address (or an invalid guest physical address) to start with, but then extending into an ordinary guest physical address. When doing repeated emulated pushes emulator_read_write sets mmio_needed to 1 on the first one. On a later push when the stack points to regular memory, mmio_nr_fragments is set to 0, but mmio_is_needed is not set to 0. As a result, KVM exits to userspace, and then returns to complete_emulated_mmio. In complete_emulated_mmio vcpu->mmio_cur_fragment is incremented. The termination condition of vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments is never achieved. The code bounces back and fourth to userspace incrementing mmio_cur_fragment past it's buffer. If the guest does nothing else it eventually leads to a a crash on a memcpy from invalid memory address. However if a guest code can cause the vm to be destroyed in another vcpu with excellent timing, then kvm_clear_async_pf_completion_queue can be used by the guest to control the data that's pointed to by the call to cancel_work_item, which can be used to gain execution. Fixes: f78146b0f9230765c6315b2e14f56112513389ad Signed-off-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org (3.5+) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | x86: kvm: introduce periodic global clock updatesAndrew Jones2014-03-04
| | | | | | | | | | | | | | | | | | | | | | commit 0061d53daf26f introduced a mechanism to execute a global clock update for a vm. We can apply this periodically in order to propagate host NTP corrections. Also, if all vcpus of a vm are pinned, then without an additional trigger, no guest NTP corrections can propagate either, as the current trigger is only vcpu cpu migration. Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | x86: kvm: rate-limit global clock updatesAndrew Jones2014-03-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we update a vcpu's local clock it may pick up an NTP correction. We can't wait an indeterminate amount of time for other vcpus to pick up that correction, so commit 0061d53daf26f introduced a global clock update. However, we can't request a global clock update on every vcpu load either (which is what happens if the tsc is marked as unstable). The solution is to rate-limit the global clock updates. Marcelo calculated that we should delay the global clock updates no more than 0.1s as follows: Assume an NTP correction c is applied to one vcpu, but not the other, then in n seconds the delta of the vcpu system_timestamps will be c * n. If we assume a correction of 500ppm (worst-case), then the two vcpus will diverge 50us in 0.1s, which is a considerable amount. Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>