aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm
Commit message (Collapse)AuthorAge
...
| * | | kvm: x86: i8259: return initialized data on invalid-size readPetr Matousek2015-03-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If data is read from PIC with invalid access size, the return data stays uninitialized even though success is returned. Fix this by always initializing the data. Signed-off-by: Petr Matousek <pmatouse@redhat.com> Reported-by: Nadav Amit <nadav.amit@gmail.com> Message-Id: <20150311111609.GG8544@dhcp-25-225.brq.redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: nVMX: Add support for rdtscpJan Kiszka2015-03-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the guest CPU is supposed to support rdtscp and the host has rdtscp enabled in the secondary execution controls, we can also expose this feature to L1. Just extend nested_vmx_exit_handled to properly route EXIT_REASON_RDTSCP. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | KVM: x86: inline kvm_ioapic_handles_vector()Radim Krčmář2015-03-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An overhead from function call is not appropriate for its size and frequency of execution. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | KVM: SVM: Fix confusing message if no exit handlers are installedBandan Das2015-03-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I hit this path on a AMD box and thought someone was playing a April Fool's joke on me. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | KVM: x86: For the symbols used locally only should be static typeXiubo Li2015-03-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fix the following sparse warnings: for arch/x86/kvm/x86.c: warning: symbol 'emulator_read_write' was not declared. Should it be static? warning: symbol 'emulator_write_emulated' was not declared. Should it be static? warning: symbol 'emulator_get_dr' was not declared. Should it be static? warning: symbol 'emulator_set_dr' was not declared. Should it be static? for arch/x86/kvm/pmu.c: warning: symbol 'fixed_pmc_events' was not declared. Should it be static? Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | KVM: x86: Avoid using plain integer as NULL pointer warningXiubo Li2015-03-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fix the following sparse warning: for file arch/x86/kvm/x86.c: warning: Using plain integer as NULL pointer Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | KVM: nVMX: Do not emulate #UD while in guest modeJan Kiszka2015-03-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While in L2, leave all #UD to L2 and do not try to emulate it. If L1 is interested in doing this, it reports its interest via the exception bitmap, and we never get into handle_exception of L0 anyway. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | KVM: x86: Fix re-execution of patched vmmcallJan Kiszka2015-03-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a very long time (since 2b3d2a20), the path handling a vmmcall instruction of the guest on an Intel host only applied the patch but no longer handled the hypercall. The reverse case, vmcall on AMD hosts, is fine. As both em_vmcall and em_vmmcall actually have to do the same, we can fix the issue by consolidating both into the same handler. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | x86: svm: use cr_interception for SVM_EXIT_CR0_SEL_WRITEDavid Kaplan2015-03-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Another patch in my war on emulate_on_interception() use as a svm exit handler. These were pulled out of a larger patch at the suggestion of Radim Krcmar, see https://lkml.org/lkml/2015/2/25/559 Changes since v1: * fixed typo introduced after test, retested Signed-off-by: David Kaplan <david.kaplan@amd.com> [separated out just cr_interception part from larger removal of INTERCEPT_CR0_WRITE, forward ported, tested] Signed-off-by: Joel Schopp <joel.schopp@amd.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | kvm: svm: make wbinvd fasterDavid Kaplan2015-03-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No need to re-decode WBINVD since we know what it is from the intercept. Signed-off-by: David Kaplan <David.Kaplan@amd.com> [extracted from larger unlrelated patch, forward ported, tested,style cleanup] Signed-off-by: Joel Schopp <joel.schopp@amd.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | kvm: x86: make kvm_emulate_* consistantJoel Schopp2015-03-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently kvm_emulate() skips the instruction but kvm_emulate_* sometimes don't. The end reult is the caller ends up doing the skip themselves. Let's make them consistant. Signed-off-by: Joel Schopp <joel.schopp@amd.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | KVM: Get rid of kvm_kvfree()Thomas Huth2015-03-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_kvfree() provides exactly the same functionality as the new common kvfree() function - so let's simply replace the kvm function with the common function. Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | KVM: ioapic: Record edge-triggered interrupts delivery statusWincy Van2015-03-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the bug discussed in https://www.mail-archive.com/kvm@vger.kernel.org/msg109813.html This patch uses a new field named irr_delivered to record the delivery status of edge-triggered interrupts, and clears the delivered interrupts in kvm_get_ioapic. So it has the same effect of commit 0bc830b05c667218d703f2026ec866c49df974fc ("KVM: ioapic: clear IRR for edge-triggered interrupts at delivery") while avoids the bug of Windows guests. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | KVM: SVM: use kvm_register_write()/read()David Kaplan2015-03-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM has nice wrappers to access the register values, clean up a few places that should use them but currently do not. Signed-off-by: David Kaplan <david.kaplan@amd.com> [forward port and testing] Signed-off-by: Joel Schopp <joel.schopp@amd.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* | | | KVM: x86: call irq notifiers with directed EOIRadim Krčmář2015-03-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_ioapic_update_eoi() wasn't called if directed EOI was enabled. We need to do that for irq notifiers. (Like with edge interrupts.) Fix it by skipping EOI broadcast only. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=82211 Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Bandan Das <bsd@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* | | | KVM: nVMX: mask unrestricted_guest if disabled on L0Radim Krčmář2015-03-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If EPT was enabled, unrestricted_guest was allowed in L1 regardless of L0. L1 triple faulted when running L2 guest that required emulation. Another side effect was 'WARN_ON_ONCE(vmx->nested.nested_run_pending)' in L0's dmesg: WARNING: CPU: 0 PID: 0 at arch/x86/kvm/vmx.c:9190 nested_vmx_vmexit+0x96e/0xb00 [kvm_intel] () Prevent this scenario by masking SECONDARY_EXEC_UNRESTRICTED_GUEST when the host doesn't have it enabled. Fixes: 78051e3b7e35 ("KVM: nVMX: Disable unrestricted mode if ept=0") Cc: stable@vger.kernel.org Tested-By: Kashyap Chamarthy <kchamart@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* | | | KVM: VMX: Set msr bitmap correctly if vcpu is in guest modeWincy Van2015-03-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 3af18d9c5fe9 ("KVM: nVMX: Prepare for using hardware MSR bitmap"), we are setting MSR_BITMAP in prepare_vmcs02 if we should use hardware. This is not enough since the field will be modified by following vmx_set_efer. Fix this by setting vmx_msr_bitmap_nested in vmx_set_msr_bitmap if vcpu is in guest mode. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* | | | kvm: x86: i8259: return initialized data on invalid-size readPetr Matousek2015-03-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If data is read from PIC with invalid access size, the return data stays uninitialized even though success is returned. Fix this by always initializing the data. Signed-off-by: Petr Matousek <pmatouse@redhat.com> Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* | | | kvm: move advertising of KVM_CAP_IRQFD to common codePaolo Bonzini2015-03-10
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | POWER supports irqfds but forgot to advertise them. Some userspace does not check for the capability, but others check it---thus they work on x86 and s390 but not POWER. To avoid that other architectures in the future make the same mistake, let common code handle KVM_CAP_IRQFD the same way as KVM_CAP_IRQFD_RESAMPLE. Reported-and-tested-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Fixes: 297e21053a52f060944e9f0de4c64fad9bcd72fc Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* / / KVM: SVM: fix interrupt injection (apic->isr_count always 0)Radim Krčmář2015-03-02
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit b4eef9b36db4, we started to use hwapic_isr_update() != NULL instead of kvm_apic_vid_enabled(vcpu->kvm). This didn't work because SVM had it defined and "apicv" path in apic_{set,clear}_isr() does not change apic->isr_count, because it should always be 1. The initial value of apic->isr_count was based on kvm_apic_vid_enabled(vcpu->kvm), which is always 0 for SVM, so KVM could have injected interrupts when it shouldn't. Fix it by implicitly setting SVM's hwapic_isr_update to NULL and make the initial isr_count depend on hwapic_isr_update() for good measure. Fixes: b4eef9b36db4 ("kvm: x86: vmx: NULL out hwapic_isr_update() in case of !enable_apicv") Reported-and-tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* | KVM: emulate: fix CMPXCHG8B on 32-bit hostsPaolo Bonzini2015-02-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This has been broken for a long time: it broke first in 2.6.35, then was almost fixed in 2.6.36 but this one-liner slipped through the cracks. The bug shows up as an infinite loop in Windows 7 (and newer) boot on 32-bit hosts without EPT. Windows uses CMPXCHG8B to write to page tables, which causes a page fault if running without EPT; the emulator is then called from kvm_mmu_page_fault. The loop then happens if the higher 4 bytes are not 0; the common case for this is that the NX bit (bit 63) is 1. Fixes: 6550e1f165f384f3a46b60a1be9aba4bc3c2adad Fixes: 16518d5ada690643453eb0aef3cc7841d3623c2d Cc: stable@vger.kernel.org # 2.6.35+ Reported-by: Erik Rull <erik.rull@rdsoftware.de> Tested-by: Erik Rull <erik.rull@rdsoftware.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: VMX: fix build without CONFIG_SMPRadim Krčmář2015-02-23
|/ | | | | | | | | | 'apic' is not defined if !CONFIG_X86_64 && !CONFIG_X86_LOCAL_APIC. Posted interrupt makes no sense without CONFIG_SMP, and CONFIG_X86_LOCAL_APIC will be set with it. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Merge branch 'perf-core-for-linus' of ↵Linus Torvalds2015-02-16
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf updates from Ingo Molnar: "This series tightens up RDPMC permissions: currently even highly sandboxed x86 execution environments (such as seccomp) have permission to execute RDPMC, which may leak various perf events / PMU state such as timing information and other CPU execution details. This 'all is allowed' RDPMC mode is still preserved as the (non-default) /sys/devices/cpu/rdpmc=2 setting. The new default is that RDPMC access is only allowed if a perf event is mmap-ed (which is needed to correctly interpret RDPMC counter values in any case). As a side effect of these changes CR4 handling is cleaned up in the x86 code and a shadow copy of the CR4 value is added. The extra CR4 manipulation adds ~ <50ns to the context switch cost between rdpmc-capable and rdpmc-non-capable mms" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks perf/x86: Only allow rdpmc if a perf_event is mapped perf: Pass the event to arch_perf_update_userpage() perf: Add pmu callbacks to track event mapping and unmapping x86: Add a comment clarifying LDT context switching x86: Store a per-cpu shadow copy of CR4 x86: Clean up cr4 manipulation
| * x86: Store a per-cpu shadow copy of CR4Andy Lutomirski2015-02-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Context switches and TLB flushes can change individual bits of CR4. CR4 reads take several cycles, so store a shadow copy of CR4 in a per-cpu variable. To avoid wasting a cache line, I added the CR4 shadow to cpu_tlbstate, which is already touched in switch_mm. The heaviest users of the cr4 shadow will be switch_mm and __switch_to_xtra, and __switch_to_xtra is called shortly after switch_mm during context switch, so the cacheline is likely to be hot. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * x86: Clean up cr4 manipulationAndy Lutomirski2015-02-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CR4 manipulation was split, seemingly at random, between direct (write_cr4) and using a helper (set/clear_in_cr4). Unfortunately, the set_in_cr4 and clear_in_cr4 helpers also poke at the boot code, which only a small subset of users actually wanted. This patch replaces all cr4 access in functions that don't leave cr4 exactly the way they found it with new helpers cr4_set_bits, cr4_clear_bits, and cr4_set_bits_and_update_boot. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/495a10bdc9e67016b8fd3945700d46cfd5c12c2f.1414190806.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2015-02-13
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM update from Paolo Bonzini: "Fairly small update, but there are some interesting new features. Common: Optional support for adding a small amount of polling on each HLT instruction executed in the guest (or equivalent for other architectures). This can improve latency up to 50% on some scenarios (e.g. O_DSYNC writes or TCP_RR netperf tests). This also has to be enabled manually for now, but the plan is to auto-tune this in the future. ARM/ARM64: The highlights are support for GICv3 emulation and dirty page tracking s390: Several optimizations and bugfixes. Also a first: a feature exposed by KVM (UUID and long guest name in /proc/sysinfo) before it is available in IBM's hypervisor! :) MIPS: Bugfixes. x86: Support for PML (page modification logging, a new feature in Broadwell Xeons that speeds up dirty page tracking), nested virtualization improvements (nested APICv---a nice optimization), usual round of emulation fixes. There is also a new option to reduce latency of the TSC deadline timer in the guest; this needs to be tuned manually. Some commits are common between this pull and Catalin's; I see you have already included his tree. Powerpc: Nothing yet. The KVM/PPC changes will come in through the PPC maintainers, because I haven't received them yet and I might end up being offline for some part of next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits) KVM: ia64: drop kvm.h from installed user headers KVM: x86: fix build with !CONFIG_SMP KVM: x86: emulate: correct page fault error code for NoWrite instructions KVM: Disable compat ioctl for s390 KVM: s390: add cpu model support KVM: s390: use facilities and cpu_id per KVM KVM: s390/CPACF: Choose crypto control block format s390/kernel: Update /proc/sysinfo file with Extended Name and UUID KVM: s390: reenable LPP facility KVM: s390: floating irqs: fix user triggerable endless loop kvm: add halt_poll_ns module parameter kvm: remove KVM_MMIO_SIZE KVM: MIPS: Don't leak FPU/DSP to guest KVM: MIPS: Disable HTW while in guest KVM: nVMX: Enable nested posted interrupt processing KVM: nVMX: Enable nested virtual interrupt delivery KVM: nVMX: Enable nested apic register virtualization KVM: nVMX: Make nested control MSRs per-cpu KVM: nVMX: Enable nested virtualize x2apic mode KVM: nVMX: Prepare for using hardware MSR bitmap ...
| * | KVM: x86: fix build with !CONFIG_SMPRadim Krčmář2015-02-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | <asm/apic.h> isn't included directly and without CONFIG_SMP, an option that automagically pulls it can't be enabled. Reported-by: Jim Davis <jim.epost@gmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: emulate: correct page fault error code for NoWrite instructionsPaolo Bonzini2015-02-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NoWrite instructions (e.g. cmp or test) never set the "write access" bit in the error code, even if one of the operands is treated as a destination. Fixes: c205fb7d7d4f81e46fc577b707ceb9e356af1456 Cc: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | kvm: add halt_poll_ns module parameterPaolo Bonzini2015-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces a new module parameter for the KVM module; when it is present, KVM attempts a bit of polling on every HLT before scheduling itself out via kvm_vcpu_block. This parameter helps a lot for latency-bound workloads---in particular I tested it with O_DSYNC writes with a battery-backed disk in the host. In this case, writes are fast (because the data doesn't have to go all the way to the platters) but they cannot be merged by either the host or the guest. KVM's performance here is usually around 30% of bare metal, or 50% if you use cache=directsync or cache=writethrough (these parameters avoid that the guest sends pointless flush requests, and at the same time they are not slow because of the battery-backed cache). The bad performance happens because on every halt the host CPU decides to halt itself too. When the interrupt comes, the vCPU thread is then migrated to a new physical CPU, and in general the latency is horrible because the vCPU thread has to be scheduled back in. With this patch performance reaches 60-65% of bare metal and, more important, 99% of what you get if you use idle=poll in the guest. This means that the tunable gets rid of this particular bottleneck, and more work can be done to improve performance in the kernel or QEMU. Of course there is some price to pay; every time an otherwise idle vCPUs is interrupted by an interrupt, it will poll unnecessarily and thus impose a little load on the host. The above results were obtained with a mostly random value of the parameter (500000), and the load was around 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU. The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll, that can be used to tune the parameter. It counts how many HLT instructions received an interrupt during the polling period; each successful poll avoids that Linux schedules the VCPU thread out and back in, and may also avoid a likely trip to C1 and back for the physical CPU. While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second. Of these halts, almost all are failed polls. During the benchmark, instead, basically all halts end within the polling period, except a more or less constant stream of 50 per second coming from vCPUs that are not running the benchmark. The wasted time is thus very low. Things may be slightly different for Windows VMs, which have a ~10 ms timer tick. The effect is also visible on Marcelo's recently-introduced latency test for the TSC deadline timer. Though of course a non-RT kernel has awful latency bounds, the latency of the timer is around 8000-10000 clock cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC deadline timer, thus, the effect is both a smaller average latency and a smaller variance. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: nVMX: Enable nested posted interrupt processingWincy Van2015-02-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If vcpu has a interrupt in vmx non-root mode, injecting that interrupt requires a vmexit. With posted interrupt processing, the vmexit is not needed, and interrupts are fully taken care of by hardware. In nested vmx, this feature avoids much more vmexits than non-nested vmx. When L1 asks L0 to deliver L1's posted interrupt vector, and the target VCPU is in non-root mode, we use a physical ipi to deliver POSTED_INTR_NV to the target vCPU. Using POSTED_INTR_NV avoids unexpected interrupts if a concurrent vmexit happens and L1's vector is different with L0's. The IPI triggers posted interrupt processing in the target physical CPU. In case the target vCPU was not in guest mode, complete the posted interrupt delivery on the next entry to L2. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: nVMX: Enable nested virtual interrupt deliveryWincy Van2015-02-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With virtual interrupt delivery, the hardware lets KVM use a more efficient mechanism for interrupt injection. This is an important feature for nested VMX, because it reduces vmexits substantially and they are much more expensive with nested virtualization. This is especially important for throughput-bound scenarios. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: nVMX: Enable nested apic register virtualizationWincy Van2015-02-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | We can reduce apic register virtualization cost with this feature, it is also a requirement for virtual interrupt delivery and posted interrupt processing. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: nVMX: Make nested control MSRs per-cpuWincy Van2015-02-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To enable nested apicv support, we need per-cpu vmx control MSRs: 1. If in-kernel irqchip is enabled, we can enable nested posted interrupt, we should set posted intr bit in the nested_vmx_pinbased_ctls_high. 2. If in-kernel irqchip is disabled, we can not enable nested posted interrupt, the posted intr bit in the nested_vmx_pinbased_ctls_high will be cleared. Since there would be different settings about in-kernel irqchip between VMs, different nested control MSRs are needed. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: nVMX: Enable nested virtualize x2apic modeWincy Van2015-02-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance, especially in apicv case. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: nVMX: Prepare for using hardware MSR bitmapWincy Van2015-02-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, if L1 enables MSR_BITMAP, we will emulate this feature, all of L2's msr access is intercepted by L0. Features like "virtualize x2apic mode" require that the MSR bitmap is enabled, or the hardware will exit and for example not virtualize the x2apic MSRs. In order to let L1 use these features, we need to build a merged bitmap that only not cause a VMEXIT if 1) L1 requires that 2) the bit is not required by the processor for APIC virtualization. For now the guests are still run with MSR bitmap disabled, but this patch already introduces nested_vmx_merge_msr_bitmap for future use. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: revert "add method to test PIR bitmap vector"Marcelo Tosatti2015-02-02
| | | | | | | | | | | | | | | | | | | | | | | | Revert 7c6a98dfa1ba9dc64a62e73624ecea9995736bbd, given that testing PIR is not necessary anymore. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: fix lapic_timer_int_injected with APIC-vMarcelo Tosatti2015-02-02
| | | | | | | | | | | | | | | | | | | | | | | | With APICv, LAPIC timer interrupt is always delivered via IRR: apic_find_highest_irr syncs PIR to IRR. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | kvm: vmx: fix oops with explicit flexpriority=0 optionPaolo Bonzini2015-01-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A function pointer was not NULLed, causing kvm_vcpu_reload_apic_access_page to go down the wrong path and OOPS when doing put_page(NULL). This did not happen on old processors, only when setting the module option explicitly. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: fix x2apic logical address matchingRadim Krčmář2015-01-30
| | | | | | | | | | | | | | | | | | | | | We cannot hit the bug now, but future patches will expose this path. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: replace 0 with APIC_DEST_PHYSICALRadim Krčmář2015-01-30
| | | | | | | | | | | | | | | | | | | | | To make the code self-documenting. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: cleanup kvm_apic_match_*()Radim Krčmář2015-01-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The majority of this patch turns result = 0; if (CODE) result = 1; return result; into return CODE; because we return bool now. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: return bool from kvm_apic_match*()Radim Krčmář2015-01-30
| | | | | | | | | | | | | | | | | | | | | And don't export the internal ones while at it. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: VMX: Add PML support in VMXKai Huang2015-01-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds PML support in VMX. A new module parameter 'enable_pml' is added to allow user to enable/disable it manually. Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: Add new dirty logging kvm_x86_ops for PMLKai Huang2015-01-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds new kvm_x86_ops dirty logging hooks to enable/disable dirty logging for particular memory slot, and to flush potentially logged dirty GPAs before reporting slot->dirty_bitmap to userspace. kvm x86 common code calls these hooks when they are available so PML logic can be hidden to VMX specific. SVM won't be impacted as these hooks remain NULL there. Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: Change parameter of kvm_mmu_slot_remove_write_accessKai Huang2015-01-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes the second parameter of kvm_mmu_slot_remove_write_access from 'slot id' to 'struct kvm_memory_slot *' to align with kvm_x86_ops dirty logging hooks, which will be introduced in further patch. Better way is to change second parameter of kvm_arch_commit_memory_region from 'struct kvm_userspace_memory_region *' to 'struct kvm_memory_slot * new', but it requires changes on other non-x86 ARCH too, so avoid it now. Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: MMU: Explicitly set D-bit for writable spte.Kai Huang2015-01-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch avoids unnecessary dirty GPA logging to PML buffer in EPT violation path by setting D-bit manually prior to the occurrence of the write from guest. We only set D-bit manually in set_spte, and leave fast_page_fault path unchanged, as fast_page_fault is very unlikely to happen in case of PML. For the hva <-> pa change case, the spte is updated to either read-only (host pte is read-only) or be dropped (host pte is writeable), and both cases will be handled by above changes, therefore no change is necessary. Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: MMU: Add mmu help functions to support PMLKai Huang2015-01-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds new mmu layer functions to clear/set D-bit for memory slot, and to write protect superpages for memory slot. In case of PML, CPU logs the dirty GPA automatically to PML buffer when CPU updates D-bit from 0 to 1, therefore we don't have to write protect 4K pages, instead, we only need to clear D-bit in order to log that GPA. For superpages, we still write protect it and let page fault code to handle dirty page logging, as we still need to split superpage to 4K pages in PML. As PML is always enabled during guest's lifetime, to eliminate unnecessary PML GPA logging, we set D-bit manually for the slot with dirty logging disabled. Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic for log ↵Kai Huang2015-01-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dirty We don't have to write protect guest memory for dirty logging if architecture supports hardware dirty logging, such as PML on VMX, so rename it to be more generic. Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | kvm: iommu: Add cond_resched to legacy device assignment codeJoerg Roedel2015-01-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When assigning devices to large memory guests (>=128GB guest memory in the failure case) the functions to create the IOMMU page-tables for the whole guest might run for a very long time. On non-preemptible kernels this might cause Soft-Lockup warnings. Fix these by adding a cond_resched() to the mapping and unmapping loops. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: Emulation of call may use incorrect stack sizeNadav Amit2015-01-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On long-mode, when far call that changes cs.l takes place, the stack size is determined by the new mode. For instance, if we go from 32-bit mode to 64-bit mode, the stack-size if 64. KVM uses the old stack size. Fix it. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>