aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm
Commit message (Collapse)AuthorAge
...
* | KVM: x86: PREFETCH and HINT_NOP should have SrcMem flagNadav Amit2014-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The decode phase of the x86 emulator assumes that every instruction with the ModRM flag, and which can be used with RIP-relative addressing, has either SrcMem or DstMem. This is not the case for several instructions - prefetch, hint-nop and clflush. Adding SrcMem|NoAccess for prefetch and hint-nop and SrcMem for clflush. This fixes CVE-2014-8480. Fixes: 41061cdb98a0bec464278b4db8e894a3121671f5 Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Emulator does not decode clflush wellNadav Amit2014-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, all group15 instructions are decoded as clflush (e.g., mfence, xsave). In addition, the clflush instruction requires no prefix (66/f2/f3) would exist. If prefix exists it may encode a different instruction (e.g., clflushopt). Creating a group for clflush, and different group for each prefix. This has been the case forever, but the next patch needs the cflush group in order to fix a bug introduced in 3.17. Fixes: 41061cdb98a0bec464278b4db8e894a3121671f5 Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: emulate: avoid accessing NULL ctxt->memoppPaolo Bonzini2014-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | A failure to decode the instruction can cause a NULL pointer access. This is fixed simply by moving the "done" label as close as possible to the return. This fixes CVE-2014-8481. Reported-by: Andy Lutomirski <luto@amacapital.net> Cc: stable@vger.kernel.org Fixes: 41061cdb98a0bec464278b4db8e894a3121671f5 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Decoding guest instructions which cross page boundary may failNadav Amit2014-10-24
| | | | | | | | | | | | | | | | | | | | | | | | Once an instruction crosses a page boundary, the size read from the second page disregards the common case that part of the operand resides on the first page. As a result, fetch of long insturctions may fail, and thereby cause the decoding to fail as well. Cc: stable@vger.kernel.org Fixes: 5cfc7e0f5e5e1adf998df94f8e36edaf5d30d38e Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: x86: don't kill guest on unknown exit reasonMichael S. Tsirkin2014-10-24
| | | | | | | | | | | | | | | | | | | | KVM_EXIT_UNKNOWN is a kvm bug, we don't really know whether it was triggered by a priveledged application. Let's not kill the guest: WARN and inject #UD instead. Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: vmx: handle invvpid vm exit gracefullyPetr Matousek2014-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On systems with invvpid instruction support (corresponding bit in IA32_VMX_EPT_VPID_CAP MSR is set) guest invocation of invvpid causes vm exit, which is currently not handled and results in propagation of unknown exit to userspace. Fix this by installing an invvpid vm exit handler. This is CVE-2014-3646. Cc: stable@vger.kernel.org Signed-off-by: Petr Matousek <pmatouse@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Handle errors when RIP is set during far jumpsNadav Amit2014-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Far jmp/call/ret may fault while loading a new RIP. Currently KVM does not handle this case, and may result in failed vm-entry once the assignment is done. The tricky part of doing so is that loading the new CS affects the VMCS/VMCB state, so if we fail during loading the new RIP, we are left in unconsistent state. Therefore, this patch saves on 64-bit the old CS descriptor and restores it if loading RIP failed. This fixes CVE-2014-3647. Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Emulator fixes for eip canonical checks on near branchesNadav Amit2014-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before changing rip (during jmp, call, ret, etc.) the target should be asserted to be canonical one, as real CPUs do. During sysret, both target rsp and rip should be canonical. If any of these values is noncanonical, a #GP exception should occur. The exception to this rule are syscall and sysenter instructions in which the assigned rip is checked during the assignment to the relevant MSRs. This patch fixes the emulator to behave as real CPUs do for near branches. Far branches are handled by the next patch. This fixes CVE-2014-3647. Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Fix wrong masking on relative jump/callNadav Amit2014-10-24
| | | | | | | | | | | | | | | | | | | | | | Relative jumps and calls do the masking according to the operand size, and not according to the address size as the KVM emulator does today. This patch fixes KVM behavior. Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Improve thread safety in pitAndy Honig2014-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | There's a race condition in the PIT emulation code in KVM. In __kvm_migrate_pit_timer the pit_timer object is accessed without synchronization. If the race condition occurs at the wrong time this can crash the host kernel. This fixes CVE-2014-3611. Cc: stable@vger.kernel.org Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Prevent host from panicking on shared MSR writes.Andy Honig2014-10-24
| | | | | | | | | | | | | | | | | | | | | | The previous patch blocked invalid writes directly when the MSR is written. As a precaution, prevent future similar mistakes by gracefulling handle GPs caused by writes to shared MSRs. Cc: stable@vger.kernel.org Signed-off-by: Andrew Honig <ahonig@google.com> [Remove parts obsoleted by Nadav's patch. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Check non-canonical addresses upon WRMSRNadav Amit2014-10-24
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is written to certain MSRs. The behavior is "almost" identical for AMD and Intel (ignoring MSRs that are not implemented in either architecture since they would anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if non-canonical address is written on Intel but not on AMD (which ignores the top 32-bits). Accordingly, this patch injects a #GP on the MSRs which behave identically on Intel and AMD. To eliminate the differences between the architecutres, the value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to canonical value before writing instead of injecting a #GP. Some references from Intel and AMD manuals: According to Intel SDM description of WRMSR instruction #GP is expected on WRMSR "If the source register contains a non-canonical address and ECX specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE, IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP." According to AMD manual instruction manual: LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the LSTAR and CSTAR registers. If an RIP written by WRMSR is not in canonical form, a general-protection exception (#GP) occurs." IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the base field must be in canonical form or a #GP fault will occur." IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must be in canonical form." This patch fixes CVE-2014-3610. Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* x86,kvm,vmx: Preserve CR4 across VM entryAndy Lutomirski2014-10-18
| | | | | | | | | | | | | | | | | | | | | CR4 isn't constant; at least the TSD and PCE bits can vary. TBH, treating CR0 and CR3 as constant scares me a bit, too, but it looks like it's correct. This adds a branch and a read from cr4 to each vm entry. Because it is extremely likely that consecutive entries into the same vcpu will have the same host cr4 value, this fixes up the vmcs instead of restoring cr4 after the fact. A subsequent patch will add a kernel-wide cr4 shadow, reducing the overhead in the common case to just two memory reads and a branch. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Cc: Petr Matousek <pmatouse@redhat.com> Cc: Gleb Natapov <gleb@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-3.18-consistent-ops' of ↵Linus Torvalds2014-10-15
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu consistent-ops changes from Tejun Heo: "Way back, before the current percpu allocator was implemented, static and dynamic percpu memory areas were allocated and handled separately and had their own accessors. The distinction has been gone for many years now; however, the now duplicate two sets of accessors remained with the pointer based ones - this_cpu_*() - evolving various other operations over time. During the process, we also accumulated other inconsistent operations. This pull request contains Christoph's patches to clean up the duplicate accessor situation. __get_cpu_var() uses are replaced with with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr(). Unfortunately, the former sometimes is tricky thanks to C being a bit messy with the distinction between lvalues and pointers, which led to a rather ugly solution for cpumask_var_t involving the introduction of this_cpu_cpumask_var_ptr(). This converts most of the uses but not all. Christoph will follow up with the remaining conversions in this merge window and hopefully remove the obsolete accessors" * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits) irqchip: Properly fetch the per cpu offset percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write. percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t Revert "powerpc: Replace __get_cpu_var uses" percpu: Remove __this_cpu_ptr clocksource: Replace __this_cpu_ptr with raw_cpu_ptr sparc: Replace __get_cpu_var uses avr32: Replace __get_cpu_var with __this_cpu_write blackfin: Replace __get_cpu_var uses tile: Use this_cpu_ptr() for hardware counters tile: Replace __get_cpu_var uses powerpc: Replace __get_cpu_var uses alpha: Replace __get_cpu_var ia64: Replace __get_cpu_var uses s390: cio driver &__get_cpu_var replacements s390: Replace __get_cpu_var uses mips: Replace __get_cpu_var uses MIPS: Replace __get_cpu_var uses in FPU emulator. arm: Replace __this_cpu_ptr with raw_cpu_ptr ...
| * x86: Replace __get_cpu_var usesChristoph Lameter2014-08-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __get_cpu_var() is used for multiple purposes in the kernel source. One of them is address calculation via the form &__get_cpu_var(x). This calculates the address for the instance of the percpu variable of the current processor based on an offset. Other use cases are for storing and retrieving data from the current processors percpu area. __get_cpu_var() can be used as an lvalue when writing data or on the right side of an assignment. __get_cpu_var() is defined as : #define __get_cpu_var(var) (*this_cpu_ptr(&(var))) __get_cpu_var() always only does an address determination. However, store and retrieve operations could use a segment prefix (or global register on other platforms) to avoid the address calculation. this_cpu_write() and this_cpu_read() can directly take an offset into a percpu area and use optimized assembly code to read and write per cpu variables. This patch converts __get_cpu_var into either an explicit address calculation using this_cpu_ptr() or into a use of this_cpu operations that use the offset. Thereby address calculations are avoided and less registers are used when code is generated. Transformations done to __get_cpu_var() 1. Determine the address of the percpu instance of the current processor. DEFINE_PER_CPU(int, y); int *x = &__get_cpu_var(y); Converts to int *x = this_cpu_ptr(&y); 2. Same as #1 but this time an array structure is involved. DEFINE_PER_CPU(int, y[20]); int *x = __get_cpu_var(y); Converts to int *x = this_cpu_ptr(y); 3. Retrieve the content of the current processors instance of a per cpu variable. DEFINE_PER_CPU(int, y); int x = __get_cpu_var(y) Converts to int x = __this_cpu_read(y); 4. Retrieve the content of a percpu struct DEFINE_PER_CPU(struct mystruct, y); struct mystruct x = __get_cpu_var(y); Converts to memcpy(&x, this_cpu_ptr(&y), sizeof(x)); 5. Assignment to a per cpu variable DEFINE_PER_CPU(int, y) __get_cpu_var(y) = x; Converts to __this_cpu_write(y, x); 6. Increment/Decrement etc of a per cpu variable DEFINE_PER_CPU(int, y); __get_cpu_var(y)++ Converts to __this_cpu_inc(y) Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86@kernel.org Acked-by: H. Peter Anvin <hpa@linux.intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | Merge branch 'for-3.18' of ↵Linus Torvalds2014-10-10
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "A lot of activities on percpu front. Notable changes are... - percpu allocator now can take @gfp. If @gfp doesn't contain GFP_KERNEL, it tries to allocate from what's already available to the allocator and a work item tries to keep the reserve around certain level so that these atomic allocations usually succeed. This will replace the ad-hoc percpu memory pool used by blk-throttle and also be used by the planned blkcg support for writeback IOs. Please note that I noticed a bug in how @gfp is interpreted while preparing this pull request and applied the fix 6ae833c7fe0c ("percpu: fix how @gfp is interpreted by the percpu allocator") just now. - percpu_ref now uses longs for percpu and global counters instead of ints. It leads to more sparse packing of the percpu counters on 64bit machines but the overhead should be negligible and this allows using percpu_ref for refcnting pages and in-memory objects directly. - The switching between percpu and single counter modes of a percpu_ref is made independent of putting the base ref and a percpu_ref can now optionally be initialized in single or killed mode. This allows avoiding percpu shutdown latency for cases where the refcounted objects may be synchronously created and destroyed in rapid succession with only a fraction of them reaching fully operational status (SCSI probing does this when combined with blk-mq support). It's also planned to be used to implement forced single mode to detect underflow more timely for debugging. There's a separate branch percpu/for-3.18-consistent-ops which cleans up the duplicate percpu accessors. That branch causes a number of conflicts with s390 and other trees. I'll send a separate pull request w/ resolutions once other branches are merged" * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits) percpu: fix how @gfp is interpreted by the percpu allocator blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky percpu_ref: add PERCPU_REF_INIT_* flags percpu_ref: decouple switching to percpu mode and reinit percpu_ref: decouple switching to atomic mode and killing percpu_ref: add PCPU_REF_DEAD percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch percpu_ref: replace pcpu_ prefix with percpu_ percpu_ref: minor code and comment updates percpu_ref: relocate percpu_ref_reinit() Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe" Revert "percpu: free percpu allocation info for uniprocessor system" percpu-refcount: make percpu_ref based on longs instead of ints percpu-refcount: improve WARN messages percpu: fix locking regression in the failure path of pcpu_alloc() percpu-refcount: add @gfp to percpu_ref_init() proportions: add @gfp to init functions percpu_counter: add @gfp to percpu_counter_init() percpu_counter: make percpu_counters_lock irq-safe ...
| * \ Merge branch 'for-linus' of ↵Tejun Heo2014-09-24
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18 This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") which implements __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The commit reverted and patches to implement proper fix will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>
| * | | percpu_counter: add @gfp to percpu_counter_init()Tejun Heo2014-09-07
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Percpu allocator now supports allocation mask. Add @gfp to percpu_counter_init() so that !GFP_KERNEL allocation masks can be used with percpu_counters too. We could have left percpu_counter_init() alone and added percpu_counter_init_gfp(); however, the number of users isn't that high and introducing _gfp variants to all percpu data structures would be quite ugly, so let's just do the conversion. This is the one with the most users. Other percpu data structures are a lot easier to convert. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Acked-by: "David S. Miller" <davem@davemloft.net> Cc: x86@kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org>
* | | kvm: do not handle APIC access page if in-kernel irqchip is not in usePaolo Bonzini2014-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes the following OOPS: loaded kvm module (v3.17-rc1-168-gcec26bc) BUG: unable to handle kernel paging request at fffffffffffffffe IP: [<ffffffff81168449>] put_page+0x9/0x30 PGD 1e15067 PUD 1e17067 PMD 0 Oops: 0000 [#1] PREEMPT SMP [<ffffffffa063271d>] ? kvm_vcpu_reload_apic_access_page+0x5d/0x70 [kvm] [<ffffffffa013b6db>] vmx_vcpu_reset+0x21b/0x470 [kvm_intel] [<ffffffffa0658816>] ? kvm_pmu_reset+0x76/0xb0 [kvm] [<ffffffffa064032a>] kvm_vcpu_reset+0x15a/0x1b0 [kvm] [<ffffffffa06403ac>] kvm_arch_vcpu_setup+0x2c/0x50 [kvm] [<ffffffffa062e540>] kvm_vm_ioctl+0x200/0x780 [kvm] [<ffffffff81212170>] do_vfs_ioctl+0x2d0/0x4b0 [<ffffffff8108bd99>] ? __mmdrop+0x69/0xb0 [<ffffffff812123d1>] SyS_ioctl+0x81/0xa0 [<ffffffff8112a6f6>] ? __audit_syscall_exit+0x1f6/0x2a0 [<ffffffff817229e9>] system_call_fastpath+0x16/0x1b Code: c6 78 ce a3 81 4c 89 e7 e8 d9 80 ff ff 0f 0b 4c 89 e7 e8 8f f6 ff ff e9 fa fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> f7 07 00 c0 00 00 55 48 89 e5 75 1e 8b 47 1c 85 c0 74 27 f0 RIP [<ffffffff81193045>] put_page+0x5/0x50 when not using the in-kernel irqchip ("-machine kernel_irqchip=off" with QEMU). The fix is to make the same check in kvm_vcpu_reload_apic_access_page that we already have in vmx.c's vm_need_virtualize_apic_accesses(). Reported-by: Jan Kiszka <jan.kiszka@siemens.com> Tested-by: Jan Kiszka <jan.kiszka@siemens.com> Fixes: 4256f43f9fab91e1c17b5846a240cf4b66a768a8 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: x86: Unpin and remove kvm_arch->apic_access_pageTang Chen2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to make the APIC access page migratable, stop pinning it in memory. And because the APIC access page is not pinned in memory, we can remove kvm_arch->apic_access_page. When we need to write its physical address into vmcs, we use gfn_to_page() to get its page struct, which is needed to call page_to_phys(); the page is then immediately unpinned. Suggested-by: Gleb Natapov <gleb@kernel.org> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: vmx: Implement set_apic_access_page_addrTang Chen2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the APIC access page is pinned by KVM for the entire life of the guest. We want to make it migratable in order to make memory hot-unplug available for machines that run KVM. This patch prepares to handle this for the case where there is no nested virtualization, or where the nested guest does not have an APIC page of its own. All accesses to kvm->arch.apic_access_page are changed to go through kvm_vcpu_reload_apic_access_page. If the APIC access page is invalidated when the host is running, we update the VMCS in the next guest entry. If it is invalidated when the guest is running, the MMU notifier will force an exit, after which we will handle everything as in the previous case. If it is invalidated when a nested guest is running, the request will update either the VMCS01 or the VMCS02. Updating the VMCS01 is done at the next L2->L1 exit, while updating the VMCS02 is done in prepare_vmcs02. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: x86: Add request bit to reload APIC access page addressTang Chen2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the APIC access page is pinned by KVM for the entire life of the guest. We want to make it migratable in order to make memory hot-unplug available for machines that run KVM. This patch prepares to handle this in generic code, through a new request bit (that will be set by the MMU notifier) and a new hook that is called whenever the request bit is processed. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: Add arch specific mmu notifier for page invalidationTang Chen2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This will be used to let the guest run while the APIC access page is not pinned. Because subsequent patches will fill in the function for x86, place the (still empty) x86 implementation in the x86.c file instead of adding an inline function in kvm_host.h. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: Fix page ageing bugsAndres Lagar-Cavilla2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm/x86/mmu: Pass gfn and level to rmapp callback.Andres Lagar-Cavilla2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | Callbacks don't have to do extra computation to learn what the caller (lvm_handle_hva_range()) knows very well. Useful for debugging/tracing/printk/future. Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: x86: use macros to compute bank MSRsChen Yucong2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid open coded calculations for bank MSRs by using well-defined macros that hide the index of higher bank MSRs. No semantic changes. Signed-off-by: Chen Yucong <slaoub@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: Remove debug assertion of non-PAE reserved bitsNadav Amit2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 346874c9507a ("KVM: x86: Fix CR3 reserved bits") removed non-PAE reserved bits which were not according to Intel SDM. However, residue was left in a debug assertion (CR3_NONPAE_RESERVED_BITS). Remove it. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: x86: fix two typos in commentTiejun Chen2014-09-24
| | | | | | | | | | | | | | | | | | | | | s/drity/dirty and s/vmsc01/vmcs01 Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: vmx: Inject #GP on invalid PAT CRNadav Amit2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | Guest which sets the PAT CR to invalid value should get a #GP. Currently, if vmx supports loading PAT CR during entry, then the value is not checked. This patch makes the required check in that case. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: emulating descriptor load misses long-mode caseNadav Amit2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In 64-bit mode a #GP should be delivered to the guest "if the code segment descriptor pointed to by the selector in the 64-bit gate doesn't have the L-bit set and the D-bit clear." - Intel SDM "Interrupt 13—General Protection Exception (#GP)". This patch fixes the behavior of CS loading emulation code. Although the comment says that segment loading is not supported in long mode, this function is executed in long mode, so the fix is necassary. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: directly use kvm_make_request againLiang Chen2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | A one-line wrapper around kvm_make_request is not particularly useful. Replace kvm_mmu_flush_tlb() with kvm_make_request(). Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: count actual tlb flushesRadim Krčmář2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - we count KVM_REQ_TLB_FLUSH requests, not actual flushes (KVM can have multiple requests for one flush) - flushes from kvm_flush_remote_tlbs aren't counted - it's easy to make a direct request by mistake Solve these by postponing the counting to kvm_check_request(). Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: nested VMX: disable perf cpuid reportingMarcelo Tosatti2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Initilization of L2 guest with -cpu host, on L1 guest with -cpu host triggers: (qemu) KVM: entry failed, hardware error 0x7 ... nested_vmx_run: VMCS MSR_{LOAD,STORE} unsupported Nested VMX MSR load/store support is not sufficient to allow perf for L2 guest. Until properly fixed, trap CPUID and disable function 0xA. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: Don't report guest userspace emulation error to userspaceNadav Amit2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to user-space") disabled the reporting of L2 (nested guest) emulation failures to userspace due to race-condition between a vmexit and the instruction emulator. The same rational applies also to userspace applications that are permitted by the guest OS to access MMIO area or perform PIO. This patch extends the current behavior - of injecting a #UD instead of reporting it to userspace - also for guest userspace code. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: Make init_rmode_tss() return 0 on success.Paolo Bonzini2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In init_rmode_tss(), there two variables indicating the return value, r and ret, and it return 0 on error, 1 on success. The function is only called by vmx_set_tss_addr(), and ret is redundant. This patch removes the redundant variable, by making init_rmode_tss() return 0 on success, -errno on failure. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: Warn if guest virtual address space is not 48-bitsNadav Amit2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | The KVM emulator code assumes that the guest virtual address space (in 64-bit) is 48-bits wide. Fail the KVM_SET_CPUID and KVM_SET_CPUID2 ioctl if userspace tries to create a guest that does not obey this restriction. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: Make init_rmode_identity_map() return 0 on success.Tang Chen2014-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In init_rmode_identity_map(), there two variables indicating the return value, r and ret, and it return 0 on error, 1 on success. The function is only called by vmx_create_vcpu(), and ret is redundant. This patch removes the redundant variable, and makes init_rmode_identity_map() return 0 on success, -errno on failure. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: Remove ept_identity_pagetable from struct kvm_arch.Tang Chen2014-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But it is never used to refer to the page at all. In vcpu initialization, it indicates two things: 1. indicates if ept page is allocated 2. indicates if a memory slot for identity page is initialized Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept identity pagetable is initialized. So we can remove ept_identity_pagetable. NOTE: In the original code, ept identity pagetable page is pinned in memroy. As a result, it cannot be migrated/hot-removed. After this patch, since kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page is no longer pinned in memory. And it can be migrated/hot-removed. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Gleb Natapov <gleb@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: Use kvm_make_request when applicableGuo Hui Liu2014-09-16
| | | | | | | | | | | | | | | | | | | | | | | | This patch replace the set_bit method by kvm_make_request to make code more readable and consistent. Signed-off-by: Guo Hui Liu <liuguohui@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: make apic_accept_irq tracepoint more genericPaolo Bonzini2014-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Initially the tracepoint was added only to the APIC_DM_FIXED case, also because it reported coalesced interrupts that only made sense for that case. However, the coalesced argument is not used anymore and tracing other delivery modes is useful, so hoist the call out of the switch statement. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address.Tang Chen2014-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | We have APIC_DEFAULT_PHYS_BASE defined as 0xfee00000, which is also the address of apic access page. So use this macro. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Gleb Natapov <gleb@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: propagate exception from permission checks on the nested page faultPaolo Bonzini2014-09-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, if a permission error happens during the translation of the final GPA to HPA, walk_addr_generic returns 0 but does not fill in walker->fault. To avoid this, add an x86_exception* argument to the translate_gpa function, and let it fill in walker->fault. The nested_page_fault field will be true, since the walk_mmu is the nested_mmu and translate_gpu instead operates on the "outer" (NPT) instance. Reported-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: skip writeback on injection of nested exceptionPaolo Bonzini2014-09-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a nested page fault happens during emulation, we will inject a vmexit, not a page fault. However because writeback happens after the injection, we will write ctxt->eip from L2 into the L1 EIP. We do not write back if an instruction caused an interception vmexit---do the same for page faults. Suggested-by: Gleb Natapov <gleb@kernel.org> Reviewed-by: Gleb Natapov <gleb@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: nSVM: propagate the NPF EXITINFO to the guestPaolo Bonzini2014-09-03
| | | | | | | | | | | | | | | | | | | | | This is similar to what the EPT code does with the exit qualification. This allows the guest to see a valid value for bits 33:32. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMDPaolo Bonzini2014-09-03
| | | | | | | | | | | | | | | | | | | | | | | | Bit 8 would be the "global" bit, which does not quite make sense for non-leaf page table entries. Intel ignores it; AMD ignores it in PDEs, but reserves it in PDPEs and PML4Es. The SVM test is relying on this behavior, so enforce it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: mmio: cleanup kvm_set_mmio_spte_maskTiejun Chen2014-09-03
| | | | | | | | | | | | | | | | | | | | | | | | Just reuse rsvd_bits() inside kvm_set_mmio_spte_mask() for slightly better code. Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: x86: fix stale mmio cache bugDavid Matlack2014-09-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following events can lead to an incorrect KVM_EXIT_MMIO bubbling up to userspace: (1) Guest accesses gpa X without a memory slot. The gfn is cached in struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets the SPTE write-execute-noread so that future accesses cause EPT_MISCONFIGs. (2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION covering the page just accessed. (3) Guest attempts to read or write to gpa X again. On Intel, this generates an EPT_MISCONFIG. The memory slot generation number that was incremented in (2) would normally take care of this but we fast path mmio faults through quickly_check_mmio_pf(), which only checks the per-vcpu mmio cache. Since we hit the cache, KVM passes a KVM_EXIT_MMIO up to userspace. This patch fixes the issue by using the memslot generation number to validate the mmio cache. Cc: stable@vger.kernel.org Signed-off-by: David Matlack <dmatlack@google.com> [xiaoguangrong: adjust the code to make it simpler for stable-tree fix.] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Tested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: fix potentially corrupt mmio cacheDavid Matlack2014-09-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vcpu exits and memslot mutations can run concurrently as long as the vcpu does not aquire the slots mutex. Thus it is theoretically possible for memslots to change underneath a vcpu that is handling an exit. If we increment the memslot generation number again after synchronize_srcu_expedited(), vcpus can safely cache memslot generation without maintaining a single rcu_dereference through an entire vm exit. And much of the x86/kvm code does not maintain a single rcu_dereference of the current memslots during each exit. We can prevent the following case: vcpu (CPU 0) | thread (CPU 1) --------------------------------------------+-------------------------- 1 vm exit | 2 srcu_read_unlock(&kvm->srcu) | 3 decide to cache something based on | old memslots | 4 | change memslots | (increments generation) 5 | synchronize_srcu(&kvm->srcu); 6 retrieve generation # from new memslots | 7 tag cache with new memslot generation | 8 srcu_read_unlock(&kvm->srcu) | ... | <action based on cache occurs even | though the caching decision was based | on the old memslots> | ... | <action *continues* to occur until next | memslot generation change, which may | be never> | | By incrementing the generation after synchronizing with kvm->srcu readers, we ensure that the generation retrieved in (6) will become invalid soon after (8). Keeping the existing increment is not strictly necessary, but we do keep it and just move it for consistency from update_memslots to install_new_memslots. It invalidates old cached MMIOs immediately, instead of having to wait for the end of synchronize_srcu_expedited, which makes the code more clearly correct in case CPU 1 is preempted right after synchronize_srcu() returns. To avoid halving the generation space in SPTEs, always presume that the low bit of the generation is zero when reconstructing a generation number out of an SPTE. This effectively disables MMIO caching in SPTEs during the call to synchronize_srcu_expedited. Using the low bit this way is somewhat like a seqcount---where the protected thing is a cache, and instead of retrying we can simply punt if we observe the low bit to be 1. Cc: stable@vger.kernel.org Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: do not bias the generation number in kvm_current_mmio_generationPaolo Bonzini2014-09-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The next patch will give a meaning (a la seqcount) to the low bit of the generation number. Ensure that it matches between kvm->memslots->generation and kvm_current_mmio_generation(). Cc: stable@vger.kernel.org Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: use guest maxphyaddr to check MTRR valuesPaolo Bonzini2014-08-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check introduced in commit d7a2a246a1b5 (KVM: x86: #GP when attempts to write reserved bits of Variable Range MTRRs, 2014-08-19) will break if the guest maxphyaddr is higher than the host's (which sometimes happens depending on your hardware and how QEMU is configured). To fix this, use cpuid_maxphyaddr similar to how the APIC_BASE MSR does already. Reported-by: Jan Kiszka <jan.kiszka@siemens.com> Tested-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>