aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm/x86.c
Commit message (Collapse)AuthorAge
...
| * KVM: SVM: Fix reading of DR6Jan Kiszka2014-01-17
| | | | | | | | | | | | | | | | | | | | | | | | | | In contrast to VMX, SVM dose not automatically transfer DR6 into the VCPU's arch.dr6. So if we face a DR6 read, we must consult a new vendor hook to obtain the current value. And as SVM now picks the DR6 state from its VMCB, we also need a set callback in order to write updates of DR6 back. Fixes a regression of 020df0794f. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: Sync DR7 on KVM_SET_DEBUGREGSJan Kiszka2014-01-17
| | | | | | | | | | | | | | | | | | Whenever we change arch.dr7, we also have to call kvm_update_dr7. In case guest debugging is off, this will synchronize the new state into hardware. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * add support for Hyper-V reference time counterVadim Rozenfeld2014-01-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off: Peter Lieven <pl@kamp.de> Signed-off: Gleb Natapov Signed-off: Vadim Rozenfeld <vrozenfe@redhat.com> After some consideration I decided to submit only Hyper-V reference counters support this time. I will submit iTSC support as a separate patch as soon as it is ready. v1 -> v2 1. mark TSC page dirty as suggested by Eric Northup <digitaleric@google.com> and Gleb 2. disable local irq when calling get_kernel_ns, as it was done by Peter Lieven <pl@amp.de> 3. move check for TSC page enable from second patch to this one. v3 -> v4     Get rid of ref counter offset. v4 -> v5 replace __copy_to_user with kvm_write_guest when updateing iTSC page. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: remove useless write to vcpu->hv_clock.tsc_timestampPaolo Bonzini2014-01-15
| | | | | | | | | | | | | | | | | | | | | | | | | | After the previous patch from Marcelo, the comment before this write became obsolete. In fact, the write is unnecessary. The calls to kvm_write_tsc ultimately result in a master clock update as soon as all TSCs agree and the master clock is re-enabled. This master clock update will rewrite tsc_timestamp. So, together with the comment, delete the dead write too. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: fix tsc catchup issue with tsc scalingMarcelo Tosatti2014-01-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To fix a problem related to different resolution of TSC and system clock, the offset in TSC units is approximated by delta = vcpu->hv_clock.tsc_timestamp - vcpu->last_guest_tsc (Guest TSC value at (Guest TSC value at last VM-exit) the last kvm_guest_time_update call) Delta is then later scaled using mult,shift pair found in hv_clock structure (which is correct against tsc_timestamp in that structure). However, if a frequency change is performed between these two points, this delta is measured using different TSC frequencies, but scaled using mult,shift pair for one frequency only. The end result is an incorrect delta. The bug which this code works around is not the only cause for clock backwards events. The global accumulator is still necessary, so remove the max_kernel_ns fix and rely on the global accumulator for no clock backwards events. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: limit PIT timer frequencyMarcelo Tosatti2014-01-15
| | | | | | | | | | | | | | | | | | | | Limit PIT timer frequency similarly to the limit applied by LAPIC timer. Cc: stable@kernel.org Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: Add comment on vcpu_enter_guest()'s return valueTakuya Yoshikawa2013-12-13
| | | | | | | | | | | | | | | | | | Giving proper names to the 0 and 1 was once suggested. But since 0 is returned to the userspace, giving it another name can introduce extra confusion. This patch just explains the meanings instead. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: Use cond_resched() directly and remove useless kvm_resched()Takuya Yoshikawa2013-12-13
| | | | | | | | | | | | | | | | | | | | Since the commit 15ad7146 ("KVM: Use the scheduler preemption notifiers to make kvm preemptible"), the remaining stuff in this function is a simple cond_resched() call with an extra need_resched() check which was there to avoid dropping VCPUs unnecessarily. Now it is meaningless. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Convert vapic synchronization to _cached functions (CVE-2013-6368)Andy Honig2013-12-12
|/ | | | | | | | | | | | | | | | | | | | | | In kvm_lapic_sync_from_vapic and kvm_lapic_sync_to_vapic there is the potential to corrupt kernel memory if userspace provides an address that is at the end of a page. This patches concerts those functions to use kvm_write_guest_cached and kvm_read_guest_cached. It also checks the vapic_address specified by userspace during ioctl processing and returns an error to userspace if the address is not a valid GPA. This is generally not guest triggerable, because the required write is done by firmware that runs before the guest. Also, it only affects AMD processors and oldish Intel that do not have the FlexPriority feature (unless you disable FlexPriority, of course; then newer processors are also affected). Fixes: b93463aa59d6 ('KVM: Accelerated apic support') Reported-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: optimize out smp_mb after srcu_read_unlockMichael S. Tsirkin2013-11-06
| | | | | | | | | | | | | | | | I noticed that srcu_read_lock/unlock both have a memory barrier, so just by moving srcu_read_unlock earlier we can get rid of one call to smp_mb() using smp_mb__after_srcu_read_unlock instead. Unsurprisingly, the gain is small but measureable using the unit test microbenchmark: before vmcall in the ballpark of 1410 cycles after vmcall in the ballpark of 1360 cycles Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* Merge branch 'kvm-ppc-queue' of git://github.com/agraf/linux-2.6 into queueGleb Natapov2013-11-04
|\ | | | | | | | | Conflicts: arch/powerpc/include/asm/processor.h
| * kvm: Add struct kvm arg to memslot APIsAneesh Kumar K.V2013-10-17
| | | | | | | | | | | | | | We will use that in the later patch to find the kvm ops handler Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
* | KVM: x86: fix KVM_SET_XCRS loopPaolo Bonzini2013-10-31
| | | | | | | | | | | | | | | | | | The loop was always using 0 as the index. This means that any rubbish after the first element of the array went undetected. It seems reasonable to assume that no KVM userspace did that. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: fix KVM_SET_XCRS for CPUs that do not support XSAVEPaolo Bonzini2013-10-31
| | | | | | | | | | | | | | | | | | | | | | | | The KVM_SET_XCRS ioctl must accept anything that KVM_GET_XCRS could return. XCR0's bit 0 is always 1 in real processors with XSAVE, and KVM_GET_XCRS will always leave bit 0 set even if the emulated processor does not have XSAVE. So, KVM_SET_XCRS must ignore that bit when checking for attempts to enable unsupported save states. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: Create non-coherent DMA registerationAlex Williamson2013-10-30
| | | | | | | | | | | | | | | | | | | | | | | | | | We currently use some ad-hoc arch variables tied to legacy KVM device assignment to manage emulation of instructions that depend on whether non-coherent DMA is present. Create an interface for this, adapting legacy KVM device assignment and adding VFIO via the KVM-VFIO device. For now we assume that non-coherent DMA is possible any time we have a VFIO group. Eventually an interface can be developed as part of the VFIO external user interface to query the coherency of a group. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm/x86: Convert iommu_flags to iommu_noncoherentAlex Williamson2013-10-30
| | | | | | | | | | | | | | | | | | Default to operating in coherent mode. This simplifies the logic when we switch to a model of registering and unregistering noncoherent I/O with KVM. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm, emulator: Rename VendorSpecific flagBorislav Petkov2013-10-30
| | | | | | | | | | | | | | | | | | | | Call it EmulateOnUD which is exactly what we're trying to do with vendor-specific instructions. Rename ->only_vendor_specific_insn to something shorter, while at it. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm, emulator: Use opcode lengthBorislav Petkov2013-10-30
| | | | | | | | | | | | | | | | | | Add a field to the current emulation context which contains the instruction opcode length. This will streamline handling of opcodes of different length. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: Add KVM_GET_EMULATED_CPUIDBorislav Petkov2013-10-30
|/ | | | | | | | | | | | | | | | Add a kvm ioctl which states which system functionality kvm emulates. The format used is that of CPUID and we return the corresponding CPUID bits set for which we do emulate functionality. Make sure ->padding is being passed on clean from userspace so that we can use it for something in the future, after the ioctl gets cast in stone. s/kvm_dev_ioctl_get_supported_cpuid/kvm_dev_ioctl_get_cpuid/ while at it. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: Drop FOLL_GET in GUP when doing async page faultchai wen2013-10-15
| | | | | | | | | | | | Page pinning is not mandatory in kvm async page fault processing since after async page fault event is delivered to a guest it accesses page once again and does its own GUP. Drop the FOLL_GET flag in GUP in async_pf code, and do some simplifying in check/clear processing. Suggested-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Gu zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* KVM: mmu: change useless int return types to voidPaolo Bonzini2013-10-03
| | | | | | | | kvm_mmu initialization is mostly filling in function pointers, there is no way for it to fail. Clean up unused return values. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* KVM: mmu: remove uninteresting MMU "new_cr3" callbacksPaolo Bonzini2013-10-03
| | | | | | | | | | | | | | | The new_cr3 MMU callback has been a wrapper for mmu_free_roots since commit e676505 (KVM: MMU: Force cr3 reload with two dimensional paging on mov cr3 emulation, 2012-07-08). The commit message mentioned that "mmu_free_roots() is somewhat of an overkill, but fixing that is more complicated and will be done after this minimal fix". One year has passed, and no one really felt the need to do a different fix. Wrap the call with a kvm_mmu_new_cr3 function for clarity, but remove the callback. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* KVM: x86: only copy XSAVE state for the supported featuresPaolo Bonzini2013-10-03
| | | | | | | | | | This makes the interface more deterministic for userspace, which can expect (after configuring only the features it supports) to get exactly the same state from the kernel, independent of the host CPU and kernel version. Suggested-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* KVM: x86: prevent setting unsupported XSAVE statesPaolo Bonzini2013-10-03
| | | | | | | | | | | | | | | | A guest can still attempt to save and restore XSAVE states even if they have been masked in CPUID leaf 0Dh. This usually is not visible to the guest, but is still wrong: "Any attempt to set a reserved bit (as determined by the contents of EAX and EDX after executing CPUID with EAX=0DH, ECX= 0H) in XCR0 for a given processor will result in a #GP exception". The patch also performs the same checks as __kvm_set_xcr in KVM_SET_XSAVE. This catches migration from newer to older kernel/processor before the guest starts running. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* KVM: Convert kvm_lock back to non-raw spinlockPaolo Bonzini2013-09-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"), the kvm_lock was made a raw lock. However, the kvm mmu_shrink() function tries to grab the (non-raw) mmu_lock within the scope of the raw locked kvm_lock being held. This leads to the following: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0 Preemption disabled at:[<ffffffffa0376eac>] mmu_shrink+0x5c/0x1b0 [kvm] Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt Call Trace: [<ffffffff8106f2ad>] __might_sleep+0xfd/0x160 [<ffffffff817d8d64>] rt_spin_lock+0x24/0x50 [<ffffffffa0376f3c>] mmu_shrink+0xec/0x1b0 [kvm] [<ffffffff8111455d>] shrink_slab+0x17d/0x3a0 [<ffffffff81151f00>] ? mem_cgroup_iter+0x130/0x260 [<ffffffff8111824a>] balance_pgdat+0x54a/0x730 [<ffffffff8111fe47>] ? set_pgdat_percpu_threshold+0xa7/0xd0 [<ffffffff811185bf>] kswapd+0x18f/0x490 [<ffffffff81070961>] ? get_parent_ip+0x11/0x50 [<ffffffff81061970>] ? __init_waitqueue_head+0x50/0x50 [<ffffffff81118430>] ? balance_pgdat+0x730/0x730 [<ffffffff81060d2b>] kthread+0xdb/0xe0 [<ffffffff8106e122>] ? finish_task_switch+0x52/0x100 [<ffffffff817e1e94>] kernel_thread_helper+0x4/0x10 [<ffffffff81060c50>] ? __init_kthread_worker+0x After the previous patch, kvm_lock need not be a raw spinlock anymore, so change it back. Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: kvm@vger.kernel.org Cc: gleb@redhat.com Cc: jan.kiszka@siemens.com Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86: update masterclock when kvmclock_offset is calculated (v2)Marcelo Tosatti2013-08-28
| | | | | | | | | | | | The offset to add to the hosts monotonic time, kvmclock_offset, is calculated against the monotonic time at KVM_SET_CLOCK ioctl time. Request a master clock update at this time, to reduce a potentially unbounded difference between the values of the masterclock and the clock value used to calculate kvmclock_offset. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* KVM: x86: add comments where MMIO does not return to the emulatorPaolo Bonzini2013-08-28
| | | | | | | | | | Support for single-step in the emulator (new in 3.12) does not work for MMIO or PIO writes, because they are completed without returning to the emulator. This is not worse than what we had in 3.11; still, add comments so that the issue is not forgotten. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* kvm hypervisor: Simplify kvm_for_each_vcpu with kvm_irq_delivery_to_apicRaghavendra K T2013-08-26
| | | | | | | | | | | | Note that we are using APIC_DM_REMRD which has reserved usage. In future if APIC_DM_REMRD usage is standardized, then we should find some other way or go back to old method. Suggested-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: Gleb Natapov <gleb@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocksSrivatsa Vaddagiri2013-08-26
| | | | | | | | | | | | | | | | | | | | kvm_hc_kick_cpu allows the calling vcpu to kick another vcpu out of halt state. the presence of these hypercalls is indicated to guest via kvm_feature_pv_unhalt. Fold pv_unhalt flag into GET_MP_STATE ioctl to aid migration During migration, any vcpu that got kicked but did not become runnable (still in halted state) should be runnable after migration. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com> [Raghu: Apic related changes, folding pvunhalted into vcpu_runnable Added flags for future use (suggested by Gleb)] [ Raghu: fold pv_unhalt flag as suggested by Eric Northup] Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: Gleb Natapov <gleb@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* nEPT: Fix wrong test in kvm_set_cr3Nadav Har'El2013-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical address. The problem is that with nested EPT, cr3 is an *L2* physical address, not an L1 physical address as this test expects. As the comment above this test explains, it isn't necessary, and doesn't correspond to anything a real processor would do. So this patch removes it. Note that this wrong test could have also theoretically caused problems in nested NPT, not just in nested EPT. However, in practice, the problem was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus circumventing the problem. Additional potential calls to the buggy function are avoided in that we don't trap cr3 modifications when nested NPT is enabled. However, because in nested VMX we did want to use kvm_set_cr3() (as requested in Avi Kivity's review of the original nested VMX patches), we can't avoid this problem and need to fix it. Reviewed-by: Orit Wasserman <owasserm@redhat.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86: handle singlestep during emulationPaolo Bonzini2013-07-29
| | | | | | | | | | | | | | | This lets debugging work better during emulation of invalid guest state. This time the check is done after emulation, but before writeback of the flags; we need to check the flags *before* execution of the instruction, we cannot check singlestep_rip because the CS base may have already been modified. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Conflicts: arch/x86/kvm/x86.c
* KVM: x86: handle hardware breakpoints during emulationPaolo Bonzini2013-07-29
| | | | | | | | | | This lets debugging work better during emulation of invalid guest state. The check is done before emulating the instruction, and (in the case of guest debugging) reuses EMULATE_DO_MMIO to exit with KVM_EXIT_DEBUG. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86: rename EMULATE_DO_MMIOPaolo Bonzini2013-07-29
| | | | | | | The next patch will reuse it for other userspace exits than MMIO, namely debug events. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: nVMX: Fix read/write to MSR_IA32_FEATURE_CONTROLNadav Har'El2013-07-18
| | | | | | | | | | | | | | | | | | | Fix read/write to IA32_FEATURE_CONTROL MSR in nested environment. This patch simulate this MSR in nested_vmx and the default value is 0x0. BIOS should set it to 0x5 before VMXON. After setting the lock bit, write to it will cause #GP(0). Another QEMU patch is also needed to handle emulation of reset and migration. Reset to vCPU should clear this MSR and migration should reserve value of it. This patch is based on Nadav's previous commit. http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/88478 Signed-off-by: Nadav Har'El <nyh@math.technion.ac.il> Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* KVM: x86: Drop useless castMathias Krause2013-07-18
| | | | | | | | Void pointers don't need no casting, drop it. Signed-off-by: Mathias Krause <minipli@googlemail.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* KVM: x86: Avoid zapping mmio sptes twice for generation wraparoundTakuya Yoshikawa2013-07-18
| | | | | | | | | | Now that kvm_arch_memslots_updated() catches every increment of the memslots->generation, checking if the mmio generation has reached its maximum value is enough. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: Introduce kvm_arch_memslots_updated()Takuya Yoshikawa2013-07-18
| | | | | | | | | | | | | | | | This is called right after the memslots is updated, i.e. when the result of update_memslots() gets installed in install_new_memslots(). Since the memslots needs to be updated twice when we delete or move a memslot, kvm_arch_commit_memory_region() does not correspond to this exactly. In the following patch, x86 will use this new API to check if the mmio generation has reached its maximum value, in which case mmio sptes need to be flushed out. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Acked-by: Alexander Graf <agraf@suse.de> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2013-07-03
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Paolo Bonzini: "On the x86 side, there are some optimizations and documentation updates. The big ARM/KVM change for 3.11, support for AArch64, will come through Catalin Marinas's tree. s390 and PPC have misc cleanups and bugfixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits) KVM: PPC: Ignore PIR writes KVM: PPC: Book3S PR: Invalidate SLB entries properly KVM: PPC: Book3S PR: Allow guest to use 1TB segments KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry KVM: PPC: Book3S PR: Fix proto-VSID calculations KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL KVM: Fix RTC interrupt coalescing tracking kvm: Add a tracepoint write_tsc_offset KVM: MMU: Inform users of mmio generation wraparound KVM: MMU: document fast invalidate all mmio sptes KVM: MMU: document fast invalidate all pages KVM: MMU: document fast page fault KVM: MMU: document mmio page fault KVM: MMU: document write_flooding_count KVM: MMU: document clear_spte_count KVM: MMU: drop kvm_mmu_zap_mmio_sptes KVM: MMU: init kvm generation close to mmio wrap-around value KVM: MMU: add tracepoint for check_mmio_spte KVM: MMU: fast invalidate all mmio sptes ...
| * kvm: Add a tracepoint write_tsc_offsetYoshihiro YUNOMAE2013-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a tracepoint write_tsc_offset for tracing TSC offset change. We want to merge ftrace's trace data of guest OSs and the host OS using TSC for timestamp in chronological order. We need "TSC offset" values for each guest when merge those because the TSC value on a guest is always the host TSC plus guest's TSC offset. If we get the TSC offset values, we can calculate the host TSC value for each guest events from the TSC offset and the event TSC value. The host TSC values of the guest events are used when we want to merge trace data of guests and the host in chronological order. (Note: the trace_clock of both the host and the guest must be set x86-tsc in this case) This tracepoint also records vcpu_id which can be used to merge trace data for SMP guests. A merge tool will read TSC offset for each vcpu, then the tool converts guest TSC values to host TSC values for each vcpu. TSC offset is stored in the VMCS by vmx_write_tsc_offset() or vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots. The latter function is executed when kvm clock is updated. Only host can read TSC offset value from VMCS, so a host needs to output TSC offset value when TSC offset is changed. Since the TSC offset is not often changed, it could be overwritten by other frequent events while tracing. To avoid that, I recommend to use a special instance for getting this event: 1. set a instance before booting a guest # cd /sys/kernel/debug/tracing/instances # mkdir tsc_offset # cd tsc_offset # echo x86-tsc > trace_clock # echo 1 > events/kvm/kvm_write_tsc_offset/enable 2. boot a guest Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
| * KVM: MMU: fast invalidate all mmio sptesXiao Guangrong2013-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch tries to introduce a very simple and scale way to invalidate all mmio sptes - it need not walk any shadow pages and hold mmu-lock KVM maintains a global mmio valid generation-number which is stored in kvm->memslots.generation and every mmio spte stores the current global generation-number into his available bits when it is created When KVM need zap all mmio sptes, it just simply increase the global generation-number. When guests do mmio access, KVM intercepts a MMIO #PF then it walks the shadow page table and get the mmio spte. If the generation-number on the spte does not equal the global generation-number, it will go to the normal #PF handler to update the mmio spte Since 19 bits are used to store generation-number on mmio spte, we zap all mmio sptes when the number is round Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: MMU: retain more available bits on mmio spteXiao Guangrong2013-06-20
| | | | | | | | | | | | | | | | | | | | Let mmio spte only use bit62 and bit63 on upper 32 bits, then bit 52 ~ bit 61 can be used for other purposes Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: handle idiv overflow at kvm_write_tscMarcelo Tosatti2013-06-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Its possible that idivl overflows (due to large delta stored in usdiff, valid scenario). Create an exception handler to catch the overflow exception (division by zero is protected by vcpu->arch.virtual_tsc_khz check), and interpret it accordingly (delta is larger than USEC_PER_SEC). Fixes https://bugzilla.redhat.com/show_bug.cgi?id=969644 Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
| * KVM: MMU: reclaim the zapped-obsolete page firstXiao Guangrong2013-06-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As Marcelo pointed out that | "(retention of large number of pages while zapping) | can be fatal, it can lead to OOM and host crash" We introduce a list, kvm->arch.zapped_obsolete_pages, to link all the pages which are deleted from the mmu cache but not actually freed. When page reclaiming is needed, we always zap this kind of pages first. Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
| * KVM: x86: use the fast way to invalidate all pagesXiao Guangrong2013-06-05
| | | | | | | | | | | | | | | | Replace kvm_mmu_zap_all by kvm_mmu_invalidate_zap_all_pages Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
| * KVM: MMU: drop unnecessary kvm_reload_remote_mmusXiao Guangrong2013-06-05
| | | | | | | | | | | | | | | | | | | | | | It is the responsibility of kvm_mmu_zap_all that keeps the consistent of mmu and tlbs. And it is also unnecessary after zap all mmio sptes since no mmio spte exists on root shadow page and it can not be cached into tlb Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
| * KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercallXiao Guangrong2013-06-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Quote Gleb's mail: | Back then kvm->lock protected memslot access so code like: | | mutex_lock(&vcpu->kvm->lock); | kvm_mmu_zap_all(vcpu->kvm); | mutex_unlock(&vcpu->kvm->lock); | | which is what 7aa81cc0 does was enough to guaranty that no vcpu will | run while code is patched. This is no longer the case and | mutex_lock(&vcpu->kvm->lock); is gone from that code path long time ago, | so now kvm_mmu_zap_all() there is useless and the code is incorrect. So we drop it and it will be fixed later Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
| * KVM: x86: limit difference between kvmclock updatesMarcelo Tosatti2013-05-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | kvmclock updates which are isolated to a given vcpu, such as vcpu->cpu migration, should not allow system_timestamp from the rest of the vcpus to remain static. Otherwise ntp frequency correction applies to one vcpu's system_timestamp but not the others. So in those cases, request a kvmclock update for all vcpus. The worst case for a remote vcpu to update its kvmclock is then bounded by maximum nohz sleep latency. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* | Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds2013-07-02
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull asm/x86 changes from Ingo Molnar: "Misc changes, with a bigger processor-flags cleanup/reorganization by Peter Anvin" * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, asm, cleanup: Replace open-coded control register values with symbolic x86, processor-flags: Fix the datatypes and add bit number defines x86: Rename X86_CR4_RDWRGSFS to X86_CR4_FSGSBASE x86, flags: Rename X86_EFLAGS_BIT1 to X86_EFLAGS_FIXED linux/const.h: Add _BITUL() and _BITULL() x86/vdso: Convert use of typedef ctl_table to struct ctl_table x86: __force_order doesn't need to be an actual variable
| * | x86: Rename X86_CR4_RDWRGSFS to X86_CR4_FSGSBASEH. Peter Anvin2013-06-25
| |/ | | | | | | | | | | | | | | | | Rename X86_CR4_RDWRGSFS to X86_CR4_FSGSBASE to match the SDM. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Link: http://lkml.kernel.org/n/tip-buq1evi5dpykxx7ak6amaam0@git.kernel.org
* / KVM: x86: remove vcpu's CPL check in host-invoked XCR setZhanghaoyu (A)2013-06-18
|/ | | | | | | | | | | | | | | | | | | | | | __kvm_set_xcr function does the CPL check when set xcr. __kvm_set_xcr is called in two flows, one is invoked by guest, call stack shown as below, handle_xsetbv(or xsetbv_interception) kvm_set_xcr __kvm_set_xcr the other one is invoked by host, for example during system reset: kvm_arch_vcpu_ioctl kvm_vcpu_ioctl_x86_set_xcrs __kvm_set_xcr The former does need the CPL check, but the latter does not. Cc: stable@vger.kernel.org Signed-off-by: Zhang Haoyu <haoyu.zhang@huawei.com> [Tweaks to commit message. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>