aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/kvm
Commit message (Collapse)AuthorAge
...
* KVM: Add physical memory aliasing featureAvi Kivity2007-05-03
| | | | | | | | With this, we can specify that accesses to one physical memory range will be remapped to another. This is useful for the vga window at 0xa0000 which is used as a movable window into the (much larger) framebuffer. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Simply gfn_to_page()Avi Kivity2007-05-03
| | | | | | | | | | | | Mapping a guest page to a host page is a common operation. Currently, one has first to find the memory slot where the page belongs (gfn_to_memslot), then locate the page itself (gfn_to_page()). This is clumsy, and also won't work well with memory aliases. So simplify gfn_to_page() not to require memory slot translation first, and instead do it internally. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Add mmu cache clear functionDor Laor2007-05-03
| | | | | | | | | | Functions that play around with the physical memory map need a way to clear mappings to possibly nonexistent or invalid memory. Both the mmu cache and the processor tlb are cleared. Signed-off-by: Dor Laor <dor.laor@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: x86 emulator: fix bit string operations operand sizeAvi Kivity2007-05-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On x86, bit operations operate on a string of bits that can reside in multiple words. For example, 'btsl %eax, (blah)' will touch the word at blah+4 if %eax is between 32 and 63. The x86 emulator compensates for that by advancing the operand address by (bit offset / BITS_PER_LONG) and truncating the bit offset to the range (0..BITS_PER_LONG-1). This has a side effect of forcing the operand size to 8 bytes on 64-bit hosts. Now, a 32-bit guest goes and fork()s a process. It write protects a stack page at 0xbffff000 using the 'btr' instruction, at offset 0xffc in the page table, with bit offset 1 (for the write permission bit). The emulator now forces the operand size to 8 bytes as previously described, and an innocent page table update turns into a cross-page-boundary write, which is assumed by the mmu code not to be a page table, so it doesn't actually clear the corresponding shadow page table entry. The guest and host permissions are out of sync and guest memory is corrupted soon afterwards, leading to guest failure. Fix by not using BITS_PER_LONG as the word size; instead use the actual operand size, so we get a 32-bit write in that case. Note we still have to teach the mmu to handle cross-page-boundary writes to guest page table; but for now this allows Damn Small Linux 0.4 (2.4.20) to boot. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Remove debug messageAvi Kivity2007-05-03
| | | | | | No longer interesting. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Use list_move()Avi Kivity2007-05-03
| | | | | | Use list_move() where possible. Noticed by Dor Laor. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Remove unused functionMichal Piotrowski2007-05-03
| | | | | | | | | | Remove unused function CC drivers/kvm/svm.o drivers/kvm/svm.c:207: warning: ‘inject_db’ defined but not used Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: SVM: Ensure timestamp counter monotonicityAvi Kivity2007-05-03
| | | | | | | | | | | | When a vcpu is migrated from one cpu to another, its timestamp counter may lose its monotonic property if the host has unsynced timestamp counters. This can confuse the guest, sometimes to the point of refusing to boot. As the rdtsc instruction is rather fast on AMD processors (7-10 cycles), we can simply record the last host tsc when we drop the cpu, and adjust the vcpu tsc offset when we detect that we've migrated to a different cpu. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: MMU: Fix hugepage pdes mapping same physical address with different accessAvi Kivity2007-05-03
| | | | | | | | | | | | | | | | | | | The kvm mmu keeps a shadow page for hugepage pdes; if several such pdes map the same physical address, they share the same shadow page. This is a fairly common case (kernel mappings on i386 nonpae Linux, for example). However, if the two pdes map the same memory but with different permissions, kvm will happily use the cached shadow page. If the access through the more permissive pde will occur after the access to the strict pde, an endless pagefault loop will be generated and the guest will make no progress. Fix by making the access permissions part of the cache lookup key. The fix allows Xen pae to boot on kvm and run guest domains. Thanks to Jeremy Fitzhardinge for reporting the bug and testing the fix. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: SVM: forbid guest to execute monitor/mwaitJoerg Roedel2007-05-03
| | | | | | | | | | This patch forbids the guest to execute monitor/mwait instructions on SVM. This is necessary because the guest can execute these instructions if they are available even if the kvm cpuid doesn't report its existence. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Handle writes to MCG_STATUS msrSergey Kiselev2007-05-03
| | | | | | | | | | Some older (~2.6.7) kernels write MCG_STATUS register during kernel boot (mce_clear_all() function, called from mce_init()). It's not currently handled by kvm and will cause it to inject a GPF. Following patch adds a "nop" handler for this. Signed-off-by: Sergey Kiselev <sergey.kiselev@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Remove unused and write-only variablesAvi Kivity2007-05-03
| | | | | | Trivial cleanup. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Don't allow the guest to turn off the cpu cacheAvi Kivity2007-05-03
| | | | | | | The cpu cache is a host resource; the guest should not be able to turn it off (even for itself). Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Hack real-mode segments on vmx from KVM_SET_SREGSAvi Kivity2007-05-03
| | | | | | | | | | As usual, we need to mangle segment registers when emulating real mode as vm86 has specific constraints. We special case the reset segment base, and set the "access rights" (or descriptor flags) to vm86 comaptible values. This fixes reboot on vmx. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Modify guest segments after potentially switching modesAvi Kivity2007-05-03
| | | | | | | | | The SET_SREGS ioctl modifies both cr0.pe (real mode/protected mode) and guest segment registers. Since segment handling is modified by the mode on Intel procesors, update the segment registers after the mode switch has taken place. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Remove set_cr0_no_modeswitch() arch opAvi Kivity2007-05-03
| | | | | | | | | set_cr0_no_modeswitch() was a hack to avoid corrupting segment registers. As we now cache the protected mode values on entry to real mode, this isn't an issue anymore, and it interferes with reboot (which usually _is_ a modeswitch). Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Workaround vmx inability to virtualize the reset stateAvi Kivity2007-05-03
| | | | | | | | | | | | | | The reset state has cs.selector == 0xf000 and cs.base == 0xffff0000, which aren't compatible with vm86 mode, which is used for real mode virtualization. When we create a vcpu, we set cs.base to 0xf0000, but if we get there by way of a reset, the values are inconsistent and vmx refuses to enter guest mode. Workaround by detecting the state and munging it appropriately. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: MMU: Remove global pte trackingAvi Kivity2007-05-03
| | | | | | | | | | | The initial, noncaching, version of the kvm mmu flushed the all nonglobal shadow page table translations (much like a native tlb flush). The new implementation flushes translations only when they change, rendering global pte tracking superfluous. This removes the unused tracking mechanism and storage space. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: MMU: Remove unnecessary check for pdptr accessAvi Kivity2007-05-03
| | | | | | We already special case the pdptr access, so no need to check it again. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Avoid guest virtual addresses in string pio userspace interfaceAvi Kivity2007-05-03
| | | | | | | | | | | | | | | The current string pio interface communicates using guest virtual addresses, relying on userspace to translate addresses and to check permissions. This interface cannot fully support guest smp, as the check needs to take into account two pages at one in case an unaligned string transfer straddles a page boundary. Change the interface not to communicate guest addresses at all; instead use a buffer page (mmaped by userspace) and do transfers there. The kernel manages the virtual to physical translation and can perform the checks atomically by taking the appropriate locks. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Future-proof argument-less ioctlsAvi Kivity2007-05-03
| | | | | | | Some ioctls ignore their arguments. By requiring them to be zero now, we allow a nonzero value to have some special meaning in the future. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Allow kernel to select size of mmap() bufferAvi Kivity2007-05-03
| | | | | | | | This allows us to store offsets in the kernel/user kvm_run area, and be sure that userspace has them mapped. As offsets can be outside the kvm_run struct, userspace has no way of knowing how much to mmap. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Add guest mode signal maskAvi Kivity2007-05-03
| | | | | | | | | Allow a special signal mask to be used while executing in guest mode. This allows signals to be used to interrupt a vcpu without requiring signal delivery to a userspace handler, which is quite expensive. Userspace still receives -EINTR and can get the signal via sigwait(). Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Initialize the apic_base msr on svm tooAvi Kivity2007-05-03
| | | | | | | Older userspace didn't care, but newer userspace (with the cpuid changes) does. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Add a special exit reason when exiting due to an interruptAvi Kivity2007-05-03
| | | | | | | | This is redundant, as we also return -EINTR from the ioctl, but it allows us to examine the exit_reason field on resume without seeing old data. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Fold kvm_run::exit_type into kvm_run::exit_reasonAvi Kivity2007-05-03
| | | | | | | | | Currently, userspace is told about the nature of the last exit from the guest using two fields, exit_type and exit_reason, where exit_type has just two enumerations (and no need for more). So fold exit_type into exit_reason, reducing the complexity of determining what really happened. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Allow userspace to process hypercalls which have no kernel handlerAvi Kivity2007-05-03
| | | | | | This is useful for paravirtualized graphics devices, for example. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Add method to check for backwards-compatible API extensionsAvi Kivity2007-05-03
| | | | Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Remove the 'emulated' field from the userspace interfaceAvi Kivity2007-05-03
| | | | | | | We no longer emulate single instructions in userspace. Instead, we service mmio or pio requests. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Handle cpuid in the kernel instead of punting to userspaceAvi Kivity2007-05-03
| | | | | | | | | | | | | | KVM used to handle cpuid by letting userspace decide what values to return to the guest. We now handle cpuid completely in the kernel. We still let userspace decide which values the guest will see by having userspace set up the value table beforehand (this is necessary to allow management software to set the cpu features to the least common denominator, so that live migration can work). The motivation for the change is that kvm kernel code can be impacted by cpuid features, for example the x86 emulator. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Do not communicate to userspace through cpu registers during PIOAvi Kivity2007-05-03
| | | | | | | | | | | | | Currently when passing the a PIO emulation request to userspace, we rely on userspace updating %rax (on 'in' instructions) and %rsi/%rdi/%rcx (on string instructions). This (a) requires two extra ioctls for getting and setting the registers and (b) is unfriendly to non-x86 archs, when they get kvm ports. So fix by doing the register fixups in the kernel and passing to userspace only an abstract description of the PIO to be done. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Use a shared page for kernel/user communication when runing a vcpuAvi Kivity2007-05-03
| | | | | | | | | Instead of passing a 'struct kvm_run' back and forth between the kernel and userspace, allocate a page and allow the user to mmap() it. This reduces needless copying and makes the interface expandable by providing lots of free space. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Fix bogus sign extension in mmu mapping auditAvi Kivity2007-05-03
| | | | | | | | | | When auditing a 32-bit guest on a 64-bit host, sign extension of the page table directory pointer table index caused bogus addresses to be shown on audit errors. Fix by declaring the index unsigned. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Use own minor numberAvi Kivity2007-05-03
| | | | | | Use the minor number (232) allocated to kvm by lanana. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Use the generic skip_emulated_instruction() in hypercall codeDor Laor2007-05-03
| | | | | | | | Instead of twiddling the rip registers directly, use the skip_emulated_instruction() function to do that for us. Signed-off-by: Dor Laor <dor.laor@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Fix guest register corruption on paravirt hypercallDor Laor2007-05-03
| | | | | | | | The hypercall code mixes up the ->cache_regs() and ->decache_regs() callbacks, resulting in guest register corruption. Signed-off-by: Dor Laor <dor.laor@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Fix off-by-one when writing to a nonpae guest pdeAvi Kivity2007-04-19
| | | | | | | | | | | | | | Nonpae guest pdes are shadowed by two pae ptes, so we double the offset twice: once to account for the pte size difference, and once because we need to shadow pdes for a single guest pde. But when writing to the upper guest pde we also need to truncate the lower bits, otherwise the multiply shifts these bits into the pde index and causes an access to the wrong shadow pde. If we're at the end of the page (accessing the very last guest pde) we can even overflow into the next host page and oops. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: always reload segment selectorsIngo Molnar2007-03-27
| | | | | | | | | failed VM entry on VMX might still change %fs or %gs, thus make sure that KVM always reloads the segment selectors. This is crutial on both x86 and x86_64: x86 has __KERNEL_PDA in %fs on which things like 'current' depends and x86_64 has 0 there and needs MSR_GS_BASE to work. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* KVM: Prevent system selectors leaking into guest on real->protected mode ↵Avi Kivity2007-03-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | transition on vmx Intel virtualization extensions do not support virtualizing real mode. So kvm uses virtualized vm86 mode to run real mode code. Unfortunately, this virtualized vm86 mode does not support the so called "big real" mode, where the segment selector and base do not agree with each other according to the real mode rules (base == selector << 4). To work around this, kvm checks whether a selector/base pair violates the virtualized vm86 rules, and if so, forces it into conformance. On a transition back to protected mode, if we see that the guest did not touch a forced segment, we restore it back to the original protected mode value. This pile of hacks breaks down if the gdt has changed in real mode, as it can cause a segment selector to point to a system descriptor instead of a normal data segment. In fact, this happens with the Windows bootloader and the qemu acpi bios, where a protected mode memcpy routine issues an innocent 'pop %es' and traps on an attempt to load a system descriptor. "Fix" by checking if the to-be-restored selector points at a system segment, and if so, coercing it into a normal data segment. The long term solution, of course, is to abandon vm86 mode and use emulation for big real mode. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: MMU: Fix host memory corruption on i386 with >= 4GB ramAvi Kivity2007-03-18
| | | | | | | | | | | | | PAGE_MASK is an unsigned long, so using it to mask physical addresses on i386 (which are 64-bit wide) leads to truncation. This can result in page->private of unrelated memory pages being modified, with disasterous results. Fix by not using PAGE_MASK for physical addresses; instead calculate the correct value directly from PAGE_SIZE. Also fix a similar BUG_ON(). Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: MMU: Fix guest writes to nonpae pdeAvi Kivity2007-03-18
| | | | | | | | | | | | | | | | | | | KVM shadow page tables are always in pae mode, regardless of the guest setting. This means that a guest pde (mapping 4MB of memory) is mapped to two shadow pdes (mapping 2MB each). When the guest writes to a pte or pde, we intercept the write and emulate it. We also remove any shadowed mappings corresponding to the write. Since the mmu did not account for the doubling in the number of pdes, it removed the wrong entry, resulting in a mismatch between shadow page tables and guest page tables, followed shortly by guest memory corruption. This patch fixes the problem by detecting the special case of writing to a non-pae pde and adjusting the address and number of shadow pdes zapped accordingly. Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Fix guest sysenter on vmxAvi Kivity2007-03-18
| | | | | | | | The vmx code currently treats the guest's sysenter support msrs as 32-bit values, which breaks 32-bit compat mode userspace on 64-bit guests. Fix by using the native word width of the machine. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Unset kvm_arch_ops if arch module loading failedAvi Kivity2007-03-18
| | | | | | | Otherwise, the core module thinks the arch module is loaded, and won't let you reload it after you've fixed the bug. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Move kvmfs magic number to <linux/magic.h>Andrew Morton2007-03-04
| | | | | | | | Use the standard magic.h for kvmfs. Cc: Avi Kivity <avi@qumranet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Fix bogus failure in kvm.ko module initializationAvi Kivity2007-03-04
| | | | | | | | | | | | | | A bogus 'return r' can cause an otherwise successful module load to fail. This both denies users the use of kvm, and it also denies them the use of their machine, as it leaves a filesystem registered with its callbacks pointing into now-freed module memory. Fix by returning a zero like a good module. Thanks to Richard Lucassen <mailinglists@lucassen.org> (?) for reporting the problem and for providing access to a machine which exhibited it. Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Remove write access permissions when dirty-page-logging is enabledUri Lublin2007-03-04
| | | | | | | | | Enabling dirty page logging is done using KVM_SET_MEMORY_REGION ioctl. If the memory region already exists, we need to remove write accesses, so writes will be caught, and dirty pages will be logged. Signed-off-by: Uri Lublin <uril@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
* kvm: move do_remove_write_access() upUri Lublin2007-03-04
| | | | | | | To be called from kvm_vm_ioctl_set_memory_region() Signed-off-by: Uri Lublin <uril@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Fix dirty page log bitmap size/access calculationUri Lublin2007-03-04
| | | | | | | | Since dirty_bitmap is an unsigned long array, the alignment and size need to take that into account. Signed-off-by: Uri Lublin <uril@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Add missing calls to mark_page_dirty()Uri Lublin2007-03-04
| | | | | | | | A few places where we modify guest memory fail to call mark_page_dirty(), causing live migration to fail. This adds the missing calls. Signed-off-by: Uri Lublin <uril@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
* KVM: Per-vcpu inodesAvi Kivity2007-03-04
| | | | | | | | | | | | | | | | | | | Allocate a distinct inode for every vcpu in a VM. This has the following benefits: - the filp cachelines are no longer bounced when f_count is incremented on every ioctl() - the API and internal code are distinctly clearer; for example, on the KVM_GET_REGS ioctl, there is no need to copy the vcpu number from userspace and then copy the registers back; the vcpu identity is derived from the fd used to make the call Right now the performance benefits are completely theoretical since (a) we don't support more than one vcpu per VM and (b) virtualization hardware inefficiencies completely everwhelm any cacheline bouncing effects. But both of these will change, and we need to prepare the API today. Signed-off-by: Avi Kivity <avi@qumranet.com>