aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/kvm
diff options
context:
space:
mode:
authorGlenn Elliott <gelliott@cs.unc.edu>2012-03-04 19:47:13 -0500
committerGlenn Elliott <gelliott@cs.unc.edu>2012-03-04 19:47:13 -0500
commitc71c03bda1e86c9d5198c5d83f712e695c4f2a1e (patch)
treeecb166cb3e2b7e2adb3b5e292245fefd23381ac8 /Documentation/kvm
parentea53c912f8a86a8567697115b6a0d8152beee5c8 (diff)
parent6a00f206debf8a5c8899055726ad127dbeeed098 (diff)
Merge branch 'mpi-master' into wip-k-fmlpwip-k-fmlp
Conflicts: litmus/sched_cedf.c
Diffstat (limited to 'Documentation/kvm')
-rw-r--r--Documentation/kvm/api.txt1220
-rw-r--r--Documentation/kvm/cpuid.txt42
-rw-r--r--Documentation/kvm/mmu.txt348
-rw-r--r--Documentation/kvm/msr.txt153
-rw-r--r--Documentation/kvm/review-checklist.txt38
5 files changed, 0 insertions, 1801 deletions
diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
deleted file mode 100644
index 5f5b64982b1a..000000000000
--- a/Documentation/kvm/api.txt
+++ /dev/null
@@ -1,1220 +0,0 @@
1The Definitive KVM (Kernel-based Virtual Machine) API Documentation
2===================================================================
3
41. General description
5
6The kvm API is a set of ioctls that are issued to control various aspects
7of a virtual machine. The ioctls belong to three classes
8
9 - System ioctls: These query and set global attributes which affect the
10 whole kvm subsystem. In addition a system ioctl is used to create
11 virtual machines
12
13 - VM ioctls: These query and set attributes that affect an entire virtual
14 machine, for example memory layout. In addition a VM ioctl is used to
15 create virtual cpus (vcpus).
16
17 Only run VM ioctls from the same process (address space) that was used
18 to create the VM.
19
20 - vcpu ioctls: These query and set attributes that control the operation
21 of a single virtual cpu.
22
23 Only run vcpu ioctls from the same thread that was used to create the
24 vcpu.
25
262. File descriptors
27
28The kvm API is centered around file descriptors. An initial
29open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
30can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this
31handle will create a VM file descriptor which can be used to issue VM
32ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu
33and return a file descriptor pointing to it. Finally, ioctls on a vcpu
34fd can be used to control the vcpu, including the important task of
35actually running guest code.
36
37In general file descriptors can be migrated among processes by means
38of fork() and the SCM_RIGHTS facility of unix domain socket. These
39kinds of tricks are explicitly not supported by kvm. While they will
40not cause harm to the host, their actual behavior is not guaranteed by
41the API. The only supported use is one virtual machine per process,
42and one vcpu per thread.
43
443. Extensions
45
46As of Linux 2.6.22, the KVM ABI has been stabilized: no backward
47incompatible change are allowed. However, there is an extension
48facility that allows backward-compatible extensions to the API to be
49queried and used.
50
51The extension mechanism is not based on on the Linux version number.
52Instead, kvm defines extension identifiers and a facility to query
53whether a particular extension identifier is available. If it is, a
54set of ioctls is available for application use.
55
564. API description
57
58This section describes ioctls that can be used to control kvm guests.
59For each ioctl, the following information is provided along with a
60description:
61
62 Capability: which KVM extension provides this ioctl. Can be 'basic',
63 which means that is will be provided by any kernel that supports
64 API version 12 (see section 4.1), or a KVM_CAP_xyz constant, which
65 means availability needs to be checked with KVM_CHECK_EXTENSION
66 (see section 4.4).
67
68 Architectures: which instruction set architectures provide this ioctl.
69 x86 includes both i386 and x86_64.
70
71 Type: system, vm, or vcpu.
72
73 Parameters: what parameters are accepted by the ioctl.
74
75 Returns: the return value. General error numbers (EBADF, ENOMEM, EINVAL)
76 are not detailed, but errors with specific meanings are.
77
784.1 KVM_GET_API_VERSION
79
80Capability: basic
81Architectures: all
82Type: system ioctl
83Parameters: none
84Returns: the constant KVM_API_VERSION (=12)
85
86This identifies the API version as the stable kvm API. It is not
87expected that this number will change. However, Linux 2.6.20 and
882.6.21 report earlier versions; these are not documented and not
89supported. Applications should refuse to run if KVM_GET_API_VERSION
90returns a value other than 12. If this check passes, all ioctls
91described as 'basic' will be available.
92
934.2 KVM_CREATE_VM
94
95Capability: basic
96Architectures: all
97Type: system ioctl
98Parameters: none
99Returns: a VM fd that can be used to control the new virtual machine.
100
101The new VM has no virtual cpus and no memory. An mmap() of a VM fd
102will access the virtual machine's physical address space; offset zero
103corresponds to guest physical address zero. Use of mmap() on a VM fd
104is discouraged if userspace memory allocation (KVM_CAP_USER_MEMORY) is
105available.
106
1074.3 KVM_GET_MSR_INDEX_LIST
108
109Capability: basic
110Architectures: x86
111Type: system
112Parameters: struct kvm_msr_list (in/out)
113Returns: 0 on success; -1 on error
114Errors:
115 E2BIG: the msr index list is to be to fit in the array specified by
116 the user.
117
118struct kvm_msr_list {
119 __u32 nmsrs; /* number of msrs in entries */
120 __u32 indices[0];
121};
122
123This ioctl returns the guest msrs that are supported. The list varies
124by kvm version and host processor, but does not change otherwise. The
125user fills in the size of the indices array in nmsrs, and in return
126kvm adjusts nmsrs to reflect the actual number of msrs and fills in
127the indices array with their numbers.
128
129Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are
130not returned in the MSR list, as different vcpus can have a different number
131of banks, as set via the KVM_X86_SETUP_MCE ioctl.
132
1334.4 KVM_CHECK_EXTENSION
134
135Capability: basic
136Architectures: all
137Type: system ioctl
138Parameters: extension identifier (KVM_CAP_*)
139Returns: 0 if unsupported; 1 (or some other positive integer) if supported
140
141The API allows the application to query about extensions to the core
142kvm API. Userspace passes an extension identifier (an integer) and
143receives an integer that describes the extension availability.
144Generally 0 means no and 1 means yes, but some extensions may report
145additional information in the integer return value.
146
1474.5 KVM_GET_VCPU_MMAP_SIZE
148
149Capability: basic
150Architectures: all
151Type: system ioctl
152Parameters: none
153Returns: size of vcpu mmap area, in bytes
154
155The KVM_RUN ioctl (cf.) communicates with userspace via a shared
156memory region. This ioctl returns the size of that region. See the
157KVM_RUN documentation for details.
158
1594.6 KVM_SET_MEMORY_REGION
160
161Capability: basic
162Architectures: all
163Type: vm ioctl
164Parameters: struct kvm_memory_region (in)
165Returns: 0 on success, -1 on error
166
167This ioctl is obsolete and has been removed.
168
1694.6 KVM_CREATE_VCPU
170
171Capability: basic
172Architectures: all
173Type: vm ioctl
174Parameters: vcpu id (apic id on x86)
175Returns: vcpu fd on success, -1 on error
176
177This API adds a vcpu to a virtual machine. The vcpu id is a small integer
178in the range [0, max_vcpus).
179
1804.7 KVM_GET_DIRTY_LOG (vm ioctl)
181
182Capability: basic
183Architectures: x86
184Type: vm ioctl
185Parameters: struct kvm_dirty_log (in/out)
186Returns: 0 on success, -1 on error
187
188/* for KVM_GET_DIRTY_LOG */
189struct kvm_dirty_log {
190 __u32 slot;
191 __u32 padding;
192 union {
193 void __user *dirty_bitmap; /* one bit per page */
194 __u64 padding;
195 };
196};
197
198Given a memory slot, return a bitmap containing any pages dirtied
199since the last call to this ioctl. Bit 0 is the first page in the
200memory slot. Ensure the entire structure is cleared to avoid padding
201issues.
202
2034.8 KVM_SET_MEMORY_ALIAS
204
205Capability: basic
206Architectures: x86
207Type: vm ioctl
208Parameters: struct kvm_memory_alias (in)
209Returns: 0 (success), -1 (error)
210
211This ioctl is obsolete and has been removed.
212
2134.9 KVM_RUN
214
215Capability: basic
216Architectures: all
217Type: vcpu ioctl
218Parameters: none
219Returns: 0 on success, -1 on error
220Errors:
221 EINTR: an unmasked signal is pending
222
223This ioctl is used to run a guest virtual cpu. While there are no
224explicit parameters, there is an implicit parameter block that can be
225obtained by mmap()ing the vcpu fd at offset 0, with the size given by
226KVM_GET_VCPU_MMAP_SIZE. The parameter block is formatted as a 'struct
227kvm_run' (see below).
228
2294.10 KVM_GET_REGS
230
231Capability: basic
232Architectures: all
233Type: vcpu ioctl
234Parameters: struct kvm_regs (out)
235Returns: 0 on success, -1 on error
236
237Reads the general purpose registers from the vcpu.
238
239/* x86 */
240struct kvm_regs {
241 /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
242 __u64 rax, rbx, rcx, rdx;
243 __u64 rsi, rdi, rsp, rbp;
244 __u64 r8, r9, r10, r11;
245 __u64 r12, r13, r14, r15;
246 __u64 rip, rflags;
247};
248
2494.11 KVM_SET_REGS
250
251Capability: basic
252Architectures: all
253Type: vcpu ioctl
254Parameters: struct kvm_regs (in)
255Returns: 0 on success, -1 on error
256
257Writes the general purpose registers into the vcpu.
258
259See KVM_GET_REGS for the data structure.
260
2614.12 KVM_GET_SREGS
262
263Capability: basic
264Architectures: x86
265Type: vcpu ioctl
266Parameters: struct kvm_sregs (out)
267Returns: 0 on success, -1 on error
268
269Reads special registers from the vcpu.
270
271/* x86 */
272struct kvm_sregs {
273 struct kvm_segment cs, ds, es, fs, gs, ss;
274 struct kvm_segment tr, ldt;
275 struct kvm_dtable gdt, idt;
276 __u64 cr0, cr2, cr3, cr4, cr8;
277 __u64 efer;
278 __u64 apic_base;
279 __u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64];
280};
281
282interrupt_bitmap is a bitmap of pending external interrupts. At most
283one bit may be set. This interrupt has been acknowledged by the APIC
284but not yet injected into the cpu core.
285
2864.13 KVM_SET_SREGS
287
288Capability: basic
289Architectures: x86
290Type: vcpu ioctl
291Parameters: struct kvm_sregs (in)
292Returns: 0 on success, -1 on error
293
294Writes special registers into the vcpu. See KVM_GET_SREGS for the
295data structures.
296
2974.14 KVM_TRANSLATE
298
299Capability: basic
300Architectures: x86
301Type: vcpu ioctl
302Parameters: struct kvm_translation (in/out)
303Returns: 0 on success, -1 on error
304
305Translates a virtual address according to the vcpu's current address
306translation mode.
307
308struct kvm_translation {
309 /* in */
310 __u64 linear_address;
311
312 /* out */
313 __u64 physical_address;
314 __u8 valid;
315 __u8 writeable;
316 __u8 usermode;
317 __u8 pad[5];
318};
319
3204.15 KVM_INTERRUPT
321
322Capability: basic
323Architectures: x86
324Type: vcpu ioctl
325Parameters: struct kvm_interrupt (in)
326Returns: 0 on success, -1 on error
327
328Queues a hardware interrupt vector to be injected. This is only
329useful if in-kernel local APIC is not used.
330
331/* for KVM_INTERRUPT */
332struct kvm_interrupt {
333 /* in */
334 __u32 irq;
335};
336
337Note 'irq' is an interrupt vector, not an interrupt pin or line.
338
3394.16 KVM_DEBUG_GUEST
340
341Capability: basic
342Architectures: none
343Type: vcpu ioctl
344Parameters: none)
345Returns: -1 on error
346
347Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead.
348
3494.17 KVM_GET_MSRS
350
351Capability: basic
352Architectures: x86
353Type: vcpu ioctl
354Parameters: struct kvm_msrs (in/out)
355Returns: 0 on success, -1 on error
356
357Reads model-specific registers from the vcpu. Supported msr indices can
358be obtained using KVM_GET_MSR_INDEX_LIST.
359
360struct kvm_msrs {
361 __u32 nmsrs; /* number of msrs in entries */
362 __u32 pad;
363
364 struct kvm_msr_entry entries[0];
365};
366
367struct kvm_msr_entry {
368 __u32 index;
369 __u32 reserved;
370 __u64 data;
371};
372
373Application code should set the 'nmsrs' member (which indicates the
374size of the entries array) and the 'index' member of each array entry.
375kvm will fill in the 'data' member.
376
3774.18 KVM_SET_MSRS
378
379Capability: basic
380Architectures: x86
381Type: vcpu ioctl
382Parameters: struct kvm_msrs (in)
383Returns: 0 on success, -1 on error
384
385Writes model-specific registers to the vcpu. See KVM_GET_MSRS for the
386data structures.
387
388Application code should set the 'nmsrs' member (which indicates the
389size of the entries array), and the 'index' and 'data' members of each
390array entry.
391
3924.19 KVM_SET_CPUID
393
394Capability: basic
395Architectures: x86
396Type: vcpu ioctl
397Parameters: struct kvm_cpuid (in)
398Returns: 0 on success, -1 on error
399
400Defines the vcpu responses to the cpuid instruction. Applications
401should use the KVM_SET_CPUID2 ioctl if available.
402
403
404struct kvm_cpuid_entry {
405 __u32 function;
406 __u32 eax;
407 __u32 ebx;
408 __u32 ecx;
409 __u32 edx;
410 __u32 padding;
411};
412
413/* for KVM_SET_CPUID */
414struct kvm_cpuid {
415 __u32 nent;
416 __u32 padding;
417 struct kvm_cpuid_entry entries[0];
418};
419
4204.20 KVM_SET_SIGNAL_MASK
421
422Capability: basic
423Architectures: x86
424Type: vcpu ioctl
425Parameters: struct kvm_signal_mask (in)
426Returns: 0 on success, -1 on error
427
428Defines which signals are blocked during execution of KVM_RUN. This
429signal mask temporarily overrides the threads signal mask. Any
430unblocked signal received (except SIGKILL and SIGSTOP, which retain
431their traditional behaviour) will cause KVM_RUN to return with -EINTR.
432
433Note the signal will only be delivered if not blocked by the original
434signal mask.
435
436/* for KVM_SET_SIGNAL_MASK */
437struct kvm_signal_mask {
438 __u32 len;
439 __u8 sigset[0];
440};
441
4424.21 KVM_GET_FPU
443
444Capability: basic
445Architectures: x86
446Type: vcpu ioctl
447Parameters: struct kvm_fpu (out)
448Returns: 0 on success, -1 on error
449
450Reads the floating point state from the vcpu.
451
452/* for KVM_GET_FPU and KVM_SET_FPU */
453struct kvm_fpu {
454 __u8 fpr[8][16];
455 __u16 fcw;
456 __u16 fsw;
457 __u8 ftwx; /* in fxsave format */
458 __u8 pad1;
459 __u16 last_opcode;
460 __u64 last_ip;
461 __u64 last_dp;
462 __u8 xmm[16][16];
463 __u32 mxcsr;
464 __u32 pad2;
465};
466
4674.22 KVM_SET_FPU
468
469Capability: basic
470Architectures: x86
471Type: vcpu ioctl
472Parameters: struct kvm_fpu (in)
473Returns: 0 on success, -1 on error
474
475Writes the floating point state to the vcpu.
476
477/* for KVM_GET_FPU and KVM_SET_FPU */
478struct kvm_fpu {
479 __u8 fpr[8][16];
480 __u16 fcw;
481 __u16 fsw;
482 __u8 ftwx; /* in fxsave format */
483 __u8 pad1;
484 __u16 last_opcode;
485 __u64 last_ip;
486 __u64 last_dp;
487 __u8 xmm[16][16];
488 __u32 mxcsr;
489 __u32 pad2;
490};
491
4924.23 KVM_CREATE_IRQCHIP
493
494Capability: KVM_CAP_IRQCHIP
495Architectures: x86, ia64
496Type: vm ioctl
497Parameters: none
498Returns: 0 on success, -1 on error
499
500Creates an interrupt controller model in the kernel. On x86, creates a virtual
501ioapic, a virtual PIC (two PICs, nested), and sets up future vcpus to have a
502local APIC. IRQ routing for GSIs 0-15 is set to both PIC and IOAPIC; GSI 16-23
503only go to the IOAPIC. On ia64, a IOSAPIC is created.
504
5054.24 KVM_IRQ_LINE
506
507Capability: KVM_CAP_IRQCHIP
508Architectures: x86, ia64
509Type: vm ioctl
510Parameters: struct kvm_irq_level
511Returns: 0 on success, -1 on error
512
513Sets the level of a GSI input to the interrupt controller model in the kernel.
514Requires that an interrupt controller model has been previously created with
515KVM_CREATE_IRQCHIP. Note that edge-triggered interrupts require the level
516to be set to 1 and then back to 0.
517
518struct kvm_irq_level {
519 union {
520 __u32 irq; /* GSI */
521 __s32 status; /* not used for KVM_IRQ_LEVEL */
522 };
523 __u32 level; /* 0 or 1 */
524};
525
5264.25 KVM_GET_IRQCHIP
527
528Capability: KVM_CAP_IRQCHIP
529Architectures: x86, ia64
530Type: vm ioctl
531Parameters: struct kvm_irqchip (in/out)
532Returns: 0 on success, -1 on error
533
534Reads the state of a kernel interrupt controller created with
535KVM_CREATE_IRQCHIP into a buffer provided by the caller.
536
537struct kvm_irqchip {
538 __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */
539 __u32 pad;
540 union {
541 char dummy[512]; /* reserving space */
542 struct kvm_pic_state pic;
543 struct kvm_ioapic_state ioapic;
544 } chip;
545};
546
5474.26 KVM_SET_IRQCHIP
548
549Capability: KVM_CAP_IRQCHIP
550Architectures: x86, ia64
551Type: vm ioctl
552Parameters: struct kvm_irqchip (in)
553Returns: 0 on success, -1 on error
554
555Sets the state of a kernel interrupt controller created with
556KVM_CREATE_IRQCHIP from a buffer provided by the caller.
557
558struct kvm_irqchip {
559 __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */
560 __u32 pad;
561 union {
562 char dummy[512]; /* reserving space */
563 struct kvm_pic_state pic;
564 struct kvm_ioapic_state ioapic;
565 } chip;
566};
567
5684.27 KVM_XEN_HVM_CONFIG
569
570Capability: KVM_CAP_XEN_HVM
571Architectures: x86
572Type: vm ioctl
573Parameters: struct kvm_xen_hvm_config (in)
574Returns: 0 on success, -1 on error
575
576Sets the MSR that the Xen HVM guest uses to initialize its hypercall
577page, and provides the starting address and size of the hypercall
578blobs in userspace. When the guest writes the MSR, kvm copies one
579page of a blob (32- or 64-bit, depending on the vcpu mode) to guest
580memory.
581
582struct kvm_xen_hvm_config {
583 __u32 flags;
584 __u32 msr;
585 __u64 blob_addr_32;
586 __u64 blob_addr_64;
587 __u8 blob_size_32;
588 __u8 blob_size_64;
589 __u8 pad2[30];
590};
591
5924.27 KVM_GET_CLOCK
593
594Capability: KVM_CAP_ADJUST_CLOCK
595Architectures: x86
596Type: vm ioctl
597Parameters: struct kvm_clock_data (out)
598Returns: 0 on success, -1 on error
599
600Gets the current timestamp of kvmclock as seen by the current guest. In
601conjunction with KVM_SET_CLOCK, it is used to ensure monotonicity on scenarios
602such as migration.
603
604struct kvm_clock_data {
605 __u64 clock; /* kvmclock current value */
606 __u32 flags;
607 __u32 pad[9];
608};
609
6104.28 KVM_SET_CLOCK
611
612Capability: KVM_CAP_ADJUST_CLOCK
613Architectures: x86
614Type: vm ioctl
615Parameters: struct kvm_clock_data (in)
616Returns: 0 on success, -1 on error
617
618Sets the current timestamp of kvmclock to the value specified in its parameter.
619In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
620such as migration.
621
622struct kvm_clock_data {
623 __u64 clock; /* kvmclock current value */
624 __u32 flags;
625 __u32 pad[9];
626};
627
6284.29 KVM_GET_VCPU_EVENTS
629
630Capability: KVM_CAP_VCPU_EVENTS
631Extended by: KVM_CAP_INTR_SHADOW
632Architectures: x86
633Type: vm ioctl
634Parameters: struct kvm_vcpu_event (out)
635Returns: 0 on success, -1 on error
636
637Gets currently pending exceptions, interrupts, and NMIs as well as related
638states of the vcpu.
639
640struct kvm_vcpu_events {
641 struct {
642 __u8 injected;
643 __u8 nr;
644 __u8 has_error_code;
645 __u8 pad;
646 __u32 error_code;
647 } exception;
648 struct {
649 __u8 injected;
650 __u8 nr;
651 __u8 soft;
652 __u8 shadow;
653 } interrupt;
654 struct {
655 __u8 injected;
656 __u8 pending;
657 __u8 masked;
658 __u8 pad;
659 } nmi;
660 __u32 sipi_vector;
661 __u32 flags;
662};
663
664KVM_VCPUEVENT_VALID_SHADOW may be set in the flags field to signal that
665interrupt.shadow contains a valid state. Otherwise, this field is undefined.
666
6674.30 KVM_SET_VCPU_EVENTS
668
669Capability: KVM_CAP_VCPU_EVENTS
670Extended by: KVM_CAP_INTR_SHADOW
671Architectures: x86
672Type: vm ioctl
673Parameters: struct kvm_vcpu_event (in)
674Returns: 0 on success, -1 on error
675
676Set pending exceptions, interrupts, and NMIs as well as related states of the
677vcpu.
678
679See KVM_GET_VCPU_EVENTS for the data structure.
680
681Fields that may be modified asynchronously by running VCPUs can be excluded
682from the update. These fields are nmi.pending and sipi_vector. Keep the
683corresponding bits in the flags field cleared to suppress overwriting the
684current in-kernel state. The bits are:
685
686KVM_VCPUEVENT_VALID_NMI_PENDING - transfer nmi.pending to the kernel
687KVM_VCPUEVENT_VALID_SIPI_VECTOR - transfer sipi_vector
688
689If KVM_CAP_INTR_SHADOW is available, KVM_VCPUEVENT_VALID_SHADOW can be set in
690the flags field to signal that interrupt.shadow contains a valid state and
691shall be written into the VCPU.
692
6934.32 KVM_GET_DEBUGREGS
694
695Capability: KVM_CAP_DEBUGREGS
696Architectures: x86
697Type: vm ioctl
698Parameters: struct kvm_debugregs (out)
699Returns: 0 on success, -1 on error
700
701Reads debug registers from the vcpu.
702
703struct kvm_debugregs {
704 __u64 db[4];
705 __u64 dr6;
706 __u64 dr7;
707 __u64 flags;
708 __u64 reserved[9];
709};
710
7114.33 KVM_SET_DEBUGREGS
712
713Capability: KVM_CAP_DEBUGREGS
714Architectures: x86
715Type: vm ioctl
716Parameters: struct kvm_debugregs (in)
717Returns: 0 on success, -1 on error
718
719Writes debug registers into the vcpu.
720
721See KVM_GET_DEBUGREGS for the data structure. The flags field is unused
722yet and must be cleared on entry.
723
7244.34 KVM_SET_USER_MEMORY_REGION
725
726Capability: KVM_CAP_USER_MEM
727Architectures: all
728Type: vm ioctl
729Parameters: struct kvm_userspace_memory_region (in)
730Returns: 0 on success, -1 on error
731
732struct kvm_userspace_memory_region {
733 __u32 slot;
734 __u32 flags;
735 __u64 guest_phys_addr;
736 __u64 memory_size; /* bytes */
737 __u64 userspace_addr; /* start of the userspace allocated memory */
738};
739
740/* for kvm_memory_region::flags */
741#define KVM_MEM_LOG_DIRTY_PAGES 1UL
742
743This ioctl allows the user to create or modify a guest physical memory
744slot. When changing an existing slot, it may be moved in the guest
745physical memory space, or its flags may be modified. It may not be
746resized. Slots may not overlap in guest physical address space.
747
748Memory for the region is taken starting at the address denoted by the
749field userspace_addr, which must point at user addressable memory for
750the entire memory slot size. Any object may back this memory, including
751anonymous memory, ordinary files, and hugetlbfs.
752
753It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
754be identical. This allows large pages in the guest to be backed by large
755pages in the host.
756
757The flags field supports just one flag, KVM_MEM_LOG_DIRTY_PAGES, which
758instructs kvm to keep track of writes to memory within the slot. See
759the KVM_GET_DIRTY_LOG ioctl.
760
761When the KVM_CAP_SYNC_MMU capability, changes in the backing of the memory
762region are automatically reflected into the guest. For example, an mmap()
763that affects the region will be made visible immediately. Another example
764is madvise(MADV_DROP).
765
766It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
767The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
768allocation and is deprecated.
769
7704.35 KVM_SET_TSS_ADDR
771
772Capability: KVM_CAP_SET_TSS_ADDR
773Architectures: x86
774Type: vm ioctl
775Parameters: unsigned long tss_address (in)
776Returns: 0 on success, -1 on error
777
778This ioctl defines the physical address of a three-page region in the guest
779physical address space. The region must be within the first 4GB of the
780guest physical address space and must not conflict with any memory slot
781or any mmio address. The guest may malfunction if it accesses this memory
782region.
783
784This ioctl is required on Intel-based hosts. This is needed on Intel hardware
785because of a quirk in the virtualization implementation (see the internals
786documentation when it pops into existence).
787
7884.36 KVM_ENABLE_CAP
789
790Capability: KVM_CAP_ENABLE_CAP
791Architectures: ppc
792Type: vcpu ioctl
793Parameters: struct kvm_enable_cap (in)
794Returns: 0 on success; -1 on error
795
796+Not all extensions are enabled by default. Using this ioctl the application
797can enable an extension, making it available to the guest.
798
799On systems that do not support this ioctl, it always fails. On systems that
800do support it, it only works for extensions that are supported for enablement.
801
802To check if a capability can be enabled, the KVM_CHECK_EXTENSION ioctl should
803be used.
804
805struct kvm_enable_cap {
806 /* in */
807 __u32 cap;
808
809The capability that is supposed to get enabled.
810
811 __u32 flags;
812
813A bitfield indicating future enhancements. Has to be 0 for now.
814
815 __u64 args[4];
816
817Arguments for enabling a feature. If a feature needs initial values to
818function properly, this is the place to put them.
819
820 __u8 pad[64];
821};
822
8234.37 KVM_GET_MP_STATE
824
825Capability: KVM_CAP_MP_STATE
826Architectures: x86, ia64
827Type: vcpu ioctl
828Parameters: struct kvm_mp_state (out)
829Returns: 0 on success; -1 on error
830
831struct kvm_mp_state {
832 __u32 mp_state;
833};
834
835Returns the vcpu's current "multiprocessing state" (though also valid on
836uniprocessor guests).
837
838Possible values are:
839
840 - KVM_MP_STATE_RUNNABLE: the vcpu is currently running
841 - KVM_MP_STATE_UNINITIALIZED: the vcpu is an application processor (AP)
842 which has not yet received an INIT signal
843 - KVM_MP_STATE_INIT_RECEIVED: the vcpu has received an INIT signal, and is
844 now ready for a SIPI
845 - KVM_MP_STATE_HALTED: the vcpu has executed a HLT instruction and
846 is waiting for an interrupt
847 - KVM_MP_STATE_SIPI_RECEIVED: the vcpu has just received a SIPI (vector
848 accesible via KVM_GET_VCPU_EVENTS)
849
850This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel
851irqchip, the multiprocessing state must be maintained by userspace.
852
8534.38 KVM_SET_MP_STATE
854
855Capability: KVM_CAP_MP_STATE
856Architectures: x86, ia64
857Type: vcpu ioctl
858Parameters: struct kvm_mp_state (in)
859Returns: 0 on success; -1 on error
860
861Sets the vcpu's current "multiprocessing state"; see KVM_GET_MP_STATE for
862arguments.
863
864This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel
865irqchip, the multiprocessing state must be maintained by userspace.
866
8674.39 KVM_SET_IDENTITY_MAP_ADDR
868
869Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR
870Architectures: x86
871Type: vm ioctl
872Parameters: unsigned long identity (in)
873Returns: 0 on success, -1 on error
874
875This ioctl defines the physical address of a one-page region in the guest
876physical address space. The region must be within the first 4GB of the
877guest physical address space and must not conflict with any memory slot
878or any mmio address. The guest may malfunction if it accesses this memory
879region.
880
881This ioctl is required on Intel-based hosts. This is needed on Intel hardware
882because of a quirk in the virtualization implementation (see the internals
883documentation when it pops into existence).
884
8854.40 KVM_SET_BOOT_CPU_ID
886
887Capability: KVM_CAP_SET_BOOT_CPU_ID
888Architectures: x86, ia64
889Type: vm ioctl
890Parameters: unsigned long vcpu_id
891Returns: 0 on success, -1 on error
892
893Define which vcpu is the Bootstrap Processor (BSP). Values are the same
894as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default
895is vcpu 0.
896
8974.41 KVM_GET_XSAVE
898
899Capability: KVM_CAP_XSAVE
900Architectures: x86
901Type: vcpu ioctl
902Parameters: struct kvm_xsave (out)
903Returns: 0 on success, -1 on error
904
905struct kvm_xsave {
906 __u32 region[1024];
907};
908
909This ioctl would copy current vcpu's xsave struct to the userspace.
910
9114.42 KVM_SET_XSAVE
912
913Capability: KVM_CAP_XSAVE
914Architectures: x86
915Type: vcpu ioctl
916Parameters: struct kvm_xsave (in)
917Returns: 0 on success, -1 on error
918
919struct kvm_xsave {
920 __u32 region[1024];
921};
922
923This ioctl would copy userspace's xsave struct to the kernel.
924
9254.43 KVM_GET_XCRS
926
927Capability: KVM_CAP_XCRS
928Architectures: x86
929Type: vcpu ioctl
930Parameters: struct kvm_xcrs (out)
931Returns: 0 on success, -1 on error
932
933struct kvm_xcr {
934 __u32 xcr;
935 __u32 reserved;
936 __u64 value;
937};
938
939struct kvm_xcrs {
940 __u32 nr_xcrs;
941 __u32 flags;
942 struct kvm_xcr xcrs[KVM_MAX_XCRS];
943 __u64 padding[16];
944};
945
946This ioctl would copy current vcpu's xcrs to the userspace.
947
9484.44 KVM_SET_XCRS
949
950Capability: KVM_CAP_XCRS
951Architectures: x86
952Type: vcpu ioctl
953Parameters: struct kvm_xcrs (in)
954Returns: 0 on success, -1 on error
955
956struct kvm_xcr {
957 __u32 xcr;
958 __u32 reserved;
959 __u64 value;
960};
961
962struct kvm_xcrs {
963 __u32 nr_xcrs;
964 __u32 flags;
965 struct kvm_xcr xcrs[KVM_MAX_XCRS];
966 __u64 padding[16];
967};
968
969This ioctl would set vcpu's xcr to the value userspace specified.
970
9714.45 KVM_GET_SUPPORTED_CPUID
972
973Capability: KVM_CAP_EXT_CPUID
974Architectures: x86
975Type: system ioctl
976Parameters: struct kvm_cpuid2 (in/out)
977Returns: 0 on success, -1 on error
978
979struct kvm_cpuid2 {
980 __u32 nent;
981 __u32 padding;
982 struct kvm_cpuid_entry2 entries[0];
983};
984
985#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX 1
986#define KVM_CPUID_FLAG_STATEFUL_FUNC 2
987#define KVM_CPUID_FLAG_STATE_READ_NEXT 4
988
989struct kvm_cpuid_entry2 {
990 __u32 function;
991 __u32 index;
992 __u32 flags;
993 __u32 eax;
994 __u32 ebx;
995 __u32 ecx;
996 __u32 edx;
997 __u32 padding[3];
998};
999
1000This ioctl returns x86 cpuid features which are supported by both the hardware
1001and kvm. Userspace can use the information returned by this ioctl to
1002construct cpuid information (for KVM_SET_CPUID2) that is consistent with
1003hardware, kernel, and userspace capabilities, and with user requirements (for
1004example, the user may wish to constrain cpuid to emulate older hardware,
1005or for feature consistency across a cluster).
1006
1007Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
1008with the 'nent' field indicating the number of entries in the variable-size
1009array 'entries'. If the number of entries is too low to describe the cpu
1010capabilities, an error (E2BIG) is returned. If the number is too high,
1011the 'nent' field is adjusted and an error (ENOMEM) is returned. If the
1012number is just right, the 'nent' field is adjusted to the number of valid
1013entries in the 'entries' array, which is then filled.
1014
1015The entries returned are the host cpuid as returned by the cpuid instruction,
1016with unknown or unsupported features masked out. The fields in each entry
1017are defined as follows:
1018
1019 function: the eax value used to obtain the entry
1020 index: the ecx value used to obtain the entry (for entries that are
1021 affected by ecx)
1022 flags: an OR of zero or more of the following:
1023 KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
1024 if the index field is valid
1025 KVM_CPUID_FLAG_STATEFUL_FUNC:
1026 if cpuid for this function returns different values for successive
1027 invocations; there will be several entries with the same function,
1028 all with this flag set
1029 KVM_CPUID_FLAG_STATE_READ_NEXT:
1030 for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
1031 the first entry to be read by a cpu
1032 eax, ebx, ecx, edx: the values returned by the cpuid instruction for
1033 this function/index combination
1034
10355. The kvm_run structure
1036
1037Application code obtains a pointer to the kvm_run structure by
1038mmap()ing a vcpu fd. From that point, application code can control
1039execution by changing fields in kvm_run prior to calling the KVM_RUN
1040ioctl, and obtain information about the reason KVM_RUN returned by
1041looking up structure members.
1042
1043struct kvm_run {
1044 /* in */
1045 __u8 request_interrupt_window;
1046
1047Request that KVM_RUN return when it becomes possible to inject external
1048interrupts into the guest. Useful in conjunction with KVM_INTERRUPT.
1049
1050 __u8 padding1[7];
1051
1052 /* out */
1053 __u32 exit_reason;
1054
1055When KVM_RUN has returned successfully (return value 0), this informs
1056application code why KVM_RUN has returned. Allowable values for this
1057field are detailed below.
1058
1059 __u8 ready_for_interrupt_injection;
1060
1061If request_interrupt_window has been specified, this field indicates
1062an interrupt can be injected now with KVM_INTERRUPT.
1063
1064 __u8 if_flag;
1065
1066The value of the current interrupt flag. Only valid if in-kernel
1067local APIC is not used.
1068
1069 __u8 padding2[2];
1070
1071 /* in (pre_kvm_run), out (post_kvm_run) */
1072 __u64 cr8;
1073
1074The value of the cr8 register. Only valid if in-kernel local APIC is
1075not used. Both input and output.
1076
1077 __u64 apic_base;
1078
1079The value of the APIC BASE msr. Only valid if in-kernel local
1080APIC is not used. Both input and output.
1081
1082 union {
1083 /* KVM_EXIT_UNKNOWN */
1084 struct {
1085 __u64 hardware_exit_reason;
1086 } hw;
1087
1088If exit_reason is KVM_EXIT_UNKNOWN, the vcpu has exited due to unknown
1089reasons. Further architecture-specific information is available in
1090hardware_exit_reason.
1091
1092 /* KVM_EXIT_FAIL_ENTRY */
1093 struct {
1094 __u64 hardware_entry_failure_reason;
1095 } fail_entry;
1096
1097If exit_reason is KVM_EXIT_FAIL_ENTRY, the vcpu could not be run due
1098to unknown reasons. Further architecture-specific information is
1099available in hardware_entry_failure_reason.
1100
1101 /* KVM_EXIT_EXCEPTION */
1102 struct {
1103 __u32 exception;
1104 __u32 error_code;
1105 } ex;
1106
1107Unused.
1108
1109 /* KVM_EXIT_IO */
1110 struct {
1111#define KVM_EXIT_IO_IN 0
1112#define KVM_EXIT_IO_OUT 1
1113 __u8 direction;
1114 __u8 size; /* bytes */
1115 __u16 port;
1116 __u32 count;
1117 __u64 data_offset; /* relative to kvm_run start */
1118 } io;
1119
1120If exit_reason is KVM_EXIT_IO, then the vcpu has
1121executed a port I/O instruction which could not be satisfied by kvm.
1122data_offset describes where the data is located (KVM_EXIT_IO_OUT) or
1123where kvm expects application code to place the data for the next
1124KVM_RUN invocation (KVM_EXIT_IO_IN). Data format is a packed array.
1125
1126 struct {
1127 struct kvm_debug_exit_arch arch;
1128 } debug;
1129
1130Unused.
1131
1132 /* KVM_EXIT_MMIO */
1133 struct {
1134 __u64 phys_addr;
1135 __u8 data[8];
1136 __u32 len;
1137 __u8 is_write;
1138 } mmio;
1139
1140If exit_reason is KVM_EXIT_MMIO, then the vcpu has
1141executed a memory-mapped I/O instruction which could not be satisfied
1142by kvm. The 'data' member contains the written data if 'is_write' is
1143true, and should be filled by application code otherwise.
1144
1145NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding
1146operations are complete (and guest state is consistent) only after userspace
1147has re-entered the kernel with KVM_RUN. The kernel side will first finish
1148incomplete operations and then check for pending signals. Userspace
1149can re-enter the guest with an unmasked signal pending to complete
1150pending operations.
1151
1152 /* KVM_EXIT_HYPERCALL */
1153 struct {
1154 __u64 nr;
1155 __u64 args[6];
1156 __u64 ret;
1157 __u32 longmode;
1158 __u32 pad;
1159 } hypercall;
1160
1161Unused. This was once used for 'hypercall to userspace'. To implement
1162such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390).
1163Note KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
1164
1165 /* KVM_EXIT_TPR_ACCESS */
1166 struct {
1167 __u64 rip;
1168 __u32 is_write;
1169 __u32 pad;
1170 } tpr_access;
1171
1172To be documented (KVM_TPR_ACCESS_REPORTING).
1173
1174 /* KVM_EXIT_S390_SIEIC */
1175 struct {
1176 __u8 icptcode;
1177 __u64 mask; /* psw upper half */
1178 __u64 addr; /* psw lower half */
1179 __u16 ipa;
1180 __u32 ipb;
1181 } s390_sieic;
1182
1183s390 specific.
1184
1185 /* KVM_EXIT_S390_RESET */
1186#define KVM_S390_RESET_POR 1
1187#define KVM_S390_RESET_CLEAR 2
1188#define KVM_S390_RESET_SUBSYSTEM 4
1189#define KVM_S390_RESET_CPU_INIT 8
1190#define KVM_S390_RESET_IPL 16
1191 __u64 s390_reset_flags;
1192
1193s390 specific.
1194
1195 /* KVM_EXIT_DCR */
1196 struct {
1197 __u32 dcrn;
1198 __u32 data;
1199 __u8 is_write;
1200 } dcr;
1201
1202powerpc specific.
1203
1204 /* KVM_EXIT_OSI */
1205 struct {
1206 __u64 gprs[32];
1207 } osi;
1208
1209MOL uses a special hypercall interface it calls 'OSI'. To enable it, we catch
1210hypercalls and exit with this exit struct that contains all the guest gprs.
1211
1212If exit_reason is KVM_EXIT_OSI, then the vcpu has triggered such a hypercall.
1213Userspace can now handle the hypercall and when it's done modify the gprs as
1214necessary. Upon guest entry all guest GPRs will then be replaced by the values
1215in this struct.
1216
1217 /* Fix the size of the union. */
1218 char padding[256];
1219 };
1220};
diff --git a/Documentation/kvm/cpuid.txt b/Documentation/kvm/cpuid.txt
deleted file mode 100644
index 14a12ea92b7f..000000000000
--- a/Documentation/kvm/cpuid.txt
+++ /dev/null
@@ -1,42 +0,0 @@
1KVM CPUID bits
2Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
3=====================================================
4
5A guest running on a kvm host, can check some of its features using
6cpuid. This is not always guaranteed to work, since userspace can
7mask-out some, or even all KVM-related cpuid features before launching
8a guest.
9
10KVM cpuid functions are:
11
12function: KVM_CPUID_SIGNATURE (0x40000000)
13returns : eax = 0,
14 ebx = 0x4b4d564b,
15 ecx = 0x564b4d56,
16 edx = 0x4d.
17Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM".
18This function queries the presence of KVM cpuid leafs.
19
20
21function: define KVM_CPUID_FEATURES (0x40000001)
22returns : ebx, ecx, edx = 0
23 eax = and OR'ed group of (1 << flag), where each flags is:
24
25
26flag || value || meaning
27=============================================================================
28KVM_FEATURE_CLOCKSOURCE || 0 || kvmclock available at msrs
29 || || 0x11 and 0x12.
30------------------------------------------------------------------------------
31KVM_FEATURE_NOP_IO_DELAY || 1 || not necessary to perform delays
32 || || on PIO operations.
33------------------------------------------------------------------------------
34KVM_FEATURE_MMU_OP || 2 || deprecated.
35------------------------------------------------------------------------------
36KVM_FEATURE_CLOCKSOURCE2 || 3 || kvmclock available at msrs
37 || || 0x4b564d00 and 0x4b564d01
38------------------------------------------------------------------------------
39KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
40 || || per-cpu warps are expected in
41 || || kvmclock.
42------------------------------------------------------------------------------
diff --git a/Documentation/kvm/mmu.txt b/Documentation/kvm/mmu.txt
deleted file mode 100644
index 142cc5136650..000000000000
--- a/Documentation/kvm/mmu.txt
+++ /dev/null
@@ -1,348 +0,0 @@
1The x86 kvm shadow mmu
2======================
3
4The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible
5for presenting a standard x86 mmu to the guest, while translating guest
6physical addresses to host physical addresses.
7
8The mmu code attempts to satisfy the following requirements:
9
10- correctness: the guest should not be able to determine that it is running
11 on an emulated mmu except for timing (we attempt to comply
12 with the specification, not emulate the characteristics of
13 a particular implementation such as tlb size)
14- security: the guest must not be able to touch host memory not assigned
15 to it
16- performance: minimize the performance penalty imposed by the mmu
17- scaling: need to scale to large memory and large vcpu guests
18- hardware: support the full range of x86 virtualization hardware
19- integration: Linux memory management code must be in control of guest memory
20 so that swapping, page migration, page merging, transparent
21 hugepages, and similar features work without change
22- dirty tracking: report writes to guest memory to enable live migration
23 and framebuffer-based displays
24- footprint: keep the amount of pinned kernel memory low (most memory
25 should be shrinkable)
26- reliablity: avoid multipage or GFP_ATOMIC allocations
27
28Acronyms
29========
30
31pfn host page frame number
32hpa host physical address
33hva host virtual address
34gfn guest frame number
35gpa guest physical address
36gva guest virtual address
37ngpa nested guest physical address
38ngva nested guest virtual address
39pte page table entry (used also to refer generically to paging structure
40 entries)
41gpte guest pte (referring to gfns)
42spte shadow pte (referring to pfns)
43tdp two dimensional paging (vendor neutral term for NPT and EPT)
44
45Virtual and real hardware supported
46===================================
47
48The mmu supports first-generation mmu hardware, which allows an atomic switch
49of the current paging mode and cr3 during guest entry, as well as
50two-dimensional paging (AMD's NPT and Intel's EPT). The emulated hardware
51it exposes is the traditional 2/3/4 level x86 mmu, with support for global
52pages, pae, pse, pse36, cr0.wp, and 1GB pages. Work is in progress to support
53exposing NPT capable hardware on NPT capable hosts.
54
55Translation
56===========
57
58The primary job of the mmu is to program the processor's mmu to translate
59addresses for the guest. Different translations are required at different
60times:
61
62- when guest paging is disabled, we translate guest physical addresses to
63 host physical addresses (gpa->hpa)
64- when guest paging is enabled, we translate guest virtual addresses, to
65 guest physical addresses, to host physical addresses (gva->gpa->hpa)
66- when the guest launches a guest of its own, we translate nested guest
67 virtual addresses, to nested guest physical addresses, to guest physical
68 addresses, to host physical addresses (ngva->ngpa->gpa->hpa)
69
70The primary challenge is to encode between 1 and 3 translations into hardware
71that support only 1 (traditional) and 2 (tdp) translations. When the
72number of required translations matches the hardware, the mmu operates in
73direct mode; otherwise it operates in shadow mode (see below).
74
75Memory
76======
77
78Guest memory (gpa) is part of the user address space of the process that is
79using kvm. Userspace defines the translation between guest addresses and user
80addresses (gpa->hva); note that two gpas may alias to the same hva, but not
81vice versa.
82
83These hvas may be backed using any method available to the host: anonymous
84memory, file backed memory, and device memory. Memory might be paged by the
85host at any time.
86
87Events
88======
89
90The mmu is driven by events, some from the guest, some from the host.
91
92Guest generated events:
93- writes to control registers (especially cr3)
94- invlpg/invlpga instruction execution
95- access to missing or protected translations
96
97Host generated events:
98- changes in the gpa->hpa translation (either through gpa->hva changes or
99 through hva->hpa changes)
100- memory pressure (the shrinker)
101
102Shadow pages
103============
104
105The principal data structure is the shadow page, 'struct kvm_mmu_page'. A
106shadow page contains 512 sptes, which can be either leaf or nonleaf sptes. A
107shadow page may contain a mix of leaf and nonleaf sptes.
108
109A nonleaf spte allows the hardware mmu to reach the leaf pages and
110is not related to a translation directly. It points to other shadow pages.
111
112A leaf spte corresponds to either one or two translations encoded into
113one paging structure entry. These are always the lowest level of the
114translation stack, with optional higher level translations left to NPT/EPT.
115Leaf ptes point at guest pages.
116
117The following table shows translations encoded by leaf ptes, with higher-level
118translations in parentheses:
119
120 Non-nested guests:
121 nonpaging: gpa->hpa
122 paging: gva->gpa->hpa
123 paging, tdp: (gva->)gpa->hpa
124 Nested guests:
125 non-tdp: ngva->gpa->hpa (*)
126 tdp: (ngva->)ngpa->gpa->hpa
127
128(*) the guest hypervisor will encode the ngva->gpa translation into its page
129 tables if npt is not present
130
131Shadow pages contain the following information:
132 role.level:
133 The level in the shadow paging hierarchy that this shadow page belongs to.
134 1=4k sptes, 2=2M sptes, 3=1G sptes, etc.
135 role.direct:
136 If set, leaf sptes reachable from this page are for a linear range.
137 Examples include real mode translation, large guest pages backed by small
138 host pages, and gpa->hpa translations when NPT or EPT is active.
139 The linear range starts at (gfn << PAGE_SHIFT) and its size is determined
140 by role.level (2MB for first level, 1GB for second level, 0.5TB for third
141 level, 256TB for fourth level)
142 If clear, this page corresponds to a guest page table denoted by the gfn
143 field.
144 role.quadrant:
145 When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit
146 sptes. That means a guest page table contains more ptes than the host,
147 so multiple shadow pages are needed to shadow one guest page.
148 For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
149 first or second 512-gpte block in the guest page table. For second-level
150 page tables, each 32-bit gpte is converted to two 64-bit sptes
151 (since each first-level guest page is shadowed by two first-level
152 shadow pages) so role.quadrant takes values in the range 0..3. Each
153 quadrant maps 1GB virtual address space.
154 role.access:
155 Inherited guest access permissions in the form uwx. Note execute
156 permission is positive, not negative.
157 role.invalid:
158 The page is invalid and should not be used. It is a root page that is
159 currently pinned (by a cpu hardware register pointing to it); once it is
160 unpinned it will be destroyed.
161 role.cr4_pae:
162 Contains the value of cr4.pae for which the page is valid (e.g. whether
163 32-bit or 64-bit gptes are in use).
164 role.nxe:
165 Contains the value of efer.nxe for which the page is valid.
166 role.cr0_wp:
167 Contains the value of cr0.wp for which the page is valid.
168 gfn:
169 Either the guest page table containing the translations shadowed by this
170 page, or the base page frame for linear translations. See role.direct.
171 spt:
172 A pageful of 64-bit sptes containing the translations for this page.
173 Accessed by both kvm and hardware.
174 The page pointed to by spt will have its page->private pointing back
175 at the shadow page structure.
176 sptes in spt point either at guest pages, or at lower-level shadow pages.
177 Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point
178 at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte.
179 The spt array forms a DAG structure with the shadow page as a node, and
180 guest pages as leaves.
181 gfns:
182 An array of 512 guest frame numbers, one for each present pte. Used to
183 perform a reverse map from a pte to a gfn. When role.direct is set, any
184 element of this array can be calculated from the gfn field when used, in
185 this case, the array of gfns is not allocated. See role.direct and gfn.
186 slot_bitmap:
187 A bitmap containing one bit per memory slot. If the page contains a pte
188 mapping a page from memory slot n, then bit n of slot_bitmap will be set
189 (if a page is aliased among several slots, then it is not guaranteed that
190 all slots will be marked).
191 Used during dirty logging to avoid scanning a shadow page if none if its
192 pages need tracking.
193 root_count:
194 A counter keeping track of how many hardware registers (guest cr3 or
195 pdptrs) are now pointing at the page. While this counter is nonzero, the
196 page cannot be destroyed. See role.invalid.
197 multimapped:
198 Whether there exist multiple sptes pointing at this page.
199 parent_pte/parent_ptes:
200 If multimapped is zero, parent_pte points at the single spte that points at
201 this page's spt. Otherwise, parent_ptes points at a data structure
202 with a list of parent_ptes.
203 unsync:
204 If true, then the translations in this page may not match the guest's
205 translation. This is equivalent to the state of the tlb when a pte is
206 changed but before the tlb entry is flushed. Accordingly, unsync ptes
207 are synchronized when the guest executes invlpg or flushes its tlb by
208 other means. Valid for leaf pages.
209 unsync_children:
210 How many sptes in the page point at pages that are unsync (or have
211 unsynchronized children).
212 unsync_child_bitmap:
213 A bitmap indicating which sptes in spt point (directly or indirectly) at
214 pages that may be unsynchronized. Used to quickly locate all unsychronized
215 pages reachable from a given page.
216
217Reverse map
218===========
219
220The mmu maintains a reverse mapping whereby all ptes mapping a page can be
221reached given its gfn. This is used, for example, when swapping out a page.
222
223Synchronized and unsynchronized pages
224=====================================
225
226The guest uses two events to synchronize its tlb and page tables: tlb flushes
227and page invalidations (invlpg).
228
229A tlb flush means that we need to synchronize all sptes reachable from the
230guest's cr3. This is expensive, so we keep all guest page tables write
231protected, and synchronize sptes to gptes when a gpte is written.
232
233A special case is when a guest page table is reachable from the current
234guest cr3. In this case, the guest is obliged to issue an invlpg instruction
235before using the translation. We take advantage of that by removing write
236protection from the guest page, and allowing the guest to modify it freely.
237We synchronize modified gptes when the guest invokes invlpg. This reduces
238the amount of emulation we have to do when the guest modifies multiple gptes,
239or when the a guest page is no longer used as a page table and is used for
240random guest data.
241
242As a side effect we have to resynchronize all reachable unsynchronized shadow
243pages on a tlb flush.
244
245
246Reaction to events
247==================
248
249- guest page fault (or npt page fault, or ept violation)
250
251This is the most complicated event. The cause of a page fault can be:
252
253 - a true guest fault (the guest translation won't allow the access) (*)
254 - access to a missing translation
255 - access to a protected translation
256 - when logging dirty pages, memory is write protected
257 - synchronized shadow pages are write protected (*)
258 - access to untranslatable memory (mmio)
259
260 (*) not applicable in direct mode
261
262Handling a page fault is performed as follows:
263
264 - if needed, walk the guest page tables to determine the guest translation
265 (gva->gpa or ngpa->gpa)
266 - if permissions are insufficient, reflect the fault back to the guest
267 - determine the host page
268 - if this is an mmio request, there is no host page; call the emulator
269 to emulate the instruction instead
270 - walk the shadow page table to find the spte for the translation,
271 instantiating missing intermediate page tables as necessary
272 - try to unsynchronize the page
273 - if successful, we can let the guest continue and modify the gpte
274 - emulate the instruction
275 - if failed, unshadow the page and let the guest continue
276 - update any translations that were modified by the instruction
277
278invlpg handling:
279
280 - walk the shadow page hierarchy and drop affected translations
281 - try to reinstantiate the indicated translation in the hope that the
282 guest will use it in the near future
283
284Guest control register updates:
285
286- mov to cr3
287 - look up new shadow roots
288 - synchronize newly reachable shadow pages
289
290- mov to cr0/cr4/efer
291 - set up mmu context for new paging mode
292 - look up new shadow roots
293 - synchronize newly reachable shadow pages
294
295Host translation updates:
296
297 - mmu notifier called with updated hva
298 - look up affected sptes through reverse map
299 - drop (or update) translations
300
301Emulating cr0.wp
302================
303
304If tdp is not enabled, the host must keep cr0.wp=1 so page write protection
305works for the guest kernel, not guest guest userspace. When the guest
306cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0,
307we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the
308semantics require allowing any guest kernel access plus user read access).
309
310We handle this by mapping the permissions to two possible sptes, depending
311on fault type:
312
313- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access,
314 disallows user access)
315- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel
316 write access)
317
318(user write faults generate a #PF)
319
320Large pages
321===========
322
323The mmu supports all combinations of large and small guest and host pages.
324Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as
325two separate 2M pages, on both guest and host, since the mmu always uses PAE
326paging.
327
328To instantiate a large spte, four constraints must be satisfied:
329
330- the spte must point to a large host page
331- the guest pte must be a large pte of at least equivalent size (if tdp is
332 enabled, there is no guest pte and this condition is satisified)
333- if the spte will be writeable, the large page frame may not overlap any
334 write-protected pages
335- the guest page must be wholly contained by a single memory slot
336
337To check the last two conditions, the mmu maintains a ->write_count set of
338arrays for each memory slot and large page size. Every write protected page
339causes its write_count to be incremented, thus preventing instantiation of
340a large spte. The frames at the end of an unaligned memory slot have
341artificically inflated ->write_counts so they can never be instantiated.
342
343Further reading
344===============
345
346- NPT presentation from KVM Forum 2008
347 http://www.linux-kvm.org/wiki/images/c/c8/KvmForum2008%24kdf2008_21.pdf
348
diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
deleted file mode 100644
index 8ddcfe84c09a..000000000000
--- a/Documentation/kvm/msr.txt
+++ /dev/null
@@ -1,153 +0,0 @@
1KVM-specific MSRs.
2Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
3=====================================================
4
5KVM makes use of some custom MSRs to service some requests.
6At present, this facility is only used by kvmclock.
7
8Custom MSRs have a range reserved for them, that goes from
90x4b564d00 to 0x4b564dff. There are MSRs outside this area,
10but they are deprecated and their use is discouraged.
11
12Custom MSR list
13--------
14
15The current supported Custom MSR list is:
16
17MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00
18
19 data: 4-byte alignment physical address of a memory area which must be
20 in guest RAM. This memory is expected to hold a copy of the following
21 structure:
22
23 struct pvclock_wall_clock {
24 u32 version;
25 u32 sec;
26 u32 nsec;
27 } __attribute__((__packed__));
28
29 whose data will be filled in by the hypervisor. The hypervisor is only
30 guaranteed to update this data at the moment of MSR write.
31 Users that want to reliably query this information more than once have
32 to write more than once to this MSR. Fields have the following meanings:
33
34 version: guest has to check version before and after grabbing
35 time information and check that they are both equal and even.
36 An odd version indicates an in-progress update.
37
38 sec: number of seconds for wallclock.
39
40 nsec: number of nanoseconds for wallclock.
41
42 Note that although MSRs are per-CPU entities, the effect of this
43 particular MSR is global.
44
45 Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
46 leaf prior to usage.
47
48MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01
49
50 data: 4-byte aligned physical address of a memory area which must be in
51 guest RAM, plus an enable bit in bit 0. This memory is expected to hold
52 a copy of the following structure:
53
54 struct pvclock_vcpu_time_info {
55 u32 version;
56 u32 pad0;
57 u64 tsc_timestamp;
58 u64 system_time;
59 u32 tsc_to_system_mul;
60 s8 tsc_shift;
61 u8 flags;
62 u8 pad[2];
63 } __attribute__((__packed__)); /* 32 bytes */
64
65 whose data will be filled in by the hypervisor periodically. Only one
66 write, or registration, is needed for each VCPU. The interval between
67 updates of this structure is arbitrary and implementation-dependent.
68 The hypervisor may update this structure at any time it sees fit until
69 anything with bit0 == 0 is written to it.
70
71 Fields have the following meanings:
72
73 version: guest has to check version before and after grabbing
74 time information and check that they are both equal and even.
75 An odd version indicates an in-progress update.
76
77 tsc_timestamp: the tsc value at the current VCPU at the time
78 of the update of this structure. Guests can subtract this value
79 from current tsc to derive a notion of elapsed time since the
80 structure update.
81
82 system_time: a host notion of monotonic time, including sleep
83 time at the time this structure was last updated. Unit is
84 nanoseconds.
85
86 tsc_to_system_mul: a function of the tsc frequency. One has
87 to multiply any tsc-related quantity by this value to get
88 a value in nanoseconds, besides dividing by 2^tsc_shift
89
90 tsc_shift: cycle to nanosecond divider, as a power of two, to
91 allow for shift rights. One has to shift right any tsc-related
92 quantity by this value to get a value in nanoseconds, besides
93 multiplying by tsc_to_system_mul.
94
95 With this information, guests can derive per-CPU time by
96 doing:
97
98 time = (current_tsc - tsc_timestamp)
99 time = (time * tsc_to_system_mul) >> tsc_shift
100 time = time + system_time
101
102 flags: bits in this field indicate extended capabilities
103 coordinated between the guest and the hypervisor. Availability
104 of specific flags has to be checked in 0x40000001 cpuid leaf.
105 Current flags are:
106
107 flag bit | cpuid bit | meaning
108 -------------------------------------------------------------
109 | | time measures taken across
110 0 | 24 | multiple cpus are guaranteed to
111 | | be monotonic
112 -------------------------------------------------------------
113
114 Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
115 leaf prior to usage.
116
117
118MSR_KVM_WALL_CLOCK: 0x11
119
120 data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
121
122 This MSR falls outside the reserved KVM range and may be removed in the
123 future. Its usage is deprecated.
124
125 Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
126 leaf prior to usage.
127
128MSR_KVM_SYSTEM_TIME: 0x12
129
130 data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
131
132 This MSR falls outside the reserved KVM range and may be removed in the
133 future. Its usage is deprecated.
134
135 Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
136 leaf prior to usage.
137
138 The suggested algorithm for detecting kvmclock presence is then:
139
140 if (!kvm_para_available()) /* refer to cpuid.txt */
141 return NON_PRESENT;
142
143 flags = cpuid_eax(0x40000001);
144 if (flags & 3) {
145 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
146 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
147 return PRESENT;
148 } else if (flags & 0) {
149 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
150 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
151 return PRESENT;
152 } else
153 return NON_PRESENT;
diff --git a/Documentation/kvm/review-checklist.txt b/Documentation/kvm/review-checklist.txt
deleted file mode 100644
index 730475ae1b8d..000000000000
--- a/Documentation/kvm/review-checklist.txt
+++ /dev/null
@@ -1,38 +0,0 @@
1Review checklist for kvm patches
2================================
3
41. The patch must follow Documentation/CodingStyle and
5 Documentation/SubmittingPatches.
6
72. Patches should be against kvm.git master branch.
8
93. If the patch introduces or modifies a new userspace API:
10 - the API must be documented in Documentation/kvm/api.txt
11 - the API must be discoverable using KVM_CHECK_EXTENSION
12
134. New state must include support for save/restore.
14
155. New features must default to off (userspace should explicitly request them).
16 Performance improvements can and should default to on.
17
186. New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2
19
207. Emulator changes should be accompanied by unit tests for qemu-kvm.git
21 kvm/test directory.
22
238. Changes should be vendor neutral when possible. Changes to common code
24 are better than duplicating changes to vendor code.
25
269. Similarly, prefer changes to arch independent code than to arch dependent
27 code.
28
2910. User/kernel interfaces and guest/host interfaces must be 64-bit clean
30 (all variables and sizes naturally aligned on 64-bit; use specific types
31 only - u64 rather than ulong).
32
3311. New guest visible features must either be documented in a hardware manual
34 or be accompanied by documentation.
35
3612. Features must be robust against reset and kexec - for example, shared
37 host/guest memory must be unshared to prevent the host from writing to
38 guest memory that the guest has not reserved for this purpose.