diff options
Diffstat (limited to 'Documentation/kvm')
-rw-r--r-- | Documentation/kvm/api.txt | 1220 | ||||
-rw-r--r-- | Documentation/kvm/cpuid.txt | 42 | ||||
-rw-r--r-- | Documentation/kvm/mmu.txt | 348 | ||||
-rw-r--r-- | Documentation/kvm/msr.txt | 153 | ||||
-rw-r--r-- | Documentation/kvm/review-checklist.txt | 38 |
5 files changed, 0 insertions, 1801 deletions
diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt deleted file mode 100644 index 5f5b64982b1a..000000000000 --- a/Documentation/kvm/api.txt +++ /dev/null | |||
@@ -1,1220 +0,0 @@ | |||
1 | The Definitive KVM (Kernel-based Virtual Machine) API Documentation | ||
2 | =================================================================== | ||
3 | |||
4 | 1. General description | ||
5 | |||
6 | The kvm API is a set of ioctls that are issued to control various aspects | ||
7 | of a virtual machine. The ioctls belong to three classes | ||
8 | |||
9 | - System ioctls: These query and set global attributes which affect the | ||
10 | whole kvm subsystem. In addition a system ioctl is used to create | ||
11 | virtual machines | ||
12 | |||
13 | - VM ioctls: These query and set attributes that affect an entire virtual | ||
14 | machine, for example memory layout. In addition a VM ioctl is used to | ||
15 | create virtual cpus (vcpus). | ||
16 | |||
17 | Only run VM ioctls from the same process (address space) that was used | ||
18 | to create the VM. | ||
19 | |||
20 | - vcpu ioctls: These query and set attributes that control the operation | ||
21 | of a single virtual cpu. | ||
22 | |||
23 | Only run vcpu ioctls from the same thread that was used to create the | ||
24 | vcpu. | ||
25 | |||
26 | 2. File descriptors | ||
27 | |||
28 | The kvm API is centered around file descriptors. An initial | ||
29 | open("/dev/kvm") obtains a handle to the kvm subsystem; this handle | ||
30 | can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this | ||
31 | handle will create a VM file descriptor which can be used to issue VM | ||
32 | ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu | ||
33 | and return a file descriptor pointing to it. Finally, ioctls on a vcpu | ||
34 | fd can be used to control the vcpu, including the important task of | ||
35 | actually running guest code. | ||
36 | |||
37 | In general file descriptors can be migrated among processes by means | ||
38 | of fork() and the SCM_RIGHTS facility of unix domain socket. These | ||
39 | kinds of tricks are explicitly not supported by kvm. While they will | ||
40 | not cause harm to the host, their actual behavior is not guaranteed by | ||
41 | the API. The only supported use is one virtual machine per process, | ||
42 | and one vcpu per thread. | ||
43 | |||
44 | 3. Extensions | ||
45 | |||
46 | As of Linux 2.6.22, the KVM ABI has been stabilized: no backward | ||
47 | incompatible change are allowed. However, there is an extension | ||
48 | facility that allows backward-compatible extensions to the API to be | ||
49 | queried and used. | ||
50 | |||
51 | The extension mechanism is not based on on the Linux version number. | ||
52 | Instead, kvm defines extension identifiers and a facility to query | ||
53 | whether a particular extension identifier is available. If it is, a | ||
54 | set of ioctls is available for application use. | ||
55 | |||
56 | 4. API description | ||
57 | |||
58 | This section describes ioctls that can be used to control kvm guests. | ||
59 | For each ioctl, the following information is provided along with a | ||
60 | description: | ||
61 | |||
62 | Capability: which KVM extension provides this ioctl. Can be 'basic', | ||
63 | which means that is will be provided by any kernel that supports | ||
64 | API version 12 (see section 4.1), or a KVM_CAP_xyz constant, which | ||
65 | means availability needs to be checked with KVM_CHECK_EXTENSION | ||
66 | (see section 4.4). | ||
67 | |||
68 | Architectures: which instruction set architectures provide this ioctl. | ||
69 | x86 includes both i386 and x86_64. | ||
70 | |||
71 | Type: system, vm, or vcpu. | ||
72 | |||
73 | Parameters: what parameters are accepted by the ioctl. | ||
74 | |||
75 | Returns: the return value. General error numbers (EBADF, ENOMEM, EINVAL) | ||
76 | are not detailed, but errors with specific meanings are. | ||
77 | |||
78 | 4.1 KVM_GET_API_VERSION | ||
79 | |||
80 | Capability: basic | ||
81 | Architectures: all | ||
82 | Type: system ioctl | ||
83 | Parameters: none | ||
84 | Returns: the constant KVM_API_VERSION (=12) | ||
85 | |||
86 | This identifies the API version as the stable kvm API. It is not | ||
87 | expected that this number will change. However, Linux 2.6.20 and | ||
88 | 2.6.21 report earlier versions; these are not documented and not | ||
89 | supported. Applications should refuse to run if KVM_GET_API_VERSION | ||
90 | returns a value other than 12. If this check passes, all ioctls | ||
91 | described as 'basic' will be available. | ||
92 | |||
93 | 4.2 KVM_CREATE_VM | ||
94 | |||
95 | Capability: basic | ||
96 | Architectures: all | ||
97 | Type: system ioctl | ||
98 | Parameters: none | ||
99 | Returns: a VM fd that can be used to control the new virtual machine. | ||
100 | |||
101 | The new VM has no virtual cpus and no memory. An mmap() of a VM fd | ||
102 | will access the virtual machine's physical address space; offset zero | ||
103 | corresponds to guest physical address zero. Use of mmap() on a VM fd | ||
104 | is discouraged if userspace memory allocation (KVM_CAP_USER_MEMORY) is | ||
105 | available. | ||
106 | |||
107 | 4.3 KVM_GET_MSR_INDEX_LIST | ||
108 | |||
109 | Capability: basic | ||
110 | Architectures: x86 | ||
111 | Type: system | ||
112 | Parameters: struct kvm_msr_list (in/out) | ||
113 | Returns: 0 on success; -1 on error | ||
114 | Errors: | ||
115 | E2BIG: the msr index list is to be to fit in the array specified by | ||
116 | the user. | ||
117 | |||
118 | struct kvm_msr_list { | ||
119 | __u32 nmsrs; /* number of msrs in entries */ | ||
120 | __u32 indices[0]; | ||
121 | }; | ||
122 | |||
123 | This ioctl returns the guest msrs that are supported. The list varies | ||
124 | by kvm version and host processor, but does not change otherwise. The | ||
125 | user fills in the size of the indices array in nmsrs, and in return | ||
126 | kvm adjusts nmsrs to reflect the actual number of msrs and fills in | ||
127 | the indices array with their numbers. | ||
128 | |||
129 | Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are | ||
130 | not returned in the MSR list, as different vcpus can have a different number | ||
131 | of banks, as set via the KVM_X86_SETUP_MCE ioctl. | ||
132 | |||
133 | 4.4 KVM_CHECK_EXTENSION | ||
134 | |||
135 | Capability: basic | ||
136 | Architectures: all | ||
137 | Type: system ioctl | ||
138 | Parameters: extension identifier (KVM_CAP_*) | ||
139 | Returns: 0 if unsupported; 1 (or some other positive integer) if supported | ||
140 | |||
141 | The API allows the application to query about extensions to the core | ||
142 | kvm API. Userspace passes an extension identifier (an integer) and | ||
143 | receives an integer that describes the extension availability. | ||
144 | Generally 0 means no and 1 means yes, but some extensions may report | ||
145 | additional information in the integer return value. | ||
146 | |||
147 | 4.5 KVM_GET_VCPU_MMAP_SIZE | ||
148 | |||
149 | Capability: basic | ||
150 | Architectures: all | ||
151 | Type: system ioctl | ||
152 | Parameters: none | ||
153 | Returns: size of vcpu mmap area, in bytes | ||
154 | |||
155 | The KVM_RUN ioctl (cf.) communicates with userspace via a shared | ||
156 | memory region. This ioctl returns the size of that region. See the | ||
157 | KVM_RUN documentation for details. | ||
158 | |||
159 | 4.6 KVM_SET_MEMORY_REGION | ||
160 | |||
161 | Capability: basic | ||
162 | Architectures: all | ||
163 | Type: vm ioctl | ||
164 | Parameters: struct kvm_memory_region (in) | ||
165 | Returns: 0 on success, -1 on error | ||
166 | |||
167 | This ioctl is obsolete and has been removed. | ||
168 | |||
169 | 4.6 KVM_CREATE_VCPU | ||
170 | |||
171 | Capability: basic | ||
172 | Architectures: all | ||
173 | Type: vm ioctl | ||
174 | Parameters: vcpu id (apic id on x86) | ||
175 | Returns: vcpu fd on success, -1 on error | ||
176 | |||
177 | This API adds a vcpu to a virtual machine. The vcpu id is a small integer | ||
178 | in the range [0, max_vcpus). | ||
179 | |||
180 | 4.7 KVM_GET_DIRTY_LOG (vm ioctl) | ||
181 | |||
182 | Capability: basic | ||
183 | Architectures: x86 | ||
184 | Type: vm ioctl | ||
185 | Parameters: struct kvm_dirty_log (in/out) | ||
186 | Returns: 0 on success, -1 on error | ||
187 | |||
188 | /* for KVM_GET_DIRTY_LOG */ | ||
189 | struct kvm_dirty_log { | ||
190 | __u32 slot; | ||
191 | __u32 padding; | ||
192 | union { | ||
193 | void __user *dirty_bitmap; /* one bit per page */ | ||
194 | __u64 padding; | ||
195 | }; | ||
196 | }; | ||
197 | |||
198 | Given a memory slot, return a bitmap containing any pages dirtied | ||
199 | since the last call to this ioctl. Bit 0 is the first page in the | ||
200 | memory slot. Ensure the entire structure is cleared to avoid padding | ||
201 | issues. | ||
202 | |||
203 | 4.8 KVM_SET_MEMORY_ALIAS | ||
204 | |||
205 | Capability: basic | ||
206 | Architectures: x86 | ||
207 | Type: vm ioctl | ||
208 | Parameters: struct kvm_memory_alias (in) | ||
209 | Returns: 0 (success), -1 (error) | ||
210 | |||
211 | This ioctl is obsolete and has been removed. | ||
212 | |||
213 | 4.9 KVM_RUN | ||
214 | |||
215 | Capability: basic | ||
216 | Architectures: all | ||
217 | Type: vcpu ioctl | ||
218 | Parameters: none | ||
219 | Returns: 0 on success, -1 on error | ||
220 | Errors: | ||
221 | EINTR: an unmasked signal is pending | ||
222 | |||
223 | This ioctl is used to run a guest virtual cpu. While there are no | ||
224 | explicit parameters, there is an implicit parameter block that can be | ||
225 | obtained by mmap()ing the vcpu fd at offset 0, with the size given by | ||
226 | KVM_GET_VCPU_MMAP_SIZE. The parameter block is formatted as a 'struct | ||
227 | kvm_run' (see below). | ||
228 | |||
229 | 4.10 KVM_GET_REGS | ||
230 | |||
231 | Capability: basic | ||
232 | Architectures: all | ||
233 | Type: vcpu ioctl | ||
234 | Parameters: struct kvm_regs (out) | ||
235 | Returns: 0 on success, -1 on error | ||
236 | |||
237 | Reads the general purpose registers from the vcpu. | ||
238 | |||
239 | /* x86 */ | ||
240 | struct kvm_regs { | ||
241 | /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */ | ||
242 | __u64 rax, rbx, rcx, rdx; | ||
243 | __u64 rsi, rdi, rsp, rbp; | ||
244 | __u64 r8, r9, r10, r11; | ||
245 | __u64 r12, r13, r14, r15; | ||
246 | __u64 rip, rflags; | ||
247 | }; | ||
248 | |||
249 | 4.11 KVM_SET_REGS | ||
250 | |||
251 | Capability: basic | ||
252 | Architectures: all | ||
253 | Type: vcpu ioctl | ||
254 | Parameters: struct kvm_regs (in) | ||
255 | Returns: 0 on success, -1 on error | ||
256 | |||
257 | Writes the general purpose registers into the vcpu. | ||
258 | |||
259 | See KVM_GET_REGS for the data structure. | ||
260 | |||
261 | 4.12 KVM_GET_SREGS | ||
262 | |||
263 | Capability: basic | ||
264 | Architectures: x86 | ||
265 | Type: vcpu ioctl | ||
266 | Parameters: struct kvm_sregs (out) | ||
267 | Returns: 0 on success, -1 on error | ||
268 | |||
269 | Reads special registers from the vcpu. | ||
270 | |||
271 | /* x86 */ | ||
272 | struct kvm_sregs { | ||
273 | struct kvm_segment cs, ds, es, fs, gs, ss; | ||
274 | struct kvm_segment tr, ldt; | ||
275 | struct kvm_dtable gdt, idt; | ||
276 | __u64 cr0, cr2, cr3, cr4, cr8; | ||
277 | __u64 efer; | ||
278 | __u64 apic_base; | ||
279 | __u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64]; | ||
280 | }; | ||
281 | |||
282 | interrupt_bitmap is a bitmap of pending external interrupts. At most | ||
283 | one bit may be set. This interrupt has been acknowledged by the APIC | ||
284 | but not yet injected into the cpu core. | ||
285 | |||
286 | 4.13 KVM_SET_SREGS | ||
287 | |||
288 | Capability: basic | ||
289 | Architectures: x86 | ||
290 | Type: vcpu ioctl | ||
291 | Parameters: struct kvm_sregs (in) | ||
292 | Returns: 0 on success, -1 on error | ||
293 | |||
294 | Writes special registers into the vcpu. See KVM_GET_SREGS for the | ||
295 | data structures. | ||
296 | |||
297 | 4.14 KVM_TRANSLATE | ||
298 | |||
299 | Capability: basic | ||
300 | Architectures: x86 | ||
301 | Type: vcpu ioctl | ||
302 | Parameters: struct kvm_translation (in/out) | ||
303 | Returns: 0 on success, -1 on error | ||
304 | |||
305 | Translates a virtual address according to the vcpu's current address | ||
306 | translation mode. | ||
307 | |||
308 | struct kvm_translation { | ||
309 | /* in */ | ||
310 | __u64 linear_address; | ||
311 | |||
312 | /* out */ | ||
313 | __u64 physical_address; | ||
314 | __u8 valid; | ||
315 | __u8 writeable; | ||
316 | __u8 usermode; | ||
317 | __u8 pad[5]; | ||
318 | }; | ||
319 | |||
320 | 4.15 KVM_INTERRUPT | ||
321 | |||
322 | Capability: basic | ||
323 | Architectures: x86 | ||
324 | Type: vcpu ioctl | ||
325 | Parameters: struct kvm_interrupt (in) | ||
326 | Returns: 0 on success, -1 on error | ||
327 | |||
328 | Queues a hardware interrupt vector to be injected. This is only | ||
329 | useful if in-kernel local APIC is not used. | ||
330 | |||
331 | /* for KVM_INTERRUPT */ | ||
332 | struct kvm_interrupt { | ||
333 | /* in */ | ||
334 | __u32 irq; | ||
335 | }; | ||
336 | |||
337 | Note 'irq' is an interrupt vector, not an interrupt pin or line. | ||
338 | |||
339 | 4.16 KVM_DEBUG_GUEST | ||
340 | |||
341 | Capability: basic | ||
342 | Architectures: none | ||
343 | Type: vcpu ioctl | ||
344 | Parameters: none) | ||
345 | Returns: -1 on error | ||
346 | |||
347 | Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead. | ||
348 | |||
349 | 4.17 KVM_GET_MSRS | ||
350 | |||
351 | Capability: basic | ||
352 | Architectures: x86 | ||
353 | Type: vcpu ioctl | ||
354 | Parameters: struct kvm_msrs (in/out) | ||
355 | Returns: 0 on success, -1 on error | ||
356 | |||
357 | Reads model-specific registers from the vcpu. Supported msr indices can | ||
358 | be obtained using KVM_GET_MSR_INDEX_LIST. | ||
359 | |||
360 | struct kvm_msrs { | ||
361 | __u32 nmsrs; /* number of msrs in entries */ | ||
362 | __u32 pad; | ||
363 | |||
364 | struct kvm_msr_entry entries[0]; | ||
365 | }; | ||
366 | |||
367 | struct kvm_msr_entry { | ||
368 | __u32 index; | ||
369 | __u32 reserved; | ||
370 | __u64 data; | ||
371 | }; | ||
372 | |||
373 | Application code should set the 'nmsrs' member (which indicates the | ||
374 | size of the entries array) and the 'index' member of each array entry. | ||
375 | kvm will fill in the 'data' member. | ||
376 | |||
377 | 4.18 KVM_SET_MSRS | ||
378 | |||
379 | Capability: basic | ||
380 | Architectures: x86 | ||
381 | Type: vcpu ioctl | ||
382 | Parameters: struct kvm_msrs (in) | ||
383 | Returns: 0 on success, -1 on error | ||
384 | |||
385 | Writes model-specific registers to the vcpu. See KVM_GET_MSRS for the | ||
386 | data structures. | ||
387 | |||
388 | Application code should set the 'nmsrs' member (which indicates the | ||
389 | size of the entries array), and the 'index' and 'data' members of each | ||
390 | array entry. | ||
391 | |||
392 | 4.19 KVM_SET_CPUID | ||
393 | |||
394 | Capability: basic | ||
395 | Architectures: x86 | ||
396 | Type: vcpu ioctl | ||
397 | Parameters: struct kvm_cpuid (in) | ||
398 | Returns: 0 on success, -1 on error | ||
399 | |||
400 | Defines the vcpu responses to the cpuid instruction. Applications | ||
401 | should use the KVM_SET_CPUID2 ioctl if available. | ||
402 | |||
403 | |||
404 | struct kvm_cpuid_entry { | ||
405 | __u32 function; | ||
406 | __u32 eax; | ||
407 | __u32 ebx; | ||
408 | __u32 ecx; | ||
409 | __u32 edx; | ||
410 | __u32 padding; | ||
411 | }; | ||
412 | |||
413 | /* for KVM_SET_CPUID */ | ||
414 | struct kvm_cpuid { | ||
415 | __u32 nent; | ||
416 | __u32 padding; | ||
417 | struct kvm_cpuid_entry entries[0]; | ||
418 | }; | ||
419 | |||
420 | 4.20 KVM_SET_SIGNAL_MASK | ||
421 | |||
422 | Capability: basic | ||
423 | Architectures: x86 | ||
424 | Type: vcpu ioctl | ||
425 | Parameters: struct kvm_signal_mask (in) | ||
426 | Returns: 0 on success, -1 on error | ||
427 | |||
428 | Defines which signals are blocked during execution of KVM_RUN. This | ||
429 | signal mask temporarily overrides the threads signal mask. Any | ||
430 | unblocked signal received (except SIGKILL and SIGSTOP, which retain | ||
431 | their traditional behaviour) will cause KVM_RUN to return with -EINTR. | ||
432 | |||
433 | Note the signal will only be delivered if not blocked by the original | ||
434 | signal mask. | ||
435 | |||
436 | /* for KVM_SET_SIGNAL_MASK */ | ||
437 | struct kvm_signal_mask { | ||
438 | __u32 len; | ||
439 | __u8 sigset[0]; | ||
440 | }; | ||
441 | |||
442 | 4.21 KVM_GET_FPU | ||
443 | |||
444 | Capability: basic | ||
445 | Architectures: x86 | ||
446 | Type: vcpu ioctl | ||
447 | Parameters: struct kvm_fpu (out) | ||
448 | Returns: 0 on success, -1 on error | ||
449 | |||
450 | Reads the floating point state from the vcpu. | ||
451 | |||
452 | /* for KVM_GET_FPU and KVM_SET_FPU */ | ||
453 | struct kvm_fpu { | ||
454 | __u8 fpr[8][16]; | ||
455 | __u16 fcw; | ||
456 | __u16 fsw; | ||
457 | __u8 ftwx; /* in fxsave format */ | ||
458 | __u8 pad1; | ||
459 | __u16 last_opcode; | ||
460 | __u64 last_ip; | ||
461 | __u64 last_dp; | ||
462 | __u8 xmm[16][16]; | ||
463 | __u32 mxcsr; | ||
464 | __u32 pad2; | ||
465 | }; | ||
466 | |||
467 | 4.22 KVM_SET_FPU | ||
468 | |||
469 | Capability: basic | ||
470 | Architectures: x86 | ||
471 | Type: vcpu ioctl | ||
472 | Parameters: struct kvm_fpu (in) | ||
473 | Returns: 0 on success, -1 on error | ||
474 | |||
475 | Writes the floating point state to the vcpu. | ||
476 | |||
477 | /* for KVM_GET_FPU and KVM_SET_FPU */ | ||
478 | struct kvm_fpu { | ||
479 | __u8 fpr[8][16]; | ||
480 | __u16 fcw; | ||
481 | __u16 fsw; | ||
482 | __u8 ftwx; /* in fxsave format */ | ||
483 | __u8 pad1; | ||
484 | __u16 last_opcode; | ||
485 | __u64 last_ip; | ||
486 | __u64 last_dp; | ||
487 | __u8 xmm[16][16]; | ||
488 | __u32 mxcsr; | ||
489 | __u32 pad2; | ||
490 | }; | ||
491 | |||
492 | 4.23 KVM_CREATE_IRQCHIP | ||
493 | |||
494 | Capability: KVM_CAP_IRQCHIP | ||
495 | Architectures: x86, ia64 | ||
496 | Type: vm ioctl | ||
497 | Parameters: none | ||
498 | Returns: 0 on success, -1 on error | ||
499 | |||
500 | Creates an interrupt controller model in the kernel. On x86, creates a virtual | ||
501 | ioapic, a virtual PIC (two PICs, nested), and sets up future vcpus to have a | ||
502 | local APIC. IRQ routing for GSIs 0-15 is set to both PIC and IOAPIC; GSI 16-23 | ||
503 | only go to the IOAPIC. On ia64, a IOSAPIC is created. | ||
504 | |||
505 | 4.24 KVM_IRQ_LINE | ||
506 | |||
507 | Capability: KVM_CAP_IRQCHIP | ||
508 | Architectures: x86, ia64 | ||
509 | Type: vm ioctl | ||
510 | Parameters: struct kvm_irq_level | ||
511 | Returns: 0 on success, -1 on error | ||
512 | |||
513 | Sets the level of a GSI input to the interrupt controller model in the kernel. | ||
514 | Requires that an interrupt controller model has been previously created with | ||
515 | KVM_CREATE_IRQCHIP. Note that edge-triggered interrupts require the level | ||
516 | to be set to 1 and then back to 0. | ||
517 | |||
518 | struct kvm_irq_level { | ||
519 | union { | ||
520 | __u32 irq; /* GSI */ | ||
521 | __s32 status; /* not used for KVM_IRQ_LEVEL */ | ||
522 | }; | ||
523 | __u32 level; /* 0 or 1 */ | ||
524 | }; | ||
525 | |||
526 | 4.25 KVM_GET_IRQCHIP | ||
527 | |||
528 | Capability: KVM_CAP_IRQCHIP | ||
529 | Architectures: x86, ia64 | ||
530 | Type: vm ioctl | ||
531 | Parameters: struct kvm_irqchip (in/out) | ||
532 | Returns: 0 on success, -1 on error | ||
533 | |||
534 | Reads the state of a kernel interrupt controller created with | ||
535 | KVM_CREATE_IRQCHIP into a buffer provided by the caller. | ||
536 | |||
537 | struct kvm_irqchip { | ||
538 | __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */ | ||
539 | __u32 pad; | ||
540 | union { | ||
541 | char dummy[512]; /* reserving space */ | ||
542 | struct kvm_pic_state pic; | ||
543 | struct kvm_ioapic_state ioapic; | ||
544 | } chip; | ||
545 | }; | ||
546 | |||
547 | 4.26 KVM_SET_IRQCHIP | ||
548 | |||
549 | Capability: KVM_CAP_IRQCHIP | ||
550 | Architectures: x86, ia64 | ||
551 | Type: vm ioctl | ||
552 | Parameters: struct kvm_irqchip (in) | ||
553 | Returns: 0 on success, -1 on error | ||
554 | |||
555 | Sets the state of a kernel interrupt controller created with | ||
556 | KVM_CREATE_IRQCHIP from a buffer provided by the caller. | ||
557 | |||
558 | struct kvm_irqchip { | ||
559 | __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */ | ||
560 | __u32 pad; | ||
561 | union { | ||
562 | char dummy[512]; /* reserving space */ | ||
563 | struct kvm_pic_state pic; | ||
564 | struct kvm_ioapic_state ioapic; | ||
565 | } chip; | ||
566 | }; | ||
567 | |||
568 | 4.27 KVM_XEN_HVM_CONFIG | ||
569 | |||
570 | Capability: KVM_CAP_XEN_HVM | ||
571 | Architectures: x86 | ||
572 | Type: vm ioctl | ||
573 | Parameters: struct kvm_xen_hvm_config (in) | ||
574 | Returns: 0 on success, -1 on error | ||
575 | |||
576 | Sets the MSR that the Xen HVM guest uses to initialize its hypercall | ||
577 | page, and provides the starting address and size of the hypercall | ||
578 | blobs in userspace. When the guest writes the MSR, kvm copies one | ||
579 | page of a blob (32- or 64-bit, depending on the vcpu mode) to guest | ||
580 | memory. | ||
581 | |||
582 | struct kvm_xen_hvm_config { | ||
583 | __u32 flags; | ||
584 | __u32 msr; | ||
585 | __u64 blob_addr_32; | ||
586 | __u64 blob_addr_64; | ||
587 | __u8 blob_size_32; | ||
588 | __u8 blob_size_64; | ||
589 | __u8 pad2[30]; | ||
590 | }; | ||
591 | |||
592 | 4.27 KVM_GET_CLOCK | ||
593 | |||
594 | Capability: KVM_CAP_ADJUST_CLOCK | ||
595 | Architectures: x86 | ||
596 | Type: vm ioctl | ||
597 | Parameters: struct kvm_clock_data (out) | ||
598 | Returns: 0 on success, -1 on error | ||
599 | |||
600 | Gets the current timestamp of kvmclock as seen by the current guest. In | ||
601 | conjunction with KVM_SET_CLOCK, it is used to ensure monotonicity on scenarios | ||
602 | such as migration. | ||
603 | |||
604 | struct kvm_clock_data { | ||
605 | __u64 clock; /* kvmclock current value */ | ||
606 | __u32 flags; | ||
607 | __u32 pad[9]; | ||
608 | }; | ||
609 | |||
610 | 4.28 KVM_SET_CLOCK | ||
611 | |||
612 | Capability: KVM_CAP_ADJUST_CLOCK | ||
613 | Architectures: x86 | ||
614 | Type: vm ioctl | ||
615 | Parameters: struct kvm_clock_data (in) | ||
616 | Returns: 0 on success, -1 on error | ||
617 | |||
618 | Sets the current timestamp of kvmclock to the value specified in its parameter. | ||
619 | In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios | ||
620 | such as migration. | ||
621 | |||
622 | struct kvm_clock_data { | ||
623 | __u64 clock; /* kvmclock current value */ | ||
624 | __u32 flags; | ||
625 | __u32 pad[9]; | ||
626 | }; | ||
627 | |||
628 | 4.29 KVM_GET_VCPU_EVENTS | ||
629 | |||
630 | Capability: KVM_CAP_VCPU_EVENTS | ||
631 | Extended by: KVM_CAP_INTR_SHADOW | ||
632 | Architectures: x86 | ||
633 | Type: vm ioctl | ||
634 | Parameters: struct kvm_vcpu_event (out) | ||
635 | Returns: 0 on success, -1 on error | ||
636 | |||
637 | Gets currently pending exceptions, interrupts, and NMIs as well as related | ||
638 | states of the vcpu. | ||
639 | |||
640 | struct kvm_vcpu_events { | ||
641 | struct { | ||
642 | __u8 injected; | ||
643 | __u8 nr; | ||
644 | __u8 has_error_code; | ||
645 | __u8 pad; | ||
646 | __u32 error_code; | ||
647 | } exception; | ||
648 | struct { | ||
649 | __u8 injected; | ||
650 | __u8 nr; | ||
651 | __u8 soft; | ||
652 | __u8 shadow; | ||
653 | } interrupt; | ||
654 | struct { | ||
655 | __u8 injected; | ||
656 | __u8 pending; | ||
657 | __u8 masked; | ||
658 | __u8 pad; | ||
659 | } nmi; | ||
660 | __u32 sipi_vector; | ||
661 | __u32 flags; | ||
662 | }; | ||
663 | |||
664 | KVM_VCPUEVENT_VALID_SHADOW may be set in the flags field to signal that | ||
665 | interrupt.shadow contains a valid state. Otherwise, this field is undefined. | ||
666 | |||
667 | 4.30 KVM_SET_VCPU_EVENTS | ||
668 | |||
669 | Capability: KVM_CAP_VCPU_EVENTS | ||
670 | Extended by: KVM_CAP_INTR_SHADOW | ||
671 | Architectures: x86 | ||
672 | Type: vm ioctl | ||
673 | Parameters: struct kvm_vcpu_event (in) | ||
674 | Returns: 0 on success, -1 on error | ||
675 | |||
676 | Set pending exceptions, interrupts, and NMIs as well as related states of the | ||
677 | vcpu. | ||
678 | |||
679 | See KVM_GET_VCPU_EVENTS for the data structure. | ||
680 | |||
681 | Fields that may be modified asynchronously by running VCPUs can be excluded | ||
682 | from the update. These fields are nmi.pending and sipi_vector. Keep the | ||
683 | corresponding bits in the flags field cleared to suppress overwriting the | ||
684 | current in-kernel state. The bits are: | ||
685 | |||
686 | KVM_VCPUEVENT_VALID_NMI_PENDING - transfer nmi.pending to the kernel | ||
687 | KVM_VCPUEVENT_VALID_SIPI_VECTOR - transfer sipi_vector | ||
688 | |||
689 | If KVM_CAP_INTR_SHADOW is available, KVM_VCPUEVENT_VALID_SHADOW can be set in | ||
690 | the flags field to signal that interrupt.shadow contains a valid state and | ||
691 | shall be written into the VCPU. | ||
692 | |||
693 | 4.32 KVM_GET_DEBUGREGS | ||
694 | |||
695 | Capability: KVM_CAP_DEBUGREGS | ||
696 | Architectures: x86 | ||
697 | Type: vm ioctl | ||
698 | Parameters: struct kvm_debugregs (out) | ||
699 | Returns: 0 on success, -1 on error | ||
700 | |||
701 | Reads debug registers from the vcpu. | ||
702 | |||
703 | struct kvm_debugregs { | ||
704 | __u64 db[4]; | ||
705 | __u64 dr6; | ||
706 | __u64 dr7; | ||
707 | __u64 flags; | ||
708 | __u64 reserved[9]; | ||
709 | }; | ||
710 | |||
711 | 4.33 KVM_SET_DEBUGREGS | ||
712 | |||
713 | Capability: KVM_CAP_DEBUGREGS | ||
714 | Architectures: x86 | ||
715 | Type: vm ioctl | ||
716 | Parameters: struct kvm_debugregs (in) | ||
717 | Returns: 0 on success, -1 on error | ||
718 | |||
719 | Writes debug registers into the vcpu. | ||
720 | |||
721 | See KVM_GET_DEBUGREGS for the data structure. The flags field is unused | ||
722 | yet and must be cleared on entry. | ||
723 | |||
724 | 4.34 KVM_SET_USER_MEMORY_REGION | ||
725 | |||
726 | Capability: KVM_CAP_USER_MEM | ||
727 | Architectures: all | ||
728 | Type: vm ioctl | ||
729 | Parameters: struct kvm_userspace_memory_region (in) | ||
730 | Returns: 0 on success, -1 on error | ||
731 | |||
732 | struct kvm_userspace_memory_region { | ||
733 | __u32 slot; | ||
734 | __u32 flags; | ||
735 | __u64 guest_phys_addr; | ||
736 | __u64 memory_size; /* bytes */ | ||
737 | __u64 userspace_addr; /* start of the userspace allocated memory */ | ||
738 | }; | ||
739 | |||
740 | /* for kvm_memory_region::flags */ | ||
741 | #define KVM_MEM_LOG_DIRTY_PAGES 1UL | ||
742 | |||
743 | This ioctl allows the user to create or modify a guest physical memory | ||
744 | slot. When changing an existing slot, it may be moved in the guest | ||
745 | physical memory space, or its flags may be modified. It may not be | ||
746 | resized. Slots may not overlap in guest physical address space. | ||
747 | |||
748 | Memory for the region is taken starting at the address denoted by the | ||
749 | field userspace_addr, which must point at user addressable memory for | ||
750 | the entire memory slot size. Any object may back this memory, including | ||
751 | anonymous memory, ordinary files, and hugetlbfs. | ||
752 | |||
753 | It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr | ||
754 | be identical. This allows large pages in the guest to be backed by large | ||
755 | pages in the host. | ||
756 | |||
757 | The flags field supports just one flag, KVM_MEM_LOG_DIRTY_PAGES, which | ||
758 | instructs kvm to keep track of writes to memory within the slot. See | ||
759 | the KVM_GET_DIRTY_LOG ioctl. | ||
760 | |||
761 | When the KVM_CAP_SYNC_MMU capability, changes in the backing of the memory | ||
762 | region are automatically reflected into the guest. For example, an mmap() | ||
763 | that affects the region will be made visible immediately. Another example | ||
764 | is madvise(MADV_DROP). | ||
765 | |||
766 | It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl. | ||
767 | The KVM_SET_MEMORY_REGION does not allow fine grained control over memory | ||
768 | allocation and is deprecated. | ||
769 | |||
770 | 4.35 KVM_SET_TSS_ADDR | ||
771 | |||
772 | Capability: KVM_CAP_SET_TSS_ADDR | ||
773 | Architectures: x86 | ||
774 | Type: vm ioctl | ||
775 | Parameters: unsigned long tss_address (in) | ||
776 | Returns: 0 on success, -1 on error | ||
777 | |||
778 | This ioctl defines the physical address of a three-page region in the guest | ||
779 | physical address space. The region must be within the first 4GB of the | ||
780 | guest physical address space and must not conflict with any memory slot | ||
781 | or any mmio address. The guest may malfunction if it accesses this memory | ||
782 | region. | ||
783 | |||
784 | This ioctl is required on Intel-based hosts. This is needed on Intel hardware | ||
785 | because of a quirk in the virtualization implementation (see the internals | ||
786 | documentation when it pops into existence). | ||
787 | |||
788 | 4.36 KVM_ENABLE_CAP | ||
789 | |||
790 | Capability: KVM_CAP_ENABLE_CAP | ||
791 | Architectures: ppc | ||
792 | Type: vcpu ioctl | ||
793 | Parameters: struct kvm_enable_cap (in) | ||
794 | Returns: 0 on success; -1 on error | ||
795 | |||
796 | +Not all extensions are enabled by default. Using this ioctl the application | ||
797 | can enable an extension, making it available to the guest. | ||
798 | |||
799 | On systems that do not support this ioctl, it always fails. On systems that | ||
800 | do support it, it only works for extensions that are supported for enablement. | ||
801 | |||
802 | To check if a capability can be enabled, the KVM_CHECK_EXTENSION ioctl should | ||
803 | be used. | ||
804 | |||
805 | struct kvm_enable_cap { | ||
806 | /* in */ | ||
807 | __u32 cap; | ||
808 | |||
809 | The capability that is supposed to get enabled. | ||
810 | |||
811 | __u32 flags; | ||
812 | |||
813 | A bitfield indicating future enhancements. Has to be 0 for now. | ||
814 | |||
815 | __u64 args[4]; | ||
816 | |||
817 | Arguments for enabling a feature. If a feature needs initial values to | ||
818 | function properly, this is the place to put them. | ||
819 | |||
820 | __u8 pad[64]; | ||
821 | }; | ||
822 | |||
823 | 4.37 KVM_GET_MP_STATE | ||
824 | |||
825 | Capability: KVM_CAP_MP_STATE | ||
826 | Architectures: x86, ia64 | ||
827 | Type: vcpu ioctl | ||
828 | Parameters: struct kvm_mp_state (out) | ||
829 | Returns: 0 on success; -1 on error | ||
830 | |||
831 | struct kvm_mp_state { | ||
832 | __u32 mp_state; | ||
833 | }; | ||
834 | |||
835 | Returns the vcpu's current "multiprocessing state" (though also valid on | ||
836 | uniprocessor guests). | ||
837 | |||
838 | Possible values are: | ||
839 | |||
840 | - KVM_MP_STATE_RUNNABLE: the vcpu is currently running | ||
841 | - KVM_MP_STATE_UNINITIALIZED: the vcpu is an application processor (AP) | ||
842 | which has not yet received an INIT signal | ||
843 | - KVM_MP_STATE_INIT_RECEIVED: the vcpu has received an INIT signal, and is | ||
844 | now ready for a SIPI | ||
845 | - KVM_MP_STATE_HALTED: the vcpu has executed a HLT instruction and | ||
846 | is waiting for an interrupt | ||
847 | - KVM_MP_STATE_SIPI_RECEIVED: the vcpu has just received a SIPI (vector | ||
848 | accesible via KVM_GET_VCPU_EVENTS) | ||
849 | |||
850 | This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel | ||
851 | irqchip, the multiprocessing state must be maintained by userspace. | ||
852 | |||
853 | 4.38 KVM_SET_MP_STATE | ||
854 | |||
855 | Capability: KVM_CAP_MP_STATE | ||
856 | Architectures: x86, ia64 | ||
857 | Type: vcpu ioctl | ||
858 | Parameters: struct kvm_mp_state (in) | ||
859 | Returns: 0 on success; -1 on error | ||
860 | |||
861 | Sets the vcpu's current "multiprocessing state"; see KVM_GET_MP_STATE for | ||
862 | arguments. | ||
863 | |||
864 | This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel | ||
865 | irqchip, the multiprocessing state must be maintained by userspace. | ||
866 | |||
867 | 4.39 KVM_SET_IDENTITY_MAP_ADDR | ||
868 | |||
869 | Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR | ||
870 | Architectures: x86 | ||
871 | Type: vm ioctl | ||
872 | Parameters: unsigned long identity (in) | ||
873 | Returns: 0 on success, -1 on error | ||
874 | |||
875 | This ioctl defines the physical address of a one-page region in the guest | ||
876 | physical address space. The region must be within the first 4GB of the | ||
877 | guest physical address space and must not conflict with any memory slot | ||
878 | or any mmio address. The guest may malfunction if it accesses this memory | ||
879 | region. | ||
880 | |||
881 | This ioctl is required on Intel-based hosts. This is needed on Intel hardware | ||
882 | because of a quirk in the virtualization implementation (see the internals | ||
883 | documentation when it pops into existence). | ||
884 | |||
885 | 4.40 KVM_SET_BOOT_CPU_ID | ||
886 | |||
887 | Capability: KVM_CAP_SET_BOOT_CPU_ID | ||
888 | Architectures: x86, ia64 | ||
889 | Type: vm ioctl | ||
890 | Parameters: unsigned long vcpu_id | ||
891 | Returns: 0 on success, -1 on error | ||
892 | |||
893 | Define which vcpu is the Bootstrap Processor (BSP). Values are the same | ||
894 | as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default | ||
895 | is vcpu 0. | ||
896 | |||
897 | 4.41 KVM_GET_XSAVE | ||
898 | |||
899 | Capability: KVM_CAP_XSAVE | ||
900 | Architectures: x86 | ||
901 | Type: vcpu ioctl | ||
902 | Parameters: struct kvm_xsave (out) | ||
903 | Returns: 0 on success, -1 on error | ||
904 | |||
905 | struct kvm_xsave { | ||
906 | __u32 region[1024]; | ||
907 | }; | ||
908 | |||
909 | This ioctl would copy current vcpu's xsave struct to the userspace. | ||
910 | |||
911 | 4.42 KVM_SET_XSAVE | ||
912 | |||
913 | Capability: KVM_CAP_XSAVE | ||
914 | Architectures: x86 | ||
915 | Type: vcpu ioctl | ||
916 | Parameters: struct kvm_xsave (in) | ||
917 | Returns: 0 on success, -1 on error | ||
918 | |||
919 | struct kvm_xsave { | ||
920 | __u32 region[1024]; | ||
921 | }; | ||
922 | |||
923 | This ioctl would copy userspace's xsave struct to the kernel. | ||
924 | |||
925 | 4.43 KVM_GET_XCRS | ||
926 | |||
927 | Capability: KVM_CAP_XCRS | ||
928 | Architectures: x86 | ||
929 | Type: vcpu ioctl | ||
930 | Parameters: struct kvm_xcrs (out) | ||
931 | Returns: 0 on success, -1 on error | ||
932 | |||
933 | struct kvm_xcr { | ||
934 | __u32 xcr; | ||
935 | __u32 reserved; | ||
936 | __u64 value; | ||
937 | }; | ||
938 | |||
939 | struct kvm_xcrs { | ||
940 | __u32 nr_xcrs; | ||
941 | __u32 flags; | ||
942 | struct kvm_xcr xcrs[KVM_MAX_XCRS]; | ||
943 | __u64 padding[16]; | ||
944 | }; | ||
945 | |||
946 | This ioctl would copy current vcpu's xcrs to the userspace. | ||
947 | |||
948 | 4.44 KVM_SET_XCRS | ||
949 | |||
950 | Capability: KVM_CAP_XCRS | ||
951 | Architectures: x86 | ||
952 | Type: vcpu ioctl | ||
953 | Parameters: struct kvm_xcrs (in) | ||
954 | Returns: 0 on success, -1 on error | ||
955 | |||
956 | struct kvm_xcr { | ||
957 | __u32 xcr; | ||
958 | __u32 reserved; | ||
959 | __u64 value; | ||
960 | }; | ||
961 | |||
962 | struct kvm_xcrs { | ||
963 | __u32 nr_xcrs; | ||
964 | __u32 flags; | ||
965 | struct kvm_xcr xcrs[KVM_MAX_XCRS]; | ||
966 | __u64 padding[16]; | ||
967 | }; | ||
968 | |||
969 | This ioctl would set vcpu's xcr to the value userspace specified. | ||
970 | |||
971 | 4.45 KVM_GET_SUPPORTED_CPUID | ||
972 | |||
973 | Capability: KVM_CAP_EXT_CPUID | ||
974 | Architectures: x86 | ||
975 | Type: system ioctl | ||
976 | Parameters: struct kvm_cpuid2 (in/out) | ||
977 | Returns: 0 on success, -1 on error | ||
978 | |||
979 | struct kvm_cpuid2 { | ||
980 | __u32 nent; | ||
981 | __u32 padding; | ||
982 | struct kvm_cpuid_entry2 entries[0]; | ||
983 | }; | ||
984 | |||
985 | #define KVM_CPUID_FLAG_SIGNIFCANT_INDEX 1 | ||
986 | #define KVM_CPUID_FLAG_STATEFUL_FUNC 2 | ||
987 | #define KVM_CPUID_FLAG_STATE_READ_NEXT 4 | ||
988 | |||
989 | struct kvm_cpuid_entry2 { | ||
990 | __u32 function; | ||
991 | __u32 index; | ||
992 | __u32 flags; | ||
993 | __u32 eax; | ||
994 | __u32 ebx; | ||
995 | __u32 ecx; | ||
996 | __u32 edx; | ||
997 | __u32 padding[3]; | ||
998 | }; | ||
999 | |||
1000 | This ioctl returns x86 cpuid features which are supported by both the hardware | ||
1001 | and kvm. Userspace can use the information returned by this ioctl to | ||
1002 | construct cpuid information (for KVM_SET_CPUID2) that is consistent with | ||
1003 | hardware, kernel, and userspace capabilities, and with user requirements (for | ||
1004 | example, the user may wish to constrain cpuid to emulate older hardware, | ||
1005 | or for feature consistency across a cluster). | ||
1006 | |||
1007 | Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure | ||
1008 | with the 'nent' field indicating the number of entries in the variable-size | ||
1009 | array 'entries'. If the number of entries is too low to describe the cpu | ||
1010 | capabilities, an error (E2BIG) is returned. If the number is too high, | ||
1011 | the 'nent' field is adjusted and an error (ENOMEM) is returned. If the | ||
1012 | number is just right, the 'nent' field is adjusted to the number of valid | ||
1013 | entries in the 'entries' array, which is then filled. | ||
1014 | |||
1015 | The entries returned are the host cpuid as returned by the cpuid instruction, | ||
1016 | with unknown or unsupported features masked out. The fields in each entry | ||
1017 | are defined as follows: | ||
1018 | |||
1019 | function: the eax value used to obtain the entry | ||
1020 | index: the ecx value used to obtain the entry (for entries that are | ||
1021 | affected by ecx) | ||
1022 | flags: an OR of zero or more of the following: | ||
1023 | KVM_CPUID_FLAG_SIGNIFCANT_INDEX: | ||
1024 | if the index field is valid | ||
1025 | KVM_CPUID_FLAG_STATEFUL_FUNC: | ||
1026 | if cpuid for this function returns different values for successive | ||
1027 | invocations; there will be several entries with the same function, | ||
1028 | all with this flag set | ||
1029 | KVM_CPUID_FLAG_STATE_READ_NEXT: | ||
1030 | for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is | ||
1031 | the first entry to be read by a cpu | ||
1032 | eax, ebx, ecx, edx: the values returned by the cpuid instruction for | ||
1033 | this function/index combination | ||
1034 | |||
1035 | 5. The kvm_run structure | ||
1036 | |||
1037 | Application code obtains a pointer to the kvm_run structure by | ||
1038 | mmap()ing a vcpu fd. From that point, application code can control | ||
1039 | execution by changing fields in kvm_run prior to calling the KVM_RUN | ||
1040 | ioctl, and obtain information about the reason KVM_RUN returned by | ||
1041 | looking up structure members. | ||
1042 | |||
1043 | struct kvm_run { | ||
1044 | /* in */ | ||
1045 | __u8 request_interrupt_window; | ||
1046 | |||
1047 | Request that KVM_RUN return when it becomes possible to inject external | ||
1048 | interrupts into the guest. Useful in conjunction with KVM_INTERRUPT. | ||
1049 | |||
1050 | __u8 padding1[7]; | ||
1051 | |||
1052 | /* out */ | ||
1053 | __u32 exit_reason; | ||
1054 | |||
1055 | When KVM_RUN has returned successfully (return value 0), this informs | ||
1056 | application code why KVM_RUN has returned. Allowable values for this | ||
1057 | field are detailed below. | ||
1058 | |||
1059 | __u8 ready_for_interrupt_injection; | ||
1060 | |||
1061 | If request_interrupt_window has been specified, this field indicates | ||
1062 | an interrupt can be injected now with KVM_INTERRUPT. | ||
1063 | |||
1064 | __u8 if_flag; | ||
1065 | |||
1066 | The value of the current interrupt flag. Only valid if in-kernel | ||
1067 | local APIC is not used. | ||
1068 | |||
1069 | __u8 padding2[2]; | ||
1070 | |||
1071 | /* in (pre_kvm_run), out (post_kvm_run) */ | ||
1072 | __u64 cr8; | ||
1073 | |||
1074 | The value of the cr8 register. Only valid if in-kernel local APIC is | ||
1075 | not used. Both input and output. | ||
1076 | |||
1077 | __u64 apic_base; | ||
1078 | |||
1079 | The value of the APIC BASE msr. Only valid if in-kernel local | ||
1080 | APIC is not used. Both input and output. | ||
1081 | |||
1082 | union { | ||
1083 | /* KVM_EXIT_UNKNOWN */ | ||
1084 | struct { | ||
1085 | __u64 hardware_exit_reason; | ||
1086 | } hw; | ||
1087 | |||
1088 | If exit_reason is KVM_EXIT_UNKNOWN, the vcpu has exited due to unknown | ||
1089 | reasons. Further architecture-specific information is available in | ||
1090 | hardware_exit_reason. | ||
1091 | |||
1092 | /* KVM_EXIT_FAIL_ENTRY */ | ||
1093 | struct { | ||
1094 | __u64 hardware_entry_failure_reason; | ||
1095 | } fail_entry; | ||
1096 | |||
1097 | If exit_reason is KVM_EXIT_FAIL_ENTRY, the vcpu could not be run due | ||
1098 | to unknown reasons. Further architecture-specific information is | ||
1099 | available in hardware_entry_failure_reason. | ||
1100 | |||
1101 | /* KVM_EXIT_EXCEPTION */ | ||
1102 | struct { | ||
1103 | __u32 exception; | ||
1104 | __u32 error_code; | ||
1105 | } ex; | ||
1106 | |||
1107 | Unused. | ||
1108 | |||
1109 | /* KVM_EXIT_IO */ | ||
1110 | struct { | ||
1111 | #define KVM_EXIT_IO_IN 0 | ||
1112 | #define KVM_EXIT_IO_OUT 1 | ||
1113 | __u8 direction; | ||
1114 | __u8 size; /* bytes */ | ||
1115 | __u16 port; | ||
1116 | __u32 count; | ||
1117 | __u64 data_offset; /* relative to kvm_run start */ | ||
1118 | } io; | ||
1119 | |||
1120 | If exit_reason is KVM_EXIT_IO, then the vcpu has | ||
1121 | executed a port I/O instruction which could not be satisfied by kvm. | ||
1122 | data_offset describes where the data is located (KVM_EXIT_IO_OUT) or | ||
1123 | where kvm expects application code to place the data for the next | ||
1124 | KVM_RUN invocation (KVM_EXIT_IO_IN). Data format is a packed array. | ||
1125 | |||
1126 | struct { | ||
1127 | struct kvm_debug_exit_arch arch; | ||
1128 | } debug; | ||
1129 | |||
1130 | Unused. | ||
1131 | |||
1132 | /* KVM_EXIT_MMIO */ | ||
1133 | struct { | ||
1134 | __u64 phys_addr; | ||
1135 | __u8 data[8]; | ||
1136 | __u32 len; | ||
1137 | __u8 is_write; | ||
1138 | } mmio; | ||
1139 | |||
1140 | If exit_reason is KVM_EXIT_MMIO, then the vcpu has | ||
1141 | executed a memory-mapped I/O instruction which could not be satisfied | ||
1142 | by kvm. The 'data' member contains the written data if 'is_write' is | ||
1143 | true, and should be filled by application code otherwise. | ||
1144 | |||
1145 | NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding | ||
1146 | operations are complete (and guest state is consistent) only after userspace | ||
1147 | has re-entered the kernel with KVM_RUN. The kernel side will first finish | ||
1148 | incomplete operations and then check for pending signals. Userspace | ||
1149 | can re-enter the guest with an unmasked signal pending to complete | ||
1150 | pending operations. | ||
1151 | |||
1152 | /* KVM_EXIT_HYPERCALL */ | ||
1153 | struct { | ||
1154 | __u64 nr; | ||
1155 | __u64 args[6]; | ||
1156 | __u64 ret; | ||
1157 | __u32 longmode; | ||
1158 | __u32 pad; | ||
1159 | } hypercall; | ||
1160 | |||
1161 | Unused. This was once used for 'hypercall to userspace'. To implement | ||
1162 | such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390). | ||
1163 | Note KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO. | ||
1164 | |||
1165 | /* KVM_EXIT_TPR_ACCESS */ | ||
1166 | struct { | ||
1167 | __u64 rip; | ||
1168 | __u32 is_write; | ||
1169 | __u32 pad; | ||
1170 | } tpr_access; | ||
1171 | |||
1172 | To be documented (KVM_TPR_ACCESS_REPORTING). | ||
1173 | |||
1174 | /* KVM_EXIT_S390_SIEIC */ | ||
1175 | struct { | ||
1176 | __u8 icptcode; | ||
1177 | __u64 mask; /* psw upper half */ | ||
1178 | __u64 addr; /* psw lower half */ | ||
1179 | __u16 ipa; | ||
1180 | __u32 ipb; | ||
1181 | } s390_sieic; | ||
1182 | |||
1183 | s390 specific. | ||
1184 | |||
1185 | /* KVM_EXIT_S390_RESET */ | ||
1186 | #define KVM_S390_RESET_POR 1 | ||
1187 | #define KVM_S390_RESET_CLEAR 2 | ||
1188 | #define KVM_S390_RESET_SUBSYSTEM 4 | ||
1189 | #define KVM_S390_RESET_CPU_INIT 8 | ||
1190 | #define KVM_S390_RESET_IPL 16 | ||
1191 | __u64 s390_reset_flags; | ||
1192 | |||
1193 | s390 specific. | ||
1194 | |||
1195 | /* KVM_EXIT_DCR */ | ||
1196 | struct { | ||
1197 | __u32 dcrn; | ||
1198 | __u32 data; | ||
1199 | __u8 is_write; | ||
1200 | } dcr; | ||
1201 | |||
1202 | powerpc specific. | ||
1203 | |||
1204 | /* KVM_EXIT_OSI */ | ||
1205 | struct { | ||
1206 | __u64 gprs[32]; | ||
1207 | } osi; | ||
1208 | |||
1209 | MOL uses a special hypercall interface it calls 'OSI'. To enable it, we catch | ||
1210 | hypercalls and exit with this exit struct that contains all the guest gprs. | ||
1211 | |||
1212 | If exit_reason is KVM_EXIT_OSI, then the vcpu has triggered such a hypercall. | ||
1213 | Userspace can now handle the hypercall and when it's done modify the gprs as | ||
1214 | necessary. Upon guest entry all guest GPRs will then be replaced by the values | ||
1215 | in this struct. | ||
1216 | |||
1217 | /* Fix the size of the union. */ | ||
1218 | char padding[256]; | ||
1219 | }; | ||
1220 | }; | ||
diff --git a/Documentation/kvm/cpuid.txt b/Documentation/kvm/cpuid.txt deleted file mode 100644 index 14a12ea92b7f..000000000000 --- a/Documentation/kvm/cpuid.txt +++ /dev/null | |||
@@ -1,42 +0,0 @@ | |||
1 | KVM CPUID bits | ||
2 | Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 | ||
3 | ===================================================== | ||
4 | |||
5 | A guest running on a kvm host, can check some of its features using | ||
6 | cpuid. This is not always guaranteed to work, since userspace can | ||
7 | mask-out some, or even all KVM-related cpuid features before launching | ||
8 | a guest. | ||
9 | |||
10 | KVM cpuid functions are: | ||
11 | |||
12 | function: KVM_CPUID_SIGNATURE (0x40000000) | ||
13 | returns : eax = 0, | ||
14 | ebx = 0x4b4d564b, | ||
15 | ecx = 0x564b4d56, | ||
16 | edx = 0x4d. | ||
17 | Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM". | ||
18 | This function queries the presence of KVM cpuid leafs. | ||
19 | |||
20 | |||
21 | function: define KVM_CPUID_FEATURES (0x40000001) | ||
22 | returns : ebx, ecx, edx = 0 | ||
23 | eax = and OR'ed group of (1 << flag), where each flags is: | ||
24 | |||
25 | |||
26 | flag || value || meaning | ||
27 | ============================================================================= | ||
28 | KVM_FEATURE_CLOCKSOURCE || 0 || kvmclock available at msrs | ||
29 | || || 0x11 and 0x12. | ||
30 | ------------------------------------------------------------------------------ | ||
31 | KVM_FEATURE_NOP_IO_DELAY || 1 || not necessary to perform delays | ||
32 | || || on PIO operations. | ||
33 | ------------------------------------------------------------------------------ | ||
34 | KVM_FEATURE_MMU_OP || 2 || deprecated. | ||
35 | ------------------------------------------------------------------------------ | ||
36 | KVM_FEATURE_CLOCKSOURCE2 || 3 || kvmclock available at msrs | ||
37 | || || 0x4b564d00 and 0x4b564d01 | ||
38 | ------------------------------------------------------------------------------ | ||
39 | KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side | ||
40 | || || per-cpu warps are expected in | ||
41 | || || kvmclock. | ||
42 | ------------------------------------------------------------------------------ | ||
diff --git a/Documentation/kvm/mmu.txt b/Documentation/kvm/mmu.txt deleted file mode 100644 index 142cc5136650..000000000000 --- a/Documentation/kvm/mmu.txt +++ /dev/null | |||
@@ -1,348 +0,0 @@ | |||
1 | The x86 kvm shadow mmu | ||
2 | ====================== | ||
3 | |||
4 | The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible | ||
5 | for presenting a standard x86 mmu to the guest, while translating guest | ||
6 | physical addresses to host physical addresses. | ||
7 | |||
8 | The mmu code attempts to satisfy the following requirements: | ||
9 | |||
10 | - correctness: the guest should not be able to determine that it is running | ||
11 | on an emulated mmu except for timing (we attempt to comply | ||
12 | with the specification, not emulate the characteristics of | ||
13 | a particular implementation such as tlb size) | ||
14 | - security: the guest must not be able to touch host memory not assigned | ||
15 | to it | ||
16 | - performance: minimize the performance penalty imposed by the mmu | ||
17 | - scaling: need to scale to large memory and large vcpu guests | ||
18 | - hardware: support the full range of x86 virtualization hardware | ||
19 | - integration: Linux memory management code must be in control of guest memory | ||
20 | so that swapping, page migration, page merging, transparent | ||
21 | hugepages, and similar features work without change | ||
22 | - dirty tracking: report writes to guest memory to enable live migration | ||
23 | and framebuffer-based displays | ||
24 | - footprint: keep the amount of pinned kernel memory low (most memory | ||
25 | should be shrinkable) | ||
26 | - reliablity: avoid multipage or GFP_ATOMIC allocations | ||
27 | |||
28 | Acronyms | ||
29 | ======== | ||
30 | |||
31 | pfn host page frame number | ||
32 | hpa host physical address | ||
33 | hva host virtual address | ||
34 | gfn guest frame number | ||
35 | gpa guest physical address | ||
36 | gva guest virtual address | ||
37 | ngpa nested guest physical address | ||
38 | ngva nested guest virtual address | ||
39 | pte page table entry (used also to refer generically to paging structure | ||
40 | entries) | ||
41 | gpte guest pte (referring to gfns) | ||
42 | spte shadow pte (referring to pfns) | ||
43 | tdp two dimensional paging (vendor neutral term for NPT and EPT) | ||
44 | |||
45 | Virtual and real hardware supported | ||
46 | =================================== | ||
47 | |||
48 | The mmu supports first-generation mmu hardware, which allows an atomic switch | ||
49 | of the current paging mode and cr3 during guest entry, as well as | ||
50 | two-dimensional paging (AMD's NPT and Intel's EPT). The emulated hardware | ||
51 | it exposes is the traditional 2/3/4 level x86 mmu, with support for global | ||
52 | pages, pae, pse, pse36, cr0.wp, and 1GB pages. Work is in progress to support | ||
53 | exposing NPT capable hardware on NPT capable hosts. | ||
54 | |||
55 | Translation | ||
56 | =========== | ||
57 | |||
58 | The primary job of the mmu is to program the processor's mmu to translate | ||
59 | addresses for the guest. Different translations are required at different | ||
60 | times: | ||
61 | |||
62 | - when guest paging is disabled, we translate guest physical addresses to | ||
63 | host physical addresses (gpa->hpa) | ||
64 | - when guest paging is enabled, we translate guest virtual addresses, to | ||
65 | guest physical addresses, to host physical addresses (gva->gpa->hpa) | ||
66 | - when the guest launches a guest of its own, we translate nested guest | ||
67 | virtual addresses, to nested guest physical addresses, to guest physical | ||
68 | addresses, to host physical addresses (ngva->ngpa->gpa->hpa) | ||
69 | |||
70 | The primary challenge is to encode between 1 and 3 translations into hardware | ||
71 | that support only 1 (traditional) and 2 (tdp) translations. When the | ||
72 | number of required translations matches the hardware, the mmu operates in | ||
73 | direct mode; otherwise it operates in shadow mode (see below). | ||
74 | |||
75 | Memory | ||
76 | ====== | ||
77 | |||
78 | Guest memory (gpa) is part of the user address space of the process that is | ||
79 | using kvm. Userspace defines the translation between guest addresses and user | ||
80 | addresses (gpa->hva); note that two gpas may alias to the same hva, but not | ||
81 | vice versa. | ||
82 | |||
83 | These hvas may be backed using any method available to the host: anonymous | ||
84 | memory, file backed memory, and device memory. Memory might be paged by the | ||
85 | host at any time. | ||
86 | |||
87 | Events | ||
88 | ====== | ||
89 | |||
90 | The mmu is driven by events, some from the guest, some from the host. | ||
91 | |||
92 | Guest generated events: | ||
93 | - writes to control registers (especially cr3) | ||
94 | - invlpg/invlpga instruction execution | ||
95 | - access to missing or protected translations | ||
96 | |||
97 | Host generated events: | ||
98 | - changes in the gpa->hpa translation (either through gpa->hva changes or | ||
99 | through hva->hpa changes) | ||
100 | - memory pressure (the shrinker) | ||
101 | |||
102 | Shadow pages | ||
103 | ============ | ||
104 | |||
105 | The principal data structure is the shadow page, 'struct kvm_mmu_page'. A | ||
106 | shadow page contains 512 sptes, which can be either leaf or nonleaf sptes. A | ||
107 | shadow page may contain a mix of leaf and nonleaf sptes. | ||
108 | |||
109 | A nonleaf spte allows the hardware mmu to reach the leaf pages and | ||
110 | is not related to a translation directly. It points to other shadow pages. | ||
111 | |||
112 | A leaf spte corresponds to either one or two translations encoded into | ||
113 | one paging structure entry. These are always the lowest level of the | ||
114 | translation stack, with optional higher level translations left to NPT/EPT. | ||
115 | Leaf ptes point at guest pages. | ||
116 | |||
117 | The following table shows translations encoded by leaf ptes, with higher-level | ||
118 | translations in parentheses: | ||
119 | |||
120 | Non-nested guests: | ||
121 | nonpaging: gpa->hpa | ||
122 | paging: gva->gpa->hpa | ||
123 | paging, tdp: (gva->)gpa->hpa | ||
124 | Nested guests: | ||
125 | non-tdp: ngva->gpa->hpa (*) | ||
126 | tdp: (ngva->)ngpa->gpa->hpa | ||
127 | |||
128 | (*) the guest hypervisor will encode the ngva->gpa translation into its page | ||
129 | tables if npt is not present | ||
130 | |||
131 | Shadow pages contain the following information: | ||
132 | role.level: | ||
133 | The level in the shadow paging hierarchy that this shadow page belongs to. | ||
134 | 1=4k sptes, 2=2M sptes, 3=1G sptes, etc. | ||
135 | role.direct: | ||
136 | If set, leaf sptes reachable from this page are for a linear range. | ||
137 | Examples include real mode translation, large guest pages backed by small | ||
138 | host pages, and gpa->hpa translations when NPT or EPT is active. | ||
139 | The linear range starts at (gfn << PAGE_SHIFT) and its size is determined | ||
140 | by role.level (2MB for first level, 1GB for second level, 0.5TB for third | ||
141 | level, 256TB for fourth level) | ||
142 | If clear, this page corresponds to a guest page table denoted by the gfn | ||
143 | field. | ||
144 | role.quadrant: | ||
145 | When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit | ||
146 | sptes. That means a guest page table contains more ptes than the host, | ||
147 | so multiple shadow pages are needed to shadow one guest page. | ||
148 | For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the | ||
149 | first or second 512-gpte block in the guest page table. For second-level | ||
150 | page tables, each 32-bit gpte is converted to two 64-bit sptes | ||
151 | (since each first-level guest page is shadowed by two first-level | ||
152 | shadow pages) so role.quadrant takes values in the range 0..3. Each | ||
153 | quadrant maps 1GB virtual address space. | ||
154 | role.access: | ||
155 | Inherited guest access permissions in the form uwx. Note execute | ||
156 | permission is positive, not negative. | ||
157 | role.invalid: | ||
158 | The page is invalid and should not be used. It is a root page that is | ||
159 | currently pinned (by a cpu hardware register pointing to it); once it is | ||
160 | unpinned it will be destroyed. | ||
161 | role.cr4_pae: | ||
162 | Contains the value of cr4.pae for which the page is valid (e.g. whether | ||
163 | 32-bit or 64-bit gptes are in use). | ||
164 | role.nxe: | ||
165 | Contains the value of efer.nxe for which the page is valid. | ||
166 | role.cr0_wp: | ||
167 | Contains the value of cr0.wp for which the page is valid. | ||
168 | gfn: | ||
169 | Either the guest page table containing the translations shadowed by this | ||
170 | page, or the base page frame for linear translations. See role.direct. | ||
171 | spt: | ||
172 | A pageful of 64-bit sptes containing the translations for this page. | ||
173 | Accessed by both kvm and hardware. | ||
174 | The page pointed to by spt will have its page->private pointing back | ||
175 | at the shadow page structure. | ||
176 | sptes in spt point either at guest pages, or at lower-level shadow pages. | ||
177 | Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point | ||
178 | at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte. | ||
179 | The spt array forms a DAG structure with the shadow page as a node, and | ||
180 | guest pages as leaves. | ||
181 | gfns: | ||
182 | An array of 512 guest frame numbers, one for each present pte. Used to | ||
183 | perform a reverse map from a pte to a gfn. When role.direct is set, any | ||
184 | element of this array can be calculated from the gfn field when used, in | ||
185 | this case, the array of gfns is not allocated. See role.direct and gfn. | ||
186 | slot_bitmap: | ||
187 | A bitmap containing one bit per memory slot. If the page contains a pte | ||
188 | mapping a page from memory slot n, then bit n of slot_bitmap will be set | ||
189 | (if a page is aliased among several slots, then it is not guaranteed that | ||
190 | all slots will be marked). | ||
191 | Used during dirty logging to avoid scanning a shadow page if none if its | ||
192 | pages need tracking. | ||
193 | root_count: | ||
194 | A counter keeping track of how many hardware registers (guest cr3 or | ||
195 | pdptrs) are now pointing at the page. While this counter is nonzero, the | ||
196 | page cannot be destroyed. See role.invalid. | ||
197 | multimapped: | ||
198 | Whether there exist multiple sptes pointing at this page. | ||
199 | parent_pte/parent_ptes: | ||
200 | If multimapped is zero, parent_pte points at the single spte that points at | ||
201 | this page's spt. Otherwise, parent_ptes points at a data structure | ||
202 | with a list of parent_ptes. | ||
203 | unsync: | ||
204 | If true, then the translations in this page may not match the guest's | ||
205 | translation. This is equivalent to the state of the tlb when a pte is | ||
206 | changed but before the tlb entry is flushed. Accordingly, unsync ptes | ||
207 | are synchronized when the guest executes invlpg or flushes its tlb by | ||
208 | other means. Valid for leaf pages. | ||
209 | unsync_children: | ||
210 | How many sptes in the page point at pages that are unsync (or have | ||
211 | unsynchronized children). | ||
212 | unsync_child_bitmap: | ||
213 | A bitmap indicating which sptes in spt point (directly or indirectly) at | ||
214 | pages that may be unsynchronized. Used to quickly locate all unsychronized | ||
215 | pages reachable from a given page. | ||
216 | |||
217 | Reverse map | ||
218 | =========== | ||
219 | |||
220 | The mmu maintains a reverse mapping whereby all ptes mapping a page can be | ||
221 | reached given its gfn. This is used, for example, when swapping out a page. | ||
222 | |||
223 | Synchronized and unsynchronized pages | ||
224 | ===================================== | ||
225 | |||
226 | The guest uses two events to synchronize its tlb and page tables: tlb flushes | ||
227 | and page invalidations (invlpg). | ||
228 | |||
229 | A tlb flush means that we need to synchronize all sptes reachable from the | ||
230 | guest's cr3. This is expensive, so we keep all guest page tables write | ||
231 | protected, and synchronize sptes to gptes when a gpte is written. | ||
232 | |||
233 | A special case is when a guest page table is reachable from the current | ||
234 | guest cr3. In this case, the guest is obliged to issue an invlpg instruction | ||
235 | before using the translation. We take advantage of that by removing write | ||
236 | protection from the guest page, and allowing the guest to modify it freely. | ||
237 | We synchronize modified gptes when the guest invokes invlpg. This reduces | ||
238 | the amount of emulation we have to do when the guest modifies multiple gptes, | ||
239 | or when the a guest page is no longer used as a page table and is used for | ||
240 | random guest data. | ||
241 | |||
242 | As a side effect we have to resynchronize all reachable unsynchronized shadow | ||
243 | pages on a tlb flush. | ||
244 | |||
245 | |||
246 | Reaction to events | ||
247 | ================== | ||
248 | |||
249 | - guest page fault (or npt page fault, or ept violation) | ||
250 | |||
251 | This is the most complicated event. The cause of a page fault can be: | ||
252 | |||
253 | - a true guest fault (the guest translation won't allow the access) (*) | ||
254 | - access to a missing translation | ||
255 | - access to a protected translation | ||
256 | - when logging dirty pages, memory is write protected | ||
257 | - synchronized shadow pages are write protected (*) | ||
258 | - access to untranslatable memory (mmio) | ||
259 | |||
260 | (*) not applicable in direct mode | ||
261 | |||
262 | Handling a page fault is performed as follows: | ||
263 | |||
264 | - if needed, walk the guest page tables to determine the guest translation | ||
265 | (gva->gpa or ngpa->gpa) | ||
266 | - if permissions are insufficient, reflect the fault back to the guest | ||
267 | - determine the host page | ||
268 | - if this is an mmio request, there is no host page; call the emulator | ||
269 | to emulate the instruction instead | ||
270 | - walk the shadow page table to find the spte for the translation, | ||
271 | instantiating missing intermediate page tables as necessary | ||
272 | - try to unsynchronize the page | ||
273 | - if successful, we can let the guest continue and modify the gpte | ||
274 | - emulate the instruction | ||
275 | - if failed, unshadow the page and let the guest continue | ||
276 | - update any translations that were modified by the instruction | ||
277 | |||
278 | invlpg handling: | ||
279 | |||
280 | - walk the shadow page hierarchy and drop affected translations | ||
281 | - try to reinstantiate the indicated translation in the hope that the | ||
282 | guest will use it in the near future | ||
283 | |||
284 | Guest control register updates: | ||
285 | |||
286 | - mov to cr3 | ||
287 | - look up new shadow roots | ||
288 | - synchronize newly reachable shadow pages | ||
289 | |||
290 | - mov to cr0/cr4/efer | ||
291 | - set up mmu context for new paging mode | ||
292 | - look up new shadow roots | ||
293 | - synchronize newly reachable shadow pages | ||
294 | |||
295 | Host translation updates: | ||
296 | |||
297 | - mmu notifier called with updated hva | ||
298 | - look up affected sptes through reverse map | ||
299 | - drop (or update) translations | ||
300 | |||
301 | Emulating cr0.wp | ||
302 | ================ | ||
303 | |||
304 | If tdp is not enabled, the host must keep cr0.wp=1 so page write protection | ||
305 | works for the guest kernel, not guest guest userspace. When the guest | ||
306 | cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0, | ||
307 | we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the | ||
308 | semantics require allowing any guest kernel access plus user read access). | ||
309 | |||
310 | We handle this by mapping the permissions to two possible sptes, depending | ||
311 | on fault type: | ||
312 | |||
313 | - kernel write fault: spte.u=0, spte.w=1 (allows full kernel access, | ||
314 | disallows user access) | ||
315 | - read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel | ||
316 | write access) | ||
317 | |||
318 | (user write faults generate a #PF) | ||
319 | |||
320 | Large pages | ||
321 | =========== | ||
322 | |||
323 | The mmu supports all combinations of large and small guest and host pages. | ||
324 | Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as | ||
325 | two separate 2M pages, on both guest and host, since the mmu always uses PAE | ||
326 | paging. | ||
327 | |||
328 | To instantiate a large spte, four constraints must be satisfied: | ||
329 | |||
330 | - the spte must point to a large host page | ||
331 | - the guest pte must be a large pte of at least equivalent size (if tdp is | ||
332 | enabled, there is no guest pte and this condition is satisified) | ||
333 | - if the spte will be writeable, the large page frame may not overlap any | ||
334 | write-protected pages | ||
335 | - the guest page must be wholly contained by a single memory slot | ||
336 | |||
337 | To check the last two conditions, the mmu maintains a ->write_count set of | ||
338 | arrays for each memory slot and large page size. Every write protected page | ||
339 | causes its write_count to be incremented, thus preventing instantiation of | ||
340 | a large spte. The frames at the end of an unaligned memory slot have | ||
341 | artificically inflated ->write_counts so they can never be instantiated. | ||
342 | |||
343 | Further reading | ||
344 | =============== | ||
345 | |||
346 | - NPT presentation from KVM Forum 2008 | ||
347 | http://www.linux-kvm.org/wiki/images/c/c8/KvmForum2008%24kdf2008_21.pdf | ||
348 | |||
diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt deleted file mode 100644 index 8ddcfe84c09a..000000000000 --- a/Documentation/kvm/msr.txt +++ /dev/null | |||
@@ -1,153 +0,0 @@ | |||
1 | KVM-specific MSRs. | ||
2 | Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 | ||
3 | ===================================================== | ||
4 | |||
5 | KVM makes use of some custom MSRs to service some requests. | ||
6 | At present, this facility is only used by kvmclock. | ||
7 | |||
8 | Custom MSRs have a range reserved for them, that goes from | ||
9 | 0x4b564d00 to 0x4b564dff. There are MSRs outside this area, | ||
10 | but they are deprecated and their use is discouraged. | ||
11 | |||
12 | Custom MSR list | ||
13 | -------- | ||
14 | |||
15 | The current supported Custom MSR list is: | ||
16 | |||
17 | MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00 | ||
18 | |||
19 | data: 4-byte alignment physical address of a memory area which must be | ||
20 | in guest RAM. This memory is expected to hold a copy of the following | ||
21 | structure: | ||
22 | |||
23 | struct pvclock_wall_clock { | ||
24 | u32 version; | ||
25 | u32 sec; | ||
26 | u32 nsec; | ||
27 | } __attribute__((__packed__)); | ||
28 | |||
29 | whose data will be filled in by the hypervisor. The hypervisor is only | ||
30 | guaranteed to update this data at the moment of MSR write. | ||
31 | Users that want to reliably query this information more than once have | ||
32 | to write more than once to this MSR. Fields have the following meanings: | ||
33 | |||
34 | version: guest has to check version before and after grabbing | ||
35 | time information and check that they are both equal and even. | ||
36 | An odd version indicates an in-progress update. | ||
37 | |||
38 | sec: number of seconds for wallclock. | ||
39 | |||
40 | nsec: number of nanoseconds for wallclock. | ||
41 | |||
42 | Note that although MSRs are per-CPU entities, the effect of this | ||
43 | particular MSR is global. | ||
44 | |||
45 | Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid | ||
46 | leaf prior to usage. | ||
47 | |||
48 | MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 | ||
49 | |||
50 | data: 4-byte aligned physical address of a memory area which must be in | ||
51 | guest RAM, plus an enable bit in bit 0. This memory is expected to hold | ||
52 | a copy of the following structure: | ||
53 | |||
54 | struct pvclock_vcpu_time_info { | ||
55 | u32 version; | ||
56 | u32 pad0; | ||
57 | u64 tsc_timestamp; | ||
58 | u64 system_time; | ||
59 | u32 tsc_to_system_mul; | ||
60 | s8 tsc_shift; | ||
61 | u8 flags; | ||
62 | u8 pad[2]; | ||
63 | } __attribute__((__packed__)); /* 32 bytes */ | ||
64 | |||
65 | whose data will be filled in by the hypervisor periodically. Only one | ||
66 | write, or registration, is needed for each VCPU. The interval between | ||
67 | updates of this structure is arbitrary and implementation-dependent. | ||
68 | The hypervisor may update this structure at any time it sees fit until | ||
69 | anything with bit0 == 0 is written to it. | ||
70 | |||
71 | Fields have the following meanings: | ||
72 | |||
73 | version: guest has to check version before and after grabbing | ||
74 | time information and check that they are both equal and even. | ||
75 | An odd version indicates an in-progress update. | ||
76 | |||
77 | tsc_timestamp: the tsc value at the current VCPU at the time | ||
78 | of the update of this structure. Guests can subtract this value | ||
79 | from current tsc to derive a notion of elapsed time since the | ||
80 | structure update. | ||
81 | |||
82 | system_time: a host notion of monotonic time, including sleep | ||
83 | time at the time this structure was last updated. Unit is | ||
84 | nanoseconds. | ||
85 | |||
86 | tsc_to_system_mul: a function of the tsc frequency. One has | ||
87 | to multiply any tsc-related quantity by this value to get | ||
88 | a value in nanoseconds, besides dividing by 2^tsc_shift | ||
89 | |||
90 | tsc_shift: cycle to nanosecond divider, as a power of two, to | ||
91 | allow for shift rights. One has to shift right any tsc-related | ||
92 | quantity by this value to get a value in nanoseconds, besides | ||
93 | multiplying by tsc_to_system_mul. | ||
94 | |||
95 | With this information, guests can derive per-CPU time by | ||
96 | doing: | ||
97 | |||
98 | time = (current_tsc - tsc_timestamp) | ||
99 | time = (time * tsc_to_system_mul) >> tsc_shift | ||
100 | time = time + system_time | ||
101 | |||
102 | flags: bits in this field indicate extended capabilities | ||
103 | coordinated between the guest and the hypervisor. Availability | ||
104 | of specific flags has to be checked in 0x40000001 cpuid leaf. | ||
105 | Current flags are: | ||
106 | |||
107 | flag bit | cpuid bit | meaning | ||
108 | ------------------------------------------------------------- | ||
109 | | | time measures taken across | ||
110 | 0 | 24 | multiple cpus are guaranteed to | ||
111 | | | be monotonic | ||
112 | ------------------------------------------------------------- | ||
113 | |||
114 | Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid | ||
115 | leaf prior to usage. | ||
116 | |||
117 | |||
118 | MSR_KVM_WALL_CLOCK: 0x11 | ||
119 | |||
120 | data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead. | ||
121 | |||
122 | This MSR falls outside the reserved KVM range and may be removed in the | ||
123 | future. Its usage is deprecated. | ||
124 | |||
125 | Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid | ||
126 | leaf prior to usage. | ||
127 | |||
128 | MSR_KVM_SYSTEM_TIME: 0x12 | ||
129 | |||
130 | data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead. | ||
131 | |||
132 | This MSR falls outside the reserved KVM range and may be removed in the | ||
133 | future. Its usage is deprecated. | ||
134 | |||
135 | Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid | ||
136 | leaf prior to usage. | ||
137 | |||
138 | The suggested algorithm for detecting kvmclock presence is then: | ||
139 | |||
140 | if (!kvm_para_available()) /* refer to cpuid.txt */ | ||
141 | return NON_PRESENT; | ||
142 | |||
143 | flags = cpuid_eax(0x40000001); | ||
144 | if (flags & 3) { | ||
145 | msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; | ||
146 | msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; | ||
147 | return PRESENT; | ||
148 | } else if (flags & 0) { | ||
149 | msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; | ||
150 | msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; | ||
151 | return PRESENT; | ||
152 | } else | ||
153 | return NON_PRESENT; | ||
diff --git a/Documentation/kvm/review-checklist.txt b/Documentation/kvm/review-checklist.txt deleted file mode 100644 index 730475ae1b8d..000000000000 --- a/Documentation/kvm/review-checklist.txt +++ /dev/null | |||
@@ -1,38 +0,0 @@ | |||
1 | Review checklist for kvm patches | ||
2 | ================================ | ||
3 | |||
4 | 1. The patch must follow Documentation/CodingStyle and | ||
5 | Documentation/SubmittingPatches. | ||
6 | |||
7 | 2. Patches should be against kvm.git master branch. | ||
8 | |||
9 | 3. If the patch introduces or modifies a new userspace API: | ||
10 | - the API must be documented in Documentation/kvm/api.txt | ||
11 | - the API must be discoverable using KVM_CHECK_EXTENSION | ||
12 | |||
13 | 4. New state must include support for save/restore. | ||
14 | |||
15 | 5. New features must default to off (userspace should explicitly request them). | ||
16 | Performance improvements can and should default to on. | ||
17 | |||
18 | 6. New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2 | ||
19 | |||
20 | 7. Emulator changes should be accompanied by unit tests for qemu-kvm.git | ||
21 | kvm/test directory. | ||
22 | |||
23 | 8. Changes should be vendor neutral when possible. Changes to common code | ||
24 | are better than duplicating changes to vendor code. | ||
25 | |||
26 | 9. Similarly, prefer changes to arch independent code than to arch dependent | ||
27 | code. | ||
28 | |||
29 | 10. User/kernel interfaces and guest/host interfaces must be 64-bit clean | ||
30 | (all variables and sizes naturally aligned on 64-bit; use specific types | ||
31 | only - u64 rather than ulong). | ||
32 | |||
33 | 11. New guest visible features must either be documented in a hardware manual | ||
34 | or be accompanied by documentation. | ||
35 | |||
36 | 12. Features must be robust against reset and kexec - for example, shared | ||
37 | host/guest memory must be unshared to prevent the host from writing to | ||
38 | guest memory that the guest has not reserved for this purpose. | ||