diff options
Diffstat (limited to 'Documentation/virtual/kvm')
-rw-r--r-- | Documentation/virtual/kvm/api.txt | 1451 | ||||
-rw-r--r-- | Documentation/virtual/kvm/cpuid.txt | 45 | ||||
-rw-r--r-- | Documentation/virtual/kvm/locking.txt | 25 | ||||
-rw-r--r-- | Documentation/virtual/kvm/mmu.txt | 348 | ||||
-rw-r--r-- | Documentation/virtual/kvm/msr.txt | 187 | ||||
-rw-r--r-- | Documentation/virtual/kvm/ppc-pv.txt | 196 | ||||
-rw-r--r-- | Documentation/virtual/kvm/review-checklist.txt | 38 | ||||
-rw-r--r-- | Documentation/virtual/kvm/timekeeping.txt | 612 |
8 files changed, 2902 insertions, 0 deletions
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt new file mode 100644 index 000000000000..9bef4e4cec50 --- /dev/null +++ b/Documentation/virtual/kvm/api.txt | |||
@@ -0,0 +1,1451 @@ | |||
1 | The Definitive KVM (Kernel-based Virtual Machine) API Documentation | ||
2 | =================================================================== | ||
3 | |||
4 | 1. General description | ||
5 | |||
6 | The kvm API is a set of ioctls that are issued to control various aspects | ||
7 | of a virtual machine. The ioctls belong to three classes | ||
8 | |||
9 | - System ioctls: These query and set global attributes which affect the | ||
10 | whole kvm subsystem. In addition a system ioctl is used to create | ||
11 | virtual machines | ||
12 | |||
13 | - VM ioctls: These query and set attributes that affect an entire virtual | ||
14 | machine, for example memory layout. In addition a VM ioctl is used to | ||
15 | create virtual cpus (vcpus). | ||
16 | |||
17 | Only run VM ioctls from the same process (address space) that was used | ||
18 | to create the VM. | ||
19 | |||
20 | - vcpu ioctls: These query and set attributes that control the operation | ||
21 | of a single virtual cpu. | ||
22 | |||
23 | Only run vcpu ioctls from the same thread that was used to create the | ||
24 | vcpu. | ||
25 | |||
26 | 2. File descriptors | ||
27 | |||
28 | The kvm API is centered around file descriptors. An initial | ||
29 | open("/dev/kvm") obtains a handle to the kvm subsystem; this handle | ||
30 | can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this | ||
31 | handle will create a VM file descriptor which can be used to issue VM | ||
32 | ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu | ||
33 | and return a file descriptor pointing to it. Finally, ioctls on a vcpu | ||
34 | fd can be used to control the vcpu, including the important task of | ||
35 | actually running guest code. | ||
36 | |||
37 | In general file descriptors can be migrated among processes by means | ||
38 | of fork() and the SCM_RIGHTS facility of unix domain socket. These | ||
39 | kinds of tricks are explicitly not supported by kvm. While they will | ||
40 | not cause harm to the host, their actual behavior is not guaranteed by | ||
41 | the API. The only supported use is one virtual machine per process, | ||
42 | and one vcpu per thread. | ||
43 | |||
44 | 3. Extensions | ||
45 | |||
46 | As of Linux 2.6.22, the KVM ABI has been stabilized: no backward | ||
47 | incompatible change are allowed. However, there is an extension | ||
48 | facility that allows backward-compatible extensions to the API to be | ||
49 | queried and used. | ||
50 | |||
51 | The extension mechanism is not based on on the Linux version number. | ||
52 | Instead, kvm defines extension identifiers and a facility to query | ||
53 | whether a particular extension identifier is available. If it is, a | ||
54 | set of ioctls is available for application use. | ||
55 | |||
56 | 4. API description | ||
57 | |||
58 | This section describes ioctls that can be used to control kvm guests. | ||
59 | For each ioctl, the following information is provided along with a | ||
60 | description: | ||
61 | |||
62 | Capability: which KVM extension provides this ioctl. Can be 'basic', | ||
63 | which means that is will be provided by any kernel that supports | ||
64 | API version 12 (see section 4.1), or a KVM_CAP_xyz constant, which | ||
65 | means availability needs to be checked with KVM_CHECK_EXTENSION | ||
66 | (see section 4.4). | ||
67 | |||
68 | Architectures: which instruction set architectures provide this ioctl. | ||
69 | x86 includes both i386 and x86_64. | ||
70 | |||
71 | Type: system, vm, or vcpu. | ||
72 | |||
73 | Parameters: what parameters are accepted by the ioctl. | ||
74 | |||
75 | Returns: the return value. General error numbers (EBADF, ENOMEM, EINVAL) | ||
76 | are not detailed, but errors with specific meanings are. | ||
77 | |||
78 | 4.1 KVM_GET_API_VERSION | ||
79 | |||
80 | Capability: basic | ||
81 | Architectures: all | ||
82 | Type: system ioctl | ||
83 | Parameters: none | ||
84 | Returns: the constant KVM_API_VERSION (=12) | ||
85 | |||
86 | This identifies the API version as the stable kvm API. It is not | ||
87 | expected that this number will change. However, Linux 2.6.20 and | ||
88 | 2.6.21 report earlier versions; these are not documented and not | ||
89 | supported. Applications should refuse to run if KVM_GET_API_VERSION | ||
90 | returns a value other than 12. If this check passes, all ioctls | ||
91 | described as 'basic' will be available. | ||
92 | |||
93 | 4.2 KVM_CREATE_VM | ||
94 | |||
95 | Capability: basic | ||
96 | Architectures: all | ||
97 | Type: system ioctl | ||
98 | Parameters: none | ||
99 | Returns: a VM fd that can be used to control the new virtual machine. | ||
100 | |||
101 | The new VM has no virtual cpus and no memory. An mmap() of a VM fd | ||
102 | will access the virtual machine's physical address space; offset zero | ||
103 | corresponds to guest physical address zero. Use of mmap() on a VM fd | ||
104 | is discouraged if userspace memory allocation (KVM_CAP_USER_MEMORY) is | ||
105 | available. | ||
106 | |||
107 | 4.3 KVM_GET_MSR_INDEX_LIST | ||
108 | |||
109 | Capability: basic | ||
110 | Architectures: x86 | ||
111 | Type: system | ||
112 | Parameters: struct kvm_msr_list (in/out) | ||
113 | Returns: 0 on success; -1 on error | ||
114 | Errors: | ||
115 | E2BIG: the msr index list is to be to fit in the array specified by | ||
116 | the user. | ||
117 | |||
118 | struct kvm_msr_list { | ||
119 | __u32 nmsrs; /* number of msrs in entries */ | ||
120 | __u32 indices[0]; | ||
121 | }; | ||
122 | |||
123 | This ioctl returns the guest msrs that are supported. The list varies | ||
124 | by kvm version and host processor, but does not change otherwise. The | ||
125 | user fills in the size of the indices array in nmsrs, and in return | ||
126 | kvm adjusts nmsrs to reflect the actual number of msrs and fills in | ||
127 | the indices array with their numbers. | ||
128 | |||
129 | Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are | ||
130 | not returned in the MSR list, as different vcpus can have a different number | ||
131 | of banks, as set via the KVM_X86_SETUP_MCE ioctl. | ||
132 | |||
133 | 4.4 KVM_CHECK_EXTENSION | ||
134 | |||
135 | Capability: basic | ||
136 | Architectures: all | ||
137 | Type: system ioctl | ||
138 | Parameters: extension identifier (KVM_CAP_*) | ||
139 | Returns: 0 if unsupported; 1 (or some other positive integer) if supported | ||
140 | |||
141 | The API allows the application to query about extensions to the core | ||
142 | kvm API. Userspace passes an extension identifier (an integer) and | ||
143 | receives an integer that describes the extension availability. | ||
144 | Generally 0 means no and 1 means yes, but some extensions may report | ||
145 | additional information in the integer return value. | ||
146 | |||
147 | 4.5 KVM_GET_VCPU_MMAP_SIZE | ||
148 | |||
149 | Capability: basic | ||
150 | Architectures: all | ||
151 | Type: system ioctl | ||
152 | Parameters: none | ||
153 | Returns: size of vcpu mmap area, in bytes | ||
154 | |||
155 | The KVM_RUN ioctl (cf.) communicates with userspace via a shared | ||
156 | memory region. This ioctl returns the size of that region. See the | ||
157 | KVM_RUN documentation for details. | ||
158 | |||
159 | 4.6 KVM_SET_MEMORY_REGION | ||
160 | |||
161 | Capability: basic | ||
162 | Architectures: all | ||
163 | Type: vm ioctl | ||
164 | Parameters: struct kvm_memory_region (in) | ||
165 | Returns: 0 on success, -1 on error | ||
166 | |||
167 | This ioctl is obsolete and has been removed. | ||
168 | |||
169 | 4.7 KVM_CREATE_VCPU | ||
170 | |||
171 | Capability: basic | ||
172 | Architectures: all | ||
173 | Type: vm ioctl | ||
174 | Parameters: vcpu id (apic id on x86) | ||
175 | Returns: vcpu fd on success, -1 on error | ||
176 | |||
177 | This API adds a vcpu to a virtual machine. The vcpu id is a small integer | ||
178 | in the range [0, max_vcpus). | ||
179 | |||
180 | 4.8 KVM_GET_DIRTY_LOG (vm ioctl) | ||
181 | |||
182 | Capability: basic | ||
183 | Architectures: x86 | ||
184 | Type: vm ioctl | ||
185 | Parameters: struct kvm_dirty_log (in/out) | ||
186 | Returns: 0 on success, -1 on error | ||
187 | |||
188 | /* for KVM_GET_DIRTY_LOG */ | ||
189 | struct kvm_dirty_log { | ||
190 | __u32 slot; | ||
191 | __u32 padding; | ||
192 | union { | ||
193 | void __user *dirty_bitmap; /* one bit per page */ | ||
194 | __u64 padding; | ||
195 | }; | ||
196 | }; | ||
197 | |||
198 | Given a memory slot, return a bitmap containing any pages dirtied | ||
199 | since the last call to this ioctl. Bit 0 is the first page in the | ||
200 | memory slot. Ensure the entire structure is cleared to avoid padding | ||
201 | issues. | ||
202 | |||
203 | 4.9 KVM_SET_MEMORY_ALIAS | ||
204 | |||
205 | Capability: basic | ||
206 | Architectures: x86 | ||
207 | Type: vm ioctl | ||
208 | Parameters: struct kvm_memory_alias (in) | ||
209 | Returns: 0 (success), -1 (error) | ||
210 | |||
211 | This ioctl is obsolete and has been removed. | ||
212 | |||
213 | 4.10 KVM_RUN | ||
214 | |||
215 | Capability: basic | ||
216 | Architectures: all | ||
217 | Type: vcpu ioctl | ||
218 | Parameters: none | ||
219 | Returns: 0 on success, -1 on error | ||
220 | Errors: | ||
221 | EINTR: an unmasked signal is pending | ||
222 | |||
223 | This ioctl is used to run a guest virtual cpu. While there are no | ||
224 | explicit parameters, there is an implicit parameter block that can be | ||
225 | obtained by mmap()ing the vcpu fd at offset 0, with the size given by | ||
226 | KVM_GET_VCPU_MMAP_SIZE. The parameter block is formatted as a 'struct | ||
227 | kvm_run' (see below). | ||
228 | |||
229 | 4.11 KVM_GET_REGS | ||
230 | |||
231 | Capability: basic | ||
232 | Architectures: all | ||
233 | Type: vcpu ioctl | ||
234 | Parameters: struct kvm_regs (out) | ||
235 | Returns: 0 on success, -1 on error | ||
236 | |||
237 | Reads the general purpose registers from the vcpu. | ||
238 | |||
239 | /* x86 */ | ||
240 | struct kvm_regs { | ||
241 | /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */ | ||
242 | __u64 rax, rbx, rcx, rdx; | ||
243 | __u64 rsi, rdi, rsp, rbp; | ||
244 | __u64 r8, r9, r10, r11; | ||
245 | __u64 r12, r13, r14, r15; | ||
246 | __u64 rip, rflags; | ||
247 | }; | ||
248 | |||
249 | 4.12 KVM_SET_REGS | ||
250 | |||
251 | Capability: basic | ||
252 | Architectures: all | ||
253 | Type: vcpu ioctl | ||
254 | Parameters: struct kvm_regs (in) | ||
255 | Returns: 0 on success, -1 on error | ||
256 | |||
257 | Writes the general purpose registers into the vcpu. | ||
258 | |||
259 | See KVM_GET_REGS for the data structure. | ||
260 | |||
261 | 4.13 KVM_GET_SREGS | ||
262 | |||
263 | Capability: basic | ||
264 | Architectures: x86 | ||
265 | Type: vcpu ioctl | ||
266 | Parameters: struct kvm_sregs (out) | ||
267 | Returns: 0 on success, -1 on error | ||
268 | |||
269 | Reads special registers from the vcpu. | ||
270 | |||
271 | /* x86 */ | ||
272 | struct kvm_sregs { | ||
273 | struct kvm_segment cs, ds, es, fs, gs, ss; | ||
274 | struct kvm_segment tr, ldt; | ||
275 | struct kvm_dtable gdt, idt; | ||
276 | __u64 cr0, cr2, cr3, cr4, cr8; | ||
277 | __u64 efer; | ||
278 | __u64 apic_base; | ||
279 | __u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64]; | ||
280 | }; | ||
281 | |||
282 | interrupt_bitmap is a bitmap of pending external interrupts. At most | ||
283 | one bit may be set. This interrupt has been acknowledged by the APIC | ||
284 | but not yet injected into the cpu core. | ||
285 | |||
286 | 4.14 KVM_SET_SREGS | ||
287 | |||
288 | Capability: basic | ||
289 | Architectures: x86 | ||
290 | Type: vcpu ioctl | ||
291 | Parameters: struct kvm_sregs (in) | ||
292 | Returns: 0 on success, -1 on error | ||
293 | |||
294 | Writes special registers into the vcpu. See KVM_GET_SREGS for the | ||
295 | data structures. | ||
296 | |||
297 | 4.15 KVM_TRANSLATE | ||
298 | |||
299 | Capability: basic | ||
300 | Architectures: x86 | ||
301 | Type: vcpu ioctl | ||
302 | Parameters: struct kvm_translation (in/out) | ||
303 | Returns: 0 on success, -1 on error | ||
304 | |||
305 | Translates a virtual address according to the vcpu's current address | ||
306 | translation mode. | ||
307 | |||
308 | struct kvm_translation { | ||
309 | /* in */ | ||
310 | __u64 linear_address; | ||
311 | |||
312 | /* out */ | ||
313 | __u64 physical_address; | ||
314 | __u8 valid; | ||
315 | __u8 writeable; | ||
316 | __u8 usermode; | ||
317 | __u8 pad[5]; | ||
318 | }; | ||
319 | |||
320 | 4.16 KVM_INTERRUPT | ||
321 | |||
322 | Capability: basic | ||
323 | Architectures: x86, ppc | ||
324 | Type: vcpu ioctl | ||
325 | Parameters: struct kvm_interrupt (in) | ||
326 | Returns: 0 on success, -1 on error | ||
327 | |||
328 | Queues a hardware interrupt vector to be injected. This is only | ||
329 | useful if in-kernel local APIC or equivalent is not used. | ||
330 | |||
331 | /* for KVM_INTERRUPT */ | ||
332 | struct kvm_interrupt { | ||
333 | /* in */ | ||
334 | __u32 irq; | ||
335 | }; | ||
336 | |||
337 | X86: | ||
338 | |||
339 | Note 'irq' is an interrupt vector, not an interrupt pin or line. | ||
340 | |||
341 | PPC: | ||
342 | |||
343 | Queues an external interrupt to be injected. This ioctl is overleaded | ||
344 | with 3 different irq values: | ||
345 | |||
346 | a) KVM_INTERRUPT_SET | ||
347 | |||
348 | This injects an edge type external interrupt into the guest once it's ready | ||
349 | to receive interrupts. When injected, the interrupt is done. | ||
350 | |||
351 | b) KVM_INTERRUPT_UNSET | ||
352 | |||
353 | This unsets any pending interrupt. | ||
354 | |||
355 | Only available with KVM_CAP_PPC_UNSET_IRQ. | ||
356 | |||
357 | c) KVM_INTERRUPT_SET_LEVEL | ||
358 | |||
359 | This injects a level type external interrupt into the guest context. The | ||
360 | interrupt stays pending until a specific ioctl with KVM_INTERRUPT_UNSET | ||
361 | is triggered. | ||
362 | |||
363 | Only available with KVM_CAP_PPC_IRQ_LEVEL. | ||
364 | |||
365 | Note that any value for 'irq' other than the ones stated above is invalid | ||
366 | and incurs unexpected behavior. | ||
367 | |||
368 | 4.17 KVM_DEBUG_GUEST | ||
369 | |||
370 | Capability: basic | ||
371 | Architectures: none | ||
372 | Type: vcpu ioctl | ||
373 | Parameters: none) | ||
374 | Returns: -1 on error | ||
375 | |||
376 | Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead. | ||
377 | |||
378 | 4.18 KVM_GET_MSRS | ||
379 | |||
380 | Capability: basic | ||
381 | Architectures: x86 | ||
382 | Type: vcpu ioctl | ||
383 | Parameters: struct kvm_msrs (in/out) | ||
384 | Returns: 0 on success, -1 on error | ||
385 | |||
386 | Reads model-specific registers from the vcpu. Supported msr indices can | ||
387 | be obtained using KVM_GET_MSR_INDEX_LIST. | ||
388 | |||
389 | struct kvm_msrs { | ||
390 | __u32 nmsrs; /* number of msrs in entries */ | ||
391 | __u32 pad; | ||
392 | |||
393 | struct kvm_msr_entry entries[0]; | ||
394 | }; | ||
395 | |||
396 | struct kvm_msr_entry { | ||
397 | __u32 index; | ||
398 | __u32 reserved; | ||
399 | __u64 data; | ||
400 | }; | ||
401 | |||
402 | Application code should set the 'nmsrs' member (which indicates the | ||
403 | size of the entries array) and the 'index' member of each array entry. | ||
404 | kvm will fill in the 'data' member. | ||
405 | |||
406 | 4.19 KVM_SET_MSRS | ||
407 | |||
408 | Capability: basic | ||
409 | Architectures: x86 | ||
410 | Type: vcpu ioctl | ||
411 | Parameters: struct kvm_msrs (in) | ||
412 | Returns: 0 on success, -1 on error | ||
413 | |||
414 | Writes model-specific registers to the vcpu. See KVM_GET_MSRS for the | ||
415 | data structures. | ||
416 | |||
417 | Application code should set the 'nmsrs' member (which indicates the | ||
418 | size of the entries array), and the 'index' and 'data' members of each | ||
419 | array entry. | ||
420 | |||
421 | 4.20 KVM_SET_CPUID | ||
422 | |||
423 | Capability: basic | ||
424 | Architectures: x86 | ||
425 | Type: vcpu ioctl | ||
426 | Parameters: struct kvm_cpuid (in) | ||
427 | Returns: 0 on success, -1 on error | ||
428 | |||
429 | Defines the vcpu responses to the cpuid instruction. Applications | ||
430 | should use the KVM_SET_CPUID2 ioctl if available. | ||
431 | |||
432 | |||
433 | struct kvm_cpuid_entry { | ||
434 | __u32 function; | ||
435 | __u32 eax; | ||
436 | __u32 ebx; | ||
437 | __u32 ecx; | ||
438 | __u32 edx; | ||
439 | __u32 padding; | ||
440 | }; | ||
441 | |||
442 | /* for KVM_SET_CPUID */ | ||
443 | struct kvm_cpuid { | ||
444 | __u32 nent; | ||
445 | __u32 padding; | ||
446 | struct kvm_cpuid_entry entries[0]; | ||
447 | }; | ||
448 | |||
449 | 4.21 KVM_SET_SIGNAL_MASK | ||
450 | |||
451 | Capability: basic | ||
452 | Architectures: x86 | ||
453 | Type: vcpu ioctl | ||
454 | Parameters: struct kvm_signal_mask (in) | ||
455 | Returns: 0 on success, -1 on error | ||
456 | |||
457 | Defines which signals are blocked during execution of KVM_RUN. This | ||
458 | signal mask temporarily overrides the threads signal mask. Any | ||
459 | unblocked signal received (except SIGKILL and SIGSTOP, which retain | ||
460 | their traditional behaviour) will cause KVM_RUN to return with -EINTR. | ||
461 | |||
462 | Note the signal will only be delivered if not blocked by the original | ||
463 | signal mask. | ||
464 | |||
465 | /* for KVM_SET_SIGNAL_MASK */ | ||
466 | struct kvm_signal_mask { | ||
467 | __u32 len; | ||
468 | __u8 sigset[0]; | ||
469 | }; | ||
470 | |||
471 | 4.22 KVM_GET_FPU | ||
472 | |||
473 | Capability: basic | ||
474 | Architectures: x86 | ||
475 | Type: vcpu ioctl | ||
476 | Parameters: struct kvm_fpu (out) | ||
477 | Returns: 0 on success, -1 on error | ||
478 | |||
479 | Reads the floating point state from the vcpu. | ||
480 | |||
481 | /* for KVM_GET_FPU and KVM_SET_FPU */ | ||
482 | struct kvm_fpu { | ||
483 | __u8 fpr[8][16]; | ||
484 | __u16 fcw; | ||
485 | __u16 fsw; | ||
486 | __u8 ftwx; /* in fxsave format */ | ||
487 | __u8 pad1; | ||
488 | __u16 last_opcode; | ||
489 | __u64 last_ip; | ||
490 | __u64 last_dp; | ||
491 | __u8 xmm[16][16]; | ||
492 | __u32 mxcsr; | ||
493 | __u32 pad2; | ||
494 | }; | ||
495 | |||
496 | 4.23 KVM_SET_FPU | ||
497 | |||
498 | Capability: basic | ||
499 | Architectures: x86 | ||
500 | Type: vcpu ioctl | ||
501 | Parameters: struct kvm_fpu (in) | ||
502 | Returns: 0 on success, -1 on error | ||
503 | |||
504 | Writes the floating point state to the vcpu. | ||
505 | |||
506 | /* for KVM_GET_FPU and KVM_SET_FPU */ | ||
507 | struct kvm_fpu { | ||
508 | __u8 fpr[8][16]; | ||
509 | __u16 fcw; | ||
510 | __u16 fsw; | ||
511 | __u8 ftwx; /* in fxsave format */ | ||
512 | __u8 pad1; | ||
513 | __u16 last_opcode; | ||
514 | __u64 last_ip; | ||
515 | __u64 last_dp; | ||
516 | __u8 xmm[16][16]; | ||
517 | __u32 mxcsr; | ||
518 | __u32 pad2; | ||
519 | }; | ||
520 | |||
521 | 4.24 KVM_CREATE_IRQCHIP | ||
522 | |||
523 | Capability: KVM_CAP_IRQCHIP | ||
524 | Architectures: x86, ia64 | ||
525 | Type: vm ioctl | ||
526 | Parameters: none | ||
527 | Returns: 0 on success, -1 on error | ||
528 | |||
529 | Creates an interrupt controller model in the kernel. On x86, creates a virtual | ||
530 | ioapic, a virtual PIC (two PICs, nested), and sets up future vcpus to have a | ||
531 | local APIC. IRQ routing for GSIs 0-15 is set to both PIC and IOAPIC; GSI 16-23 | ||
532 | only go to the IOAPIC. On ia64, a IOSAPIC is created. | ||
533 | |||
534 | 4.25 KVM_IRQ_LINE | ||
535 | |||
536 | Capability: KVM_CAP_IRQCHIP | ||
537 | Architectures: x86, ia64 | ||
538 | Type: vm ioctl | ||
539 | Parameters: struct kvm_irq_level | ||
540 | Returns: 0 on success, -1 on error | ||
541 | |||
542 | Sets the level of a GSI input to the interrupt controller model in the kernel. | ||
543 | Requires that an interrupt controller model has been previously created with | ||
544 | KVM_CREATE_IRQCHIP. Note that edge-triggered interrupts require the level | ||
545 | to be set to 1 and then back to 0. | ||
546 | |||
547 | struct kvm_irq_level { | ||
548 | union { | ||
549 | __u32 irq; /* GSI */ | ||
550 | __s32 status; /* not used for KVM_IRQ_LEVEL */ | ||
551 | }; | ||
552 | __u32 level; /* 0 or 1 */ | ||
553 | }; | ||
554 | |||
555 | 4.26 KVM_GET_IRQCHIP | ||
556 | |||
557 | Capability: KVM_CAP_IRQCHIP | ||
558 | Architectures: x86, ia64 | ||
559 | Type: vm ioctl | ||
560 | Parameters: struct kvm_irqchip (in/out) | ||
561 | Returns: 0 on success, -1 on error | ||
562 | |||
563 | Reads the state of a kernel interrupt controller created with | ||
564 | KVM_CREATE_IRQCHIP into a buffer provided by the caller. | ||
565 | |||
566 | struct kvm_irqchip { | ||
567 | __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */ | ||
568 | __u32 pad; | ||
569 | union { | ||
570 | char dummy[512]; /* reserving space */ | ||
571 | struct kvm_pic_state pic; | ||
572 | struct kvm_ioapic_state ioapic; | ||
573 | } chip; | ||
574 | }; | ||
575 | |||
576 | 4.27 KVM_SET_IRQCHIP | ||
577 | |||
578 | Capability: KVM_CAP_IRQCHIP | ||
579 | Architectures: x86, ia64 | ||
580 | Type: vm ioctl | ||
581 | Parameters: struct kvm_irqchip (in) | ||
582 | Returns: 0 on success, -1 on error | ||
583 | |||
584 | Sets the state of a kernel interrupt controller created with | ||
585 | KVM_CREATE_IRQCHIP from a buffer provided by the caller. | ||
586 | |||
587 | struct kvm_irqchip { | ||
588 | __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */ | ||
589 | __u32 pad; | ||
590 | union { | ||
591 | char dummy[512]; /* reserving space */ | ||
592 | struct kvm_pic_state pic; | ||
593 | struct kvm_ioapic_state ioapic; | ||
594 | } chip; | ||
595 | }; | ||
596 | |||
597 | 4.28 KVM_XEN_HVM_CONFIG | ||
598 | |||
599 | Capability: KVM_CAP_XEN_HVM | ||
600 | Architectures: x86 | ||
601 | Type: vm ioctl | ||
602 | Parameters: struct kvm_xen_hvm_config (in) | ||
603 | Returns: 0 on success, -1 on error | ||
604 | |||
605 | Sets the MSR that the Xen HVM guest uses to initialize its hypercall | ||
606 | page, and provides the starting address and size of the hypercall | ||
607 | blobs in userspace. When the guest writes the MSR, kvm copies one | ||
608 | page of a blob (32- or 64-bit, depending on the vcpu mode) to guest | ||
609 | memory. | ||
610 | |||
611 | struct kvm_xen_hvm_config { | ||
612 | __u32 flags; | ||
613 | __u32 msr; | ||
614 | __u64 blob_addr_32; | ||
615 | __u64 blob_addr_64; | ||
616 | __u8 blob_size_32; | ||
617 | __u8 blob_size_64; | ||
618 | __u8 pad2[30]; | ||
619 | }; | ||
620 | |||
621 | 4.29 KVM_GET_CLOCK | ||
622 | |||
623 | Capability: KVM_CAP_ADJUST_CLOCK | ||
624 | Architectures: x86 | ||
625 | Type: vm ioctl | ||
626 | Parameters: struct kvm_clock_data (out) | ||
627 | Returns: 0 on success, -1 on error | ||
628 | |||
629 | Gets the current timestamp of kvmclock as seen by the current guest. In | ||
630 | conjunction with KVM_SET_CLOCK, it is used to ensure monotonicity on scenarios | ||
631 | such as migration. | ||
632 | |||
633 | struct kvm_clock_data { | ||
634 | __u64 clock; /* kvmclock current value */ | ||
635 | __u32 flags; | ||
636 | __u32 pad[9]; | ||
637 | }; | ||
638 | |||
639 | 4.30 KVM_SET_CLOCK | ||
640 | |||
641 | Capability: KVM_CAP_ADJUST_CLOCK | ||
642 | Architectures: x86 | ||
643 | Type: vm ioctl | ||
644 | Parameters: struct kvm_clock_data (in) | ||
645 | Returns: 0 on success, -1 on error | ||
646 | |||
647 | Sets the current timestamp of kvmclock to the value specified in its parameter. | ||
648 | In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios | ||
649 | such as migration. | ||
650 | |||
651 | struct kvm_clock_data { | ||
652 | __u64 clock; /* kvmclock current value */ | ||
653 | __u32 flags; | ||
654 | __u32 pad[9]; | ||
655 | }; | ||
656 | |||
657 | 4.31 KVM_GET_VCPU_EVENTS | ||
658 | |||
659 | Capability: KVM_CAP_VCPU_EVENTS | ||
660 | Extended by: KVM_CAP_INTR_SHADOW | ||
661 | Architectures: x86 | ||
662 | Type: vm ioctl | ||
663 | Parameters: struct kvm_vcpu_event (out) | ||
664 | Returns: 0 on success, -1 on error | ||
665 | |||
666 | Gets currently pending exceptions, interrupts, and NMIs as well as related | ||
667 | states of the vcpu. | ||
668 | |||
669 | struct kvm_vcpu_events { | ||
670 | struct { | ||
671 | __u8 injected; | ||
672 | __u8 nr; | ||
673 | __u8 has_error_code; | ||
674 | __u8 pad; | ||
675 | __u32 error_code; | ||
676 | } exception; | ||
677 | struct { | ||
678 | __u8 injected; | ||
679 | __u8 nr; | ||
680 | __u8 soft; | ||
681 | __u8 shadow; | ||
682 | } interrupt; | ||
683 | struct { | ||
684 | __u8 injected; | ||
685 | __u8 pending; | ||
686 | __u8 masked; | ||
687 | __u8 pad; | ||
688 | } nmi; | ||
689 | __u32 sipi_vector; | ||
690 | __u32 flags; | ||
691 | }; | ||
692 | |||
693 | KVM_VCPUEVENT_VALID_SHADOW may be set in the flags field to signal that | ||
694 | interrupt.shadow contains a valid state. Otherwise, this field is undefined. | ||
695 | |||
696 | 4.32 KVM_SET_VCPU_EVENTS | ||
697 | |||
698 | Capability: KVM_CAP_VCPU_EVENTS | ||
699 | Extended by: KVM_CAP_INTR_SHADOW | ||
700 | Architectures: x86 | ||
701 | Type: vm ioctl | ||
702 | Parameters: struct kvm_vcpu_event (in) | ||
703 | Returns: 0 on success, -1 on error | ||
704 | |||
705 | Set pending exceptions, interrupts, and NMIs as well as related states of the | ||
706 | vcpu. | ||
707 | |||
708 | See KVM_GET_VCPU_EVENTS for the data structure. | ||
709 | |||
710 | Fields that may be modified asynchronously by running VCPUs can be excluded | ||
711 | from the update. These fields are nmi.pending and sipi_vector. Keep the | ||
712 | corresponding bits in the flags field cleared to suppress overwriting the | ||
713 | current in-kernel state. The bits are: | ||
714 | |||
715 | KVM_VCPUEVENT_VALID_NMI_PENDING - transfer nmi.pending to the kernel | ||
716 | KVM_VCPUEVENT_VALID_SIPI_VECTOR - transfer sipi_vector | ||
717 | |||
718 | If KVM_CAP_INTR_SHADOW is available, KVM_VCPUEVENT_VALID_SHADOW can be set in | ||
719 | the flags field to signal that interrupt.shadow contains a valid state and | ||
720 | shall be written into the VCPU. | ||
721 | |||
722 | 4.33 KVM_GET_DEBUGREGS | ||
723 | |||
724 | Capability: KVM_CAP_DEBUGREGS | ||
725 | Architectures: x86 | ||
726 | Type: vm ioctl | ||
727 | Parameters: struct kvm_debugregs (out) | ||
728 | Returns: 0 on success, -1 on error | ||
729 | |||
730 | Reads debug registers from the vcpu. | ||
731 | |||
732 | struct kvm_debugregs { | ||
733 | __u64 db[4]; | ||
734 | __u64 dr6; | ||
735 | __u64 dr7; | ||
736 | __u64 flags; | ||
737 | __u64 reserved[9]; | ||
738 | }; | ||
739 | |||
740 | 4.34 KVM_SET_DEBUGREGS | ||
741 | |||
742 | Capability: KVM_CAP_DEBUGREGS | ||
743 | Architectures: x86 | ||
744 | Type: vm ioctl | ||
745 | Parameters: struct kvm_debugregs (in) | ||
746 | Returns: 0 on success, -1 on error | ||
747 | |||
748 | Writes debug registers into the vcpu. | ||
749 | |||
750 | See KVM_GET_DEBUGREGS for the data structure. The flags field is unused | ||
751 | yet and must be cleared on entry. | ||
752 | |||
753 | 4.35 KVM_SET_USER_MEMORY_REGION | ||
754 | |||
755 | Capability: KVM_CAP_USER_MEM | ||
756 | Architectures: all | ||
757 | Type: vm ioctl | ||
758 | Parameters: struct kvm_userspace_memory_region (in) | ||
759 | Returns: 0 on success, -1 on error | ||
760 | |||
761 | struct kvm_userspace_memory_region { | ||
762 | __u32 slot; | ||
763 | __u32 flags; | ||
764 | __u64 guest_phys_addr; | ||
765 | __u64 memory_size; /* bytes */ | ||
766 | __u64 userspace_addr; /* start of the userspace allocated memory */ | ||
767 | }; | ||
768 | |||
769 | /* for kvm_memory_region::flags */ | ||
770 | #define KVM_MEM_LOG_DIRTY_PAGES 1UL | ||
771 | |||
772 | This ioctl allows the user to create or modify a guest physical memory | ||
773 | slot. When changing an existing slot, it may be moved in the guest | ||
774 | physical memory space, or its flags may be modified. It may not be | ||
775 | resized. Slots may not overlap in guest physical address space. | ||
776 | |||
777 | Memory for the region is taken starting at the address denoted by the | ||
778 | field userspace_addr, which must point at user addressable memory for | ||
779 | the entire memory slot size. Any object may back this memory, including | ||
780 | anonymous memory, ordinary files, and hugetlbfs. | ||
781 | |||
782 | It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr | ||
783 | be identical. This allows large pages in the guest to be backed by large | ||
784 | pages in the host. | ||
785 | |||
786 | The flags field supports just one flag, KVM_MEM_LOG_DIRTY_PAGES, which | ||
787 | instructs kvm to keep track of writes to memory within the slot. See | ||
788 | the KVM_GET_DIRTY_LOG ioctl. | ||
789 | |||
790 | When the KVM_CAP_SYNC_MMU capability, changes in the backing of the memory | ||
791 | region are automatically reflected into the guest. For example, an mmap() | ||
792 | that affects the region will be made visible immediately. Another example | ||
793 | is madvise(MADV_DROP). | ||
794 | |||
795 | It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl. | ||
796 | The KVM_SET_MEMORY_REGION does not allow fine grained control over memory | ||
797 | allocation and is deprecated. | ||
798 | |||
799 | 4.36 KVM_SET_TSS_ADDR | ||
800 | |||
801 | Capability: KVM_CAP_SET_TSS_ADDR | ||
802 | Architectures: x86 | ||
803 | Type: vm ioctl | ||
804 | Parameters: unsigned long tss_address (in) | ||
805 | Returns: 0 on success, -1 on error | ||
806 | |||
807 | This ioctl defines the physical address of a three-page region in the guest | ||
808 | physical address space. The region must be within the first 4GB of the | ||
809 | guest physical address space and must not conflict with any memory slot | ||
810 | or any mmio address. The guest may malfunction if it accesses this memory | ||
811 | region. | ||
812 | |||
813 | This ioctl is required on Intel-based hosts. This is needed on Intel hardware | ||
814 | because of a quirk in the virtualization implementation (see the internals | ||
815 | documentation when it pops into existence). | ||
816 | |||
817 | 4.37 KVM_ENABLE_CAP | ||
818 | |||
819 | Capability: KVM_CAP_ENABLE_CAP | ||
820 | Architectures: ppc | ||
821 | Type: vcpu ioctl | ||
822 | Parameters: struct kvm_enable_cap (in) | ||
823 | Returns: 0 on success; -1 on error | ||
824 | |||
825 | +Not all extensions are enabled by default. Using this ioctl the application | ||
826 | can enable an extension, making it available to the guest. | ||
827 | |||
828 | On systems that do not support this ioctl, it always fails. On systems that | ||
829 | do support it, it only works for extensions that are supported for enablement. | ||
830 | |||
831 | To check if a capability can be enabled, the KVM_CHECK_EXTENSION ioctl should | ||
832 | be used. | ||
833 | |||
834 | struct kvm_enable_cap { | ||
835 | /* in */ | ||
836 | __u32 cap; | ||
837 | |||
838 | The capability that is supposed to get enabled. | ||
839 | |||
840 | __u32 flags; | ||
841 | |||
842 | A bitfield indicating future enhancements. Has to be 0 for now. | ||
843 | |||
844 | __u64 args[4]; | ||
845 | |||
846 | Arguments for enabling a feature. If a feature needs initial values to | ||
847 | function properly, this is the place to put them. | ||
848 | |||
849 | __u8 pad[64]; | ||
850 | }; | ||
851 | |||
852 | 4.38 KVM_GET_MP_STATE | ||
853 | |||
854 | Capability: KVM_CAP_MP_STATE | ||
855 | Architectures: x86, ia64 | ||
856 | Type: vcpu ioctl | ||
857 | Parameters: struct kvm_mp_state (out) | ||
858 | Returns: 0 on success; -1 on error | ||
859 | |||
860 | struct kvm_mp_state { | ||
861 | __u32 mp_state; | ||
862 | }; | ||
863 | |||
864 | Returns the vcpu's current "multiprocessing state" (though also valid on | ||
865 | uniprocessor guests). | ||
866 | |||
867 | Possible values are: | ||
868 | |||
869 | - KVM_MP_STATE_RUNNABLE: the vcpu is currently running | ||
870 | - KVM_MP_STATE_UNINITIALIZED: the vcpu is an application processor (AP) | ||
871 | which has not yet received an INIT signal | ||
872 | - KVM_MP_STATE_INIT_RECEIVED: the vcpu has received an INIT signal, and is | ||
873 | now ready for a SIPI | ||
874 | - KVM_MP_STATE_HALTED: the vcpu has executed a HLT instruction and | ||
875 | is waiting for an interrupt | ||
876 | - KVM_MP_STATE_SIPI_RECEIVED: the vcpu has just received a SIPI (vector | ||
877 | accessible via KVM_GET_VCPU_EVENTS) | ||
878 | |||
879 | This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel | ||
880 | irqchip, the multiprocessing state must be maintained by userspace. | ||
881 | |||
882 | 4.39 KVM_SET_MP_STATE | ||
883 | |||
884 | Capability: KVM_CAP_MP_STATE | ||
885 | Architectures: x86, ia64 | ||
886 | Type: vcpu ioctl | ||
887 | Parameters: struct kvm_mp_state (in) | ||
888 | Returns: 0 on success; -1 on error | ||
889 | |||
890 | Sets the vcpu's current "multiprocessing state"; see KVM_GET_MP_STATE for | ||
891 | arguments. | ||
892 | |||
893 | This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel | ||
894 | irqchip, the multiprocessing state must be maintained by userspace. | ||
895 | |||
896 | 4.40 KVM_SET_IDENTITY_MAP_ADDR | ||
897 | |||
898 | Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR | ||
899 | Architectures: x86 | ||
900 | Type: vm ioctl | ||
901 | Parameters: unsigned long identity (in) | ||
902 | Returns: 0 on success, -1 on error | ||
903 | |||
904 | This ioctl defines the physical address of a one-page region in the guest | ||
905 | physical address space. The region must be within the first 4GB of the | ||
906 | guest physical address space and must not conflict with any memory slot | ||
907 | or any mmio address. The guest may malfunction if it accesses this memory | ||
908 | region. | ||
909 | |||
910 | This ioctl is required on Intel-based hosts. This is needed on Intel hardware | ||
911 | because of a quirk in the virtualization implementation (see the internals | ||
912 | documentation when it pops into existence). | ||
913 | |||
914 | 4.41 KVM_SET_BOOT_CPU_ID | ||
915 | |||
916 | Capability: KVM_CAP_SET_BOOT_CPU_ID | ||
917 | Architectures: x86, ia64 | ||
918 | Type: vm ioctl | ||
919 | Parameters: unsigned long vcpu_id | ||
920 | Returns: 0 on success, -1 on error | ||
921 | |||
922 | Define which vcpu is the Bootstrap Processor (BSP). Values are the same | ||
923 | as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default | ||
924 | is vcpu 0. | ||
925 | |||
926 | 4.42 KVM_GET_XSAVE | ||
927 | |||
928 | Capability: KVM_CAP_XSAVE | ||
929 | Architectures: x86 | ||
930 | Type: vcpu ioctl | ||
931 | Parameters: struct kvm_xsave (out) | ||
932 | Returns: 0 on success, -1 on error | ||
933 | |||
934 | struct kvm_xsave { | ||
935 | __u32 region[1024]; | ||
936 | }; | ||
937 | |||
938 | This ioctl would copy current vcpu's xsave struct to the userspace. | ||
939 | |||
940 | 4.43 KVM_SET_XSAVE | ||
941 | |||
942 | Capability: KVM_CAP_XSAVE | ||
943 | Architectures: x86 | ||
944 | Type: vcpu ioctl | ||
945 | Parameters: struct kvm_xsave (in) | ||
946 | Returns: 0 on success, -1 on error | ||
947 | |||
948 | struct kvm_xsave { | ||
949 | __u32 region[1024]; | ||
950 | }; | ||
951 | |||
952 | This ioctl would copy userspace's xsave struct to the kernel. | ||
953 | |||
954 | 4.44 KVM_GET_XCRS | ||
955 | |||
956 | Capability: KVM_CAP_XCRS | ||
957 | Architectures: x86 | ||
958 | Type: vcpu ioctl | ||
959 | Parameters: struct kvm_xcrs (out) | ||
960 | Returns: 0 on success, -1 on error | ||
961 | |||
962 | struct kvm_xcr { | ||
963 | __u32 xcr; | ||
964 | __u32 reserved; | ||
965 | __u64 value; | ||
966 | }; | ||
967 | |||
968 | struct kvm_xcrs { | ||
969 | __u32 nr_xcrs; | ||
970 | __u32 flags; | ||
971 | struct kvm_xcr xcrs[KVM_MAX_XCRS]; | ||
972 | __u64 padding[16]; | ||
973 | }; | ||
974 | |||
975 | This ioctl would copy current vcpu's xcrs to the userspace. | ||
976 | |||
977 | 4.45 KVM_SET_XCRS | ||
978 | |||
979 | Capability: KVM_CAP_XCRS | ||
980 | Architectures: x86 | ||
981 | Type: vcpu ioctl | ||
982 | Parameters: struct kvm_xcrs (in) | ||
983 | Returns: 0 on success, -1 on error | ||
984 | |||
985 | struct kvm_xcr { | ||
986 | __u32 xcr; | ||
987 | __u32 reserved; | ||
988 | __u64 value; | ||
989 | }; | ||
990 | |||
991 | struct kvm_xcrs { | ||
992 | __u32 nr_xcrs; | ||
993 | __u32 flags; | ||
994 | struct kvm_xcr xcrs[KVM_MAX_XCRS]; | ||
995 | __u64 padding[16]; | ||
996 | }; | ||
997 | |||
998 | This ioctl would set vcpu's xcr to the value userspace specified. | ||
999 | |||
1000 | 4.46 KVM_GET_SUPPORTED_CPUID | ||
1001 | |||
1002 | Capability: KVM_CAP_EXT_CPUID | ||
1003 | Architectures: x86 | ||
1004 | Type: system ioctl | ||
1005 | Parameters: struct kvm_cpuid2 (in/out) | ||
1006 | Returns: 0 on success, -1 on error | ||
1007 | |||
1008 | struct kvm_cpuid2 { | ||
1009 | __u32 nent; | ||
1010 | __u32 padding; | ||
1011 | struct kvm_cpuid_entry2 entries[0]; | ||
1012 | }; | ||
1013 | |||
1014 | #define KVM_CPUID_FLAG_SIGNIFCANT_INDEX 1 | ||
1015 | #define KVM_CPUID_FLAG_STATEFUL_FUNC 2 | ||
1016 | #define KVM_CPUID_FLAG_STATE_READ_NEXT 4 | ||
1017 | |||
1018 | struct kvm_cpuid_entry2 { | ||
1019 | __u32 function; | ||
1020 | __u32 index; | ||
1021 | __u32 flags; | ||
1022 | __u32 eax; | ||
1023 | __u32 ebx; | ||
1024 | __u32 ecx; | ||
1025 | __u32 edx; | ||
1026 | __u32 padding[3]; | ||
1027 | }; | ||
1028 | |||
1029 | This ioctl returns x86 cpuid features which are supported by both the hardware | ||
1030 | and kvm. Userspace can use the information returned by this ioctl to | ||
1031 | construct cpuid information (for KVM_SET_CPUID2) that is consistent with | ||
1032 | hardware, kernel, and userspace capabilities, and with user requirements (for | ||
1033 | example, the user may wish to constrain cpuid to emulate older hardware, | ||
1034 | or for feature consistency across a cluster). | ||
1035 | |||
1036 | Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure | ||
1037 | with the 'nent' field indicating the number of entries in the variable-size | ||
1038 | array 'entries'. If the number of entries is too low to describe the cpu | ||
1039 | capabilities, an error (E2BIG) is returned. If the number is too high, | ||
1040 | the 'nent' field is adjusted and an error (ENOMEM) is returned. If the | ||
1041 | number is just right, the 'nent' field is adjusted to the number of valid | ||
1042 | entries in the 'entries' array, which is then filled. | ||
1043 | |||
1044 | The entries returned are the host cpuid as returned by the cpuid instruction, | ||
1045 | with unknown or unsupported features masked out. Some features (for example, | ||
1046 | x2apic), may not be present in the host cpu, but are exposed by kvm if it can | ||
1047 | emulate them efficiently. The fields in each entry are defined as follows: | ||
1048 | |||
1049 | function: the eax value used to obtain the entry | ||
1050 | index: the ecx value used to obtain the entry (for entries that are | ||
1051 | affected by ecx) | ||
1052 | flags: an OR of zero or more of the following: | ||
1053 | KVM_CPUID_FLAG_SIGNIFCANT_INDEX: | ||
1054 | if the index field is valid | ||
1055 | KVM_CPUID_FLAG_STATEFUL_FUNC: | ||
1056 | if cpuid for this function returns different values for successive | ||
1057 | invocations; there will be several entries with the same function, | ||
1058 | all with this flag set | ||
1059 | KVM_CPUID_FLAG_STATE_READ_NEXT: | ||
1060 | for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is | ||
1061 | the first entry to be read by a cpu | ||
1062 | eax, ebx, ecx, edx: the values returned by the cpuid instruction for | ||
1063 | this function/index combination | ||
1064 | |||
1065 | 4.47 KVM_PPC_GET_PVINFO | ||
1066 | |||
1067 | Capability: KVM_CAP_PPC_GET_PVINFO | ||
1068 | Architectures: ppc | ||
1069 | Type: vm ioctl | ||
1070 | Parameters: struct kvm_ppc_pvinfo (out) | ||
1071 | Returns: 0 on success, !0 on error | ||
1072 | |||
1073 | struct kvm_ppc_pvinfo { | ||
1074 | __u32 flags; | ||
1075 | __u32 hcall[4]; | ||
1076 | __u8 pad[108]; | ||
1077 | }; | ||
1078 | |||
1079 | This ioctl fetches PV specific information that need to be passed to the guest | ||
1080 | using the device tree or other means from vm context. | ||
1081 | |||
1082 | For now the only implemented piece of information distributed here is an array | ||
1083 | of 4 instructions that make up a hypercall. | ||
1084 | |||
1085 | If any additional field gets added to this structure later on, a bit for that | ||
1086 | additional piece of information will be set in the flags bitmap. | ||
1087 | |||
1088 | 4.48 KVM_ASSIGN_PCI_DEVICE | ||
1089 | |||
1090 | Capability: KVM_CAP_DEVICE_ASSIGNMENT | ||
1091 | Architectures: x86 ia64 | ||
1092 | Type: vm ioctl | ||
1093 | Parameters: struct kvm_assigned_pci_dev (in) | ||
1094 | Returns: 0 on success, -1 on error | ||
1095 | |||
1096 | Assigns a host PCI device to the VM. | ||
1097 | |||
1098 | struct kvm_assigned_pci_dev { | ||
1099 | __u32 assigned_dev_id; | ||
1100 | __u32 busnr; | ||
1101 | __u32 devfn; | ||
1102 | __u32 flags; | ||
1103 | __u32 segnr; | ||
1104 | union { | ||
1105 | __u32 reserved[11]; | ||
1106 | }; | ||
1107 | }; | ||
1108 | |||
1109 | The PCI device is specified by the triple segnr, busnr, and devfn. | ||
1110 | Identification in succeeding service requests is done via assigned_dev_id. The | ||
1111 | following flags are specified: | ||
1112 | |||
1113 | /* Depends on KVM_CAP_IOMMU */ | ||
1114 | #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0) | ||
1115 | |||
1116 | 4.49 KVM_DEASSIGN_PCI_DEVICE | ||
1117 | |||
1118 | Capability: KVM_CAP_DEVICE_DEASSIGNMENT | ||
1119 | Architectures: x86 ia64 | ||
1120 | Type: vm ioctl | ||
1121 | Parameters: struct kvm_assigned_pci_dev (in) | ||
1122 | Returns: 0 on success, -1 on error | ||
1123 | |||
1124 | Ends PCI device assignment, releasing all associated resources. | ||
1125 | |||
1126 | See KVM_CAP_DEVICE_ASSIGNMENT for the data structure. Only assigned_dev_id is | ||
1127 | used in kvm_assigned_pci_dev to identify the device. | ||
1128 | |||
1129 | 4.50 KVM_ASSIGN_DEV_IRQ | ||
1130 | |||
1131 | Capability: KVM_CAP_ASSIGN_DEV_IRQ | ||
1132 | Architectures: x86 ia64 | ||
1133 | Type: vm ioctl | ||
1134 | Parameters: struct kvm_assigned_irq (in) | ||
1135 | Returns: 0 on success, -1 on error | ||
1136 | |||
1137 | Assigns an IRQ to a passed-through device. | ||
1138 | |||
1139 | struct kvm_assigned_irq { | ||
1140 | __u32 assigned_dev_id; | ||
1141 | __u32 host_irq; | ||
1142 | __u32 guest_irq; | ||
1143 | __u32 flags; | ||
1144 | union { | ||
1145 | struct { | ||
1146 | __u32 addr_lo; | ||
1147 | __u32 addr_hi; | ||
1148 | __u32 data; | ||
1149 | } guest_msi; | ||
1150 | __u32 reserved[12]; | ||
1151 | }; | ||
1152 | }; | ||
1153 | |||
1154 | The following flags are defined: | ||
1155 | |||
1156 | #define KVM_DEV_IRQ_HOST_INTX (1 << 0) | ||
1157 | #define KVM_DEV_IRQ_HOST_MSI (1 << 1) | ||
1158 | #define KVM_DEV_IRQ_HOST_MSIX (1 << 2) | ||
1159 | |||
1160 | #define KVM_DEV_IRQ_GUEST_INTX (1 << 8) | ||
1161 | #define KVM_DEV_IRQ_GUEST_MSI (1 << 9) | ||
1162 | #define KVM_DEV_IRQ_GUEST_MSIX (1 << 10) | ||
1163 | |||
1164 | It is not valid to specify multiple types per host or guest IRQ. However, the | ||
1165 | IRQ type of host and guest can differ or can even be null. | ||
1166 | |||
1167 | 4.51 KVM_DEASSIGN_DEV_IRQ | ||
1168 | |||
1169 | Capability: KVM_CAP_ASSIGN_DEV_IRQ | ||
1170 | Architectures: x86 ia64 | ||
1171 | Type: vm ioctl | ||
1172 | Parameters: struct kvm_assigned_irq (in) | ||
1173 | Returns: 0 on success, -1 on error | ||
1174 | |||
1175 | Ends an IRQ assignment to a passed-through device. | ||
1176 | |||
1177 | See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified | ||
1178 | by assigned_dev_id, flags must correspond to the IRQ type specified on | ||
1179 | KVM_ASSIGN_DEV_IRQ. Partial deassignment of host or guest IRQ is allowed. | ||
1180 | |||
1181 | 4.52 KVM_SET_GSI_ROUTING | ||
1182 | |||
1183 | Capability: KVM_CAP_IRQ_ROUTING | ||
1184 | Architectures: x86 ia64 | ||
1185 | Type: vm ioctl | ||
1186 | Parameters: struct kvm_irq_routing (in) | ||
1187 | Returns: 0 on success, -1 on error | ||
1188 | |||
1189 | Sets the GSI routing table entries, overwriting any previously set entries. | ||
1190 | |||
1191 | struct kvm_irq_routing { | ||
1192 | __u32 nr; | ||
1193 | __u32 flags; | ||
1194 | struct kvm_irq_routing_entry entries[0]; | ||
1195 | }; | ||
1196 | |||
1197 | No flags are specified so far, the corresponding field must be set to zero. | ||
1198 | |||
1199 | struct kvm_irq_routing_entry { | ||
1200 | __u32 gsi; | ||
1201 | __u32 type; | ||
1202 | __u32 flags; | ||
1203 | __u32 pad; | ||
1204 | union { | ||
1205 | struct kvm_irq_routing_irqchip irqchip; | ||
1206 | struct kvm_irq_routing_msi msi; | ||
1207 | __u32 pad[8]; | ||
1208 | } u; | ||
1209 | }; | ||
1210 | |||
1211 | /* gsi routing entry types */ | ||
1212 | #define KVM_IRQ_ROUTING_IRQCHIP 1 | ||
1213 | #define KVM_IRQ_ROUTING_MSI 2 | ||
1214 | |||
1215 | No flags are specified so far, the corresponding field must be set to zero. | ||
1216 | |||
1217 | struct kvm_irq_routing_irqchip { | ||
1218 | __u32 irqchip; | ||
1219 | __u32 pin; | ||
1220 | }; | ||
1221 | |||
1222 | struct kvm_irq_routing_msi { | ||
1223 | __u32 address_lo; | ||
1224 | __u32 address_hi; | ||
1225 | __u32 data; | ||
1226 | __u32 pad; | ||
1227 | }; | ||
1228 | |||
1229 | 4.53 KVM_ASSIGN_SET_MSIX_NR | ||
1230 | |||
1231 | Capability: KVM_CAP_DEVICE_MSIX | ||
1232 | Architectures: x86 ia64 | ||
1233 | Type: vm ioctl | ||
1234 | Parameters: struct kvm_assigned_msix_nr (in) | ||
1235 | Returns: 0 on success, -1 on error | ||
1236 | |||
1237 | Set the number of MSI-X interrupts for an assigned device. This service can | ||
1238 | only be called once in the lifetime of an assigned device. | ||
1239 | |||
1240 | struct kvm_assigned_msix_nr { | ||
1241 | __u32 assigned_dev_id; | ||
1242 | __u16 entry_nr; | ||
1243 | __u16 padding; | ||
1244 | }; | ||
1245 | |||
1246 | #define KVM_MAX_MSIX_PER_DEV 256 | ||
1247 | |||
1248 | 4.54 KVM_ASSIGN_SET_MSIX_ENTRY | ||
1249 | |||
1250 | Capability: KVM_CAP_DEVICE_MSIX | ||
1251 | Architectures: x86 ia64 | ||
1252 | Type: vm ioctl | ||
1253 | Parameters: struct kvm_assigned_msix_entry (in) | ||
1254 | Returns: 0 on success, -1 on error | ||
1255 | |||
1256 | Specifies the routing of an MSI-X assigned device interrupt to a GSI. Setting | ||
1257 | the GSI vector to zero means disabling the interrupt. | ||
1258 | |||
1259 | struct kvm_assigned_msix_entry { | ||
1260 | __u32 assigned_dev_id; | ||
1261 | __u32 gsi; | ||
1262 | __u16 entry; /* The index of entry in the MSI-X table */ | ||
1263 | __u16 padding[3]; | ||
1264 | }; | ||
1265 | |||
1266 | 5. The kvm_run structure | ||
1267 | |||
1268 | Application code obtains a pointer to the kvm_run structure by | ||
1269 | mmap()ing a vcpu fd. From that point, application code can control | ||
1270 | execution by changing fields in kvm_run prior to calling the KVM_RUN | ||
1271 | ioctl, and obtain information about the reason KVM_RUN returned by | ||
1272 | looking up structure members. | ||
1273 | |||
1274 | struct kvm_run { | ||
1275 | /* in */ | ||
1276 | __u8 request_interrupt_window; | ||
1277 | |||
1278 | Request that KVM_RUN return when it becomes possible to inject external | ||
1279 | interrupts into the guest. Useful in conjunction with KVM_INTERRUPT. | ||
1280 | |||
1281 | __u8 padding1[7]; | ||
1282 | |||
1283 | /* out */ | ||
1284 | __u32 exit_reason; | ||
1285 | |||
1286 | When KVM_RUN has returned successfully (return value 0), this informs | ||
1287 | application code why KVM_RUN has returned. Allowable values for this | ||
1288 | field are detailed below. | ||
1289 | |||
1290 | __u8 ready_for_interrupt_injection; | ||
1291 | |||
1292 | If request_interrupt_window has been specified, this field indicates | ||
1293 | an interrupt can be injected now with KVM_INTERRUPT. | ||
1294 | |||
1295 | __u8 if_flag; | ||
1296 | |||
1297 | The value of the current interrupt flag. Only valid if in-kernel | ||
1298 | local APIC is not used. | ||
1299 | |||
1300 | __u8 padding2[2]; | ||
1301 | |||
1302 | /* in (pre_kvm_run), out (post_kvm_run) */ | ||
1303 | __u64 cr8; | ||
1304 | |||
1305 | The value of the cr8 register. Only valid if in-kernel local APIC is | ||
1306 | not used. Both input and output. | ||
1307 | |||
1308 | __u64 apic_base; | ||
1309 | |||
1310 | The value of the APIC BASE msr. Only valid if in-kernel local | ||
1311 | APIC is not used. Both input and output. | ||
1312 | |||
1313 | union { | ||
1314 | /* KVM_EXIT_UNKNOWN */ | ||
1315 | struct { | ||
1316 | __u64 hardware_exit_reason; | ||
1317 | } hw; | ||
1318 | |||
1319 | If exit_reason is KVM_EXIT_UNKNOWN, the vcpu has exited due to unknown | ||
1320 | reasons. Further architecture-specific information is available in | ||
1321 | hardware_exit_reason. | ||
1322 | |||
1323 | /* KVM_EXIT_FAIL_ENTRY */ | ||
1324 | struct { | ||
1325 | __u64 hardware_entry_failure_reason; | ||
1326 | } fail_entry; | ||
1327 | |||
1328 | If exit_reason is KVM_EXIT_FAIL_ENTRY, the vcpu could not be run due | ||
1329 | to unknown reasons. Further architecture-specific information is | ||
1330 | available in hardware_entry_failure_reason. | ||
1331 | |||
1332 | /* KVM_EXIT_EXCEPTION */ | ||
1333 | struct { | ||
1334 | __u32 exception; | ||
1335 | __u32 error_code; | ||
1336 | } ex; | ||
1337 | |||
1338 | Unused. | ||
1339 | |||
1340 | /* KVM_EXIT_IO */ | ||
1341 | struct { | ||
1342 | #define KVM_EXIT_IO_IN 0 | ||
1343 | #define KVM_EXIT_IO_OUT 1 | ||
1344 | __u8 direction; | ||
1345 | __u8 size; /* bytes */ | ||
1346 | __u16 port; | ||
1347 | __u32 count; | ||
1348 | __u64 data_offset; /* relative to kvm_run start */ | ||
1349 | } io; | ||
1350 | |||
1351 | If exit_reason is KVM_EXIT_IO, then the vcpu has | ||
1352 | executed a port I/O instruction which could not be satisfied by kvm. | ||
1353 | data_offset describes where the data is located (KVM_EXIT_IO_OUT) or | ||
1354 | where kvm expects application code to place the data for the next | ||
1355 | KVM_RUN invocation (KVM_EXIT_IO_IN). Data format is a packed array. | ||
1356 | |||
1357 | struct { | ||
1358 | struct kvm_debug_exit_arch arch; | ||
1359 | } debug; | ||
1360 | |||
1361 | Unused. | ||
1362 | |||
1363 | /* KVM_EXIT_MMIO */ | ||
1364 | struct { | ||
1365 | __u64 phys_addr; | ||
1366 | __u8 data[8]; | ||
1367 | __u32 len; | ||
1368 | __u8 is_write; | ||
1369 | } mmio; | ||
1370 | |||
1371 | If exit_reason is KVM_EXIT_MMIO, then the vcpu has | ||
1372 | executed a memory-mapped I/O instruction which could not be satisfied | ||
1373 | by kvm. The 'data' member contains the written data if 'is_write' is | ||
1374 | true, and should be filled by application code otherwise. | ||
1375 | |||
1376 | NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding | ||
1377 | operations are complete (and guest state is consistent) only after userspace | ||
1378 | has re-entered the kernel with KVM_RUN. The kernel side will first finish | ||
1379 | incomplete operations and then check for pending signals. Userspace | ||
1380 | can re-enter the guest with an unmasked signal pending to complete | ||
1381 | pending operations. | ||
1382 | |||
1383 | /* KVM_EXIT_HYPERCALL */ | ||
1384 | struct { | ||
1385 | __u64 nr; | ||
1386 | __u64 args[6]; | ||
1387 | __u64 ret; | ||
1388 | __u32 longmode; | ||
1389 | __u32 pad; | ||
1390 | } hypercall; | ||
1391 | |||
1392 | Unused. This was once used for 'hypercall to userspace'. To implement | ||
1393 | such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390). | ||
1394 | Note KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO. | ||
1395 | |||
1396 | /* KVM_EXIT_TPR_ACCESS */ | ||
1397 | struct { | ||
1398 | __u64 rip; | ||
1399 | __u32 is_write; | ||
1400 | __u32 pad; | ||
1401 | } tpr_access; | ||
1402 | |||
1403 | To be documented (KVM_TPR_ACCESS_REPORTING). | ||
1404 | |||
1405 | /* KVM_EXIT_S390_SIEIC */ | ||
1406 | struct { | ||
1407 | __u8 icptcode; | ||
1408 | __u64 mask; /* psw upper half */ | ||
1409 | __u64 addr; /* psw lower half */ | ||
1410 | __u16 ipa; | ||
1411 | __u32 ipb; | ||
1412 | } s390_sieic; | ||
1413 | |||
1414 | s390 specific. | ||
1415 | |||
1416 | /* KVM_EXIT_S390_RESET */ | ||
1417 | #define KVM_S390_RESET_POR 1 | ||
1418 | #define KVM_S390_RESET_CLEAR 2 | ||
1419 | #define KVM_S390_RESET_SUBSYSTEM 4 | ||
1420 | #define KVM_S390_RESET_CPU_INIT 8 | ||
1421 | #define KVM_S390_RESET_IPL 16 | ||
1422 | __u64 s390_reset_flags; | ||
1423 | |||
1424 | s390 specific. | ||
1425 | |||
1426 | /* KVM_EXIT_DCR */ | ||
1427 | struct { | ||
1428 | __u32 dcrn; | ||
1429 | __u32 data; | ||
1430 | __u8 is_write; | ||
1431 | } dcr; | ||
1432 | |||
1433 | powerpc specific. | ||
1434 | |||
1435 | /* KVM_EXIT_OSI */ | ||
1436 | struct { | ||
1437 | __u64 gprs[32]; | ||
1438 | } osi; | ||
1439 | |||
1440 | MOL uses a special hypercall interface it calls 'OSI'. To enable it, we catch | ||
1441 | hypercalls and exit with this exit struct that contains all the guest gprs. | ||
1442 | |||
1443 | If exit_reason is KVM_EXIT_OSI, then the vcpu has triggered such a hypercall. | ||
1444 | Userspace can now handle the hypercall and when it's done modify the gprs as | ||
1445 | necessary. Upon guest entry all guest GPRs will then be replaced by the values | ||
1446 | in this struct. | ||
1447 | |||
1448 | /* Fix the size of the union. */ | ||
1449 | char padding[256]; | ||
1450 | }; | ||
1451 | }; | ||
diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt new file mode 100644 index 000000000000..882068538c9c --- /dev/null +++ b/Documentation/virtual/kvm/cpuid.txt | |||
@@ -0,0 +1,45 @@ | |||
1 | KVM CPUID bits | ||
2 | Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 | ||
3 | ===================================================== | ||
4 | |||
5 | A guest running on a kvm host, can check some of its features using | ||
6 | cpuid. This is not always guaranteed to work, since userspace can | ||
7 | mask-out some, or even all KVM-related cpuid features before launching | ||
8 | a guest. | ||
9 | |||
10 | KVM cpuid functions are: | ||
11 | |||
12 | function: KVM_CPUID_SIGNATURE (0x40000000) | ||
13 | returns : eax = 0, | ||
14 | ebx = 0x4b4d564b, | ||
15 | ecx = 0x564b4d56, | ||
16 | edx = 0x4d. | ||
17 | Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM". | ||
18 | This function queries the presence of KVM cpuid leafs. | ||
19 | |||
20 | |||
21 | function: define KVM_CPUID_FEATURES (0x40000001) | ||
22 | returns : ebx, ecx, edx = 0 | ||
23 | eax = and OR'ed group of (1 << flag), where each flags is: | ||
24 | |||
25 | |||
26 | flag || value || meaning | ||
27 | ============================================================================= | ||
28 | KVM_FEATURE_CLOCKSOURCE || 0 || kvmclock available at msrs | ||
29 | || || 0x11 and 0x12. | ||
30 | ------------------------------------------------------------------------------ | ||
31 | KVM_FEATURE_NOP_IO_DELAY || 1 || not necessary to perform delays | ||
32 | || || on PIO operations. | ||
33 | ------------------------------------------------------------------------------ | ||
34 | KVM_FEATURE_MMU_OP || 2 || deprecated. | ||
35 | ------------------------------------------------------------------------------ | ||
36 | KVM_FEATURE_CLOCKSOURCE2 || 3 || kvmclock available at msrs | ||
37 | || || 0x4b564d00 and 0x4b564d01 | ||
38 | ------------------------------------------------------------------------------ | ||
39 | KVM_FEATURE_ASYNC_PF || 4 || async pf can be enabled by | ||
40 | || || writing to msr 0x4b564d02 | ||
41 | ------------------------------------------------------------------------------ | ||
42 | KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side | ||
43 | || || per-cpu warps are expected in | ||
44 | || || kvmclock. | ||
45 | ------------------------------------------------------------------------------ | ||
diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt new file mode 100644 index 000000000000..3b4cd3bf5631 --- /dev/null +++ b/Documentation/virtual/kvm/locking.txt | |||
@@ -0,0 +1,25 @@ | |||
1 | KVM Lock Overview | ||
2 | ================= | ||
3 | |||
4 | 1. Acquisition Orders | ||
5 | --------------------- | ||
6 | |||
7 | (to be written) | ||
8 | |||
9 | 2. Reference | ||
10 | ------------ | ||
11 | |||
12 | Name: kvm_lock | ||
13 | Type: raw_spinlock | ||
14 | Arch: any | ||
15 | Protects: - vm_list | ||
16 | - hardware virtualization enable/disable | ||
17 | Comment: 'raw' because hardware enabling/disabling must be atomic /wrt | ||
18 | migration. | ||
19 | |||
20 | Name: kvm_arch::tsc_write_lock | ||
21 | Type: raw_spinlock | ||
22 | Arch: x86 | ||
23 | Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} | ||
24 | - tsc offset in vmcb | ||
25 | Comment: 'raw' because updating the tsc offsets must not be preempted. | ||
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt new file mode 100644 index 000000000000..f46aa58389ca --- /dev/null +++ b/Documentation/virtual/kvm/mmu.txt | |||
@@ -0,0 +1,348 @@ | |||
1 | The x86 kvm shadow mmu | ||
2 | ====================== | ||
3 | |||
4 | The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible | ||
5 | for presenting a standard x86 mmu to the guest, while translating guest | ||
6 | physical addresses to host physical addresses. | ||
7 | |||
8 | The mmu code attempts to satisfy the following requirements: | ||
9 | |||
10 | - correctness: the guest should not be able to determine that it is running | ||
11 | on an emulated mmu except for timing (we attempt to comply | ||
12 | with the specification, not emulate the characteristics of | ||
13 | a particular implementation such as tlb size) | ||
14 | - security: the guest must not be able to touch host memory not assigned | ||
15 | to it | ||
16 | - performance: minimize the performance penalty imposed by the mmu | ||
17 | - scaling: need to scale to large memory and large vcpu guests | ||
18 | - hardware: support the full range of x86 virtualization hardware | ||
19 | - integration: Linux memory management code must be in control of guest memory | ||
20 | so that swapping, page migration, page merging, transparent | ||
21 | hugepages, and similar features work without change | ||
22 | - dirty tracking: report writes to guest memory to enable live migration | ||
23 | and framebuffer-based displays | ||
24 | - footprint: keep the amount of pinned kernel memory low (most memory | ||
25 | should be shrinkable) | ||
26 | - reliability: avoid multipage or GFP_ATOMIC allocations | ||
27 | |||
28 | Acronyms | ||
29 | ======== | ||
30 | |||
31 | pfn host page frame number | ||
32 | hpa host physical address | ||
33 | hva host virtual address | ||
34 | gfn guest frame number | ||
35 | gpa guest physical address | ||
36 | gva guest virtual address | ||
37 | ngpa nested guest physical address | ||
38 | ngva nested guest virtual address | ||
39 | pte page table entry (used also to refer generically to paging structure | ||
40 | entries) | ||
41 | gpte guest pte (referring to gfns) | ||
42 | spte shadow pte (referring to pfns) | ||
43 | tdp two dimensional paging (vendor neutral term for NPT and EPT) | ||
44 | |||
45 | Virtual and real hardware supported | ||
46 | =================================== | ||
47 | |||
48 | The mmu supports first-generation mmu hardware, which allows an atomic switch | ||
49 | of the current paging mode and cr3 during guest entry, as well as | ||
50 | two-dimensional paging (AMD's NPT and Intel's EPT). The emulated hardware | ||
51 | it exposes is the traditional 2/3/4 level x86 mmu, with support for global | ||
52 | pages, pae, pse, pse36, cr0.wp, and 1GB pages. Work is in progress to support | ||
53 | exposing NPT capable hardware on NPT capable hosts. | ||
54 | |||
55 | Translation | ||
56 | =========== | ||
57 | |||
58 | The primary job of the mmu is to program the processor's mmu to translate | ||
59 | addresses for the guest. Different translations are required at different | ||
60 | times: | ||
61 | |||
62 | - when guest paging is disabled, we translate guest physical addresses to | ||
63 | host physical addresses (gpa->hpa) | ||
64 | - when guest paging is enabled, we translate guest virtual addresses, to | ||
65 | guest physical addresses, to host physical addresses (gva->gpa->hpa) | ||
66 | - when the guest launches a guest of its own, we translate nested guest | ||
67 | virtual addresses, to nested guest physical addresses, to guest physical | ||
68 | addresses, to host physical addresses (ngva->ngpa->gpa->hpa) | ||
69 | |||
70 | The primary challenge is to encode between 1 and 3 translations into hardware | ||
71 | that support only 1 (traditional) and 2 (tdp) translations. When the | ||
72 | number of required translations matches the hardware, the mmu operates in | ||
73 | direct mode; otherwise it operates in shadow mode (see below). | ||
74 | |||
75 | Memory | ||
76 | ====== | ||
77 | |||
78 | Guest memory (gpa) is part of the user address space of the process that is | ||
79 | using kvm. Userspace defines the translation between guest addresses and user | ||
80 | addresses (gpa->hva); note that two gpas may alias to the same hva, but not | ||
81 | vice versa. | ||
82 | |||
83 | These hvas may be backed using any method available to the host: anonymous | ||
84 | memory, file backed memory, and device memory. Memory might be paged by the | ||
85 | host at any time. | ||
86 | |||
87 | Events | ||
88 | ====== | ||
89 | |||
90 | The mmu is driven by events, some from the guest, some from the host. | ||
91 | |||
92 | Guest generated events: | ||
93 | - writes to control registers (especially cr3) | ||
94 | - invlpg/invlpga instruction execution | ||
95 | - access to missing or protected translations | ||
96 | |||
97 | Host generated events: | ||
98 | - changes in the gpa->hpa translation (either through gpa->hva changes or | ||
99 | through hva->hpa changes) | ||
100 | - memory pressure (the shrinker) | ||
101 | |||
102 | Shadow pages | ||
103 | ============ | ||
104 | |||
105 | The principal data structure is the shadow page, 'struct kvm_mmu_page'. A | ||
106 | shadow page contains 512 sptes, which can be either leaf or nonleaf sptes. A | ||
107 | shadow page may contain a mix of leaf and nonleaf sptes. | ||
108 | |||
109 | A nonleaf spte allows the hardware mmu to reach the leaf pages and | ||
110 | is not related to a translation directly. It points to other shadow pages. | ||
111 | |||
112 | A leaf spte corresponds to either one or two translations encoded into | ||
113 | one paging structure entry. These are always the lowest level of the | ||
114 | translation stack, with optional higher level translations left to NPT/EPT. | ||
115 | Leaf ptes point at guest pages. | ||
116 | |||
117 | The following table shows translations encoded by leaf ptes, with higher-level | ||
118 | translations in parentheses: | ||
119 | |||
120 | Non-nested guests: | ||
121 | nonpaging: gpa->hpa | ||
122 | paging: gva->gpa->hpa | ||
123 | paging, tdp: (gva->)gpa->hpa | ||
124 | Nested guests: | ||
125 | non-tdp: ngva->gpa->hpa (*) | ||
126 | tdp: (ngva->)ngpa->gpa->hpa | ||
127 | |||
128 | (*) the guest hypervisor will encode the ngva->gpa translation into its page | ||
129 | tables if npt is not present | ||
130 | |||
131 | Shadow pages contain the following information: | ||
132 | role.level: | ||
133 | The level in the shadow paging hierarchy that this shadow page belongs to. | ||
134 | 1=4k sptes, 2=2M sptes, 3=1G sptes, etc. | ||
135 | role.direct: | ||
136 | If set, leaf sptes reachable from this page are for a linear range. | ||
137 | Examples include real mode translation, large guest pages backed by small | ||
138 | host pages, and gpa->hpa translations when NPT or EPT is active. | ||
139 | The linear range starts at (gfn << PAGE_SHIFT) and its size is determined | ||
140 | by role.level (2MB for first level, 1GB for second level, 0.5TB for third | ||
141 | level, 256TB for fourth level) | ||
142 | If clear, this page corresponds to a guest page table denoted by the gfn | ||
143 | field. | ||
144 | role.quadrant: | ||
145 | When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit | ||
146 | sptes. That means a guest page table contains more ptes than the host, | ||
147 | so multiple shadow pages are needed to shadow one guest page. | ||
148 | For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the | ||
149 | first or second 512-gpte block in the guest page table. For second-level | ||
150 | page tables, each 32-bit gpte is converted to two 64-bit sptes | ||
151 | (since each first-level guest page is shadowed by two first-level | ||
152 | shadow pages) so role.quadrant takes values in the range 0..3. Each | ||
153 | quadrant maps 1GB virtual address space. | ||
154 | role.access: | ||
155 | Inherited guest access permissions in the form uwx. Note execute | ||
156 | permission is positive, not negative. | ||
157 | role.invalid: | ||
158 | The page is invalid and should not be used. It is a root page that is | ||
159 | currently pinned (by a cpu hardware register pointing to it); once it is | ||
160 | unpinned it will be destroyed. | ||
161 | role.cr4_pae: | ||
162 | Contains the value of cr4.pae for which the page is valid (e.g. whether | ||
163 | 32-bit or 64-bit gptes are in use). | ||
164 | role.nxe: | ||
165 | Contains the value of efer.nxe for which the page is valid. | ||
166 | role.cr0_wp: | ||
167 | Contains the value of cr0.wp for which the page is valid. | ||
168 | gfn: | ||
169 | Either the guest page table containing the translations shadowed by this | ||
170 | page, or the base page frame for linear translations. See role.direct. | ||
171 | spt: | ||
172 | A pageful of 64-bit sptes containing the translations for this page. | ||
173 | Accessed by both kvm and hardware. | ||
174 | The page pointed to by spt will have its page->private pointing back | ||
175 | at the shadow page structure. | ||
176 | sptes in spt point either at guest pages, or at lower-level shadow pages. | ||
177 | Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point | ||
178 | at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte. | ||
179 | The spt array forms a DAG structure with the shadow page as a node, and | ||
180 | guest pages as leaves. | ||
181 | gfns: | ||
182 | An array of 512 guest frame numbers, one for each present pte. Used to | ||
183 | perform a reverse map from a pte to a gfn. When role.direct is set, any | ||
184 | element of this array can be calculated from the gfn field when used, in | ||
185 | this case, the array of gfns is not allocated. See role.direct and gfn. | ||
186 | slot_bitmap: | ||
187 | A bitmap containing one bit per memory slot. If the page contains a pte | ||
188 | mapping a page from memory slot n, then bit n of slot_bitmap will be set | ||
189 | (if a page is aliased among several slots, then it is not guaranteed that | ||
190 | all slots will be marked). | ||
191 | Used during dirty logging to avoid scanning a shadow page if none if its | ||
192 | pages need tracking. | ||
193 | root_count: | ||
194 | A counter keeping track of how many hardware registers (guest cr3 or | ||
195 | pdptrs) are now pointing at the page. While this counter is nonzero, the | ||
196 | page cannot be destroyed. See role.invalid. | ||
197 | multimapped: | ||
198 | Whether there exist multiple sptes pointing at this page. | ||
199 | parent_pte/parent_ptes: | ||
200 | If multimapped is zero, parent_pte points at the single spte that points at | ||
201 | this page's spt. Otherwise, parent_ptes points at a data structure | ||
202 | with a list of parent_ptes. | ||
203 | unsync: | ||
204 | If true, then the translations in this page may not match the guest's | ||
205 | translation. This is equivalent to the state of the tlb when a pte is | ||
206 | changed but before the tlb entry is flushed. Accordingly, unsync ptes | ||
207 | are synchronized when the guest executes invlpg or flushes its tlb by | ||
208 | other means. Valid for leaf pages. | ||
209 | unsync_children: | ||
210 | How many sptes in the page point at pages that are unsync (or have | ||
211 | unsynchronized children). | ||
212 | unsync_child_bitmap: | ||
213 | A bitmap indicating which sptes in spt point (directly or indirectly) at | ||
214 | pages that may be unsynchronized. Used to quickly locate all unsychronized | ||
215 | pages reachable from a given page. | ||
216 | |||
217 | Reverse map | ||
218 | =========== | ||
219 | |||
220 | The mmu maintains a reverse mapping whereby all ptes mapping a page can be | ||
221 | reached given its gfn. This is used, for example, when swapping out a page. | ||
222 | |||
223 | Synchronized and unsynchronized pages | ||
224 | ===================================== | ||
225 | |||
226 | The guest uses two events to synchronize its tlb and page tables: tlb flushes | ||
227 | and page invalidations (invlpg). | ||
228 | |||
229 | A tlb flush means that we need to synchronize all sptes reachable from the | ||
230 | guest's cr3. This is expensive, so we keep all guest page tables write | ||
231 | protected, and synchronize sptes to gptes when a gpte is written. | ||
232 | |||
233 | A special case is when a guest page table is reachable from the current | ||
234 | guest cr3. In this case, the guest is obliged to issue an invlpg instruction | ||
235 | before using the translation. We take advantage of that by removing write | ||
236 | protection from the guest page, and allowing the guest to modify it freely. | ||
237 | We synchronize modified gptes when the guest invokes invlpg. This reduces | ||
238 | the amount of emulation we have to do when the guest modifies multiple gptes, | ||
239 | or when the a guest page is no longer used as a page table and is used for | ||
240 | random guest data. | ||
241 | |||
242 | As a side effect we have to resynchronize all reachable unsynchronized shadow | ||
243 | pages on a tlb flush. | ||
244 | |||
245 | |||
246 | Reaction to events | ||
247 | ================== | ||
248 | |||
249 | - guest page fault (or npt page fault, or ept violation) | ||
250 | |||
251 | This is the most complicated event. The cause of a page fault can be: | ||
252 | |||
253 | - a true guest fault (the guest translation won't allow the access) (*) | ||
254 | - access to a missing translation | ||
255 | - access to a protected translation | ||
256 | - when logging dirty pages, memory is write protected | ||
257 | - synchronized shadow pages are write protected (*) | ||
258 | - access to untranslatable memory (mmio) | ||
259 | |||
260 | (*) not applicable in direct mode | ||
261 | |||
262 | Handling a page fault is performed as follows: | ||
263 | |||
264 | - if needed, walk the guest page tables to determine the guest translation | ||
265 | (gva->gpa or ngpa->gpa) | ||
266 | - if permissions are insufficient, reflect the fault back to the guest | ||
267 | - determine the host page | ||
268 | - if this is an mmio request, there is no host page; call the emulator | ||
269 | to emulate the instruction instead | ||
270 | - walk the shadow page table to find the spte for the translation, | ||
271 | instantiating missing intermediate page tables as necessary | ||
272 | - try to unsynchronize the page | ||
273 | - if successful, we can let the guest continue and modify the gpte | ||
274 | - emulate the instruction | ||
275 | - if failed, unshadow the page and let the guest continue | ||
276 | - update any translations that were modified by the instruction | ||
277 | |||
278 | invlpg handling: | ||
279 | |||
280 | - walk the shadow page hierarchy and drop affected translations | ||
281 | - try to reinstantiate the indicated translation in the hope that the | ||
282 | guest will use it in the near future | ||
283 | |||
284 | Guest control register updates: | ||
285 | |||
286 | - mov to cr3 | ||
287 | - look up new shadow roots | ||
288 | - synchronize newly reachable shadow pages | ||
289 | |||
290 | - mov to cr0/cr4/efer | ||
291 | - set up mmu context for new paging mode | ||
292 | - look up new shadow roots | ||
293 | - synchronize newly reachable shadow pages | ||
294 | |||
295 | Host translation updates: | ||
296 | |||
297 | - mmu notifier called with updated hva | ||
298 | - look up affected sptes through reverse map | ||
299 | - drop (or update) translations | ||
300 | |||
301 | Emulating cr0.wp | ||
302 | ================ | ||
303 | |||
304 | If tdp is not enabled, the host must keep cr0.wp=1 so page write protection | ||
305 | works for the guest kernel, not guest guest userspace. When the guest | ||
306 | cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0, | ||
307 | we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the | ||
308 | semantics require allowing any guest kernel access plus user read access). | ||
309 | |||
310 | We handle this by mapping the permissions to two possible sptes, depending | ||
311 | on fault type: | ||
312 | |||
313 | - kernel write fault: spte.u=0, spte.w=1 (allows full kernel access, | ||
314 | disallows user access) | ||
315 | - read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel | ||
316 | write access) | ||
317 | |||
318 | (user write faults generate a #PF) | ||
319 | |||
320 | Large pages | ||
321 | =========== | ||
322 | |||
323 | The mmu supports all combinations of large and small guest and host pages. | ||
324 | Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as | ||
325 | two separate 2M pages, on both guest and host, since the mmu always uses PAE | ||
326 | paging. | ||
327 | |||
328 | To instantiate a large spte, four constraints must be satisfied: | ||
329 | |||
330 | - the spte must point to a large host page | ||
331 | - the guest pte must be a large pte of at least equivalent size (if tdp is | ||
332 | enabled, there is no guest pte and this condition is satisified) | ||
333 | - if the spte will be writeable, the large page frame may not overlap any | ||
334 | write-protected pages | ||
335 | - the guest page must be wholly contained by a single memory slot | ||
336 | |||
337 | To check the last two conditions, the mmu maintains a ->write_count set of | ||
338 | arrays for each memory slot and large page size. Every write protected page | ||
339 | causes its write_count to be incremented, thus preventing instantiation of | ||
340 | a large spte. The frames at the end of an unaligned memory slot have | ||
341 | artificically inflated ->write_counts so they can never be instantiated. | ||
342 | |||
343 | Further reading | ||
344 | =============== | ||
345 | |||
346 | - NPT presentation from KVM Forum 2008 | ||
347 | http://www.linux-kvm.org/wiki/images/c/c8/KvmForum2008%24kdf2008_21.pdf | ||
348 | |||
diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt new file mode 100644 index 000000000000..d079aed27e03 --- /dev/null +++ b/Documentation/virtual/kvm/msr.txt | |||
@@ -0,0 +1,187 @@ | |||
1 | KVM-specific MSRs. | ||
2 | Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 | ||
3 | ===================================================== | ||
4 | |||
5 | KVM makes use of some custom MSRs to service some requests. | ||
6 | |||
7 | Custom MSRs have a range reserved for them, that goes from | ||
8 | 0x4b564d00 to 0x4b564dff. There are MSRs outside this area, | ||
9 | but they are deprecated and their use is discouraged. | ||
10 | |||
11 | Custom MSR list | ||
12 | -------- | ||
13 | |||
14 | The current supported Custom MSR list is: | ||
15 | |||
16 | MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00 | ||
17 | |||
18 | data: 4-byte alignment physical address of a memory area which must be | ||
19 | in guest RAM. This memory is expected to hold a copy of the following | ||
20 | structure: | ||
21 | |||
22 | struct pvclock_wall_clock { | ||
23 | u32 version; | ||
24 | u32 sec; | ||
25 | u32 nsec; | ||
26 | } __attribute__((__packed__)); | ||
27 | |||
28 | whose data will be filled in by the hypervisor. The hypervisor is only | ||
29 | guaranteed to update this data at the moment of MSR write. | ||
30 | Users that want to reliably query this information more than once have | ||
31 | to write more than once to this MSR. Fields have the following meanings: | ||
32 | |||
33 | version: guest has to check version before and after grabbing | ||
34 | time information and check that they are both equal and even. | ||
35 | An odd version indicates an in-progress update. | ||
36 | |||
37 | sec: number of seconds for wallclock. | ||
38 | |||
39 | nsec: number of nanoseconds for wallclock. | ||
40 | |||
41 | Note that although MSRs are per-CPU entities, the effect of this | ||
42 | particular MSR is global. | ||
43 | |||
44 | Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid | ||
45 | leaf prior to usage. | ||
46 | |||
47 | MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 | ||
48 | |||
49 | data: 4-byte aligned physical address of a memory area which must be in | ||
50 | guest RAM, plus an enable bit in bit 0. This memory is expected to hold | ||
51 | a copy of the following structure: | ||
52 | |||
53 | struct pvclock_vcpu_time_info { | ||
54 | u32 version; | ||
55 | u32 pad0; | ||
56 | u64 tsc_timestamp; | ||
57 | u64 system_time; | ||
58 | u32 tsc_to_system_mul; | ||
59 | s8 tsc_shift; | ||
60 | u8 flags; | ||
61 | u8 pad[2]; | ||
62 | } __attribute__((__packed__)); /* 32 bytes */ | ||
63 | |||
64 | whose data will be filled in by the hypervisor periodically. Only one | ||
65 | write, or registration, is needed for each VCPU. The interval between | ||
66 | updates of this structure is arbitrary and implementation-dependent. | ||
67 | The hypervisor may update this structure at any time it sees fit until | ||
68 | anything with bit0 == 0 is written to it. | ||
69 | |||
70 | Fields have the following meanings: | ||
71 | |||
72 | version: guest has to check version before and after grabbing | ||
73 | time information and check that they are both equal and even. | ||
74 | An odd version indicates an in-progress update. | ||
75 | |||
76 | tsc_timestamp: the tsc value at the current VCPU at the time | ||
77 | of the update of this structure. Guests can subtract this value | ||
78 | from current tsc to derive a notion of elapsed time since the | ||
79 | structure update. | ||
80 | |||
81 | system_time: a host notion of monotonic time, including sleep | ||
82 | time at the time this structure was last updated. Unit is | ||
83 | nanoseconds. | ||
84 | |||
85 | tsc_to_system_mul: a function of the tsc frequency. One has | ||
86 | to multiply any tsc-related quantity by this value to get | ||
87 | a value in nanoseconds, besides dividing by 2^tsc_shift | ||
88 | |||
89 | tsc_shift: cycle to nanosecond divider, as a power of two, to | ||
90 | allow for shift rights. One has to shift right any tsc-related | ||
91 | quantity by this value to get a value in nanoseconds, besides | ||
92 | multiplying by tsc_to_system_mul. | ||
93 | |||
94 | With this information, guests can derive per-CPU time by | ||
95 | doing: | ||
96 | |||
97 | time = (current_tsc - tsc_timestamp) | ||
98 | time = (time * tsc_to_system_mul) >> tsc_shift | ||
99 | time = time + system_time | ||
100 | |||
101 | flags: bits in this field indicate extended capabilities | ||
102 | coordinated between the guest and the hypervisor. Availability | ||
103 | of specific flags has to be checked in 0x40000001 cpuid leaf. | ||
104 | Current flags are: | ||
105 | |||
106 | flag bit | cpuid bit | meaning | ||
107 | ------------------------------------------------------------- | ||
108 | | | time measures taken across | ||
109 | 0 | 24 | multiple cpus are guaranteed to | ||
110 | | | be monotonic | ||
111 | ------------------------------------------------------------- | ||
112 | |||
113 | Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid | ||
114 | leaf prior to usage. | ||
115 | |||
116 | |||
117 | MSR_KVM_WALL_CLOCK: 0x11 | ||
118 | |||
119 | data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead. | ||
120 | |||
121 | This MSR falls outside the reserved KVM range and may be removed in the | ||
122 | future. Its usage is deprecated. | ||
123 | |||
124 | Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid | ||
125 | leaf prior to usage. | ||
126 | |||
127 | MSR_KVM_SYSTEM_TIME: 0x12 | ||
128 | |||
129 | data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead. | ||
130 | |||
131 | This MSR falls outside the reserved KVM range and may be removed in the | ||
132 | future. Its usage is deprecated. | ||
133 | |||
134 | Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid | ||
135 | leaf prior to usage. | ||
136 | |||
137 | The suggested algorithm for detecting kvmclock presence is then: | ||
138 | |||
139 | if (!kvm_para_available()) /* refer to cpuid.txt */ | ||
140 | return NON_PRESENT; | ||
141 | |||
142 | flags = cpuid_eax(0x40000001); | ||
143 | if (flags & 3) { | ||
144 | msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; | ||
145 | msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; | ||
146 | return PRESENT; | ||
147 | } else if (flags & 0) { | ||
148 | msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; | ||
149 | msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; | ||
150 | return PRESENT; | ||
151 | } else | ||
152 | return NON_PRESENT; | ||
153 | |||
154 | MSR_KVM_ASYNC_PF_EN: 0x4b564d02 | ||
155 | data: Bits 63-6 hold 64-byte aligned physical address of a | ||
156 | 64 byte memory area which must be in guest RAM and must be | ||
157 | zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1 | ||
158 | when asynchronous page faults are enabled on the vcpu 0 when | ||
159 | disabled. Bit 2 is 1 if asynchronous page faults can be injected | ||
160 | when vcpu is in cpl == 0. | ||
161 | |||
162 | First 4 byte of 64 byte memory location will be written to by | ||
163 | the hypervisor at the time of asynchronous page fault (APF) | ||
164 | injection to indicate type of asynchronous page fault. Value | ||
165 | of 1 means that the page referred to by the page fault is not | ||
166 | present. Value 2 means that the page is now available. Disabling | ||
167 | interrupt inhibits APFs. Guest must not enable interrupt | ||
168 | before the reason is read, or it may be overwritten by another | ||
169 | APF. Since APF uses the same exception vector as regular page | ||
170 | fault guest must reset the reason to 0 before it does | ||
171 | something that can generate normal page fault. If during page | ||
172 | fault APF reason is 0 it means that this is regular page | ||
173 | fault. | ||
174 | |||
175 | During delivery of type 1 APF cr2 contains a token that will | ||
176 | be used to notify a guest when missing page becomes | ||
177 | available. When page becomes available type 2 APF is sent with | ||
178 | cr2 set to the token associated with the page. There is special | ||
179 | kind of token 0xffffffff which tells vcpu that it should wake | ||
180 | up all processes waiting for APFs and no individual type 2 APFs | ||
181 | will be sent. | ||
182 | |||
183 | If APF is disabled while there are outstanding APFs, they will | ||
184 | not be delivered. | ||
185 | |||
186 | Currently type 2 APF will be always delivered on the same vcpu as | ||
187 | type 1 was, but guest should not rely on that. | ||
diff --git a/Documentation/virtual/kvm/ppc-pv.txt b/Documentation/virtual/kvm/ppc-pv.txt new file mode 100644 index 000000000000..3ab969c59046 --- /dev/null +++ b/Documentation/virtual/kvm/ppc-pv.txt | |||
@@ -0,0 +1,196 @@ | |||
1 | The PPC KVM paravirtual interface | ||
2 | ================================= | ||
3 | |||
4 | The basic execution principle by which KVM on PowerPC works is to run all kernel | ||
5 | space code in PR=1 which is user space. This way we trap all privileged | ||
6 | instructions and can emulate them accordingly. | ||
7 | |||
8 | Unfortunately that is also the downfall. There are quite some privileged | ||
9 | instructions that needlessly return us to the hypervisor even though they | ||
10 | could be handled differently. | ||
11 | |||
12 | This is what the PPC PV interface helps with. It takes privileged instructions | ||
13 | and transforms them into unprivileged ones with some help from the hypervisor. | ||
14 | This cuts down virtualization costs by about 50% on some of my benchmarks. | ||
15 | |||
16 | The code for that interface can be found in arch/powerpc/kernel/kvm* | ||
17 | |||
18 | Querying for existence | ||
19 | ====================== | ||
20 | |||
21 | To find out if we're running on KVM or not, we leverage the device tree. When | ||
22 | Linux is running on KVM, a node /hypervisor exists. That node contains a | ||
23 | compatible property with the value "linux,kvm". | ||
24 | |||
25 | Once you determined you're running under a PV capable KVM, you can now use | ||
26 | hypercalls as described below. | ||
27 | |||
28 | KVM hypercalls | ||
29 | ============== | ||
30 | |||
31 | Inside the device tree's /hypervisor node there's a property called | ||
32 | 'hypercall-instructions'. This property contains at most 4 opcodes that make | ||
33 | up the hypercall. To call a hypercall, just call these instructions. | ||
34 | |||
35 | The parameters are as follows: | ||
36 | |||
37 | Register IN OUT | ||
38 | |||
39 | r0 - volatile | ||
40 | r3 1st parameter Return code | ||
41 | r4 2nd parameter 1st output value | ||
42 | r5 3rd parameter 2nd output value | ||
43 | r6 4th parameter 3rd output value | ||
44 | r7 5th parameter 4th output value | ||
45 | r8 6th parameter 5th output value | ||
46 | r9 7th parameter 6th output value | ||
47 | r10 8th parameter 7th output value | ||
48 | r11 hypercall number 8th output value | ||
49 | r12 - volatile | ||
50 | |||
51 | Hypercall definitions are shared in generic code, so the same hypercall numbers | ||
52 | apply for x86 and powerpc alike with the exception that each KVM hypercall | ||
53 | also needs to be ORed with the KVM vendor code which is (42 << 16). | ||
54 | |||
55 | Return codes can be as follows: | ||
56 | |||
57 | Code Meaning | ||
58 | |||
59 | 0 Success | ||
60 | 12 Hypercall not implemented | ||
61 | <0 Error | ||
62 | |||
63 | The magic page | ||
64 | ============== | ||
65 | |||
66 | To enable communication between the hypervisor and guest there is a new shared | ||
67 | page that contains parts of supervisor visible register state. The guest can | ||
68 | map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE. | ||
69 | |||
70 | With this hypercall issued the guest always gets the magic page mapped at the | ||
71 | desired location in effective and physical address space. For now, we always | ||
72 | map the page to -4096. This way we can access it using absolute load and store | ||
73 | functions. The following instruction reads the first field of the magic page: | ||
74 | |||
75 | ld rX, -4096(0) | ||
76 | |||
77 | The interface is designed to be extensible should there be need later to add | ||
78 | additional registers to the magic page. If you add fields to the magic page, | ||
79 | also define a new hypercall feature to indicate that the host can give you more | ||
80 | registers. Only if the host supports the additional features, make use of them. | ||
81 | |||
82 | The magic page has the following layout as described in | ||
83 | arch/powerpc/include/asm/kvm_para.h: | ||
84 | |||
85 | struct kvm_vcpu_arch_shared { | ||
86 | __u64 scratch1; | ||
87 | __u64 scratch2; | ||
88 | __u64 scratch3; | ||
89 | __u64 critical; /* Guest may not get interrupts if == r1 */ | ||
90 | __u64 sprg0; | ||
91 | __u64 sprg1; | ||
92 | __u64 sprg2; | ||
93 | __u64 sprg3; | ||
94 | __u64 srr0; | ||
95 | __u64 srr1; | ||
96 | __u64 dar; | ||
97 | __u64 msr; | ||
98 | __u32 dsisr; | ||
99 | __u32 int_pending; /* Tells the guest if we have an interrupt */ | ||
100 | }; | ||
101 | |||
102 | Additions to the page must only occur at the end. Struct fields are always 32 | ||
103 | or 64 bit aligned, depending on them being 32 or 64 bit wide respectively. | ||
104 | |||
105 | Magic page features | ||
106 | =================== | ||
107 | |||
108 | When mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE, | ||
109 | a second return value is passed to the guest. This second return value contains | ||
110 | a bitmap of available features inside the magic page. | ||
111 | |||
112 | The following enhancements to the magic page are currently available: | ||
113 | |||
114 | KVM_MAGIC_FEAT_SR Maps SR registers r/w in the magic page | ||
115 | |||
116 | For enhanced features in the magic page, please check for the existence of the | ||
117 | feature before using them! | ||
118 | |||
119 | MSR bits | ||
120 | ======== | ||
121 | |||
122 | The MSR contains bits that require hypervisor intervention and bits that do | ||
123 | not require direct hypervisor intervention because they only get interpreted | ||
124 | when entering the guest or don't have any impact on the hypervisor's behavior. | ||
125 | |||
126 | The following bits are safe to be set inside the guest: | ||
127 | |||
128 | MSR_EE | ||
129 | MSR_RI | ||
130 | MSR_CR | ||
131 | MSR_ME | ||
132 | |||
133 | If any other bit changes in the MSR, please still use mtmsr(d). | ||
134 | |||
135 | Patched instructions | ||
136 | ==================== | ||
137 | |||
138 | The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions | ||
139 | respectively on 32 bit systems with an added offset of 4 to accommodate for big | ||
140 | endianness. | ||
141 | |||
142 | The following is a list of mapping the Linux kernel performs when running as | ||
143 | guest. Implementing any of those mappings is optional, as the instruction traps | ||
144 | also act on the shared page. So calling privileged instructions still works as | ||
145 | before. | ||
146 | |||
147 | From To | ||
148 | ==== == | ||
149 | |||
150 | mfmsr rX ld rX, magic_page->msr | ||
151 | mfsprg rX, 0 ld rX, magic_page->sprg0 | ||
152 | mfsprg rX, 1 ld rX, magic_page->sprg1 | ||
153 | mfsprg rX, 2 ld rX, magic_page->sprg2 | ||
154 | mfsprg rX, 3 ld rX, magic_page->sprg3 | ||
155 | mfsrr0 rX ld rX, magic_page->srr0 | ||
156 | mfsrr1 rX ld rX, magic_page->srr1 | ||
157 | mfdar rX ld rX, magic_page->dar | ||
158 | mfdsisr rX lwz rX, magic_page->dsisr | ||
159 | |||
160 | mtmsr rX std rX, magic_page->msr | ||
161 | mtsprg 0, rX std rX, magic_page->sprg0 | ||
162 | mtsprg 1, rX std rX, magic_page->sprg1 | ||
163 | mtsprg 2, rX std rX, magic_page->sprg2 | ||
164 | mtsprg 3, rX std rX, magic_page->sprg3 | ||
165 | mtsrr0 rX std rX, magic_page->srr0 | ||
166 | mtsrr1 rX std rX, magic_page->srr1 | ||
167 | mtdar rX std rX, magic_page->dar | ||
168 | mtdsisr rX stw rX, magic_page->dsisr | ||
169 | |||
170 | tlbsync nop | ||
171 | |||
172 | mtmsrd rX, 0 b <special mtmsr section> | ||
173 | mtmsr rX b <special mtmsr section> | ||
174 | |||
175 | mtmsrd rX, 1 b <special mtmsrd section> | ||
176 | |||
177 | [Book3S only] | ||
178 | mtsrin rX, rY b <special mtsrin section> | ||
179 | |||
180 | [BookE only] | ||
181 | wrteei [0|1] b <special wrteei section> | ||
182 | |||
183 | |||
184 | Some instructions require more logic to determine what's going on than a load | ||
185 | or store instruction can deliver. To enable patching of those, we keep some | ||
186 | RAM around where we can live translate instructions to. What happens is the | ||
187 | following: | ||
188 | |||
189 | 1) copy emulation code to memory | ||
190 | 2) patch that code to fit the emulated instruction | ||
191 | 3) patch that code to return to the original pc + 4 | ||
192 | 4) patch the original instruction to branch to the new code | ||
193 | |||
194 | That way we can inject an arbitrary amount of code as replacement for a single | ||
195 | instruction. This allows us to check for pending interrupts when setting EE=1 | ||
196 | for example. | ||
diff --git a/Documentation/virtual/kvm/review-checklist.txt b/Documentation/virtual/kvm/review-checklist.txt new file mode 100644 index 000000000000..a850986ed684 --- /dev/null +++ b/Documentation/virtual/kvm/review-checklist.txt | |||
@@ -0,0 +1,38 @@ | |||
1 | Review checklist for kvm patches | ||
2 | ================================ | ||
3 | |||
4 | 1. The patch must follow Documentation/CodingStyle and | ||
5 | Documentation/SubmittingPatches. | ||
6 | |||
7 | 2. Patches should be against kvm.git master branch. | ||
8 | |||
9 | 3. If the patch introduces or modifies a new userspace API: | ||
10 | - the API must be documented in Documentation/virtual/kvm/api.txt | ||
11 | - the API must be discoverable using KVM_CHECK_EXTENSION | ||
12 | |||
13 | 4. New state must include support for save/restore. | ||
14 | |||
15 | 5. New features must default to off (userspace should explicitly request them). | ||
16 | Performance improvements can and should default to on. | ||
17 | |||
18 | 6. New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2 | ||
19 | |||
20 | 7. Emulator changes should be accompanied by unit tests for qemu-kvm.git | ||
21 | kvm/test directory. | ||
22 | |||
23 | 8. Changes should be vendor neutral when possible. Changes to common code | ||
24 | are better than duplicating changes to vendor code. | ||
25 | |||
26 | 9. Similarly, prefer changes to arch independent code than to arch dependent | ||
27 | code. | ||
28 | |||
29 | 10. User/kernel interfaces and guest/host interfaces must be 64-bit clean | ||
30 | (all variables and sizes naturally aligned on 64-bit; use specific types | ||
31 | only - u64 rather than ulong). | ||
32 | |||
33 | 11. New guest visible features must either be documented in a hardware manual | ||
34 | or be accompanied by documentation. | ||
35 | |||
36 | 12. Features must be robust against reset and kexec - for example, shared | ||
37 | host/guest memory must be unshared to prevent the host from writing to | ||
38 | guest memory that the guest has not reserved for this purpose. | ||
diff --git a/Documentation/virtual/kvm/timekeeping.txt b/Documentation/virtual/kvm/timekeeping.txt new file mode 100644 index 000000000000..df8946377cb6 --- /dev/null +++ b/Documentation/virtual/kvm/timekeeping.txt | |||
@@ -0,0 +1,612 @@ | |||
1 | |||
2 | Timekeeping Virtualization for X86-Based Architectures | ||
3 | |||
4 | Zachary Amsden <zamsden@redhat.com> | ||
5 | Copyright (c) 2010, Red Hat. All rights reserved. | ||
6 | |||
7 | 1) Overview | ||
8 | 2) Timing Devices | ||
9 | 3) TSC Hardware | ||
10 | 4) Virtualization Problems | ||
11 | |||
12 | ========================================================================= | ||
13 | |||
14 | 1) Overview | ||
15 | |||
16 | One of the most complicated parts of the X86 platform, and specifically, | ||
17 | the virtualization of this platform is the plethora of timing devices available | ||
18 | and the complexity of emulating those devices. In addition, virtualization of | ||
19 | time introduces a new set of challenges because it introduces a multiplexed | ||
20 | division of time beyond the control of the guest CPU. | ||
21 | |||
22 | First, we will describe the various timekeeping hardware available, then | ||
23 | present some of the problems which arise and solutions available, giving | ||
24 | specific recommendations for certain classes of KVM guests. | ||
25 | |||
26 | The purpose of this document is to collect data and information relevant to | ||
27 | timekeeping which may be difficult to find elsewhere, specifically, | ||
28 | information relevant to KVM and hardware-based virtualization. | ||
29 | |||
30 | ========================================================================= | ||
31 | |||
32 | 2) Timing Devices | ||
33 | |||
34 | First we discuss the basic hardware devices available. TSC and the related | ||
35 | KVM clock are special enough to warrant a full exposition and are described in | ||
36 | the following section. | ||
37 | |||
38 | 2.1) i8254 - PIT | ||
39 | |||
40 | One of the first timer devices available is the programmable interrupt timer, | ||
41 | or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three | ||
42 | channels which can be programmed to deliver periodic or one-shot interrupts. | ||
43 | These three channels can be configured in different modes and have individual | ||
44 | counters. Channel 1 and 2 were not available for general use in the original | ||
45 | IBM PC, and historically were connected to control RAM refresh and the PC | ||
46 | speaker. Now the PIT is typically integrated as part of an emulated chipset | ||
47 | and a separate physical PIT is not used. | ||
48 | |||
49 | The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done | ||
50 | using single or multiple byte access to the I/O ports. There are 6 modes | ||
51 | available, but not all modes are available to all timers, as only timer 2 | ||
52 | has a connected gate input, required for modes 1 and 5. The gate line is | ||
53 | controlled by port 61h, bit 0, as illustrated in the following diagram. | ||
54 | |||
55 | -------------- ---------------- | ||
56 | | | | | | ||
57 | | 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0 | ||
58 | | Clock | | | | | ||
59 | -------------- | +->| GATE TIMER 0 | | ||
60 | | ---------------- | ||
61 | | | ||
62 | | ---------------- | ||
63 | | | | | ||
64 | |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM | ||
65 | | | | (aka /dev/null) | ||
66 | | +->| GATE TIMER 1 | | ||
67 | | ---------------- | ||
68 | | | ||
69 | | ---------------- | ||
70 | | | | | ||
71 | |------>| CLOCK OUT | ---------> Port 61h, bit 5 | ||
72 | | | | | ||
73 | Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.---- ____ | ||
74 | ---------------- _| )--|LPF|---Speaker | ||
75 | / *---- \___/ | ||
76 | Port 61h, bit 1 -----------------------------------/ | ||
77 | |||
78 | The timer modes are now described. | ||
79 | |||
80 | Mode 0: Single Timeout. This is a one-shot software timeout that counts down | ||
81 | when the gate is high (always true for timers 0 and 1). When the count | ||
82 | reaches zero, the output goes high. | ||
83 | |||
84 | Mode 1: Triggered One-shot. The output is initially set high. When the gate | ||
85 | line is set high, a countdown is initiated (which does not stop if the gate is | ||
86 | lowered), during which the output is set low. When the count reaches zero, | ||
87 | the output goes high. | ||
88 | |||
89 | Mode 2: Rate Generator. The output is initially set high. When the countdown | ||
90 | reaches 1, the output goes low for one count and then returns high. The value | ||
91 | is reloaded and the countdown automatically resumes. If the gate line goes | ||
92 | low, the count is halted. If the output is low when the gate is lowered, the | ||
93 | output automatically goes high (this only affects timer 2). | ||
94 | |||
95 | Mode 3: Square Wave. This generates a high / low square wave. The count | ||
96 | determines the length of the pulse, which alternates between high and low | ||
97 | when zero is reached. The count only proceeds when gate is high and is | ||
98 | automatically reloaded on reaching zero. The count is decremented twice at | ||
99 | each clock to generate a full high / low cycle at the full periodic rate. | ||
100 | If the count is even, the clock remains high for N/2 counts and low for N/2 | ||
101 | counts; if the clock is odd, the clock is high for (N+1)/2 counts and low | ||
102 | for (N-1)/2 counts. Only even values are latched by the counter, so odd | ||
103 | values are not observed when reading. This is the intended mode for timer 2, | ||
104 | which generates sine-like tones by low-pass filtering the square wave output. | ||
105 | |||
106 | Mode 4: Software Strobe. After programming this mode and loading the counter, | ||
107 | the output remains high until the counter reaches zero. Then the output | ||
108 | goes low for 1 clock cycle and returns high. The counter is not reloaded. | ||
109 | Counting only occurs when gate is high. | ||
110 | |||
111 | Mode 5: Hardware Strobe. After programming and loading the counter, the | ||
112 | output remains high. When the gate is raised, a countdown is initiated | ||
113 | (which does not stop if the gate is lowered). When the counter reaches zero, | ||
114 | the output goes low for 1 clock cycle and then returns high. The counter is | ||
115 | not reloaded. | ||
116 | |||
117 | In addition to normal binary counting, the PIT supports BCD counting. The | ||
118 | command port, 0x43 is used to set the counter and mode for each of the three | ||
119 | timers. | ||
120 | |||
121 | PIT commands, issued to port 0x43, using the following bit encoding: | ||
122 | |||
123 | Bit 7-4: Command (See table below) | ||
124 | Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) | ||
125 | Bit 0 : Binary (0) / BCD (1) | ||
126 | |||
127 | Command table: | ||
128 | |||
129 | 0000 - Latch Timer 0 count for port 0x40 | ||
130 | sample and hold the count to be read in port 0x40; | ||
131 | additional commands ignored until counter is read; | ||
132 | mode bits ignored. | ||
133 | |||
134 | 0001 - Set Timer 0 LSB mode for port 0x40 | ||
135 | set timer to read LSB only and force MSB to zero; | ||
136 | mode bits set timer mode | ||
137 | |||
138 | 0010 - Set Timer 0 MSB mode for port 0x40 | ||
139 | set timer to read MSB only and force LSB to zero; | ||
140 | mode bits set timer mode | ||
141 | |||
142 | 0011 - Set Timer 0 16-bit mode for port 0x40 | ||
143 | set timer to read / write LSB first, then MSB; | ||
144 | mode bits set timer mode | ||
145 | |||
146 | 0100 - Latch Timer 1 count for port 0x41 - as described above | ||
147 | 0101 - Set Timer 1 LSB mode for port 0x41 - as described above | ||
148 | 0110 - Set Timer 1 MSB mode for port 0x41 - as described above | ||
149 | 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above | ||
150 | |||
151 | 1000 - Latch Timer 2 count for port 0x42 - as described above | ||
152 | 1001 - Set Timer 2 LSB mode for port 0x42 - as described above | ||
153 | 1010 - Set Timer 2 MSB mode for port 0x42 - as described above | ||
154 | 1011 - Set Timer 2 16-bit mode for port 0x42 as described above | ||
155 | |||
156 | 1101 - General counter latch | ||
157 | Latch combination of counters into corresponding ports | ||
158 | Bit 3 = Counter 2 | ||
159 | Bit 2 = Counter 1 | ||
160 | Bit 1 = Counter 0 | ||
161 | Bit 0 = Unused | ||
162 | |||
163 | 1110 - Latch timer status | ||
164 | Latch combination of counter mode into corresponding ports | ||
165 | Bit 3 = Counter 2 | ||
166 | Bit 2 = Counter 1 | ||
167 | Bit 1 = Counter 0 | ||
168 | |||
169 | The output of ports 0x40-0x42 following this command will be: | ||
170 | |||
171 | Bit 7 = Output pin | ||
172 | Bit 6 = Count loaded (0 if timer has expired) | ||
173 | Bit 5-4 = Read / Write mode | ||
174 | 01 = MSB only | ||
175 | 10 = LSB only | ||
176 | 11 = LSB / MSB (16-bit) | ||
177 | Bit 3-1 = Mode | ||
178 | Bit 0 = Binary (0) / BCD mode (1) | ||
179 | |||
180 | 2.2) RTC | ||
181 | |||
182 | The second device which was available in the original PC was the MC146818 real | ||
183 | time clock. The original device is now obsolete, and usually emulated by the | ||
184 | system chipset, sometimes by an HPET and some frankenstein IRQ routing. | ||
185 | |||
186 | The RTC is accessed through CMOS variables, which uses an index register to | ||
187 | control which bytes are read. Since there is only one index register, read | ||
188 | of the CMOS and read of the RTC require lock protection (in addition, it is | ||
189 | dangerous to allow userspace utilities such as hwclock to have direct RTC | ||
190 | access, as they could corrupt kernel reads and writes of CMOS memory). | ||
191 | |||
192 | The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt | ||
193 | can function as a periodic timer, an additional once a day alarm, and can issue | ||
194 | interrupts after an update of the CMOS registers by the MC146818 is complete. | ||
195 | The type of interrupt is signalled in the RTC status registers. | ||
196 | |||
197 | The RTC will update the current time fields by battery power even while the | ||
198 | system is off. The current time fields should not be read while an update is | ||
199 | in progress, as indicated in the status register. | ||
200 | |||
201 | The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be | ||
202 | programmed to a 32kHz divider if the RTC is to count seconds. | ||
203 | |||
204 | This is the RAM map originally used for the RTC/CMOS: | ||
205 | |||
206 | Location Size Description | ||
207 | ------------------------------------------ | ||
208 | 00h byte Current second (BCD) | ||
209 | 01h byte Seconds alarm (BCD) | ||
210 | 02h byte Current minute (BCD) | ||
211 | 03h byte Minutes alarm (BCD) | ||
212 | 04h byte Current hour (BCD) | ||
213 | 05h byte Hours alarm (BCD) | ||
214 | 06h byte Current day of week (BCD) | ||
215 | 07h byte Current day of month (BCD) | ||
216 | 08h byte Current month (BCD) | ||
217 | 09h byte Current year (BCD) | ||
218 | 0Ah byte Register A | ||
219 | bit 7 = Update in progress | ||
220 | bit 6-4 = Divider for clock | ||
221 | 000 = 4.194 MHz | ||
222 | 001 = 1.049 MHz | ||
223 | 010 = 32 kHz | ||
224 | 10X = test modes | ||
225 | 110 = reset / disable | ||
226 | 111 = reset / disable | ||
227 | bit 3-0 = Rate selection for periodic interrupt | ||
228 | 000 = periodic timer disabled | ||
229 | 001 = 3.90625 uS | ||
230 | 010 = 7.8125 uS | ||
231 | 011 = .122070 mS | ||
232 | 100 = .244141 mS | ||
233 | ... | ||
234 | 1101 = 125 mS | ||
235 | 1110 = 250 mS | ||
236 | 1111 = 500 mS | ||
237 | 0Bh byte Register B | ||
238 | bit 7 = Run (0) / Halt (1) | ||
239 | bit 6 = Periodic interrupt enable | ||
240 | bit 5 = Alarm interrupt enable | ||
241 | bit 4 = Update-ended interrupt enable | ||
242 | bit 3 = Square wave interrupt enable | ||
243 | bit 2 = BCD calendar (0) / Binary (1) | ||
244 | bit 1 = 12-hour mode (0) / 24-hour mode (1) | ||
245 | bit 0 = 0 (DST off) / 1 (DST enabled) | ||
246 | OCh byte Register C (read only) | ||
247 | bit 7 = interrupt request flag (IRQF) | ||
248 | bit 6 = periodic interrupt flag (PF) | ||
249 | bit 5 = alarm interrupt flag (AF) | ||
250 | bit 4 = update interrupt flag (UF) | ||
251 | bit 3-0 = reserved | ||
252 | ODh byte Register D (read only) | ||
253 | bit 7 = RTC has power | ||
254 | bit 6-0 = reserved | ||
255 | 32h byte Current century BCD (*) | ||
256 | (*) location vendor specific and now determined from ACPI global tables | ||
257 | |||
258 | 2.3) APIC | ||
259 | |||
260 | On Pentium and later processors, an on-board timer is available to each CPU | ||
261 | as part of the Advanced Programmable Interrupt Controller. The APIC is | ||
262 | accessed through memory-mapped registers and provides interrupt service to each | ||
263 | CPU, used for IPIs and local timer interrupts. | ||
264 | |||
265 | Although in theory the APIC is a safe and stable source for local interrupts, | ||
266 | in practice, many bugs and glitches have occurred due to the special nature of | ||
267 | the APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect | ||
268 | the use of the APIC and that workarounds may be required. In addition, some of | ||
269 | these workarounds pose unique constraints for virtualization - requiring either | ||
270 | extra overhead incurred from extra reads of memory-mapped I/O or additional | ||
271 | functionality that may be more computationally expensive to implement. | ||
272 | |||
273 | Since the APIC is documented quite well in the Intel and AMD manuals, we will | ||
274 | avoid repetition of the detail here. It should be pointed out that the APIC | ||
275 | timer is programmed through the LVT (local vector timer) register, is capable | ||
276 | of one-shot or periodic operation, and is based on the bus clock divided down | ||
277 | by the programmable divider register. | ||
278 | |||
279 | 2.4) HPET | ||
280 | |||
281 | HPET is quite complex, and was originally intended to replace the PIT / RTC | ||
282 | support of the X86 PC. It remains to be seen whether that will be the case, as | ||
283 | the de facto standard of PC hardware is to emulate these older devices. Some | ||
284 | systems designated as legacy free may support only the HPET as a hardware timer | ||
285 | device. | ||
286 | |||
287 | The HPET spec is rather loose and vague, requiring at least 3 hardware timers, | ||
288 | but allowing implementation freedom to support many more. It also imposes no | ||
289 | fixed rate on the timer frequency, but does impose some extremal values on | ||
290 | frequency, error and slew. | ||
291 | |||
292 | In general, the HPET is recommended as a high precision (compared to PIT /RTC) | ||
293 | time source which is independent of local variation (as there is only one HPET | ||
294 | in any given system). The HPET is also memory-mapped, and its presence is | ||
295 | indicated through ACPI tables by the BIOS. | ||
296 | |||
297 | Detailed specification of the HPET is beyond the current scope of this | ||
298 | document, as it is also very well documented elsewhere. | ||
299 | |||
300 | 2.5) Offboard Timers | ||
301 | |||
302 | Several cards, both proprietary (watchdog boards) and commonplace (e1000) have | ||
303 | timing chips built into the cards which may have registers which are accessible | ||
304 | to kernel or user drivers. To the author's knowledge, using these to generate | ||
305 | a clocksource for a Linux or other kernel has not yet been attempted and is in | ||
306 | general frowned upon as not playing by the agreed rules of the game. Such a | ||
307 | timer device would require additional support to be virtualized properly and is | ||
308 | not considered important at this time as no known operating system does this. | ||
309 | |||
310 | ========================================================================= | ||
311 | |||
312 | 3) TSC Hardware | ||
313 | |||
314 | The TSC or time stamp counter is relatively simple in theory; it counts | ||
315 | instruction cycles issued by the processor, which can be used as a measure of | ||
316 | time. In practice, due to a number of problems, it is the most complicated | ||
317 | timekeeping device to use. | ||
318 | |||
319 | The TSC is represented internally as a 64-bit MSR which can be read with the | ||
320 | RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware | ||
321 | limitations made it possible to write the TSC, but generally on old hardware it | ||
322 | was only possible to write the low 32-bits of the 64-bit counter, and the upper | ||
323 | 32-bits of the counter were cleared. Now, however, on Intel processors family | ||
324 | 0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction | ||
325 | has been lifted and all 64-bits are writable. On AMD systems, the ability to | ||
326 | write the TSC MSR is not an architectural guarantee. | ||
327 | |||
328 | The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by | ||
329 | means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access. | ||
330 | |||
331 | Some vendors have implemented an additional instruction, RDTSCP, which returns | ||
332 | atomically not just the TSC, but an indicator which corresponds to the | ||
333 | processor number. This can be used to index into an array of TSC variables to | ||
334 | determine offset information in SMP systems where TSCs are not synchronized. | ||
335 | The presence of this instruction must be determined by consulting CPUID feature | ||
336 | bits. | ||
337 | |||
338 | Both VMX and SVM provide extension fields in the virtualization hardware which | ||
339 | allows the guest visible TSC to be offset by a constant. Newer implementations | ||
340 | promise to allow the TSC to additionally be scaled, but this hardware is not | ||
341 | yet widely available. | ||
342 | |||
343 | 3.1) TSC synchronization | ||
344 | |||
345 | The TSC is a CPU-local clock in most implementations. This means, on SMP | ||
346 | platforms, the TSCs of different CPUs may start at different times depending | ||
347 | on when the CPUs are powered on. Generally, CPUs on the same die will share | ||
348 | the same clock, however, this is not always the case. | ||
349 | |||
350 | The BIOS may attempt to resynchronize the TSCs during the poweron process and | ||
351 | the operating system or other system software may attempt to do this as well. | ||
352 | Several hardware limitations make the problem worse - if it is not possible to | ||
353 | write the full 64-bits of the TSC, it may be impossible to match the TSC in | ||
354 | newly arriving CPUs to that of the rest of the system, resulting in | ||
355 | unsynchronized TSCs. This may be done by BIOS or system software, but in | ||
356 | practice, getting a perfectly synchronized TSC will not be possible unless all | ||
357 | values are read from the same clock, which generally only is possible on single | ||
358 | socket systems or those with special hardware support. | ||
359 | |||
360 | 3.2) TSC and CPU hotplug | ||
361 | |||
362 | As touched on already, CPUs which arrive later than the boot time of the system | ||
363 | may not have a TSC value that is synchronized with the rest of the system. | ||
364 | Either system software, BIOS, or SMM code may actually try to establish the TSC | ||
365 | to a value matching the rest of the system, but a perfect match is usually not | ||
366 | a guarantee. This can have the effect of bringing a system from a state where | ||
367 | TSC is synchronized back to a state where TSC synchronization flaws, however | ||
368 | small, may be exposed to the OS and any virtualization environment. | ||
369 | |||
370 | 3.3) TSC and multi-socket / NUMA | ||
371 | |||
372 | Multi-socket systems, especially large multi-socket systems are likely to have | ||
373 | individual clocksources rather than a single, universally distributed clock. | ||
374 | Since these clocks are driven by different crystals, they will not have | ||
375 | perfectly matched frequency, and temperature and electrical variations will | ||
376 | cause the CPU clocks, and thus the TSCs to drift over time. Depending on the | ||
377 | exact clock and bus design, the drift may or may not be fixed in absolute | ||
378 | error, and may accumulate over time. | ||
379 | |||
380 | In addition, very large systems may deliberately slew the clocks of individual | ||
381 | cores. This technique, known as spread-spectrum clocking, reduces EMI at the | ||
382 | clock frequency and harmonics of it, which may be required to pass FCC | ||
383 | standards for telecommunications and computer equipment. | ||
384 | |||
385 | It is recommended not to trust the TSCs to remain synchronized on NUMA or | ||
386 | multiple socket systems for these reasons. | ||
387 | |||
388 | 3.4) TSC and C-states | ||
389 | |||
390 | C-states, or idling states of the processor, especially C1E and deeper sleep | ||
391 | states may be problematic for TSC as well. The TSC may stop advancing in such | ||
392 | a state, resulting in a TSC which is behind that of other CPUs when execution | ||
393 | is resumed. Such CPUs must be detected and flagged by the operating system | ||
394 | based on CPU and chipset identifications. | ||
395 | |||
396 | The TSC in such a case may be corrected by catching it up to a known external | ||
397 | clocksource. | ||
398 | |||
399 | 3.5) TSC frequency change / P-states | ||
400 | |||
401 | To make things slightly more interesting, some CPUs may change frequency. They | ||
402 | may or may not run the TSC at the same rate, and because the frequency change | ||
403 | may be staggered or slewed, at some points in time, the TSC rate may not be | ||
404 | known other than falling within a range of values. In this case, the TSC will | ||
405 | not be a stable time source, and must be calibrated against a known, stable, | ||
406 | external clock to be a usable source of time. | ||
407 | |||
408 | Whether the TSC runs at a constant rate or scales with the P-state is model | ||
409 | dependent and must be determined by inspecting CPUID, chipset or vendor | ||
410 | specific MSR fields. | ||
411 | |||
412 | In addition, some vendors have known bugs where the P-state is actually | ||
413 | compensated for properly during normal operation, but when the processor is | ||
414 | inactive, the P-state may be raised temporarily to service cache misses from | ||
415 | other processors. In such cases, the TSC on halted CPUs could advance faster | ||
416 | than that of non-halted processors. AMD Turion processors are known to have | ||
417 | this problem. | ||
418 | |||
419 | 3.6) TSC and STPCLK / T-states | ||
420 | |||
421 | External signals given to the processor may also have the effect of stopping | ||
422 | the TSC. This is typically done for thermal emergency power control to prevent | ||
423 | an overheating condition, and typically, there is no way to detect that this | ||
424 | condition has happened. | ||
425 | |||
426 | 3.7) TSC virtualization - VMX | ||
427 | |||
428 | VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP | ||
429 | instructions, which is enough for full virtualization of TSC in any manner. In | ||
430 | addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET | ||
431 | field specified in the VMCS. Special instructions must be used to read and | ||
432 | write the VMCS field. | ||
433 | |||
434 | 3.8) TSC virtualization - SVM | ||
435 | |||
436 | SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP | ||
437 | instructions, which is enough for full virtualization of TSC in any manner. In | ||
438 | addition, SVM allows passing through the host TSC plus an additional offset | ||
439 | field specified in the SVM control block. | ||
440 | |||
441 | 3.9) TSC feature bits in Linux | ||
442 | |||
443 | In summary, there is no way to guarantee the TSC remains in perfect | ||
444 | synchronization unless it is explicitly guaranteed by the architecture. Even | ||
445 | if so, the TSCs in multi-sockets or NUMA systems may still run independently | ||
446 | despite being locally consistent. | ||
447 | |||
448 | The following feature bits are used by Linux to signal various TSC attributes, | ||
449 | but they can only be taken to be meaningful for UP or single node systems. | ||
450 | |||
451 | X86_FEATURE_TSC : The TSC is available in hardware | ||
452 | X86_FEATURE_RDTSCP : The RDTSCP instruction is available | ||
453 | X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states | ||
454 | X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states | ||
455 | X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware) | ||
456 | |||
457 | 4) Virtualization Problems | ||
458 | |||
459 | Timekeeping is especially problematic for virtualization because a number of | ||
460 | challenges arise. The most obvious problem is that time is now shared between | ||
461 | the host and, potentially, a number of virtual machines. Thus the virtual | ||
462 | operating system does not run with 100% usage of the CPU, despite the fact that | ||
463 | it may very well make that assumption. It may expect it to remain true to very | ||
464 | exacting bounds when interrupt sources are disabled, but in reality only its | ||
465 | virtual interrupt sources are disabled, and the machine may still be preempted | ||
466 | at any time. This causes problems as the passage of real time, the injection | ||
467 | of machine interrupts and the associated clock sources are no longer completely | ||
468 | synchronized with real time. | ||
469 | |||
470 | This same problem can occur on native harware to a degree, as SMM mode may | ||
471 | steal cycles from the naturally on X86 systems when SMM mode is used by the | ||
472 | BIOS, but not in such an extreme fashion. However, the fact that SMM mode may | ||
473 | cause similar problems to virtualization makes it a good justification for | ||
474 | solving many of these problems on bare metal. | ||
475 | |||
476 | 4.1) Interrupt clocking | ||
477 | |||
478 | One of the most immediate problems that occurs with legacy operating systems | ||
479 | is that the system timekeeping routines are often designed to keep track of | ||
480 | time by counting periodic interrupts. These interrupts may come from the PIT | ||
481 | or the RTC, but the problem is the same: the host virtualization engine may not | ||
482 | be able to deliver the proper number of interrupts per second, and so guest | ||
483 | time may fall behind. This is especially problematic if a high interrupt rate | ||
484 | is selected, such as 1000 HZ, which is unfortunately the default for many Linux | ||
485 | guests. | ||
486 | |||
487 | There are three approaches to solving this problem; first, it may be possible | ||
488 | to simply ignore it. Guests which have a separate time source for tracking | ||
489 | 'wall clock' or 'real time' may not need any adjustment of their interrupts to | ||
490 | maintain proper time. If this is not sufficient, it may be necessary to inject | ||
491 | additional interrupts into the guest in order to increase the effective | ||
492 | interrupt rate. This approach leads to complications in extreme conditions, | ||
493 | where host load or guest lag is too much to compensate for, and thus another | ||
494 | solution to the problem has risen: the guest may need to become aware of lost | ||
495 | ticks and compensate for them internally. Although promising in theory, the | ||
496 | implementation of this policy in Linux has been extremely error prone, and a | ||
497 | number of buggy variants of lost tick compensation are distributed across | ||
498 | commonly used Linux systems. | ||
499 | |||
500 | Windows uses periodic RTC clocking as a means of keeping time internally, and | ||
501 | thus requires interrupt slewing to keep proper time. It does use a low enough | ||
502 | rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in | ||
503 | practice. | ||
504 | |||
505 | 4.2) TSC sampling and serialization | ||
506 | |||
507 | As the highest precision time source available, the cycle counter of the CPU | ||
508 | has aroused much interest from developers. As explained above, this timer has | ||
509 | many problems unique to its nature as a local, potentially unstable and | ||
510 | potentially unsynchronized source. One issue which is not unique to the TSC, | ||
511 | but is highlighted because of its very precise nature is sampling delay. By | ||
512 | definition, the counter, once read is already old. However, it is also | ||
513 | possible for the counter to be read ahead of the actual use of the result. | ||
514 | This is a consequence of the superscalar execution of the instruction stream, | ||
515 | which may execute instructions out of order. Such execution is called | ||
516 | non-serialized. Forcing serialized execution is necessary for precise | ||
517 | measurement with the TSC, and requires a serializing instruction, such as CPUID | ||
518 | or an MSR read. | ||
519 | |||
520 | Since CPUID may actually be virtualized by a trap and emulate mechanism, this | ||
521 | serialization can pose a performance issue for hardware virtualization. An | ||
522 | accurate time stamp counter reading may therefore not always be available, and | ||
523 | it may be necessary for an implementation to guard against "backwards" reads of | ||
524 | the TSC as seen from other CPUs, even in an otherwise perfectly synchronized | ||
525 | system. | ||
526 | |||
527 | 4.3) Timespec aliasing | ||
528 | |||
529 | Additionally, this lack of serialization from the TSC poses another challenge | ||
530 | when using results of the TSC when measured against another time source. As | ||
531 | the TSC is much higher precision, many possible values of the TSC may be read | ||
532 | while another clock is still expressing the same value. | ||
533 | |||
534 | That is, you may read (T,T+10) while external clock C maintains the same value. | ||
535 | Due to non-serialized reads, you may actually end up with a range which | ||
536 | fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but | ||
537 | calibrated against an external value may have a range of valid values. | ||
538 | Re-calibrating this computation may actually cause time, as computed after the | ||
539 | calibration, to go backwards, compared with time computed before the | ||
540 | calibration. | ||
541 | |||
542 | This problem is particularly pronounced with an internal time source in Linux, | ||
543 | the kernel time, which is expressed in the theoretically high resolution | ||
544 | timespec - but which advances in much larger granularity intervals, sometimes | ||
545 | at the rate of jiffies, and possibly in catchup modes, at a much larger step. | ||
546 | |||
547 | This aliasing requires care in the computation and recalibration of kvmclock | ||
548 | and any other values derived from TSC computation (such as TSC virtualization | ||
549 | itself). | ||
550 | |||
551 | 4.4) Migration | ||
552 | |||
553 | Migration of a virtual machine raises problems for timekeeping in two ways. | ||
554 | First, the migration itself may take time, during which interrupts cannot be | ||
555 | delivered, and after which, the guest time may need to be caught up. NTP may | ||
556 | be able to help to some degree here, as the clock correction required is | ||
557 | typically small enough to fall in the NTP-correctable window. | ||
558 | |||
559 | An additional concern is that timers based off the TSC (or HPET, if the raw bus | ||
560 | clock is exposed) may now be running at different rates, requiring compensation | ||
561 | in some way in the hypervisor by virtualizing these timers. In addition, | ||
562 | migrating to a faster machine may preclude the use of a passthrough TSC, as a | ||
563 | faster clock cannot be made visible to a guest without the potential of time | ||
564 | advancing faster than usual. A slower clock is less of a problem, as it can | ||
565 | always be caught up to the original rate. KVM clock avoids these problems by | ||
566 | simply storing multipliers and offsets against the TSC for the guest to convert | ||
567 | back into nanosecond resolution values. | ||
568 | |||
569 | 4.5) Scheduling | ||
570 | |||
571 | Since scheduling may be based on precise timing and firing of interrupts, the | ||
572 | scheduling algorithms of an operating system may be adversely affected by | ||
573 | virtualization. In theory, the effect is random and should be universally | ||
574 | distributed, but in contrived as well as real scenarios (guest device access, | ||
575 | causes of virtualization exits, possible context switch), this may not always | ||
576 | be the case. The effect of this has not been well studied. | ||
577 | |||
578 | In an attempt to work around this, several implementations have provided a | ||
579 | paravirtualized scheduler clock, which reveals the true amount of CPU time for | ||
580 | which a virtual machine has been running. | ||
581 | |||
582 | 4.6) Watchdogs | ||
583 | |||
584 | Watchdog timers, such as the lock detector in Linux may fire accidentally when | ||
585 | running under hardware virtualization due to timer interrupts being delayed or | ||
586 | misinterpretation of the passage of real time. Usually, these warnings are | ||
587 | spurious and can be ignored, but in some circumstances it may be necessary to | ||
588 | disable such detection. | ||
589 | |||
590 | 4.7) Delays and precision timing | ||
591 | |||
592 | Precise timing and delays may not be possible in a virtualized system. This | ||
593 | can happen if the system is controlling physical hardware, or issues delays to | ||
594 | compensate for slower I/O to and from devices. The first issue is not solvable | ||
595 | in general for a virtualized system; hardware control software can't be | ||
596 | adequately virtualized without a full real-time operating system, which would | ||
597 | require an RT aware virtualization platform. | ||
598 | |||
599 | The second issue may cause performance problems, but this is unlikely to be a | ||
600 | significant issue. In many cases these delays may be eliminated through | ||
601 | configuration or paravirtualization. | ||
602 | |||
603 | 4.8) Covert channels and leaks | ||
604 | |||
605 | In addition to the above problems, time information will inevitably leak to the | ||
606 | guest about the host in anything but a perfect implementation of virtualized | ||
607 | time. This may allow the guest to infer the presence of a hypervisor (as in a | ||
608 | red-pill type detection), and it may allow information to leak between guests | ||
609 | by using CPU utilization itself as a signalling channel. Preventing such | ||
610 | problems would require completely isolated virtual time which may not track | ||
611 | real time any longer. This may be useful in certain security or QA contexts, | ||
612 | but in general isn't recommended for real-world deployment scenarios. | ||