diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2010-08-04 13:43:01 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2010-08-04 13:43:01 -0400 |
commit | 5e83f6fbdb020b70c0e413312801424d13c58d68 (patch) | |
tree | ca270178fa891813dbc47751c331fed975d3766c /Documentation | |
parent | fe445c6e2cb62a566e1a89f8798de11459975710 (diff) | |
parent | 3444d7da1839b851eefedd372978d8a982316c36 (diff) |
Merge branch 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (198 commits)
KVM: VMX: Fix host GDT.LIMIT corruption
KVM: MMU: using __xchg_spte more smarter
KVM: MMU: cleanup spte set and accssed/dirty tracking
KVM: MMU: don't atomicly set spte if it's not present
KVM: MMU: fix page dirty tracking lost while sync page
KVM: MMU: fix broken page accessed tracking with ept enabled
KVM: MMU: add missing reserved bits check in speculative path
KVM: MMU: fix mmu notifier invalidate handler for huge spte
KVM: x86 emulator: fix xchg instruction emulation
KVM: x86: Call mask notifiers from pic
KVM: x86: never re-execute instruction with enabled tdp
KVM: Document KVM_GET_SUPPORTED_CPUID2 ioctl
KVM: x86: emulator: inc/dec can have lock prefix
KVM: MMU: Eliminate redundant temporaries in FNAME(fetch)
KVM: MMU: Validate all gptes during fetch, not just those used for new pages
KVM: MMU: Simplify spte fetch() function
KVM: MMU: Add gpte_valid() helper
KVM: MMU: Add validate_direct_spte() helper
KVM: MMU: Add drop_large_spte() helper
KVM: MMU: Use __set_spte to link shadow pages
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/feature-removal-schedule.txt | 21 | ||||
-rw-r--r-- | Documentation/kvm/api.txt | 208 | ||||
-rw-r--r-- | Documentation/kvm/mmu.txt | 52 | ||||
-rw-r--r-- | Documentation/kvm/msr.txt | 153 | ||||
-rw-r--r-- | Documentation/kvm/review-checklist.txt | 38 |
5 files changed, 413 insertions, 59 deletions
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 79cb554761af..b273d35039ed 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt | |||
@@ -487,17 +487,6 @@ Who: Jan Kiszka <jan.kiszka@web.de> | |||
487 | 487 | ||
488 | ---------------------------- | 488 | ---------------------------- |
489 | 489 | ||
490 | What: KVM memory aliases support | ||
491 | When: July 2010 | ||
492 | Why: Memory aliasing support is used for speeding up guest vga access | ||
493 | through the vga windows. | ||
494 | |||
495 | Modern userspace no longer uses this feature, so it's just bitrotted | ||
496 | code and can be removed with no impact. | ||
497 | Who: Avi Kivity <avi@redhat.com> | ||
498 | |||
499 | ---------------------------- | ||
500 | |||
501 | What: xtime, wall_to_monotonic | 490 | What: xtime, wall_to_monotonic |
502 | When: 2.6.36+ | 491 | When: 2.6.36+ |
503 | Files: kernel/time/timekeeping.c include/linux/time.h | 492 | Files: kernel/time/timekeeping.c include/linux/time.h |
@@ -508,16 +497,6 @@ Who: John Stultz <johnstul@us.ibm.com> | |||
508 | 497 | ||
509 | ---------------------------- | 498 | ---------------------------- |
510 | 499 | ||
511 | What: KVM kernel-allocated memory slots | ||
512 | When: July 2010 | ||
513 | Why: Since 2.6.25, kvm supports user-allocated memory slots, which are | ||
514 | much more flexible than kernel-allocated slots. All current userspace | ||
515 | supports the newer interface and this code can be removed with no | ||
516 | impact. | ||
517 | Who: Avi Kivity <avi@redhat.com> | ||
518 | |||
519 | ---------------------------- | ||
520 | |||
521 | What: KVM paravirt mmu host support | 500 | What: KVM paravirt mmu host support |
522 | When: January 2011 | 501 | When: January 2011 |
523 | Why: The paravirt mmu host support is slower than non-paravirt mmu, both | 502 | Why: The paravirt mmu host support is slower than non-paravirt mmu, both |
diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt index a237518e51b9..5f5b64982b1a 100644 --- a/Documentation/kvm/api.txt +++ b/Documentation/kvm/api.txt | |||
@@ -126,6 +126,10 @@ user fills in the size of the indices array in nmsrs, and in return | |||
126 | kvm adjusts nmsrs to reflect the actual number of msrs and fills in | 126 | kvm adjusts nmsrs to reflect the actual number of msrs and fills in |
127 | the indices array with their numbers. | 127 | the indices array with their numbers. |
128 | 128 | ||
129 | Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are | ||
130 | not returned in the MSR list, as different vcpus can have a different number | ||
131 | of banks, as set via the KVM_X86_SETUP_MCE ioctl. | ||
132 | |||
129 | 4.4 KVM_CHECK_EXTENSION | 133 | 4.4 KVM_CHECK_EXTENSION |
130 | 134 | ||
131 | Capability: basic | 135 | Capability: basic |
@@ -160,29 +164,7 @@ Type: vm ioctl | |||
160 | Parameters: struct kvm_memory_region (in) | 164 | Parameters: struct kvm_memory_region (in) |
161 | Returns: 0 on success, -1 on error | 165 | Returns: 0 on success, -1 on error |
162 | 166 | ||
163 | struct kvm_memory_region { | 167 | This ioctl is obsolete and has been removed. |
164 | __u32 slot; | ||
165 | __u32 flags; | ||
166 | __u64 guest_phys_addr; | ||
167 | __u64 memory_size; /* bytes */ | ||
168 | }; | ||
169 | |||
170 | /* for kvm_memory_region::flags */ | ||
171 | #define KVM_MEM_LOG_DIRTY_PAGES 1UL | ||
172 | |||
173 | This ioctl allows the user to create or modify a guest physical memory | ||
174 | slot. When changing an existing slot, it may be moved in the guest | ||
175 | physical memory space, or its flags may be modified. It may not be | ||
176 | resized. Slots may not overlap. | ||
177 | |||
178 | The flags field supports just one flag, KVM_MEM_LOG_DIRTY_PAGES, which | ||
179 | instructs kvm to keep track of writes to memory within the slot. See | ||
180 | the KVM_GET_DIRTY_LOG ioctl. | ||
181 | |||
182 | It is recommended to use the KVM_SET_USER_MEMORY_REGION ioctl instead | ||
183 | of this API, if available. This newer API allows placing guest memory | ||
184 | at specified locations in the host address space, yielding better | ||
185 | control and easy access. | ||
186 | 168 | ||
187 | 4.6 KVM_CREATE_VCPU | 169 | 4.6 KVM_CREATE_VCPU |
188 | 170 | ||
@@ -226,17 +208,7 @@ Type: vm ioctl | |||
226 | Parameters: struct kvm_memory_alias (in) | 208 | Parameters: struct kvm_memory_alias (in) |
227 | Returns: 0 (success), -1 (error) | 209 | Returns: 0 (success), -1 (error) |
228 | 210 | ||
229 | struct kvm_memory_alias { | 211 | This ioctl is obsolete and has been removed. |
230 | __u32 slot; /* this has a different namespace than memory slots */ | ||
231 | __u32 flags; | ||
232 | __u64 guest_phys_addr; | ||
233 | __u64 memory_size; | ||
234 | __u64 target_phys_addr; | ||
235 | }; | ||
236 | |||
237 | Defines a guest physical address space region as an alias to another | ||
238 | region. Useful for aliased address, for example the VGA low memory | ||
239 | window. Should not be used with userspace memory. | ||
240 | 212 | ||
241 | 4.9 KVM_RUN | 213 | 4.9 KVM_RUN |
242 | 214 | ||
@@ -892,6 +864,174 @@ arguments. | |||
892 | This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel | 864 | This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel |
893 | irqchip, the multiprocessing state must be maintained by userspace. | 865 | irqchip, the multiprocessing state must be maintained by userspace. |
894 | 866 | ||
867 | 4.39 KVM_SET_IDENTITY_MAP_ADDR | ||
868 | |||
869 | Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR | ||
870 | Architectures: x86 | ||
871 | Type: vm ioctl | ||
872 | Parameters: unsigned long identity (in) | ||
873 | Returns: 0 on success, -1 on error | ||
874 | |||
875 | This ioctl defines the physical address of a one-page region in the guest | ||
876 | physical address space. The region must be within the first 4GB of the | ||
877 | guest physical address space and must not conflict with any memory slot | ||
878 | or any mmio address. The guest may malfunction if it accesses this memory | ||
879 | region. | ||
880 | |||
881 | This ioctl is required on Intel-based hosts. This is needed on Intel hardware | ||
882 | because of a quirk in the virtualization implementation (see the internals | ||
883 | documentation when it pops into existence). | ||
884 | |||
885 | 4.40 KVM_SET_BOOT_CPU_ID | ||
886 | |||
887 | Capability: KVM_CAP_SET_BOOT_CPU_ID | ||
888 | Architectures: x86, ia64 | ||
889 | Type: vm ioctl | ||
890 | Parameters: unsigned long vcpu_id | ||
891 | Returns: 0 on success, -1 on error | ||
892 | |||
893 | Define which vcpu is the Bootstrap Processor (BSP). Values are the same | ||
894 | as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default | ||
895 | is vcpu 0. | ||
896 | |||
897 | 4.41 KVM_GET_XSAVE | ||
898 | |||
899 | Capability: KVM_CAP_XSAVE | ||
900 | Architectures: x86 | ||
901 | Type: vcpu ioctl | ||
902 | Parameters: struct kvm_xsave (out) | ||
903 | Returns: 0 on success, -1 on error | ||
904 | |||
905 | struct kvm_xsave { | ||
906 | __u32 region[1024]; | ||
907 | }; | ||
908 | |||
909 | This ioctl would copy current vcpu's xsave struct to the userspace. | ||
910 | |||
911 | 4.42 KVM_SET_XSAVE | ||
912 | |||
913 | Capability: KVM_CAP_XSAVE | ||
914 | Architectures: x86 | ||
915 | Type: vcpu ioctl | ||
916 | Parameters: struct kvm_xsave (in) | ||
917 | Returns: 0 on success, -1 on error | ||
918 | |||
919 | struct kvm_xsave { | ||
920 | __u32 region[1024]; | ||
921 | }; | ||
922 | |||
923 | This ioctl would copy userspace's xsave struct to the kernel. | ||
924 | |||
925 | 4.43 KVM_GET_XCRS | ||
926 | |||
927 | Capability: KVM_CAP_XCRS | ||
928 | Architectures: x86 | ||
929 | Type: vcpu ioctl | ||
930 | Parameters: struct kvm_xcrs (out) | ||
931 | Returns: 0 on success, -1 on error | ||
932 | |||
933 | struct kvm_xcr { | ||
934 | __u32 xcr; | ||
935 | __u32 reserved; | ||
936 | __u64 value; | ||
937 | }; | ||
938 | |||
939 | struct kvm_xcrs { | ||
940 | __u32 nr_xcrs; | ||
941 | __u32 flags; | ||
942 | struct kvm_xcr xcrs[KVM_MAX_XCRS]; | ||
943 | __u64 padding[16]; | ||
944 | }; | ||
945 | |||
946 | This ioctl would copy current vcpu's xcrs to the userspace. | ||
947 | |||
948 | 4.44 KVM_SET_XCRS | ||
949 | |||
950 | Capability: KVM_CAP_XCRS | ||
951 | Architectures: x86 | ||
952 | Type: vcpu ioctl | ||
953 | Parameters: struct kvm_xcrs (in) | ||
954 | Returns: 0 on success, -1 on error | ||
955 | |||
956 | struct kvm_xcr { | ||
957 | __u32 xcr; | ||
958 | __u32 reserved; | ||
959 | __u64 value; | ||
960 | }; | ||
961 | |||
962 | struct kvm_xcrs { | ||
963 | __u32 nr_xcrs; | ||
964 | __u32 flags; | ||
965 | struct kvm_xcr xcrs[KVM_MAX_XCRS]; | ||
966 | __u64 padding[16]; | ||
967 | }; | ||
968 | |||
969 | This ioctl would set vcpu's xcr to the value userspace specified. | ||
970 | |||
971 | 4.45 KVM_GET_SUPPORTED_CPUID | ||
972 | |||
973 | Capability: KVM_CAP_EXT_CPUID | ||
974 | Architectures: x86 | ||
975 | Type: system ioctl | ||
976 | Parameters: struct kvm_cpuid2 (in/out) | ||
977 | Returns: 0 on success, -1 on error | ||
978 | |||
979 | struct kvm_cpuid2 { | ||
980 | __u32 nent; | ||
981 | __u32 padding; | ||
982 | struct kvm_cpuid_entry2 entries[0]; | ||
983 | }; | ||
984 | |||
985 | #define KVM_CPUID_FLAG_SIGNIFCANT_INDEX 1 | ||
986 | #define KVM_CPUID_FLAG_STATEFUL_FUNC 2 | ||
987 | #define KVM_CPUID_FLAG_STATE_READ_NEXT 4 | ||
988 | |||
989 | struct kvm_cpuid_entry2 { | ||
990 | __u32 function; | ||
991 | __u32 index; | ||
992 | __u32 flags; | ||
993 | __u32 eax; | ||
994 | __u32 ebx; | ||
995 | __u32 ecx; | ||
996 | __u32 edx; | ||
997 | __u32 padding[3]; | ||
998 | }; | ||
999 | |||
1000 | This ioctl returns x86 cpuid features which are supported by both the hardware | ||
1001 | and kvm. Userspace can use the information returned by this ioctl to | ||
1002 | construct cpuid information (for KVM_SET_CPUID2) that is consistent with | ||
1003 | hardware, kernel, and userspace capabilities, and with user requirements (for | ||
1004 | example, the user may wish to constrain cpuid to emulate older hardware, | ||
1005 | or for feature consistency across a cluster). | ||
1006 | |||
1007 | Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure | ||
1008 | with the 'nent' field indicating the number of entries in the variable-size | ||
1009 | array 'entries'. If the number of entries is too low to describe the cpu | ||
1010 | capabilities, an error (E2BIG) is returned. If the number is too high, | ||
1011 | the 'nent' field is adjusted and an error (ENOMEM) is returned. If the | ||
1012 | number is just right, the 'nent' field is adjusted to the number of valid | ||
1013 | entries in the 'entries' array, which is then filled. | ||
1014 | |||
1015 | The entries returned are the host cpuid as returned by the cpuid instruction, | ||
1016 | with unknown or unsupported features masked out. The fields in each entry | ||
1017 | are defined as follows: | ||
1018 | |||
1019 | function: the eax value used to obtain the entry | ||
1020 | index: the ecx value used to obtain the entry (for entries that are | ||
1021 | affected by ecx) | ||
1022 | flags: an OR of zero or more of the following: | ||
1023 | KVM_CPUID_FLAG_SIGNIFCANT_INDEX: | ||
1024 | if the index field is valid | ||
1025 | KVM_CPUID_FLAG_STATEFUL_FUNC: | ||
1026 | if cpuid for this function returns different values for successive | ||
1027 | invocations; there will be several entries with the same function, | ||
1028 | all with this flag set | ||
1029 | KVM_CPUID_FLAG_STATE_READ_NEXT: | ||
1030 | for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is | ||
1031 | the first entry to be read by a cpu | ||
1032 | eax, ebx, ecx, edx: the values returned by the cpuid instruction for | ||
1033 | this function/index combination | ||
1034 | |||
895 | 5. The kvm_run structure | 1035 | 5. The kvm_run structure |
896 | 1036 | ||
897 | Application code obtains a pointer to the kvm_run structure by | 1037 | Application code obtains a pointer to the kvm_run structure by |
diff --git a/Documentation/kvm/mmu.txt b/Documentation/kvm/mmu.txt index aaed6ab9d7ab..142cc5136650 100644 --- a/Documentation/kvm/mmu.txt +++ b/Documentation/kvm/mmu.txt | |||
@@ -77,10 +77,10 @@ Memory | |||
77 | 77 | ||
78 | Guest memory (gpa) is part of the user address space of the process that is | 78 | Guest memory (gpa) is part of the user address space of the process that is |
79 | using kvm. Userspace defines the translation between guest addresses and user | 79 | using kvm. Userspace defines the translation between guest addresses and user |
80 | addresses (gpa->hva); note that two gpas may alias to the same gva, but not | 80 | addresses (gpa->hva); note that two gpas may alias to the same hva, but not |
81 | vice versa. | 81 | vice versa. |
82 | 82 | ||
83 | These gvas may be backed using any method available to the host: anonymous | 83 | These hvas may be backed using any method available to the host: anonymous |
84 | memory, file backed memory, and device memory. Memory might be paged by the | 84 | memory, file backed memory, and device memory. Memory might be paged by the |
85 | host at any time. | 85 | host at any time. |
86 | 86 | ||
@@ -161,7 +161,7 @@ Shadow pages contain the following information: | |||
161 | role.cr4_pae: | 161 | role.cr4_pae: |
162 | Contains the value of cr4.pae for which the page is valid (e.g. whether | 162 | Contains the value of cr4.pae for which the page is valid (e.g. whether |
163 | 32-bit or 64-bit gptes are in use). | 163 | 32-bit or 64-bit gptes are in use). |
164 | role.cr4_nxe: | 164 | role.nxe: |
165 | Contains the value of efer.nxe for which the page is valid. | 165 | Contains the value of efer.nxe for which the page is valid. |
166 | role.cr0_wp: | 166 | role.cr0_wp: |
167 | Contains the value of cr0.wp for which the page is valid. | 167 | Contains the value of cr0.wp for which the page is valid. |
@@ -180,7 +180,9 @@ Shadow pages contain the following information: | |||
180 | guest pages as leaves. | 180 | guest pages as leaves. |
181 | gfns: | 181 | gfns: |
182 | An array of 512 guest frame numbers, one for each present pte. Used to | 182 | An array of 512 guest frame numbers, one for each present pte. Used to |
183 | perform a reverse map from a pte to a gfn. | 183 | perform a reverse map from a pte to a gfn. When role.direct is set, any |
184 | element of this array can be calculated from the gfn field when used, in | ||
185 | this case, the array of gfns is not allocated. See role.direct and gfn. | ||
184 | slot_bitmap: | 186 | slot_bitmap: |
185 | A bitmap containing one bit per memory slot. If the page contains a pte | 187 | A bitmap containing one bit per memory slot. If the page contains a pte |
186 | mapping a page from memory slot n, then bit n of slot_bitmap will be set | 188 | mapping a page from memory slot n, then bit n of slot_bitmap will be set |
@@ -296,6 +298,48 @@ Host translation updates: | |||
296 | - look up affected sptes through reverse map | 298 | - look up affected sptes through reverse map |
297 | - drop (or update) translations | 299 | - drop (or update) translations |
298 | 300 | ||
301 | Emulating cr0.wp | ||
302 | ================ | ||
303 | |||
304 | If tdp is not enabled, the host must keep cr0.wp=1 so page write protection | ||
305 | works for the guest kernel, not guest guest userspace. When the guest | ||
306 | cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0, | ||
307 | we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the | ||
308 | semantics require allowing any guest kernel access plus user read access). | ||
309 | |||
310 | We handle this by mapping the permissions to two possible sptes, depending | ||
311 | on fault type: | ||
312 | |||
313 | - kernel write fault: spte.u=0, spte.w=1 (allows full kernel access, | ||
314 | disallows user access) | ||
315 | - read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel | ||
316 | write access) | ||
317 | |||
318 | (user write faults generate a #PF) | ||
319 | |||
320 | Large pages | ||
321 | =========== | ||
322 | |||
323 | The mmu supports all combinations of large and small guest and host pages. | ||
324 | Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as | ||
325 | two separate 2M pages, on both guest and host, since the mmu always uses PAE | ||
326 | paging. | ||
327 | |||
328 | To instantiate a large spte, four constraints must be satisfied: | ||
329 | |||
330 | - the spte must point to a large host page | ||
331 | - the guest pte must be a large pte of at least equivalent size (if tdp is | ||
332 | enabled, there is no guest pte and this condition is satisified) | ||
333 | - if the spte will be writeable, the large page frame may not overlap any | ||
334 | write-protected pages | ||
335 | - the guest page must be wholly contained by a single memory slot | ||
336 | |||
337 | To check the last two conditions, the mmu maintains a ->write_count set of | ||
338 | arrays for each memory slot and large page size. Every write protected page | ||
339 | causes its write_count to be incremented, thus preventing instantiation of | ||
340 | a large spte. The frames at the end of an unaligned memory slot have | ||
341 | artificically inflated ->write_counts so they can never be instantiated. | ||
342 | |||
299 | Further reading | 343 | Further reading |
300 | =============== | 344 | =============== |
301 | 345 | ||
diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt new file mode 100644 index 000000000000..8ddcfe84c09a --- /dev/null +++ b/Documentation/kvm/msr.txt | |||
@@ -0,0 +1,153 @@ | |||
1 | KVM-specific MSRs. | ||
2 | Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 | ||
3 | ===================================================== | ||
4 | |||
5 | KVM makes use of some custom MSRs to service some requests. | ||
6 | At present, this facility is only used by kvmclock. | ||
7 | |||
8 | Custom MSRs have a range reserved for them, that goes from | ||
9 | 0x4b564d00 to 0x4b564dff. There are MSRs outside this area, | ||
10 | but they are deprecated and their use is discouraged. | ||
11 | |||
12 | Custom MSR list | ||
13 | -------- | ||
14 | |||
15 | The current supported Custom MSR list is: | ||
16 | |||
17 | MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00 | ||
18 | |||
19 | data: 4-byte alignment physical address of a memory area which must be | ||
20 | in guest RAM. This memory is expected to hold a copy of the following | ||
21 | structure: | ||
22 | |||
23 | struct pvclock_wall_clock { | ||
24 | u32 version; | ||
25 | u32 sec; | ||
26 | u32 nsec; | ||
27 | } __attribute__((__packed__)); | ||
28 | |||
29 | whose data will be filled in by the hypervisor. The hypervisor is only | ||
30 | guaranteed to update this data at the moment of MSR write. | ||
31 | Users that want to reliably query this information more than once have | ||
32 | to write more than once to this MSR. Fields have the following meanings: | ||
33 | |||
34 | version: guest has to check version before and after grabbing | ||
35 | time information and check that they are both equal and even. | ||
36 | An odd version indicates an in-progress update. | ||
37 | |||
38 | sec: number of seconds for wallclock. | ||
39 | |||
40 | nsec: number of nanoseconds for wallclock. | ||
41 | |||
42 | Note that although MSRs are per-CPU entities, the effect of this | ||
43 | particular MSR is global. | ||
44 | |||
45 | Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid | ||
46 | leaf prior to usage. | ||
47 | |||
48 | MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 | ||
49 | |||
50 | data: 4-byte aligned physical address of a memory area which must be in | ||
51 | guest RAM, plus an enable bit in bit 0. This memory is expected to hold | ||
52 | a copy of the following structure: | ||
53 | |||
54 | struct pvclock_vcpu_time_info { | ||
55 | u32 version; | ||
56 | u32 pad0; | ||
57 | u64 tsc_timestamp; | ||
58 | u64 system_time; | ||
59 | u32 tsc_to_system_mul; | ||
60 | s8 tsc_shift; | ||
61 | u8 flags; | ||
62 | u8 pad[2]; | ||
63 | } __attribute__((__packed__)); /* 32 bytes */ | ||
64 | |||
65 | whose data will be filled in by the hypervisor periodically. Only one | ||
66 | write, or registration, is needed for each VCPU. The interval between | ||
67 | updates of this structure is arbitrary and implementation-dependent. | ||
68 | The hypervisor may update this structure at any time it sees fit until | ||
69 | anything with bit0 == 0 is written to it. | ||
70 | |||
71 | Fields have the following meanings: | ||
72 | |||
73 | version: guest has to check version before and after grabbing | ||
74 | time information and check that they are both equal and even. | ||
75 | An odd version indicates an in-progress update. | ||
76 | |||
77 | tsc_timestamp: the tsc value at the current VCPU at the time | ||
78 | of the update of this structure. Guests can subtract this value | ||
79 | from current tsc to derive a notion of elapsed time since the | ||
80 | structure update. | ||
81 | |||
82 | system_time: a host notion of monotonic time, including sleep | ||
83 | time at the time this structure was last updated. Unit is | ||
84 | nanoseconds. | ||
85 | |||
86 | tsc_to_system_mul: a function of the tsc frequency. One has | ||
87 | to multiply any tsc-related quantity by this value to get | ||
88 | a value in nanoseconds, besides dividing by 2^tsc_shift | ||
89 | |||
90 | tsc_shift: cycle to nanosecond divider, as a power of two, to | ||
91 | allow for shift rights. One has to shift right any tsc-related | ||
92 | quantity by this value to get a value in nanoseconds, besides | ||
93 | multiplying by tsc_to_system_mul. | ||
94 | |||
95 | With this information, guests can derive per-CPU time by | ||
96 | doing: | ||
97 | |||
98 | time = (current_tsc - tsc_timestamp) | ||
99 | time = (time * tsc_to_system_mul) >> tsc_shift | ||
100 | time = time + system_time | ||
101 | |||
102 | flags: bits in this field indicate extended capabilities | ||
103 | coordinated between the guest and the hypervisor. Availability | ||
104 | of specific flags has to be checked in 0x40000001 cpuid leaf. | ||
105 | Current flags are: | ||
106 | |||
107 | flag bit | cpuid bit | meaning | ||
108 | ------------------------------------------------------------- | ||
109 | | | time measures taken across | ||
110 | 0 | 24 | multiple cpus are guaranteed to | ||
111 | | | be monotonic | ||
112 | ------------------------------------------------------------- | ||
113 | |||
114 | Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid | ||
115 | leaf prior to usage. | ||
116 | |||
117 | |||
118 | MSR_KVM_WALL_CLOCK: 0x11 | ||
119 | |||
120 | data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead. | ||
121 | |||
122 | This MSR falls outside the reserved KVM range and may be removed in the | ||
123 | future. Its usage is deprecated. | ||
124 | |||
125 | Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid | ||
126 | leaf prior to usage. | ||
127 | |||
128 | MSR_KVM_SYSTEM_TIME: 0x12 | ||
129 | |||
130 | data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead. | ||
131 | |||
132 | This MSR falls outside the reserved KVM range and may be removed in the | ||
133 | future. Its usage is deprecated. | ||
134 | |||
135 | Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid | ||
136 | leaf prior to usage. | ||
137 | |||
138 | The suggested algorithm for detecting kvmclock presence is then: | ||
139 | |||
140 | if (!kvm_para_available()) /* refer to cpuid.txt */ | ||
141 | return NON_PRESENT; | ||
142 | |||
143 | flags = cpuid_eax(0x40000001); | ||
144 | if (flags & 3) { | ||
145 | msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; | ||
146 | msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; | ||
147 | return PRESENT; | ||
148 | } else if (flags & 0) { | ||
149 | msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; | ||
150 | msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; | ||
151 | return PRESENT; | ||
152 | } else | ||
153 | return NON_PRESENT; | ||
diff --git a/Documentation/kvm/review-checklist.txt b/Documentation/kvm/review-checklist.txt new file mode 100644 index 000000000000..730475ae1b8d --- /dev/null +++ b/Documentation/kvm/review-checklist.txt | |||
@@ -0,0 +1,38 @@ | |||
1 | Review checklist for kvm patches | ||
2 | ================================ | ||
3 | |||
4 | 1. The patch must follow Documentation/CodingStyle and | ||
5 | Documentation/SubmittingPatches. | ||
6 | |||
7 | 2. Patches should be against kvm.git master branch. | ||
8 | |||
9 | 3. If the patch introduces or modifies a new userspace API: | ||
10 | - the API must be documented in Documentation/kvm/api.txt | ||
11 | - the API must be discoverable using KVM_CHECK_EXTENSION | ||
12 | |||
13 | 4. New state must include support for save/restore. | ||
14 | |||
15 | 5. New features must default to off (userspace should explicitly request them). | ||
16 | Performance improvements can and should default to on. | ||
17 | |||
18 | 6. New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2 | ||
19 | |||
20 | 7. Emulator changes should be accompanied by unit tests for qemu-kvm.git | ||
21 | kvm/test directory. | ||
22 | |||
23 | 8. Changes should be vendor neutral when possible. Changes to common code | ||
24 | are better than duplicating changes to vendor code. | ||
25 | |||
26 | 9. Similarly, prefer changes to arch independent code than to arch dependent | ||
27 | code. | ||
28 | |||
29 | 10. User/kernel interfaces and guest/host interfaces must be 64-bit clean | ||
30 | (all variables and sizes naturally aligned on 64-bit; use specific types | ||
31 | only - u64 rather than ulong). | ||
32 | |||
33 | 11. New guest visible features must either be documented in a hardware manual | ||
34 | or be accompanied by documentation. | ||
35 | |||
36 | 12. Features must be robust against reset and kexec - for example, shared | ||
37 | host/guest memory must be unshared to prevent the host from writing to | ||
38 | guest memory that the guest has not reserved for this purpose. | ||