aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/virtual
diff options
context:
space:
mode:
authorRob Landley <rlandley@parallels.com>2011-05-06 12:22:02 -0400
committerRandy Dunlap <randy.dunlap@oracle.com>2011-05-06 12:22:02 -0400
commited16648eb5b86917f0b90bdcdbc857202da72f90 (patch)
treea8198415a6c2f1909f02340b05d36e1d53b82320 /Documentation/virtual
parentbfd412db9e7b0d8f7b9c09d12d07aa2ac785f1d0 (diff)
Move kvm, uml, and lguest subdirectories under a common "virtual" directory, I.E:
cd Documentation mkdir virtual git mv kvm uml lguest virtual Signed-off-by: Rob Landley <rlandley@parallels.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Diffstat (limited to 'Documentation/virtual')
-rw-r--r--Documentation/virtual/kvm/api.txt1451
-rw-r--r--Documentation/virtual/kvm/cpuid.txt45
-rw-r--r--Documentation/virtual/kvm/locking.txt25
-rw-r--r--Documentation/virtual/kvm/mmu.txt348
-rw-r--r--Documentation/virtual/kvm/msr.txt187
-rw-r--r--Documentation/virtual/kvm/ppc-pv.txt196
-rw-r--r--Documentation/virtual/kvm/review-checklist.txt38
-rw-r--r--Documentation/virtual/kvm/timekeeping.txt612
-rw-r--r--Documentation/virtual/lguest/.gitignore1
-rw-r--r--Documentation/virtual/lguest/Makefile8
-rw-r--r--Documentation/virtual/lguest/extract58
-rw-r--r--Documentation/virtual/lguest/lguest.c2095
-rw-r--r--Documentation/virtual/lguest/lguest.txt128
-rw-r--r--Documentation/virtual/uml/UserModeLinux-HOWTO.txt4579
14 files changed, 9771 insertions, 0 deletions
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
new file mode 100644
index 000000000000..9bef4e4cec50
--- /dev/null
+++ b/Documentation/virtual/kvm/api.txt
@@ -0,0 +1,1451 @@
1The Definitive KVM (Kernel-based Virtual Machine) API Documentation
2===================================================================
3
41. General description
5
6The kvm API is a set of ioctls that are issued to control various aspects
7of a virtual machine. The ioctls belong to three classes
8
9 - System ioctls: These query and set global attributes which affect the
10 whole kvm subsystem. In addition a system ioctl is used to create
11 virtual machines
12
13 - VM ioctls: These query and set attributes that affect an entire virtual
14 machine, for example memory layout. In addition a VM ioctl is used to
15 create virtual cpus (vcpus).
16
17 Only run VM ioctls from the same process (address space) that was used
18 to create the VM.
19
20 - vcpu ioctls: These query and set attributes that control the operation
21 of a single virtual cpu.
22
23 Only run vcpu ioctls from the same thread that was used to create the
24 vcpu.
25
262. File descriptors
27
28The kvm API is centered around file descriptors. An initial
29open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
30can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this
31handle will create a VM file descriptor which can be used to issue VM
32ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu
33and return a file descriptor pointing to it. Finally, ioctls on a vcpu
34fd can be used to control the vcpu, including the important task of
35actually running guest code.
36
37In general file descriptors can be migrated among processes by means
38of fork() and the SCM_RIGHTS facility of unix domain socket. These
39kinds of tricks are explicitly not supported by kvm. While they will
40not cause harm to the host, their actual behavior is not guaranteed by
41the API. The only supported use is one virtual machine per process,
42and one vcpu per thread.
43
443. Extensions
45
46As of Linux 2.6.22, the KVM ABI has been stabilized: no backward
47incompatible change are allowed. However, there is an extension
48facility that allows backward-compatible extensions to the API to be
49queried and used.
50
51The extension mechanism is not based on on the Linux version number.
52Instead, kvm defines extension identifiers and a facility to query
53whether a particular extension identifier is available. If it is, a
54set of ioctls is available for application use.
55
564. API description
57
58This section describes ioctls that can be used to control kvm guests.
59For each ioctl, the following information is provided along with a
60description:
61
62 Capability: which KVM extension provides this ioctl. Can be 'basic',
63 which means that is will be provided by any kernel that supports
64 API version 12 (see section 4.1), or a KVM_CAP_xyz constant, which
65 means availability needs to be checked with KVM_CHECK_EXTENSION
66 (see section 4.4).
67
68 Architectures: which instruction set architectures provide this ioctl.
69 x86 includes both i386 and x86_64.
70
71 Type: system, vm, or vcpu.
72
73 Parameters: what parameters are accepted by the ioctl.
74
75 Returns: the return value. General error numbers (EBADF, ENOMEM, EINVAL)
76 are not detailed, but errors with specific meanings are.
77
784.1 KVM_GET_API_VERSION
79
80Capability: basic
81Architectures: all
82Type: system ioctl
83Parameters: none
84Returns: the constant KVM_API_VERSION (=12)
85
86This identifies the API version as the stable kvm API. It is not
87expected that this number will change. However, Linux 2.6.20 and
882.6.21 report earlier versions; these are not documented and not
89supported. Applications should refuse to run if KVM_GET_API_VERSION
90returns a value other than 12. If this check passes, all ioctls
91described as 'basic' will be available.
92
934.2 KVM_CREATE_VM
94
95Capability: basic
96Architectures: all
97Type: system ioctl
98Parameters: none
99Returns: a VM fd that can be used to control the new virtual machine.
100
101The new VM has no virtual cpus and no memory. An mmap() of a VM fd
102will access the virtual machine's physical address space; offset zero
103corresponds to guest physical address zero. Use of mmap() on a VM fd
104is discouraged if userspace memory allocation (KVM_CAP_USER_MEMORY) is
105available.
106
1074.3 KVM_GET_MSR_INDEX_LIST
108
109Capability: basic
110Architectures: x86
111Type: system
112Parameters: struct kvm_msr_list (in/out)
113Returns: 0 on success; -1 on error
114Errors:
115 E2BIG: the msr index list is to be to fit in the array specified by
116 the user.
117
118struct kvm_msr_list {
119 __u32 nmsrs; /* number of msrs in entries */
120 __u32 indices[0];
121};
122
123This ioctl returns the guest msrs that are supported. The list varies
124by kvm version and host processor, but does not change otherwise. The
125user fills in the size of the indices array in nmsrs, and in return
126kvm adjusts nmsrs to reflect the actual number of msrs and fills in
127the indices array with their numbers.
128
129Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are
130not returned in the MSR list, as different vcpus can have a different number
131of banks, as set via the KVM_X86_SETUP_MCE ioctl.
132
1334.4 KVM_CHECK_EXTENSION
134
135Capability: basic
136Architectures: all
137Type: system ioctl
138Parameters: extension identifier (KVM_CAP_*)
139Returns: 0 if unsupported; 1 (or some other positive integer) if supported
140
141The API allows the application to query about extensions to the core
142kvm API. Userspace passes an extension identifier (an integer) and
143receives an integer that describes the extension availability.
144Generally 0 means no and 1 means yes, but some extensions may report
145additional information in the integer return value.
146
1474.5 KVM_GET_VCPU_MMAP_SIZE
148
149Capability: basic
150Architectures: all
151Type: system ioctl
152Parameters: none
153Returns: size of vcpu mmap area, in bytes
154
155The KVM_RUN ioctl (cf.) communicates with userspace via a shared
156memory region. This ioctl returns the size of that region. See the
157KVM_RUN documentation for details.
158
1594.6 KVM_SET_MEMORY_REGION
160
161Capability: basic
162Architectures: all
163Type: vm ioctl
164Parameters: struct kvm_memory_region (in)
165Returns: 0 on success, -1 on error
166
167This ioctl is obsolete and has been removed.
168
1694.7 KVM_CREATE_VCPU
170
171Capability: basic
172Architectures: all
173Type: vm ioctl
174Parameters: vcpu id (apic id on x86)
175Returns: vcpu fd on success, -1 on error
176
177This API adds a vcpu to a virtual machine. The vcpu id is a small integer
178in the range [0, max_vcpus).
179
1804.8 KVM_GET_DIRTY_LOG (vm ioctl)
181
182Capability: basic
183Architectures: x86
184Type: vm ioctl
185Parameters: struct kvm_dirty_log (in/out)
186Returns: 0 on success, -1 on error
187
188/* for KVM_GET_DIRTY_LOG */
189struct kvm_dirty_log {
190 __u32 slot;
191 __u32 padding;
192 union {
193 void __user *dirty_bitmap; /* one bit per page */
194 __u64 padding;
195 };
196};
197
198Given a memory slot, return a bitmap containing any pages dirtied
199since the last call to this ioctl. Bit 0 is the first page in the
200memory slot. Ensure the entire structure is cleared to avoid padding
201issues.
202
2034.9 KVM_SET_MEMORY_ALIAS
204
205Capability: basic
206Architectures: x86
207Type: vm ioctl
208Parameters: struct kvm_memory_alias (in)
209Returns: 0 (success), -1 (error)
210
211This ioctl is obsolete and has been removed.
212
2134.10 KVM_RUN
214
215Capability: basic
216Architectures: all
217Type: vcpu ioctl
218Parameters: none
219Returns: 0 on success, -1 on error
220Errors:
221 EINTR: an unmasked signal is pending
222
223This ioctl is used to run a guest virtual cpu. While there are no
224explicit parameters, there is an implicit parameter block that can be
225obtained by mmap()ing the vcpu fd at offset 0, with the size given by
226KVM_GET_VCPU_MMAP_SIZE. The parameter block is formatted as a 'struct
227kvm_run' (see below).
228
2294.11 KVM_GET_REGS
230
231Capability: basic
232Architectures: all
233Type: vcpu ioctl
234Parameters: struct kvm_regs (out)
235Returns: 0 on success, -1 on error
236
237Reads the general purpose registers from the vcpu.
238
239/* x86 */
240struct kvm_regs {
241 /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
242 __u64 rax, rbx, rcx, rdx;
243 __u64 rsi, rdi, rsp, rbp;
244 __u64 r8, r9, r10, r11;
245 __u64 r12, r13, r14, r15;
246 __u64 rip, rflags;
247};
248
2494.12 KVM_SET_REGS
250
251Capability: basic
252Architectures: all
253Type: vcpu ioctl
254Parameters: struct kvm_regs (in)
255Returns: 0 on success, -1 on error
256
257Writes the general purpose registers into the vcpu.
258
259See KVM_GET_REGS for the data structure.
260
2614.13 KVM_GET_SREGS
262
263Capability: basic
264Architectures: x86
265Type: vcpu ioctl
266Parameters: struct kvm_sregs (out)
267Returns: 0 on success, -1 on error
268
269Reads special registers from the vcpu.
270
271/* x86 */
272struct kvm_sregs {
273 struct kvm_segment cs, ds, es, fs, gs, ss;
274 struct kvm_segment tr, ldt;
275 struct kvm_dtable gdt, idt;
276 __u64 cr0, cr2, cr3, cr4, cr8;
277 __u64 efer;
278 __u64 apic_base;
279 __u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64];
280};
281
282interrupt_bitmap is a bitmap of pending external interrupts. At most
283one bit may be set. This interrupt has been acknowledged by the APIC
284but not yet injected into the cpu core.
285
2864.14 KVM_SET_SREGS
287
288Capability: basic
289Architectures: x86
290Type: vcpu ioctl
291Parameters: struct kvm_sregs (in)
292Returns: 0 on success, -1 on error
293
294Writes special registers into the vcpu. See KVM_GET_SREGS for the
295data structures.
296
2974.15 KVM_TRANSLATE
298
299Capability: basic
300Architectures: x86
301Type: vcpu ioctl
302Parameters: struct kvm_translation (in/out)
303Returns: 0 on success, -1 on error
304
305Translates a virtual address according to the vcpu's current address
306translation mode.
307
308struct kvm_translation {
309 /* in */
310 __u64 linear_address;
311
312 /* out */
313 __u64 physical_address;
314 __u8 valid;
315 __u8 writeable;
316 __u8 usermode;
317 __u8 pad[5];
318};
319
3204.16 KVM_INTERRUPT
321
322Capability: basic
323Architectures: x86, ppc
324Type: vcpu ioctl
325Parameters: struct kvm_interrupt (in)
326Returns: 0 on success, -1 on error
327
328Queues a hardware interrupt vector to be injected. This is only
329useful if in-kernel local APIC or equivalent is not used.
330
331/* for KVM_INTERRUPT */
332struct kvm_interrupt {
333 /* in */
334 __u32 irq;
335};
336
337X86:
338
339Note 'irq' is an interrupt vector, not an interrupt pin or line.
340
341PPC:
342
343Queues an external interrupt to be injected. This ioctl is overleaded
344with 3 different irq values:
345
346a) KVM_INTERRUPT_SET
347
348 This injects an edge type external interrupt into the guest once it's ready
349 to receive interrupts. When injected, the interrupt is done.
350
351b) KVM_INTERRUPT_UNSET
352
353 This unsets any pending interrupt.
354
355 Only available with KVM_CAP_PPC_UNSET_IRQ.
356
357c) KVM_INTERRUPT_SET_LEVEL
358
359 This injects a level type external interrupt into the guest context. The
360 interrupt stays pending until a specific ioctl with KVM_INTERRUPT_UNSET
361 is triggered.
362
363 Only available with KVM_CAP_PPC_IRQ_LEVEL.
364
365Note that any value for 'irq' other than the ones stated above is invalid
366and incurs unexpected behavior.
367
3684.17 KVM_DEBUG_GUEST
369
370Capability: basic
371Architectures: none
372Type: vcpu ioctl
373Parameters: none)
374Returns: -1 on error
375
376Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead.
377
3784.18 KVM_GET_MSRS
379
380Capability: basic
381Architectures: x86
382Type: vcpu ioctl
383Parameters: struct kvm_msrs (in/out)
384Returns: 0 on success, -1 on error
385
386Reads model-specific registers from the vcpu. Supported msr indices can
387be obtained using KVM_GET_MSR_INDEX_LIST.
388
389struct kvm_msrs {
390 __u32 nmsrs; /* number of msrs in entries */
391 __u32 pad;
392
393 struct kvm_msr_entry entries[0];
394};
395
396struct kvm_msr_entry {
397 __u32 index;
398 __u32 reserved;
399 __u64 data;
400};
401
402Application code should set the 'nmsrs' member (which indicates the
403size of the entries array) and the 'index' member of each array entry.
404kvm will fill in the 'data' member.
405
4064.19 KVM_SET_MSRS
407
408Capability: basic
409Architectures: x86
410Type: vcpu ioctl
411Parameters: struct kvm_msrs (in)
412Returns: 0 on success, -1 on error
413
414Writes model-specific registers to the vcpu. See KVM_GET_MSRS for the
415data structures.
416
417Application code should set the 'nmsrs' member (which indicates the
418size of the entries array), and the 'index' and 'data' members of each
419array entry.
420
4214.20 KVM_SET_CPUID
422
423Capability: basic
424Architectures: x86
425Type: vcpu ioctl
426Parameters: struct kvm_cpuid (in)
427Returns: 0 on success, -1 on error
428
429Defines the vcpu responses to the cpuid instruction. Applications
430should use the KVM_SET_CPUID2 ioctl if available.
431
432
433struct kvm_cpuid_entry {
434 __u32 function;
435 __u32 eax;
436 __u32 ebx;
437 __u32 ecx;
438 __u32 edx;
439 __u32 padding;
440};
441
442/* for KVM_SET_CPUID */
443struct kvm_cpuid {
444 __u32 nent;
445 __u32 padding;
446 struct kvm_cpuid_entry entries[0];
447};
448
4494.21 KVM_SET_SIGNAL_MASK
450
451Capability: basic
452Architectures: x86
453Type: vcpu ioctl
454Parameters: struct kvm_signal_mask (in)
455Returns: 0 on success, -1 on error
456
457Defines which signals are blocked during execution of KVM_RUN. This
458signal mask temporarily overrides the threads signal mask. Any
459unblocked signal received (except SIGKILL and SIGSTOP, which retain
460their traditional behaviour) will cause KVM_RUN to return with -EINTR.
461
462Note the signal will only be delivered if not blocked by the original
463signal mask.
464
465/* for KVM_SET_SIGNAL_MASK */
466struct kvm_signal_mask {
467 __u32 len;
468 __u8 sigset[0];
469};
470
4714.22 KVM_GET_FPU
472
473Capability: basic
474Architectures: x86
475Type: vcpu ioctl
476Parameters: struct kvm_fpu (out)
477Returns: 0 on success, -1 on error
478
479Reads the floating point state from the vcpu.
480
481/* for KVM_GET_FPU and KVM_SET_FPU */
482struct kvm_fpu {
483 __u8 fpr[8][16];
484 __u16 fcw;
485 __u16 fsw;
486 __u8 ftwx; /* in fxsave format */
487 __u8 pad1;
488 __u16 last_opcode;
489 __u64 last_ip;
490 __u64 last_dp;
491 __u8 xmm[16][16];
492 __u32 mxcsr;
493 __u32 pad2;
494};
495
4964.23 KVM_SET_FPU
497
498Capability: basic
499Architectures: x86
500Type: vcpu ioctl
501Parameters: struct kvm_fpu (in)
502Returns: 0 on success, -1 on error
503
504Writes the floating point state to the vcpu.
505
506/* for KVM_GET_FPU and KVM_SET_FPU */
507struct kvm_fpu {
508 __u8 fpr[8][16];
509 __u16 fcw;
510 __u16 fsw;
511 __u8 ftwx; /* in fxsave format */
512 __u8 pad1;
513 __u16 last_opcode;
514 __u64 last_ip;
515 __u64 last_dp;
516 __u8 xmm[16][16];
517 __u32 mxcsr;
518 __u32 pad2;
519};
520
5214.24 KVM_CREATE_IRQCHIP
522
523Capability: KVM_CAP_IRQCHIP
524Architectures: x86, ia64
525Type: vm ioctl
526Parameters: none
527Returns: 0 on success, -1 on error
528
529Creates an interrupt controller model in the kernel. On x86, creates a virtual
530ioapic, a virtual PIC (two PICs, nested), and sets up future vcpus to have a
531local APIC. IRQ routing for GSIs 0-15 is set to both PIC and IOAPIC; GSI 16-23
532only go to the IOAPIC. On ia64, a IOSAPIC is created.
533
5344.25 KVM_IRQ_LINE
535
536Capability: KVM_CAP_IRQCHIP
537Architectures: x86, ia64
538Type: vm ioctl
539Parameters: struct kvm_irq_level
540Returns: 0 on success, -1 on error
541
542Sets the level of a GSI input to the interrupt controller model in the kernel.
543Requires that an interrupt controller model has been previously created with
544KVM_CREATE_IRQCHIP. Note that edge-triggered interrupts require the level
545to be set to 1 and then back to 0.
546
547struct kvm_irq_level {
548 union {
549 __u32 irq; /* GSI */
550 __s32 status; /* not used for KVM_IRQ_LEVEL */
551 };
552 __u32 level; /* 0 or 1 */
553};
554
5554.26 KVM_GET_IRQCHIP
556
557Capability: KVM_CAP_IRQCHIP
558Architectures: x86, ia64
559Type: vm ioctl
560Parameters: struct kvm_irqchip (in/out)
561Returns: 0 on success, -1 on error
562
563Reads the state of a kernel interrupt controller created with
564KVM_CREATE_IRQCHIP into a buffer provided by the caller.
565
566struct kvm_irqchip {
567 __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */
568 __u32 pad;
569 union {
570 char dummy[512]; /* reserving space */
571 struct kvm_pic_state pic;
572 struct kvm_ioapic_state ioapic;
573 } chip;
574};
575
5764.27 KVM_SET_IRQCHIP
577
578Capability: KVM_CAP_IRQCHIP
579Architectures: x86, ia64
580Type: vm ioctl
581Parameters: struct kvm_irqchip (in)
582Returns: 0 on success, -1 on error
583
584Sets the state of a kernel interrupt controller created with
585KVM_CREATE_IRQCHIP from a buffer provided by the caller.
586
587struct kvm_irqchip {
588 __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */
589 __u32 pad;
590 union {
591 char dummy[512]; /* reserving space */
592 struct kvm_pic_state pic;
593 struct kvm_ioapic_state ioapic;
594 } chip;
595};
596
5974.28 KVM_XEN_HVM_CONFIG
598
599Capability: KVM_CAP_XEN_HVM
600Architectures: x86
601Type: vm ioctl
602Parameters: struct kvm_xen_hvm_config (in)
603Returns: 0 on success, -1 on error
604
605Sets the MSR that the Xen HVM guest uses to initialize its hypercall
606page, and provides the starting address and size of the hypercall
607blobs in userspace. When the guest writes the MSR, kvm copies one
608page of a blob (32- or 64-bit, depending on the vcpu mode) to guest
609memory.
610
611struct kvm_xen_hvm_config {
612 __u32 flags;
613 __u32 msr;
614 __u64 blob_addr_32;
615 __u64 blob_addr_64;
616 __u8 blob_size_32;
617 __u8 blob_size_64;
618 __u8 pad2[30];
619};
620
6214.29 KVM_GET_CLOCK
622
623Capability: KVM_CAP_ADJUST_CLOCK
624Architectures: x86
625Type: vm ioctl
626Parameters: struct kvm_clock_data (out)
627Returns: 0 on success, -1 on error
628
629Gets the current timestamp of kvmclock as seen by the current guest. In
630conjunction with KVM_SET_CLOCK, it is used to ensure monotonicity on scenarios
631such as migration.
632
633struct kvm_clock_data {
634 __u64 clock; /* kvmclock current value */
635 __u32 flags;
636 __u32 pad[9];
637};
638
6394.30 KVM_SET_CLOCK
640
641Capability: KVM_CAP_ADJUST_CLOCK
642Architectures: x86
643Type: vm ioctl
644Parameters: struct kvm_clock_data (in)
645Returns: 0 on success, -1 on error
646
647Sets the current timestamp of kvmclock to the value specified in its parameter.
648In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
649such as migration.
650
651struct kvm_clock_data {
652 __u64 clock; /* kvmclock current value */
653 __u32 flags;
654 __u32 pad[9];
655};
656
6574.31 KVM_GET_VCPU_EVENTS
658
659Capability: KVM_CAP_VCPU_EVENTS
660Extended by: KVM_CAP_INTR_SHADOW
661Architectures: x86
662Type: vm ioctl
663Parameters: struct kvm_vcpu_event (out)
664Returns: 0 on success, -1 on error
665
666Gets currently pending exceptions, interrupts, and NMIs as well as related
667states of the vcpu.
668
669struct kvm_vcpu_events {
670 struct {
671 __u8 injected;
672 __u8 nr;
673 __u8 has_error_code;
674 __u8 pad;
675 __u32 error_code;
676 } exception;
677 struct {
678 __u8 injected;
679 __u8 nr;
680 __u8 soft;
681 __u8 shadow;
682 } interrupt;
683 struct {
684 __u8 injected;
685 __u8 pending;
686 __u8 masked;
687 __u8 pad;
688 } nmi;
689 __u32 sipi_vector;
690 __u32 flags;
691};
692
693KVM_VCPUEVENT_VALID_SHADOW may be set in the flags field to signal that
694interrupt.shadow contains a valid state. Otherwise, this field is undefined.
695
6964.32 KVM_SET_VCPU_EVENTS
697
698Capability: KVM_CAP_VCPU_EVENTS
699Extended by: KVM_CAP_INTR_SHADOW
700Architectures: x86
701Type: vm ioctl
702Parameters: struct kvm_vcpu_event (in)
703Returns: 0 on success, -1 on error
704
705Set pending exceptions, interrupts, and NMIs as well as related states of the
706vcpu.
707
708See KVM_GET_VCPU_EVENTS for the data structure.
709
710Fields that may be modified asynchronously by running VCPUs can be excluded
711from the update. These fields are nmi.pending and sipi_vector. Keep the
712corresponding bits in the flags field cleared to suppress overwriting the
713current in-kernel state. The bits are:
714
715KVM_VCPUEVENT_VALID_NMI_PENDING - transfer nmi.pending to the kernel
716KVM_VCPUEVENT_VALID_SIPI_VECTOR - transfer sipi_vector
717
718If KVM_CAP_INTR_SHADOW is available, KVM_VCPUEVENT_VALID_SHADOW can be set in
719the flags field to signal that interrupt.shadow contains a valid state and
720shall be written into the VCPU.
721
7224.33 KVM_GET_DEBUGREGS
723
724Capability: KVM_CAP_DEBUGREGS
725Architectures: x86
726Type: vm ioctl
727Parameters: struct kvm_debugregs (out)
728Returns: 0 on success, -1 on error
729
730Reads debug registers from the vcpu.
731
732struct kvm_debugregs {
733 __u64 db[4];
734 __u64 dr6;
735 __u64 dr7;
736 __u64 flags;
737 __u64 reserved[9];
738};
739
7404.34 KVM_SET_DEBUGREGS
741
742Capability: KVM_CAP_DEBUGREGS
743Architectures: x86
744Type: vm ioctl
745Parameters: struct kvm_debugregs (in)
746Returns: 0 on success, -1 on error
747
748Writes debug registers into the vcpu.
749
750See KVM_GET_DEBUGREGS for the data structure. The flags field is unused
751yet and must be cleared on entry.
752
7534.35 KVM_SET_USER_MEMORY_REGION
754
755Capability: KVM_CAP_USER_MEM
756Architectures: all
757Type: vm ioctl
758Parameters: struct kvm_userspace_memory_region (in)
759Returns: 0 on success, -1 on error
760
761struct kvm_userspace_memory_region {
762 __u32 slot;
763 __u32 flags;
764 __u64 guest_phys_addr;
765 __u64 memory_size; /* bytes */
766 __u64 userspace_addr; /* start of the userspace allocated memory */
767};
768
769/* for kvm_memory_region::flags */
770#define KVM_MEM_LOG_DIRTY_PAGES 1UL
771
772This ioctl allows the user to create or modify a guest physical memory
773slot. When changing an existing slot, it may be moved in the guest
774physical memory space, or its flags may be modified. It may not be
775resized. Slots may not overlap in guest physical address space.
776
777Memory for the region is taken starting at the address denoted by the
778field userspace_addr, which must point at user addressable memory for
779the entire memory slot size. Any object may back this memory, including
780anonymous memory, ordinary files, and hugetlbfs.
781
782It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
783be identical. This allows large pages in the guest to be backed by large
784pages in the host.
785
786The flags field supports just one flag, KVM_MEM_LOG_DIRTY_PAGES, which
787instructs kvm to keep track of writes to memory within the slot. See
788the KVM_GET_DIRTY_LOG ioctl.
789
790When the KVM_CAP_SYNC_MMU capability, changes in the backing of the memory
791region are automatically reflected into the guest. For example, an mmap()
792that affects the region will be made visible immediately. Another example
793is madvise(MADV_DROP).
794
795It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
796The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
797allocation and is deprecated.
798
7994.36 KVM_SET_TSS_ADDR
800
801Capability: KVM_CAP_SET_TSS_ADDR
802Architectures: x86
803Type: vm ioctl
804Parameters: unsigned long tss_address (in)
805Returns: 0 on success, -1 on error
806
807This ioctl defines the physical address of a three-page region in the guest
808physical address space. The region must be within the first 4GB of the
809guest physical address space and must not conflict with any memory slot
810or any mmio address. The guest may malfunction if it accesses this memory
811region.
812
813This ioctl is required on Intel-based hosts. This is needed on Intel hardware
814because of a quirk in the virtualization implementation (see the internals
815documentation when it pops into existence).
816
8174.37 KVM_ENABLE_CAP
818
819Capability: KVM_CAP_ENABLE_CAP
820Architectures: ppc
821Type: vcpu ioctl
822Parameters: struct kvm_enable_cap (in)
823Returns: 0 on success; -1 on error
824
825+Not all extensions are enabled by default. Using this ioctl the application
826can enable an extension, making it available to the guest.
827
828On systems that do not support this ioctl, it always fails. On systems that
829do support it, it only works for extensions that are supported for enablement.
830
831To check if a capability can be enabled, the KVM_CHECK_EXTENSION ioctl should
832be used.
833
834struct kvm_enable_cap {
835 /* in */
836 __u32 cap;
837
838The capability that is supposed to get enabled.
839
840 __u32 flags;
841
842A bitfield indicating future enhancements. Has to be 0 for now.
843
844 __u64 args[4];
845
846Arguments for enabling a feature. If a feature needs initial values to
847function properly, this is the place to put them.
848
849 __u8 pad[64];
850};
851
8524.38 KVM_GET_MP_STATE
853
854Capability: KVM_CAP_MP_STATE
855Architectures: x86, ia64
856Type: vcpu ioctl
857Parameters: struct kvm_mp_state (out)
858Returns: 0 on success; -1 on error
859
860struct kvm_mp_state {
861 __u32 mp_state;
862};
863
864Returns the vcpu's current "multiprocessing state" (though also valid on
865uniprocessor guests).
866
867Possible values are:
868
869 - KVM_MP_STATE_RUNNABLE: the vcpu is currently running
870 - KVM_MP_STATE_UNINITIALIZED: the vcpu is an application processor (AP)
871 which has not yet received an INIT signal
872 - KVM_MP_STATE_INIT_RECEIVED: the vcpu has received an INIT signal, and is
873 now ready for a SIPI
874 - KVM_MP_STATE_HALTED: the vcpu has executed a HLT instruction and
875 is waiting for an interrupt
876 - KVM_MP_STATE_SIPI_RECEIVED: the vcpu has just received a SIPI (vector
877 accessible via KVM_GET_VCPU_EVENTS)
878
879This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel
880irqchip, the multiprocessing state must be maintained by userspace.
881
8824.39 KVM_SET_MP_STATE
883
884Capability: KVM_CAP_MP_STATE
885Architectures: x86, ia64
886Type: vcpu ioctl
887Parameters: struct kvm_mp_state (in)
888Returns: 0 on success; -1 on error
889
890Sets the vcpu's current "multiprocessing state"; see KVM_GET_MP_STATE for
891arguments.
892
893This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel
894irqchip, the multiprocessing state must be maintained by userspace.
895
8964.40 KVM_SET_IDENTITY_MAP_ADDR
897
898Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR
899Architectures: x86
900Type: vm ioctl
901Parameters: unsigned long identity (in)
902Returns: 0 on success, -1 on error
903
904This ioctl defines the physical address of a one-page region in the guest
905physical address space. The region must be within the first 4GB of the
906guest physical address space and must not conflict with any memory slot
907or any mmio address. The guest may malfunction if it accesses this memory
908region.
909
910This ioctl is required on Intel-based hosts. This is needed on Intel hardware
911because of a quirk in the virtualization implementation (see the internals
912documentation when it pops into existence).
913
9144.41 KVM_SET_BOOT_CPU_ID
915
916Capability: KVM_CAP_SET_BOOT_CPU_ID
917Architectures: x86, ia64
918Type: vm ioctl
919Parameters: unsigned long vcpu_id
920Returns: 0 on success, -1 on error
921
922Define which vcpu is the Bootstrap Processor (BSP). Values are the same
923as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default
924is vcpu 0.
925
9264.42 KVM_GET_XSAVE
927
928Capability: KVM_CAP_XSAVE
929Architectures: x86
930Type: vcpu ioctl
931Parameters: struct kvm_xsave (out)
932Returns: 0 on success, -1 on error
933
934struct kvm_xsave {
935 __u32 region[1024];
936};
937
938This ioctl would copy current vcpu's xsave struct to the userspace.
939
9404.43 KVM_SET_XSAVE
941
942Capability: KVM_CAP_XSAVE
943Architectures: x86
944Type: vcpu ioctl
945Parameters: struct kvm_xsave (in)
946Returns: 0 on success, -1 on error
947
948struct kvm_xsave {
949 __u32 region[1024];
950};
951
952This ioctl would copy userspace's xsave struct to the kernel.
953
9544.44 KVM_GET_XCRS
955
956Capability: KVM_CAP_XCRS
957Architectures: x86
958Type: vcpu ioctl
959Parameters: struct kvm_xcrs (out)
960Returns: 0 on success, -1 on error
961
962struct kvm_xcr {
963 __u32 xcr;
964 __u32 reserved;
965 __u64 value;
966};
967
968struct kvm_xcrs {
969 __u32 nr_xcrs;
970 __u32 flags;
971 struct kvm_xcr xcrs[KVM_MAX_XCRS];
972 __u64 padding[16];
973};
974
975This ioctl would copy current vcpu's xcrs to the userspace.
976
9774.45 KVM_SET_XCRS
978
979Capability: KVM_CAP_XCRS
980Architectures: x86
981Type: vcpu ioctl
982Parameters: struct kvm_xcrs (in)
983Returns: 0 on success, -1 on error
984
985struct kvm_xcr {
986 __u32 xcr;
987 __u32 reserved;
988 __u64 value;
989};
990
991struct kvm_xcrs {
992 __u32 nr_xcrs;
993 __u32 flags;
994 struct kvm_xcr xcrs[KVM_MAX_XCRS];
995 __u64 padding[16];
996};
997
998This ioctl would set vcpu's xcr to the value userspace specified.
999
10004.46 KVM_GET_SUPPORTED_CPUID
1001
1002Capability: KVM_CAP_EXT_CPUID
1003Architectures: x86
1004Type: system ioctl
1005Parameters: struct kvm_cpuid2 (in/out)
1006Returns: 0 on success, -1 on error
1007
1008struct kvm_cpuid2 {
1009 __u32 nent;
1010 __u32 padding;
1011 struct kvm_cpuid_entry2 entries[0];
1012};
1013
1014#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX 1
1015#define KVM_CPUID_FLAG_STATEFUL_FUNC 2
1016#define KVM_CPUID_FLAG_STATE_READ_NEXT 4
1017
1018struct kvm_cpuid_entry2 {
1019 __u32 function;
1020 __u32 index;
1021 __u32 flags;
1022 __u32 eax;
1023 __u32 ebx;
1024 __u32 ecx;
1025 __u32 edx;
1026 __u32 padding[3];
1027};
1028
1029This ioctl returns x86 cpuid features which are supported by both the hardware
1030and kvm. Userspace can use the information returned by this ioctl to
1031construct cpuid information (for KVM_SET_CPUID2) that is consistent with
1032hardware, kernel, and userspace capabilities, and with user requirements (for
1033example, the user may wish to constrain cpuid to emulate older hardware,
1034or for feature consistency across a cluster).
1035
1036Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
1037with the 'nent' field indicating the number of entries in the variable-size
1038array 'entries'. If the number of entries is too low to describe the cpu
1039capabilities, an error (E2BIG) is returned. If the number is too high,
1040the 'nent' field is adjusted and an error (ENOMEM) is returned. If the
1041number is just right, the 'nent' field is adjusted to the number of valid
1042entries in the 'entries' array, which is then filled.
1043
1044The entries returned are the host cpuid as returned by the cpuid instruction,
1045with unknown or unsupported features masked out. Some features (for example,
1046x2apic), may not be present in the host cpu, but are exposed by kvm if it can
1047emulate them efficiently. The fields in each entry are defined as follows:
1048
1049 function: the eax value used to obtain the entry
1050 index: the ecx value used to obtain the entry (for entries that are
1051 affected by ecx)
1052 flags: an OR of zero or more of the following:
1053 KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
1054 if the index field is valid
1055 KVM_CPUID_FLAG_STATEFUL_FUNC:
1056 if cpuid for this function returns different values for successive
1057 invocations; there will be several entries with the same function,
1058 all with this flag set
1059 KVM_CPUID_FLAG_STATE_READ_NEXT:
1060 for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
1061 the first entry to be read by a cpu
1062 eax, ebx, ecx, edx: the values returned by the cpuid instruction for
1063 this function/index combination
1064
10654.47 KVM_PPC_GET_PVINFO
1066
1067Capability: KVM_CAP_PPC_GET_PVINFO
1068Architectures: ppc
1069Type: vm ioctl
1070Parameters: struct kvm_ppc_pvinfo (out)
1071Returns: 0 on success, !0 on error
1072
1073struct kvm_ppc_pvinfo {
1074 __u32 flags;
1075 __u32 hcall[4];
1076 __u8 pad[108];
1077};
1078
1079This ioctl fetches PV specific information that need to be passed to the guest
1080using the device tree or other means from vm context.
1081
1082For now the only implemented piece of information distributed here is an array
1083of 4 instructions that make up a hypercall.
1084
1085If any additional field gets added to this structure later on, a bit for that
1086additional piece of information will be set in the flags bitmap.
1087
10884.48 KVM_ASSIGN_PCI_DEVICE
1089
1090Capability: KVM_CAP_DEVICE_ASSIGNMENT
1091Architectures: x86 ia64
1092Type: vm ioctl
1093Parameters: struct kvm_assigned_pci_dev (in)
1094Returns: 0 on success, -1 on error
1095
1096Assigns a host PCI device to the VM.
1097
1098struct kvm_assigned_pci_dev {
1099 __u32 assigned_dev_id;
1100 __u32 busnr;
1101 __u32 devfn;
1102 __u32 flags;
1103 __u32 segnr;
1104 union {
1105 __u32 reserved[11];
1106 };
1107};
1108
1109The PCI device is specified by the triple segnr, busnr, and devfn.
1110Identification in succeeding service requests is done via assigned_dev_id. The
1111following flags are specified:
1112
1113/* Depends on KVM_CAP_IOMMU */
1114#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
1115
11164.49 KVM_DEASSIGN_PCI_DEVICE
1117
1118Capability: KVM_CAP_DEVICE_DEASSIGNMENT
1119Architectures: x86 ia64
1120Type: vm ioctl
1121Parameters: struct kvm_assigned_pci_dev (in)
1122Returns: 0 on success, -1 on error
1123
1124Ends PCI device assignment, releasing all associated resources.
1125
1126See KVM_CAP_DEVICE_ASSIGNMENT for the data structure. Only assigned_dev_id is
1127used in kvm_assigned_pci_dev to identify the device.
1128
11294.50 KVM_ASSIGN_DEV_IRQ
1130
1131Capability: KVM_CAP_ASSIGN_DEV_IRQ
1132Architectures: x86 ia64
1133Type: vm ioctl
1134Parameters: struct kvm_assigned_irq (in)
1135Returns: 0 on success, -1 on error
1136
1137Assigns an IRQ to a passed-through device.
1138
1139struct kvm_assigned_irq {
1140 __u32 assigned_dev_id;
1141 __u32 host_irq;
1142 __u32 guest_irq;
1143 __u32 flags;
1144 union {
1145 struct {
1146 __u32 addr_lo;
1147 __u32 addr_hi;
1148 __u32 data;
1149 } guest_msi;
1150 __u32 reserved[12];
1151 };
1152};
1153
1154The following flags are defined:
1155
1156#define KVM_DEV_IRQ_HOST_INTX (1 << 0)
1157#define KVM_DEV_IRQ_HOST_MSI (1 << 1)
1158#define KVM_DEV_IRQ_HOST_MSIX (1 << 2)
1159
1160#define KVM_DEV_IRQ_GUEST_INTX (1 << 8)
1161#define KVM_DEV_IRQ_GUEST_MSI (1 << 9)
1162#define KVM_DEV_IRQ_GUEST_MSIX (1 << 10)
1163
1164It is not valid to specify multiple types per host or guest IRQ. However, the
1165IRQ type of host and guest can differ or can even be null.
1166
11674.51 KVM_DEASSIGN_DEV_IRQ
1168
1169Capability: KVM_CAP_ASSIGN_DEV_IRQ
1170Architectures: x86 ia64
1171Type: vm ioctl
1172Parameters: struct kvm_assigned_irq (in)
1173Returns: 0 on success, -1 on error
1174
1175Ends an IRQ assignment to a passed-through device.
1176
1177See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
1178by assigned_dev_id, flags must correspond to the IRQ type specified on
1179KVM_ASSIGN_DEV_IRQ. Partial deassignment of host or guest IRQ is allowed.
1180
11814.52 KVM_SET_GSI_ROUTING
1182
1183Capability: KVM_CAP_IRQ_ROUTING
1184Architectures: x86 ia64
1185Type: vm ioctl
1186Parameters: struct kvm_irq_routing (in)
1187Returns: 0 on success, -1 on error
1188
1189Sets the GSI routing table entries, overwriting any previously set entries.
1190
1191struct kvm_irq_routing {
1192 __u32 nr;
1193 __u32 flags;
1194 struct kvm_irq_routing_entry entries[0];
1195};
1196
1197No flags are specified so far, the corresponding field must be set to zero.
1198
1199struct kvm_irq_routing_entry {
1200 __u32 gsi;
1201 __u32 type;
1202 __u32 flags;
1203 __u32 pad;
1204 union {
1205 struct kvm_irq_routing_irqchip irqchip;
1206 struct kvm_irq_routing_msi msi;
1207 __u32 pad[8];
1208 } u;
1209};
1210
1211/* gsi routing entry types */
1212#define KVM_IRQ_ROUTING_IRQCHIP 1
1213#define KVM_IRQ_ROUTING_MSI 2
1214
1215No flags are specified so far, the corresponding field must be set to zero.
1216
1217struct kvm_irq_routing_irqchip {
1218 __u32 irqchip;
1219 __u32 pin;
1220};
1221
1222struct kvm_irq_routing_msi {
1223 __u32 address_lo;
1224 __u32 address_hi;
1225 __u32 data;
1226 __u32 pad;
1227};
1228
12294.53 KVM_ASSIGN_SET_MSIX_NR
1230
1231Capability: KVM_CAP_DEVICE_MSIX
1232Architectures: x86 ia64
1233Type: vm ioctl
1234Parameters: struct kvm_assigned_msix_nr (in)
1235Returns: 0 on success, -1 on error
1236
1237Set the number of MSI-X interrupts for an assigned device. This service can
1238only be called once in the lifetime of an assigned device.
1239
1240struct kvm_assigned_msix_nr {
1241 __u32 assigned_dev_id;
1242 __u16 entry_nr;
1243 __u16 padding;
1244};
1245
1246#define KVM_MAX_MSIX_PER_DEV 256
1247
12484.54 KVM_ASSIGN_SET_MSIX_ENTRY
1249
1250Capability: KVM_CAP_DEVICE_MSIX
1251Architectures: x86 ia64
1252Type: vm ioctl
1253Parameters: struct kvm_assigned_msix_entry (in)
1254Returns: 0 on success, -1 on error
1255
1256Specifies the routing of an MSI-X assigned device interrupt to a GSI. Setting
1257the GSI vector to zero means disabling the interrupt.
1258
1259struct kvm_assigned_msix_entry {
1260 __u32 assigned_dev_id;
1261 __u32 gsi;
1262 __u16 entry; /* The index of entry in the MSI-X table */
1263 __u16 padding[3];
1264};
1265
12665. The kvm_run structure
1267
1268Application code obtains a pointer to the kvm_run structure by
1269mmap()ing a vcpu fd. From that point, application code can control
1270execution by changing fields in kvm_run prior to calling the KVM_RUN
1271ioctl, and obtain information about the reason KVM_RUN returned by
1272looking up structure members.
1273
1274struct kvm_run {
1275 /* in */
1276 __u8 request_interrupt_window;
1277
1278Request that KVM_RUN return when it becomes possible to inject external
1279interrupts into the guest. Useful in conjunction with KVM_INTERRUPT.
1280
1281 __u8 padding1[7];
1282
1283 /* out */
1284 __u32 exit_reason;
1285
1286When KVM_RUN has returned successfully (return value 0), this informs
1287application code why KVM_RUN has returned. Allowable values for this
1288field are detailed below.
1289
1290 __u8 ready_for_interrupt_injection;
1291
1292If request_interrupt_window has been specified, this field indicates
1293an interrupt can be injected now with KVM_INTERRUPT.
1294
1295 __u8 if_flag;
1296
1297The value of the current interrupt flag. Only valid if in-kernel
1298local APIC is not used.
1299
1300 __u8 padding2[2];
1301
1302 /* in (pre_kvm_run), out (post_kvm_run) */
1303 __u64 cr8;
1304
1305The value of the cr8 register. Only valid if in-kernel local APIC is
1306not used. Both input and output.
1307
1308 __u64 apic_base;
1309
1310The value of the APIC BASE msr. Only valid if in-kernel local
1311APIC is not used. Both input and output.
1312
1313 union {
1314 /* KVM_EXIT_UNKNOWN */
1315 struct {
1316 __u64 hardware_exit_reason;
1317 } hw;
1318
1319If exit_reason is KVM_EXIT_UNKNOWN, the vcpu has exited due to unknown
1320reasons. Further architecture-specific information is available in
1321hardware_exit_reason.
1322
1323 /* KVM_EXIT_FAIL_ENTRY */
1324 struct {
1325 __u64 hardware_entry_failure_reason;
1326 } fail_entry;
1327
1328If exit_reason is KVM_EXIT_FAIL_ENTRY, the vcpu could not be run due
1329to unknown reasons. Further architecture-specific information is
1330available in hardware_entry_failure_reason.
1331
1332 /* KVM_EXIT_EXCEPTION */
1333 struct {
1334 __u32 exception;
1335 __u32 error_code;
1336 } ex;
1337
1338Unused.
1339
1340 /* KVM_EXIT_IO */
1341 struct {
1342#define KVM_EXIT_IO_IN 0
1343#define KVM_EXIT_IO_OUT 1
1344 __u8 direction;
1345 __u8 size; /* bytes */
1346 __u16 port;
1347 __u32 count;
1348 __u64 data_offset; /* relative to kvm_run start */
1349 } io;
1350
1351If exit_reason is KVM_EXIT_IO, then the vcpu has
1352executed a port I/O instruction which could not be satisfied by kvm.
1353data_offset describes where the data is located (KVM_EXIT_IO_OUT) or
1354where kvm expects application code to place the data for the next
1355KVM_RUN invocation (KVM_EXIT_IO_IN). Data format is a packed array.
1356
1357 struct {
1358 struct kvm_debug_exit_arch arch;
1359 } debug;
1360
1361Unused.
1362
1363 /* KVM_EXIT_MMIO */
1364 struct {
1365 __u64 phys_addr;
1366 __u8 data[8];
1367 __u32 len;
1368 __u8 is_write;
1369 } mmio;
1370
1371If exit_reason is KVM_EXIT_MMIO, then the vcpu has
1372executed a memory-mapped I/O instruction which could not be satisfied
1373by kvm. The 'data' member contains the written data if 'is_write' is
1374true, and should be filled by application code otherwise.
1375
1376NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding
1377operations are complete (and guest state is consistent) only after userspace
1378has re-entered the kernel with KVM_RUN. The kernel side will first finish
1379incomplete operations and then check for pending signals. Userspace
1380can re-enter the guest with an unmasked signal pending to complete
1381pending operations.
1382
1383 /* KVM_EXIT_HYPERCALL */
1384 struct {
1385 __u64 nr;
1386 __u64 args[6];
1387 __u64 ret;
1388 __u32 longmode;
1389 __u32 pad;
1390 } hypercall;
1391
1392Unused. This was once used for 'hypercall to userspace'. To implement
1393such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390).
1394Note KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
1395
1396 /* KVM_EXIT_TPR_ACCESS */
1397 struct {
1398 __u64 rip;
1399 __u32 is_write;
1400 __u32 pad;
1401 } tpr_access;
1402
1403To be documented (KVM_TPR_ACCESS_REPORTING).
1404
1405 /* KVM_EXIT_S390_SIEIC */
1406 struct {
1407 __u8 icptcode;
1408 __u64 mask; /* psw upper half */
1409 __u64 addr; /* psw lower half */
1410 __u16 ipa;
1411 __u32 ipb;
1412 } s390_sieic;
1413
1414s390 specific.
1415
1416 /* KVM_EXIT_S390_RESET */
1417#define KVM_S390_RESET_POR 1
1418#define KVM_S390_RESET_CLEAR 2
1419#define KVM_S390_RESET_SUBSYSTEM 4
1420#define KVM_S390_RESET_CPU_INIT 8
1421#define KVM_S390_RESET_IPL 16
1422 __u64 s390_reset_flags;
1423
1424s390 specific.
1425
1426 /* KVM_EXIT_DCR */
1427 struct {
1428 __u32 dcrn;
1429 __u32 data;
1430 __u8 is_write;
1431 } dcr;
1432
1433powerpc specific.
1434
1435 /* KVM_EXIT_OSI */
1436 struct {
1437 __u64 gprs[32];
1438 } osi;
1439
1440MOL uses a special hypercall interface it calls 'OSI'. To enable it, we catch
1441hypercalls and exit with this exit struct that contains all the guest gprs.
1442
1443If exit_reason is KVM_EXIT_OSI, then the vcpu has triggered such a hypercall.
1444Userspace can now handle the hypercall and when it's done modify the gprs as
1445necessary. Upon guest entry all guest GPRs will then be replaced by the values
1446in this struct.
1447
1448 /* Fix the size of the union. */
1449 char padding[256];
1450 };
1451};
diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
new file mode 100644
index 000000000000..882068538c9c
--- /dev/null
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -0,0 +1,45 @@
1KVM CPUID bits
2Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
3=====================================================
4
5A guest running on a kvm host, can check some of its features using
6cpuid. This is not always guaranteed to work, since userspace can
7mask-out some, or even all KVM-related cpuid features before launching
8a guest.
9
10KVM cpuid functions are:
11
12function: KVM_CPUID_SIGNATURE (0x40000000)
13returns : eax = 0,
14 ebx = 0x4b4d564b,
15 ecx = 0x564b4d56,
16 edx = 0x4d.
17Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM".
18This function queries the presence of KVM cpuid leafs.
19
20
21function: define KVM_CPUID_FEATURES (0x40000001)
22returns : ebx, ecx, edx = 0
23 eax = and OR'ed group of (1 << flag), where each flags is:
24
25
26flag || value || meaning
27=============================================================================
28KVM_FEATURE_CLOCKSOURCE || 0 || kvmclock available at msrs
29 || || 0x11 and 0x12.
30------------------------------------------------------------------------------
31KVM_FEATURE_NOP_IO_DELAY || 1 || not necessary to perform delays
32 || || on PIO operations.
33------------------------------------------------------------------------------
34KVM_FEATURE_MMU_OP || 2 || deprecated.
35------------------------------------------------------------------------------
36KVM_FEATURE_CLOCKSOURCE2 || 3 || kvmclock available at msrs
37 || || 0x4b564d00 and 0x4b564d01
38------------------------------------------------------------------------------
39KVM_FEATURE_ASYNC_PF || 4 || async pf can be enabled by
40 || || writing to msr 0x4b564d02
41------------------------------------------------------------------------------
42KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
43 || || per-cpu warps are expected in
44 || || kvmclock.
45------------------------------------------------------------------------------
diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt
new file mode 100644
index 000000000000..3b4cd3bf5631
--- /dev/null
+++ b/Documentation/virtual/kvm/locking.txt
@@ -0,0 +1,25 @@
1KVM Lock Overview
2=================
3
41. Acquisition Orders
5---------------------
6
7(to be written)
8
92. Reference
10------------
11
12Name: kvm_lock
13Type: raw_spinlock
14Arch: any
15Protects: - vm_list
16 - hardware virtualization enable/disable
17Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
18 migration.
19
20Name: kvm_arch::tsc_write_lock
21Type: raw_spinlock
22Arch: x86
23Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
24 - tsc offset in vmcb
25Comment: 'raw' because updating the tsc offsets must not be preempted.
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
new file mode 100644
index 000000000000..f46aa58389ca
--- /dev/null
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -0,0 +1,348 @@
1The x86 kvm shadow mmu
2======================
3
4The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible
5for presenting a standard x86 mmu to the guest, while translating guest
6physical addresses to host physical addresses.
7
8The mmu code attempts to satisfy the following requirements:
9
10- correctness: the guest should not be able to determine that it is running
11 on an emulated mmu except for timing (we attempt to comply
12 with the specification, not emulate the characteristics of
13 a particular implementation such as tlb size)
14- security: the guest must not be able to touch host memory not assigned
15 to it
16- performance: minimize the performance penalty imposed by the mmu
17- scaling: need to scale to large memory and large vcpu guests
18- hardware: support the full range of x86 virtualization hardware
19- integration: Linux memory management code must be in control of guest memory
20 so that swapping, page migration, page merging, transparent
21 hugepages, and similar features work without change
22- dirty tracking: report writes to guest memory to enable live migration
23 and framebuffer-based displays
24- footprint: keep the amount of pinned kernel memory low (most memory
25 should be shrinkable)
26- reliability: avoid multipage or GFP_ATOMIC allocations
27
28Acronyms
29========
30
31pfn host page frame number
32hpa host physical address
33hva host virtual address
34gfn guest frame number
35gpa guest physical address
36gva guest virtual address
37ngpa nested guest physical address
38ngva nested guest virtual address
39pte page table entry (used also to refer generically to paging structure
40 entries)
41gpte guest pte (referring to gfns)
42spte shadow pte (referring to pfns)
43tdp two dimensional paging (vendor neutral term for NPT and EPT)
44
45Virtual and real hardware supported
46===================================
47
48The mmu supports first-generation mmu hardware, which allows an atomic switch
49of the current paging mode and cr3 during guest entry, as well as
50two-dimensional paging (AMD's NPT and Intel's EPT). The emulated hardware
51it exposes is the traditional 2/3/4 level x86 mmu, with support for global
52pages, pae, pse, pse36, cr0.wp, and 1GB pages. Work is in progress to support
53exposing NPT capable hardware on NPT capable hosts.
54
55Translation
56===========
57
58The primary job of the mmu is to program the processor's mmu to translate
59addresses for the guest. Different translations are required at different
60times:
61
62- when guest paging is disabled, we translate guest physical addresses to
63 host physical addresses (gpa->hpa)
64- when guest paging is enabled, we translate guest virtual addresses, to
65 guest physical addresses, to host physical addresses (gva->gpa->hpa)
66- when the guest launches a guest of its own, we translate nested guest
67 virtual addresses, to nested guest physical addresses, to guest physical
68 addresses, to host physical addresses (ngva->ngpa->gpa->hpa)
69
70The primary challenge is to encode between 1 and 3 translations into hardware
71that support only 1 (traditional) and 2 (tdp) translations. When the
72number of required translations matches the hardware, the mmu operates in
73direct mode; otherwise it operates in shadow mode (see below).
74
75Memory
76======
77
78Guest memory (gpa) is part of the user address space of the process that is
79using kvm. Userspace defines the translation between guest addresses and user
80addresses (gpa->hva); note that two gpas may alias to the same hva, but not
81vice versa.
82
83These hvas may be backed using any method available to the host: anonymous
84memory, file backed memory, and device memory. Memory might be paged by the
85host at any time.
86
87Events
88======
89
90The mmu is driven by events, some from the guest, some from the host.
91
92Guest generated events:
93- writes to control registers (especially cr3)
94- invlpg/invlpga instruction execution
95- access to missing or protected translations
96
97Host generated events:
98- changes in the gpa->hpa translation (either through gpa->hva changes or
99 through hva->hpa changes)
100- memory pressure (the shrinker)
101
102Shadow pages
103============
104
105The principal data structure is the shadow page, 'struct kvm_mmu_page'. A
106shadow page contains 512 sptes, which can be either leaf or nonleaf sptes. A
107shadow page may contain a mix of leaf and nonleaf sptes.
108
109A nonleaf spte allows the hardware mmu to reach the leaf pages and
110is not related to a translation directly. It points to other shadow pages.
111
112A leaf spte corresponds to either one or two translations encoded into
113one paging structure entry. These are always the lowest level of the
114translation stack, with optional higher level translations left to NPT/EPT.
115Leaf ptes point at guest pages.
116
117The following table shows translations encoded by leaf ptes, with higher-level
118translations in parentheses:
119
120 Non-nested guests:
121 nonpaging: gpa->hpa
122 paging: gva->gpa->hpa
123 paging, tdp: (gva->)gpa->hpa
124 Nested guests:
125 non-tdp: ngva->gpa->hpa (*)
126 tdp: (ngva->)ngpa->gpa->hpa
127
128(*) the guest hypervisor will encode the ngva->gpa translation into its page
129 tables if npt is not present
130
131Shadow pages contain the following information:
132 role.level:
133 The level in the shadow paging hierarchy that this shadow page belongs to.
134 1=4k sptes, 2=2M sptes, 3=1G sptes, etc.
135 role.direct:
136 If set, leaf sptes reachable from this page are for a linear range.
137 Examples include real mode translation, large guest pages backed by small
138 host pages, and gpa->hpa translations when NPT or EPT is active.
139 The linear range starts at (gfn << PAGE_SHIFT) and its size is determined
140 by role.level (2MB for first level, 1GB for second level, 0.5TB for third
141 level, 256TB for fourth level)
142 If clear, this page corresponds to a guest page table denoted by the gfn
143 field.
144 role.quadrant:
145 When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit
146 sptes. That means a guest page table contains more ptes than the host,
147 so multiple shadow pages are needed to shadow one guest page.
148 For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
149 first or second 512-gpte block in the guest page table. For second-level
150 page tables, each 32-bit gpte is converted to two 64-bit sptes
151 (since each first-level guest page is shadowed by two first-level
152 shadow pages) so role.quadrant takes values in the range 0..3. Each
153 quadrant maps 1GB virtual address space.
154 role.access:
155 Inherited guest access permissions in the form uwx. Note execute
156 permission is positive, not negative.
157 role.invalid:
158 The page is invalid and should not be used. It is a root page that is
159 currently pinned (by a cpu hardware register pointing to it); once it is
160 unpinned it will be destroyed.
161 role.cr4_pae:
162 Contains the value of cr4.pae for which the page is valid (e.g. whether
163 32-bit or 64-bit gptes are in use).
164 role.nxe:
165 Contains the value of efer.nxe for which the page is valid.
166 role.cr0_wp:
167 Contains the value of cr0.wp for which the page is valid.
168 gfn:
169 Either the guest page table containing the translations shadowed by this
170 page, or the base page frame for linear translations. See role.direct.
171 spt:
172 A pageful of 64-bit sptes containing the translations for this page.
173 Accessed by both kvm and hardware.
174 The page pointed to by spt will have its page->private pointing back
175 at the shadow page structure.
176 sptes in spt point either at guest pages, or at lower-level shadow pages.
177 Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point
178 at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte.
179 The spt array forms a DAG structure with the shadow page as a node, and
180 guest pages as leaves.
181 gfns:
182 An array of 512 guest frame numbers, one for each present pte. Used to
183 perform a reverse map from a pte to a gfn. When role.direct is set, any
184 element of this array can be calculated from the gfn field when used, in
185 this case, the array of gfns is not allocated. See role.direct and gfn.
186 slot_bitmap:
187 A bitmap containing one bit per memory slot. If the page contains a pte
188 mapping a page from memory slot n, then bit n of slot_bitmap will be set
189 (if a page is aliased among several slots, then it is not guaranteed that
190 all slots will be marked).
191 Used during dirty logging to avoid scanning a shadow page if none if its
192 pages need tracking.
193 root_count:
194 A counter keeping track of how many hardware registers (guest cr3 or
195 pdptrs) are now pointing at the page. While this counter is nonzero, the
196 page cannot be destroyed. See role.invalid.
197 multimapped:
198 Whether there exist multiple sptes pointing at this page.
199 parent_pte/parent_ptes:
200 If multimapped is zero, parent_pte points at the single spte that points at
201 this page's spt. Otherwise, parent_ptes points at a data structure
202 with a list of parent_ptes.
203 unsync:
204 If true, then the translations in this page may not match the guest's
205 translation. This is equivalent to the state of the tlb when a pte is
206 changed but before the tlb entry is flushed. Accordingly, unsync ptes
207 are synchronized when the guest executes invlpg or flushes its tlb by
208 other means. Valid for leaf pages.
209 unsync_children:
210 How many sptes in the page point at pages that are unsync (or have
211 unsynchronized children).
212 unsync_child_bitmap:
213 A bitmap indicating which sptes in spt point (directly or indirectly) at
214 pages that may be unsynchronized. Used to quickly locate all unsychronized
215 pages reachable from a given page.
216
217Reverse map
218===========
219
220The mmu maintains a reverse mapping whereby all ptes mapping a page can be
221reached given its gfn. This is used, for example, when swapping out a page.
222
223Synchronized and unsynchronized pages
224=====================================
225
226The guest uses two events to synchronize its tlb and page tables: tlb flushes
227and page invalidations (invlpg).
228
229A tlb flush means that we need to synchronize all sptes reachable from the
230guest's cr3. This is expensive, so we keep all guest page tables write
231protected, and synchronize sptes to gptes when a gpte is written.
232
233A special case is when a guest page table is reachable from the current
234guest cr3. In this case, the guest is obliged to issue an invlpg instruction
235before using the translation. We take advantage of that by removing write
236protection from the guest page, and allowing the guest to modify it freely.
237We synchronize modified gptes when the guest invokes invlpg. This reduces
238the amount of emulation we have to do when the guest modifies multiple gptes,
239or when the a guest page is no longer used as a page table and is used for
240random guest data.
241
242As a side effect we have to resynchronize all reachable unsynchronized shadow
243pages on a tlb flush.
244
245
246Reaction to events
247==================
248
249- guest page fault (or npt page fault, or ept violation)
250
251This is the most complicated event. The cause of a page fault can be:
252
253 - a true guest fault (the guest translation won't allow the access) (*)
254 - access to a missing translation
255 - access to a protected translation
256 - when logging dirty pages, memory is write protected
257 - synchronized shadow pages are write protected (*)
258 - access to untranslatable memory (mmio)
259
260 (*) not applicable in direct mode
261
262Handling a page fault is performed as follows:
263
264 - if needed, walk the guest page tables to determine the guest translation
265 (gva->gpa or ngpa->gpa)
266 - if permissions are insufficient, reflect the fault back to the guest
267 - determine the host page
268 - if this is an mmio request, there is no host page; call the emulator
269 to emulate the instruction instead
270 - walk the shadow page table to find the spte for the translation,
271 instantiating missing intermediate page tables as necessary
272 - try to unsynchronize the page
273 - if successful, we can let the guest continue and modify the gpte
274 - emulate the instruction
275 - if failed, unshadow the page and let the guest continue
276 - update any translations that were modified by the instruction
277
278invlpg handling:
279
280 - walk the shadow page hierarchy and drop affected translations
281 - try to reinstantiate the indicated translation in the hope that the
282 guest will use it in the near future
283
284Guest control register updates:
285
286- mov to cr3
287 - look up new shadow roots
288 - synchronize newly reachable shadow pages
289
290- mov to cr0/cr4/efer
291 - set up mmu context for new paging mode
292 - look up new shadow roots
293 - synchronize newly reachable shadow pages
294
295Host translation updates:
296
297 - mmu notifier called with updated hva
298 - look up affected sptes through reverse map
299 - drop (or update) translations
300
301Emulating cr0.wp
302================
303
304If tdp is not enabled, the host must keep cr0.wp=1 so page write protection
305works for the guest kernel, not guest guest userspace. When the guest
306cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0,
307we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the
308semantics require allowing any guest kernel access plus user read access).
309
310We handle this by mapping the permissions to two possible sptes, depending
311on fault type:
312
313- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access,
314 disallows user access)
315- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel
316 write access)
317
318(user write faults generate a #PF)
319
320Large pages
321===========
322
323The mmu supports all combinations of large and small guest and host pages.
324Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as
325two separate 2M pages, on both guest and host, since the mmu always uses PAE
326paging.
327
328To instantiate a large spte, four constraints must be satisfied:
329
330- the spte must point to a large host page
331- the guest pte must be a large pte of at least equivalent size (if tdp is
332 enabled, there is no guest pte and this condition is satisified)
333- if the spte will be writeable, the large page frame may not overlap any
334 write-protected pages
335- the guest page must be wholly contained by a single memory slot
336
337To check the last two conditions, the mmu maintains a ->write_count set of
338arrays for each memory slot and large page size. Every write protected page
339causes its write_count to be incremented, thus preventing instantiation of
340a large spte. The frames at the end of an unaligned memory slot have
341artificically inflated ->write_counts so they can never be instantiated.
342
343Further reading
344===============
345
346- NPT presentation from KVM Forum 2008
347 http://www.linux-kvm.org/wiki/images/c/c8/KvmForum2008%24kdf2008_21.pdf
348
diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
new file mode 100644
index 000000000000..d079aed27e03
--- /dev/null
+++ b/Documentation/virtual/kvm/msr.txt
@@ -0,0 +1,187 @@
1KVM-specific MSRs.
2Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
3=====================================================
4
5KVM makes use of some custom MSRs to service some requests.
6
7Custom MSRs have a range reserved for them, that goes from
80x4b564d00 to 0x4b564dff. There are MSRs outside this area,
9but they are deprecated and their use is discouraged.
10
11Custom MSR list
12--------
13
14The current supported Custom MSR list is:
15
16MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00
17
18 data: 4-byte alignment physical address of a memory area which must be
19 in guest RAM. This memory is expected to hold a copy of the following
20 structure:
21
22 struct pvclock_wall_clock {
23 u32 version;
24 u32 sec;
25 u32 nsec;
26 } __attribute__((__packed__));
27
28 whose data will be filled in by the hypervisor. The hypervisor is only
29 guaranteed to update this data at the moment of MSR write.
30 Users that want to reliably query this information more than once have
31 to write more than once to this MSR. Fields have the following meanings:
32
33 version: guest has to check version before and after grabbing
34 time information and check that they are both equal and even.
35 An odd version indicates an in-progress update.
36
37 sec: number of seconds for wallclock.
38
39 nsec: number of nanoseconds for wallclock.
40
41 Note that although MSRs are per-CPU entities, the effect of this
42 particular MSR is global.
43
44 Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
45 leaf prior to usage.
46
47MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01
48
49 data: 4-byte aligned physical address of a memory area which must be in
50 guest RAM, plus an enable bit in bit 0. This memory is expected to hold
51 a copy of the following structure:
52
53 struct pvclock_vcpu_time_info {
54 u32 version;
55 u32 pad0;
56 u64 tsc_timestamp;
57 u64 system_time;
58 u32 tsc_to_system_mul;
59 s8 tsc_shift;
60 u8 flags;
61 u8 pad[2];
62 } __attribute__((__packed__)); /* 32 bytes */
63
64 whose data will be filled in by the hypervisor periodically. Only one
65 write, or registration, is needed for each VCPU. The interval between
66 updates of this structure is arbitrary and implementation-dependent.
67 The hypervisor may update this structure at any time it sees fit until
68 anything with bit0 == 0 is written to it.
69
70 Fields have the following meanings:
71
72 version: guest has to check version before and after grabbing
73 time information and check that they are both equal and even.
74 An odd version indicates an in-progress update.
75
76 tsc_timestamp: the tsc value at the current VCPU at the time
77 of the update of this structure. Guests can subtract this value
78 from current tsc to derive a notion of elapsed time since the
79 structure update.
80
81 system_time: a host notion of monotonic time, including sleep
82 time at the time this structure was last updated. Unit is
83 nanoseconds.
84
85 tsc_to_system_mul: a function of the tsc frequency. One has
86 to multiply any tsc-related quantity by this value to get
87 a value in nanoseconds, besides dividing by 2^tsc_shift
88
89 tsc_shift: cycle to nanosecond divider, as a power of two, to
90 allow for shift rights. One has to shift right any tsc-related
91 quantity by this value to get a value in nanoseconds, besides
92 multiplying by tsc_to_system_mul.
93
94 With this information, guests can derive per-CPU time by
95 doing:
96
97 time = (current_tsc - tsc_timestamp)
98 time = (time * tsc_to_system_mul) >> tsc_shift
99 time = time + system_time
100
101 flags: bits in this field indicate extended capabilities
102 coordinated between the guest and the hypervisor. Availability
103 of specific flags has to be checked in 0x40000001 cpuid leaf.
104 Current flags are:
105
106 flag bit | cpuid bit | meaning
107 -------------------------------------------------------------
108 | | time measures taken across
109 0 | 24 | multiple cpus are guaranteed to
110 | | be monotonic
111 -------------------------------------------------------------
112
113 Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
114 leaf prior to usage.
115
116
117MSR_KVM_WALL_CLOCK: 0x11
118
119 data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
120
121 This MSR falls outside the reserved KVM range and may be removed in the
122 future. Its usage is deprecated.
123
124 Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
125 leaf prior to usage.
126
127MSR_KVM_SYSTEM_TIME: 0x12
128
129 data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
130
131 This MSR falls outside the reserved KVM range and may be removed in the
132 future. Its usage is deprecated.
133
134 Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
135 leaf prior to usage.
136
137 The suggested algorithm for detecting kvmclock presence is then:
138
139 if (!kvm_para_available()) /* refer to cpuid.txt */
140 return NON_PRESENT;
141
142 flags = cpuid_eax(0x40000001);
143 if (flags & 3) {
144 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
145 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
146 return PRESENT;
147 } else if (flags & 0) {
148 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
149 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
150 return PRESENT;
151 } else
152 return NON_PRESENT;
153
154MSR_KVM_ASYNC_PF_EN: 0x4b564d02
155 data: Bits 63-6 hold 64-byte aligned physical address of a
156 64 byte memory area which must be in guest RAM and must be
157 zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1
158 when asynchronous page faults are enabled on the vcpu 0 when
159 disabled. Bit 2 is 1 if asynchronous page faults can be injected
160 when vcpu is in cpl == 0.
161
162 First 4 byte of 64 byte memory location will be written to by
163 the hypervisor at the time of asynchronous page fault (APF)
164 injection to indicate type of asynchronous page fault. Value
165 of 1 means that the page referred to by the page fault is not
166 present. Value 2 means that the page is now available. Disabling
167 interrupt inhibits APFs. Guest must not enable interrupt
168 before the reason is read, or it may be overwritten by another
169 APF. Since APF uses the same exception vector as regular page
170 fault guest must reset the reason to 0 before it does
171 something that can generate normal page fault. If during page
172 fault APF reason is 0 it means that this is regular page
173 fault.
174
175 During delivery of type 1 APF cr2 contains a token that will
176 be used to notify a guest when missing page becomes
177 available. When page becomes available type 2 APF is sent with
178 cr2 set to the token associated with the page. There is special
179 kind of token 0xffffffff which tells vcpu that it should wake
180 up all processes waiting for APFs and no individual type 2 APFs
181 will be sent.
182
183 If APF is disabled while there are outstanding APFs, they will
184 not be delivered.
185
186 Currently type 2 APF will be always delivered on the same vcpu as
187 type 1 was, but guest should not rely on that.
diff --git a/Documentation/virtual/kvm/ppc-pv.txt b/Documentation/virtual/kvm/ppc-pv.txt
new file mode 100644
index 000000000000..3ab969c59046
--- /dev/null
+++ b/Documentation/virtual/kvm/ppc-pv.txt
@@ -0,0 +1,196 @@
1The PPC KVM paravirtual interface
2=================================
3
4The basic execution principle by which KVM on PowerPC works is to run all kernel
5space code in PR=1 which is user space. This way we trap all privileged
6instructions and can emulate them accordingly.
7
8Unfortunately that is also the downfall. There are quite some privileged
9instructions that needlessly return us to the hypervisor even though they
10could be handled differently.
11
12This is what the PPC PV interface helps with. It takes privileged instructions
13and transforms them into unprivileged ones with some help from the hypervisor.
14This cuts down virtualization costs by about 50% on some of my benchmarks.
15
16The code for that interface can be found in arch/powerpc/kernel/kvm*
17
18Querying for existence
19======================
20
21To find out if we're running on KVM or not, we leverage the device tree. When
22Linux is running on KVM, a node /hypervisor exists. That node contains a
23compatible property with the value "linux,kvm".
24
25Once you determined you're running under a PV capable KVM, you can now use
26hypercalls as described below.
27
28KVM hypercalls
29==============
30
31Inside the device tree's /hypervisor node there's a property called
32'hypercall-instructions'. This property contains at most 4 opcodes that make
33up the hypercall. To call a hypercall, just call these instructions.
34
35The parameters are as follows:
36
37 Register IN OUT
38
39 r0 - volatile
40 r3 1st parameter Return code
41 r4 2nd parameter 1st output value
42 r5 3rd parameter 2nd output value
43 r6 4th parameter 3rd output value
44 r7 5th parameter 4th output value
45 r8 6th parameter 5th output value
46 r9 7th parameter 6th output value
47 r10 8th parameter 7th output value
48 r11 hypercall number 8th output value
49 r12 - volatile
50
51Hypercall definitions are shared in generic code, so the same hypercall numbers
52apply for x86 and powerpc alike with the exception that each KVM hypercall
53also needs to be ORed with the KVM vendor code which is (42 << 16).
54
55Return codes can be as follows:
56
57 Code Meaning
58
59 0 Success
60 12 Hypercall not implemented
61 <0 Error
62
63The magic page
64==============
65
66To enable communication between the hypervisor and guest there is a new shared
67page that contains parts of supervisor visible register state. The guest can
68map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
69
70With this hypercall issued the guest always gets the magic page mapped at the
71desired location in effective and physical address space. For now, we always
72map the page to -4096. This way we can access it using absolute load and store
73functions. The following instruction reads the first field of the magic page:
74
75 ld rX, -4096(0)
76
77The interface is designed to be extensible should there be need later to add
78additional registers to the magic page. If you add fields to the magic page,
79also define a new hypercall feature to indicate that the host can give you more
80registers. Only if the host supports the additional features, make use of them.
81
82The magic page has the following layout as described in
83arch/powerpc/include/asm/kvm_para.h:
84
85struct kvm_vcpu_arch_shared {
86 __u64 scratch1;
87 __u64 scratch2;
88 __u64 scratch3;
89 __u64 critical; /* Guest may not get interrupts if == r1 */
90 __u64 sprg0;
91 __u64 sprg1;
92 __u64 sprg2;
93 __u64 sprg3;
94 __u64 srr0;
95 __u64 srr1;
96 __u64 dar;
97 __u64 msr;
98 __u32 dsisr;
99 __u32 int_pending; /* Tells the guest if we have an interrupt */
100};
101
102Additions to the page must only occur at the end. Struct fields are always 32
103or 64 bit aligned, depending on them being 32 or 64 bit wide respectively.
104
105Magic page features
106===================
107
108When mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE,
109a second return value is passed to the guest. This second return value contains
110a bitmap of available features inside the magic page.
111
112The following enhancements to the magic page are currently available:
113
114 KVM_MAGIC_FEAT_SR Maps SR registers r/w in the magic page
115
116For enhanced features in the magic page, please check for the existence of the
117feature before using them!
118
119MSR bits
120========
121
122The MSR contains bits that require hypervisor intervention and bits that do
123not require direct hypervisor intervention because they only get interpreted
124when entering the guest or don't have any impact on the hypervisor's behavior.
125
126The following bits are safe to be set inside the guest:
127
128 MSR_EE
129 MSR_RI
130 MSR_CR
131 MSR_ME
132
133If any other bit changes in the MSR, please still use mtmsr(d).
134
135Patched instructions
136====================
137
138The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
139respectively on 32 bit systems with an added offset of 4 to accommodate for big
140endianness.
141
142The following is a list of mapping the Linux kernel performs when running as
143guest. Implementing any of those mappings is optional, as the instruction traps
144also act on the shared page. So calling privileged instructions still works as
145before.
146
147From To
148==== ==
149
150mfmsr rX ld rX, magic_page->msr
151mfsprg rX, 0 ld rX, magic_page->sprg0
152mfsprg rX, 1 ld rX, magic_page->sprg1
153mfsprg rX, 2 ld rX, magic_page->sprg2
154mfsprg rX, 3 ld rX, magic_page->sprg3
155mfsrr0 rX ld rX, magic_page->srr0
156mfsrr1 rX ld rX, magic_page->srr1
157mfdar rX ld rX, magic_page->dar
158mfdsisr rX lwz rX, magic_page->dsisr
159
160mtmsr rX std rX, magic_page->msr
161mtsprg 0, rX std rX, magic_page->sprg0
162mtsprg 1, rX std rX, magic_page->sprg1
163mtsprg 2, rX std rX, magic_page->sprg2
164mtsprg 3, rX std rX, magic_page->sprg3
165mtsrr0 rX std rX, magic_page->srr0
166mtsrr1 rX std rX, magic_page->srr1
167mtdar rX std rX, magic_page->dar
168mtdsisr rX stw rX, magic_page->dsisr
169
170tlbsync nop
171
172mtmsrd rX, 0 b <special mtmsr section>
173mtmsr rX b <special mtmsr section>
174
175mtmsrd rX, 1 b <special mtmsrd section>
176
177[Book3S only]
178mtsrin rX, rY b <special mtsrin section>
179
180[BookE only]
181wrteei [0|1] b <special wrteei section>
182
183
184Some instructions require more logic to determine what's going on than a load
185or store instruction can deliver. To enable patching of those, we keep some
186RAM around where we can live translate instructions to. What happens is the
187following:
188
189 1) copy emulation code to memory
190 2) patch that code to fit the emulated instruction
191 3) patch that code to return to the original pc + 4
192 4) patch the original instruction to branch to the new code
193
194That way we can inject an arbitrary amount of code as replacement for a single
195instruction. This allows us to check for pending interrupts when setting EE=1
196for example.
diff --git a/Documentation/virtual/kvm/review-checklist.txt b/Documentation/virtual/kvm/review-checklist.txt
new file mode 100644
index 000000000000..730475ae1b8d
--- /dev/null
+++ b/Documentation/virtual/kvm/review-checklist.txt
@@ -0,0 +1,38 @@
1Review checklist for kvm patches
2================================
3
41. The patch must follow Documentation/CodingStyle and
5 Documentation/SubmittingPatches.
6
72. Patches should be against kvm.git master branch.
8
93. If the patch introduces or modifies a new userspace API:
10 - the API must be documented in Documentation/kvm/api.txt
11 - the API must be discoverable using KVM_CHECK_EXTENSION
12
134. New state must include support for save/restore.
14
155. New features must default to off (userspace should explicitly request them).
16 Performance improvements can and should default to on.
17
186. New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2
19
207. Emulator changes should be accompanied by unit tests for qemu-kvm.git
21 kvm/test directory.
22
238. Changes should be vendor neutral when possible. Changes to common code
24 are better than duplicating changes to vendor code.
25
269. Similarly, prefer changes to arch independent code than to arch dependent
27 code.
28
2910. User/kernel interfaces and guest/host interfaces must be 64-bit clean
30 (all variables and sizes naturally aligned on 64-bit; use specific types
31 only - u64 rather than ulong).
32
3311. New guest visible features must either be documented in a hardware manual
34 or be accompanied by documentation.
35
3612. Features must be robust against reset and kexec - for example, shared
37 host/guest memory must be unshared to prevent the host from writing to
38 guest memory that the guest has not reserved for this purpose.
diff --git a/Documentation/virtual/kvm/timekeeping.txt b/Documentation/virtual/kvm/timekeeping.txt
new file mode 100644
index 000000000000..df8946377cb6
--- /dev/null
+++ b/Documentation/virtual/kvm/timekeeping.txt
@@ -0,0 +1,612 @@
1
2 Timekeeping Virtualization for X86-Based Architectures
3
4 Zachary Amsden <zamsden@redhat.com>
5 Copyright (c) 2010, Red Hat. All rights reserved.
6
71) Overview
82) Timing Devices
93) TSC Hardware
104) Virtualization Problems
11
12=========================================================================
13
141) Overview
15
16One of the most complicated parts of the X86 platform, and specifically,
17the virtualization of this platform is the plethora of timing devices available
18and the complexity of emulating those devices. In addition, virtualization of
19time introduces a new set of challenges because it introduces a multiplexed
20division of time beyond the control of the guest CPU.
21
22First, we will describe the various timekeeping hardware available, then
23present some of the problems which arise and solutions available, giving
24specific recommendations for certain classes of KVM guests.
25
26The purpose of this document is to collect data and information relevant to
27timekeeping which may be difficult to find elsewhere, specifically,
28information relevant to KVM and hardware-based virtualization.
29
30=========================================================================
31
322) Timing Devices
33
34First we discuss the basic hardware devices available. TSC and the related
35KVM clock are special enough to warrant a full exposition and are described in
36the following section.
37
382.1) i8254 - PIT
39
40One of the first timer devices available is the programmable interrupt timer,
41or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three
42channels which can be programmed to deliver periodic or one-shot interrupts.
43These three channels can be configured in different modes and have individual
44counters. Channel 1 and 2 were not available for general use in the original
45IBM PC, and historically were connected to control RAM refresh and the PC
46speaker. Now the PIT is typically integrated as part of an emulated chipset
47and a separate physical PIT is not used.
48
49The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done
50using single or multiple byte access to the I/O ports. There are 6 modes
51available, but not all modes are available to all timers, as only timer 2
52has a connected gate input, required for modes 1 and 5. The gate line is
53controlled by port 61h, bit 0, as illustrated in the following diagram.
54
55 -------------- ----------------
56| | | |
57| 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0
58| Clock | | | |
59 -------------- | +->| GATE TIMER 0 |
60 | ----------------
61 |
62 | ----------------
63 | | |
64 |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM
65 | | | (aka /dev/null)
66 | +->| GATE TIMER 1 |
67 | ----------------
68 |
69 | ----------------
70 | | |
71 |------>| CLOCK OUT | ---------> Port 61h, bit 5
72 | | |
73Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.---- ____
74 ---------------- _| )--|LPF|---Speaker
75 / *---- \___/
76Port 61h, bit 1 -----------------------------------/
77
78The timer modes are now described.
79
80Mode 0: Single Timeout. This is a one-shot software timeout that counts down
81 when the gate is high (always true for timers 0 and 1). When the count
82 reaches zero, the output goes high.
83
84Mode 1: Triggered One-shot. The output is initially set high. When the gate
85 line is set high, a countdown is initiated (which does not stop if the gate is
86 lowered), during which the output is set low. When the count reaches zero,
87 the output goes high.
88
89Mode 2: Rate Generator. The output is initially set high. When the countdown
90 reaches 1, the output goes low for one count and then returns high. The value
91 is reloaded and the countdown automatically resumes. If the gate line goes
92 low, the count is halted. If the output is low when the gate is lowered, the
93 output automatically goes high (this only affects timer 2).
94
95Mode 3: Square Wave. This generates a high / low square wave. The count
96 determines the length of the pulse, which alternates between high and low
97 when zero is reached. The count only proceeds when gate is high and is
98 automatically reloaded on reaching zero. The count is decremented twice at
99 each clock to generate a full high / low cycle at the full periodic rate.
100 If the count is even, the clock remains high for N/2 counts and low for N/2
101 counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
102 for (N-1)/2 counts. Only even values are latched by the counter, so odd
103 values are not observed when reading. This is the intended mode for timer 2,
104 which generates sine-like tones by low-pass filtering the square wave output.
105
106Mode 4: Software Strobe. After programming this mode and loading the counter,
107 the output remains high until the counter reaches zero. Then the output
108 goes low for 1 clock cycle and returns high. The counter is not reloaded.
109 Counting only occurs when gate is high.
110
111Mode 5: Hardware Strobe. After programming and loading the counter, the
112 output remains high. When the gate is raised, a countdown is initiated
113 (which does not stop if the gate is lowered). When the counter reaches zero,
114 the output goes low for 1 clock cycle and then returns high. The counter is
115 not reloaded.
116
117In addition to normal binary counting, the PIT supports BCD counting. The
118command port, 0x43 is used to set the counter and mode for each of the three
119timers.
120
121PIT commands, issued to port 0x43, using the following bit encoding:
122
123Bit 7-4: Command (See table below)
124Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
125Bit 0 : Binary (0) / BCD (1)
126
127Command table:
128
1290000 - Latch Timer 0 count for port 0x40
130 sample and hold the count to be read in port 0x40;
131 additional commands ignored until counter is read;
132 mode bits ignored.
133
1340001 - Set Timer 0 LSB mode for port 0x40
135 set timer to read LSB only and force MSB to zero;
136 mode bits set timer mode
137
1380010 - Set Timer 0 MSB mode for port 0x40
139 set timer to read MSB only and force LSB to zero;
140 mode bits set timer mode
141
1420011 - Set Timer 0 16-bit mode for port 0x40
143 set timer to read / write LSB first, then MSB;
144 mode bits set timer mode
145
1460100 - Latch Timer 1 count for port 0x41 - as described above
1470101 - Set Timer 1 LSB mode for port 0x41 - as described above
1480110 - Set Timer 1 MSB mode for port 0x41 - as described above
1490111 - Set Timer 1 16-bit mode for port 0x41 - as described above
150
1511000 - Latch Timer 2 count for port 0x42 - as described above
1521001 - Set Timer 2 LSB mode for port 0x42 - as described above
1531010 - Set Timer 2 MSB mode for port 0x42 - as described above
1541011 - Set Timer 2 16-bit mode for port 0x42 as described above
155
1561101 - General counter latch
157 Latch combination of counters into corresponding ports
158 Bit 3 = Counter 2
159 Bit 2 = Counter 1
160 Bit 1 = Counter 0
161 Bit 0 = Unused
162
1631110 - Latch timer status
164 Latch combination of counter mode into corresponding ports
165 Bit 3 = Counter 2
166 Bit 2 = Counter 1
167 Bit 1 = Counter 0
168
169 The output of ports 0x40-0x42 following this command will be:
170
171 Bit 7 = Output pin
172 Bit 6 = Count loaded (0 if timer has expired)
173 Bit 5-4 = Read / Write mode
174 01 = MSB only
175 10 = LSB only
176 11 = LSB / MSB (16-bit)
177 Bit 3-1 = Mode
178 Bit 0 = Binary (0) / BCD mode (1)
179
1802.2) RTC
181
182The second device which was available in the original PC was the MC146818 real
183time clock. The original device is now obsolete, and usually emulated by the
184system chipset, sometimes by an HPET and some frankenstein IRQ routing.
185
186The RTC is accessed through CMOS variables, which uses an index register to
187control which bytes are read. Since there is only one index register, read
188of the CMOS and read of the RTC require lock protection (in addition, it is
189dangerous to allow userspace utilities such as hwclock to have direct RTC
190access, as they could corrupt kernel reads and writes of CMOS memory).
191
192The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt
193can function as a periodic timer, an additional once a day alarm, and can issue
194interrupts after an update of the CMOS registers by the MC146818 is complete.
195The type of interrupt is signalled in the RTC status registers.
196
197The RTC will update the current time fields by battery power even while the
198system is off. The current time fields should not be read while an update is
199in progress, as indicated in the status register.
200
201The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
202programmed to a 32kHz divider if the RTC is to count seconds.
203
204This is the RAM map originally used for the RTC/CMOS:
205
206Location Size Description
207------------------------------------------
20800h byte Current second (BCD)
20901h byte Seconds alarm (BCD)
21002h byte Current minute (BCD)
21103h byte Minutes alarm (BCD)
21204h byte Current hour (BCD)
21305h byte Hours alarm (BCD)
21406h byte Current day of week (BCD)
21507h byte Current day of month (BCD)
21608h byte Current month (BCD)
21709h byte Current year (BCD)
2180Ah byte Register A
219 bit 7 = Update in progress
220 bit 6-4 = Divider for clock
221 000 = 4.194 MHz
222 001 = 1.049 MHz
223 010 = 32 kHz
224 10X = test modes
225 110 = reset / disable
226 111 = reset / disable
227 bit 3-0 = Rate selection for periodic interrupt
228 000 = periodic timer disabled
229 001 = 3.90625 uS
230 010 = 7.8125 uS
231 011 = .122070 mS
232 100 = .244141 mS
233 ...
234 1101 = 125 mS
235 1110 = 250 mS
236 1111 = 500 mS
2370Bh byte Register B
238 bit 7 = Run (0) / Halt (1)
239 bit 6 = Periodic interrupt enable
240 bit 5 = Alarm interrupt enable
241 bit 4 = Update-ended interrupt enable
242 bit 3 = Square wave interrupt enable
243 bit 2 = BCD calendar (0) / Binary (1)
244 bit 1 = 12-hour mode (0) / 24-hour mode (1)
245 bit 0 = 0 (DST off) / 1 (DST enabled)
246OCh byte Register C (read only)
247 bit 7 = interrupt request flag (IRQF)
248 bit 6 = periodic interrupt flag (PF)
249 bit 5 = alarm interrupt flag (AF)
250 bit 4 = update interrupt flag (UF)
251 bit 3-0 = reserved
252ODh byte Register D (read only)
253 bit 7 = RTC has power
254 bit 6-0 = reserved
25532h byte Current century BCD (*)
256 (*) location vendor specific and now determined from ACPI global tables
257
2582.3) APIC
259
260On Pentium and later processors, an on-board timer is available to each CPU
261as part of the Advanced Programmable Interrupt Controller. The APIC is
262accessed through memory-mapped registers and provides interrupt service to each
263CPU, used for IPIs and local timer interrupts.
264
265Although in theory the APIC is a safe and stable source for local interrupts,
266in practice, many bugs and glitches have occurred due to the special nature of
267the APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect
268the use of the APIC and that workarounds may be required. In addition, some of
269these workarounds pose unique constraints for virtualization - requiring either
270extra overhead incurred from extra reads of memory-mapped I/O or additional
271functionality that may be more computationally expensive to implement.
272
273Since the APIC is documented quite well in the Intel and AMD manuals, we will
274avoid repetition of the detail here. It should be pointed out that the APIC
275timer is programmed through the LVT (local vector timer) register, is capable
276of one-shot or periodic operation, and is based on the bus clock divided down
277by the programmable divider register.
278
2792.4) HPET
280
281HPET is quite complex, and was originally intended to replace the PIT / RTC
282support of the X86 PC. It remains to be seen whether that will be the case, as
283the de facto standard of PC hardware is to emulate these older devices. Some
284systems designated as legacy free may support only the HPET as a hardware timer
285device.
286
287The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
288but allowing implementation freedom to support many more. It also imposes no
289fixed rate on the timer frequency, but does impose some extremal values on
290frequency, error and slew.
291
292In general, the HPET is recommended as a high precision (compared to PIT /RTC)
293time source which is independent of local variation (as there is only one HPET
294in any given system). The HPET is also memory-mapped, and its presence is
295indicated through ACPI tables by the BIOS.
296
297Detailed specification of the HPET is beyond the current scope of this
298document, as it is also very well documented elsewhere.
299
3002.5) Offboard Timers
301
302Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
303timing chips built into the cards which may have registers which are accessible
304to kernel or user drivers. To the author's knowledge, using these to generate
305a clocksource for a Linux or other kernel has not yet been attempted and is in
306general frowned upon as not playing by the agreed rules of the game. Such a
307timer device would require additional support to be virtualized properly and is
308not considered important at this time as no known operating system does this.
309
310=========================================================================
311
3123) TSC Hardware
313
314The TSC or time stamp counter is relatively simple in theory; it counts
315instruction cycles issued by the processor, which can be used as a measure of
316time. In practice, due to a number of problems, it is the most complicated
317timekeeping device to use.
318
319The TSC is represented internally as a 64-bit MSR which can be read with the
320RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware
321limitations made it possible to write the TSC, but generally on old hardware it
322was only possible to write the low 32-bits of the 64-bit counter, and the upper
32332-bits of the counter were cleared. Now, however, on Intel processors family
3240Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
325has been lifted and all 64-bits are writable. On AMD systems, the ability to
326write the TSC MSR is not an architectural guarantee.
327
328The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
329means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
330
331Some vendors have implemented an additional instruction, RDTSCP, which returns
332atomically not just the TSC, but an indicator which corresponds to the
333processor number. This can be used to index into an array of TSC variables to
334determine offset information in SMP systems where TSCs are not synchronized.
335The presence of this instruction must be determined by consulting CPUID feature
336bits.
337
338Both VMX and SVM provide extension fields in the virtualization hardware which
339allows the guest visible TSC to be offset by a constant. Newer implementations
340promise to allow the TSC to additionally be scaled, but this hardware is not
341yet widely available.
342
3433.1) TSC synchronization
344
345The TSC is a CPU-local clock in most implementations. This means, on SMP
346platforms, the TSCs of different CPUs may start at different times depending
347on when the CPUs are powered on. Generally, CPUs on the same die will share
348the same clock, however, this is not always the case.
349
350The BIOS may attempt to resynchronize the TSCs during the poweron process and
351the operating system or other system software may attempt to do this as well.
352Several hardware limitations make the problem worse - if it is not possible to
353write the full 64-bits of the TSC, it may be impossible to match the TSC in
354newly arriving CPUs to that of the rest of the system, resulting in
355unsynchronized TSCs. This may be done by BIOS or system software, but in
356practice, getting a perfectly synchronized TSC will not be possible unless all
357values are read from the same clock, which generally only is possible on single
358socket systems or those with special hardware support.
359
3603.2) TSC and CPU hotplug
361
362As touched on already, CPUs which arrive later than the boot time of the system
363may not have a TSC value that is synchronized with the rest of the system.
364Either system software, BIOS, or SMM code may actually try to establish the TSC
365to a value matching the rest of the system, but a perfect match is usually not
366a guarantee. This can have the effect of bringing a system from a state where
367TSC is synchronized back to a state where TSC synchronization flaws, however
368small, may be exposed to the OS and any virtualization environment.
369
3703.3) TSC and multi-socket / NUMA
371
372Multi-socket systems, especially large multi-socket systems are likely to have
373individual clocksources rather than a single, universally distributed clock.
374Since these clocks are driven by different crystals, they will not have
375perfectly matched frequency, and temperature and electrical variations will
376cause the CPU clocks, and thus the TSCs to drift over time. Depending on the
377exact clock and bus design, the drift may or may not be fixed in absolute
378error, and may accumulate over time.
379
380In addition, very large systems may deliberately slew the clocks of individual
381cores. This technique, known as spread-spectrum clocking, reduces EMI at the
382clock frequency and harmonics of it, which may be required to pass FCC
383standards for telecommunications and computer equipment.
384
385It is recommended not to trust the TSCs to remain synchronized on NUMA or
386multiple socket systems for these reasons.
387
3883.4) TSC and C-states
389
390C-states, or idling states of the processor, especially C1E and deeper sleep
391states may be problematic for TSC as well. The TSC may stop advancing in such
392a state, resulting in a TSC which is behind that of other CPUs when execution
393is resumed. Such CPUs must be detected and flagged by the operating system
394based on CPU and chipset identifications.
395
396The TSC in such a case may be corrected by catching it up to a known external
397clocksource.
398
3993.5) TSC frequency change / P-states
400
401To make things slightly more interesting, some CPUs may change frequency. They
402may or may not run the TSC at the same rate, and because the frequency change
403may be staggered or slewed, at some points in time, the TSC rate may not be
404known other than falling within a range of values. In this case, the TSC will
405not be a stable time source, and must be calibrated against a known, stable,
406external clock to be a usable source of time.
407
408Whether the TSC runs at a constant rate or scales with the P-state is model
409dependent and must be determined by inspecting CPUID, chipset or vendor
410specific MSR fields.
411
412In addition, some vendors have known bugs where the P-state is actually
413compensated for properly during normal operation, but when the processor is
414inactive, the P-state may be raised temporarily to service cache misses from
415other processors. In such cases, the TSC on halted CPUs could advance faster
416than that of non-halted processors. AMD Turion processors are known to have
417this problem.
418
4193.6) TSC and STPCLK / T-states
420
421External signals given to the processor may also have the effect of stopping
422the TSC. This is typically done for thermal emergency power control to prevent
423an overheating condition, and typically, there is no way to detect that this
424condition has happened.
425
4263.7) TSC virtualization - VMX
427
428VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
429instructions, which is enough for full virtualization of TSC in any manner. In
430addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
431field specified in the VMCS. Special instructions must be used to read and
432write the VMCS field.
433
4343.8) TSC virtualization - SVM
435
436SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
437instructions, which is enough for full virtualization of TSC in any manner. In
438addition, SVM allows passing through the host TSC plus an additional offset
439field specified in the SVM control block.
440
4413.9) TSC feature bits in Linux
442
443In summary, there is no way to guarantee the TSC remains in perfect
444synchronization unless it is explicitly guaranteed by the architecture. Even
445if so, the TSCs in multi-sockets or NUMA systems may still run independently
446despite being locally consistent.
447
448The following feature bits are used by Linux to signal various TSC attributes,
449but they can only be taken to be meaningful for UP or single node systems.
450
451X86_FEATURE_TSC : The TSC is available in hardware
452X86_FEATURE_RDTSCP : The RDTSCP instruction is available
453X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states
454X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states
455X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware)
456
4574) Virtualization Problems
458
459Timekeeping is especially problematic for virtualization because a number of
460challenges arise. The most obvious problem is that time is now shared between
461the host and, potentially, a number of virtual machines. Thus the virtual
462operating system does not run with 100% usage of the CPU, despite the fact that
463it may very well make that assumption. It may expect it to remain true to very
464exacting bounds when interrupt sources are disabled, but in reality only its
465virtual interrupt sources are disabled, and the machine may still be preempted
466at any time. This causes problems as the passage of real time, the injection
467of machine interrupts and the associated clock sources are no longer completely
468synchronized with real time.
469
470This same problem can occur on native harware to a degree, as SMM mode may
471steal cycles from the naturally on X86 systems when SMM mode is used by the
472BIOS, but not in such an extreme fashion. However, the fact that SMM mode may
473cause similar problems to virtualization makes it a good justification for
474solving many of these problems on bare metal.
475
4764.1) Interrupt clocking
477
478One of the most immediate problems that occurs with legacy operating systems
479is that the system timekeeping routines are often designed to keep track of
480time by counting periodic interrupts. These interrupts may come from the PIT
481or the RTC, but the problem is the same: the host virtualization engine may not
482be able to deliver the proper number of interrupts per second, and so guest
483time may fall behind. This is especially problematic if a high interrupt rate
484is selected, such as 1000 HZ, which is unfortunately the default for many Linux
485guests.
486
487There are three approaches to solving this problem; first, it may be possible
488to simply ignore it. Guests which have a separate time source for tracking
489'wall clock' or 'real time' may not need any adjustment of their interrupts to
490maintain proper time. If this is not sufficient, it may be necessary to inject
491additional interrupts into the guest in order to increase the effective
492interrupt rate. This approach leads to complications in extreme conditions,
493where host load or guest lag is too much to compensate for, and thus another
494solution to the problem has risen: the guest may need to become aware of lost
495ticks and compensate for them internally. Although promising in theory, the
496implementation of this policy in Linux has been extremely error prone, and a
497number of buggy variants of lost tick compensation are distributed across
498commonly used Linux systems.
499
500Windows uses periodic RTC clocking as a means of keeping time internally, and
501thus requires interrupt slewing to keep proper time. It does use a low enough
502rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
503practice.
504
5054.2) TSC sampling and serialization
506
507As the highest precision time source available, the cycle counter of the CPU
508has aroused much interest from developers. As explained above, this timer has
509many problems unique to its nature as a local, potentially unstable and
510potentially unsynchronized source. One issue which is not unique to the TSC,
511but is highlighted because of its very precise nature is sampling delay. By
512definition, the counter, once read is already old. However, it is also
513possible for the counter to be read ahead of the actual use of the result.
514This is a consequence of the superscalar execution of the instruction stream,
515which may execute instructions out of order. Such execution is called
516non-serialized. Forcing serialized execution is necessary for precise
517measurement with the TSC, and requires a serializing instruction, such as CPUID
518or an MSR read.
519
520Since CPUID may actually be virtualized by a trap and emulate mechanism, this
521serialization can pose a performance issue for hardware virtualization. An
522accurate time stamp counter reading may therefore not always be available, and
523it may be necessary for an implementation to guard against "backwards" reads of
524the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
525system.
526
5274.3) Timespec aliasing
528
529Additionally, this lack of serialization from the TSC poses another challenge
530when using results of the TSC when measured against another time source. As
531the TSC is much higher precision, many possible values of the TSC may be read
532while another clock is still expressing the same value.
533
534That is, you may read (T,T+10) while external clock C maintains the same value.
535Due to non-serialized reads, you may actually end up with a range which
536fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but
537calibrated against an external value may have a range of valid values.
538Re-calibrating this computation may actually cause time, as computed after the
539calibration, to go backwards, compared with time computed before the
540calibration.
541
542This problem is particularly pronounced with an internal time source in Linux,
543the kernel time, which is expressed in the theoretically high resolution
544timespec - but which advances in much larger granularity intervals, sometimes
545at the rate of jiffies, and possibly in catchup modes, at a much larger step.
546
547This aliasing requires care in the computation and recalibration of kvmclock
548and any other values derived from TSC computation (such as TSC virtualization
549itself).
550
5514.4) Migration
552
553Migration of a virtual machine raises problems for timekeeping in two ways.
554First, the migration itself may take time, during which interrupts cannot be
555delivered, and after which, the guest time may need to be caught up. NTP may
556be able to help to some degree here, as the clock correction required is
557typically small enough to fall in the NTP-correctable window.
558
559An additional concern is that timers based off the TSC (or HPET, if the raw bus
560clock is exposed) may now be running at different rates, requiring compensation
561in some way in the hypervisor by virtualizing these timers. In addition,
562migrating to a faster machine may preclude the use of a passthrough TSC, as a
563faster clock cannot be made visible to a guest without the potential of time
564advancing faster than usual. A slower clock is less of a problem, as it can
565always be caught up to the original rate. KVM clock avoids these problems by
566simply storing multipliers and offsets against the TSC for the guest to convert
567back into nanosecond resolution values.
568
5694.5) Scheduling
570
571Since scheduling may be based on precise timing and firing of interrupts, the
572scheduling algorithms of an operating system may be adversely affected by
573virtualization. In theory, the effect is random and should be universally
574distributed, but in contrived as well as real scenarios (guest device access,
575causes of virtualization exits, possible context switch), this may not always
576be the case. The effect of this has not been well studied.
577
578In an attempt to work around this, several implementations have provided a
579paravirtualized scheduler clock, which reveals the true amount of CPU time for
580which a virtual machine has been running.
581
5824.6) Watchdogs
583
584Watchdog timers, such as the lock detector in Linux may fire accidentally when
585running under hardware virtualization due to timer interrupts being delayed or
586misinterpretation of the passage of real time. Usually, these warnings are
587spurious and can be ignored, but in some circumstances it may be necessary to
588disable such detection.
589
5904.7) Delays and precision timing
591
592Precise timing and delays may not be possible in a virtualized system. This
593can happen if the system is controlling physical hardware, or issues delays to
594compensate for slower I/O to and from devices. The first issue is not solvable
595in general for a virtualized system; hardware control software can't be
596adequately virtualized without a full real-time operating system, which would
597require an RT aware virtualization platform.
598
599The second issue may cause performance problems, but this is unlikely to be a
600significant issue. In many cases these delays may be eliminated through
601configuration or paravirtualization.
602
6034.8) Covert channels and leaks
604
605In addition to the above problems, time information will inevitably leak to the
606guest about the host in anything but a perfect implementation of virtualized
607time. This may allow the guest to infer the presence of a hypervisor (as in a
608red-pill type detection), and it may allow information to leak between guests
609by using CPU utilization itself as a signalling channel. Preventing such
610problems would require completely isolated virtual time which may not track
611real time any longer. This may be useful in certain security or QA contexts,
612but in general isn't recommended for real-world deployment scenarios.
diff --git a/Documentation/virtual/lguest/.gitignore b/Documentation/virtual/lguest/.gitignore
new file mode 100644
index 000000000000..115587fd5f65
--- /dev/null
+++ b/Documentation/virtual/lguest/.gitignore
@@ -0,0 +1 @@
lguest
diff --git a/Documentation/virtual/lguest/Makefile b/Documentation/virtual/lguest/Makefile
new file mode 100644
index 000000000000..bebac6b4f332
--- /dev/null
+++ b/Documentation/virtual/lguest/Makefile
@@ -0,0 +1,8 @@
1# This creates the demonstration utility "lguest" which runs a Linux guest.
2# Missing headers? Add "-I../../include -I../../arch/x86/include"
3CFLAGS:=-m32 -Wall -Wmissing-declarations -Wmissing-prototypes -O3 -U_FORTIFY_SOURCE
4
5all: lguest
6
7clean:
8 rm -f lguest
diff --git a/Documentation/virtual/lguest/extract b/Documentation/virtual/lguest/extract
new file mode 100644
index 000000000000..7730bb6e4b94
--- /dev/null
+++ b/Documentation/virtual/lguest/extract
@@ -0,0 +1,58 @@
1#! /bin/sh
2
3set -e
4
5PREFIX=$1
6shift
7
8trap 'rm -r $TMPDIR' 0
9TMPDIR=`mktemp -d`
10
11exec 3>/dev/null
12for f; do
13 while IFS="
14" read -r LINE; do
15 case "$LINE" in
16 *$PREFIX:[0-9]*:\**)
17 NUM=`echo "$LINE" | sed "s/.*$PREFIX:\([0-9]*\).*/\1/"`
18 if [ -f $TMPDIR/$NUM ]; then
19 echo "$TMPDIR/$NUM already exits prior to $f"
20 exit 1
21 fi
22 exec 3>>$TMPDIR/$NUM
23 echo $f | sed 's,\.\./,,g' > $TMPDIR/.$NUM
24 /bin/echo "$LINE" | sed -e "s/$PREFIX:[0-9]*//" -e "s/:\*/*/" >&3
25 ;;
26 *$PREFIX:[0-9]*)
27 NUM=`echo "$LINE" | sed "s/.*$PREFIX:\([0-9]*\).*/\1/"`
28 if [ -f $TMPDIR/$NUM ]; then
29 echo "$TMPDIR/$NUM already exits prior to $f"
30 exit 1
31 fi
32 exec 3>>$TMPDIR/$NUM
33 echo $f | sed 's,\.\./,,g' > $TMPDIR/.$NUM
34 /bin/echo "$LINE" | sed "s/$PREFIX:[0-9]*//" >&3
35 ;;
36 *:\**)
37 /bin/echo "$LINE" | sed -e "s/:\*/*/" -e "s,/\*\*/,," >&3
38 echo >&3
39 exec 3>/dev/null
40 ;;
41 *)
42 /bin/echo "$LINE" >&3
43 ;;
44 esac
45 done < $f
46 echo >&3
47 exec 3>/dev/null
48done
49
50LASTFILE=""
51for f in $TMPDIR/*; do
52 if [ "$LASTFILE" != $(cat $TMPDIR/.$(basename $f) ) ]; then
53 LASTFILE=$(cat $TMPDIR/.$(basename $f) )
54 echo "[ $LASTFILE ]"
55 fi
56 cat $f
57done
58
diff --git a/Documentation/virtual/lguest/lguest.c b/Documentation/virtual/lguest/lguest.c
new file mode 100644
index 000000000000..d9da7e148538
--- /dev/null
+++ b/Documentation/virtual/lguest/lguest.c
@@ -0,0 +1,2095 @@
1/*P:100
2 * This is the Launcher code, a simple program which lays out the "physical"
3 * memory for the new Guest by mapping the kernel image and the virtual
4 * devices, then opens /dev/lguest to tell the kernel about the Guest and
5 * control it.
6:*/
7#define _LARGEFILE64_SOURCE
8#define _GNU_SOURCE
9#include <stdio.h>
10#include <string.h>
11#include <unistd.h>
12#include <err.h>
13#include <stdint.h>
14#include <stdlib.h>
15#include <elf.h>
16#include <sys/mman.h>
17#include <sys/param.h>
18#include <sys/types.h>
19#include <sys/stat.h>
20#include <sys/wait.h>
21#include <sys/eventfd.h>
22#include <fcntl.h>
23#include <stdbool.h>
24#include <errno.h>
25#include <ctype.h>
26#include <sys/socket.h>
27#include <sys/ioctl.h>
28#include <sys/time.h>
29#include <time.h>
30#include <netinet/in.h>
31#include <net/if.h>
32#include <linux/sockios.h>
33#include <linux/if_tun.h>
34#include <sys/uio.h>
35#include <termios.h>
36#include <getopt.h>
37#include <assert.h>
38#include <sched.h>
39#include <limits.h>
40#include <stddef.h>
41#include <signal.h>
42#include <pwd.h>
43#include <grp.h>
44
45#include <linux/virtio_config.h>
46#include <linux/virtio_net.h>
47#include <linux/virtio_blk.h>
48#include <linux/virtio_console.h>
49#include <linux/virtio_rng.h>
50#include <linux/virtio_ring.h>
51#include <asm/bootparam.h>
52#include "../../include/linux/lguest_launcher.h"
53/*L:110
54 * We can ignore the 42 include files we need for this program, but I do want
55 * to draw attention to the use of kernel-style types.
56 *
57 * As Linus said, "C is a Spartan language, and so should your naming be." I
58 * like these abbreviations, so we define them here. Note that u64 is always
59 * unsigned long long, which works on all Linux systems: this means that we can
60 * use %llu in printf for any u64.
61 */
62typedef unsigned long long u64;
63typedef uint32_t u32;
64typedef uint16_t u16;
65typedef uint8_t u8;
66/*:*/
67
68#define PAGE_PRESENT 0x7 /* Present, RW, Execute */
69#define BRIDGE_PFX "bridge:"
70#ifndef SIOCBRADDIF
71#define SIOCBRADDIF 0x89a2 /* add interface to bridge */
72#endif
73/* We can have up to 256 pages for devices. */
74#define DEVICE_PAGES 256
75/* This will occupy 3 pages: it must be a power of 2. */
76#define VIRTQUEUE_NUM 256
77
78/*L:120
79 * verbose is both a global flag and a macro. The C preprocessor allows
80 * this, and although I wouldn't recommend it, it works quite nicely here.
81 */
82static bool verbose;
83#define verbose(args...) \
84 do { if (verbose) printf(args); } while(0)
85/*:*/
86
87/* The pointer to the start of guest memory. */
88static void *guest_base;
89/* The maximum guest physical address allowed, and maximum possible. */
90static unsigned long guest_limit, guest_max;
91/* The /dev/lguest file descriptor. */
92static int lguest_fd;
93
94/* a per-cpu variable indicating whose vcpu is currently running */
95static unsigned int __thread cpu_id;
96
97/* This is our list of devices. */
98struct device_list {
99 /* Counter to assign interrupt numbers. */
100 unsigned int next_irq;
101
102 /* Counter to print out convenient device numbers. */
103 unsigned int device_num;
104
105 /* The descriptor page for the devices. */
106 u8 *descpage;
107
108 /* A single linked list of devices. */
109 struct device *dev;
110 /* And a pointer to the last device for easy append. */
111 struct device *lastdev;
112};
113
114/* The list of Guest devices, based on command line arguments. */
115static struct device_list devices;
116
117/* The device structure describes a single device. */
118struct device {
119 /* The linked-list pointer. */
120 struct device *next;
121
122 /* The device's descriptor, as mapped into the Guest. */
123 struct lguest_device_desc *desc;
124
125 /* We can't trust desc values once Guest has booted: we use these. */
126 unsigned int feature_len;
127 unsigned int num_vq;
128
129 /* The name of this device, for --verbose. */
130 const char *name;
131
132 /* Any queues attached to this device */
133 struct virtqueue *vq;
134
135 /* Is it operational */
136 bool running;
137
138 /* Does Guest want an intrrupt on empty? */
139 bool irq_on_empty;
140
141 /* Device-specific data. */
142 void *priv;
143};
144
145/* The virtqueue structure describes a queue attached to a device. */
146struct virtqueue {
147 struct virtqueue *next;
148
149 /* Which device owns me. */
150 struct device *dev;
151
152 /* The configuration for this queue. */
153 struct lguest_vqconfig config;
154
155 /* The actual ring of buffers. */
156 struct vring vring;
157
158 /* Last available index we saw. */
159 u16 last_avail_idx;
160
161 /* How many are used since we sent last irq? */
162 unsigned int pending_used;
163
164 /* Eventfd where Guest notifications arrive. */
165 int eventfd;
166
167 /* Function for the thread which is servicing this virtqueue. */
168 void (*service)(struct virtqueue *vq);
169 pid_t thread;
170};
171
172/* Remember the arguments to the program so we can "reboot" */
173static char **main_args;
174
175/* The original tty settings to restore on exit. */
176static struct termios orig_term;
177
178/*
179 * We have to be careful with barriers: our devices are all run in separate
180 * threads and so we need to make sure that changes visible to the Guest happen
181 * in precise order.
182 */
183#define wmb() __asm__ __volatile__("" : : : "memory")
184#define mb() __asm__ __volatile__("" : : : "memory")
185
186/*
187 * Convert an iovec element to the given type.
188 *
189 * This is a fairly ugly trick: we need to know the size of the type and
190 * alignment requirement to check the pointer is kosher. It's also nice to
191 * have the name of the type in case we report failure.
192 *
193 * Typing those three things all the time is cumbersome and error prone, so we
194 * have a macro which sets them all up and passes to the real function.
195 */
196#define convert(iov, type) \
197 ((type *)_convert((iov), sizeof(type), __alignof__(type), #type))
198
199static void *_convert(struct iovec *iov, size_t size, size_t align,
200 const char *name)
201{
202 if (iov->iov_len != size)
203 errx(1, "Bad iovec size %zu for %s", iov->iov_len, name);
204 if ((unsigned long)iov->iov_base % align != 0)
205 errx(1, "Bad alignment %p for %s", iov->iov_base, name);
206 return iov->iov_base;
207}
208
209/* Wrapper for the last available index. Makes it easier to change. */
210#define lg_last_avail(vq) ((vq)->last_avail_idx)
211
212/*
213 * The virtio configuration space is defined to be little-endian. x86 is
214 * little-endian too, but it's nice to be explicit so we have these helpers.
215 */
216#define cpu_to_le16(v16) (v16)
217#define cpu_to_le32(v32) (v32)
218#define cpu_to_le64(v64) (v64)
219#define le16_to_cpu(v16) (v16)
220#define le32_to_cpu(v32) (v32)
221#define le64_to_cpu(v64) (v64)
222
223/* Is this iovec empty? */
224static bool iov_empty(const struct iovec iov[], unsigned int num_iov)
225{
226 unsigned int i;
227
228 for (i = 0; i < num_iov; i++)
229 if (iov[i].iov_len)
230 return false;
231 return true;
232}
233
234/* Take len bytes from the front of this iovec. */
235static void iov_consume(struct iovec iov[], unsigned num_iov, unsigned len)
236{
237 unsigned int i;
238
239 for (i = 0; i < num_iov; i++) {
240 unsigned int used;
241
242 used = iov[i].iov_len < len ? iov[i].iov_len : len;
243 iov[i].iov_base += used;
244 iov[i].iov_len -= used;
245 len -= used;
246 }
247 assert(len == 0);
248}
249
250/* The device virtqueue descriptors are followed by feature bitmasks. */
251static u8 *get_feature_bits(struct device *dev)
252{
253 return (u8 *)(dev->desc + 1)
254 + dev->num_vq * sizeof(struct lguest_vqconfig);
255}
256
257/*L:100
258 * The Launcher code itself takes us out into userspace, that scary place where
259 * pointers run wild and free! Unfortunately, like most userspace programs,
260 * it's quite boring (which is why everyone likes to hack on the kernel!).
261 * Perhaps if you make up an Lguest Drinking Game at this point, it will get
262 * you through this section. Or, maybe not.
263 *
264 * The Launcher sets up a big chunk of memory to be the Guest's "physical"
265 * memory and stores it in "guest_base". In other words, Guest physical ==
266 * Launcher virtual with an offset.
267 *
268 * This can be tough to get your head around, but usually it just means that we
269 * use these trivial conversion functions when the Guest gives us its
270 * "physical" addresses:
271 */
272static void *from_guest_phys(unsigned long addr)
273{
274 return guest_base + addr;
275}
276
277static unsigned long to_guest_phys(const void *addr)
278{
279 return (addr - guest_base);
280}
281
282/*L:130
283 * Loading the Kernel.
284 *
285 * We start with couple of simple helper routines. open_or_die() avoids
286 * error-checking code cluttering the callers:
287 */
288static int open_or_die(const char *name, int flags)
289{
290 int fd = open(name, flags);
291 if (fd < 0)
292 err(1, "Failed to open %s", name);
293 return fd;
294}
295
296/* map_zeroed_pages() takes a number of pages. */
297static void *map_zeroed_pages(unsigned int num)
298{
299 int fd = open_or_die("/dev/zero", O_RDONLY);
300 void *addr;
301
302 /*
303 * We use a private mapping (ie. if we write to the page, it will be
304 * copied). We allocate an extra two pages PROT_NONE to act as guard
305 * pages against read/write attempts that exceed allocated space.
306 */
307 addr = mmap(NULL, getpagesize() * (num+2),
308 PROT_NONE, MAP_PRIVATE, fd, 0);
309
310 if (addr == MAP_FAILED)
311 err(1, "Mmapping %u pages of /dev/zero", num);
312
313 if (mprotect(addr + getpagesize(), getpagesize() * num,
314 PROT_READ|PROT_WRITE) == -1)
315 err(1, "mprotect rw %u pages failed", num);
316
317 /*
318 * One neat mmap feature is that you can close the fd, and it
319 * stays mapped.
320 */
321 close(fd);
322
323 /* Return address after PROT_NONE page */
324 return addr + getpagesize();
325}
326
327/* Get some more pages for a device. */
328static void *get_pages(unsigned int num)
329{
330 void *addr = from_guest_phys(guest_limit);
331
332 guest_limit += num * getpagesize();
333 if (guest_limit > guest_max)
334 errx(1, "Not enough memory for devices");
335 return addr;
336}
337
338/*
339 * This routine is used to load the kernel or initrd. It tries mmap, but if
340 * that fails (Plan 9's kernel file isn't nicely aligned on page boundaries),
341 * it falls back to reading the memory in.
342 */
343static void map_at(int fd, void *addr, unsigned long offset, unsigned long len)
344{
345 ssize_t r;
346
347 /*
348 * We map writable even though for some segments are marked read-only.
349 * The kernel really wants to be writable: it patches its own
350 * instructions.
351 *
352 * MAP_PRIVATE means that the page won't be copied until a write is
353 * done to it. This allows us to share untouched memory between
354 * Guests.
355 */
356 if (mmap(addr, len, PROT_READ|PROT_WRITE,
357 MAP_FIXED|MAP_PRIVATE, fd, offset) != MAP_FAILED)
358 return;
359
360 /* pread does a seek and a read in one shot: saves a few lines. */
361 r = pread(fd, addr, len, offset);
362 if (r != len)
363 err(1, "Reading offset %lu len %lu gave %zi", offset, len, r);
364}
365
366/*
367 * This routine takes an open vmlinux image, which is in ELF, and maps it into
368 * the Guest memory. ELF = Embedded Linking Format, which is the format used
369 * by all modern binaries on Linux including the kernel.
370 *
371 * The ELF headers give *two* addresses: a physical address, and a virtual
372 * address. We use the physical address; the Guest will map itself to the
373 * virtual address.
374 *
375 * We return the starting address.
376 */
377static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr)
378{
379 Elf32_Phdr phdr[ehdr->e_phnum];
380 unsigned int i;
381
382 /*
383 * Sanity checks on the main ELF header: an x86 executable with a
384 * reasonable number of correctly-sized program headers.
385 */
386 if (ehdr->e_type != ET_EXEC
387 || ehdr->e_machine != EM_386
388 || ehdr->e_phentsize != sizeof(Elf32_Phdr)
389 || ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(Elf32_Phdr))
390 errx(1, "Malformed elf header");
391
392 /*
393 * An ELF executable contains an ELF header and a number of "program"
394 * headers which indicate which parts ("segments") of the program to
395 * load where.
396 */
397
398 /* We read in all the program headers at once: */
399 if (lseek(elf_fd, ehdr->e_phoff, SEEK_SET) < 0)
400 err(1, "Seeking to program headers");
401 if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr))
402 err(1, "Reading program headers");
403
404 /*
405 * Try all the headers: there are usually only three. A read-only one,
406 * a read-write one, and a "note" section which we don't load.
407 */
408 for (i = 0; i < ehdr->e_phnum; i++) {
409 /* If this isn't a loadable segment, we ignore it */
410 if (phdr[i].p_type != PT_LOAD)
411 continue;
412
413 verbose("Section %i: size %i addr %p\n",
414 i, phdr[i].p_memsz, (void *)phdr[i].p_paddr);
415
416 /* We map this section of the file at its physical address. */
417 map_at(elf_fd, from_guest_phys(phdr[i].p_paddr),
418 phdr[i].p_offset, phdr[i].p_filesz);
419 }
420
421 /* The entry point is given in the ELF header. */
422 return ehdr->e_entry;
423}
424
425/*L:150
426 * A bzImage, unlike an ELF file, is not meant to be loaded. You're supposed
427 * to jump into it and it will unpack itself. We used to have to perform some
428 * hairy magic because the unpacking code scared me.
429 *
430 * Fortunately, Jeremy Fitzhardinge convinced me it wasn't that hard and wrote
431 * a small patch to jump over the tricky bits in the Guest, so now we just read
432 * the funky header so we know where in the file to load, and away we go!
433 */
434static unsigned long load_bzimage(int fd)
435{
436 struct boot_params boot;
437 int r;
438 /* Modern bzImages get loaded at 1M. */
439 void *p = from_guest_phys(0x100000);
440
441 /*
442 * Go back to the start of the file and read the header. It should be
443 * a Linux boot header (see Documentation/x86/i386/boot.txt)
444 */
445 lseek(fd, 0, SEEK_SET);
446 read(fd, &boot, sizeof(boot));
447
448 /* Inside the setup_hdr, we expect the magic "HdrS" */
449 if (memcmp(&boot.hdr.header, "HdrS", 4) != 0)
450 errx(1, "This doesn't look like a bzImage to me");
451
452 /* Skip over the extra sectors of the header. */
453 lseek(fd, (boot.hdr.setup_sects+1) * 512, SEEK_SET);
454
455 /* Now read everything into memory. in nice big chunks. */
456 while ((r = read(fd, p, 65536)) > 0)
457 p += r;
458
459 /* Finally, code32_start tells us where to enter the kernel. */
460 return boot.hdr.code32_start;
461}
462
463/*L:140
464 * Loading the kernel is easy when it's a "vmlinux", but most kernels
465 * come wrapped up in the self-decompressing "bzImage" format. With a little
466 * work, we can load those, too.
467 */
468static unsigned long load_kernel(int fd)
469{
470 Elf32_Ehdr hdr;
471
472 /* Read in the first few bytes. */
473 if (read(fd, &hdr, sizeof(hdr)) != sizeof(hdr))
474 err(1, "Reading kernel");
475
476 /* If it's an ELF file, it starts with "\177ELF" */
477 if (memcmp(hdr.e_ident, ELFMAG, SELFMAG) == 0)
478 return map_elf(fd, &hdr);
479
480 /* Otherwise we assume it's a bzImage, and try to load it. */
481 return load_bzimage(fd);
482}
483
484/*
485 * This is a trivial little helper to align pages. Andi Kleen hated it because
486 * it calls getpagesize() twice: "it's dumb code."
487 *
488 * Kernel guys get really het up about optimization, even when it's not
489 * necessary. I leave this code as a reaction against that.
490 */
491static inline unsigned long page_align(unsigned long addr)
492{
493 /* Add upwards and truncate downwards. */
494 return ((addr + getpagesize()-1) & ~(getpagesize()-1));
495}
496
497/*L:180
498 * An "initial ram disk" is a disk image loaded into memory along with the
499 * kernel which the kernel can use to boot from without needing any drivers.
500 * Most distributions now use this as standard: the initrd contains the code to
501 * load the appropriate driver modules for the current machine.
502 *
503 * Importantly, James Morris works for RedHat, and Fedora uses initrds for its
504 * kernels. He sent me this (and tells me when I break it).
505 */
506static unsigned long load_initrd(const char *name, unsigned long mem)
507{
508 int ifd;
509 struct stat st;
510 unsigned long len;
511
512 ifd = open_or_die(name, O_RDONLY);
513 /* fstat() is needed to get the file size. */
514 if (fstat(ifd, &st) < 0)
515 err(1, "fstat() on initrd '%s'", name);
516
517 /*
518 * We map the initrd at the top of memory, but mmap wants it to be
519 * page-aligned, so we round the size up for that.
520 */
521 len = page_align(st.st_size);
522 map_at(ifd, from_guest_phys(mem - len), 0, st.st_size);
523 /*
524 * Once a file is mapped, you can close the file descriptor. It's a
525 * little odd, but quite useful.
526 */
527 close(ifd);
528 verbose("mapped initrd %s size=%lu @ %p\n", name, len, (void*)mem-len);
529
530 /* We return the initrd size. */
531 return len;
532}
533/*:*/
534
535/*
536 * Simple routine to roll all the commandline arguments together with spaces
537 * between them.
538 */
539static void concat(char *dst, char *args[])
540{
541 unsigned int i, len = 0;
542
543 for (i = 0; args[i]; i++) {
544 if (i) {
545 strcat(dst+len, " ");
546 len++;
547 }
548 strcpy(dst+len, args[i]);
549 len += strlen(args[i]);
550 }
551 /* In case it's empty. */
552 dst[len] = '\0';
553}
554
555/*L:185
556 * This is where we actually tell the kernel to initialize the Guest. We
557 * saw the arguments it expects when we looked at initialize() in lguest_user.c:
558 * the base of Guest "physical" memory, the top physical page to allow and the
559 * entry point for the Guest.
560 */
561static void tell_kernel(unsigned long start)
562{
563 unsigned long args[] = { LHREQ_INITIALIZE,
564 (unsigned long)guest_base,
565 guest_limit / getpagesize(), start };
566 verbose("Guest: %p - %p (%#lx)\n",
567 guest_base, guest_base + guest_limit, guest_limit);
568 lguest_fd = open_or_die("/dev/lguest", O_RDWR);
569 if (write(lguest_fd, args, sizeof(args)) < 0)
570 err(1, "Writing to /dev/lguest");
571}
572/*:*/
573
574/*L:200
575 * Device Handling.
576 *
577 * When the Guest gives us a buffer, it sends an array of addresses and sizes.
578 * We need to make sure it's not trying to reach into the Launcher itself, so
579 * we have a convenient routine which checks it and exits with an error message
580 * if something funny is going on:
581 */
582static void *_check_pointer(unsigned long addr, unsigned int size,
583 unsigned int line)
584{
585 /*
586 * Check if the requested address and size exceeds the allocated memory,
587 * or addr + size wraps around.
588 */
589 if ((addr + size) > guest_limit || (addr + size) < addr)
590 errx(1, "%s:%i: Invalid address %#lx", __FILE__, line, addr);
591 /*
592 * We return a pointer for the caller's convenience, now we know it's
593 * safe to use.
594 */
595 return from_guest_phys(addr);
596}
597/* A macro which transparently hands the line number to the real function. */
598#define check_pointer(addr,size) _check_pointer(addr, size, __LINE__)
599
600/*
601 * Each buffer in the virtqueues is actually a chain of descriptors. This
602 * function returns the next descriptor in the chain, or vq->vring.num if we're
603 * at the end.
604 */
605static unsigned next_desc(struct vring_desc *desc,
606 unsigned int i, unsigned int max)
607{
608 unsigned int next;
609
610 /* If this descriptor says it doesn't chain, we're done. */
611 if (!(desc[i].flags & VRING_DESC_F_NEXT))
612 return max;
613
614 /* Check they're not leading us off end of descriptors. */
615 next = desc[i].next;
616 /* Make sure compiler knows to grab that: we don't want it changing! */
617 wmb();
618
619 if (next >= max)
620 errx(1, "Desc next is %u", next);
621
622 return next;
623}
624
625/*
626 * This actually sends the interrupt for this virtqueue, if we've used a
627 * buffer.
628 */
629static void trigger_irq(struct virtqueue *vq)
630{
631 unsigned long buf[] = { LHREQ_IRQ, vq->config.irq };
632
633 /* Don't inform them if nothing used. */
634 if (!vq->pending_used)
635 return;
636 vq->pending_used = 0;
637
638 /* If they don't want an interrupt, don't send one... */
639 if (vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) {
640 /* ... unless they've asked us to force one on empty. */
641 if (!vq->dev->irq_on_empty
642 || lg_last_avail(vq) != vq->vring.avail->idx)
643 return;
644 }
645
646 /* Send the Guest an interrupt tell them we used something up. */
647 if (write(lguest_fd, buf, sizeof(buf)) != 0)
648 err(1, "Triggering irq %i", vq->config.irq);
649}
650
651/*
652 * This looks in the virtqueue for the first available buffer, and converts
653 * it to an iovec for convenient access. Since descriptors consist of some
654 * number of output then some number of input descriptors, it's actually two
655 * iovecs, but we pack them into one and note how many of each there were.
656 *
657 * This function waits if necessary, and returns the descriptor number found.
658 */
659static unsigned wait_for_vq_desc(struct virtqueue *vq,
660 struct iovec iov[],
661 unsigned int *out_num, unsigned int *in_num)
662{
663 unsigned int i, head, max;
664 struct vring_desc *desc;
665 u16 last_avail = lg_last_avail(vq);
666
667 /* There's nothing available? */
668 while (last_avail == vq->vring.avail->idx) {
669 u64 event;
670
671 /*
672 * Since we're about to sleep, now is a good time to tell the
673 * Guest about what we've used up to now.
674 */
675 trigger_irq(vq);
676
677 /* OK, now we need to know about added descriptors. */
678 vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
679
680 /*
681 * They could have slipped one in as we were doing that: make
682 * sure it's written, then check again.
683 */
684 mb();
685 if (last_avail != vq->vring.avail->idx) {
686 vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
687 break;
688 }
689
690 /* Nothing new? Wait for eventfd to tell us they refilled. */
691 if (read(vq->eventfd, &event, sizeof(event)) != sizeof(event))
692 errx(1, "Event read failed?");
693
694 /* We don't need to be notified again. */
695 vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
696 }
697
698 /* Check it isn't doing very strange things with descriptor numbers. */
699 if ((u16)(vq->vring.avail->idx - last_avail) > vq->vring.num)
700 errx(1, "Guest moved used index from %u to %u",
701 last_avail, vq->vring.avail->idx);
702
703 /*
704 * Grab the next descriptor number they're advertising, and increment
705 * the index we've seen.
706 */
707 head = vq->vring.avail->ring[last_avail % vq->vring.num];
708 lg_last_avail(vq)++;
709
710 /* If their number is silly, that's a fatal mistake. */
711 if (head >= vq->vring.num)
712 errx(1, "Guest says index %u is available", head);
713
714 /* When we start there are none of either input nor output. */
715 *out_num = *in_num = 0;
716
717 max = vq->vring.num;
718 desc = vq->vring.desc;
719 i = head;
720
721 /*
722 * If this is an indirect entry, then this buffer contains a descriptor
723 * table which we handle as if it's any normal descriptor chain.
724 */
725 if (desc[i].flags & VRING_DESC_F_INDIRECT) {
726 if (desc[i].len % sizeof(struct vring_desc))
727 errx(1, "Invalid size for indirect buffer table");
728
729 max = desc[i].len / sizeof(struct vring_desc);
730 desc = check_pointer(desc[i].addr, desc[i].len);
731 i = 0;
732 }
733
734 do {
735 /* Grab the first descriptor, and check it's OK. */
736 iov[*out_num + *in_num].iov_len = desc[i].len;
737 iov[*out_num + *in_num].iov_base
738 = check_pointer(desc[i].addr, desc[i].len);
739 /* If this is an input descriptor, increment that count. */
740 if (desc[i].flags & VRING_DESC_F_WRITE)
741 (*in_num)++;
742 else {
743 /*
744 * If it's an output descriptor, they're all supposed
745 * to come before any input descriptors.
746 */
747 if (*in_num)
748 errx(1, "Descriptor has out after in");
749 (*out_num)++;
750 }
751
752 /* If we've got too many, that implies a descriptor loop. */
753 if (*out_num + *in_num > max)
754 errx(1, "Looped descriptor");
755 } while ((i = next_desc(desc, i, max)) != max);
756
757 return head;
758}
759
760/*
761 * After we've used one of their buffers, we tell the Guest about it. Sometime
762 * later we'll want to send them an interrupt using trigger_irq(); note that
763 * wait_for_vq_desc() does that for us if it has to wait.
764 */
765static void add_used(struct virtqueue *vq, unsigned int head, int len)
766{
767 struct vring_used_elem *used;
768
769 /*
770 * The virtqueue contains a ring of used buffers. Get a pointer to the
771 * next entry in that used ring.
772 */
773 used = &vq->vring.used->ring[vq->vring.used->idx % vq->vring.num];
774 used->id = head;
775 used->len = len;
776 /* Make sure buffer is written before we update index. */
777 wmb();
778 vq->vring.used->idx++;
779 vq->pending_used++;
780}
781
782/* And here's the combo meal deal. Supersize me! */
783static void add_used_and_trigger(struct virtqueue *vq, unsigned head, int len)
784{
785 add_used(vq, head, len);
786 trigger_irq(vq);
787}
788
789/*
790 * The Console
791 *
792 * We associate some data with the console for our exit hack.
793 */
794struct console_abort {
795 /* How many times have they hit ^C? */
796 int count;
797 /* When did they start? */
798 struct timeval start;
799};
800
801/* This is the routine which handles console input (ie. stdin). */
802static void console_input(struct virtqueue *vq)
803{
804 int len;
805 unsigned int head, in_num, out_num;
806 struct console_abort *abort = vq->dev->priv;
807 struct iovec iov[vq->vring.num];
808
809 /* Make sure there's a descriptor available. */
810 head = wait_for_vq_desc(vq, iov, &out_num, &in_num);
811 if (out_num)
812 errx(1, "Output buffers in console in queue?");
813
814 /* Read into it. This is where we usually wait. */
815 len = readv(STDIN_FILENO, iov, in_num);
816 if (len <= 0) {
817 /* Ran out of input? */
818 warnx("Failed to get console input, ignoring console.");
819 /*
820 * For simplicity, dying threads kill the whole Launcher. So
821 * just nap here.
822 */
823 for (;;)
824 pause();
825 }
826
827 /* Tell the Guest we used a buffer. */
828 add_used_and_trigger(vq, head, len);
829
830 /*
831 * Three ^C within one second? Exit.
832 *
833 * This is such a hack, but works surprisingly well. Each ^C has to
834 * be in a buffer by itself, so they can't be too fast. But we check
835 * that we get three within about a second, so they can't be too
836 * slow.
837 */
838 if (len != 1 || ((char *)iov[0].iov_base)[0] != 3) {
839 abort->count = 0;
840 return;
841 }
842
843 abort->count++;
844 if (abort->count == 1)
845 gettimeofday(&abort->start, NULL);
846 else if (abort->count == 3) {
847 struct timeval now;
848 gettimeofday(&now, NULL);
849 /* Kill all Launcher processes with SIGINT, like normal ^C */
850 if (now.tv_sec <= abort->start.tv_sec+1)
851 kill(0, SIGINT);
852 abort->count = 0;
853 }
854}
855
856/* This is the routine which handles console output (ie. stdout). */
857static void console_output(struct virtqueue *vq)
858{
859 unsigned int head, out, in;
860 struct iovec iov[vq->vring.num];
861
862 /* We usually wait in here, for the Guest to give us something. */
863 head = wait_for_vq_desc(vq, iov, &out, &in);
864 if (in)
865 errx(1, "Input buffers in console output queue?");
866
867 /* writev can return a partial write, so we loop here. */
868 while (!iov_empty(iov, out)) {
869 int len = writev(STDOUT_FILENO, iov, out);
870 if (len <= 0)
871 err(1, "Write to stdout gave %i", len);
872 iov_consume(iov, out, len);
873 }
874
875 /*
876 * We're finished with that buffer: if we're going to sleep,
877 * wait_for_vq_desc() will prod the Guest with an interrupt.
878 */
879 add_used(vq, head, 0);
880}
881
882/*
883 * The Network
884 *
885 * Handling output for network is also simple: we get all the output buffers
886 * and write them to /dev/net/tun.
887 */
888struct net_info {
889 int tunfd;
890};
891
892static void net_output(struct virtqueue *vq)
893{
894 struct net_info *net_info = vq->dev->priv;
895 unsigned int head, out, in;
896 struct iovec iov[vq->vring.num];
897
898 /* We usually wait in here for the Guest to give us a packet. */
899 head = wait_for_vq_desc(vq, iov, &out, &in);
900 if (in)
901 errx(1, "Input buffers in net output queue?");
902 /*
903 * Send the whole thing through to /dev/net/tun. It expects the exact
904 * same format: what a coincidence!
905 */
906 if (writev(net_info->tunfd, iov, out) < 0)
907 errx(1, "Write to tun failed?");
908
909 /*
910 * Done with that one; wait_for_vq_desc() will send the interrupt if
911 * all packets are processed.
912 */
913 add_used(vq, head, 0);
914}
915
916/*
917 * Handling network input is a bit trickier, because I've tried to optimize it.
918 *
919 * First we have a helper routine which tells is if from this file descriptor
920 * (ie. the /dev/net/tun device) will block:
921 */
922static bool will_block(int fd)
923{
924 fd_set fdset;
925 struct timeval zero = { 0, 0 };
926 FD_ZERO(&fdset);
927 FD_SET(fd, &fdset);
928 return select(fd+1, &fdset, NULL, NULL, &zero) != 1;
929}
930
931/*
932 * This handles packets coming in from the tun device to our Guest. Like all
933 * service routines, it gets called again as soon as it returns, so you don't
934 * see a while(1) loop here.
935 */
936static void net_input(struct virtqueue *vq)
937{
938 int len;
939 unsigned int head, out, in;
940 struct iovec iov[vq->vring.num];
941 struct net_info *net_info = vq->dev->priv;
942
943 /*
944 * Get a descriptor to write an incoming packet into. This will also
945 * send an interrupt if they're out of descriptors.
946 */
947 head = wait_for_vq_desc(vq, iov, &out, &in);
948 if (out)
949 errx(1, "Output buffers in net input queue?");
950
951 /*
952 * If it looks like we'll block reading from the tun device, send them
953 * an interrupt.
954 */
955 if (vq->pending_used && will_block(net_info->tunfd))
956 trigger_irq(vq);
957
958 /*
959 * Read in the packet. This is where we normally wait (when there's no
960 * incoming network traffic).
961 */
962 len = readv(net_info->tunfd, iov, in);
963 if (len <= 0)
964 err(1, "Failed to read from tun.");
965
966 /*
967 * Mark that packet buffer as used, but don't interrupt here. We want
968 * to wait until we've done as much work as we can.
969 */
970 add_used(vq, head, len);
971}
972/*:*/
973
974/* This is the helper to create threads: run the service routine in a loop. */
975static int do_thread(void *_vq)
976{
977 struct virtqueue *vq = _vq;
978
979 for (;;)
980 vq->service(vq);
981 return 0;
982}
983
984/*
985 * When a child dies, we kill our entire process group with SIGTERM. This
986 * also has the side effect that the shell restores the console for us!
987 */
988static void kill_launcher(int signal)
989{
990 kill(0, SIGTERM);
991}
992
993static void reset_device(struct device *dev)
994{
995 struct virtqueue *vq;
996
997 verbose("Resetting device %s\n", dev->name);
998
999 /* Clear any features they've acked. */
1000 memset(get_feature_bits(dev) + dev->feature_len, 0, dev->feature_len);
1001
1002 /* We're going to be explicitly killing threads, so ignore them. */
1003 signal(SIGCHLD, SIG_IGN);
1004
1005 /* Zero out the virtqueues, get rid of their threads */
1006 for (vq = dev->vq; vq; vq = vq->next) {
1007 if (vq->thread != (pid_t)-1) {
1008 kill(vq->thread, SIGTERM);
1009 waitpid(vq->thread, NULL, 0);
1010 vq->thread = (pid_t)-1;
1011 }
1012 memset(vq->vring.desc, 0,
1013 vring_size(vq->config.num, LGUEST_VRING_ALIGN));
1014 lg_last_avail(vq) = 0;
1015 }
1016 dev->running = false;
1017
1018 /* Now we care if threads die. */
1019 signal(SIGCHLD, (void *)kill_launcher);
1020}
1021
1022/*L:216
1023 * This actually creates the thread which services the virtqueue for a device.
1024 */
1025static void create_thread(struct virtqueue *vq)
1026{
1027 /*
1028 * Create stack for thread. Since the stack grows upwards, we point
1029 * the stack pointer to the end of this region.
1030 */
1031 char *stack = malloc(32768);
1032 unsigned long args[] = { LHREQ_EVENTFD,
1033 vq->config.pfn*getpagesize(), 0 };
1034
1035 /* Create a zero-initialized eventfd. */
1036 vq->eventfd = eventfd(0, 0);
1037 if (vq->eventfd < 0)
1038 err(1, "Creating eventfd");
1039 args[2] = vq->eventfd;
1040
1041 /*
1042 * Attach an eventfd to this virtqueue: it will go off when the Guest
1043 * does an LHCALL_NOTIFY for this vq.
1044 */
1045 if (write(lguest_fd, &args, sizeof(args)) != 0)
1046 err(1, "Attaching eventfd");
1047
1048 /*
1049 * CLONE_VM: because it has to access the Guest memory, and SIGCHLD so
1050 * we get a signal if it dies.
1051 */
1052 vq->thread = clone(do_thread, stack + 32768, CLONE_VM | SIGCHLD, vq);
1053 if (vq->thread == (pid_t)-1)
1054 err(1, "Creating clone");
1055
1056 /* We close our local copy now the child has it. */
1057 close(vq->eventfd);
1058}
1059
1060static bool accepted_feature(struct device *dev, unsigned int bit)
1061{
1062 const u8 *features = get_feature_bits(dev) + dev->feature_len;
1063
1064 if (dev->feature_len < bit / CHAR_BIT)
1065 return false;
1066 return features[bit / CHAR_BIT] & (1 << (bit % CHAR_BIT));
1067}
1068
1069static void start_device(struct device *dev)
1070{
1071 unsigned int i;
1072 struct virtqueue *vq;
1073
1074 verbose("Device %s OK: offered", dev->name);
1075 for (i = 0; i < dev->feature_len; i++)
1076 verbose(" %02x", get_feature_bits(dev)[i]);
1077 verbose(", accepted");
1078 for (i = 0; i < dev->feature_len; i++)
1079 verbose(" %02x", get_feature_bits(dev)
1080 [dev->feature_len+i]);
1081
1082 dev->irq_on_empty = accepted_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY);
1083
1084 for (vq = dev->vq; vq; vq = vq->next) {
1085 if (vq->service)
1086 create_thread(vq);
1087 }
1088 dev->running = true;
1089}
1090
1091static void cleanup_devices(void)
1092{
1093 struct device *dev;
1094
1095 for (dev = devices.dev; dev; dev = dev->next)
1096 reset_device(dev);
1097
1098 /* If we saved off the original terminal settings, restore them now. */
1099 if (orig_term.c_lflag & (ISIG|ICANON|ECHO))
1100 tcsetattr(STDIN_FILENO, TCSANOW, &orig_term);
1101}
1102
1103/* When the Guest tells us they updated the status field, we handle it. */
1104static void update_device_status(struct device *dev)
1105{
1106 /* A zero status is a reset, otherwise it's a set of flags. */
1107 if (dev->desc->status == 0)
1108 reset_device(dev);
1109 else if (dev->desc->status & VIRTIO_CONFIG_S_FAILED) {
1110 warnx("Device %s configuration FAILED", dev->name);
1111 if (dev->running)
1112 reset_device(dev);
1113 } else if (dev->desc->status & VIRTIO_CONFIG_S_DRIVER_OK) {
1114 if (!dev->running)
1115 start_device(dev);
1116 }
1117}
1118
1119/*L:215
1120 * This is the generic routine we call when the Guest uses LHCALL_NOTIFY. In
1121 * particular, it's used to notify us of device status changes during boot.
1122 */
1123static void handle_output(unsigned long addr)
1124{
1125 struct device *i;
1126
1127 /* Check each device. */
1128 for (i = devices.dev; i; i = i->next) {
1129 struct virtqueue *vq;
1130
1131 /*
1132 * Notifications to device descriptors mean they updated the
1133 * device status.
1134 */
1135 if (from_guest_phys(addr) == i->desc) {
1136 update_device_status(i);
1137 return;
1138 }
1139
1140 /*
1141 * Devices *can* be used before status is set to DRIVER_OK.
1142 * The original plan was that they would never do this: they
1143 * would always finish setting up their status bits before
1144 * actually touching the virtqueues. In practice, we allowed
1145 * them to, and they do (eg. the disk probes for partition
1146 * tables as part of initialization).
1147 *
1148 * If we see this, we start the device: once it's running, we
1149 * expect the device to catch all the notifications.
1150 */
1151 for (vq = i->vq; vq; vq = vq->next) {
1152 if (addr != vq->config.pfn*getpagesize())
1153 continue;
1154 if (i->running)
1155 errx(1, "Notification on running %s", i->name);
1156 /* This just calls create_thread() for each virtqueue */
1157 start_device(i);
1158 return;
1159 }
1160 }
1161
1162 /*
1163 * Early console write is done using notify on a nul-terminated string
1164 * in Guest memory. It's also great for hacking debugging messages
1165 * into a Guest.
1166 */
1167 if (addr >= guest_limit)
1168 errx(1, "Bad NOTIFY %#lx", addr);
1169
1170 write(STDOUT_FILENO, from_guest_phys(addr),
1171 strnlen(from_guest_phys(addr), guest_limit - addr));
1172}
1173
1174/*L:190
1175 * Device Setup
1176 *
1177 * All devices need a descriptor so the Guest knows it exists, and a "struct
1178 * device" so the Launcher can keep track of it. We have common helper
1179 * routines to allocate and manage them.
1180 */
1181
1182/*
1183 * The layout of the device page is a "struct lguest_device_desc" followed by a
1184 * number of virtqueue descriptors, then two sets of feature bits, then an
1185 * array of configuration bytes. This routine returns the configuration
1186 * pointer.
1187 */
1188static u8 *device_config(const struct device *dev)
1189{
1190 return (void *)(dev->desc + 1)
1191 + dev->num_vq * sizeof(struct lguest_vqconfig)
1192 + dev->feature_len * 2;
1193}
1194
1195/*
1196 * This routine allocates a new "struct lguest_device_desc" from descriptor
1197 * table page just above the Guest's normal memory. It returns a pointer to
1198 * that descriptor.
1199 */
1200static struct lguest_device_desc *new_dev_desc(u16 type)
1201{
1202 struct lguest_device_desc d = { .type = type };
1203 void *p;
1204
1205 /* Figure out where the next device config is, based on the last one. */
1206 if (devices.lastdev)
1207 p = device_config(devices.lastdev)
1208 + devices.lastdev->desc->config_len;
1209 else
1210 p = devices.descpage;
1211
1212 /* We only have one page for all the descriptors. */
1213 if (p + sizeof(d) > (void *)devices.descpage + getpagesize())
1214 errx(1, "Too many devices");
1215
1216 /* p might not be aligned, so we memcpy in. */
1217 return memcpy(p, &d, sizeof(d));
1218}
1219
1220/*
1221 * Each device descriptor is followed by the description of its virtqueues. We
1222 * specify how many descriptors the virtqueue is to have.
1223 */
1224static void add_virtqueue(struct device *dev, unsigned int num_descs,
1225 void (*service)(struct virtqueue *))
1226{
1227 unsigned int pages;
1228 struct virtqueue **i, *vq = malloc(sizeof(*vq));
1229 void *p;
1230
1231 /* First we need some memory for this virtqueue. */
1232 pages = (vring_size(num_descs, LGUEST_VRING_ALIGN) + getpagesize() - 1)
1233 / getpagesize();
1234 p = get_pages(pages);
1235
1236 /* Initialize the virtqueue */
1237 vq->next = NULL;
1238 vq->last_avail_idx = 0;
1239 vq->dev = dev;
1240
1241 /*
1242 * This is the routine the service thread will run, and its Process ID
1243 * once it's running.
1244 */
1245 vq->service = service;
1246 vq->thread = (pid_t)-1;
1247
1248 /* Initialize the configuration. */
1249 vq->config.num = num_descs;
1250 vq->config.irq = devices.next_irq++;
1251 vq->config.pfn = to_guest_phys(p) / getpagesize();
1252
1253 /* Initialize the vring. */
1254 vring_init(&vq->vring, num_descs, p, LGUEST_VRING_ALIGN);
1255
1256 /*
1257 * Append virtqueue to this device's descriptor. We use
1258 * device_config() to get the end of the device's current virtqueues;
1259 * we check that we haven't added any config or feature information
1260 * yet, otherwise we'd be overwriting them.
1261 */
1262 assert(dev->desc->config_len == 0 && dev->desc->feature_len == 0);
1263 memcpy(device_config(dev), &vq->config, sizeof(vq->config));
1264 dev->num_vq++;
1265 dev->desc->num_vq++;
1266
1267 verbose("Virtqueue page %#lx\n", to_guest_phys(p));
1268
1269 /*
1270 * Add to tail of list, so dev->vq is first vq, dev->vq->next is
1271 * second.
1272 */
1273 for (i = &dev->vq; *i; i = &(*i)->next);
1274 *i = vq;
1275}
1276
1277/*
1278 * The first half of the feature bitmask is for us to advertise features. The
1279 * second half is for the Guest to accept features.
1280 */
1281static void add_feature(struct device *dev, unsigned bit)
1282{
1283 u8 *features = get_feature_bits(dev);
1284
1285 /* We can't extend the feature bits once we've added config bytes */
1286 if (dev->desc->feature_len <= bit / CHAR_BIT) {
1287 assert(dev->desc->config_len == 0);
1288 dev->feature_len = dev->desc->feature_len = (bit/CHAR_BIT) + 1;
1289 }
1290
1291 features[bit / CHAR_BIT] |= (1 << (bit % CHAR_BIT));
1292}
1293
1294/*
1295 * This routine sets the configuration fields for an existing device's
1296 * descriptor. It only works for the last device, but that's OK because that's
1297 * how we use it.
1298 */
1299static void set_config(struct device *dev, unsigned len, const void *conf)
1300{
1301 /* Check we haven't overflowed our single page. */
1302 if (device_config(dev) + len > devices.descpage + getpagesize())
1303 errx(1, "Too many devices");
1304
1305 /* Copy in the config information, and store the length. */
1306 memcpy(device_config(dev), conf, len);
1307 dev->desc->config_len = len;
1308
1309 /* Size must fit in config_len field (8 bits)! */
1310 assert(dev->desc->config_len == len);
1311}
1312
1313/*
1314 * This routine does all the creation and setup of a new device, including
1315 * calling new_dev_desc() to allocate the descriptor and device memory. We
1316 * don't actually start the service threads until later.
1317 *
1318 * See what I mean about userspace being boring?
1319 */
1320static struct device *new_device(const char *name, u16 type)
1321{
1322 struct device *dev = malloc(sizeof(*dev));
1323
1324 /* Now we populate the fields one at a time. */
1325 dev->desc = new_dev_desc(type);
1326 dev->name = name;
1327 dev->vq = NULL;
1328 dev->feature_len = 0;
1329 dev->num_vq = 0;
1330 dev->running = false;
1331
1332 /*
1333 * Append to device list. Prepending to a single-linked list is
1334 * easier, but the user expects the devices to be arranged on the bus
1335 * in command-line order. The first network device on the command line
1336 * is eth0, the first block device /dev/vda, etc.
1337 */
1338 if (devices.lastdev)
1339 devices.lastdev->next = dev;
1340 else
1341 devices.dev = dev;
1342 devices.lastdev = dev;
1343
1344 return dev;
1345}
1346
1347/*
1348 * Our first setup routine is the console. It's a fairly simple device, but
1349 * UNIX tty handling makes it uglier than it could be.
1350 */
1351static void setup_console(void)
1352{
1353 struct device *dev;
1354
1355 /* If we can save the initial standard input settings... */
1356 if (tcgetattr(STDIN_FILENO, &orig_term) == 0) {
1357 struct termios term = orig_term;
1358 /*
1359 * Then we turn off echo, line buffering and ^C etc: We want a
1360 * raw input stream to the Guest.
1361 */
1362 term.c_lflag &= ~(ISIG|ICANON|ECHO);
1363 tcsetattr(STDIN_FILENO, TCSANOW, &term);
1364 }
1365
1366 dev = new_device("console", VIRTIO_ID_CONSOLE);
1367
1368 /* We store the console state in dev->priv, and initialize it. */
1369 dev->priv = malloc(sizeof(struct console_abort));
1370 ((struct console_abort *)dev->priv)->count = 0;
1371
1372 /*
1373 * The console needs two virtqueues: the input then the output. When
1374 * they put something the input queue, we make sure we're listening to
1375 * stdin. When they put something in the output queue, we write it to
1376 * stdout.
1377 */
1378 add_virtqueue(dev, VIRTQUEUE_NUM, console_input);
1379 add_virtqueue(dev, VIRTQUEUE_NUM, console_output);
1380
1381 verbose("device %u: console\n", ++devices.device_num);
1382}
1383/*:*/
1384
1385/*M:010
1386 * Inter-guest networking is an interesting area. Simplest is to have a
1387 * --sharenet=<name> option which opens or creates a named pipe. This can be
1388 * used to send packets to another guest in a 1:1 manner.
1389 *
1390 * More sopisticated is to use one of the tools developed for project like UML
1391 * to do networking.
1392 *
1393 * Faster is to do virtio bonding in kernel. Doing this 1:1 would be
1394 * completely generic ("here's my vring, attach to your vring") and would work
1395 * for any traffic. Of course, namespace and permissions issues need to be
1396 * dealt with. A more sophisticated "multi-channel" virtio_net.c could hide
1397 * multiple inter-guest channels behind one interface, although it would
1398 * require some manner of hotplugging new virtio channels.
1399 *
1400 * Finally, we could implement a virtio network switch in the kernel.
1401:*/
1402
1403static u32 str2ip(const char *ipaddr)
1404{
1405 unsigned int b[4];
1406
1407 if (sscanf(ipaddr, "%u.%u.%u.%u", &b[0], &b[1], &b[2], &b[3]) != 4)
1408 errx(1, "Failed to parse IP address '%s'", ipaddr);
1409 return (b[0] << 24) | (b[1] << 16) | (b[2] << 8) | b[3];
1410}
1411
1412static void str2mac(const char *macaddr, unsigned char mac[6])
1413{
1414 unsigned int m[6];
1415 if (sscanf(macaddr, "%02x:%02x:%02x:%02x:%02x:%02x",
1416 &m[0], &m[1], &m[2], &m[3], &m[4], &m[5]) != 6)
1417 errx(1, "Failed to parse mac address '%s'", macaddr);
1418 mac[0] = m[0];
1419 mac[1] = m[1];
1420 mac[2] = m[2];
1421 mac[3] = m[3];
1422 mac[4] = m[4];
1423 mac[5] = m[5];
1424}
1425
1426/*
1427 * This code is "adapted" from libbridge: it attaches the Host end of the
1428 * network device to the bridge device specified by the command line.
1429 *
1430 * This is yet another James Morris contribution (I'm an IP-level guy, so I
1431 * dislike bridging), and I just try not to break it.
1432 */
1433static void add_to_bridge(int fd, const char *if_name, const char *br_name)
1434{
1435 int ifidx;
1436 struct ifreq ifr;
1437
1438 if (!*br_name)
1439 errx(1, "must specify bridge name");
1440
1441 ifidx = if_nametoindex(if_name);
1442 if (!ifidx)
1443 errx(1, "interface %s does not exist!", if_name);
1444
1445 strncpy(ifr.ifr_name, br_name, IFNAMSIZ);
1446 ifr.ifr_name[IFNAMSIZ-1] = '\0';
1447 ifr.ifr_ifindex = ifidx;
1448 if (ioctl(fd, SIOCBRADDIF, &ifr) < 0)
1449 err(1, "can't add %s to bridge %s", if_name, br_name);
1450}
1451
1452/*
1453 * This sets up the Host end of the network device with an IP address, brings
1454 * it up so packets will flow, the copies the MAC address into the hwaddr
1455 * pointer.
1456 */
1457static void configure_device(int fd, const char *tapif, u32 ipaddr)
1458{
1459 struct ifreq ifr;
1460 struct sockaddr_in sin;
1461
1462 memset(&ifr, 0, sizeof(ifr));
1463 strcpy(ifr.ifr_name, tapif);
1464
1465 /* Don't read these incantations. Just cut & paste them like I did! */
1466 sin.sin_family = AF_INET;
1467 sin.sin_addr.s_addr = htonl(ipaddr);
1468 memcpy(&ifr.ifr_addr, &sin, sizeof(sin));
1469 if (ioctl(fd, SIOCSIFADDR, &ifr) != 0)
1470 err(1, "Setting %s interface address", tapif);
1471 ifr.ifr_flags = IFF_UP;
1472 if (ioctl(fd, SIOCSIFFLAGS, &ifr) != 0)
1473 err(1, "Bringing interface %s up", tapif);
1474}
1475
1476static int get_tun_device(char tapif[IFNAMSIZ])
1477{
1478 struct ifreq ifr;
1479 int netfd;
1480
1481 /* Start with this zeroed. Messy but sure. */
1482 memset(&ifr, 0, sizeof(ifr));
1483
1484 /*
1485 * We open the /dev/net/tun device and tell it we want a tap device. A
1486 * tap device is like a tun device, only somehow different. To tell
1487 * the truth, I completely blundered my way through this code, but it
1488 * works now!
1489 */
1490 netfd = open_or_die("/dev/net/tun", O_RDWR);
1491 ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
1492 strcpy(ifr.ifr_name, "tap%d");
1493 if (ioctl(netfd, TUNSETIFF, &ifr) != 0)
1494 err(1, "configuring /dev/net/tun");
1495
1496 if (ioctl(netfd, TUNSETOFFLOAD,
1497 TUN_F_CSUM|TUN_F_TSO4|TUN_F_TSO6|TUN_F_TSO_ECN) != 0)
1498 err(1, "Could not set features for tun device");
1499
1500 /*
1501 * We don't need checksums calculated for packets coming in this
1502 * device: trust us!
1503 */
1504 ioctl(netfd, TUNSETNOCSUM, 1);
1505
1506 memcpy(tapif, ifr.ifr_name, IFNAMSIZ);
1507 return netfd;
1508}
1509
1510/*L:195
1511 * Our network is a Host<->Guest network. This can either use bridging or
1512 * routing, but the principle is the same: it uses the "tun" device to inject
1513 * packets into the Host as if they came in from a normal network card. We
1514 * just shunt packets between the Guest and the tun device.
1515 */
1516static void setup_tun_net(char *arg)
1517{
1518 struct device *dev;
1519 struct net_info *net_info = malloc(sizeof(*net_info));
1520 int ipfd;
1521 u32 ip = INADDR_ANY;
1522 bool bridging = false;
1523 char tapif[IFNAMSIZ], *p;
1524 struct virtio_net_config conf;
1525
1526 net_info->tunfd = get_tun_device(tapif);
1527
1528 /* First we create a new network device. */
1529 dev = new_device("net", VIRTIO_ID_NET);
1530 dev->priv = net_info;
1531
1532 /* Network devices need a recv and a send queue, just like console. */
1533 add_virtqueue(dev, VIRTQUEUE_NUM, net_input);
1534 add_virtqueue(dev, VIRTQUEUE_NUM, net_output);
1535
1536 /*
1537 * We need a socket to perform the magic network ioctls to bring up the
1538 * tap interface, connect to the bridge etc. Any socket will do!
1539 */
1540 ipfd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
1541 if (ipfd < 0)
1542 err(1, "opening IP socket");
1543
1544 /* If the command line was --tunnet=bridge:<name> do bridging. */
1545 if (!strncmp(BRIDGE_PFX, arg, strlen(BRIDGE_PFX))) {
1546 arg += strlen(BRIDGE_PFX);
1547 bridging = true;
1548 }
1549
1550 /* A mac address may follow the bridge name or IP address */
1551 p = strchr(arg, ':');
1552 if (p) {
1553 str2mac(p+1, conf.mac);
1554 add_feature(dev, VIRTIO_NET_F_MAC);
1555 *p = '\0';
1556 }
1557
1558 /* arg is now either an IP address or a bridge name */
1559 if (bridging)
1560 add_to_bridge(ipfd, tapif, arg);
1561 else
1562 ip = str2ip(arg);
1563
1564 /* Set up the tun device. */
1565 configure_device(ipfd, tapif, ip);
1566
1567 add_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY);
1568 /* Expect Guest to handle everything except UFO */
1569 add_feature(dev, VIRTIO_NET_F_CSUM);
1570 add_feature(dev, VIRTIO_NET_F_GUEST_CSUM);
1571 add_feature(dev, VIRTIO_NET_F_GUEST_TSO4);
1572 add_feature(dev, VIRTIO_NET_F_GUEST_TSO6);
1573 add_feature(dev, VIRTIO_NET_F_GUEST_ECN);
1574 add_feature(dev, VIRTIO_NET_F_HOST_TSO4);
1575 add_feature(dev, VIRTIO_NET_F_HOST_TSO6);
1576 add_feature(dev, VIRTIO_NET_F_HOST_ECN);
1577 /* We handle indirect ring entries */
1578 add_feature(dev, VIRTIO_RING_F_INDIRECT_DESC);
1579 set_config(dev, sizeof(conf), &conf);
1580
1581 /* We don't need the socket any more; setup is done. */
1582 close(ipfd);
1583
1584 devices.device_num++;
1585
1586 if (bridging)
1587 verbose("device %u: tun %s attached to bridge: %s\n",
1588 devices.device_num, tapif, arg);
1589 else
1590 verbose("device %u: tun %s: %s\n",
1591 devices.device_num, tapif, arg);
1592}
1593/*:*/
1594
1595/* This hangs off device->priv. */
1596struct vblk_info {
1597 /* The size of the file. */
1598 off64_t len;
1599
1600 /* The file descriptor for the file. */
1601 int fd;
1602
1603};
1604
1605/*L:210
1606 * The Disk
1607 *
1608 * The disk only has one virtqueue, so it only has one thread. It is really
1609 * simple: the Guest asks for a block number and we read or write that position
1610 * in the file.
1611 *
1612 * Before we serviced each virtqueue in a separate thread, that was unacceptably
1613 * slow: the Guest waits until the read is finished before running anything
1614 * else, even if it could have been doing useful work.
1615 *
1616 * We could have used async I/O, except it's reputed to suck so hard that
1617 * characters actually go missing from your code when you try to use it.
1618 */
1619static void blk_request(struct virtqueue *vq)
1620{
1621 struct vblk_info *vblk = vq->dev->priv;
1622 unsigned int head, out_num, in_num, wlen;
1623 int ret;
1624 u8 *in;
1625 struct virtio_blk_outhdr *out;
1626 struct iovec iov[vq->vring.num];
1627 off64_t off;
1628
1629 /*
1630 * Get the next request, where we normally wait. It triggers the
1631 * interrupt to acknowledge previously serviced requests (if any).
1632 */
1633 head = wait_for_vq_desc(vq, iov, &out_num, &in_num);
1634
1635 /*
1636 * Every block request should contain at least one output buffer
1637 * (detailing the location on disk and the type of request) and one
1638 * input buffer (to hold the result).
1639 */
1640 if (out_num == 0 || in_num == 0)
1641 errx(1, "Bad virtblk cmd %u out=%u in=%u",
1642 head, out_num, in_num);
1643
1644 out = convert(&iov[0], struct virtio_blk_outhdr);
1645 in = convert(&iov[out_num+in_num-1], u8);
1646 /*
1647 * For historical reasons, block operations are expressed in 512 byte
1648 * "sectors".
1649 */
1650 off = out->sector * 512;
1651
1652 /*
1653 * In general the virtio block driver is allowed to try SCSI commands.
1654 * It'd be nice if we supported eject, for example, but we don't.
1655 */
1656 if (out->type & VIRTIO_BLK_T_SCSI_CMD) {
1657 fprintf(stderr, "Scsi commands unsupported\n");
1658 *in = VIRTIO_BLK_S_UNSUPP;
1659 wlen = sizeof(*in);
1660 } else if (out->type & VIRTIO_BLK_T_OUT) {
1661 /*
1662 * Write
1663 *
1664 * Move to the right location in the block file. This can fail
1665 * if they try to write past end.
1666 */
1667 if (lseek64(vblk->fd, off, SEEK_SET) != off)
1668 err(1, "Bad seek to sector %llu", out->sector);
1669
1670 ret = writev(vblk->fd, iov+1, out_num-1);
1671 verbose("WRITE to sector %llu: %i\n", out->sector, ret);
1672
1673 /*
1674 * Grr... Now we know how long the descriptor they sent was, we
1675 * make sure they didn't try to write over the end of the block
1676 * file (possibly extending it).
1677 */
1678 if (ret > 0 && off + ret > vblk->len) {
1679 /* Trim it back to the correct length */
1680 ftruncate64(vblk->fd, vblk->len);
1681 /* Die, bad Guest, die. */
1682 errx(1, "Write past end %llu+%u", off, ret);
1683 }
1684
1685 wlen = sizeof(*in);
1686 *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
1687 } else if (out->type & VIRTIO_BLK_T_FLUSH) {
1688 /* Flush */
1689 ret = fdatasync(vblk->fd);
1690 verbose("FLUSH fdatasync: %i\n", ret);
1691 wlen = sizeof(*in);
1692 *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
1693 } else {
1694 /*
1695 * Read
1696 *
1697 * Move to the right location in the block file. This can fail
1698 * if they try to read past end.
1699 */
1700 if (lseek64(vblk->fd, off, SEEK_SET) != off)
1701 err(1, "Bad seek to sector %llu", out->sector);
1702
1703 ret = readv(vblk->fd, iov+1, in_num-1);
1704 verbose("READ from sector %llu: %i\n", out->sector, ret);
1705 if (ret >= 0) {
1706 wlen = sizeof(*in) + ret;
1707 *in = VIRTIO_BLK_S_OK;
1708 } else {
1709 wlen = sizeof(*in);
1710 *in = VIRTIO_BLK_S_IOERR;
1711 }
1712 }
1713
1714 /* Finished that request. */
1715 add_used(vq, head, wlen);
1716}
1717
1718/*L:198 This actually sets up a virtual block device. */
1719static void setup_block_file(const char *filename)
1720{
1721 struct device *dev;
1722 struct vblk_info *vblk;
1723 struct virtio_blk_config conf;
1724
1725 /* Creat the device. */
1726 dev = new_device("block", VIRTIO_ID_BLOCK);
1727
1728 /* The device has one virtqueue, where the Guest places requests. */
1729 add_virtqueue(dev, VIRTQUEUE_NUM, blk_request);
1730
1731 /* Allocate the room for our own bookkeeping */
1732 vblk = dev->priv = malloc(sizeof(*vblk));
1733
1734 /* First we open the file and store the length. */
1735 vblk->fd = open_or_die(filename, O_RDWR|O_LARGEFILE);
1736 vblk->len = lseek64(vblk->fd, 0, SEEK_END);
1737
1738 /* We support FLUSH. */
1739 add_feature(dev, VIRTIO_BLK_F_FLUSH);
1740
1741 /* Tell Guest how many sectors this device has. */
1742 conf.capacity = cpu_to_le64(vblk->len / 512);
1743
1744 /*
1745 * Tell Guest not to put in too many descriptors at once: two are used
1746 * for the in and out elements.
1747 */
1748 add_feature(dev, VIRTIO_BLK_F_SEG_MAX);
1749 conf.seg_max = cpu_to_le32(VIRTQUEUE_NUM - 2);
1750
1751 /* Don't try to put whole struct: we have 8 bit limit. */
1752 set_config(dev, offsetof(struct virtio_blk_config, geometry), &conf);
1753
1754 verbose("device %u: virtblock %llu sectors\n",
1755 ++devices.device_num, le64_to_cpu(conf.capacity));
1756}
1757
1758/*L:211
1759 * Our random number generator device reads from /dev/random into the Guest's
1760 * input buffers. The usual case is that the Guest doesn't want random numbers
1761 * and so has no buffers although /dev/random is still readable, whereas
1762 * console is the reverse.
1763 *
1764 * The same logic applies, however.
1765 */
1766struct rng_info {
1767 int rfd;
1768};
1769
1770static void rng_input(struct virtqueue *vq)
1771{
1772 int len;
1773 unsigned int head, in_num, out_num, totlen = 0;
1774 struct rng_info *rng_info = vq->dev->priv;
1775 struct iovec iov[vq->vring.num];
1776
1777 /* First we need a buffer from the Guests's virtqueue. */
1778 head = wait_for_vq_desc(vq, iov, &out_num, &in_num);
1779 if (out_num)
1780 errx(1, "Output buffers in rng?");
1781
1782 /*
1783 * Just like the console write, we loop to cover the whole iovec.
1784 * In this case, short reads actually happen quite a bit.
1785 */
1786 while (!iov_empty(iov, in_num)) {
1787 len = readv(rng_info->rfd, iov, in_num);
1788 if (len <= 0)
1789 err(1, "Read from /dev/random gave %i", len);
1790 iov_consume(iov, in_num, len);
1791 totlen += len;
1792 }
1793
1794 /* Tell the Guest about the new input. */
1795 add_used(vq, head, totlen);
1796}
1797
1798/*L:199
1799 * This creates a "hardware" random number device for the Guest.
1800 */
1801static void setup_rng(void)
1802{
1803 struct device *dev;
1804 struct rng_info *rng_info = malloc(sizeof(*rng_info));
1805
1806 /* Our device's privat info simply contains the /dev/random fd. */
1807 rng_info->rfd = open_or_die("/dev/random", O_RDONLY);
1808
1809 /* Create the new device. */
1810 dev = new_device("rng", VIRTIO_ID_RNG);
1811 dev->priv = rng_info;
1812
1813 /* The device has one virtqueue, where the Guest places inbufs. */
1814 add_virtqueue(dev, VIRTQUEUE_NUM, rng_input);
1815
1816 verbose("device %u: rng\n", devices.device_num++);
1817}
1818/* That's the end of device setup. */
1819
1820/*L:230 Reboot is pretty easy: clean up and exec() the Launcher afresh. */
1821static void __attribute__((noreturn)) restart_guest(void)
1822{
1823 unsigned int i;
1824
1825 /*
1826 * Since we don't track all open fds, we simply close everything beyond
1827 * stderr.
1828 */
1829 for (i = 3; i < FD_SETSIZE; i++)
1830 close(i);
1831
1832 /* Reset all the devices (kills all threads). */
1833 cleanup_devices();
1834
1835 execv(main_args[0], main_args);
1836 err(1, "Could not exec %s", main_args[0]);
1837}
1838
1839/*L:220
1840 * Finally we reach the core of the Launcher which runs the Guest, serves
1841 * its input and output, and finally, lays it to rest.
1842 */
1843static void __attribute__((noreturn)) run_guest(void)
1844{
1845 for (;;) {
1846 unsigned long notify_addr;
1847 int readval;
1848
1849 /* We read from the /dev/lguest device to run the Guest. */
1850 readval = pread(lguest_fd, &notify_addr,
1851 sizeof(notify_addr), cpu_id);
1852
1853 /* One unsigned long means the Guest did HCALL_NOTIFY */
1854 if (readval == sizeof(notify_addr)) {
1855 verbose("Notify on address %#lx\n", notify_addr);
1856 handle_output(notify_addr);
1857 /* ENOENT means the Guest died. Reading tells us why. */
1858 } else if (errno == ENOENT) {
1859 char reason[1024] = { 0 };
1860 pread(lguest_fd, reason, sizeof(reason)-1, cpu_id);
1861 errx(1, "%s", reason);
1862 /* ERESTART means that we need to reboot the guest */
1863 } else if (errno == ERESTART) {
1864 restart_guest();
1865 /* Anything else means a bug or incompatible change. */
1866 } else
1867 err(1, "Running guest failed");
1868 }
1869}
1870/*L:240
1871 * This is the end of the Launcher. The good news: we are over halfway
1872 * through! The bad news: the most fiendish part of the code still lies ahead
1873 * of us.
1874 *
1875 * Are you ready? Take a deep breath and join me in the core of the Host, in
1876 * "make Host".
1877:*/
1878
1879static struct option opts[] = {
1880 { "verbose", 0, NULL, 'v' },
1881 { "tunnet", 1, NULL, 't' },
1882 { "block", 1, NULL, 'b' },
1883 { "rng", 0, NULL, 'r' },
1884 { "initrd", 1, NULL, 'i' },
1885 { "username", 1, NULL, 'u' },
1886 { "chroot", 1, NULL, 'c' },
1887 { NULL },
1888};
1889static void usage(void)
1890{
1891 errx(1, "Usage: lguest [--verbose] "
1892 "[--tunnet=(<ipaddr>:<macaddr>|bridge:<bridgename>:<macaddr>)\n"
1893 "|--block=<filename>|--initrd=<filename>]...\n"
1894 "<mem-in-mb> vmlinux [args...]");
1895}
1896
1897/*L:105 The main routine is where the real work begins: */
1898int main(int argc, char *argv[])
1899{
1900 /* Memory, code startpoint and size of the (optional) initrd. */
1901 unsigned long mem = 0, start, initrd_size = 0;
1902 /* Two temporaries. */
1903 int i, c;
1904 /* The boot information for the Guest. */
1905 struct boot_params *boot;
1906 /* If they specify an initrd file to load. */
1907 const char *initrd_name = NULL;
1908
1909 /* Password structure for initgroups/setres[gu]id */
1910 struct passwd *user_details = NULL;
1911
1912 /* Directory to chroot to */
1913 char *chroot_path = NULL;
1914
1915 /* Save the args: we "reboot" by execing ourselves again. */
1916 main_args = argv;
1917
1918 /*
1919 * First we initialize the device list. We keep a pointer to the last
1920 * device, and the next interrupt number to use for devices (1:
1921 * remember that 0 is used by the timer).
1922 */
1923 devices.lastdev = NULL;
1924 devices.next_irq = 1;
1925
1926 /* We're CPU 0. In fact, that's the only CPU possible right now. */
1927 cpu_id = 0;
1928
1929 /*
1930 * We need to know how much memory so we can set up the device
1931 * descriptor and memory pages for the devices as we parse the command
1932 * line. So we quickly look through the arguments to find the amount
1933 * of memory now.
1934 */
1935 for (i = 1; i < argc; i++) {
1936 if (argv[i][0] != '-') {
1937 mem = atoi(argv[i]) * 1024 * 1024;
1938 /*
1939 * We start by mapping anonymous pages over all of
1940 * guest-physical memory range. This fills it with 0,
1941 * and ensures that the Guest won't be killed when it
1942 * tries to access it.
1943 */
1944 guest_base = map_zeroed_pages(mem / getpagesize()
1945 + DEVICE_PAGES);
1946 guest_limit = mem;
1947 guest_max = mem + DEVICE_PAGES*getpagesize();
1948 devices.descpage = get_pages(1);
1949 break;
1950 }
1951 }
1952
1953 /* The options are fairly straight-forward */
1954 while ((c = getopt_long(argc, argv, "v", opts, NULL)) != EOF) {
1955 switch (c) {
1956 case 'v':
1957 verbose = true;
1958 break;
1959 case 't':
1960 setup_tun_net(optarg);
1961 break;
1962 case 'b':
1963 setup_block_file(optarg);
1964 break;
1965 case 'r':
1966 setup_rng();
1967 break;
1968 case 'i':
1969 initrd_name = optarg;
1970 break;
1971 case 'u':
1972 user_details = getpwnam(optarg);
1973 if (!user_details)
1974 err(1, "getpwnam failed, incorrect username?");
1975 break;
1976 case 'c':
1977 chroot_path = optarg;
1978 break;
1979 default:
1980 warnx("Unknown argument %s", argv[optind]);
1981 usage();
1982 }
1983 }
1984 /*
1985 * After the other arguments we expect memory and kernel image name,
1986 * followed by command line arguments for the kernel.
1987 */
1988 if (optind + 2 > argc)
1989 usage();
1990
1991 verbose("Guest base is at %p\n", guest_base);
1992
1993 /* We always have a console device */
1994 setup_console();
1995
1996 /* Now we load the kernel */
1997 start = load_kernel(open_or_die(argv[optind+1], O_RDONLY));
1998
1999 /* Boot information is stashed at physical address 0 */
2000 boot = from_guest_phys(0);
2001
2002 /* Map the initrd image if requested (at top of physical memory) */
2003 if (initrd_name) {
2004 initrd_size = load_initrd(initrd_name, mem);
2005 /*
2006 * These are the location in the Linux boot header where the
2007 * start and size of the initrd are expected to be found.
2008 */
2009 boot->hdr.ramdisk_image = mem - initrd_size;
2010 boot->hdr.ramdisk_size = initrd_size;
2011 /* The bootloader type 0xFF means "unknown"; that's OK. */
2012 boot->hdr.type_of_loader = 0xFF;
2013 }
2014
2015 /*
2016 * The Linux boot header contains an "E820" memory map: ours is a
2017 * simple, single region.
2018 */
2019 boot->e820_entries = 1;
2020 boot->e820_map[0] = ((struct e820entry) { 0, mem, E820_RAM });
2021 /*
2022 * The boot header contains a command line pointer: we put the command
2023 * line after the boot header.
2024 */
2025 boot->hdr.cmd_line_ptr = to_guest_phys(boot + 1);
2026 /* We use a simple helper to copy the arguments separated by spaces. */
2027 concat((char *)(boot + 1), argv+optind+2);
2028
2029 /* Boot protocol version: 2.07 supports the fields for lguest. */
2030 boot->hdr.version = 0x207;
2031
2032 /* The hardware_subarch value of "1" tells the Guest it's an lguest. */
2033 boot->hdr.hardware_subarch = 1;
2034
2035 /* Tell the entry path not to try to reload segment registers. */
2036 boot->hdr.loadflags |= KEEP_SEGMENTS;
2037
2038 /*
2039 * We tell the kernel to initialize the Guest: this returns the open
2040 * /dev/lguest file descriptor.
2041 */
2042 tell_kernel(start);
2043
2044 /* Ensure that we terminate if a device-servicing child dies. */
2045 signal(SIGCHLD, kill_launcher);
2046
2047 /* If we exit via err(), this kills all the threads, restores tty. */
2048 atexit(cleanup_devices);
2049
2050 /* If requested, chroot to a directory */
2051 if (chroot_path) {
2052 if (chroot(chroot_path) != 0)
2053 err(1, "chroot(\"%s\") failed", chroot_path);
2054
2055 if (chdir("/") != 0)
2056 err(1, "chdir(\"/\") failed");
2057
2058 verbose("chroot done\n");
2059 }
2060
2061 /* If requested, drop privileges */
2062 if (user_details) {
2063 uid_t u;
2064 gid_t g;
2065
2066 u = user_details->pw_uid;
2067 g = user_details->pw_gid;
2068
2069 if (initgroups(user_details->pw_name, g) != 0)
2070 err(1, "initgroups failed");
2071
2072 if (setresgid(g, g, g) != 0)
2073 err(1, "setresgid failed");
2074
2075 if (setresuid(u, u, u) != 0)
2076 err(1, "setresuid failed");
2077
2078 verbose("Dropping privileges completed\n");
2079 }
2080
2081 /* Finally, run the Guest. This doesn't return. */
2082 run_guest();
2083}
2084/*:*/
2085
2086/*M:999
2087 * Mastery is done: you now know everything I do.
2088 *
2089 * But surely you have seen code, features and bugs in your wanderings which
2090 * you now yearn to attack? That is the real game, and I look forward to you
2091 * patching and forking lguest into the Your-Name-Here-visor.
2092 *
2093 * Farewell, and good coding!
2094 * Rusty Russell.
2095 */
diff --git a/Documentation/virtual/lguest/lguest.txt b/Documentation/virtual/lguest/lguest.txt
new file mode 100644
index 000000000000..dad99978a6a8
--- /dev/null
+++ b/Documentation/virtual/lguest/lguest.txt
@@ -0,0 +1,128 @@
1 __
2 (___()'`; Rusty's Remarkably Unreliable Guide to Lguest
3 /, /` - or, A Young Coder's Illustrated Hypervisor
4 \\"--\\ http://lguest.ozlabs.org
5
6Lguest is designed to be a minimal 32-bit x86 hypervisor for the Linux kernel,
7for Linux developers and users to experiment with virtualization with the
8minimum of complexity. Nonetheless, it should have sufficient features to
9make it useful for specific tasks, and, of course, you are encouraged to fork
10and enhance it (see drivers/lguest/README).
11
12Features:
13
14- Kernel module which runs in a normal kernel.
15- Simple I/O model for communication.
16- Simple program to create new guests.
17- Logo contains cute puppies: http://lguest.ozlabs.org
18
19Developer features:
20
21- Fun to hack on.
22- No ABI: being tied to a specific kernel anyway, you can change anything.
23- Many opportunities for improvement or feature implementation.
24
25Running Lguest:
26
27- The easiest way to run lguest is to use same kernel as guest and host.
28 You can configure them differently, but usually it's easiest not to.
29
30 You will need to configure your kernel with the following options:
31
32 "General setup":
33 "Prompt for development and/or incomplete code/drivers" = Y
34 (CONFIG_EXPERIMENTAL=y)
35
36 "Processor type and features":
37 "Paravirtualized guest support" = Y
38 "Lguest guest support" = Y
39 "High Memory Support" = off/4GB
40 "Alignment value to which kernel should be aligned" = 0x100000
41 (CONFIG_PARAVIRT=y, CONFIG_LGUEST_GUEST=y, CONFIG_HIGHMEM64G=n and
42 CONFIG_PHYSICAL_ALIGN=0x100000)
43
44 "Device Drivers":
45 "Block devices"
46 "Virtio block driver (EXPERIMENTAL)" = M/Y
47 "Network device support"
48 "Universal TUN/TAP device driver support" = M/Y
49 "Virtio network driver (EXPERIMENTAL)" = M/Y
50 (CONFIG_VIRTIO_BLK=m, CONFIG_VIRTIO_NET=m and CONFIG_TUN=m)
51
52 "Virtualization"
53 "Linux hypervisor example code" = M/Y
54 (CONFIG_LGUEST=m)
55
56- A tool called "lguest" is available in this directory: type "make"
57 to build it. If you didn't build your kernel in-tree, use "make
58 O=<builddir>".
59
60- Create or find a root disk image. There are several useful ones
61 around, such as the xm-test tiny root image at
62 http://xm-test.xensource.com/ramdisks/initrd-1.1-i386.img
63
64 For more serious work, I usually use a distribution ISO image and
65 install it under qemu, then make multiple copies:
66
67 dd if=/dev/zero of=rootfile bs=1M count=2048
68 qemu -cdrom image.iso -hda rootfile -net user -net nic -boot d
69
70 Make sure that you install a getty on /dev/hvc0 if you want to log in on the
71 console!
72
73- "modprobe lg" if you built it as a module.
74
75- Run an lguest as root:
76
77 Documentation/lguest/lguest 64 vmlinux --tunnet=192.168.19.1 --block=rootfile root=/dev/vda
78
79 Explanation:
80 64: the amount of memory to use, in MB.
81
82 vmlinux: the kernel image found in the top of your build directory. You
83 can also use a standard bzImage.
84
85 --tunnet=192.168.19.1: configures a "tap" device for networking with this
86 IP address.
87
88 --block=rootfile: a file or block device which becomes /dev/vda
89 inside the guest.
90
91 root=/dev/vda: this (and anything else on the command line) are
92 kernel boot parameters.
93
94- Configuring networking. I usually have the host masquerade, using
95 "iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE" and "echo 1 >
96 /proc/sys/net/ipv4/ip_forward". In this example, I would configure
97 eth0 inside the guest at 192.168.19.2.
98
99 Another method is to bridge the tap device to an external interface
100 using --tunnet=bridge:<bridgename>, and perhaps run dhcp on the guest
101 to obtain an IP address. The bridge needs to be configured first:
102 this option simply adds the tap interface to it.
103
104 A simple example on my system:
105
106 ifconfig eth0 0.0.0.0
107 brctl addbr lg0
108 ifconfig lg0 up
109 brctl addif lg0 eth0
110 dhclient lg0
111
112 Then use --tunnet=bridge:lg0 when launching the guest.
113
114 See:
115
116 http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge
117
118 for general information on how to get bridging to work.
119
120- Random number generation. Using the --rng option will provide a
121 /dev/hwrng in the guest that will read from the host's /dev/random.
122 Use this option in conjunction with rng-tools (see ../hw_random.txt)
123 to provide entropy to the guest kernel's /dev/random.
124
125There is a helpful mailing list at http://ozlabs.org/mailman/listinfo/lguest
126
127Good luck!
128Rusty Russell rusty@rustcorp.com.au.
diff --git a/Documentation/virtual/uml/UserModeLinux-HOWTO.txt b/Documentation/virtual/uml/UserModeLinux-HOWTO.txt
new file mode 100644
index 000000000000..9b7e1904db1c
--- /dev/null
+++ b/Documentation/virtual/uml/UserModeLinux-HOWTO.txt
@@ -0,0 +1,4579 @@
1 User Mode Linux HOWTO
2 User Mode Linux Core Team
3 Mon Nov 18 14:16:16 EST 2002
4
5 This document describes the use and abuse of Jeff Dike's User Mode
6 Linux: a port of the Linux kernel as a normal Intel Linux process.
7 ______________________________________________________________________
8
9 Table of Contents
10
11 1. Introduction
12
13 1.1 How is User Mode Linux Different?
14 1.2 Why Would I Want User Mode Linux?
15
16 2. Compiling the kernel and modules
17
18 2.1 Compiling the kernel
19 2.2 Compiling and installing kernel modules
20 2.3 Compiling and installing uml_utilities
21
22 3. Running UML and logging in
23
24 3.1 Running UML
25 3.2 Logging in
26 3.3 Examples
27
28 4. UML on 2G/2G hosts
29
30 4.1 Introduction
31 4.2 The problem
32 4.3 The solution
33
34 5. Setting up serial lines and consoles
35
36 5.1 Specifying the device
37 5.2 Specifying the channel
38 5.3 Examples
39
40 6. Setting up the network
41
42 6.1 General setup
43 6.2 Userspace daemons
44 6.3 Specifying ethernet addresses
45 6.4 UML interface setup
46 6.5 Multicast
47 6.6 TUN/TAP with the uml_net helper
48 6.7 TUN/TAP with a preconfigured tap device
49 6.8 Ethertap
50 6.9 The switch daemon
51 6.10 Slip
52 6.11 Slirp
53 6.12 pcap
54 6.13 Setting up the host yourself
55
56 7. Sharing Filesystems between Virtual Machines
57
58 7.1 A warning
59 7.2 Using layered block devices
60 7.3 Note!
61 7.4 Another warning
62 7.5 uml_moo : Merging a COW file with its backing file
63
64 8. Creating filesystems
65
66 8.1 Create the filesystem file
67 8.2 Assign the file to a UML device
68 8.3 Creating and mounting the filesystem
69
70 9. Host file access
71
72 9.1 Using hostfs
73 9.2 hostfs as the root filesystem
74 9.3 Building hostfs
75
76 10. The Management Console
77 10.1 version
78 10.2 halt and reboot
79 10.3 config
80 10.4 remove
81 10.5 sysrq
82 10.6 help
83 10.7 cad
84 10.8 stop
85 10.9 go
86
87 11. Kernel debugging
88
89 11.1 Starting the kernel under gdb
90 11.2 Examining sleeping processes
91 11.3 Running ddd on UML
92 11.4 Debugging modules
93 11.5 Attaching gdb to the kernel
94 11.6 Using alternate debuggers
95
96 12. Kernel debugging examples
97
98 12.1 The case of the hung fsck
99 12.2 Episode 2: The case of the hung fsck
100
101 13. What to do when UML doesn't work
102
103 13.1 Strange compilation errors when you build from source
104 13.2 (obsolete)
105 13.3 A variety of panics and hangs with /tmp on a reiserfs filesystem
106 13.4 The compile fails with errors about conflicting types for 'open', 'dup', and 'waitpid'
107 13.5 UML doesn't work when /tmp is an NFS filesystem
108 13.6 UML hangs on boot when compiled with gprof support
109 13.7 syslogd dies with a SIGTERM on startup
110 13.8 TUN/TAP networking doesn't work on a 2.4 host
111 13.9 You can network to the host but not to other machines on the net
112 13.10 I have no root and I want to scream
113 13.11 UML build conflict between ptrace.h and ucontext.h
114 13.12 The UML BogoMips is exactly half the host's BogoMips
115 13.13 When you run UML, it immediately segfaults
116 13.14 xterms appear, then immediately disappear
117 13.15 Any other panic, hang, or strange behavior
118
119 14. Diagnosing Problems
120
121 14.1 Case 1 : Normal kernel panics
122 14.2 Case 2 : Tracing thread panics
123 14.3 Case 3 : Tracing thread panics caused by other threads
124 14.4 Case 4 : Hangs
125
126 15. Thanks
127
128 15.1 Code and Documentation
129 15.2 Flushing out bugs
130 15.3 Buglets and clean-ups
131 15.4 Case Studies
132 15.5 Other contributions
133
134
135 ______________________________________________________________________
136
137 11.. IInnttrroodduuccttiioonn
138
139 Welcome to User Mode Linux. It's going to be fun.
140
141
142
143 11..11.. HHooww iiss UUsseerr MMooddee LLiinnuuxx DDiiffffeerreenntt??
144
145 Normally, the Linux Kernel talks straight to your hardware (video
146 card, keyboard, hard drives, etc), and any programs which run ask the
147 kernel to operate the hardware, like so:
148
149
150
151 +-----------+-----------+----+
152 | Process 1 | Process 2 | ...|
153 +-----------+-----------+----+
154 | Linux Kernel |
155 +----------------------------+
156 | Hardware |
157 +----------------------------+
158
159
160
161
162 The User Mode Linux Kernel is different; instead of talking to the
163 hardware, it talks to a `real' Linux kernel (called the `host kernel'
164 from now on), like any other program. Programs can then run inside
165 User-Mode Linux as if they were running under a normal kernel, like
166 so:
167
168
169
170 +----------------+
171 | Process 2 | ...|
172 +-----------+----------------+
173 | Process 1 | User-Mode Linux|
174 +----------------------------+
175 | Linux Kernel |
176 +----------------------------+
177 | Hardware |
178 +----------------------------+
179
180
181
182
183
184 11..22.. WWhhyy WWoouulldd II WWaanntt UUsseerr MMooddee LLiinnuuxx??
185
186
187 1. If User Mode Linux crashes, your host kernel is still fine.
188
189 2. You can run a usermode kernel as a non-root user.
190
191 3. You can debug the User Mode Linux like any normal process.
192
193 4. You can run gprof (profiling) and gcov (coverage testing).
194
195 5. You can play with your kernel without breaking things.
196
197 6. You can use it as a sandbox for testing new apps.
198
199 7. You can try new development kernels safely.
200
201 8. You can run different distributions simultaneously.
202
203 9. It's extremely fun.
204
205
206
207
208
209 22.. CCoommppiilliinngg tthhee kkeerrnneell aanndd mmoodduulleess
210
211
212
213
214 22..11.. CCoommppiilliinngg tthhee kkeerrnneell
215
216
217 Compiling the user mode kernel is just like compiling any other
218 kernel. Let's go through the steps, using 2.4.0-prerelease (current
219 as of this writing) as an example:
220
221
222 1. Download the latest UML patch from
223
224 the download page <http://user-mode-linux.sourceforge.net/
225
226 In this example, the file is uml-patch-2.4.0-prerelease.bz2.
227
228
229 2. Download the matching kernel from your favourite kernel mirror,
230 such as:
231
232 ftp://ftp.ca.kernel.org/pub/kernel/v2.4/linux-2.4.0-prerelease.tar.bz2
233 <ftp://ftp.ca.kernel.org/pub/kernel/v2.4/linux-2.4.0-prerelease.tar.bz2>
234 .
235
236
237 3. Make a directory and unpack the kernel into it.
238
239
240
241 host%
242 mkdir ~/uml
243
244
245
246
247
248
249 host%
250 cd ~/uml
251
252
253
254
255
256
257 host%
258 tar -xzvf linux-2.4.0-prerelease.tar.bz2
259
260
261
262
263
264
265 4. Apply the patch using
266
267
268
269 host%
270 cd ~/uml/linux
271
272
273
274 host%
275 bzcat uml-patch-2.4.0-prerelease.bz2 | patch -p1
276
277
278
279
280
281
282 5. Run your favorite config; `make xconfig ARCH=um' is the most
283 convenient. `make config ARCH=um' and 'make menuconfig ARCH=um'
284 will work as well. The defaults will give you a useful kernel. If
285 you want to change something, go ahead, it probably won't hurt
286 anything.
287
288
289 Note: If the host is configured with a 2G/2G address space split
290 rather than the usual 3G/1G split, then the packaged UML binaries
291 will not run. They will immediately segfault. See ``UML on 2G/2G
292 hosts'' for the scoop on running UML on your system.
293
294
295
296 6. Finish with `make linux ARCH=um': the result is a file called
297 `linux' in the top directory of your source tree.
298
299 Make sure that you don't build this kernel in /usr/src/linux. On some
300 distributions, /usr/include/asm is a link into this pool. The user-
301 mode build changes the other end of that link, and things that include
302 <asm/anything.h> stop compiling.
303
304 The sources are also available from cvs at the project's cvs page,
305 which has directions on getting the sources. You can also browse the
306 CVS pool from there.
307
308 If you get the CVS sources, you will have to check them out into an
309 empty directory. You will then have to copy each file into the
310 corresponding directory in the appropriate kernel pool.
311
312 If you don't have the latest kernel pool, you can get the
313 corresponding user-mode sources with
314
315
316 host% cvs co -r v_2_3_x linux
317
318
319
320
321 where 'x' is the version in your pool. Note that you will not get the
322 bug fixes and enhancements that have gone into subsequent releases.
323
324
325 22..22.. CCoommppiilliinngg aanndd iinnssttaalllliinngg kkeerrnneell mmoodduulleess
326
327 UML modules are built in the same way as the native kernel (with the
328 exception of the 'ARCH=um' that you always need for UML):
329
330
331 host% make modules ARCH=um
332
333
334
335
336 Any modules that you want to load into this kernel need to be built in
337 the user-mode pool. Modules from the native kernel won't work.
338
339 You can install them by using ftp or something to copy them into the
340 virtual machine and dropping them into /lib/modules/`uname -r`.
341
342 You can also get the kernel build process to install them as follows:
343
344 1. with the kernel not booted, mount the root filesystem in the top
345 level of the kernel pool:
346
347
348 host% mount root_fs mnt -o loop
349
350
351
352
353
354
355 2. run
356
357
358 host%
359 make modules_install INSTALL_MOD_PATH=`pwd`/mnt ARCH=um
360
361
362
363
364
365
366 3. unmount the filesystem
367
368
369 host% umount mnt
370
371
372
373
374
375
376 4. boot the kernel on it
377
378
379 When the system is booted, you can use insmod as usual to get the
380 modules into the kernel. A number of things have been loaded into UML
381 as modules, especially filesystems and network protocols and filters,
382 so most symbols which need to be exported probably already are.
383 However, if you do find symbols that need exporting, let us
384 <http://user-mode-linux.sourceforge.net/> know, and
385 they'll be "taken care of".
386
387
388
389 22..33.. CCoommppiilliinngg aanndd iinnssttaalllliinngg uummll__uuttiilliittiieess
390
391 Many features of the UML kernel require a user-space helper program,
392 so a uml_utilities package is distributed separately from the kernel
393 patch which provides these helpers. Included within this is:
394
395 +o port-helper - Used by consoles which connect to xterms or ports
396
397 +o tunctl - Configuration tool to create and delete tap devices
398
399 +o uml_net - Setuid binary for automatic tap device configuration
400
401 +o uml_switch - User-space virtual switch required for daemon
402 transport
403
404 The uml_utilities tree is compiled with:
405
406
407 host#
408 make && make install
409
410
411
412
413 Note that UML kernel patches may require a specific version of the
414 uml_utilities distribution. If you don't keep up with the mailing
415 lists, ensure that you have the latest release of uml_utilities if you
416 are experiencing problems with your UML kernel, particularly when
417 dealing with consoles or command-line switches to the helper programs
418
419
420
421
422
423
424
425
426 33.. RRuunnnniinngg UUMMLL aanndd llooggggiinngg iinn
427
428
429
430 33..11.. RRuunnnniinngg UUMMLL
431
432 It runs on 2.2.15 or later, and all 2.4 kernels.
433
434
435 Booting UML is straightforward. Simply run 'linux': it will try to
436 mount the file `root_fs' in the current directory. You do not need to
437 run it as root. If your root filesystem is not named `root_fs', then
438 you need to put a `ubd0=root_fs_whatever' switch on the linux command
439 line.
440
441
442 You will need a filesystem to boot UML from. There are a number
443 available for download from here <http://user-mode-
444 linux.sourceforge.net/> . There are also several tools
445 <http://user-mode-linux.sourceforge.net/> which can be
446 used to generate UML-compatible filesystem images from media.
447 The kernel will boot up and present you with a login prompt.
448
449
450 Note: If the host is configured with a 2G/2G address space split
451 rather than the usual 3G/1G split, then the packaged UML binaries will
452 not run. They will immediately segfault. See ``UML on 2G/2G hosts''
453 for the scoop on running UML on your system.
454
455
456
457 33..22.. LLooggggiinngg iinn
458
459
460
461 The prepackaged filesystems have a root account with password 'root'
462 and a user account with password 'user'. The login banner will
463 generally tell you how to log in. So, you log in and you will find
464 yourself inside a little virtual machine. Our filesystems have a
465 variety of commands and utilities installed (and it is fairly easy to
466 add more), so you will have a lot of tools with which to poke around
467 the system.
468
469 There are a couple of other ways to log in:
470
471 +o On a virtual console
472
473
474
475 Each virtual console that is configured (i.e. the device exists in
476 /dev and /etc/inittab runs a getty on it) will come up in its own
477 xterm. If you get tired of the xterms, read ``Setting up serial
478 lines and consoles'' to see how to attach the consoles to
479 something else, like host ptys.
480
481
482
483 +o Over the serial line
484
485
486 In the boot output, find a line that looks like:
487
488
489
490 serial line 0 assigned pty /dev/ptyp1
491
492
493
494
495 Attach your favorite terminal program to the corresponding tty. I.e.
496 for minicom, the command would be
497
498
499 host% minicom -o -p /dev/ttyp1
500
501
502
503
504
505
506 +o Over the net
507
508
509 If the network is running, then you can telnet to the virtual
510 machine and log in to it. See ``Setting up the network'' to learn
511 about setting up a virtual network.
512
513 When you're done using it, run halt, and the kernel will bring itself
514 down and the process will exit.
515
516
517 33..33.. EExxaammpplleess
518
519 Here are some examples of UML in action:
520
521 +o A login session <http://user-mode-linux.sourceforge.net/login.html>
522
523 +o A virtual network <http://user-mode-linux.sourceforge.net/net.html>
524
525
526
527
528
529
530
531 44.. UUMMLL oonn 22GG//22GG hhoossttss
532
533
534
535
536 44..11.. IInnttrroodduuccttiioonn
537
538
539 Most Linux machines are configured so that the kernel occupies the
540 upper 1G (0xc0000000 - 0xffffffff) of the 4G address space and
541 processes use the lower 3G (0x00000000 - 0xbfffffff). However, some
542 machine are configured with a 2G/2G split, with the kernel occupying
543 the upper 2G (0x80000000 - 0xffffffff) and processes using the lower
544 2G (0x00000000 - 0x7fffffff).
545
546
547
548
549 44..22.. TThhee pprroobblleemm
550
551
552 The prebuilt UML binaries on this site will not run on 2G/2G hosts
553 because UML occupies the upper .5G of the 3G process address space
554 (0xa0000000 - 0xbfffffff). Obviously, on 2G/2G hosts, this is right
555 in the middle of the kernel address space, so UML won't even load - it
556 will immediately segfault.
557
558
559
560
561 44..33.. TThhee ssoolluuttiioonn
562
563
564 The fix for this is to rebuild UML from source after enabling
565 CONFIG_HOST_2G_2G (under 'General Setup'). This will cause UML to
566 load itself in the top .5G of that smaller process address space,
567 where it will run fine. See ``Compiling the kernel and modules'' if
568 you need help building UML from source.
569
570
571
572
573
574
575
576
577
578
579 55.. SSeettttiinngg uupp sseerriiaall lliinneess aanndd ccoonnssoolleess
580
581
582 It is possible to attach UML serial lines and consoles to many types
583 of host I/O channels by specifying them on the command line.
584
585
586 You can attach them to host ptys, ttys, file descriptors, and ports.
587 This allows you to do things like
588
589 +o have a UML console appear on an unused host console,
590
591 +o hook two virtual machines together by having one attach to a pty
592 and having the other attach to the corresponding tty
593
594 +o make a virtual machine accessible from the net by attaching a
595 console to a port on the host.
596
597
598 The general format of the command line option is device=channel.
599
600
601
602 55..11.. SSppeecciiffyyiinngg tthhee ddeevviiccee
603
604 Devices are specified with "con" or "ssl" (console or serial line,
605 respectively), optionally with a device number if you are talking
606 about a specific device.
607
608
609 Using just "con" or "ssl" describes all of the consoles or serial
610 lines. If you want to talk about console #3 or serial line #10, they
611 would be "con3" and "ssl10", respectively.
612
613
614 A specific device name will override a less general "con=" or "ssl=".
615 So, for example, you can assign a pty to each of the serial lines
616 except for the first two like this:
617
618
619 ssl=pty ssl0=tty:/dev/tty0 ssl1=tty:/dev/tty1
620
621
622
623
624 The specificity of the device name is all that matters; order on the
625 command line is irrelevant.
626
627
628
629 55..22.. SSppeecciiffyyiinngg tthhee cchhaannnneell
630
631 There are a number of different types of channels to attach a UML
632 device to, each with a different way of specifying exactly what to
633 attach to.
634
635 +o pseudo-terminals - device=pty pts terminals - device=pts
636
637
638 This will cause UML to allocate a free host pseudo-terminal for the
639 device. The terminal that it got will be announced in the boot
640 log. You access it by attaching a terminal program to the
641 corresponding tty:
642
643 +o screen /dev/pts/n
644
645 +o screen /dev/ttyxx
646
647 +o minicom -o -p /dev/ttyxx - minicom seems not able to handle pts
648 devices
649
650 +o kermit - start it up, 'open' the device, then 'connect'
651
652
653
654
655
656 +o terminals - device=tty:tty device file
657
658
659 This will make UML attach the device to the specified tty (i.e
660
661
662 con1=tty:/dev/tty3
663
664
665
666
667 will attach UML's console 1 to the host's /dev/tty3). If the tty that
668 you specify is the slave end of a tty/pty pair, something else must
669 have already opened the corresponding pty in order for this to work.
670
671
672
673
674
675 +o xterms - device=xterm
676
677
678 UML will run an xterm and the device will be attached to it.
679
680
681
682
683
684 +o Port - device=port:port number
685
686
687 This will attach the UML devices to the specified host port.
688 Attaching console 1 to the host's port 9000 would be done like
689 this:
690
691
692 con1=port:9000
693
694
695
696
697 Attaching all the serial lines to that port would be done similarly:
698
699
700 ssl=port:9000
701
702
703
704
705 You access these devices by telnetting to that port. Each active tel-
706 net session gets a different device. If there are more telnets to a
707 port than UML devices attached to it, then the extra telnet sessions
708 will block until an existing telnet detaches, or until another device
709 becomes active (i.e. by being activated in /etc/inittab).
710
711 This channel has the advantage that you can both attach multiple UML
712 devices to it and know how to access them without reading the UML boot
713 log. It is also unique in allowing access to a UML from remote
714 machines without requiring that the UML be networked. This could be
715 useful in allowing public access to UMLs because they would be
716 accessible from the net, but wouldn't need any kind of network
717 filtering or access control because they would have no network access.
718
719
720 If you attach the main console to a portal, then the UML boot will
721 appear to hang. In reality, it's waiting for a telnet to connect, at
722 which point the boot will proceed.
723
724
725
726
727
728 +o already-existing file descriptors - device=file descriptor
729
730
731 If you set up a file descriptor on the UML command line, you can
732 attach a UML device to it. This is most commonly used to put the
733 main console back on stdin and stdout after assigning all the other
734 consoles to something else:
735
736
737 con0=fd:0,fd:1 con=pts
738
739
740
741
742
743
744
745
746 +o Nothing - device=null
747
748
749 This allows the device to be opened, in contrast to 'none', but
750 reads will block, and writes will succeed and the data will be
751 thrown out.
752
753
754
755
756
757 +o None - device=none
758
759
760 This causes the device to disappear.
761
762
763
764 You can also specify different input and output channels for a device
765 by putting a comma between them:
766
767
768 ssl3=tty:/dev/tty2,xterm
769
770
771
772
773 will cause serial line 3 to accept input on the host's /dev/tty3 and
774 display output on an xterm. That's a silly example - the most common
775 use of this syntax is to reattach the main console to stdin and stdout
776 as shown above.
777
778
779 If you decide to move the main console away from stdin/stdout, the
780 initial boot output will appear in the terminal that you're running
781 UML in. However, once the console driver has been officially
782 initialized, then the boot output will start appearing wherever you
783 specified that console 0 should be. That device will receive all
784 subsequent output.
785
786
787
788 55..33.. EExxaammpplleess
789
790 There are a number of interesting things you can do with this
791 capability.
792
793
794 First, this is how you get rid of those bleeding console xterms by
795 attaching them to host ptys:
796
797
798 con=pty con0=fd:0,fd:1
799
800
801
802
803 This will make a UML console take over an unused host virtual console,
804 so that when you switch to it, you will see the UML login prompt
805 rather than the host login prompt:
806
807
808 con1=tty:/dev/tty6
809
810
811
812
813 You can attach two virtual machines together with what amounts to a
814 serial line as follows:
815
816 Run one UML with a serial line attached to a pty -
817
818
819 ssl1=pty
820
821
822
823
824 Look at the boot log to see what pty it got (this example will assume
825 that it got /dev/ptyp1).
826
827 Boot the other UML with a serial line attached to the corresponding
828 tty -
829
830
831 ssl1=tty:/dev/ttyp1
832
833
834
835
836 Log in, make sure that it has no getty on that serial line, attach a
837 terminal program like minicom to it, and you should see the login
838 prompt of the other virtual machine.
839
840
841 66.. SSeettttiinngg uupp tthhee nneettwwoorrkk
842
843
844
845 This page describes how to set up the various transports and to
846 provide a UML instance with network access to the host, other machines
847 on the local net, and the rest of the net.
848
849
850 As of 2.4.5, UML networking has been completely redone to make it much
851 easier to set up, fix bugs, and add new features.
852
853
854 There is a new helper, uml_net, which does the host setup that
855 requires root privileges.
856
857
858 There are currently five transport types available for a UML virtual
859 machine to exchange packets with other hosts:
860
861 +o ethertap
862
863 +o TUN/TAP
864
865 +o Multicast
866
867 +o a switch daemon
868
869 +o slip
870
871 +o slirp
872
873 +o pcap
874
875 The TUN/TAP, ethertap, slip, and slirp transports allow a UML
876 instance to exchange packets with the host. They may be directed
877 to the host or the host may just act as a router to provide access
878 to other physical or virtual machines.
879
880
881 The pcap transport is a synthetic read-only interface, using the
882 libpcap binary to collect packets from interfaces on the host and
883 filter them. This is useful for building preconfigured traffic
884 monitors or sniffers.
885
886
887 The daemon and multicast transports provide a completely virtual
888 network to other virtual machines. This network is completely
889 disconnected from the physical network unless one of the virtual
890 machines on it is acting as a gateway.
891
892
893 With so many host transports, which one should you use? Here's when
894 you should use each one:
895
896 +o ethertap - if you want access to the host networking and it is
897 running 2.2
898
899 +o TUN/TAP - if you want access to the host networking and it is
900 running 2.4. Also, the TUN/TAP transport is able to use a
901 preconfigured device, allowing it to avoid using the setuid uml_net
902 helper, which is a security advantage.
903
904 +o Multicast - if you want a purely virtual network and you don't want
905 to set up anything but the UML
906
907 +o a switch daemon - if you want a purely virtual network and you
908 don't mind running the daemon in order to get somewhat better
909 performance
910
911 +o slip - there is no particular reason to run the slip backend unless
912 ethertap and TUN/TAP are just not available for some reason
913
914 +o slirp - if you don't have root access on the host to setup
915 networking, or if you don't want to allocate an IP to your UML
916
917 +o pcap - not much use for actual network connectivity, but great for
918 monitoring traffic on the host
919
920 Ethertap is available on 2.4 and works fine. TUN/TAP is preferred
921 to it because it has better performance and ethertap is officially
922 considered obsolete in 2.4. Also, the root helper only needs to
923 run occasionally for TUN/TAP, rather than handling every packet, as
924 it does with ethertap. This is a slight security advantage since
925 it provides fewer opportunities for a nasty UML user to somehow
926 exploit the helper's root privileges.
927
928
929 66..11.. GGeenneerraall sseettuupp
930
931 First, you must have the virtual network enabled in your UML. If are
932 running a prebuilt kernel from this site, everything is already
933 enabled. If you build the kernel yourself, under the "Network device
934 support" menu, enable "Network device support", and then the three
935 transports.
936
937
938 The next step is to provide a network device to the virtual machine.
939 This is done by describing it on the kernel command line.
940
941 The general format is
942
943
944 eth <n> = <transport> , <transport args>
945
946
947
948
949 For example, a virtual ethernet device may be attached to a host
950 ethertap device as follows:
951
952
953 eth0=ethertap,tap0,fe:fd:0:0:0:1,192.168.0.254
954
955
956
957
958 This sets up eth0 inside the virtual machine to attach itself to the
959 host /dev/tap0, assigns it an ethernet address, and assigns the host
960 tap0 interface an IP address.
961
962
963
964 Note that the IP address you assign to the host end of the tap device
965 must be different than the IP you assign to the eth device inside UML.
966 If you are short on IPs and don't want to consume two per UML, then
967 you can reuse the host's eth IP address for the host ends of the tap
968 devices. Internally, the UMLs must still get unique IPs for their eth
969 devices. You can also give the UMLs non-routable IPs (192.168.x.x or
970 10.x.x.x) and have the host masquerade them. This will let outgoing
971 connections work, but incoming connections won't without more work,
972 such as port forwarding from the host.
973 Also note that when you configure the host side of an interface, it is
974 only acting as a gateway. It will respond to pings sent to it
975 locally, but is not useful to do that since it's a host interface.
976 You are not talking to the UML when you ping that interface and get a
977 response.
978
979
980 You can also add devices to a UML and remove them at runtime. See the
981 ``The Management Console'' page for details.
982
983
984 The sections below describe this in more detail.
985
986
987 Once you've decided how you're going to set up the devices, you boot
988 UML, log in, configure the UML side of the devices, and set up routes
989 to the outside world. At that point, you will be able to talk to any
990 other machines, physical or virtual, on the net.
991
992
993 If ifconfig inside UML fails and the network refuses to come up, run
994 tell you what went wrong.
995
996
997
998 66..22.. UUsseerrssppaaccee ddaaeemmoonnss
999
1000 You will likely need the setuid helper, or the switch daemon, or both.
1001 They are both installed with the RPM and deb, so if you've installed
1002 either, you can skip the rest of this section.
1003
1004
1005 If not, then you need to check them out of CVS, build them, and
1006 install them. The helper is uml_net, in CVS /tools/uml_net, and the
1007 daemon is uml_switch, in CVS /tools/uml_router. They are both built
1008 with a plain 'make'. Both need to be installed in a directory that's
1009 in your path - /usr/bin is recommend. On top of that, uml_net needs
1010 to be setuid root.
1011
1012
1013
1014 66..33.. SSppeecciiffyyiinngg eetthheerrnneett aaddddrreesssseess
1015
1016 Below, you will see that the TUN/TAP, ethertap, and daemon interfaces
1017 allow you to specify hardware addresses for the virtual ethernet
1018 devices. This is generally not necessary. If you don't have a
1019 specific reason to do it, you probably shouldn't. If one is not
1020 specified on the command line, the driver will assign one based on the
1021 device IP address. It will provide the address fe:fd:nn:nn:nn:nn
1022 where nn.nn.nn.nn is the device IP address. This is nearly always
1023 sufficient to guarantee a unique hardware address for the device. A
1024 couple of exceptions are:
1025
1026 +o Another set of virtual ethernet devices are on the same network and
1027 they are assigned hardware addresses using a different scheme which
1028 may conflict with the UML IP address-based scheme
1029
1030 +o You aren't going to use the device for IP networking, so you don't
1031 assign the device an IP address
1032
1033 If you let the driver provide the hardware address, you should make
1034 sure that the device IP address is known before the interface is
1035 brought up. So, inside UML, this will guarantee that:
1036
1037
1038
1039 UML#
1040 ifconfig eth0 192.168.0.250 up
1041
1042
1043
1044
1045 If you decide to assign the hardware address yourself, make sure that
1046 the first byte of the address is even. Addresses with an odd first
1047 byte are broadcast addresses, which you don't want assigned to a
1048 device.
1049
1050
1051
1052 66..44.. UUMMLL iinntteerrffaaccee sseettuupp
1053
1054 Once the network devices have been described on the command line, you
1055 should boot UML and log in.
1056
1057
1058 The first thing to do is bring the interface up:
1059
1060
1061 UML# ifconfig ethn ip-address up
1062
1063
1064
1065
1066 You should be able to ping the host at this point.
1067
1068
1069 To reach the rest of the world, you should set a default route to the
1070 host:
1071
1072
1073 UML# route add default gw host ip
1074
1075
1076
1077
1078 Again, with host ip of 192.168.0.4:
1079
1080
1081 UML# route add default gw 192.168.0.4
1082
1083
1084
1085
1086 This page used to recommend setting a network route to your local net.
1087 This is wrong, because it will cause UML to try to figure out hardware
1088 addresses of the local machines by arping on the interface to the
1089 host. Since that interface is basically a single strand of ethernet
1090 with two nodes on it (UML and the host) and arp requests don't cross
1091 networks, they will fail to elicit any responses. So, what you want
1092 is for UML to just blindly throw all packets at the host and let it
1093 figure out what to do with them, which is what leaving out the network
1094 route and adding the default route does.
1095
1096
1097 Note: If you can't communicate with other hosts on your physical
1098 ethernet, it's probably because of a network route that's
1099 automatically set up. If you run 'route -n' and see a route that
1100 looks like this:
1101
1102
1103
1104
1105 Destination Gateway Genmask Flags Metric Ref Use Iface
1106 192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
1107
1108
1109
1110
1111 with a mask that's not 255.255.255.255, then replace it with a route
1112 to your host:
1113
1114
1115 UML#
1116 route del -net 192.168.0.0 dev eth0 netmask 255.255.255.0
1117
1118
1119
1120
1121
1122
1123 UML#
1124 route add -host 192.168.0.4 dev eth0
1125
1126
1127
1128
1129 This, plus the default route to the host, will allow UML to exchange
1130 packets with any machine on your ethernet.
1131
1132
1133
1134 66..55.. MMuullttiiccaasstt
1135
1136 The simplest way to set up a virtual network between multiple UMLs is
1137 to use the mcast transport. This was written by Harald Welte and is
1138 present in UML version 2.4.5-5um and later. Your system must have
1139 multicast enabled in the kernel and there must be a multicast-capable
1140 network device on the host. Normally, this is eth0, but if there is
1141 no ethernet card on the host, then you will likely get strange error
1142 messages when you bring the device up inside UML.
1143
1144
1145 To use it, run two UMLs with
1146
1147
1148 eth0=mcast
1149
1150
1151
1152
1153 on their command lines. Log in, configure the ethernet device in each
1154 machine with different IP addresses:
1155
1156
1157 UML1# ifconfig eth0 192.168.0.254
1158
1159
1160
1161
1162
1163
1164 UML2# ifconfig eth0 192.168.0.253
1165
1166
1167
1168
1169 and they should be able to talk to each other.
1170
1171 The full set of command line options for this transport are
1172
1173
1174
1175 ethn=mcast,ethernet address,multicast
1176 address,multicast port,ttl
1177
1178
1179
1180
1181 Harald's original README is here <http://user-mode-linux.source-
1182 forge.net/> and explains these in detail, as well as
1183 some other issues.
1184
1185
1186
1187 66..66.. TTUUNN//TTAAPP wwiitthh tthhee uummll__nneett hheellppeerr
1188
1189 TUN/TAP is the preferred mechanism on 2.4 to exchange packets with the
1190 host. The TUN/TAP backend has been in UML since 2.4.9-3um.
1191
1192
1193 The easiest way to get up and running is to let the setuid uml_net
1194 helper do the host setup for you. This involves insmod-ing the tun.o
1195 module if necessary, configuring the device, and setting up IP
1196 forwarding, routing, and proxy arp. If you are new to UML networking,
1197 do this first. If you're concerned about the security implications of
1198 the setuid helper, use it to get up and running, then read the next
1199 section to see how to have UML use a preconfigured tap device, which
1200 avoids the use of uml_net.
1201
1202
1203 If you specify an IP address for the host side of the device, the
1204 uml_net helper will do all necessary setup on the host - the only
1205 requirement is that TUN/TAP be available, either built in to the host
1206 kernel or as the tun.o module.
1207
1208 The format of the command line switch to attach a device to a TUN/TAP
1209 device is
1210
1211
1212 eth <n> =tuntap,,, <IP address>
1213
1214
1215
1216
1217 For example, this argument will attach the UML's eth0 to the next
1218 available tap device and assign an ethernet address to it based on its
1219 IP address
1220
1221
1222 eth0=tuntap,,,192.168.0.254
1223
1224
1225
1226
1227
1228
1229 Note that the IP address that must be used for the eth device inside
1230 UML is fixed by the routing and proxy arp that is set up on the
1231 TUN/TAP device on the host. You can use a different one, but it won't
1232 work because reply packets won't reach the UML. This is a feature.
1233 It prevents a nasty UML user from doing things like setting the UML IP
1234 to the same as the network's nameserver or mail server.
1235
1236
1237 There are a couple potential problems with running the TUN/TAP
1238 transport on a 2.4 host kernel
1239
1240 +o TUN/TAP seems not to work on 2.4.3 and earlier. Upgrade the host
1241 kernel or use the ethertap transport.
1242
1243 +o With an upgraded kernel, TUN/TAP may fail with
1244
1245
1246 File descriptor in bad state
1247
1248
1249
1250
1251 This is due to a header mismatch between the upgraded kernel and the
1252 kernel that was originally installed on the machine. The fix is to
1253 make sure that /usr/src/linux points to the headers for the running
1254 kernel.
1255
1256 These were pointed out by Tim Robinson <timro at trkr dot net> in
1257 <http://www.geocrawler.com/> name="this uml-
1258 user post"> .
1259
1260
1261
1262 66..77.. TTUUNN//TTAAPP wwiitthh aa pprreeccoonnffiigguurreedd ttaapp ddeevviiccee
1263
1264 If you prefer not to have UML use uml_net (which is somewhat
1265 insecure), with UML 2.4.17-11, you can set up a TUN/TAP device
1266 beforehand. The setup needs to be done as root, but once that's done,
1267 there is no need for root assistance. Setting up the device is done
1268 as follows:
1269
1270 +o Create the device with tunctl (available from the UML utilities
1271 tarball)
1272
1273
1274
1275
1276 host# tunctl -u uid
1277
1278
1279
1280
1281 where uid is the user id or username that UML will be run as. This
1282 will tell you what device was created.
1283
1284 +o Configure the device IP (change IP addresses and device name to
1285 suit)
1286
1287
1288
1289
1290 host# ifconfig tap0 192.168.0.254 up
1291
1292
1293
1294
1295
1296 +o Set up routing and arping if desired - this is my recipe, there are
1297 other ways of doing the same thing
1298
1299
1300 host#
1301 bash -c 'echo 1 > /proc/sys/net/ipv4/ip_forward'
1302
1303 host#
1304 route add -host 192.168.0.253 dev tap0
1305
1306
1307
1308
1309
1310
1311 host#
1312 bash -c 'echo 1 > /proc/sys/net/ipv4/conf/tap0/proxy_arp'
1313
1314
1315
1316
1317
1318
1319 host#
1320 arp -Ds 192.168.0.253 eth0 pub
1321
1322
1323
1324
1325 Note that this must be done every time the host boots - this configu-
1326 ration is not stored across host reboots. So, it's probably a good
1327 idea to stick it in an rc file. An even better idea would be a little
1328 utility which reads the information from a config file and sets up
1329 devices at boot time.
1330
1331 +o Rather than using up two IPs and ARPing for one of them, you can
1332 also provide direct access to your LAN by the UML by using a
1333 bridge.
1334
1335
1336 host#
1337 brctl addbr br0
1338
1339
1340
1341
1342
1343
1344 host#
1345 ifconfig eth0 0.0.0.0 promisc up
1346
1347
1348
1349
1350
1351
1352 host#
1353 ifconfig tap0 0.0.0.0 promisc up
1354
1355
1356
1357
1358
1359
1360 host#
1361 ifconfig br0 192.168.0.1 netmask 255.255.255.0 up
1362
1363
1364
1365
1366
1367
1368
1369 host#
1370 brctl stp br0 off
1371
1372
1373
1374
1375
1376
1377 host#
1378 brctl setfd br0 1
1379
1380
1381
1382
1383
1384
1385 host#
1386 brctl sethello br0 1
1387
1388
1389
1390
1391
1392
1393 host#
1394 brctl addif br0 eth0
1395
1396
1397
1398
1399
1400
1401 host#
1402 brctl addif br0 tap0
1403
1404
1405
1406
1407 Note that 'br0' should be setup using ifconfig with the existing IP
1408 address of eth0, as eth0 no longer has its own IP.
1409
1410 +o
1411
1412
1413 Also, the /dev/net/tun device must be writable by the user running
1414 UML in order for the UML to use the device that's been configured
1415 for it. The simplest thing to do is
1416
1417
1418 host# chmod 666 /dev/net/tun
1419
1420
1421
1422
1423 Making it world-writable looks bad, but it seems not to be
1424 exploitable as a security hole. However, it does allow anyone to cre-
1425 ate useless tap devices (useless because they can't configure them),
1426 which is a DOS attack. A somewhat more secure alternative would to be
1427 to create a group containing all the users who have preconfigured tap
1428 devices and chgrp /dev/net/tun to that group with mode 664 or 660.
1429
1430
1431 +o Once the device is set up, run UML with 'eth0=tuntap,device name'
1432 (i.e. 'eth0=tuntap,tap0') on the command line (or do it with the
1433 mconsole config command).
1434
1435 +o Bring the eth device up in UML and you're in business.
1436
1437 If you don't want that tap device any more, you can make it non-
1438 persistent with
1439
1440
1441 host# tunctl -d tap device
1442
1443
1444
1445
1446 Finally, tunctl has a -b (for brief mode) switch which causes it to
1447 output only the name of the tap device it created. This makes it
1448 suitable for capture by a script:
1449
1450
1451 host# TAP=`tunctl -u 1000 -b`
1452
1453
1454
1455
1456
1457
1458 66..88.. EEtthheerrttaapp
1459
1460 Ethertap is the general mechanism on 2.2 for userspace processes to
1461 exchange packets with the kernel.
1462
1463
1464
1465 To use this transport, you need to describe the virtual network device
1466 on the UML command line. The general format for this is
1467
1468
1469 eth <n> =ethertap, <device> , <ethernet address> , <tap IP address>
1470
1471
1472
1473
1474 So, the previous example
1475
1476
1477 eth0=ethertap,tap0,fe:fd:0:0:0:1,192.168.0.254
1478
1479
1480
1481
1482 attaches the UML eth0 device to the host /dev/tap0, assigns it the
1483 ethernet address fe:fd:0:0:0:1, and assigns the IP address
1484 192.168.0.254 to the tap device.
1485
1486
1487
1488 The tap device is mandatory, but the others are optional. If the
1489 ethernet address is omitted, one will be assigned to it.
1490
1491
1492 The presence of the tap IP address will cause the helper to run and do
1493 whatever host setup is needed to allow the virtual machine to
1494 communicate with the outside world. If you're not sure you know what
1495 you're doing, this is the way to go.
1496
1497
1498 If it is absent, then you must configure the tap device and whatever
1499 arping and routing you will need on the host. However, even in this
1500 case, the uml_net helper still needs to be in your path and it must be
1501 setuid root if you're not running UML as root. This is because the
1502 tap device doesn't support SIGIO, which UML needs in order to use
1503 something as a source of input. So, the helper is used as a
1504 convenient asynchronous IO thread.
1505
1506 If you're using the uml_net helper, you can ignore the following host
1507 setup - uml_net will do it for you. You just need to make sure you
1508 have ethertap available, either built in to the host kernel or
1509 available as a module.
1510
1511
1512 If you want to set things up yourself, you need to make sure that the
1513 appropriate /dev entry exists. If it doesn't, become root and create
1514 it as follows:
1515
1516
1517 mknod /dev/tap <minor> c 36 <minor> + 16
1518
1519
1520
1521
1522 For example, this is how to create /dev/tap0:
1523
1524
1525 mknod /dev/tap0 c 36 0 + 16
1526
1527
1528
1529
1530 You also need to make sure that the host kernel has ethertap support.
1531 If ethertap is enabled as a module, you apparently need to insmod
1532 ethertap once for each ethertap device you want to enable. So,
1533
1534
1535 host#
1536 insmod ethertap
1537
1538
1539
1540
1541 will give you the tap0 interface. To get the tap1 interface, you need
1542 to run
1543
1544
1545 host#
1546 insmod ethertap unit=1 -o ethertap1
1547
1548
1549
1550
1551
1552
1553
1554 66..99.. TThhee sswwiittcchh ddaaeemmoonn
1555
1556 NNoottee: This is the daemon formerly known as uml_router, but which was
1557 renamed so the network weenies of the world would stop growling at me.
1558
1559
1560 The switch daemon, uml_switch, provides a mechanism for creating a
1561 totally virtual network. By default, it provides no connection to the
1562 host network (but see -tap, below).
1563
1564
1565 The first thing you need to do is run the daemon. Running it with no
1566 arguments will make it listen on a default pair of unix domain
1567 sockets.
1568
1569
1570 If you want it to listen on a different pair of sockets, use
1571
1572
1573 -unix control socket data socket
1574
1575
1576
1577
1578
1579 If you want it to act as a hub rather than a switch, use
1580
1581
1582 -hub
1583
1584
1585
1586
1587
1588 If you want the switch to be connected to host networking (allowing
1589 the umls to get access to the outside world through the host), use
1590
1591
1592 -tap tap0
1593
1594
1595
1596
1597
1598 Note that the tap device must be preconfigured (see "TUN/TAP with a
1599 preconfigured tap device", above). If you're using a different tap
1600 device than tap0, specify that instead of tap0.
1601
1602
1603 uml_switch can be backgrounded as follows
1604
1605
1606 host%
1607 uml_switch [ options ] < /dev/null > /dev/null
1608
1609
1610
1611
1612 The reason it doesn't background by default is that it listens to
1613 stdin for EOF. When it sees that, it exits.
1614
1615
1616 The general format of the kernel command line switch is
1617
1618
1619
1620 ethn=daemon,ethernet address,socket
1621 type,control socket,data socket
1622
1623
1624
1625
1626 You can leave off everything except the 'daemon'. You only need to
1627 specify the ethernet address if the one that will be assigned to it
1628 isn't acceptable for some reason. The rest of the arguments describe
1629 how to communicate with the daemon. You should only specify them if
1630 you told the daemon to use different sockets than the default. So, if
1631 you ran the daemon with no arguments, running the UML on the same
1632 machine with
1633 eth0=daemon
1634
1635
1636
1637
1638 will cause the eth0 driver to attach itself to the daemon correctly.
1639
1640
1641
1642 66..1100.. SSlliipp
1643
1644 Slip is another, less general, mechanism for a process to communicate
1645 with the host networking. In contrast to the ethertap interface,
1646 which exchanges ethernet frames with the host and can be used to
1647 transport any higher-level protocol, it can only be used to transport
1648 IP.
1649
1650
1651 The general format of the command line switch is
1652
1653
1654
1655 ethn=slip,slip IP
1656
1657
1658
1659
1660 The slip IP argument is the IP address that will be assigned to the
1661 host end of the slip device. If it is specified, the helper will run
1662 and will set up the host so that the virtual machine can reach it and
1663 the rest of the network.
1664
1665
1666 There are some oddities with this interface that you should be aware
1667 of. You should only specify one slip device on a given virtual
1668 machine, and its name inside UML will be 'umn', not 'eth0' or whatever
1669 you specified on the command line. These problems will be fixed at
1670 some point.
1671
1672
1673
1674 66..1111.. SSlliirrpp
1675
1676 slirp uses an external program, usually /usr/bin/slirp, to provide IP
1677 only networking connectivity through the host. This is similar to IP
1678 masquerading with a firewall, although the translation is performed in
1679 user-space, rather than by the kernel. As slirp does not set up any
1680 interfaces on the host, or changes routing, slirp does not require
1681 root access or setuid binaries on the host.
1682
1683
1684 The general format of the command line switch for slirp is:
1685
1686
1687
1688 ethn=slirp,ethernet address,slirp path
1689
1690
1691
1692
1693 The ethernet address is optional, as UML will set up the interface
1694 with an ethernet address based upon the initial IP address of the
1695 interface. The slirp path is generally /usr/bin/slirp, although it
1696 will depend on distribution.
1697
1698
1699 The slirp program can have a number of options passed to the command
1700 line and we can't add them to the UML command line, as they will be
1701 parsed incorrectly. Instead, a wrapper shell script can be written or
1702 the options inserted into the /.slirprc file. More information on
1703 all of the slirp options can be found in its man pages.
1704
1705
1706 The eth0 interface on UML should be set up with the IP 10.2.0.15,
1707 although you can use anything as long as it is not used by a network
1708 you will be connecting to. The default route on UML should be set to
1709 use
1710
1711
1712 UML#
1713 route add default dev eth0
1714
1715
1716
1717
1718 slirp provides a number of useful IP addresses which can be used by
1719 UML, such as 10.0.2.3 which is an alias for the DNS server specified
1720 in /etc/resolv.conf on the host or the IP given in the 'dns' option
1721 for slirp.
1722
1723
1724 Even with a baudrate setting higher than 115200, the slirp connection
1725 is limited to 115200. If you need it to go faster, the slirp binary
1726 needs to be compiled with FULL_BOLT defined in config.h.
1727
1728
1729
1730 66..1122.. ppccaapp
1731
1732 The pcap transport is attached to a UML ethernet device on the command
1733 line or with uml_mconsole with the following syntax:
1734
1735
1736
1737 ethn=pcap,host interface,filter
1738 expression,option1,option2
1739
1740
1741
1742
1743 The expression and options are optional.
1744
1745
1746 The interface is whatever network device on the host you want to
1747 sniff. The expression is a pcap filter expression, which is also what
1748 tcpdump uses, so if you know how to specify tcpdump filters, you will
1749 use the same expressions here. The options are up to two of
1750 'promisc', control whether pcap puts the host interface into
1751 promiscuous mode. 'optimize' and 'nooptimize' control whether the pcap
1752 expression optimizer is used.
1753
1754
1755 Example:
1756
1757
1758
1759 eth0=pcap,eth0,tcp
1760
1761 eth1=pcap,eth0,!tcp
1762
1763
1764
1765 will cause the UML eth0 to emit all tcp packets on the host eth0 and
1766 the UML eth1 to emit all non-tcp packets on the host eth0.
1767
1768
1769
1770 66..1133.. SSeettttiinngg uupp tthhee hhoosstt yyoouurrsseellff
1771
1772 If you don't specify an address for the host side of the ethertap or
1773 slip device, UML won't do any setup on the host. So this is what is
1774 needed to get things working (the examples use a host-side IP of
1775 192.168.0.251 and a UML-side IP of 192.168.0.250 - adjust to suit your
1776 own network):
1777
1778 +o The device needs to be configured with its IP address. Tap devices
1779 are also configured with an mtu of 1484. Slip devices are
1780 configured with a point-to-point address pointing at the UML ip
1781 address.
1782
1783
1784 host# ifconfig tap0 arp mtu 1484 192.168.0.251 up
1785
1786
1787
1788
1789
1790
1791 host#
1792 ifconfig sl0 192.168.0.251 pointopoint 192.168.0.250 up
1793
1794
1795
1796
1797
1798 +o If a tap device is being set up, a route is set to the UML IP.
1799
1800
1801 UML# route add -host 192.168.0.250 gw 192.168.0.251
1802
1803
1804
1805
1806
1807 +o To allow other hosts on your network to see the virtual machine,
1808 proxy arp is set up for it.
1809
1810
1811 host# arp -Ds 192.168.0.250 eth0 pub
1812
1813
1814
1815
1816
1817 +o Finally, the host is set up to route packets.
1818
1819
1820 host# echo 1 > /proc/sys/net/ipv4/ip_forward
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831 77.. SShhaarriinngg FFiilleessyysstteemmss bbeettwweeeenn VViirrttuuaall MMaacchhiinneess
1832
1833
1834
1835
1836 77..11.. AA wwaarrnniinngg
1837
1838 Don't attempt to share filesystems simply by booting two UMLs from the
1839 same file. That's the same thing as booting two physical machines
1840 from a shared disk. It will result in filesystem corruption.
1841
1842
1843
1844 77..22.. UUssiinngg llaayyeerreedd bblloocckk ddeevviicceess
1845
1846 The way to share a filesystem between two virtual machines is to use
1847 the copy-on-write (COW) layering capability of the ubd block driver.
1848 As of 2.4.6-2um, the driver supports layering a read-write private
1849 device over a read-only shared device. A machine's writes are stored
1850 in the private device, while reads come from either device - the
1851 private one if the requested block is valid in it, the shared one if
1852 not. Using this scheme, the majority of data which is unchanged is
1853 shared between an arbitrary number of virtual machines, each of which
1854 has a much smaller file containing the changes that it has made. With
1855 a large number of UMLs booting from a large root filesystem, this
1856 leads to a huge disk space saving. It will also help performance,
1857 since the host will be able to cache the shared data using a much
1858 smaller amount of memory, so UML disk requests will be served from the
1859 host's memory rather than its disks.
1860
1861
1862
1863
1864 To add a copy-on-write layer to an existing block device file, simply
1865 add the name of the COW file to the appropriate ubd switch:
1866
1867
1868 ubd0=root_fs_cow,root_fs_debian_22
1869
1870
1871
1872
1873 where 'root_fs_cow' is the private COW file and 'root_fs_debian_22' is
1874 the existing shared filesystem. The COW file need not exist. If it
1875 doesn't, the driver will create and initialize it. Once the COW file
1876 has been initialized, it can be used on its own on the command line:
1877
1878
1879 ubd0=root_fs_cow
1880
1881
1882
1883
1884 The name of the backing file is stored in the COW file header, so it
1885 would be redundant to continue specifying it on the command line.
1886
1887
1888
1889 77..33.. NNoottee!!
1890
1891 When checking the size of the COW file in order to see the gobs of
1892 space that you're saving, make sure you use 'ls -ls' to see the actual
1893 disk consumption rather than the length of the file. The COW file is
1894 sparse, so the length will be very different from the disk usage.
1895 Here is a 'ls -l' of a COW file and backing file from one boot and
1896 shutdown:
1897 host% ls -l cow.debian debian2.2
1898 -rw-r--r-- 1 jdike jdike 492504064 Aug 6 21:16 cow.debian
1899 -rwxrw-rw- 1 jdike jdike 537919488 Aug 6 20:42 debian2.2
1900
1901
1902
1903
1904 Doesn't look like much saved space, does it? Well, here's 'ls -ls':
1905
1906
1907 host% ls -ls cow.debian debian2.2
1908 880 -rw-r--r-- 1 jdike jdike 492504064 Aug 6 21:16 cow.debian
1909 525832 -rwxrw-rw- 1 jdike jdike 537919488 Aug 6 20:42 debian2.2
1910
1911
1912
1913
1914 Now, you can see that the COW file has less than a meg of disk, rather
1915 than 492 meg.
1916
1917
1918
1919 77..44.. AAnnootthheerr wwaarrnniinngg
1920
1921 Once a filesystem is being used as a readonly backing file for a COW
1922 file, do not boot directly from it or modify it in any way. Doing so
1923 will invalidate any COW files that are using it. The mtime and size
1924 of the backing file are stored in the COW file header at its creation,
1925 and they must continue to match. If they don't, the driver will
1926 refuse to use the COW file.
1927
1928
1929
1930
1931 If you attempt to evade this restriction by changing either the
1932 backing file or the COW header by hand, you will get a corrupted
1933 filesystem.
1934
1935
1936
1937
1938 Among other things, this means that upgrading the distribution in a
1939 backing file and expecting that all of the COW files using it will see
1940 the upgrade will not work.
1941
1942
1943
1944
1945 77..55.. uummll__mmoooo :: MMeerrggiinngg aa CCOOWW ffiillee wwiitthh iittss bbaacckkiinngg ffiillee
1946
1947 Depending on how you use UML and COW devices, it may be advisable to
1948 merge the changes in the COW file into the backing file every once in
1949 a while.
1950
1951
1952
1953
1954 The utility that does this is uml_moo. Its usage is
1955
1956
1957 host% uml_moo COW file new backing file
1958
1959
1960
1961
1962 There's no need to specify the backing file since that information is
1963 already in the COW file header. If you're paranoid, boot the new
1964 merged file, and if you're happy with it, move it over the old backing
1965 file.
1966
1967
1968
1969
1970 uml_moo creates a new backing file by default as a safety measure. It
1971 also has a destructive merge option which will merge the COW file
1972 directly into its current backing file. This is really only usable
1973 when the backing file only has one COW file associated with it. If
1974 there are multiple COWs associated with a backing file, a -d merge of
1975 one of them will invalidate all of the others. However, it is
1976 convenient if you're short of disk space, and it should also be
1977 noticeably faster than a non-destructive merge.
1978
1979
1980
1981
1982 uml_moo is installed with the UML deb and RPM. If you didn't install
1983 UML from one of those packages, you can also get it from the UML
1984 utilities <http://user-mode-linux.sourceforge.net/
1985 utilities> tar file in tools/moo.
1986
1987
1988
1989
1990
1991
1992
1993
1994 88.. CCrreeaattiinngg ffiilleessyysstteemmss
1995
1996
1997 You may want to create and mount new UML filesystems, either because
1998 your root filesystem isn't large enough or because you want to use a
1999 filesystem other than ext2.
2000
2001
2002 This was written on the occasion of reiserfs being included in the
2003 2.4.1 kernel pool, and therefore the 2.4.1 UML, so the examples will
2004 talk about reiserfs. This information is generic, and the examples
2005 should be easy to translate to the filesystem of your choice.
2006
2007
2008 88..11.. CCrreeaattee tthhee ffiilleessyysstteemm ffiillee
2009
2010 dd is your friend. All you need to do is tell dd to create an empty
2011 file of the appropriate size. I usually make it sparse to save time
2012 and to avoid allocating disk space until it's actually used. For
2013 example, the following command will create a sparse 100 meg file full
2014 of zeroes.
2015
2016
2017 host%
2018 dd if=/dev/zero of=new_filesystem seek=100 count=1 bs=1M
2019
2020
2021
2022
2023
2024
2025 88..22.. AAssssiiggnn tthhee ffiillee ttoo aa UUMMLL ddeevviiccee
2026
2027 Add an argument like the following to the UML command line:
2028
2029 ubd4=new_filesystem
2030
2031
2032
2033
2034 making sure that you use an unassigned ubd device number.
2035
2036
2037
2038 88..33.. CCrreeaattiinngg aanndd mmoouunnttiinngg tthhee ffiilleessyysstteemm
2039
2040 Make sure that the filesystem is available, either by being built into
2041 the kernel, or available as a module, then boot up UML and log in. If
2042 the root filesystem doesn't have the filesystem utilities (mkfs, fsck,
2043 etc), then get them into UML by way of the net or hostfs.
2044
2045
2046 Make the new filesystem on the device assigned to the new file:
2047
2048
2049 host# mkreiserfs /dev/ubd/4
2050
2051
2052 <----------- MKREISERFSv2 ----------->
2053
2054 ReiserFS version 3.6.25
2055 Block size 4096 bytes
2056 Block count 25856
2057 Used blocks 8212
2058 Journal - 8192 blocks (18-8209), journal header is in block 8210
2059 Bitmaps: 17
2060 Root block 8211
2061 Hash function "r5"
2062 ATTENTION: ALL DATA WILL BE LOST ON '/dev/ubd/4'! (y/n)y
2063 journal size 8192 (from 18)
2064 Initializing journal - 0%....20%....40%....60%....80%....100%
2065 Syncing..done.
2066
2067
2068
2069
2070 Now, mount it:
2071
2072
2073 UML#
2074 mount /dev/ubd/4 /mnt
2075
2076
2077
2078
2079 and you're in business.
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089 99.. HHoosstt ffiillee aacccceessss
2090
2091
2092 If you want to access files on the host machine from inside UML, you
2093 can treat it as a separate machine and either nfs mount directories
2094 from the host or copy files into the virtual machine with scp or rcp.
2095 However, since UML is running on the host, it can access those
2096 files just like any other process and make them available inside the
2097 virtual machine without needing to use the network.
2098
2099
2100 This is now possible with the hostfs virtual filesystem. With it, you
2101 can mount a host directory into the UML filesystem and access the
2102 files contained in it just as you would on the host.
2103
2104
2105 99..11.. UUssiinngg hhoossttffss
2106
2107 To begin with, make sure that hostfs is available inside the virtual
2108 machine with
2109
2110
2111 UML# cat /proc/filesystems
2112
2113
2114
2115 . hostfs should be listed. If it's not, either rebuild the kernel
2116 with hostfs configured into it or make sure that hostfs is built as a
2117 module and available inside the virtual machine, and insmod it.
2118
2119
2120 Now all you need to do is run mount:
2121
2122
2123 UML# mount none /mnt/host -t hostfs
2124
2125
2126
2127
2128 will mount the host's / on the virtual machine's /mnt/host.
2129
2130
2131 If you don't want to mount the host root directory, then you can
2132 specify a subdirectory to mount with the -o switch to mount:
2133
2134
2135 UML# mount none /mnt/home -t hostfs -o /home
2136
2137
2138
2139
2140 will mount the hosts's /home on the virtual machine's /mnt/home.
2141
2142
2143
2144 99..22.. hhoossttffss aass tthhee rroooott ffiilleessyysstteemm
2145
2146 It's possible to boot from a directory hierarchy on the host using
2147 hostfs rather than using the standard filesystem in a file.
2148
2149 To start, you need that hierarchy. The easiest way is to loop mount
2150 an existing root_fs file:
2151
2152
2153 host# mount root_fs uml_root_dir -o loop
2154
2155
2156
2157
2158 You need to change the filesystem type of / in etc/fstab to be
2159 'hostfs', so that line looks like this:
2160
2161 /dev/ubd/0 / hostfs defaults 1 1
2162
2163
2164
2165
2166 Then you need to chown to yourself all the files in that directory
2167 that are owned by root. This worked for me:
2168
2169
2170 host# find . -uid 0 -exec chown jdike {} \;
2171
2172
2173
2174
2175 Next, make sure that your UML kernel has hostfs compiled in, not as a
2176 module. Then run UML with the boot device pointing at that directory:
2177
2178
2179 ubd0=/path/to/uml/root/directory
2180
2181
2182
2183
2184 UML should then boot as it does normally.
2185
2186
2187 99..33.. BBuuiillddiinngg hhoossttffss
2188
2189 If you need to build hostfs because it's not in your kernel, you have
2190 two choices:
2191
2192
2193
2194 +o Compiling hostfs into the kernel:
2195
2196
2197 Reconfigure the kernel and set the 'Host filesystem' option under
2198
2199
2200 +o Compiling hostfs as a module:
2201
2202
2203 Reconfigure the kernel and set the 'Host filesystem' option under
2204 be in arch/um/fs/hostfs/hostfs.o. Install that in
2205 /lib/modules/`uname -r`/fs in the virtual machine, boot it up, and
2206
2207
2208 UML# insmod hostfs
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221 1100.. TThhee MMaannaaggeemmeenntt CCoonnssoollee
2222
2223
2224
2225 The UML management console is a low-level interface to the kernel,
2226 somewhat like the i386 SysRq interface. Since there is a full-blown
2227 operating system under UML, there is much greater flexibility possible
2228 than with the SysRq mechanism.
2229
2230
2231 There are a number of things you can do with the mconsole interface:
2232
2233 +o get the kernel version
2234
2235 +o add and remove devices
2236
2237 +o halt or reboot the machine
2238
2239 +o Send SysRq commands
2240
2241 +o Pause and resume the UML
2242
2243
2244 You need the mconsole client (uml_mconsole) which is present in CVS
2245 (/tools/mconsole) in 2.4.5-9um and later, and will be in the RPM in
2246 2.4.6.
2247
2248
2249 You also need CONFIG_MCONSOLE (under 'General Setup') enabled in UML.
2250 When you boot UML, you'll see a line like:
2251
2252
2253 mconsole initialized on /home/jdike/.uml/umlNJ32yL/mconsole
2254
2255
2256
2257
2258 If you specify a unique machine id one the UML command line, i.e.
2259
2260
2261 umid=debian
2262
2263
2264
2265
2266 you'll see this
2267
2268
2269 mconsole initialized on /home/jdike/.uml/debian/mconsole
2270
2271
2272
2273
2274 That file is the socket that uml_mconsole will use to communicate with
2275 UML. Run it with either the umid or the full path as its argument:
2276
2277
2278 host% uml_mconsole debian
2279
2280
2281
2282
2283 or
2284
2285
2286 host% uml_mconsole /home/jdike/.uml/debian/mconsole
2287
2288
2289
2290
2291 You'll get a prompt, at which you can run one of these commands:
2292
2293 +o version
2294
2295 +o halt
2296
2297 +o reboot
2298
2299 +o config
2300
2301 +o remove
2302
2303 +o sysrq
2304
2305 +o help
2306
2307 +o cad
2308
2309 +o stop
2310
2311 +o go
2312
2313
2314 1100..11.. vveerrssiioonn
2315
2316 This takes no arguments. It prints the UML version.
2317
2318
2319 (mconsole) version
2320 OK Linux usermode 2.4.5-9um #1 Wed Jun 20 22:47:08 EDT 2001 i686
2321
2322
2323
2324
2325 There are a couple actual uses for this. It's a simple no-op which
2326 can be used to check that a UML is running. It's also a way of
2327 sending an interrupt to the UML. This is sometimes useful on SMP
2328 hosts, where there's a bug which causes signals to UML to be lost,
2329 often causing it to appear to hang. Sending such a UML the mconsole
2330 version command is a good way to 'wake it up' before networking has
2331 been enabled, as it does not do anything to the function of the UML.
2332
2333
2334
2335 1100..22.. hhaalltt aanndd rreebboooott
2336
2337 These take no arguments. They shut the machine down immediately, with
2338 no syncing of disks and no clean shutdown of userspace. So, they are
2339 pretty close to crashing the machine.
2340
2341
2342 (mconsole) halt
2343 OK
2344
2345
2346
2347
2348
2349
2350 1100..33.. ccoonnffiigg
2351
2352 "config" adds a new device to the virtual machine. Currently the ubd
2353 and network drivers support this. It takes one argument, which is the
2354 device to add, with the same syntax as the kernel command line.
2355
2356
2357
2358
2359 (mconsole)
2360 config ubd3=/home/jdike/incoming/roots/root_fs_debian22
2361
2362 OK
2363 (mconsole) config eth1=mcast
2364 OK
2365
2366
2367
2368
2369
2370
2371 1100..44.. rreemmoovvee
2372
2373 "remove" deletes a device from the system. Its argument is just the
2374 name of the device to be removed. The device must be idle in whatever
2375 sense the driver considers necessary. In the case of the ubd driver,
2376 the removed block device must not be mounted, swapped on, or otherwise
2377 open, and in the case of the network driver, the device must be down.
2378
2379
2380 (mconsole) remove ubd3
2381 OK
2382 (mconsole) remove eth1
2383 OK
2384
2385
2386
2387
2388
2389
2390 1100..55.. ssyyssrrqq
2391
2392 This takes one argument, which is a single letter. It calls the
2393 generic kernel's SysRq driver, which does whatever is called for by
2394 that argument. See the SysRq documentation in Documentation/sysrq.txt
2395 in your favorite kernel tree to see what letters are valid and what
2396 they do.
2397
2398
2399
2400 1100..66.. hheellpp
2401
2402 "help" returns a string listing the valid commands and what each one
2403 does.
2404
2405
2406
2407 1100..77.. ccaadd
2408
2409 This invokes the Ctl-Alt-Del action on init. What exactly this ends
2410 up doing is up to /etc/inittab. Normally, it reboots the machine.
2411 With UML, this is usually not desired, so if a halt would be better,
2412 then find the section of inittab that looks like this
2413
2414
2415 # What to do when CTRL-ALT-DEL is pressed.
2416 ca:12345:ctrlaltdel:/sbin/shutdown -t1 -a -r now
2417
2418
2419
2420
2421 and change the command to halt.
2422
2423
2424
2425 1100..88.. ssttoopp
2426
2427 This puts the UML in a loop reading mconsole requests until a 'go'
2428 mconsole command is received. This is very useful for making backups
2429 of UML filesystems, as the UML can be stopped, then synced via 'sysrq
2430 s', so that everything is written to the filesystem. You can then copy
2431 the filesystem and then send the UML 'go' via mconsole.
2432
2433
2434 Note that a UML running with more than one CPU will have problems
2435 after you send the 'stop' command, as only one CPU will be held in a
2436 mconsole loop and all others will continue as normal. This is a bug,
2437 and will be fixed.
2438
2439
2440
2441 1100..99.. ggoo
2442
2443 This resumes a UML after being paused by a 'stop' command. Note that
2444 when the UML has resumed, TCP connections may have timed out and if
2445 the UML is paused for a long period of time, crond might go a little
2446 crazy, running all the jobs it didn't do earlier.
2447
2448
2449
2450
2451
2452
2453
2454
2455 1111.. KKeerrnneell ddeebbuuggggiinngg
2456
2457
2458 NNoottee:: The interface that makes debugging, as described here, possible
2459 is present in 2.4.0-test6 kernels and later.
2460
2461
2462 Since the user-mode kernel runs as a normal Linux process, it is
2463 possible to debug it with gdb almost like any other process. It is
2464 slightly different because the kernel's threads are already being
2465 ptraced for system call interception, so gdb can't ptrace them.
2466 However, a mechanism has been added to work around that problem.
2467
2468
2469 In order to debug the kernel, you need build it from source. See
2470 ``Compiling the kernel and modules'' for information on doing that.
2471 Make sure that you enable CONFIG_DEBUGSYM and CONFIG_PT_PROXY during
2472 the config. These will compile the kernel with -g, and enable the
2473 ptrace proxy so that gdb works with UML, respectively.
2474
2475
2476
2477
2478 1111..11.. SSttaarrttiinngg tthhee kkeerrnneell uunnddeerr ggddbb
2479
2480 You can have the kernel running under the control of gdb from the
2481 beginning by putting 'debug' on the command line. You will get an
2482 xterm with gdb running inside it. The kernel will send some commands
2483 to gdb which will leave it stopped at the beginning of start_kernel.
2484 At this point, you can get things going with 'next', 'step', or
2485 'cont'.
2486
2487
2488 There is a transcript of a debugging session here <debug-
2489 session.html> , with breakpoints being set in the scheduler and in an
2490 interrupt handler.
2491 1111..22.. EExxaammiinniinngg sslleeeeppiinngg pprroocceesssseess
2492
2493 Not every bug is evident in the currently running process. Sometimes,
2494 processes hang in the kernel when they shouldn't because they've
2495 deadlocked on a semaphore or something similar. In this case, when
2496 you ^C gdb and get a backtrace, you will see the idle thread, which
2497 isn't very relevant.
2498
2499
2500 What you want is the stack of whatever process is sleeping when it
2501 shouldn't be. You need to figure out which process that is, which is
2502 generally fairly easy. Then you need to get its host process id,
2503 which you can do either by looking at ps on the host or at
2504 task.thread.extern_pid in gdb.
2505
2506
2507 Now what you do is this:
2508
2509 +o detach from the current thread
2510
2511
2512 (UML gdb) det
2513
2514
2515
2516
2517
2518 +o attach to the thread you are interested in
2519
2520
2521 (UML gdb) att <host pid>
2522
2523
2524
2525
2526
2527 +o look at its stack and anything else of interest
2528
2529
2530 (UML gdb) bt
2531
2532
2533
2534
2535 Note that you can't do anything at this point that requires that a
2536 process execute, e.g. calling a function
2537
2538 +o when you're done looking at that process, reattach to the current
2539 thread and continue it
2540
2541
2542 (UML gdb)
2543 att 1
2544
2545
2546
2547
2548
2549
2550 (UML gdb)
2551 c
2552
2553
2554
2555
2556 Here, specifying any pid which is not the process id of a UML thread
2557 will cause gdb to reattach to the current thread. I commonly use 1,
2558 but any other invalid pid would work.
2559
2560
2561
2562 1111..33.. RRuunnnniinngg dddddd oonn UUMMLL
2563
2564 ddd works on UML, but requires a special kludge. The process goes
2565 like this:
2566
2567 +o Start ddd
2568
2569
2570 host% ddd linux
2571
2572
2573
2574
2575
2576 +o With ps, get the pid of the gdb that ddd started. You can ask the
2577 gdb to tell you, but for some reason that confuses things and
2578 causes a hang.
2579
2580 +o run UML with 'debug=parent gdb-pid=<pid>' added to the command line
2581 - it will just sit there after you hit return
2582
2583 +o type 'att 1' to the ddd gdb and you will see something like
2584
2585
2586 0xa013dc51 in __kill ()
2587
2588
2589 (gdb)
2590
2591
2592
2593
2594
2595 +o At this point, type 'c', UML will boot up, and you can use ddd just
2596 as you do on any other process.
2597
2598
2599
2600 1111..44.. DDeebbuuggggiinngg mmoodduulleess
2601
2602 gdb has support for debugging code which is dynamically loaded into
2603 the process. This support is what is needed to debug kernel modules
2604 under UML.
2605
2606
2607 Using that support is somewhat complicated. You have to tell gdb what
2608 object file you just loaded into UML and where in memory it is. Then,
2609 it can read the symbol table, and figure out where all the symbols are
2610 from the load address that you provided. It gets more interesting
2611 when you load the module again (i.e. after an rmmod). You have to
2612 tell gdb to forget about all its symbols, including the main UML ones
2613 for some reason, then load then all back in again.
2614
2615
2616 There's an easy way and a hard way to do this. The easy way is to use
2617 the umlgdb expect script written by Chandan Kudige. It basically
2618 automates the process for you.
2619
2620
2621 First, you must tell it where your modules are. There is a list in
2622 the script that looks like this:
2623 set MODULE_PATHS {
2624 "fat" "/usr/src/uml/linux-2.4.18/fs/fat/fat.o"
2625 "isofs" "/usr/src/uml/linux-2.4.18/fs/isofs/isofs.o"
2626 "minix" "/usr/src/uml/linux-2.4.18/fs/minix/minix.o"
2627 }
2628
2629
2630
2631
2632 You change that to list the names and paths of the modules that you
2633 are going to debug. Then you run it from the toplevel directory of
2634 your UML pool and it basically tells you what to do:
2635
2636
2637
2638
2639 ******** GDB pid is 21903 ********
2640 Start UML as: ./linux <kernel switches> debug gdb-pid=21903
2641
2642
2643
2644 GNU gdb 5.0rh-5 Red Hat Linux 7.1
2645 Copyright 2001 Free Software Foundation, Inc.
2646 GDB is free software, covered by the GNU General Public License, and you are
2647 welcome to change it and/or distribute copies of it under certain conditions.
2648 Type "show copying" to see the conditions.
2649 There is absolutely no warranty for GDB. Type "show warranty" for details.
2650 This GDB was configured as "i386-redhat-linux"...
2651 (gdb) b sys_init_module
2652 Breakpoint 1 at 0xa0011923: file module.c, line 349.
2653 (gdb) att 1
2654
2655
2656
2657
2658 After you run UML and it sits there doing nothing, you hit return at
2659 the 'att 1' and continue it:
2660
2661
2662 Attaching to program: /home/jdike/linux/2.4/um/./linux, process 1
2663 0xa00f4221 in __kill ()
2664 (UML gdb) c
2665 Continuing.
2666
2667
2668
2669
2670 At this point, you debug normally. When you insmod something, the
2671 expect magic will kick in and you'll see something like:
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689 *** Module hostfs loaded ***
2690 Breakpoint 1, sys_init_module (name_user=0x805abb0 "hostfs",
2691 mod_user=0x8070e00) at module.c:349
2692 349 char *name, *n_name, *name_tmp = NULL;
2693 (UML gdb) finish
2694 Run till exit from #0 sys_init_module (name_user=0x805abb0 "hostfs",
2695 mod_user=0x8070e00) at module.c:349
2696 0xa00e2e23 in execute_syscall (r=0xa8140284) at syscall_kern.c:411
2697 411 else res = EXECUTE_SYSCALL(syscall, regs);
2698 Value returned is $1 = 0
2699 (UML gdb)
2700 p/x (int)module_list + module_list->size_of_struct
2701
2702 $2 = 0xa9021054
2703 (UML gdb) symbol-file ./linux
2704 Load new symbol table from "./linux"? (y or n) y
2705 Reading symbols from ./linux...
2706 done.
2707 (UML gdb)
2708 add-symbol-file /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o 0xa9021054
2709
2710 add symbol table from file "/home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o" at
2711 .text_addr = 0xa9021054
2712 (y or n) y
2713
2714 Reading symbols from /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o...
2715 done.
2716 (UML gdb) p *module_list
2717 $1 = {size_of_struct = 84, next = 0xa0178720, name = 0xa9022de0 "hostfs",
2718 size = 9016, uc = {usecount = {counter = 0}, pad = 0}, flags = 1,
2719 nsyms = 57, ndeps = 0, syms = 0xa9023170, deps = 0x0, refs = 0x0,
2720 init = 0xa90221f0 <init_hostfs>, cleanup = 0xa902222c <exit_hostfs>,
2721 ex_table_start = 0x0, ex_table_end = 0x0, persist_start = 0x0,
2722 persist_end = 0x0, can_unload = 0, runsize = 0, kallsyms_start = 0x0,
2723 kallsyms_end = 0x0,
2724 archdata_start = 0x1b855 <Address 0x1b855 out of bounds>,
2725 archdata_end = 0xe5890000 <Address 0xe5890000 out of bounds>,
2726 kernel_data = 0xf689c35d <Address 0xf689c35d out of bounds>}
2727 >> Finished loading symbols for hostfs ...
2728
2729
2730
2731
2732 That's the easy way. It's highly recommended. The hard way is
2733 described below in case you're interested in what's going on.
2734
2735
2736 Boot the kernel under the debugger and load the module with insmod or
2737 modprobe. With gdb, do:
2738
2739
2740 (UML gdb) p module_list
2741
2742
2743
2744
2745 This is a list of modules that have been loaded into the kernel, with
2746 the most recently loaded module first. Normally, the module you want
2747 is at module_list. If it's not, walk down the next links, looking at
2748 the name fields until find the module you want to debug. Take the
2749 address of that structure, and add module.size_of_struct (which in
2750 2.4.10 kernels is 96 (0x60)) to it. Gdb can make this hard addition
2751 for you :-):
2752
2753
2754
2755 (UML gdb)
2756 printf "%#x\n", (int)module_list module_list->size_of_struct
2757
2758
2759
2760
2761 The offset from the module start occasionally changes (before 2.4.0,
2762 it was module.size_of_struct + 4), so it's a good idea to check the
2763 init and cleanup addresses once in a while, as describe below. Now
2764 do:
2765
2766
2767 (UML gdb)
2768 add-symbol-file /path/to/module/on/host that_address
2769
2770
2771
2772
2773 Tell gdb you really want to do it, and you're in business.
2774
2775
2776 If there's any doubt that you got the offset right, like breakpoints
2777 appear not to work, or they're appearing in the wrong place, you can
2778 check it by looking at the module structure. The init and cleanup
2779 fields should look like:
2780
2781
2782 init = 0x588066b0 <init_hostfs>, cleanup = 0x588066c0 <exit_hostfs>
2783
2784
2785
2786
2787 with no offsets on the symbol names. If the names are right, but they
2788 are offset, then the offset tells you how much you need to add to the
2789 address you gave to add-symbol-file.
2790
2791
2792 When you want to load in a new version of the module, you need to get
2793 gdb to forget about the old one. The only way I've found to do that
2794 is to tell gdb to forget about all symbols that it knows about:
2795
2796
2797 (UML gdb) symbol-file
2798
2799
2800
2801
2802 Then reload the symbols from the kernel binary:
2803
2804
2805 (UML gdb) symbol-file /path/to/kernel
2806
2807
2808
2809
2810 and repeat the process above. You'll also need to re-enable break-
2811 points. They were disabled when you dumped all the symbols because
2812 gdb couldn't figure out where they should go.
2813
2814
2815
2816 1111..55.. AAttttaacchhiinngg ggddbb ttoo tthhee kkeerrnneell
2817
2818 If you don't have the kernel running under gdb, you can attach gdb to
2819 it later by sending the tracing thread a SIGUSR1. The first line of
2820 the console output identifies its pid:
2821 tracing thread pid = 20093
2822
2823
2824
2825
2826 When you send it the signal:
2827
2828
2829 host% kill -USR1 20093
2830
2831
2832
2833
2834 you will get an xterm with gdb running in it.
2835
2836
2837 If you have the mconsole compiled into UML, then the mconsole client
2838 can be used to start gdb:
2839
2840
2841 (mconsole) (mconsole) config gdb=xterm
2842
2843
2844
2845
2846 will fire up an xterm with gdb running in it.
2847
2848
2849
2850 1111..66.. UUssiinngg aalltteerrnnaattee ddeebbuuggggeerrss
2851
2852 UML has support for attaching to an already running debugger rather
2853 than starting gdb itself. This is present in CVS as of 17 Apr 2001.
2854 I sent it to Alan for inclusion in the ac tree, and it will be in my
2855 2.4.4 release.
2856
2857
2858 This is useful when gdb is a subprocess of some UI, such as emacs or
2859 ddd. It can also be used to run debuggers other than gdb on UML.
2860 Below is an example of using strace as an alternate debugger.
2861
2862
2863 To do this, you need to get the pid of the debugger and pass it in
2864 with the
2865
2866
2867 If you are using gdb under some UI, then tell it to 'att 1', and
2868 you'll find yourself attached to UML.
2869
2870
2871 If you are using something other than gdb as your debugger, then
2872 you'll need to get it to do the equivalent of 'att 1' if it doesn't do
2873 it automatically.
2874
2875
2876 An example of an alternate debugger is strace. You can strace the
2877 actual kernel as follows:
2878
2879 +o Run the following in a shell
2880
2881
2882 host%
2883 sh -c 'echo pid=$$; echo -n hit return; read x; exec strace -p 1 -o strace.out'
2884
2885
2886
2887 +o Run UML with 'debug' and 'gdb-pid=<pid>' with the pid printed out
2888 by the previous command
2889
2890 +o Hit return in the shell, and UML will start running, and strace
2891 output will start accumulating in the output file.
2892
2893 Note that this is different from running
2894
2895
2896 host% strace ./linux
2897
2898
2899
2900
2901 That will strace only the main UML thread, the tracing thread, which
2902 doesn't do any of the actual kernel work. It just oversees the vir-
2903 tual machine. In contrast, using strace as described above will show
2904 you the low-level activity of the virtual machine.
2905
2906
2907
2908
2909
2910 1122.. KKeerrnneell ddeebbuuggggiinngg eexxaammpplleess
2911
2912 1122..11.. TThhee ccaassee ooff tthhee hhuunngg ffsscckk
2913
2914 When booting up the kernel, fsck failed, and dropped me into a shell
2915 to fix things up. I ran fsck -y, which hung:
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953 Setting hostname uml [ OK ]
2954 Checking root filesystem
2955 /dev/fhd0 was not cleanly unmounted, check forced.
2956 Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780.
2957
2958 /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
2959 (i.e., without -a or -p options)
2960 [ FAILED ]
2961
2962 *** An error occurred during the file system check.
2963 *** Dropping you to a shell; the system will reboot
2964 *** when you leave the shell.
2965 Give root password for maintenance
2966 (or type Control-D for normal startup):
2967
2968 [root@uml /root]# fsck -y /dev/fhd0
2969 fsck -y /dev/fhd0
2970 Parallelizing fsck version 1.14 (9-Jan-1999)
2971 e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09
2972 /dev/fhd0 contains a file system with errors, check forced.
2973 Pass 1: Checking inodes, blocks, and sizes
2974 Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. Ignore error? yes
2975
2976 Inode 19780, i_blocks is 1548, should be 540. Fix? yes
2977
2978 Pass 2: Checking directory structure
2979 Error reading block 49405 (Attempt to read block from filesystem resulted in short read). Ignore error? yes
2980
2981 Directory inode 11858, block 0, offset 0: directory corrupted
2982 Salvage? yes
2983
2984 Missing '.' in directory inode 11858.
2985 Fix? yes
2986
2987 Missing '..' in directory inode 11858.
2988 Fix? yes
2989
2990
2991
2992
2993
2994 The standard drill in this sort of situation is to fire up gdb on the
2995 signal thread, which, in this case, was pid 1935. In another window,
2996 I run gdb and attach pid 1935.
2997
2998
2999
3000
3001 ~/linux/2.3.26/um 1016: gdb linux
3002 GNU gdb 4.17.0.11 with Linux support
3003 Copyright 1998 Free Software Foundation, Inc.
3004 GDB is free software, covered by the GNU General Public License, and you are
3005 welcome to change it and/or distribute copies of it under certain conditions.
3006 Type "show copying" to see the conditions.
3007 There is absolutely no warranty for GDB. Type "show warranty" for details.
3008 This GDB was configured as "i386-redhat-linux"...
3009
3010 (gdb) att 1935
3011 Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 1935
3012 0x100756d9 in __wait4 ()
3013
3014
3015
3016
3017
3018
3019 Let's see what's currently running:
3020
3021
3022
3023 (gdb) p current_task.pid
3024 $1 = 0
3025
3026
3027
3028
3029
3030 It's the idle thread, which means that fsck went to sleep for some
3031 reason and never woke up.
3032
3033
3034 Let's guess that the last process in the process list is fsck:
3035
3036
3037
3038 (gdb) p current_task.prev_task.comm
3039 $13 = "fsck.ext2\000\000\000\000\000\000"
3040
3041
3042
3043
3044
3045 It is, so let's see what it thinks it's up to:
3046
3047
3048
3049 (gdb) p current_task.prev_task.thread
3050 $14 = {extern_pid = 1980, tracing = 0, want_tracing = 0, forking = 0,
3051 kernel_stack_page = 0, signal_stack = 1342627840, syscall = {id = 4, args = {
3052 3, 134973440, 1024, 0, 1024}, have_result = 0, result = 50590720},
3053 request = {op = 2, u = {exec = {ip = 1350467584, sp = 2952789424}, fork = {
3054 regs = {1350467584, 2952789424, 0 <repeats 15 times>}, sigstack = 0,
3055 pid = 0}, switch_to = 0x507e8000, thread = {proc = 0x507e8000,
3056 arg = 0xaffffdb0, flags = 0, new_pid = 0}, input_request = {
3057 op = 1350467584, fd = -1342177872, proc = 0, pid = 0}}}}
3058
3059
3060
3061
3062
3063 The interesting things here are the fact that its .thread.syscall.id
3064 is __NR_write (see the big switch in arch/um/kernel/syscall_kern.c or
3065 the defines in include/asm-um/arch/unistd.h), and that it never
3066 returned. Also, its .request.op is OP_SWITCH (see
3067 arch/um/include/user_util.h). These mean that it went into a write,
3068 and, for some reason, called schedule().
3069
3070
3071 The fact that it never returned from write means that its stack should
3072 be fairly interesting. Its pid is 1980 (.thread.extern_pid). That
3073 process is being ptraced by the signal thread, so it must be detached
3074 before gdb can attach it:
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085 (gdb) call detach(1980)
3086
3087 Program received signal SIGSEGV, Segmentation fault.
3088 <function called from gdb>
3089 The program being debugged stopped while in a function called from GDB.
3090 When the function (detach) is done executing, GDB will silently
3091 stop (instead of continuing to evaluate the expression containing
3092 the function call).
3093 (gdb) call detach(1980)
3094 $15 = 0
3095
3096
3097
3098
3099
3100 The first detach segfaults for some reason, and the second one
3101 succeeds.
3102
3103
3104 Now I detach from the signal thread, attach to the fsck thread, and
3105 look at its stack:
3106
3107
3108 (gdb) det
3109 Detaching from program: /home/dike/linux/2.3.26/um/linux Pid 1935
3110 (gdb) att 1980
3111 Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 1980
3112 0x10070451 in __kill ()
3113 (gdb) bt
3114 #0 0x10070451 in __kill ()
3115 #1 0x10068ccd in usr1_pid (pid=1980) at process.c:30
3116 #2 0x1006a03f in _switch_to (prev=0x50072000, next=0x507e8000)
3117 at process_kern.c:156
3118 #3 0x1006a052 in switch_to (prev=0x50072000, next=0x507e8000, last=0x50072000)
3119 at process_kern.c:161
3120 #4 0x10001d12 in schedule () at sched.c:777
3121 #5 0x1006a744 in __down (sem=0x507d241c) at semaphore.c:71
3122 #6 0x1006aa10 in __down_failed () at semaphore.c:157
3123 #7 0x1006c5d8 in segv_handler (sc=0x5006e940) at trap_user.c:174
3124 #8 0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182
3125 #9 <signal handler called>
3126 #10 0x10155404 in errno ()
3127 #11 0x1006c0aa in segv (address=1342179328, is_write=2) at trap_kern.c:50
3128 #12 0x1006c5d8 in segv_handler (sc=0x5006eaf8) at trap_user.c:174
3129 #13 0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182
3130 #14 <signal handler called>
3131 #15 0xc0fd in ?? ()
3132 #16 0x10016647 in sys_write (fd=3,
3133 buf=0x80b8800 <Address 0x80b8800 out of bounds>, count=1024)
3134 at read_write.c:159
3135 #17 0x1006d5b3 in execute_syscall (syscall=4, args=0x5006ef08)
3136 at syscall_kern.c:254
3137 #18 0x1006af87 in really_do_syscall (sig=12) at syscall_user.c:35
3138 #19 <signal handler called>
3139 #20 0x400dc8b0 in ?? ()
3140
3141
3142
3143
3144
3145 The interesting things here are :
3146
3147 +o There are two segfaults on this stack (frames 9 and 14)
3148
3149 +o The first faulting address (frame 11) is 0x50000800
3150
3151 (gdb) p (void *)1342179328
3152 $16 = (void *) 0x50000800
3153
3154
3155
3156
3157
3158 The initial faulting address is interesting because it is on the idle
3159 thread's stack. I had been seeing the idle thread segfault for no
3160 apparent reason, and the cause looked like stack corruption. In hopes
3161 of catching the culprit in the act, I had turned off all protections
3162 to that stack while the idle thread wasn't running. This apparently
3163 tripped that trap.
3164
3165
3166 However, the more immediate problem is that second segfault and I'm
3167 going to concentrate on that. First, I want to see where the fault
3168 happened, so I have to go look at the sigcontent struct in frame 8:
3169
3170
3171
3172 (gdb) up
3173 #1 0x10068ccd in usr1_pid (pid=1980) at process.c:30
3174 30 kill(pid, SIGUSR1);
3175 (gdb)
3176 #2 0x1006a03f in _switch_to (prev=0x50072000, next=0x507e8000)
3177 at process_kern.c:156
3178 156 usr1_pid(getpid());
3179 (gdb)
3180 #3 0x1006a052 in switch_to (prev=0x50072000, next=0x507e8000, last=0x50072000)
3181 at process_kern.c:161
3182 161 _switch_to(prev, next);
3183 (gdb)
3184 #4 0x10001d12 in schedule () at sched.c:777
3185 777 switch_to(prev, next, prev);
3186 (gdb)
3187 #5 0x1006a744 in __down (sem=0x507d241c) at semaphore.c:71
3188 71 schedule();
3189 (gdb)
3190 #6 0x1006aa10 in __down_failed () at semaphore.c:157
3191 157 }
3192 (gdb)
3193 #7 0x1006c5d8 in segv_handler (sc=0x5006e940) at trap_user.c:174
3194 174 segv(sc->cr2, sc->err & 2);
3195 (gdb)
3196 #8 0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182
3197 182 segv_handler(sc);
3198 (gdb) p *sc
3199 Cannot access memory at address 0x0.
3200
3201
3202
3203
3204 That's not very useful, so I'll try a more manual method:
3205
3206
3207 (gdb) p *((struct sigcontext *) (&sig + 1))
3208 $19 = {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43,
3209 __dsh = 0, edi = 1342179328, esi = 1350378548, ebp = 1342630440,
3210 esp = 1342630420, ebx = 1348150624, edx = 1280, ecx = 0, eax = 0,
3211 trapno = 14, err = 4, eip = 268480945, cs = 35, __csh = 0, eflags = 66118,
3212 esp_at_signal = 1342630420, ss = 43, __ssh = 0, fpstate = 0x0, oldmask = 0,
3213 cr2 = 1280}
3214
3215
3216
3217 The ip is in handle_mm_fault:
3218
3219
3220 (gdb) p (void *)268480945
3221 $20 = (void *) 0x1000b1b1
3222 (gdb) i sym $20
3223 handle_mm_fault + 57 in section .text
3224
3225
3226
3227
3228
3229 Specifically, it's in pte_alloc:
3230
3231
3232 (gdb) i line *$20
3233 Line 124 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
3234 starts at address 0x1000b1b1 <handle_mm_fault+57>
3235 and ends at 0x1000b1b7 <handle_mm_fault+63>.
3236
3237
3238
3239
3240
3241 To find where in handle_mm_fault this is, I'll jump forward in the
3242 code until I see an address in that procedure:
3243
3244
3245
3246 (gdb) i line *0x1000b1c0
3247 Line 126 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
3248 starts at address 0x1000b1b7 <handle_mm_fault+63>
3249 and ends at 0x1000b1c3 <handle_mm_fault+75>.
3250 (gdb) i line *0x1000b1d0
3251 Line 131 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
3252 starts at address 0x1000b1d0 <handle_mm_fault+88>
3253 and ends at 0x1000b1da <handle_mm_fault+98>.
3254 (gdb) i line *0x1000b1e0
3255 Line 61 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
3256 starts at address 0x1000b1da <handle_mm_fault+98>
3257 and ends at 0x1000b1e1 <handle_mm_fault+105>.
3258 (gdb) i line *0x1000b1f0
3259 Line 134 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
3260 starts at address 0x1000b1f0 <handle_mm_fault+120>
3261 and ends at 0x1000b200 <handle_mm_fault+136>.
3262 (gdb) i line *0x1000b200
3263 Line 135 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
3264 starts at address 0x1000b200 <handle_mm_fault+136>
3265 and ends at 0x1000b208 <handle_mm_fault+144>.
3266 (gdb) i line *0x1000b210
3267 Line 139 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
3268 starts at address 0x1000b210 <handle_mm_fault+152>
3269 and ends at 0x1000b219 <handle_mm_fault+161>.
3270 (gdb) i line *0x1000b220
3271 Line 1168 of "memory.c" starts at address 0x1000b21e <handle_mm_fault+166>
3272 and ends at 0x1000b222 <handle_mm_fault+170>.
3273
3274
3275
3276
3277
3278 Something is apparently wrong with the page tables or vma_structs, so
3279 lets go back to frame 11 and have a look at them:
3280
3281
3282
3283 #11 0x1006c0aa in segv (address=1342179328, is_write=2) at trap_kern.c:50
3284 50 handle_mm_fault(current, vma, address, is_write);
3285 (gdb) call pgd_offset_proc(vma->vm_mm, address)
3286 $22 = (pgd_t *) 0x80a548c
3287
3288
3289
3290
3291
3292 That's pretty bogus. Page tables aren't supposed to be in process
3293 text or data areas. Let's see what's in the vma:
3294
3295
3296 (gdb) p *vma
3297 $23 = {vm_mm = 0x507d2434, vm_start = 0, vm_end = 134512640,
3298 vm_next = 0x80a4f8c, vm_page_prot = {pgprot = 0}, vm_flags = 31200,
3299 vm_avl_height = 2058, vm_avl_left = 0x80a8c94, vm_avl_right = 0x80d1000,
3300 vm_next_share = 0xaffffdb0, vm_pprev_share = 0xaffffe63,
3301 vm_ops = 0xaffffe7a, vm_pgoff = 2952789626, vm_file = 0xafffffec,
3302 vm_private_data = 0x62}
3303 (gdb) p *vma.vm_mm
3304 $24 = {mmap = 0x507d2434, mmap_avl = 0x0, mmap_cache = 0x8048000,
3305 pgd = 0x80a4f8c, mm_users = {counter = 0}, mm_count = {counter = 134904288},
3306 map_count = 134909076, mmap_sem = {count = {counter = 135073792},
3307 sleepers = -1342177872, wait = {lock = <optimized out or zero length>,
3308 task_list = {next = 0xaffffe63, prev = 0xaffffe7a},
3309 __magic = -1342177670, __creator = -1342177300}, __magic = 98},
3310 page_table_lock = {}, context = 138, start_code = 0, end_code = 0,
3311 start_data = 0, end_data = 0, start_brk = 0, brk = 0, start_stack = 0,
3312 arg_start = 0, arg_end = 0, env_start = 0, env_end = 0, rss = 1350381536,
3313 total_vm = 0, locked_vm = 0, def_flags = 0, cpu_vm_mask = 0, swap_cnt = 0,
3314 swap_address = 0, segments = 0x0}
3315
3316
3317
3318
3319
3320 This also pretty bogus. With all of the 0x80xxxxx and 0xaffffxxx
3321 addresses, this is looking like a stack was plonked down on top of
3322 these structures. Maybe it's a stack overflow from the next page:
3323
3324
3325
3326 (gdb) p vma
3327 $25 = (struct vm_area_struct *) 0x507d2434
3328
3329
3330
3331
3332
3333 That's towards the lower quarter of the page, so that would have to
3334 have been pretty heavy stack overflow:
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349 (gdb) x/100x $25
3350 0x507d2434: 0x507d2434 0x00000000 0x08048000 0x080a4f8c
3351 0x507d2444: 0x00000000 0x080a79e0 0x080a8c94 0x080d1000
3352 0x507d2454: 0xaffffdb0 0xaffffe63 0xaffffe7a 0xaffffe7a
3353 0x507d2464: 0xafffffec 0x00000062 0x0000008a 0x00000000
3354 0x507d2474: 0x00000000 0x00000000 0x00000000 0x00000000
3355 0x507d2484: 0x00000000 0x00000000 0x00000000 0x00000000
3356 0x507d2494: 0x00000000 0x00000000 0x507d2fe0 0x00000000
3357 0x507d24a4: 0x00000000 0x00000000 0x00000000 0x00000000
3358 0x507d24b4: 0x00000000 0x00000000 0x00000000 0x00000000
3359 0x507d24c4: 0x00000000 0x00000000 0x00000000 0x00000000
3360 0x507d24d4: 0x00000000 0x00000000 0x00000000 0x00000000
3361 0x507d24e4: 0x00000000 0x00000000 0x00000000 0x00000000
3362 0x507d24f4: 0x00000000 0x00000000 0x00000000 0x00000000
3363 0x507d2504: 0x00000000 0x00000000 0x00000000 0x00000000
3364 0x507d2514: 0x00000000 0x00000000 0x00000000 0x00000000
3365 0x507d2524: 0x00000000 0x00000000 0x00000000 0x00000000
3366 0x507d2534: 0x00000000 0x00000000 0x507d25dc 0x00000000
3367 0x507d2544: 0x00000000 0x00000000 0x00000000 0x00000000
3368 0x507d2554: 0x00000000 0x00000000 0x00000000 0x00000000
3369 0x507d2564: 0x00000000 0x00000000 0x00000000 0x00000000
3370 0x507d2574: 0x00000000 0x00000000 0x00000000 0x00000000
3371 0x507d2584: 0x00000000 0x00000000 0x00000000 0x00000000
3372 0x507d2594: 0x00000000 0x00000000 0x00000000 0x00000000
3373 0x507d25a4: 0x00000000 0x00000000 0x00000000 0x00000000
3374 0x507d25b4: 0x00000000 0x00000000 0x00000000 0x00000000
3375
3376
3377
3378
3379
3380 It's not stack overflow. The only "stack-like" piece of this data is
3381 the vma_struct itself.
3382
3383
3384 At this point, I don't see any avenues to pursue, so I just have to
3385 admit that I have no idea what's going on. What I will do, though, is
3386 stick a trap on the segfault handler which will stop if it sees any
3387 writes to the idle thread's stack. That was the thing that happened
3388 first, and it may be that if I can catch it immediately, what's going
3389 on will be somewhat clearer.
3390
3391
3392 1122..22.. EEppiissooddee 22:: TThhee ccaassee ooff tthhee hhuunngg ffsscckk
3393
3394 After setting a trap in the SEGV handler for accesses to the signal
3395 thread's stack, I reran the kernel.
3396
3397
3398 fsck hung again, this time by hitting the trap:
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415 Setting hostname uml [ OK ]
3416 Checking root filesystem
3417 /dev/fhd0 contains a file system with errors, check forced.
3418 Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780.
3419
3420 /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
3421 (i.e., without -a or -p options)
3422 [ FAILED ]
3423
3424 *** An error occurred during the file system check.
3425 *** Dropping you to a shell; the system will reboot
3426 *** when you leave the shell.
3427 Give root password for maintenance
3428 (or type Control-D for normal startup):
3429
3430 [root@uml /root]# fsck -y /dev/fhd0
3431 fsck -y /dev/fhd0
3432 Parallelizing fsck version 1.14 (9-Jan-1999)
3433 e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09
3434 /dev/fhd0 contains a file system with errors, check forced.
3435 Pass 1: Checking inodes, blocks, and sizes
3436 Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. Ignore error? yes
3437
3438 Pass 2: Checking directory structure
3439 Error reading block 49405 (Attempt to read block from filesystem resulted in short read). Ignore error? yes
3440
3441 Directory inode 11858, block 0, offset 0: directory corrupted
3442 Salvage? yes
3443
3444 Missing '.' in directory inode 11858.
3445 Fix? yes
3446
3447 Missing '..' in directory inode 11858.
3448 Fix? yes
3449
3450 Untested (4127) [100fe44c]: trap_kern.c line 31
3451
3452
3453
3454
3455
3456 I need to get the signal thread to detach from pid 4127 so that I can
3457 attach to it with gdb. This is done by sending it a SIGUSR1, which is
3458 caught by the signal thread, which detaches the process:
3459
3460
3461 kill -USR1 4127
3462
3463
3464
3465
3466
3467 Now I can run gdb on it:
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481 ~/linux/2.3.26/um 1034: gdb linux
3482 GNU gdb 4.17.0.11 with Linux support
3483 Copyright 1998 Free Software Foundation, Inc.
3484 GDB is free software, covered by the GNU General Public License, and you are
3485 welcome to change it and/or distribute copies of it under certain conditions.
3486 Type "show copying" to see the conditions.
3487 There is absolutely no warranty for GDB. Type "show warranty" for details.
3488 This GDB was configured as "i386-redhat-linux"...
3489 (gdb) att 4127
3490 Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 4127
3491 0x10075891 in __libc_nanosleep ()
3492
3493
3494
3495
3496
3497 The backtrace shows that it was in a write and that the fault address
3498 (address in frame 3) is 0x50000800, which is right in the middle of
3499 the signal thread's stack page:
3500
3501
3502 (gdb) bt
3503 #0 0x10075891 in __libc_nanosleep ()
3504 #1 0x1007584d in __sleep (seconds=1000000)
3505 at ../sysdeps/unix/sysv/linux/sleep.c:78
3506 #2 0x1006ce9a in stop () at user_util.c:191
3507 #3 0x1006bf88 in segv (address=1342179328, is_write=2) at trap_kern.c:31
3508 #4 0x1006c628 in segv_handler (sc=0x5006eaf8) at trap_user.c:174
3509 #5 0x1006c63c in kern_segv_handler (sig=11) at trap_user.c:182
3510 #6 <signal handler called>
3511 #7 0xc0fd in ?? ()
3512 #8 0x10016647 in sys_write (fd=3, buf=0x80b8800 "R.", count=1024)
3513 at read_write.c:159
3514 #9 0x1006d603 in execute_syscall (syscall=4, args=0x5006ef08)
3515 at syscall_kern.c:254
3516 #10 0x1006af87 in really_do_syscall (sig=12) at syscall_user.c:35
3517 #11 <signal handler called>
3518 #12 0x400dc8b0 in ?? ()
3519 #13 <signal handler called>
3520 #14 0x400dc8b0 in ?? ()
3521 #15 0x80545fd in ?? ()
3522 #16 0x804daae in ?? ()
3523 #17 0x8054334 in ?? ()
3524 #18 0x804d23e in ?? ()
3525 #19 0x8049632 in ?? ()
3526 #20 0x80491d2 in ?? ()
3527 #21 0x80596b5 in ?? ()
3528 (gdb) p (void *)1342179328
3529 $3 = (void *) 0x50000800
3530
3531
3532
3533
3534
3535 Going up the stack to the segv_handler frame and looking at where in
3536 the code the access happened shows that it happened near line 110 of
3537 block_dev.c:
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547 (gdb) up
3548 #1 0x1007584d in __sleep (seconds=1000000)
3549 at ../sysdeps/unix/sysv/linux/sleep.c:78
3550 ../sysdeps/unix/sysv/linux/sleep.c:78: No such file or directory.
3551 (gdb)
3552 #2 0x1006ce9a in stop () at user_util.c:191
3553 191 while(1) sleep(1000000);
3554 (gdb)
3555 #3 0x1006bf88 in segv (address=1342179328, is_write=2) at trap_kern.c:31
3556 31 KERN_UNTESTED();
3557 (gdb)
3558 #4 0x1006c628 in segv_handler (sc=0x5006eaf8) at trap_user.c:174
3559 174 segv(sc->cr2, sc->err & 2);
3560 (gdb) p *sc
3561 $1 = {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43,
3562 __dsh = 0, edi = 1342179328, esi = 134973440, ebp = 1342631484,
3563 esp = 1342630864, ebx = 256, edx = 0, ecx = 256, eax = 1024, trapno = 14,
3564 err = 6, eip = 268550834, cs = 35, __csh = 0, eflags = 66070,
3565 esp_at_signal = 1342630864, ss = 43, __ssh = 0, fpstate = 0x0, oldmask = 0,
3566 cr2 = 1342179328}
3567 (gdb) p (void *)268550834
3568 $2 = (void *) 0x1001c2b2
3569 (gdb) i sym $2
3570 block_write + 1090 in section .text
3571 (gdb) i line *$2
3572 Line 209 of "/home/dike/linux/2.3.26/um/include/asm/arch/string.h"
3573 starts at address 0x1001c2a1 <block_write+1073>
3574 and ends at 0x1001c2bf <block_write+1103>.
3575 (gdb) i line *0x1001c2c0
3576 Line 110 of "block_dev.c" starts at address 0x1001c2bf <block_write+1103>
3577 and ends at 0x1001c2e3 <block_write+1139>.
3578
3579
3580
3581
3582
3583 Looking at the source shows that the fault happened during a call to
3584 copy_to_user to copy the data into the kernel:
3585
3586
3587 107 count -= chars;
3588 108 copy_from_user(p,buf,chars);
3589 109 p += chars;
3590 110 buf += chars;
3591
3592
3593
3594
3595
3596 p is the pointer which must contain 0x50000800, since buf contains
3597 0x80b8800 (frame 8 above). It is defined as:
3598
3599
3600 p = offset + bh->b_data;
3601
3602
3603
3604
3605
3606 I need to figure out what bh is, and it just so happens that bh is
3607 passed as an argument to mark_buffer_uptodate and mark_buffer_dirty a
3608 few lines later, so I do a little disassembly:
3609
3610
3611
3612
3613 (gdb) disas 0x1001c2bf 0x1001c2e0
3614 Dump of assembler code from 0x1001c2bf to 0x1001c2d0:
3615 0x1001c2bf <block_write+1103>: addl %eax,0xc(%ebp)
3616 0x1001c2c2 <block_write+1106>: movl 0xfffffdd4(%ebp),%edx
3617 0x1001c2c8 <block_write+1112>: btsl $0x0,0x18(%edx)
3618 0x1001c2cd <block_write+1117>: btsl $0x1,0x18(%edx)
3619 0x1001c2d2 <block_write+1122>: sbbl %ecx,%ecx
3620 0x1001c2d4 <block_write+1124>: testl %ecx,%ecx
3621 0x1001c2d6 <block_write+1126>: jne 0x1001c2e3 <block_write+1139>
3622 0x1001c2d8 <block_write+1128>: pushl $0x0
3623 0x1001c2da <block_write+1130>: pushl %edx
3624 0x1001c2db <block_write+1131>: call 0x1001819c <__mark_buffer_dirty>
3625 End of assembler dump.
3626
3627
3628
3629
3630
3631 At that point, bh is in %edx (address 0x1001c2da), which is calculated
3632 at 0x1001c2c2 as %ebp + 0xfffffdd4, so I figure exactly what that is,
3633 taking %ebp from the sigcontext_struct above:
3634
3635
3636 (gdb) p (void *)1342631484
3637 $5 = (void *) 0x5006ee3c
3638 (gdb) p 0x5006ee3c+0xfffffdd4
3639 $6 = 1342630928
3640 (gdb) p (void *)$6
3641 $7 = (void *) 0x5006ec10
3642 (gdb) p *((void **)$7)
3643 $8 = (void *) 0x50100200
3644
3645
3646
3647
3648
3649 Now, I look at the structure to see what's in it, and particularly,
3650 what its b_data field contains:
3651
3652
3653 (gdb) p *((struct buffer_head *)0x50100200)
3654 $13 = {b_next = 0x50289380, b_blocknr = 49405, b_size = 1024, b_list = 0,
3655 b_dev = 15872, b_count = {counter = 1}, b_rdev = 15872, b_state = 24,
3656 b_flushtime = 0, b_next_free = 0x501001a0, b_prev_free = 0x50100260,
3657 b_this_page = 0x501001a0, b_reqnext = 0x0, b_pprev = 0x507fcf58,
3658 b_data = 0x50000800 "", b_page = 0x50004000,
3659 b_end_io = 0x10017f60 <end_buffer_io_sync>, b_dev_id = 0x0,
3660 b_rsector = 98810, b_wait = {lock = <optimized out or zero length>,
3661 task_list = {next = 0x50100248, prev = 0x50100248}, __magic = 1343226448,
3662 __creator = 0}, b_kiobuf = 0x0}
3663
3664
3665
3666
3667
3668 The b_data field is indeed 0x50000800, so the question becomes how
3669 that happened. The rest of the structure looks fine, so this probably
3670 is not a case of data corruption. It happened on purpose somehow.
3671
3672
3673 The b_page field is a pointer to the page_struct representing the
3674 0x50000000 page. Looking at it shows the kernel's idea of the state
3675 of that page:
3676
3677
3678
3679 (gdb) p *$13.b_page
3680 $17 = {list = {next = 0x50004a5c, prev = 0x100c5174}, mapping = 0x0,
3681 index = 0, next_hash = 0x0, count = {counter = 1}, flags = 132, lru = {
3682 next = 0x50008460, prev = 0x50019350}, wait = {
3683 lock = <optimized out or zero length>, task_list = {next = 0x50004024,
3684 prev = 0x50004024}, __magic = 1342193708, __creator = 0},
3685 pprev_hash = 0x0, buffers = 0x501002c0, virtual = 1342177280,
3686 zone = 0x100c5160}
3687
3688
3689
3690
3691
3692 Some sanity-checking: the virtual field shows the "virtual" address of
3693 this page, which in this kernel is the same as its "physical" address,
3694 and the page_struct itself should be mem_map[0], since it represents
3695 the first page of memory:
3696
3697
3698
3699 (gdb) p (void *)1342177280
3700 $18 = (void *) 0x50000000
3701 (gdb) p mem_map
3702 $19 = (mem_map_t *) 0x50004000
3703
3704
3705
3706
3707
3708 These check out fine.
3709
3710
3711 Now to check out the page_struct itself. In particular, the flags
3712 field shows whether the page is considered free or not:
3713
3714
3715 (gdb) p (void *)132
3716 $21 = (void *) 0x84
3717
3718
3719
3720
3721
3722 The "reserved" bit is the high bit, which is definitely not set, so
3723 the kernel considers the signal stack page to be free and available to
3724 be used.
3725
3726
3727 At this point, I jump to conclusions and start looking at my early
3728 boot code, because that's where that page is supposed to be reserved.
3729
3730
3731 In my setup_arch procedure, I have the following code which looks just
3732 fine:
3733
3734
3735
3736 bootmap_size = init_bootmem(start_pfn, end_pfn - start_pfn);
3737 free_bootmem(__pa(low_physmem) + bootmap_size, high_physmem - low_physmem);
3738
3739
3740
3741
3742
3743 Two stack pages have already been allocated, and low_physmem points to
3744 the third page, which is the beginning of free memory.
3745 The init_bootmem call declares the entire memory to the boot memory
3746 manager, which marks it all reserved. The free_bootmem call frees up
3747 all of it, except for the first two pages. This looks correct to me.
3748
3749
3750 So, I decide to see init_bootmem run and make sure that it is marking
3751 those first two pages as reserved. I never get that far.
3752
3753
3754 Stepping into init_bootmem, and looking at bootmem_map before looking
3755 at what it contains shows the following:
3756
3757
3758
3759 (gdb) p bootmem_map
3760 $3 = (void *) 0x50000000
3761
3762
3763
3764
3765
3766 Aha! The light dawns. That first page is doing double duty as a
3767 stack and as the boot memory map. The last thing that the boot memory
3768 manager does is to free the pages used by its memory map, so this page
3769 is getting freed even its marked as reserved.
3770
3771
3772 The fix was to initialize the boot memory manager before allocating
3773 those two stack pages, and then allocate them through the boot memory
3774 manager. After doing this, and fixing a couple of subsequent buglets,
3775 the stack corruption problem disappeared.
3776
3777
3778
3779
3780
3781 1133.. WWhhaatt ttoo ddoo wwhheenn UUMMLL ddooeessnn''tt wwoorrkk
3782
3783
3784
3785
3786 1133..11.. SSttrraannggee ccoommppiillaattiioonn eerrrroorrss wwhheenn yyoouu bbuuiilldd ffrroomm ssoouurrccee
3787
3788 As of test11, it is necessary to have "ARCH=um" in the environment or
3789 on the make command line for all steps in building UML, including
3790 clean, distclean, or mrproper, config, menuconfig, or xconfig, dep,
3791 and linux. If you forget for any of them, the i386 build seems to
3792 contaminate the UML build. If this happens, start from scratch with
3793
3794
3795 host%
3796 make mrproper ARCH=um
3797
3798
3799
3800
3801 and repeat the build process with ARCH=um on all the steps.
3802
3803
3804 See ``Compiling the kernel and modules'' for more details.
3805
3806
3807 Another cause of strange compilation errors is building UML in
3808 /usr/src/linux. If you do this, the first thing you need to do is
3809 clean up the mess you made. The /usr/src/linux/asm link will now
3810 point to /usr/src/linux/asm-um. Make it point back to
3811 /usr/src/linux/asm-i386. Then, move your UML pool someplace else and
3812 build it there. Also see below, where a more specific set of symptoms
3813 is described.
3814
3815
3816
3817 1133..33.. AA vvaarriieettyy ooff ppaanniiccss aanndd hhaannggss wwiitthh //ttmmpp oonn aa rreeiisseerrffss ffiilleessyyss--
3818 tteemm
3819
3820 I saw this on reiserfs 3.5.21 and it seems to be fixed in 3.5.27.
3821 Panics preceded by
3822
3823
3824 Detaching pid nnnn
3825
3826
3827
3828 are diagnostic of this problem. This is a reiserfs bug which causes a
3829 thread to occasionally read stale data from a mmapped page shared with
3830 another thread. The fix is to upgrade the filesystem or to have /tmp
3831 be an ext2 filesystem.
3832
3833
3834
3835 1133..44.. TThhee ccoommppiillee ffaaiillss wwiitthh eerrrroorrss aabboouutt ccoonnfflliiccttiinngg ttyyppeess ffoorr
3836 ''ooppeenn'',, ''dduupp'',, aanndd ''wwaaiittppiidd''
3837
3838 This happens when you build in /usr/src/linux. The UML build makes
3839 the include/asm link point to include/asm-um. /usr/include/asm points
3840 to /usr/src/linux/include/asm, so when that link gets moved, files
3841 which need to include the asm-i386 versions of headers get the
3842 incompatible asm-um versions. The fix is to move the include/asm link
3843 back to include/asm-i386 and to do UML builds someplace else.
3844
3845
3846
3847 1133..55.. UUMMLL ddooeessnn''tt wwoorrkk wwhheenn //ttmmpp iiss aann NNFFSS ffiilleessyysstteemm
3848
3849 This seems to be a similar situation with the ReiserFS problem above.
3850 Some versions of NFS seems not to handle mmap correctly, which UML
3851 depends on. The workaround is have /tmp be a non-NFS directory.
3852
3853
3854 1133..66.. UUMMLL hhaannggss oonn bboooott wwhheenn ccoommppiilleedd wwiitthh ggpprrooff ssuuppppoorrtt
3855
3856 If you build UML with gprof support and, early in the boot, it does
3857 this
3858
3859
3860 kernel BUG at page_alloc.c:100!
3861
3862
3863
3864
3865 you have a buggy gcc. You can work around the problem by removing
3866 UM_FASTCALL from CFLAGS in arch/um/Makefile-i386. This will open up
3867 another bug, but that one is fairly hard to reproduce.
3868
3869
3870
3871 1133..77.. ssyyssllooggdd ddiieess wwiitthh aa SSIIGGTTEERRMM oonn ssttaarrttuupp
3872
3873 The exact boot error depends on the distribution that you're booting,
3874 but Debian produces this:
3875
3876
3877 /etc/rc2.d/S10sysklogd: line 49: 93 Terminated
3878 start-stop-daemon --start --quiet --exec /sbin/syslogd -- $SYSLOGD
3879
3880
3881
3882
3883 This is a syslogd bug. There's a race between a parent process
3884 installing a signal handler and its child sending the signal. See
3885 this uml-devel post <http://www.geocrawler.com/lists/3/Source-
3886 Forge/709/0/6612801> for the details.
3887
3888
3889
3890 1133..88.. TTUUNN//TTAAPP nneettwwoorrkkiinngg ddooeessnn''tt wwoorrkk oonn aa 22..44 hhoosstt
3891
3892 There are a couple of problems which were
3893 <http://www.geocrawler.com/lists/3/SourceForge/597/0/> name="pointed
3894 out"> by Tim Robinson <timro at trkr dot net>
3895
3896 +o It doesn't work on hosts running 2.4.7 (or thereabouts) or earlier.
3897 The fix is to upgrade to something more recent and then read the
3898 next item.
3899
3900 +o If you see
3901
3902
3903 File descriptor in bad state
3904
3905
3906
3907 when you bring up the device inside UML, you have a header mismatch
3908 between the original kernel and the upgraded one. Make /usr/src/linux
3909 point at the new headers. This will only be a problem if you build
3910 uml_net yourself.
3911
3912
3913
3914 1133..99.. YYoouu ccaann nneettwwoorrkk ttoo tthhee hhoosstt bbuutt nnoott ttoo ootthheerr mmaacchhiinneess oonn tthhee
3915 nneett
3916
3917 If you can connect to the host, and the host can connect to UML, but
3918 you cannot connect to any other machines, then you may need to enable
3919 IP Masquerading on the host. Usually this is only experienced when
3920 using private IP addresses (192.168.x.x or 10.x.x.x) for host/UML
3921 networking, rather than the public address space that your host is
3922 connected to. UML does not enable IP Masquerading, so you will need
3923 to create a static rule to enable it:
3924
3925
3926 host%
3927 iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
3928
3929
3930
3931
3932 Replace eth0 with the interface that you use to talk to the rest of
3933 the world.
3934
3935
3936 Documentation on IP Masquerading, and SNAT, can be found at
3937 www.netfilter.org <http://www.netfilter.org> .
3938
3939
3940 If you can reach the local net, but not the outside Internet, then
3941 that is usually a routing problem. The UML needs a default route:
3942
3943
3944 UML#
3945 route add default gw gateway IP
3946
3947
3948
3949
3950 The gateway IP can be any machine on the local net that knows how to
3951 reach the outside world. Usually, this is the host or the local net-
3952 work's gateway.
3953
3954
3955 Occasionally, we hear from someone who can reach some machines, but
3956 not others on the same net, or who can reach some ports on other
3957 machines, but not others. These are usually caused by strange
3958 firewalling somewhere between the UML and the other box. You track
3959 this down by running tcpdump on every interface the packets travel
3960 over and see where they disappear. When you find a machine that takes
3961 the packets in, but does not send them onward, that's the culprit.
3962
3963
3964
3965 1133..1100.. II hhaavvee nnoo rroooott aanndd II wwaanntt ttoo ssccrreeaamm
3966
3967 Thanks to Birgit Wahlich for telling me about this strange one. It
3968 turns out that there's a limit of six environment variables on the
3969 kernel command line. When that limit is reached or exceeded, argument
3970 processing stops, which means that the 'root=' argument that UML
3971 usually adds is not seen. So, the filesystem has no idea what the
3972 root device is, so it panics.
3973
3974
3975 The fix is to put less stuff on the command line. Glomming all your
3976 setup variables into one is probably the best way to go.
3977
3978
3979
3980 1133..1111.. UUMMLL bbuuiilldd ccoonnfflliicctt bbeettwweeeenn ppttrraaccee..hh aanndd uuccoonntteexxtt..hh
3981
3982 On some older systems, /usr/include/asm/ptrace.h and
3983 /usr/include/sys/ucontext.h define the same names. So, when they're
3984 included together, the defines from one completely mess up the parsing
3985 of the other, producing errors like:
3986 /usr/include/sys/ucontext.h:47: parse error before
3987 `10'
3988
3989
3990
3991
3992 plus a pile of warnings.
3993
3994
3995 This is a libc botch, which has since been fixed, and I don't see any
3996 way around it besides upgrading.
3997
3998
3999
4000 1133..1122.. TThhee UUMMLL BBooggooMMiippss iiss eexxaaccttllyy hhaallff tthhee hhoosstt''ss BBooggooMMiippss
4001
4002 On i386 kernels, there are two ways of running the loop that is used
4003 to calculate the BogoMips rating, using the TSC if it's there or using
4004 a one-instruction loop. The TSC produces twice the BogoMips as the
4005 loop. UML uses the loop, since it has nothing resembling a TSC, and
4006 will get almost exactly the same BogoMips as a host using the loop.
4007 However, on a host with a TSC, its BogoMips will be double the loop
4008 BogoMips, and therefore double the UML BogoMips.
4009
4010
4011
4012 1133..1133.. WWhheenn yyoouu rruunn UUMMLL,, iitt iimmmmeeddiiaatteellyy sseeggffaauullttss
4013
4014 If the host is configured with the 2G/2G address space split, that's
4015 why. See ``UML on 2G/2G hosts'' for the details on getting UML to
4016 run on your host.
4017
4018
4019
4020 1133..1144.. xxtteerrmmss aappppeeaarr,, tthheenn iimmmmeeddiiaatteellyy ddiissaappppeeaarr
4021
4022 If you're running an up to date kernel with an old release of
4023 uml_utilities, the port-helper program will not work properly, so
4024 xterms will exit straight after they appear. The solution is to
4025 upgrade to the latest release of uml_utilities. Usually this problem
4026 occurs when you have installed a packaged release of UML then compiled
4027 your own development kernel without upgrading the uml_utilities from
4028 the source distribution.
4029
4030
4031
4032 1133..1155.. AAnnyy ootthheerr ppaanniicc,, hhaanngg,, oorr ssttrraannggee bbeehhaavviioorr
4033
4034 If you're seeing truly strange behavior, such as hangs or panics that
4035 happen in random places, or you try running the debugger to see what's
4036 happening and it acts strangely, then it could be a problem in the
4037 host kernel. If you're not running a stock Linus or -ac kernel, then
4038 try that. An early version of the preemption patch and a 2.4.10 SuSE
4039 kernel have caused very strange problems in UML.
4040
4041
4042 Otherwise, let me know about it. Send a message to one of the UML
4043 mailing lists - either the developer list - user-mode-linux-devel at
4044 lists dot sourceforge dot net (subscription info) or the user list -
4045 user-mode-linux-user at lists dot sourceforge do net (subscription
4046 info), whichever you prefer. Don't assume that everyone knows about
4047 it and that a fix is imminent.
4048
4049
4050 If you want to be super-helpful, read ``Diagnosing Problems'' and
4051 follow the instructions contained therein.
4052 1144.. DDiiaaggnnoossiinngg PPrroobblleemmss
4053
4054
4055 If you get UML to crash, hang, or otherwise misbehave, you should
4056 report this on one of the project mailing lists, either the developer
4057 list - user-mode-linux-devel at lists dot sourceforge dot net
4058 (subscription info) or the user list - user-mode-linux-user at lists
4059 dot sourceforge dot net (subscription info). When you do, it is
4060 likely that I will want more information. So, it would be helpful to
4061 read the stuff below, do whatever is applicable in your case, and
4062 report the results to the list.
4063
4064
4065 For any diagnosis, you're going to need to build a debugging kernel.
4066 The binaries from this site aren't debuggable. If you haven't done
4067 this before, read about ``Compiling the kernel and modules'' and
4068 ``Kernel debugging'' UML first.
4069
4070
4071 1144..11.. CCaassee 11 :: NNoorrmmaall kkeerrnneell ppaanniiccss
4072
4073 The most common case is for a normal thread to panic. To debug this,
4074 you will need to run it under the debugger (add 'debug' to the command
4075 line). An xterm will start up with gdb running inside it. Continue
4076 it when it stops in start_kernel and make it crash. Now ^C gdb and
4077
4078
4079 If the panic was a "Kernel mode fault", then there will be a segv
4080 frame on the stack and I'm going to want some more information. The
4081 stack might look something like this:
4082
4083
4084 (UML gdb) backtrace
4085 #0 0x1009bf76 in __sigprocmask (how=1, set=0x5f347940, oset=0x0)
4086 at ../sysdeps/unix/sysv/linux/sigprocmask.c:49
4087 #1 0x10091411 in change_sig (signal=10, on=1) at process.c:218
4088 #2 0x10094785 in timer_handler (sig=26) at time_kern.c:32
4089 #3 0x1009bf38 in __restore ()
4090 at ../sysdeps/unix/sysv/linux/i386/sigaction.c:125
4091 #4 0x1009534c in segv (address=8, ip=268849158, is_write=2, is_user=0)
4092 at trap_kern.c:66
4093 #5 0x10095c04 in segv_handler (sig=11) at trap_user.c:285
4094 #6 0x1009bf38 in __restore ()
4095
4096
4097
4098
4099 I'm going to want to see the symbol and line information for the value
4100 of ip in the segv frame. In this case, you would do the following:
4101
4102
4103 (UML gdb) i sym 268849158
4104
4105
4106
4107
4108 and
4109
4110
4111 (UML gdb) i line *268849158
4112
4113
4114
4115
4116 The reason for this is the __restore frame right above the segv_han-
4117 dler frame is hiding the frame that actually segfaulted. So, I have
4118 to get that information from the faulting ip.
4119
4120
4121 1144..22.. CCaassee 22 :: TTrraacciinngg tthhrreeaadd ppaanniiccss
4122
4123 The less common and more painful case is when the tracing thread
4124 panics. In this case, the kernel debugger will be useless because it
4125 needs a healthy tracing thread in order to work. The first thing to
4126 do is get a backtrace from the tracing thread. This is done by
4127 figuring out what its pid is, firing up gdb, and attaching it to that
4128 pid. You can figure out the tracing thread pid by looking at the
4129 first line of the console output, which will look like this:
4130
4131
4132 tracing thread pid = 15851
4133
4134
4135
4136
4137 or by running ps on the host and finding the line that looks like
4138 this:
4139
4140
4141 jdike 15851 4.5 0.4 132568 1104 pts/0 S 21:34 0:05 ./linux [(tracing thread)]
4142
4143
4144
4145
4146 If the panic was 'segfault in signals', then follow the instructions
4147 above for collecting information about the location of the seg fault.
4148
4149
4150 If the tracing thread flaked out all by itself, then send that
4151 backtrace in and wait for our crack debugging team to fix the problem.
4152
4153
4154 1144..33.. CCaassee 33 :: TTrraacciinngg tthhrreeaadd ppaanniiccss ccaauusseedd bbyy ootthheerr tthhrreeaaddss
4155
4156 However, there are cases where the misbehavior of another thread
4157 caused the problem. The most common panic of this type is:
4158
4159
4160 wait_for_stop failed to wait for <pid> to stop with <signal number>
4161
4162
4163
4164
4165 In this case, you'll need to get a backtrace from the process men-
4166 tioned in the panic, which is complicated by the fact that the kernel
4167 debugger is defunct and without some fancy footwork, another gdb can't
4168 attach to it. So, this is how the fancy footwork goes:
4169
4170 In a shell:
4171
4172
4173 host% kill -STOP pid
4174
4175
4176
4177
4178 Run gdb on the tracing thread as described in case 2 and do:
4179
4180
4181 (host gdb) call detach(pid)
4182
4183
4184 If you get a segfault, do it again. It always works the second time.
4185
4186 Detach from the tracing thread and attach to that other thread:
4187
4188
4189 (host gdb) detach
4190
4191
4192
4193
4194
4195
4196 (host gdb) attach pid
4197
4198
4199
4200
4201 If gdb hangs when attaching to that process, go back to a shell and
4202 do:
4203
4204
4205 host%
4206 kill -CONT pid
4207
4208
4209
4210
4211 And then get the backtrace:
4212
4213
4214 (host gdb) backtrace
4215
4216
4217
4218
4219
4220 1144..44.. CCaassee 44 :: HHaannggss
4221
4222 Hangs seem to be fairly rare, but they sometimes happen. When a hang
4223 happens, we need a backtrace from the offending process. Run the
4224 kernel debugger as described in case 1 and get a backtrace. If the
4225 current process is not the idle thread, then send in the backtrace.
4226 You can tell that it's the idle thread if the stack looks like this:
4227
4228
4229 #0 0x100b1401 in __libc_nanosleep ()
4230 #1 0x100a2885 in idle_sleep (secs=10) at time.c:122
4231 #2 0x100a546f in do_idle () at process_kern.c:445
4232 #3 0x100a5508 in cpu_idle () at process_kern.c:471
4233 #4 0x100ec18f in start_kernel () at init/main.c:592
4234 #5 0x100a3e10 in start_kernel_proc (unused=0x0) at um_arch.c:71
4235 #6 0x100a383f in signal_tramp (arg=0x100a3dd8) at trap_user.c:50
4236
4237
4238
4239
4240 If this is the case, then some other process is at fault, and went to
4241 sleep when it shouldn't have. Run ps on the host and figure out which
4242 process should not have gone to sleep and stayed asleep. Then attach
4243 to it with gdb and get a backtrace as described in case 3.
4244
4245
4246
4247
4248
4249
4250 1155.. TThhaannkkss
4251
4252
4253 A number of people have helped this project in various ways, and this
4254 page gives recognition where recognition is due.
4255
4256
4257 If you're listed here and you would prefer a real link on your name,
4258 or no link at all, instead of the despammed email address pseudo-link,
4259 let me know.
4260
4261
4262 If you're not listed here and you think maybe you should be, please
4263 let me know that as well. I try to get everyone, but sometimes my
4264 bookkeeping lapses and I forget about contributions.
4265
4266
4267 1155..11.. CCooddee aanndd DDooccuummeennttaattiioonn
4268
4269 Rusty Russell <rusty at linuxcare.com.au> -
4270
4271 +o wrote the HOWTO <http://user-mode-
4272 linux.sourceforge.net/UserModeLinux-HOWTO.html>
4273
4274 +o prodded me into making this project official and putting it on
4275 SourceForge
4276
4277 +o came up with the way cool UML logo <http://user-mode-
4278 linux.sourceforge.net/uml-small.png>
4279
4280 +o redid the config process
4281
4282
4283 Peter Moulder <reiter at netspace.net.au> - Fixed my config and build
4284 processes, and added some useful code to the block driver
4285
4286
4287 Bill Stearns <wstearns at pobox.com> -
4288
4289 +o HOWTO updates
4290
4291 +o lots of bug reports
4292
4293 +o lots of testing
4294
4295 +o dedicated a box (uml.ists.dartmouth.edu) to support UML development
4296
4297 +o wrote the mkrootfs script, which allows bootable filesystems of
4298 RPM-based distributions to be cranked out
4299
4300 +o cranked out a large number of filesystems with said script
4301
4302
4303 Jim Leu <jleu at mindspring.com> - Wrote the virtual ethernet driver
4304 and associated usermode tools
4305
4306 Lars Brinkhoff <http://lars.nocrew.org/> - Contributed the ptrace
4307 proxy from his own project <http://a386.nocrew.org/> to allow easier
4308 kernel debugging
4309
4310
4311 Andrea Arcangeli <andrea at suse.de> - Redid some of the early boot
4312 code so that it would work on machines with Large File Support
4313
4314
4315 Chris Emerson <http://www.chiark.greenend.org.uk/~cemerson/> - Did
4316 the first UML port to Linux/ppc
4317
4318
4319 Harald Welte <laforge at gnumonks.org> - Wrote the multicast
4320 transport for the network driver
4321
4322
4323 Jorgen Cederlof - Added special file support to hostfs
4324
4325
4326 Greg Lonnon <glonnon at ridgerun dot com> - Changed the ubd driver
4327 to allow it to layer a COW file on a shared read-only filesystem and
4328 wrote the iomem emulation support
4329
4330
4331 Henrik Nordstrom <http://hem.passagen.se/hno/> - Provided a variety
4332 of patches, fixes, and clues
4333
4334
4335 Lennert Buytenhek - Contributed various patches, a rewrite of the
4336 network driver, the first implementation of the mconsole driver, and
4337 did the bulk of the work needed to get SMP working again.
4338
4339
4340 Yon Uriarte - Fixed the TUN/TAP network backend while I slept.
4341
4342
4343 Adam Heath - Made a bunch of nice cleanups to the initialization code,
4344 plus various other small patches.
4345
4346
4347 Matt Zimmerman - Matt volunteered to be the UML Debian maintainer and
4348 is doing a real nice job of it. He also noticed and fixed a number of
4349 actually and potentially exploitable security holes in uml_net. Plus
4350 the occasional patch. I like patches.
4351
4352
4353 James McMechan - James seems to have taken over maintenance of the ubd
4354 driver and is doing a nice job of it.
4355
4356
4357 Chandan Kudige - wrote the umlgdb script which automates the reloading
4358 of module symbols.
4359
4360
4361 Steve Schmidtke - wrote the UML slirp transport and hostaudio drivers,
4362 enabling UML processes to access audio devices on the host. He also
4363 submitted patches for the slip transport and lots of other things.
4364
4365
4366 David Coulson <http://davidcoulson.net> -
4367
4368 +o Set up the usermodelinux.org <http://usermodelinux.org> site,
4369 which is a great way of keeping the UML user community on top of
4370 UML goings-on.
4371
4372 +o Site documentation and updates
4373
4374 +o Nifty little UML management daemon UMLd
4375 <http://uml.openconsultancy.com/umld/>
4376
4377 +o Lots of testing and bug reports
4378
4379
4380
4381
4382 1155..22.. FFlluusshhiinngg oouutt bbuuggss
4383
4384
4385
4386 +o Yuri Pudgorodsky
4387
4388 +o Gerald Britton
4389
4390 +o Ian Wehrman
4391
4392 +o Gord Lamb
4393
4394 +o Eugene Koontz
4395
4396 +o John H. Hartman
4397
4398 +o Anders Karlsson
4399
4400 +o Daniel Phillips
4401
4402 +o John Fremlin
4403
4404 +o Rainer Burgstaller
4405
4406 +o James Stevenson
4407
4408 +o Matt Clay
4409
4410 +o Cliff Jefferies
4411
4412 +o Geoff Hoff
4413
4414 +o Lennert Buytenhek
4415
4416 +o Al Viro
4417
4418 +o Frank Klingenhoefer
4419
4420 +o Livio Baldini Soares
4421
4422 +o Jon Burgess
4423
4424 +o Petru Paler
4425
4426 +o Paul
4427
4428 +o Chris Reahard
4429
4430 +o Sverker Nilsson
4431
4432 +o Gong Su
4433
4434 +o johan verrept
4435
4436 +o Bjorn Eriksson
4437
4438 +o Lorenzo Allegrucci
4439
4440 +o Muli Ben-Yehuda
4441
4442 +o David Mansfield
4443
4444 +o Howard Goff
4445
4446 +o Mike Anderson
4447
4448 +o John Byrne
4449
4450 +o Sapan J. Batia
4451
4452 +o Iris Huang
4453
4454 +o Jan Hudec
4455
4456 +o Voluspa
4457
4458
4459
4460
4461 1155..33.. BBuugglleettss aanndd cclleeaann--uuppss
4462
4463
4464
4465 +o Dave Zarzycki
4466
4467 +o Adam Lazur
4468
4469 +o Boria Feigin
4470
4471 +o Brian J. Murrell
4472
4473 +o JS
4474
4475 +o Roman Zippel
4476
4477 +o Wil Cooley
4478
4479 +o Ayelet Shemesh
4480
4481 +o Will Dyson
4482
4483 +o Sverker Nilsson
4484
4485 +o dvorak
4486
4487 +o v.naga srinivas
4488
4489 +o Shlomi Fish
4490
4491 +o Roger Binns
4492
4493 +o johan verrept
4494
4495 +o MrChuoi
4496
4497 +o Peter Cleve
4498
4499 +o Vincent Guffens
4500
4501 +o Nathan Scott
4502
4503 +o Patrick Caulfield
4504
4505 +o jbearce
4506
4507 +o Catalin Marinas
4508
4509 +o Shane Spencer
4510
4511 +o Zou Min
4512
4513
4514 +o Ryan Boder
4515
4516 +o Lorenzo Colitti
4517
4518 +o Gwendal Grignou
4519
4520 +o Andre' Breiler
4521
4522 +o Tsutomu Yasuda
4523
4524
4525
4526 1155..44.. CCaassee SSttuuddiieess
4527
4528
4529 +o Jon Wright
4530
4531 +o William McEwan
4532
4533 +o Michael Richardson
4534
4535
4536
4537 1155..55.. OOtthheerr ccoonnttrriibbuuttiioonnss
4538
4539
4540 Bill Carr <Bill.Carr at compaq.com> made the Red Hat mkrootfs script
4541 work with RH 6.2.
4542
4543 Michael Jennings <mikejen at hevanet.com> sent in some material which
4544 is now gracing the top of the index page <http://user-mode-
4545 linux.sourceforge.net/> of this site.
4546
4547 SGI <http://www.sgi.com> (and more specifically Ralf Baechle <ralf at
4548 uni-koblenz.de> ) gave me an account on oss.sgi.com
4549 <http://www.oss.sgi.com> . The bandwidth there made it possible to
4550 produce most of the filesystems available on the project download
4551 page.
4552
4553 Laurent Bonnaud <Laurent.Bonnaud at inpg.fr> took the old grotty
4554 Debian filesystem that I've been distributing and updated it to 2.2.
4555 It is now available by itself here.
4556
4557 Rik van Riel gave me some ftp space on ftp.nl.linux.org so I can make
4558 releases even when Sourceforge is broken.
4559
4560 Rodrigo de Castro looked at my broken pte code and told me what was
4561 wrong with it, letting me fix a long-standing (several weeks) and
4562 serious set of bugs.
4563
4564 Chris Reahard built a specialized root filesystem for running a DNS
4565 server jailed inside UML. It's available from the download
4566 <http://user-mode-linux.sourceforge.net/dl-sf.html> page in the Jail
4567 Filesystems section.
4568
4569
4570
4571
4572
4573
4574
4575
4576
4577
4578
4579