aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorIngo Molnar <mingo@elte.hu>2009-08-25 04:04:27 -0400
committerIngo Molnar <mingo@elte.hu>2009-08-25 04:04:32 -0400
commitdaedc71836e5a398fd0cc0e12c5cb43539478485 (patch)
treec56567a92017679e57195cef992d4a5561c20e0e /Documentation
parentc36ba80ea01d0aecb652c26799a912e760ce8981 (diff)
parent422bef879e84104fee6dc68ded0e371dbeb5f88e (diff)
Merge commit 'v2.6.31-rc7' into irq/core
Merge reason: move from an -rc2 base to -rc7. Signed-off-by: Ingo Molnar <mingo@elte.hu>
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/sysfs-block37
-rw-r--r--Documentation/DocBook/kernel-hacking.tmpl4
-rw-r--r--Documentation/DocBook/mac80211.tmpl2
-rw-r--r--Documentation/RCU/rculist_nulls.txt7
-rw-r--r--Documentation/arm/memory.txt2
-rw-r--r--Documentation/connector/cn_test.c4
-rw-r--r--Documentation/connector/ucon.c2
-rw-r--r--Documentation/driver-model/driver.txt4
-rw-r--r--Documentation/dvb/get_dvb_firmware53
-rw-r--r--Documentation/feature-removal-schedule.txt10
-rw-r--r--Documentation/filesystems/afs.txt26
-rw-r--r--Documentation/filesystems/proc.txt15
-rw-r--r--Documentation/filesystems/sysfs.txt3
-rw-r--r--Documentation/ioctl/ioctl-number.txt1
-rw-r--r--Documentation/kernel-parameters.txt8
-rw-r--r--Documentation/laptops/thinkpad-acpi.txt127
-rw-r--r--Documentation/lguest/lguest.c721
-rw-r--r--Documentation/lockdep-design.txt6
-rw-r--r--Documentation/networking/6pack.txt2
-rw-r--r--Documentation/scheduler/sched-rt-group.txt13
-rw-r--r--Documentation/sound/alsa/Procfile.txt5
-rw-r--r--Documentation/sysrq.txt7
-rw-r--r--Documentation/video4linux/CARDLIST.em28xx5
-rw-r--r--Documentation/video4linux/CARDLIST.saa71344
-rw-r--r--Documentation/video4linux/gspca.txt32
-rw-r--r--Documentation/x86/00-INDEX2
-rw-r--r--Documentation/x86/exception-tables.txt (renamed from Documentation/exception.txt)202
27 files changed, 769 insertions, 535 deletions
diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
index cbbd3e069945..5f3bedaf8e35 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -94,28 +94,37 @@ What: /sys/block/<disk>/queue/physical_block_size
94Date: May 2009 94Date: May 2009
95Contact: Martin K. Petersen <martin.petersen@oracle.com> 95Contact: Martin K. Petersen <martin.petersen@oracle.com>
96Description: 96Description:
97 This is the smallest unit the storage device can write 97 This is the smallest unit a physical storage device can
98 without resorting to read-modify-write operation. It is 98 write atomically. It is usually the same as the logical
99 usually the same as the logical block size but may be 99 block size but may be bigger. One example is SATA
100 bigger. One example is SATA drives with 4KB sectors 100 drives with 4KB sectors that expose a 512-byte logical
101 that expose a 512-byte logical block size to the 101 block size to the operating system. For stacked block
102 operating system. 102 devices the physical_block_size variable contains the
103 maximum physical_block_size of the component devices.
103 104
104What: /sys/block/<disk>/queue/minimum_io_size 105What: /sys/block/<disk>/queue/minimum_io_size
105Date: April 2009 106Date: April 2009
106Contact: Martin K. Petersen <martin.petersen@oracle.com> 107Contact: Martin K. Petersen <martin.petersen@oracle.com>
107Description: 108Description:
108 Storage devices may report a preferred minimum I/O size, 109 Storage devices may report a granularity or preferred
109 which is the smallest request the device can perform 110 minimum I/O size which is the smallest request the
110 without incurring a read-modify-write penalty. For disk 111 device can perform without incurring a performance
111 drives this is often the physical block size. For RAID 112 penalty. For disk drives this is often the physical
112 arrays it is often the stripe chunk size. 113 block size. For RAID arrays it is often the stripe
114 chunk size. A properly aligned multiple of
115 minimum_io_size is the preferred request size for
116 workloads where a high number of I/O operations is
117 desired.
113 118
114What: /sys/block/<disk>/queue/optimal_io_size 119What: /sys/block/<disk>/queue/optimal_io_size
115Date: April 2009 120Date: April 2009
116Contact: Martin K. Petersen <martin.petersen@oracle.com> 121Contact: Martin K. Petersen <martin.petersen@oracle.com>
117Description: 122Description:
118 Storage devices may report an optimal I/O size, which is 123 Storage devices may report an optimal I/O size, which is
119 the device's preferred unit of receiving I/O. This is 124 the device's preferred unit for sustained I/O. This is
120 rarely reported for disk drives. For RAID devices it is 125 rarely reported for disk drives. For RAID arrays it is
121 usually the stripe width or the internal block size. 126 usually the stripe width or the internal track size. A
127 properly aligned multiple of optimal_io_size is the
128 preferred request size for workloads where sustained
129 throughput is desired. If no optimal I/O size is
130 reported this file contains 0.
diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl
index a50d6cd58573..992e67e6be7f 100644
--- a/Documentation/DocBook/kernel-hacking.tmpl
+++ b/Documentation/DocBook/kernel-hacking.tmpl
@@ -449,8 +449,8 @@ printk(KERN_INFO "i = %u\n", i);
449 </para> 449 </para>
450 450
451 <programlisting> 451 <programlisting>
452__u32 ipaddress; 452__be32 ipaddress;
453printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); 453printk(KERN_INFO "my ip: %pI4\n", &amp;ipaddress);
454 </programlisting> 454 </programlisting>
455 455
456 <para> 456 <para>
diff --git a/Documentation/DocBook/mac80211.tmpl b/Documentation/DocBook/mac80211.tmpl
index e36986663570..f3f37f141dbd 100644
--- a/Documentation/DocBook/mac80211.tmpl
+++ b/Documentation/DocBook/mac80211.tmpl
@@ -184,8 +184,6 @@ usage should require reading the full document.
184!Finclude/net/mac80211.h ieee80211_ctstoself_get 184!Finclude/net/mac80211.h ieee80211_ctstoself_get
185!Finclude/net/mac80211.h ieee80211_ctstoself_duration 185!Finclude/net/mac80211.h ieee80211_ctstoself_duration
186!Finclude/net/mac80211.h ieee80211_generic_frame_duration 186!Finclude/net/mac80211.h ieee80211_generic_frame_duration
187!Finclude/net/mac80211.h ieee80211_get_hdrlen_from_skb
188!Finclude/net/mac80211.h ieee80211_hdrlen
189!Finclude/net/mac80211.h ieee80211_wake_queue 187!Finclude/net/mac80211.h ieee80211_wake_queue
190!Finclude/net/mac80211.h ieee80211_stop_queue 188!Finclude/net/mac80211.h ieee80211_stop_queue
191!Finclude/net/mac80211.h ieee80211_wake_queues 189!Finclude/net/mac80211.h ieee80211_wake_queues
diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt
index 93cb28d05dcd..18f9651ff23d 100644
--- a/Documentation/RCU/rculist_nulls.txt
+++ b/Documentation/RCU/rculist_nulls.txt
@@ -83,11 +83,12 @@ not detect it missed following items in original chain.
83obj = kmem_cache_alloc(...); 83obj = kmem_cache_alloc(...);
84lock_chain(); // typically a spin_lock() 84lock_chain(); // typically a spin_lock()
85obj->key = key; 85obj->key = key;
86atomic_inc(&obj->refcnt);
87/* 86/*
88 * we need to make sure obj->key is updated before obj->next 87 * we need to make sure obj->key is updated before obj->next
88 * or obj->refcnt
89 */ 89 */
90smp_wmb(); 90smp_wmb();
91atomic_set(&obj->refcnt, 1);
91hlist_add_head_rcu(&obj->obj_node, list); 92hlist_add_head_rcu(&obj->obj_node, list);
92unlock_chain(); // typically a spin_unlock() 93unlock_chain(); // typically a spin_unlock()
93 94
@@ -159,6 +160,10 @@ out:
159obj = kmem_cache_alloc(cachep); 160obj = kmem_cache_alloc(cachep);
160lock_chain(); // typically a spin_lock() 161lock_chain(); // typically a spin_lock()
161obj->key = key; 162obj->key = key;
163/*
164 * changes to obj->key must be visible before refcnt one
165 */
166smp_wmb();
162atomic_set(&obj->refcnt, 1); 167atomic_set(&obj->refcnt, 1);
163/* 168/*
164 * insert obj in RCU way (readers might be traversing chain) 169 * insert obj in RCU way (readers might be traversing chain)
diff --git a/Documentation/arm/memory.txt b/Documentation/arm/memory.txt
index 43cb1004d35f..9d58c7c5eddd 100644
--- a/Documentation/arm/memory.txt
+++ b/Documentation/arm/memory.txt
@@ -21,6 +21,8 @@ ffff8000 ffffffff copy_user_page / clear_user_page use.
21 For SA11xx and Xscale, this is used to 21 For SA11xx and Xscale, this is used to
22 setup a minicache mapping. 22 setup a minicache mapping.
23 23
24ffff4000 ffffffff cache aliasing on ARMv6 and later CPUs.
25
24ffff1000 ffff7fff Reserved. 26ffff1000 ffff7fff Reserved.
25 Platforms must not use this address range. 27 Platforms must not use this address range.
26 28
diff --git a/Documentation/connector/cn_test.c b/Documentation/connector/cn_test.c
index f688eba87704..6a5be5d5c8e4 100644
--- a/Documentation/connector/cn_test.c
+++ b/Documentation/connector/cn_test.c
@@ -1,7 +1,7 @@
1/* 1/*
2 * cn_test.c 2 * cn_test.c
3 * 3 *
4 * 2004-2005 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> 4 * 2004+ Copyright (c) Evgeniy Polyakov <zbr@ioremap.net>
5 * All rights reserved. 5 * All rights reserved.
6 * 6 *
7 * This program is free software; you can redistribute it and/or modify 7 * This program is free software; you can redistribute it and/or modify
@@ -194,5 +194,5 @@ module_init(cn_test_init);
194module_exit(cn_test_fini); 194module_exit(cn_test_fini);
195 195
196MODULE_LICENSE("GPL"); 196MODULE_LICENSE("GPL");
197MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>"); 197MODULE_AUTHOR("Evgeniy Polyakov <zbr@ioremap.net>");
198MODULE_DESCRIPTION("Connector's test module"); 198MODULE_DESCRIPTION("Connector's test module");
diff --git a/Documentation/connector/ucon.c b/Documentation/connector/ucon.c
index d738cde2a8d5..c5092ad0ce4b 100644
--- a/Documentation/connector/ucon.c
+++ b/Documentation/connector/ucon.c
@@ -1,7 +1,7 @@
1/* 1/*
2 * ucon.c 2 * ucon.c
3 * 3 *
4 * Copyright (c) 2004+ Evgeniy Polyakov <johnpol@2ka.mipt.ru> 4 * Copyright (c) 2004+ Evgeniy Polyakov <zbr@ioremap.net>
5 * 5 *
6 * 6 *
7 * This program is free software; you can redistribute it and/or modify 7 * This program is free software; you can redistribute it and/or modify
diff --git a/Documentation/driver-model/driver.txt b/Documentation/driver-model/driver.txt
index 82132169d47a..60120fb3b961 100644
--- a/Documentation/driver-model/driver.txt
+++ b/Documentation/driver-model/driver.txt
@@ -207,8 +207,8 @@ Attributes
207~~~~~~~~~~ 207~~~~~~~~~~
208struct driver_attribute { 208struct driver_attribute {
209 struct attribute attr; 209 struct attribute attr;
210 ssize_t (*show)(struct device_driver *, char * buf, size_t count, loff_t off); 210 ssize_t (*show)(struct device_driver *driver, char *buf);
211 ssize_t (*store)(struct device_driver *, const char * buf, size_t count, loff_t off); 211 ssize_t (*store)(struct device_driver *, const char * buf, size_t count);
212}; 212};
213 213
214Device drivers can export attributes via their sysfs directories. 214Device drivers can export attributes via their sysfs directories.
diff --git a/Documentation/dvb/get_dvb_firmware b/Documentation/dvb/get_dvb_firmware
index a52adfc9a57f..3d1b0ab70c8e 100644
--- a/Documentation/dvb/get_dvb_firmware
+++ b/Documentation/dvb/get_dvb_firmware
@@ -25,7 +25,7 @@ use IO::Handle;
25 "tda10046lifeview", "av7110", "dec2000t", "dec2540t", 25 "tda10046lifeview", "av7110", "dec2000t", "dec2540t",
26 "dec3000s", "vp7041", "dibusb", "nxt2002", "nxt2004", 26 "dec3000s", "vp7041", "dibusb", "nxt2002", "nxt2004",
27 "or51211", "or51132_qam", "or51132_vsb", "bluebird", 27 "or51211", "or51132_qam", "or51132_vsb", "bluebird",
28 "opera1", "cx231xx", "cx18", "cx23885", "pvrusb2" ); 28 "opera1", "cx231xx", "cx18", "cx23885", "pvrusb2", "mpc718" );
29 29
30# Check args 30# Check args
31syntax() if (scalar(@ARGV) != 1); 31syntax() if (scalar(@ARGV) != 1);
@@ -381,6 +381,57 @@ sub cx18 {
381 $allfiles; 381 $allfiles;
382} 382}
383 383
384sub mpc718 {
385 my $archive = 'Yuan MPC718 TV Tuner Card 2.13.10.1016.zip';
386 my $url = "ftp://ftp.work.acer-euro.com/desktop/aspire_idea510/vista/Drivers/$archive";
387 my $fwfile = "dvb-cx18-mpc718-mt352.fw";
388 my $tmpdir = tempdir(DIR => "/tmp", CLEANUP => 1);
389
390 checkstandard();
391 wgetfile($archive, $url);
392 unzip($archive, $tmpdir);
393
394 my $sourcefile = "$tmpdir/Yuan MPC718 TV Tuner Card 2.13.10.1016/mpc718_32bit/yuanrap.sys";
395 my $found = 0;
396
397 open IN, '<', $sourcefile or die "Couldn't open $sourcefile to extract $fwfile data\n";
398 binmode IN;
399 open OUT, '>', $fwfile;
400 binmode OUT;
401 {
402 # Block scope because we change the line terminator variable $/
403 my $prevlen = 0;
404 my $currlen;
405
406 # Buried in the data segment are 3 runs of almost identical
407 # register-value pairs that end in 0x5d 0x01 which is a "TUNER GO"
408 # command for the MT352.
409 # Pull out the middle run (because it's easy) of register-value
410 # pairs to make the "firmware" file.
411
412 local $/ = "\x5d\x01"; # MT352 "TUNER GO"
413
414 while (<IN>) {
415 $currlen = length($_);
416 if ($prevlen == $currlen && $currlen <= 64) {
417 chop; chop; # Get rid of "TUNER GO"
418 s/^\0\0//; # get rid of leading 00 00 if it's there
419 printf OUT "$_";
420 $found = 1;
421 last;
422 }
423 $prevlen = $currlen;
424 }
425 }
426 close OUT;
427 close IN;
428 if (!$found) {
429 unlink $fwfile;
430 die "Couldn't find valid register-value sequence in $sourcefile for $fwfile\n";
431 }
432 $fwfile;
433}
434
384sub cx23885 { 435sub cx23885 {
385 my $url = "http://linuxtv.org/downloads/firmware/"; 436 my $url = "http://linuxtv.org/downloads/firmware/";
386 437
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 2af78126b053..82e8d1a54433 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -449,3 +449,13 @@ Why: Remove the old legacy 32bit machine check code. This has been
449 but the old version has been kept around for easier testing. Note this 449 but the old version has been kept around for easier testing. Note this
450 doesn't impact the old P5 and WinChip machine check handlers. 450 doesn't impact the old P5 and WinChip machine check handlers.
451Who: Andi Kleen <andi@firstfloor.org> 451Who: Andi Kleen <andi@firstfloor.org>
452
453----------------------------
454
455What: lock_policy_rwsem_* and unlock_policy_rwsem_* will not be
456 exported interface anymore.
457When: 2.6.33
458Why: cpu_policy_rwsem has a new cleaner definition making it local to
459 cpufreq core and contained inside cpufreq.c. Other dependent
460 drivers should not use it in order to safely avoid lockdep issues.
461Who: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt
index 12ad6c7f4e50..ffef91c4e0d6 100644
--- a/Documentation/filesystems/afs.txt
+++ b/Documentation/filesystems/afs.txt
@@ -23,15 +23,13 @@ it does support include:
23 23
24 (*) Security (currently only AFS kaserver and KerberosIV tickets). 24 (*) Security (currently only AFS kaserver and KerberosIV tickets).
25 25
26 (*) File reading. 26 (*) File reading and writing.
27 27
28 (*) Automounting. 28 (*) Automounting.
29 29
30It does not yet support the following AFS features: 30 (*) Local caching (via fscache).
31
32 (*) Write support.
33 31
34 (*) Local caching. 32It does not yet support the following AFS features:
35 33
36 (*) pioctl() system call. 34 (*) pioctl() system call.
37 35
@@ -56,7 +54,7 @@ They permit the debugging messages to be turned on dynamically by manipulating
56the masks in the following files: 54the masks in the following files:
57 55
58 /sys/module/af_rxrpc/parameters/debug 56 /sys/module/af_rxrpc/parameters/debug
59 /sys/module/afs/parameters/debug 57 /sys/module/kafs/parameters/debug
60 58
61 59
62===== 60=====
@@ -66,9 +64,9 @@ USAGE
66When inserting the driver modules the root cell must be specified along with a 64When inserting the driver modules the root cell must be specified along with a
67list of volume location server IP addresses: 65list of volume location server IP addresses:
68 66
69 insmod af_rxrpc.o 67 modprobe af_rxrpc
70 insmod rxkad.o 68 modprobe rxkad
71 insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 69 modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
72 70
73The first module is the AF_RXRPC network protocol driver. This provides the 71The first module is the AF_RXRPC network protocol driver. This provides the
74RxRPC remote operation protocol and may also be accessed from userspace. See: 72RxRPC remote operation protocol and may also be accessed from userspace. See:
@@ -81,7 +79,7 @@ is the actual filesystem driver for the AFS filesystem.
81Once the module has been loaded, more modules can be added by the following 79Once the module has been loaded, more modules can be added by the following
82procedure: 80procedure:
83 81
84 echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells 82 echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells
85 83
86Where the parameters to the "add" command are the name of a cell and a list of 84Where the parameters to the "add" command are the name of a cell and a list of
87volume location servers within that cell, with the latter separated by colons. 85volume location servers within that cell, with the latter separated by colons.
@@ -101,7 +99,7 @@ The name of the volume can be suffixes with ".backup" or ".readonly" to
101specify connection to only volumes of those types. 99specify connection to only volumes of those types.
102 100
103The name of the cell is optional, and if not given during a mount, then the 101The name of the cell is optional, and if not given during a mount, then the
104named volume will be looked up in the cell specified during insmod. 102named volume will be looked up in the cell specified during modprobe.
105 103
106Additional cells can be added through /proc (see later section). 104Additional cells can be added through /proc (see later section).
107 105
@@ -163,14 +161,14 @@ THE CELL DATABASE
163 161
164The filesystem maintains an internal database of all the cells it knows and the 162The filesystem maintains an internal database of all the cells it knows and the
165IP addresses of the volume location servers for those cells. The cell to which 163IP addresses of the volume location servers for those cells. The cell to which
166the system belongs is added to the database when insmod is performed by the 164the system belongs is added to the database when modprobe is performed by the
167"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on 165"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on
168the kernel command line. 166the kernel command line.
169 167
170Further cells can be added by commands similar to the following: 168Further cells can be added by commands similar to the following:
171 169
172 echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells 170 echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
173 echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells 171 echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells
174 172
175No other cell database operations are available at this time. 173No other cell database operations are available at this time.
176 174
@@ -233,7 +231,7 @@ insmod /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.91
233mount -t afs \%root.afs. /afs 231mount -t afs \%root.afs. /afs
234mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/ 232mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/
235 233
236echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells 234echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 > /proc/fs/afs/cells
237mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/ 235mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/
238mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive 236mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive
239mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib 237mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index fad18f9456e4..ffead13f9443 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1167,13 +1167,11 @@ CHAPTER 3: PER-PROCESS PARAMETERS
11673.1 /proc/<pid>/oom_adj - Adjust the oom-killer score 11673.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
1168------------------------------------------------------ 1168------------------------------------------------------
1169 1169
1170This file can be used to adjust the score used to select which processes should 1170This file can be used to adjust the score used to select which processes
1171be killed in an out-of-memory situation. The oom_adj value is a characteristic 1171should be killed in an out-of-memory situation. Giving it a high score will
1172of the task's mm, so all threads that share an mm with pid will have the same 1172increase the likelihood of this process being killed by the oom-killer. Valid
1173oom_adj value. A high value will increase the likelihood of this process being 1173values are in the range -16 to +15, plus the special value -17, which disables
1174killed by the oom-killer. Valid values are in the range -16 to +15 as 1174oom-killing altogether for this process.
1175explained below and a special value of -17, which disables oom-killing
1176altogether for threads sharing pid's mm.
1177 1175
1178The process to be killed in an out-of-memory situation is selected among all others 1176The process to be killed in an out-of-memory situation is selected among all others
1179based on its badness score. This value equals the original memory size of the process 1177based on its badness score. This value equals the original memory size of the process
@@ -1187,9 +1185,6 @@ the parent's score if they do not share the same memory. Thus forking servers
1187are the prime candidates to be killed. Having only one 'hungry' child will make 1185are the prime candidates to be killed. Having only one 'hungry' child will make
1188parent less preferable than the child. 1186parent less preferable than the child.
1189 1187
1190/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
1191oom-killing already.
1192
1193/proc/<pid>/oom_score shows process' current badness score. 1188/proc/<pid>/oom_score shows process' current badness score.
1194 1189
1195The following heuristics are then applied: 1190The following heuristics are then applied:
diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt
index 7e81e37c0b1e..b245d524d568 100644
--- a/Documentation/filesystems/sysfs.txt
+++ b/Documentation/filesystems/sysfs.txt
@@ -23,7 +23,8 @@ interface.
23Using sysfs 23Using sysfs
24~~~~~~~~~~~ 24~~~~~~~~~~~
25 25
26sysfs is always compiled in. You can access it by doing: 26sysfs is always compiled in if CONFIG_SYSFS is defined. You can access
27it by doing:
27 28
28 mount -t sysfs sysfs /sys 29 mount -t sysfs sysfs /sys
29 30
diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 7bb0d934b6d8..dbea4f95fc85 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -139,6 +139,7 @@ Code Seq# Include File Comments
139'm' all linux/synclink.h conflict! 139'm' all linux/synclink.h conflict!
140'm' 00-1F net/irda/irmod.h conflict! 140'm' 00-1F net/irda/irmod.h conflict!
141'n' 00-7F linux/ncp_fs.h 141'n' 00-7F linux/ncp_fs.h
142'n' 80-8F linux/nilfs2_fs.h NILFS2
142'n' E0-FF video/matrox.h matroxfb 143'n' E0-FF video/matrox.h matroxfb
143'o' 00-1F fs/ocfs2/ocfs2_fs.h OCFS2 144'o' 00-1F fs/ocfs2/ocfs2_fs.h OCFS2
144'o' 00-03 include/mtd/ubi-user.h conflict! (OCFS2 and UBI overlaps) 145'o' 00-03 include/mtd/ubi-user.h conflict! (OCFS2 and UBI overlaps)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index d77fbd8b79ac..7936b801fe6a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1115,6 +1115,10 @@ and is between 256 and 4096 characters. It is defined in the file
1115 libata.dma=4 Compact Flash DMA only 1115 libata.dma=4 Compact Flash DMA only
1116 Combinations also work, so libata.dma=3 enables DMA 1116 Combinations also work, so libata.dma=3 enables DMA
1117 for disks and CDROMs, but not CFs. 1117 for disks and CDROMs, but not CFs.
1118
1119 libata.ignore_hpa= [LIBATA] Ignore HPA limit
1120 libata.ignore_hpa=0 keep BIOS limits (default)
1121 libata.ignore_hpa=1 ignore limits, using full disk
1118 1122
1119 libata.noacpi [LIBATA] Disables use of ACPI in libata suspend/resume 1123 libata.noacpi [LIBATA] Disables use of ACPI in libata suspend/resume
1120 when set. 1124 when set.
@@ -1720,8 +1724,8 @@ and is between 256 and 4096 characters. It is defined in the file
1720 oprofile.cpu_type= Force an oprofile cpu type 1724 oprofile.cpu_type= Force an oprofile cpu type
1721 This might be useful if you have an older oprofile 1725 This might be useful if you have an older oprofile
1722 userland or if you want common events. 1726 userland or if you want common events.
1723 Format: { archperfmon } 1727 Format: { arch_perfmon }
1724 archperfmon: [X86] Force use of architectural 1728 arch_perfmon: [X86] Force use of architectural
1725 perfmon on Intel CPUs instead of the 1729 perfmon on Intel CPUs instead of the
1726 CPU specific event set. 1730 CPU specific event set.
1727 1731
diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt
index f2296ecedb89..e2ddcdeb61b6 100644
--- a/Documentation/laptops/thinkpad-acpi.txt
+++ b/Documentation/laptops/thinkpad-acpi.txt
@@ -36,8 +36,6 @@ detailed description):
36 - Bluetooth enable and disable 36 - Bluetooth enable and disable
37 - video output switching, expansion control 37 - video output switching, expansion control
38 - ThinkLight on and off 38 - ThinkLight on and off
39 - limited docking and undocking
40 - UltraBay eject
41 - CMOS/UCMS control 39 - CMOS/UCMS control
42 - LED control 40 - LED control
43 - ACPI sounds 41 - ACPI sounds
@@ -729,131 +727,6 @@ cannot be read or if it is unknown, thinkpad-acpi will report it as "off".
729It is impossible to know if the status returned through sysfs is valid. 727It is impossible to know if the status returned through sysfs is valid.
730 728
731 729
732Docking / undocking -- /proc/acpi/ibm/dock
733------------------------------------------
734
735Docking and undocking (e.g. with the X4 UltraBase) requires some
736actions to be taken by the operating system to safely make or break
737the electrical connections with the dock.
738
739The docking feature of this driver generates the following ACPI events:
740
741 ibm/dock GDCK 00000003 00000001 -- eject request
742 ibm/dock GDCK 00000003 00000002 -- undocked
743 ibm/dock GDCK 00000000 00000003 -- docked
744
745NOTE: These events will only be generated if the laptop was docked
746when originally booted. This is due to the current lack of support for
747hot plugging of devices in the Linux ACPI framework. If the laptop was
748booted while not in the dock, the following message is shown in the
749logs:
750
751 Mar 17 01:42:34 aero kernel: thinkpad_acpi: dock device not present
752
753In this case, no dock-related events are generated but the dock and
754undock commands described below still work. They can be executed
755manually or triggered by Fn key combinations (see the example acpid
756configuration files included in the driver tarball package available
757on the web site).
758
759When the eject request button on the dock is pressed, the first event
760above is generated. The handler for this event should issue the
761following command:
762
763 echo undock > /proc/acpi/ibm/dock
764
765After the LED on the dock goes off, it is safe to eject the laptop.
766Note: if you pressed this key by mistake, go ahead and eject the
767laptop, then dock it back in. Otherwise, the dock may not function as
768expected.
769
770When the laptop is docked, the third event above is generated. The
771handler for this event should issue the following command to fully
772enable the dock:
773
774 echo dock > /proc/acpi/ibm/dock
775
776The contents of the /proc/acpi/ibm/dock file shows the current status
777of the dock, as provided by the ACPI framework.
778
779The docking support in this driver does not take care of enabling or
780disabling any other devices you may have attached to the dock. For
781example, a CD drive plugged into the UltraBase needs to be disabled or
782enabled separately. See the provided example acpid configuration files
783for how this can be accomplished.
784
785There is no support yet for PCI devices that may be attached to a
786docking station, e.g. in the ThinkPad Dock II. The driver currently
787does not recognize, enable or disable such devices. This means that
788the only docking stations currently supported are the X-series
789UltraBase docks and "dumb" port replicators like the Mini Dock (the
790latter don't need any ACPI support, actually).
791
792
793UltraBay eject -- /proc/acpi/ibm/bay
794------------------------------------
795
796Inserting or ejecting an UltraBay device requires some actions to be
797taken by the operating system to safely make or break the electrical
798connections with the device.
799
800This feature generates the following ACPI events:
801
802 ibm/bay MSTR 00000003 00000000 -- eject request
803 ibm/bay MSTR 00000001 00000000 -- eject lever inserted
804
805NOTE: These events will only be generated if the UltraBay was present
806when the laptop was originally booted (on the X series, the UltraBay
807is in the dock, so it may not be present if the laptop was undocked).
808This is due to the current lack of support for hot plugging of devices
809in the Linux ACPI framework. If the laptop was booted without the
810UltraBay, the following message is shown in the logs:
811
812 Mar 17 01:42:34 aero kernel: thinkpad_acpi: bay device not present
813
814In this case, no bay-related events are generated but the eject
815command described below still works. It can be executed manually or
816triggered by a hot key combination.
817
818Sliding the eject lever generates the first event shown above. The
819handler for this event should take whatever actions are necessary to
820shut down the device in the UltraBay (e.g. call idectl), then issue
821the following command:
822
823 echo eject > /proc/acpi/ibm/bay
824
825After the LED on the UltraBay goes off, it is safe to pull out the
826device.
827
828When the eject lever is inserted, the second event above is
829generated. The handler for this event should take whatever actions are
830necessary to enable the UltraBay device (e.g. call idectl).
831
832The contents of the /proc/acpi/ibm/bay file shows the current status
833of the UltraBay, as provided by the ACPI framework.
834
835EXPERIMENTAL warm eject support on the 600e/x, A22p and A3x (To use
836this feature, you need to supply the experimental=1 parameter when
837loading the module):
838
839These models do not have a button near the UltraBay device to request
840a hot eject but rather require the laptop to be put to sleep
841(suspend-to-ram) before the bay device is ejected or inserted).
842The sequence of steps to eject the device is as follows:
843
844 echo eject > /proc/acpi/ibm/bay
845 put the ThinkPad to sleep
846 remove the drive
847 resume from sleep
848 cat /proc/acpi/ibm/bay should show that the drive was removed
849
850On the A3x, both the UltraBay 2000 and UltraBay Plus devices are
851supported. Use "eject2" instead of "eject" for the second bay.
852
853Note: the UltraBay eject support on the 600e/x, A22p and A3x is
854EXPERIMENTAL and may not work as expected. USE WITH CAUTION!
855
856
857CMOS/UCMS control 730CMOS/UCMS control
858----------------- 731-----------------
859 732
diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index 9ebcd6ef361b..950cde6d6e58 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -1,7 +1,9 @@
1/*P:100 This is the Launcher code, a simple program which lays out the 1/*P:100
2 * "physical" memory for the new Guest by mapping the kernel image and 2 * This is the Launcher code, a simple program which lays out the "physical"
3 * the virtual devices, then opens /dev/lguest to tell the kernel 3 * memory for the new Guest by mapping the kernel image and the virtual
4 * about the Guest and control it. :*/ 4 * devices, then opens /dev/lguest to tell the kernel about the Guest and
5 * control it.
6:*/
5#define _LARGEFILE64_SOURCE 7#define _LARGEFILE64_SOURCE
6#define _GNU_SOURCE 8#define _GNU_SOURCE
7#include <stdio.h> 9#include <stdio.h>
@@ -46,13 +48,15 @@
46#include "linux/virtio_rng.h" 48#include "linux/virtio_rng.h"
47#include "linux/virtio_ring.h" 49#include "linux/virtio_ring.h"
48#include "asm/bootparam.h" 50#include "asm/bootparam.h"
49/*L:110 We can ignore the 39 include files we need for this program, but I do 51/*L:110
50 * want to draw attention to the use of kernel-style types. 52 * We can ignore the 42 include files we need for this program, but I do want
53 * to draw attention to the use of kernel-style types.
51 * 54 *
52 * As Linus said, "C is a Spartan language, and so should your naming be." I 55 * As Linus said, "C is a Spartan language, and so should your naming be." I
53 * like these abbreviations, so we define them here. Note that u64 is always 56 * like these abbreviations, so we define them here. Note that u64 is always
54 * unsigned long long, which works on all Linux systems: this means that we can 57 * unsigned long long, which works on all Linux systems: this means that we can
55 * use %llu in printf for any u64. */ 58 * use %llu in printf for any u64.
59 */
56typedef unsigned long long u64; 60typedef unsigned long long u64;
57typedef uint32_t u32; 61typedef uint32_t u32;
58typedef uint16_t u16; 62typedef uint16_t u16;
@@ -69,8 +73,10 @@ typedef uint8_t u8;
69/* This will occupy 3 pages: it must be a power of 2. */ 73/* This will occupy 3 pages: it must be a power of 2. */
70#define VIRTQUEUE_NUM 256 74#define VIRTQUEUE_NUM 256
71 75
72/*L:120 verbose is both a global flag and a macro. The C preprocessor allows 76/*L:120
73 * this, and although I wouldn't recommend it, it works quite nicely here. */ 77 * verbose is both a global flag and a macro. The C preprocessor allows
78 * this, and although I wouldn't recommend it, it works quite nicely here.
79 */
74static bool verbose; 80static bool verbose;
75#define verbose(args...) \ 81#define verbose(args...) \
76 do { if (verbose) printf(args); } while(0) 82 do { if (verbose) printf(args); } while(0)
@@ -87,8 +93,7 @@ static int lguest_fd;
87static unsigned int __thread cpu_id; 93static unsigned int __thread cpu_id;
88 94
89/* This is our list of devices. */ 95/* This is our list of devices. */
90struct device_list 96struct device_list {
91{
92 /* Counter to assign interrupt numbers. */ 97 /* Counter to assign interrupt numbers. */
93 unsigned int next_irq; 98 unsigned int next_irq;
94 99
@@ -100,8 +105,7 @@ struct device_list
100 105
101 /* A single linked list of devices. */ 106 /* A single linked list of devices. */
102 struct device *dev; 107 struct device *dev;
103 /* And a pointer to the last device for easy append and also for 108 /* And a pointer to the last device for easy append. */
104 * configuration appending. */
105 struct device *lastdev; 109 struct device *lastdev;
106}; 110};
107 111
@@ -109,8 +113,7 @@ struct device_list
109static struct device_list devices; 113static struct device_list devices;
110 114
111/* The device structure describes a single device. */ 115/* The device structure describes a single device. */
112struct device 116struct device {
113{
114 /* The linked-list pointer. */ 117 /* The linked-list pointer. */
115 struct device *next; 118 struct device *next;
116 119
@@ -135,8 +138,7 @@ struct device
135}; 138};
136 139
137/* The virtqueue structure describes a queue attached to a device. */ 140/* The virtqueue structure describes a queue attached to a device. */
138struct virtqueue 141struct virtqueue {
139{
140 struct virtqueue *next; 142 struct virtqueue *next;
141 143
142 /* Which device owns me. */ 144 /* Which device owns me. */
@@ -168,20 +170,24 @@ static char **main_args;
168/* The original tty settings to restore on exit. */ 170/* The original tty settings to restore on exit. */
169static struct termios orig_term; 171static struct termios orig_term;
170 172
171/* We have to be careful with barriers: our devices are all run in separate 173/*
174 * We have to be careful with barriers: our devices are all run in separate
172 * threads and so we need to make sure that changes visible to the Guest happen 175 * threads and so we need to make sure that changes visible to the Guest happen
173 * in precise order. */ 176 * in precise order.
177 */
174#define wmb() __asm__ __volatile__("" : : : "memory") 178#define wmb() __asm__ __volatile__("" : : : "memory")
175#define mb() __asm__ __volatile__("" : : : "memory") 179#define mb() __asm__ __volatile__("" : : : "memory")
176 180
177/* Convert an iovec element to the given type. 181/*
182 * Convert an iovec element to the given type.
178 * 183 *
179 * This is a fairly ugly trick: we need to know the size of the type and 184 * This is a fairly ugly trick: we need to know the size of the type and
180 * alignment requirement to check the pointer is kosher. It's also nice to 185 * alignment requirement to check the pointer is kosher. It's also nice to
181 * have the name of the type in case we report failure. 186 * have the name of the type in case we report failure.
182 * 187 *
183 * Typing those three things all the time is cumbersome and error prone, so we 188 * Typing those three things all the time is cumbersome and error prone, so we
184 * have a macro which sets them all up and passes to the real function. */ 189 * have a macro which sets them all up and passes to the real function.
190 */
185#define convert(iov, type) \ 191#define convert(iov, type) \
186 ((type *)_convert((iov), sizeof(type), __alignof__(type), #type)) 192 ((type *)_convert((iov), sizeof(type), __alignof__(type), #type))
187 193
@@ -198,8 +204,10 @@ static void *_convert(struct iovec *iov, size_t size, size_t align,
198/* Wrapper for the last available index. Makes it easier to change. */ 204/* Wrapper for the last available index. Makes it easier to change. */
199#define lg_last_avail(vq) ((vq)->last_avail_idx) 205#define lg_last_avail(vq) ((vq)->last_avail_idx)
200 206
201/* The virtio configuration space is defined to be little-endian. x86 is 207/*
202 * little-endian too, but it's nice to be explicit so we have these helpers. */ 208 * The virtio configuration space is defined to be little-endian. x86 is
209 * little-endian too, but it's nice to be explicit so we have these helpers.
210 */
203#define cpu_to_le16(v16) (v16) 211#define cpu_to_le16(v16) (v16)
204#define cpu_to_le32(v32) (v32) 212#define cpu_to_le32(v32) (v32)
205#define cpu_to_le64(v64) (v64) 213#define cpu_to_le64(v64) (v64)
@@ -241,11 +249,12 @@ static u8 *get_feature_bits(struct device *dev)
241 + dev->num_vq * sizeof(struct lguest_vqconfig); 249 + dev->num_vq * sizeof(struct lguest_vqconfig);
242} 250}
243 251
244/*L:100 The Launcher code itself takes us out into userspace, that scary place 252/*L:100
245 * where pointers run wild and free! Unfortunately, like most userspace 253 * The Launcher code itself takes us out into userspace, that scary place where
246 * programs, it's quite boring (which is why everyone likes to hack on the 254 * pointers run wild and free! Unfortunately, like most userspace programs,
247 * kernel!). Perhaps if you make up an Lguest Drinking Game at this point, it 255 * it's quite boring (which is why everyone likes to hack on the kernel!).
248 * will get you through this section. Or, maybe not. 256 * Perhaps if you make up an Lguest Drinking Game at this point, it will get
257 * you through this section. Or, maybe not.
249 * 258 *
250 * The Launcher sets up a big chunk of memory to be the Guest's "physical" 259 * The Launcher sets up a big chunk of memory to be the Guest's "physical"
251 * memory and stores it in "guest_base". In other words, Guest physical == 260 * memory and stores it in "guest_base". In other words, Guest physical ==
@@ -253,7 +262,8 @@ static u8 *get_feature_bits(struct device *dev)
253 * 262 *
254 * This can be tough to get your head around, but usually it just means that we 263 * This can be tough to get your head around, but usually it just means that we
255 * use these trivial conversion functions when the Guest gives us it's 264 * use these trivial conversion functions when the Guest gives us it's
256 * "physical" addresses: */ 265 * "physical" addresses:
266 */
257static void *from_guest_phys(unsigned long addr) 267static void *from_guest_phys(unsigned long addr)
258{ 268{
259 return guest_base + addr; 269 return guest_base + addr;
@@ -268,7 +278,8 @@ static unsigned long to_guest_phys(const void *addr)
268 * Loading the Kernel. 278 * Loading the Kernel.
269 * 279 *
270 * We start with couple of simple helper routines. open_or_die() avoids 280 * We start with couple of simple helper routines. open_or_die() avoids
271 * error-checking code cluttering the callers: */ 281 * error-checking code cluttering the callers:
282 */
272static int open_or_die(const char *name, int flags) 283static int open_or_die(const char *name, int flags)
273{ 284{
274 int fd = open(name, flags); 285 int fd = open(name, flags);
@@ -283,12 +294,19 @@ static void *map_zeroed_pages(unsigned int num)
283 int fd = open_or_die("/dev/zero", O_RDONLY); 294 int fd = open_or_die("/dev/zero", O_RDONLY);
284 void *addr; 295 void *addr;
285 296
286 /* We use a private mapping (ie. if we write to the page, it will be 297 /*
287 * copied). */ 298 * We use a private mapping (ie. if we write to the page, it will be
299 * copied).
300 */
288 addr = mmap(NULL, getpagesize() * num, 301 addr = mmap(NULL, getpagesize() * num,
289 PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, fd, 0); 302 PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, fd, 0);
290 if (addr == MAP_FAILED) 303 if (addr == MAP_FAILED)
291 err(1, "Mmaping %u pages of /dev/zero", num); 304 err(1, "Mmaping %u pages of /dev/zero", num);
305
306 /*
307 * One neat mmap feature is that you can close the fd, and it
308 * stays mapped.
309 */
292 close(fd); 310 close(fd);
293 311
294 return addr; 312 return addr;
@@ -305,20 +323,24 @@ static void *get_pages(unsigned int num)
305 return addr; 323 return addr;
306} 324}
307 325
308/* This routine is used to load the kernel or initrd. It tries mmap, but if 326/*
327 * This routine is used to load the kernel or initrd. It tries mmap, but if
309 * that fails (Plan 9's kernel file isn't nicely aligned on page boundaries), 328 * that fails (Plan 9's kernel file isn't nicely aligned on page boundaries),
310 * it falls back to reading the memory in. */ 329 * it falls back to reading the memory in.
330 */
311static void map_at(int fd, void *addr, unsigned long offset, unsigned long len) 331static void map_at(int fd, void *addr, unsigned long offset, unsigned long len)
312{ 332{
313 ssize_t r; 333 ssize_t r;
314 334
315 /* We map writable even though for some segments are marked read-only. 335 /*
336 * We map writable even though for some segments are marked read-only.
316 * The kernel really wants to be writable: it patches its own 337 * The kernel really wants to be writable: it patches its own
317 * instructions. 338 * instructions.
318 * 339 *
319 * MAP_PRIVATE means that the page won't be copied until a write is 340 * MAP_PRIVATE means that the page won't be copied until a write is
320 * done to it. This allows us to share untouched memory between 341 * done to it. This allows us to share untouched memory between
321 * Guests. */ 342 * Guests.
343 */
322 if (mmap(addr, len, PROT_READ|PROT_WRITE|PROT_EXEC, 344 if (mmap(addr, len, PROT_READ|PROT_WRITE|PROT_EXEC,
323 MAP_FIXED|MAP_PRIVATE, fd, offset) != MAP_FAILED) 345 MAP_FIXED|MAP_PRIVATE, fd, offset) != MAP_FAILED)
324 return; 346 return;
@@ -329,7 +351,8 @@ static void map_at(int fd, void *addr, unsigned long offset, unsigned long len)
329 err(1, "Reading offset %lu len %lu gave %zi", offset, len, r); 351 err(1, "Reading offset %lu len %lu gave %zi", offset, len, r);
330} 352}
331 353
332/* This routine takes an open vmlinux image, which is in ELF, and maps it into 354/*
355 * This routine takes an open vmlinux image, which is in ELF, and maps it into
333 * the Guest memory. ELF = Embedded Linking Format, which is the format used 356 * the Guest memory. ELF = Embedded Linking Format, which is the format used
334 * by all modern binaries on Linux including the kernel. 357 * by all modern binaries on Linux including the kernel.
335 * 358 *
@@ -337,23 +360,28 @@ static void map_at(int fd, void *addr, unsigned long offset, unsigned long len)
337 * address. We use the physical address; the Guest will map itself to the 360 * address. We use the physical address; the Guest will map itself to the
338 * virtual address. 361 * virtual address.
339 * 362 *
340 * We return the starting address. */ 363 * We return the starting address.
364 */
341static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr) 365static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr)
342{ 366{
343 Elf32_Phdr phdr[ehdr->e_phnum]; 367 Elf32_Phdr phdr[ehdr->e_phnum];
344 unsigned int i; 368 unsigned int i;
345 369
346 /* Sanity checks on the main ELF header: an x86 executable with a 370 /*
347 * reasonable number of correctly-sized program headers. */ 371 * Sanity checks on the main ELF header: an x86 executable with a
372 * reasonable number of correctly-sized program headers.
373 */
348 if (ehdr->e_type != ET_EXEC 374 if (ehdr->e_type != ET_EXEC
349 || ehdr->e_machine != EM_386 375 || ehdr->e_machine != EM_386
350 || ehdr->e_phentsize != sizeof(Elf32_Phdr) 376 || ehdr->e_phentsize != sizeof(Elf32_Phdr)
351 || ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(Elf32_Phdr)) 377 || ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(Elf32_Phdr))
352 errx(1, "Malformed elf header"); 378 errx(1, "Malformed elf header");
353 379
354 /* An ELF executable contains an ELF header and a number of "program" 380 /*
381 * An ELF executable contains an ELF header and a number of "program"
355 * headers which indicate which parts ("segments") of the program to 382 * headers which indicate which parts ("segments") of the program to
356 * load where. */ 383 * load where.
384 */
357 385
358 /* We read in all the program headers at once: */ 386 /* We read in all the program headers at once: */
359 if (lseek(elf_fd, ehdr->e_phoff, SEEK_SET) < 0) 387 if (lseek(elf_fd, ehdr->e_phoff, SEEK_SET) < 0)
@@ -361,8 +389,10 @@ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr)
361 if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr)) 389 if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr))
362 err(1, "Reading program headers"); 390 err(1, "Reading program headers");
363 391
364 /* Try all the headers: there are usually only three. A read-only one, 392 /*
365 * a read-write one, and a "note" section which we don't load. */ 393 * Try all the headers: there are usually only three. A read-only one,
394 * a read-write one, and a "note" section which we don't load.
395 */
366 for (i = 0; i < ehdr->e_phnum; i++) { 396 for (i = 0; i < ehdr->e_phnum; i++) {
367 /* If this isn't a loadable segment, we ignore it */ 397 /* If this isn't a loadable segment, we ignore it */
368 if (phdr[i].p_type != PT_LOAD) 398 if (phdr[i].p_type != PT_LOAD)
@@ -380,13 +410,15 @@ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr)
380 return ehdr->e_entry; 410 return ehdr->e_entry;
381} 411}
382 412
383/*L:150 A bzImage, unlike an ELF file, is not meant to be loaded. You're 413/*L:150
384 * supposed to jump into it and it will unpack itself. We used to have to 414 * A bzImage, unlike an ELF file, is not meant to be loaded. You're supposed
385 * perform some hairy magic because the unpacking code scared me. 415 * to jump into it and it will unpack itself. We used to have to perform some
416 * hairy magic because the unpacking code scared me.
386 * 417 *
387 * Fortunately, Jeremy Fitzhardinge convinced me it wasn't that hard and wrote 418 * Fortunately, Jeremy Fitzhardinge convinced me it wasn't that hard and wrote
388 * a small patch to jump over the tricky bits in the Guest, so now we just read 419 * a small patch to jump over the tricky bits in the Guest, so now we just read
389 * the funky header so we know where in the file to load, and away we go! */ 420 * the funky header so we know where in the file to load, and away we go!
421 */
390static unsigned long load_bzimage(int fd) 422static unsigned long load_bzimage(int fd)
391{ 423{
392 struct boot_params boot; 424 struct boot_params boot;
@@ -394,8 +426,10 @@ static unsigned long load_bzimage(int fd)
394 /* Modern bzImages get loaded at 1M. */ 426 /* Modern bzImages get loaded at 1M. */
395 void *p = from_guest_phys(0x100000); 427 void *p = from_guest_phys(0x100000);
396 428
397 /* Go back to the start of the file and read the header. It should be 429 /*
398 * a Linux boot header (see Documentation/x86/i386/boot.txt) */ 430 * Go back to the start of the file and read the header. It should be
431 * a Linux boot header (see Documentation/x86/i386/boot.txt)
432 */
399 lseek(fd, 0, SEEK_SET); 433 lseek(fd, 0, SEEK_SET);
400 read(fd, &boot, sizeof(boot)); 434 read(fd, &boot, sizeof(boot));
401 435
@@ -414,9 +448,11 @@ static unsigned long load_bzimage(int fd)
414 return boot.hdr.code32_start; 448 return boot.hdr.code32_start;
415} 449}
416 450
417/*L:140 Loading the kernel is easy when it's a "vmlinux", but most kernels 451/*L:140
452 * Loading the kernel is easy when it's a "vmlinux", but most kernels
418 * come wrapped up in the self-decompressing "bzImage" format. With a little 453 * come wrapped up in the self-decompressing "bzImage" format. With a little
419 * work, we can load those, too. */ 454 * work, we can load those, too.
455 */
420static unsigned long load_kernel(int fd) 456static unsigned long load_kernel(int fd)
421{ 457{
422 Elf32_Ehdr hdr; 458 Elf32_Ehdr hdr;
@@ -433,24 +469,28 @@ static unsigned long load_kernel(int fd)
433 return load_bzimage(fd); 469 return load_bzimage(fd);
434} 470}
435 471
436/* This is a trivial little helper to align pages. Andi Kleen hated it because 472/*
473 * This is a trivial little helper to align pages. Andi Kleen hated it because
437 * it calls getpagesize() twice: "it's dumb code." 474 * it calls getpagesize() twice: "it's dumb code."
438 * 475 *
439 * Kernel guys get really het up about optimization, even when it's not 476 * Kernel guys get really het up about optimization, even when it's not
440 * necessary. I leave this code as a reaction against that. */ 477 * necessary. I leave this code as a reaction against that.
478 */
441static inline unsigned long page_align(unsigned long addr) 479static inline unsigned long page_align(unsigned long addr)
442{ 480{
443 /* Add upwards and truncate downwards. */ 481 /* Add upwards and truncate downwards. */
444 return ((addr + getpagesize()-1) & ~(getpagesize()-1)); 482 return ((addr + getpagesize()-1) & ~(getpagesize()-1));
445} 483}
446 484
447/*L:180 An "initial ram disk" is a disk image loaded into memory along with 485/*L:180
448 * the kernel which the kernel can use to boot from without needing any 486 * An "initial ram disk" is a disk image loaded into memory along with the
449 * drivers. Most distributions now use this as standard: the initrd contains 487 * kernel which the kernel can use to boot from without needing any drivers.
450 * the code to load the appropriate driver modules for the current machine. 488 * Most distributions now use this as standard: the initrd contains the code to
489 * load the appropriate driver modules for the current machine.
451 * 490 *
452 * Importantly, James Morris works for RedHat, and Fedora uses initrds for its 491 * Importantly, James Morris works for RedHat, and Fedora uses initrds for its
453 * kernels. He sent me this (and tells me when I break it). */ 492 * kernels. He sent me this (and tells me when I break it).
493 */
454static unsigned long load_initrd(const char *name, unsigned long mem) 494static unsigned long load_initrd(const char *name, unsigned long mem)
455{ 495{
456 int ifd; 496 int ifd;
@@ -462,12 +502,16 @@ static unsigned long load_initrd(const char *name, unsigned long mem)
462 if (fstat(ifd, &st) < 0) 502 if (fstat(ifd, &st) < 0)
463 err(1, "fstat() on initrd '%s'", name); 503 err(1, "fstat() on initrd '%s'", name);
464 504
465 /* We map the initrd at the top of memory, but mmap wants it to be 505 /*
466 * page-aligned, so we round the size up for that. */ 506 * We map the initrd at the top of memory, but mmap wants it to be
507 * page-aligned, so we round the size up for that.
508 */
467 len = page_align(st.st_size); 509 len = page_align(st.st_size);
468 map_at(ifd, from_guest_phys(mem - len), 0, st.st_size); 510 map_at(ifd, from_guest_phys(mem - len), 0, st.st_size);
469 /* Once a file is mapped, you can close the file descriptor. It's a 511 /*
470 * little odd, but quite useful. */ 512 * Once a file is mapped, you can close the file descriptor. It's a
513 * little odd, but quite useful.
514 */
471 close(ifd); 515 close(ifd);
472 verbose("mapped initrd %s size=%lu @ %p\n", name, len, (void*)mem-len); 516 verbose("mapped initrd %s size=%lu @ %p\n", name, len, (void*)mem-len);
473 517
@@ -476,8 +520,10 @@ static unsigned long load_initrd(const char *name, unsigned long mem)
476} 520}
477/*:*/ 521/*:*/
478 522
479/* Simple routine to roll all the commandline arguments together with spaces 523/*
480 * between them. */ 524 * Simple routine to roll all the commandline arguments together with spaces
525 * between them.
526 */
481static void concat(char *dst, char *args[]) 527static void concat(char *dst, char *args[])
482{ 528{
483 unsigned int i, len = 0; 529 unsigned int i, len = 0;
@@ -494,10 +540,12 @@ static void concat(char *dst, char *args[])
494 dst[len] = '\0'; 540 dst[len] = '\0';
495} 541}
496 542
497/*L:185 This is where we actually tell the kernel to initialize the Guest. We 543/*L:185
544 * This is where we actually tell the kernel to initialize the Guest. We
498 * saw the arguments it expects when we looked at initialize() in lguest_user.c: 545 * saw the arguments it expects when we looked at initialize() in lguest_user.c:
499 * the base of Guest "physical" memory, the top physical page to allow and the 546 * the base of Guest "physical" memory, the top physical page to allow and the
500 * entry point for the Guest. */ 547 * entry point for the Guest.
548 */
501static void tell_kernel(unsigned long start) 549static void tell_kernel(unsigned long start)
502{ 550{
503 unsigned long args[] = { LHREQ_INITIALIZE, 551 unsigned long args[] = { LHREQ_INITIALIZE,
@@ -511,7 +559,7 @@ static void tell_kernel(unsigned long start)
511} 559}
512/*:*/ 560/*:*/
513 561
514/* 562/*L:200
515 * Device Handling. 563 * Device Handling.
516 * 564 *
517 * When the Guest gives us a buffer, it sends an array of addresses and sizes. 565 * When the Guest gives us a buffer, it sends an array of addresses and sizes.
@@ -522,20 +570,26 @@ static void tell_kernel(unsigned long start)
522static void *_check_pointer(unsigned long addr, unsigned int size, 570static void *_check_pointer(unsigned long addr, unsigned int size,
523 unsigned int line) 571 unsigned int line)
524{ 572{
525 /* We have to separately check addr and addr+size, because size could 573 /*
526 * be huge and addr + size might wrap around. */ 574 * We have to separately check addr and addr+size, because size could
575 * be huge and addr + size might wrap around.
576 */
527 if (addr >= guest_limit || addr + size >= guest_limit) 577 if (addr >= guest_limit || addr + size >= guest_limit)
528 errx(1, "%s:%i: Invalid address %#lx", __FILE__, line, addr); 578 errx(1, "%s:%i: Invalid address %#lx", __FILE__, line, addr);
529 /* We return a pointer for the caller's convenience, now we know it's 579 /*
530 * safe to use. */ 580 * We return a pointer for the caller's convenience, now we know it's
581 * safe to use.
582 */
531 return from_guest_phys(addr); 583 return from_guest_phys(addr);
532} 584}
533/* A macro which transparently hands the line number to the real function. */ 585/* A macro which transparently hands the line number to the real function. */
534#define check_pointer(addr,size) _check_pointer(addr, size, __LINE__) 586#define check_pointer(addr,size) _check_pointer(addr, size, __LINE__)
535 587
536/* Each buffer in the virtqueues is actually a chain of descriptors. This 588/*
589 * Each buffer in the virtqueues is actually a chain of descriptors. This
537 * function returns the next descriptor in the chain, or vq->vring.num if we're 590 * function returns the next descriptor in the chain, or vq->vring.num if we're
538 * at the end. */ 591 * at the end.
592 */
539static unsigned next_desc(struct vring_desc *desc, 593static unsigned next_desc(struct vring_desc *desc,
540 unsigned int i, unsigned int max) 594 unsigned int i, unsigned int max)
541{ 595{
@@ -556,7 +610,10 @@ static unsigned next_desc(struct vring_desc *desc,
556 return next; 610 return next;
557} 611}
558 612
559/* This actually sends the interrupt for this virtqueue */ 613/*
614 * This actually sends the interrupt for this virtqueue, if we've used a
615 * buffer.
616 */
560static void trigger_irq(struct virtqueue *vq) 617static void trigger_irq(struct virtqueue *vq)
561{ 618{
562 unsigned long buf[] = { LHREQ_IRQ, vq->config.irq }; 619 unsigned long buf[] = { LHREQ_IRQ, vq->config.irq };
@@ -576,12 +633,14 @@ static void trigger_irq(struct virtqueue *vq)
576 err(1, "Triggering irq %i", vq->config.irq); 633 err(1, "Triggering irq %i", vq->config.irq);
577} 634}
578 635
579/* This looks in the virtqueue and for the first available buffer, and converts 636/*
637 * This looks in the virtqueue for the first available buffer, and converts
580 * it to an iovec for convenient access. Since descriptors consist of some 638 * it to an iovec for convenient access. Since descriptors consist of some
581 * number of output then some number of input descriptors, it's actually two 639 * number of output then some number of input descriptors, it's actually two
582 * iovecs, but we pack them into one and note how many of each there were. 640 * iovecs, but we pack them into one and note how many of each there were.
583 * 641 *
584 * This function returns the descriptor number found. */ 642 * This function waits if necessary, and returns the descriptor number found.
643 */
585static unsigned wait_for_vq_desc(struct virtqueue *vq, 644static unsigned wait_for_vq_desc(struct virtqueue *vq,
586 struct iovec iov[], 645 struct iovec iov[],
587 unsigned int *out_num, unsigned int *in_num) 646 unsigned int *out_num, unsigned int *in_num)
@@ -590,17 +649,23 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq,
590 struct vring_desc *desc; 649 struct vring_desc *desc;
591 u16 last_avail = lg_last_avail(vq); 650 u16 last_avail = lg_last_avail(vq);
592 651
652 /* There's nothing available? */
593 while (last_avail == vq->vring.avail->idx) { 653 while (last_avail == vq->vring.avail->idx) {
594 u64 event; 654 u64 event;
595 655
596 /* OK, tell Guest about progress up to now. */ 656 /*
657 * Since we're about to sleep, now is a good time to tell the
658 * Guest about what we've used up to now.
659 */
597 trigger_irq(vq); 660 trigger_irq(vq);
598 661
599 /* OK, now we need to know about added descriptors. */ 662 /* OK, now we need to know about added descriptors. */
600 vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; 663 vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
601 664
602 /* They could have slipped one in as we were doing that: make 665 /*
603 * sure it's written, then check again. */ 666 * They could have slipped one in as we were doing that: make
667 * sure it's written, then check again.
668 */
604 mb(); 669 mb();
605 if (last_avail != vq->vring.avail->idx) { 670 if (last_avail != vq->vring.avail->idx) {
606 vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY; 671 vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
@@ -620,8 +685,10 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq,
620 errx(1, "Guest moved used index from %u to %u", 685 errx(1, "Guest moved used index from %u to %u",
621 last_avail, vq->vring.avail->idx); 686 last_avail, vq->vring.avail->idx);
622 687
623 /* Grab the next descriptor number they're advertising, and increment 688 /*
624 * the index we've seen. */ 689 * Grab the next descriptor number they're advertising, and increment
690 * the index we've seen.
691 */
625 head = vq->vring.avail->ring[last_avail % vq->vring.num]; 692 head = vq->vring.avail->ring[last_avail % vq->vring.num];
626 lg_last_avail(vq)++; 693 lg_last_avail(vq)++;
627 694
@@ -636,8 +703,10 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq,
636 desc = vq->vring.desc; 703 desc = vq->vring.desc;
637 i = head; 704 i = head;
638 705
639 /* If this is an indirect entry, then this buffer contains a descriptor 706 /*
640 * table which we handle as if it's any normal descriptor chain. */ 707 * If this is an indirect entry, then this buffer contains a descriptor
708 * table which we handle as if it's any normal descriptor chain.
709 */
641 if (desc[i].flags & VRING_DESC_F_INDIRECT) { 710 if (desc[i].flags & VRING_DESC_F_INDIRECT) {
642 if (desc[i].len % sizeof(struct vring_desc)) 711 if (desc[i].len % sizeof(struct vring_desc))
643 errx(1, "Invalid size for indirect buffer table"); 712 errx(1, "Invalid size for indirect buffer table");
@@ -656,8 +725,10 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq,
656 if (desc[i].flags & VRING_DESC_F_WRITE) 725 if (desc[i].flags & VRING_DESC_F_WRITE)
657 (*in_num)++; 726 (*in_num)++;
658 else { 727 else {
659 /* If it's an output descriptor, they're all supposed 728 /*
660 * to come before any input descriptors. */ 729 * If it's an output descriptor, they're all supposed
730 * to come before any input descriptors.
731 */
661 if (*in_num) 732 if (*in_num)
662 errx(1, "Descriptor has out after in"); 733 errx(1, "Descriptor has out after in");
663 (*out_num)++; 734 (*out_num)++;
@@ -671,14 +742,19 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq,
671 return head; 742 return head;
672} 743}
673 744
674/* After we've used one of their buffers, we tell them about it. We'll then 745/*
675 * want to send them an interrupt, using trigger_irq(). */ 746 * After we've used one of their buffers, we tell the Guest about it. Sometime
747 * later we'll want to send them an interrupt using trigger_irq(); note that
748 * wait_for_vq_desc() does that for us if it has to wait.
749 */
676static void add_used(struct virtqueue *vq, unsigned int head, int len) 750static void add_used(struct virtqueue *vq, unsigned int head, int len)
677{ 751{
678 struct vring_used_elem *used; 752 struct vring_used_elem *used;
679 753
680 /* The virtqueue contains a ring of used buffers. Get a pointer to the 754 /*
681 * next entry in that used ring. */ 755 * The virtqueue contains a ring of used buffers. Get a pointer to the
756 * next entry in that used ring.
757 */
682 used = &vq->vring.used->ring[vq->vring.used->idx % vq->vring.num]; 758 used = &vq->vring.used->ring[vq->vring.used->idx % vq->vring.num];
683 used->id = head; 759 used->id = head;
684 used->len = len; 760 used->len = len;
@@ -698,9 +774,9 @@ static void add_used_and_trigger(struct virtqueue *vq, unsigned head, int len)
698/* 774/*
699 * The Console 775 * The Console
700 * 776 *
701 * We associate some data with the console for our exit hack. */ 777 * We associate some data with the console for our exit hack.
702struct console_abort 778 */
703{ 779struct console_abort {
704 /* How many times have they hit ^C? */ 780 /* How many times have they hit ^C? */
705 int count; 781 int count;
706 /* When did they start? */ 782 /* When did they start? */
@@ -715,30 +791,35 @@ static void console_input(struct virtqueue *vq)
715 struct console_abort *abort = vq->dev->priv; 791 struct console_abort *abort = vq->dev->priv;
716 struct iovec iov[vq->vring.num]; 792 struct iovec iov[vq->vring.num];
717 793
718 /* Make sure there's a descriptor waiting. */ 794 /* Make sure there's a descriptor available. */
719 head = wait_for_vq_desc(vq, iov, &out_num, &in_num); 795 head = wait_for_vq_desc(vq, iov, &out_num, &in_num);
720 if (out_num) 796 if (out_num)
721 errx(1, "Output buffers in console in queue?"); 797 errx(1, "Output buffers in console in queue?");
722 798
723 /* Read it in. */ 799 /* Read into it. This is where we usually wait. */
724 len = readv(STDIN_FILENO, iov, in_num); 800 len = readv(STDIN_FILENO, iov, in_num);
725 if (len <= 0) { 801 if (len <= 0) {
726 /* Ran out of input? */ 802 /* Ran out of input? */
727 warnx("Failed to get console input, ignoring console."); 803 warnx("Failed to get console input, ignoring console.");
728 /* For simplicity, dying threads kill the whole Launcher. So 804 /*
729 * just nap here. */ 805 * For simplicity, dying threads kill the whole Launcher. So
806 * just nap here.
807 */
730 for (;;) 808 for (;;)
731 pause(); 809 pause();
732 } 810 }
733 811
812 /* Tell the Guest we used a buffer. */
734 add_used_and_trigger(vq, head, len); 813 add_used_and_trigger(vq, head, len);
735 814
736 /* Three ^C within one second? Exit. 815 /*
816 * Three ^C within one second? Exit.
737 * 817 *
738 * This is such a hack, but works surprisingly well. Each ^C has to 818 * This is such a hack, but works surprisingly well. Each ^C has to
739 * be in a buffer by itself, so they can't be too fast. But we check 819 * be in a buffer by itself, so they can't be too fast. But we check
740 * that we get three within about a second, so they can't be too 820 * that we get three within about a second, so they can't be too
741 * slow. */ 821 * slow.
822 */
742 if (len != 1 || ((char *)iov[0].iov_base)[0] != 3) { 823 if (len != 1 || ((char *)iov[0].iov_base)[0] != 3) {
743 abort->count = 0; 824 abort->count = 0;
744 return; 825 return;
@@ -763,15 +844,23 @@ static void console_output(struct virtqueue *vq)
763 unsigned int head, out, in; 844 unsigned int head, out, in;
764 struct iovec iov[vq->vring.num]; 845 struct iovec iov[vq->vring.num];
765 846
847 /* We usually wait in here, for the Guest to give us something. */
766 head = wait_for_vq_desc(vq, iov, &out, &in); 848 head = wait_for_vq_desc(vq, iov, &out, &in);
767 if (in) 849 if (in)
768 errx(1, "Input buffers in console output queue?"); 850 errx(1, "Input buffers in console output queue?");
851
852 /* writev can return a partial write, so we loop here. */
769 while (!iov_empty(iov, out)) { 853 while (!iov_empty(iov, out)) {
770 int len = writev(STDOUT_FILENO, iov, out); 854 int len = writev(STDOUT_FILENO, iov, out);
771 if (len <= 0) 855 if (len <= 0)
772 err(1, "Write to stdout gave %i", len); 856 err(1, "Write to stdout gave %i", len);
773 iov_consume(iov, out, len); 857 iov_consume(iov, out, len);
774 } 858 }
859
860 /*
861 * We're finished with that buffer: if we're going to sleep,
862 * wait_for_vq_desc() will prod the Guest with an interrupt.
863 */
775 add_used(vq, head, 0); 864 add_used(vq, head, 0);
776} 865}
777 866
@@ -791,15 +880,30 @@ static void net_output(struct virtqueue *vq)
791 unsigned int head, out, in; 880 unsigned int head, out, in;
792 struct iovec iov[vq->vring.num]; 881 struct iovec iov[vq->vring.num];
793 882
883 /* We usually wait in here for the Guest to give us a packet. */
794 head = wait_for_vq_desc(vq, iov, &out, &in); 884 head = wait_for_vq_desc(vq, iov, &out, &in);
795 if (in) 885 if (in)
796 errx(1, "Input buffers in net output queue?"); 886 errx(1, "Input buffers in net output queue?");
887 /*
888 * Send the whole thing through to /dev/net/tun. It expects the exact
889 * same format: what a coincidence!
890 */
797 if (writev(net_info->tunfd, iov, out) < 0) 891 if (writev(net_info->tunfd, iov, out) < 0)
798 errx(1, "Write to tun failed?"); 892 errx(1, "Write to tun failed?");
893
894 /*
895 * Done with that one; wait_for_vq_desc() will send the interrupt if
896 * all packets are processed.
897 */
799 add_used(vq, head, 0); 898 add_used(vq, head, 0);
800} 899}
801 900
802/* Will reading from this file descriptor block? */ 901/*
902 * Handling network input is a bit trickier, because I've tried to optimize it.
903 *
904 * First we have a helper routine which tells is if from this file descriptor
905 * (ie. the /dev/net/tun device) will block:
906 */
803static bool will_block(int fd) 907static bool will_block(int fd)
804{ 908{
805 fd_set fdset; 909 fd_set fdset;
@@ -809,8 +913,11 @@ static bool will_block(int fd)
809 return select(fd+1, &fdset, NULL, NULL, &zero) != 1; 913 return select(fd+1, &fdset, NULL, NULL, &zero) != 1;
810} 914}
811 915
812/* This is where we handle packets coming in from the tun device to our 916/*
813 * Guest. */ 917 * This handles packets coming in from the tun device to our Guest. Like all
918 * service routines, it gets called again as soon as it returns, so you don't
919 * see a while(1) loop here.
920 */
814static void net_input(struct virtqueue *vq) 921static void net_input(struct virtqueue *vq)
815{ 922{
816 int len; 923 int len;
@@ -818,21 +925,38 @@ static void net_input(struct virtqueue *vq)
818 struct iovec iov[vq->vring.num]; 925 struct iovec iov[vq->vring.num];
819 struct net_info *net_info = vq->dev->priv; 926 struct net_info *net_info = vq->dev->priv;
820 927
928 /*
929 * Get a descriptor to write an incoming packet into. This will also
930 * send an interrupt if they're out of descriptors.
931 */
821 head = wait_for_vq_desc(vq, iov, &out, &in); 932 head = wait_for_vq_desc(vq, iov, &out, &in);
822 if (out) 933 if (out)
823 errx(1, "Output buffers in net input queue?"); 934 errx(1, "Output buffers in net input queue?");
824 935
825 /* Deliver interrupt now, since we're about to sleep. */ 936 /*
937 * If it looks like we'll block reading from the tun device, send them
938 * an interrupt.
939 */
826 if (vq->pending_used && will_block(net_info->tunfd)) 940 if (vq->pending_used && will_block(net_info->tunfd))
827 trigger_irq(vq); 941 trigger_irq(vq);
828 942
943 /*
944 * Read in the packet. This is where we normally wait (when there's no
945 * incoming network traffic).
946 */
829 len = readv(net_info->tunfd, iov, in); 947 len = readv(net_info->tunfd, iov, in);
830 if (len <= 0) 948 if (len <= 0)
831 err(1, "Failed to read from tun."); 949 err(1, "Failed to read from tun.");
950
951 /*
952 * Mark that packet buffer as used, but don't interrupt here. We want
953 * to wait until we've done as much work as we can.
954 */
832 add_used(vq, head, len); 955 add_used(vq, head, len);
833} 956}
957/*:*/
834 958
835/* This is the helper to create threads. */ 959/* This is the helper to create threads: run the service routine in a loop. */
836static int do_thread(void *_vq) 960static int do_thread(void *_vq)
837{ 961{
838 struct virtqueue *vq = _vq; 962 struct virtqueue *vq = _vq;
@@ -842,8 +966,10 @@ static int do_thread(void *_vq)
842 return 0; 966 return 0;
843} 967}
844 968
845/* When a child dies, we kill our entire process group with SIGTERM. This 969/*
846 * also has the side effect that the shell restores the console for us! */ 970 * When a child dies, we kill our entire process group with SIGTERM. This
971 * also has the side effect that the shell restores the console for us!
972 */
847static void kill_launcher(int signal) 973static void kill_launcher(int signal)
848{ 974{
849 kill(0, SIGTERM); 975 kill(0, SIGTERM);
@@ -878,11 +1004,15 @@ static void reset_device(struct device *dev)
878 signal(SIGCHLD, (void *)kill_launcher); 1004 signal(SIGCHLD, (void *)kill_launcher);
879} 1005}
880 1006
1007/*L:216
1008 * This actually creates the thread which services the virtqueue for a device.
1009 */
881static void create_thread(struct virtqueue *vq) 1010static void create_thread(struct virtqueue *vq)
882{ 1011{
883 /* Create stack for thread and run it. Since stack grows 1012 /*
884 * upwards, we point the stack pointer to the end of this 1013 * Create stack for thread. Since the stack grows upwards, we point
885 * region. */ 1014 * the stack pointer to the end of this region.
1015 */
886 char *stack = malloc(32768); 1016 char *stack = malloc(32768);
887 unsigned long args[] = { LHREQ_EVENTFD, 1017 unsigned long args[] = { LHREQ_EVENTFD,
888 vq->config.pfn*getpagesize(), 0 }; 1018 vq->config.pfn*getpagesize(), 0 };
@@ -893,17 +1023,22 @@ static void create_thread(struct virtqueue *vq)
893 err(1, "Creating eventfd"); 1023 err(1, "Creating eventfd");
894 args[2] = vq->eventfd; 1024 args[2] = vq->eventfd;
895 1025
896 /* Attach an eventfd to this virtqueue: it will go off 1026 /*
897 * when the Guest does an LHCALL_NOTIFY for this vq. */ 1027 * Attach an eventfd to this virtqueue: it will go off when the Guest
1028 * does an LHCALL_NOTIFY for this vq.
1029 */
898 if (write(lguest_fd, &args, sizeof(args)) != 0) 1030 if (write(lguest_fd, &args, sizeof(args)) != 0)
899 err(1, "Attaching eventfd"); 1031 err(1, "Attaching eventfd");
900 1032
901 /* CLONE_VM: because it has to access the Guest memory, and 1033 /*
902 * SIGCHLD so we get a signal if it dies. */ 1034 * CLONE_VM: because it has to access the Guest memory, and SIGCHLD so
1035 * we get a signal if it dies.
1036 */
903 vq->thread = clone(do_thread, stack + 32768, CLONE_VM | SIGCHLD, vq); 1037 vq->thread = clone(do_thread, stack + 32768, CLONE_VM | SIGCHLD, vq);
904 if (vq->thread == (pid_t)-1) 1038 if (vq->thread == (pid_t)-1)
905 err(1, "Creating clone"); 1039 err(1, "Creating clone");
906 /* We close our local copy, now the child has it. */ 1040
1041 /* We close our local copy now the child has it. */
907 close(vq->eventfd); 1042 close(vq->eventfd);
908} 1043}
909 1044
@@ -955,7 +1090,10 @@ static void update_device_status(struct device *dev)
955 } 1090 }
956} 1091}
957 1092
958/* This is the generic routine we call when the Guest uses LHCALL_NOTIFY. */ 1093/*L:215
1094 * This is the generic routine we call when the Guest uses LHCALL_NOTIFY. In
1095 * particular, it's used to notify us of device status changes during boot.
1096 */
959static void handle_output(unsigned long addr) 1097static void handle_output(unsigned long addr)
960{ 1098{
961 struct device *i; 1099 struct device *i;
@@ -964,25 +1102,42 @@ static void handle_output(unsigned long addr)
964 for (i = devices.dev; i; i = i->next) { 1102 for (i = devices.dev; i; i = i->next) {
965 struct virtqueue *vq; 1103 struct virtqueue *vq;
966 1104
967 /* Notifications to device descriptors update device status. */ 1105 /*
1106 * Notifications to device descriptors mean they updated the
1107 * device status.
1108 */
968 if (from_guest_phys(addr) == i->desc) { 1109 if (from_guest_phys(addr) == i->desc) {
969 update_device_status(i); 1110 update_device_status(i);
970 return; 1111 return;
971 } 1112 }
972 1113
973 /* Devices *can* be used before status is set to DRIVER_OK. */ 1114 /*
1115 * Devices *can* be used before status is set to DRIVER_OK.
1116 * The original plan was that they would never do this: they
1117 * would always finish setting up their status bits before
1118 * actually touching the virtqueues. In practice, we allowed
1119 * them to, and they do (eg. the disk probes for partition
1120 * tables as part of initialization).
1121 *
1122 * If we see this, we start the device: once it's running, we
1123 * expect the device to catch all the notifications.
1124 */
974 for (vq = i->vq; vq; vq = vq->next) { 1125 for (vq = i->vq; vq; vq = vq->next) {
975 if (addr != vq->config.pfn*getpagesize()) 1126 if (addr != vq->config.pfn*getpagesize())
976 continue; 1127 continue;
977 if (i->running) 1128 if (i->running)
978 errx(1, "Notification on running %s", i->name); 1129 errx(1, "Notification on running %s", i->name);
1130 /* This just calls create_thread() for each virtqueue */
979 start_device(i); 1131 start_device(i);
980 return; 1132 return;
981 } 1133 }
982 } 1134 }
983 1135
984 /* Early console write is done using notify on a nul-terminated string 1136 /*
985 * in Guest memory. */ 1137 * Early console write is done using notify on a nul-terminated string
1138 * in Guest memory. It's also great for hacking debugging messages
1139 * into a Guest.
1140 */
986 if (addr >= guest_limit) 1141 if (addr >= guest_limit)
987 errx(1, "Bad NOTIFY %#lx", addr); 1142 errx(1, "Bad NOTIFY %#lx", addr);
988 1143
@@ -998,10 +1153,12 @@ static void handle_output(unsigned long addr)
998 * routines to allocate and manage them. 1153 * routines to allocate and manage them.
999 */ 1154 */
1000 1155
1001/* The layout of the device page is a "struct lguest_device_desc" followed by a 1156/*
1157 * The layout of the device page is a "struct lguest_device_desc" followed by a
1002 * number of virtqueue descriptors, then two sets of feature bits, then an 1158 * number of virtqueue descriptors, then two sets of feature bits, then an
1003 * array of configuration bytes. This routine returns the configuration 1159 * array of configuration bytes. This routine returns the configuration
1004 * pointer. */ 1160 * pointer.
1161 */
1005static u8 *device_config(const struct device *dev) 1162static u8 *device_config(const struct device *dev)
1006{ 1163{
1007 return (void *)(dev->desc + 1) 1164 return (void *)(dev->desc + 1)
@@ -1009,9 +1166,11 @@ static u8 *device_config(const struct device *dev)
1009 + dev->feature_len * 2; 1166 + dev->feature_len * 2;
1010} 1167}
1011 1168
1012/* This routine allocates a new "struct lguest_device_desc" from descriptor 1169/*
1170 * This routine allocates a new "struct lguest_device_desc" from descriptor
1013 * table page just above the Guest's normal memory. It returns a pointer to 1171 * table page just above the Guest's normal memory. It returns a pointer to
1014 * that descriptor. */ 1172 * that descriptor.
1173 */
1015static struct lguest_device_desc *new_dev_desc(u16 type) 1174static struct lguest_device_desc *new_dev_desc(u16 type)
1016{ 1175{
1017 struct lguest_device_desc d = { .type = type }; 1176 struct lguest_device_desc d = { .type = type };
@@ -1032,8 +1191,10 @@ static struct lguest_device_desc *new_dev_desc(u16 type)
1032 return memcpy(p, &d, sizeof(d)); 1191 return memcpy(p, &d, sizeof(d));
1033} 1192}
1034 1193
1035/* Each device descriptor is followed by the description of its virtqueues. We 1194/*
1036 * specify how many descriptors the virtqueue is to have. */ 1195 * Each device descriptor is followed by the description of its virtqueues. We
1196 * specify how many descriptors the virtqueue is to have.
1197 */
1037static void add_virtqueue(struct device *dev, unsigned int num_descs, 1198static void add_virtqueue(struct device *dev, unsigned int num_descs,
1038 void (*service)(struct virtqueue *)) 1199 void (*service)(struct virtqueue *))
1039{ 1200{
@@ -1050,6 +1211,11 @@ static void add_virtqueue(struct device *dev, unsigned int num_descs,
1050 vq->next = NULL; 1211 vq->next = NULL;
1051 vq->last_avail_idx = 0; 1212 vq->last_avail_idx = 0;
1052 vq->dev = dev; 1213 vq->dev = dev;
1214
1215 /*
1216 * This is the routine the service thread will run, and its Process ID
1217 * once it's running.
1218 */
1053 vq->service = service; 1219 vq->service = service;
1054 vq->thread = (pid_t)-1; 1220 vq->thread = (pid_t)-1;
1055 1221
@@ -1061,10 +1227,12 @@ static void add_virtqueue(struct device *dev, unsigned int num_descs,
1061 /* Initialize the vring. */ 1227 /* Initialize the vring. */
1062 vring_init(&vq->vring, num_descs, p, LGUEST_VRING_ALIGN); 1228 vring_init(&vq->vring, num_descs, p, LGUEST_VRING_ALIGN);
1063 1229
1064 /* Append virtqueue to this device's descriptor. We use 1230 /*
1231 * Append virtqueue to this device's descriptor. We use
1065 * device_config() to get the end of the device's current virtqueues; 1232 * device_config() to get the end of the device's current virtqueues;
1066 * we check that we haven't added any config or feature information 1233 * we check that we haven't added any config or feature information
1067 * yet, otherwise we'd be overwriting them. */ 1234 * yet, otherwise we'd be overwriting them.
1235 */
1068 assert(dev->desc->config_len == 0 && dev->desc->feature_len == 0); 1236 assert(dev->desc->config_len == 0 && dev->desc->feature_len == 0);
1069 memcpy(device_config(dev), &vq->config, sizeof(vq->config)); 1237 memcpy(device_config(dev), &vq->config, sizeof(vq->config));
1070 dev->num_vq++; 1238 dev->num_vq++;
@@ -1072,14 +1240,18 @@ static void add_virtqueue(struct device *dev, unsigned int num_descs,
1072 1240
1073 verbose("Virtqueue page %#lx\n", to_guest_phys(p)); 1241 verbose("Virtqueue page %#lx\n", to_guest_phys(p));
1074 1242
1075 /* Add to tail of list, so dev->vq is first vq, dev->vq->next is 1243 /*
1076 * second. */ 1244 * Add to tail of list, so dev->vq is first vq, dev->vq->next is
1245 * second.
1246 */
1077 for (i = &dev->vq; *i; i = &(*i)->next); 1247 for (i = &dev->vq; *i; i = &(*i)->next);
1078 *i = vq; 1248 *i = vq;
1079} 1249}
1080 1250
1081/* The first half of the feature bitmask is for us to advertise features. The 1251/*
1082 * second half is for the Guest to accept features. */ 1252 * The first half of the feature bitmask is for us to advertise features. The
1253 * second half is for the Guest to accept features.
1254 */
1083static void add_feature(struct device *dev, unsigned bit) 1255static void add_feature(struct device *dev, unsigned bit)
1084{ 1256{
1085 u8 *features = get_feature_bits(dev); 1257 u8 *features = get_feature_bits(dev);
@@ -1093,9 +1265,11 @@ static void add_feature(struct device *dev, unsigned bit)
1093 features[bit / CHAR_BIT] |= (1 << (bit % CHAR_BIT)); 1265 features[bit / CHAR_BIT] |= (1 << (bit % CHAR_BIT));
1094} 1266}
1095 1267
1096/* This routine sets the configuration fields for an existing device's 1268/*
1269 * This routine sets the configuration fields for an existing device's
1097 * descriptor. It only works for the last device, but that's OK because that's 1270 * descriptor. It only works for the last device, but that's OK because that's
1098 * how we use it. */ 1271 * how we use it.
1272 */
1099static void set_config(struct device *dev, unsigned len, const void *conf) 1273static void set_config(struct device *dev, unsigned len, const void *conf)
1100{ 1274{
1101 /* Check we haven't overflowed our single page. */ 1275 /* Check we haven't overflowed our single page. */
@@ -1105,12 +1279,18 @@ static void set_config(struct device *dev, unsigned len, const void *conf)
1105 /* Copy in the config information, and store the length. */ 1279 /* Copy in the config information, and store the length. */
1106 memcpy(device_config(dev), conf, len); 1280 memcpy(device_config(dev), conf, len);
1107 dev->desc->config_len = len; 1281 dev->desc->config_len = len;
1282
1283 /* Size must fit in config_len field (8 bits)! */
1284 assert(dev->desc->config_len == len);
1108} 1285}
1109 1286
1110/* This routine does all the creation and setup of a new device, including 1287/*
1111 * calling new_dev_desc() to allocate the descriptor and device memory. 1288 * This routine does all the creation and setup of a new device, including
1289 * calling new_dev_desc() to allocate the descriptor and device memory. We
1290 * don't actually start the service threads until later.
1112 * 1291 *
1113 * See what I mean about userspace being boring? */ 1292 * See what I mean about userspace being boring?
1293 */
1114static struct device *new_device(const char *name, u16 type) 1294static struct device *new_device(const char *name, u16 type)
1115{ 1295{
1116 struct device *dev = malloc(sizeof(*dev)); 1296 struct device *dev = malloc(sizeof(*dev));
@@ -1123,10 +1303,12 @@ static struct device *new_device(const char *name, u16 type)
1123 dev->num_vq = 0; 1303 dev->num_vq = 0;
1124 dev->running = false; 1304 dev->running = false;
1125 1305
1126 /* Append to device list. Prepending to a single-linked list is 1306 /*
1307 * Append to device list. Prepending to a single-linked list is
1127 * easier, but the user expects the devices to be arranged on the bus 1308 * easier, but the user expects the devices to be arranged on the bus
1128 * in command-line order. The first network device on the command line 1309 * in command-line order. The first network device on the command line
1129 * is eth0, the first block device /dev/vda, etc. */ 1310 * is eth0, the first block device /dev/vda, etc.
1311 */
1130 if (devices.lastdev) 1312 if (devices.lastdev)
1131 devices.lastdev->next = dev; 1313 devices.lastdev->next = dev;
1132 else 1314 else
@@ -1136,8 +1318,10 @@ static struct device *new_device(const char *name, u16 type)
1136 return dev; 1318 return dev;
1137} 1319}
1138 1320
1139/* Our first setup routine is the console. It's a fairly simple device, but 1321/*
1140 * UNIX tty handling makes it uglier than it could be. */ 1322 * Our first setup routine is the console. It's a fairly simple device, but
1323 * UNIX tty handling makes it uglier than it could be.
1324 */
1141static void setup_console(void) 1325static void setup_console(void)
1142{ 1326{
1143 struct device *dev; 1327 struct device *dev;
@@ -1145,8 +1329,10 @@ static void setup_console(void)
1145 /* If we can save the initial standard input settings... */ 1329 /* If we can save the initial standard input settings... */
1146 if (tcgetattr(STDIN_FILENO, &orig_term) == 0) { 1330 if (tcgetattr(STDIN_FILENO, &orig_term) == 0) {
1147 struct termios term = orig_term; 1331 struct termios term = orig_term;
1148 /* Then we turn off echo, line buffering and ^C etc. We want a 1332 /*
1149 * raw input stream to the Guest. */ 1333 * Then we turn off echo, line buffering and ^C etc: We want a
1334 * raw input stream to the Guest.
1335 */
1150 term.c_lflag &= ~(ISIG|ICANON|ECHO); 1336 term.c_lflag &= ~(ISIG|ICANON|ECHO);
1151 tcsetattr(STDIN_FILENO, TCSANOW, &term); 1337 tcsetattr(STDIN_FILENO, TCSANOW, &term);
1152 } 1338 }
@@ -1157,10 +1343,12 @@ static void setup_console(void)
1157 dev->priv = malloc(sizeof(struct console_abort)); 1343 dev->priv = malloc(sizeof(struct console_abort));
1158 ((struct console_abort *)dev->priv)->count = 0; 1344 ((struct console_abort *)dev->priv)->count = 0;
1159 1345
1160 /* The console needs two virtqueues: the input then the output. When 1346 /*
1347 * The console needs two virtqueues: the input then the output. When
1161 * they put something the input queue, we make sure we're listening to 1348 * they put something the input queue, we make sure we're listening to
1162 * stdin. When they put something in the output queue, we write it to 1349 * stdin. When they put something in the output queue, we write it to
1163 * stdout. */ 1350 * stdout.
1351 */
1164 add_virtqueue(dev, VIRTQUEUE_NUM, console_input); 1352 add_virtqueue(dev, VIRTQUEUE_NUM, console_input);
1165 add_virtqueue(dev, VIRTQUEUE_NUM, console_output); 1353 add_virtqueue(dev, VIRTQUEUE_NUM, console_output);
1166 1354
@@ -1168,7 +1356,8 @@ static void setup_console(void)
1168} 1356}
1169/*:*/ 1357/*:*/
1170 1358
1171/*M:010 Inter-guest networking is an interesting area. Simplest is to have a 1359/*M:010
1360 * Inter-guest networking is an interesting area. Simplest is to have a
1172 * --sharenet=<name> option which opens or creates a named pipe. This can be 1361 * --sharenet=<name> option which opens or creates a named pipe. This can be
1173 * used to send packets to another guest in a 1:1 manner. 1362 * used to send packets to another guest in a 1:1 manner.
1174 * 1363 *
@@ -1182,7 +1371,8 @@ static void setup_console(void)
1182 * multiple inter-guest channels behind one interface, although it would 1371 * multiple inter-guest channels behind one interface, although it would
1183 * require some manner of hotplugging new virtio channels. 1372 * require some manner of hotplugging new virtio channels.
1184 * 1373 *
1185 * Finally, we could implement a virtio network switch in the kernel. :*/ 1374 * Finally, we could implement a virtio network switch in the kernel.
1375:*/
1186 1376
1187static u32 str2ip(const char *ipaddr) 1377static u32 str2ip(const char *ipaddr)
1188{ 1378{
@@ -1207,11 +1397,13 @@ static void str2mac(const char *macaddr, unsigned char mac[6])
1207 mac[5] = m[5]; 1397 mac[5] = m[5];
1208} 1398}
1209 1399
1210/* This code is "adapted" from libbridge: it attaches the Host end of the 1400/*
1401 * This code is "adapted" from libbridge: it attaches the Host end of the
1211 * network device to the bridge device specified by the command line. 1402 * network device to the bridge device specified by the command line.
1212 * 1403 *
1213 * This is yet another James Morris contribution (I'm an IP-level guy, so I 1404 * This is yet another James Morris contribution (I'm an IP-level guy, so I
1214 * dislike bridging), and I just try not to break it. */ 1405 * dislike bridging), and I just try not to break it.
1406 */
1215static void add_to_bridge(int fd, const char *if_name, const char *br_name) 1407static void add_to_bridge(int fd, const char *if_name, const char *br_name)
1216{ 1408{
1217 int ifidx; 1409 int ifidx;
@@ -1231,9 +1423,11 @@ static void add_to_bridge(int fd, const char *if_name, const char *br_name)
1231 err(1, "can't add %s to bridge %s", if_name, br_name); 1423 err(1, "can't add %s to bridge %s", if_name, br_name);
1232} 1424}
1233 1425
1234/* This sets up the Host end of the network device with an IP address, brings 1426/*
1427 * This sets up the Host end of the network device with an IP address, brings
1235 * it up so packets will flow, the copies the MAC address into the hwaddr 1428 * it up so packets will flow, the copies the MAC address into the hwaddr
1236 * pointer. */ 1429 * pointer.
1430 */
1237static void configure_device(int fd, const char *tapif, u32 ipaddr) 1431static void configure_device(int fd, const char *tapif, u32 ipaddr)
1238{ 1432{
1239 struct ifreq ifr; 1433 struct ifreq ifr;
@@ -1260,10 +1454,12 @@ static int get_tun_device(char tapif[IFNAMSIZ])
1260 /* Start with this zeroed. Messy but sure. */ 1454 /* Start with this zeroed. Messy but sure. */
1261 memset(&ifr, 0, sizeof(ifr)); 1455 memset(&ifr, 0, sizeof(ifr));
1262 1456
1263 /* We open the /dev/net/tun device and tell it we want a tap device. A 1457 /*
1458 * We open the /dev/net/tun device and tell it we want a tap device. A
1264 * tap device is like a tun device, only somehow different. To tell 1459 * tap device is like a tun device, only somehow different. To tell
1265 * the truth, I completely blundered my way through this code, but it 1460 * the truth, I completely blundered my way through this code, but it
1266 * works now! */ 1461 * works now!
1462 */
1267 netfd = open_or_die("/dev/net/tun", O_RDWR); 1463 netfd = open_or_die("/dev/net/tun", O_RDWR);
1268 ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR; 1464 ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
1269 strcpy(ifr.ifr_name, "tap%d"); 1465 strcpy(ifr.ifr_name, "tap%d");
@@ -1274,18 +1470,22 @@ static int get_tun_device(char tapif[IFNAMSIZ])
1274 TUN_F_CSUM|TUN_F_TSO4|TUN_F_TSO6|TUN_F_TSO_ECN) != 0) 1470 TUN_F_CSUM|TUN_F_TSO4|TUN_F_TSO6|TUN_F_TSO_ECN) != 0)
1275 err(1, "Could not set features for tun device"); 1471 err(1, "Could not set features for tun device");
1276 1472
1277 /* We don't need checksums calculated for packets coming in this 1473 /*
1278 * device: trust us! */ 1474 * We don't need checksums calculated for packets coming in this
1475 * device: trust us!
1476 */
1279 ioctl(netfd, TUNSETNOCSUM, 1); 1477 ioctl(netfd, TUNSETNOCSUM, 1);
1280 1478
1281 memcpy(tapif, ifr.ifr_name, IFNAMSIZ); 1479 memcpy(tapif, ifr.ifr_name, IFNAMSIZ);
1282 return netfd; 1480 return netfd;
1283} 1481}
1284 1482
1285/*L:195 Our network is a Host<->Guest network. This can either use bridging or 1483/*L:195
1484 * Our network is a Host<->Guest network. This can either use bridging or
1286 * routing, but the principle is the same: it uses the "tun" device to inject 1485 * routing, but the principle is the same: it uses the "tun" device to inject
1287 * packets into the Host as if they came in from a normal network card. We 1486 * packets into the Host as if they came in from a normal network card. We
1288 * just shunt packets between the Guest and the tun device. */ 1487 * just shunt packets between the Guest and the tun device.
1488 */
1289static void setup_tun_net(char *arg) 1489static void setup_tun_net(char *arg)
1290{ 1490{
1291 struct device *dev; 1491 struct device *dev;
@@ -1302,13 +1502,14 @@ static void setup_tun_net(char *arg)
1302 dev = new_device("net", VIRTIO_ID_NET); 1502 dev = new_device("net", VIRTIO_ID_NET);
1303 dev->priv = net_info; 1503 dev->priv = net_info;
1304 1504
1305 /* Network devices need a receive and a send queue, just like 1505 /* Network devices need a recv and a send queue, just like console. */
1306 * console. */
1307 add_virtqueue(dev, VIRTQUEUE_NUM, net_input); 1506 add_virtqueue(dev, VIRTQUEUE_NUM, net_input);
1308 add_virtqueue(dev, VIRTQUEUE_NUM, net_output); 1507 add_virtqueue(dev, VIRTQUEUE_NUM, net_output);
1309 1508
1310 /* We need a socket to perform the magic network ioctls to bring up the 1509 /*
1311 * tap interface, connect to the bridge etc. Any socket will do! */ 1510 * We need a socket to perform the magic network ioctls to bring up the
1511 * tap interface, connect to the bridge etc. Any socket will do!
1512 */
1312 ipfd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP); 1513 ipfd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
1313 if (ipfd < 0) 1514 if (ipfd < 0)
1314 err(1, "opening IP socket"); 1515 err(1, "opening IP socket");
@@ -1362,39 +1563,31 @@ static void setup_tun_net(char *arg)
1362 verbose("device %u: tun %s: %s\n", 1563 verbose("device %u: tun %s: %s\n",
1363 devices.device_num, tapif, arg); 1564 devices.device_num, tapif, arg);
1364} 1565}
1365 1566/*:*/
1366/* Our block (disk) device should be really simple: the Guest asks for a block
1367 * number and we read or write that position in the file. Unfortunately, that
1368 * was amazingly slow: the Guest waits until the read is finished before
1369 * running anything else, even if it could have been doing useful work.
1370 *
1371 * We could use async I/O, except it's reputed to suck so hard that characters
1372 * actually go missing from your code when you try to use it.
1373 *
1374 * So we farm the I/O out to thread, and communicate with it via a pipe. */
1375 1567
1376/* This hangs off device->priv. */ 1568/* This hangs off device->priv. */
1377struct vblk_info 1569struct vblk_info {
1378{
1379 /* The size of the file. */ 1570 /* The size of the file. */
1380 off64_t len; 1571 off64_t len;
1381 1572
1382 /* The file descriptor for the file. */ 1573 /* The file descriptor for the file. */
1383 int fd; 1574 int fd;
1384 1575
1385 /* IO thread listens on this file descriptor [0]. */
1386 int workpipe[2];
1387
1388 /* IO thread writes to this file descriptor to mark it done, then
1389 * Launcher triggers interrupt to Guest. */
1390 int done_fd;
1391}; 1576};
1392 1577
1393/*L:210 1578/*L:210
1394 * The Disk 1579 * The Disk
1395 * 1580 *
1396 * Remember that the block device is handled by a separate I/O thread. We head 1581 * The disk only has one virtqueue, so it only has one thread. It is really
1397 * straight into the core of that thread here: 1582 * simple: the Guest asks for a block number and we read or write that position
1583 * in the file.
1584 *
1585 * Before we serviced each virtqueue in a separate thread, that was unacceptably
1586 * slow: the Guest waits until the read is finished before running anything
1587 * else, even if it could have been doing useful work.
1588 *
1589 * We could have used async I/O, except it's reputed to suck so hard that
1590 * characters actually go missing from your code when you try to use it.
1398 */ 1591 */
1399static void blk_request(struct virtqueue *vq) 1592static void blk_request(struct virtqueue *vq)
1400{ 1593{
@@ -1406,47 +1599,64 @@ static void blk_request(struct virtqueue *vq)
1406 struct iovec iov[vq->vring.num]; 1599 struct iovec iov[vq->vring.num];
1407 off64_t off; 1600 off64_t off;
1408 1601
1409 /* Get the next request. */ 1602 /*
1603 * Get the next request, where we normally wait. It triggers the
1604 * interrupt to acknowledge previously serviced requests (if any).
1605 */
1410 head = wait_for_vq_desc(vq, iov, &out_num, &in_num); 1606 head = wait_for_vq_desc(vq, iov, &out_num, &in_num);
1411 1607
1412 /* Every block request should contain at least one output buffer 1608 /*
1609 * Every block request should contain at least one output buffer
1413 * (detailing the location on disk and the type of request) and one 1610 * (detailing the location on disk and the type of request) and one
1414 * input buffer (to hold the result). */ 1611 * input buffer (to hold the result).
1612 */
1415 if (out_num == 0 || in_num == 0) 1613 if (out_num == 0 || in_num == 0)
1416 errx(1, "Bad virtblk cmd %u out=%u in=%u", 1614 errx(1, "Bad virtblk cmd %u out=%u in=%u",
1417 head, out_num, in_num); 1615 head, out_num, in_num);
1418 1616
1419 out = convert(&iov[0], struct virtio_blk_outhdr); 1617 out = convert(&iov[0], struct virtio_blk_outhdr);
1420 in = convert(&iov[out_num+in_num-1], u8); 1618 in = convert(&iov[out_num+in_num-1], u8);
1619 /*
1620 * For historical reasons, block operations are expressed in 512 byte
1621 * "sectors".
1622 */
1421 off = out->sector * 512; 1623 off = out->sector * 512;
1422 1624
1423 /* The block device implements "barriers", where the Guest indicates 1625 /*
1626 * The block device implements "barriers", where the Guest indicates
1424 * that it wants all previous writes to occur before this write. We 1627 * that it wants all previous writes to occur before this write. We
1425 * don't have a way of asking our kernel to do a barrier, so we just 1628 * don't have a way of asking our kernel to do a barrier, so we just
1426 * synchronize all the data in the file. Pretty poor, no? */ 1629 * synchronize all the data in the file. Pretty poor, no?
1630 */
1427 if (out->type & VIRTIO_BLK_T_BARRIER) 1631 if (out->type & VIRTIO_BLK_T_BARRIER)
1428 fdatasync(vblk->fd); 1632 fdatasync(vblk->fd);
1429 1633
1430 /* In general the virtio block driver is allowed to try SCSI commands. 1634 /*
1431 * It'd be nice if we supported eject, for example, but we don't. */ 1635 * In general the virtio block driver is allowed to try SCSI commands.
1636 * It'd be nice if we supported eject, for example, but we don't.
1637 */
1432 if (out->type & VIRTIO_BLK_T_SCSI_CMD) { 1638 if (out->type & VIRTIO_BLK_T_SCSI_CMD) {
1433 fprintf(stderr, "Scsi commands unsupported\n"); 1639 fprintf(stderr, "Scsi commands unsupported\n");
1434 *in = VIRTIO_BLK_S_UNSUPP; 1640 *in = VIRTIO_BLK_S_UNSUPP;
1435 wlen = sizeof(*in); 1641 wlen = sizeof(*in);
1436 } else if (out->type & VIRTIO_BLK_T_OUT) { 1642 } else if (out->type & VIRTIO_BLK_T_OUT) {
1437 /* Write */ 1643 /*
1438 1644 * Write
1439 /* Move to the right location in the block file. This can fail 1645 *
1440 * if they try to write past end. */ 1646 * Move to the right location in the block file. This can fail
1647 * if they try to write past end.
1648 */
1441 if (lseek64(vblk->fd, off, SEEK_SET) != off) 1649 if (lseek64(vblk->fd, off, SEEK_SET) != off)
1442 err(1, "Bad seek to sector %llu", out->sector); 1650 err(1, "Bad seek to sector %llu", out->sector);
1443 1651
1444 ret = writev(vblk->fd, iov+1, out_num-1); 1652 ret = writev(vblk->fd, iov+1, out_num-1);
1445 verbose("WRITE to sector %llu: %i\n", out->sector, ret); 1653 verbose("WRITE to sector %llu: %i\n", out->sector, ret);
1446 1654
1447 /* Grr... Now we know how long the descriptor they sent was, we 1655 /*
1656 * Grr... Now we know how long the descriptor they sent was, we
1448 * make sure they didn't try to write over the end of the block 1657 * make sure they didn't try to write over the end of the block
1449 * file (possibly extending it). */ 1658 * file (possibly extending it).
1659 */
1450 if (ret > 0 && off + ret > vblk->len) { 1660 if (ret > 0 && off + ret > vblk->len) {
1451 /* Trim it back to the correct length */ 1661 /* Trim it back to the correct length */
1452 ftruncate64(vblk->fd, vblk->len); 1662 ftruncate64(vblk->fd, vblk->len);
@@ -1456,10 +1666,12 @@ static void blk_request(struct virtqueue *vq)
1456 wlen = sizeof(*in); 1666 wlen = sizeof(*in);
1457 *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR); 1667 *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
1458 } else { 1668 } else {
1459 /* Read */ 1669 /*
1460 1670 * Read
1461 /* Move to the right location in the block file. This can fail 1671 *
1462 * if they try to read past end. */ 1672 * Move to the right location in the block file. This can fail
1673 * if they try to read past end.
1674 */
1463 if (lseek64(vblk->fd, off, SEEK_SET) != off) 1675 if (lseek64(vblk->fd, off, SEEK_SET) != off)
1464 err(1, "Bad seek to sector %llu", out->sector); 1676 err(1, "Bad seek to sector %llu", out->sector);
1465 1677
@@ -1474,13 +1686,16 @@ static void blk_request(struct virtqueue *vq)
1474 } 1686 }
1475 } 1687 }
1476 1688
1477 /* OK, so we noted that it was pretty poor to use an fdatasync as a 1689 /*
1690 * OK, so we noted that it was pretty poor to use an fdatasync as a
1478 * barrier. But Christoph Hellwig points out that we need a sync 1691 * barrier. But Christoph Hellwig points out that we need a sync
1479 * *afterwards* as well: "Barriers specify no reordering to the front 1692 * *afterwards* as well: "Barriers specify no reordering to the front
1480 * or the back." And Jens Axboe confirmed it, so here we are: */ 1693 * or the back." And Jens Axboe confirmed it, so here we are:
1694 */
1481 if (out->type & VIRTIO_BLK_T_BARRIER) 1695 if (out->type & VIRTIO_BLK_T_BARRIER)
1482 fdatasync(vblk->fd); 1696 fdatasync(vblk->fd);
1483 1697
1698 /* Finished that request. */
1484 add_used(vq, head, wlen); 1699 add_used(vq, head, wlen);
1485} 1700}
1486 1701
@@ -1491,7 +1706,7 @@ static void setup_block_file(const char *filename)
1491 struct vblk_info *vblk; 1706 struct vblk_info *vblk;
1492 struct virtio_blk_config conf; 1707 struct virtio_blk_config conf;
1493 1708
1494 /* The device responds to return from I/O thread. */ 1709 /* Creat the device. */
1495 dev = new_device("block", VIRTIO_ID_BLOCK); 1710 dev = new_device("block", VIRTIO_ID_BLOCK);
1496 1711
1497 /* The device has one virtqueue, where the Guest places requests. */ 1712 /* The device has one virtqueue, where the Guest places requests. */
@@ -1510,27 +1725,32 @@ static void setup_block_file(const char *filename)
1510 /* Tell Guest how many sectors this device has. */ 1725 /* Tell Guest how many sectors this device has. */
1511 conf.capacity = cpu_to_le64(vblk->len / 512); 1726 conf.capacity = cpu_to_le64(vblk->len / 512);
1512 1727
1513 /* Tell Guest not to put in too many descriptors at once: two are used 1728 /*
1514 * for the in and out elements. */ 1729 * Tell Guest not to put in too many descriptors at once: two are used
1730 * for the in and out elements.
1731 */
1515 add_feature(dev, VIRTIO_BLK_F_SEG_MAX); 1732 add_feature(dev, VIRTIO_BLK_F_SEG_MAX);
1516 conf.seg_max = cpu_to_le32(VIRTQUEUE_NUM - 2); 1733 conf.seg_max = cpu_to_le32(VIRTQUEUE_NUM - 2);
1517 1734
1518 set_config(dev, sizeof(conf), &conf); 1735 /* Don't try to put whole struct: we have 8 bit limit. */
1736 set_config(dev, offsetof(struct virtio_blk_config, geometry), &conf);
1519 1737
1520 verbose("device %u: virtblock %llu sectors\n", 1738 verbose("device %u: virtblock %llu sectors\n",
1521 ++devices.device_num, le64_to_cpu(conf.capacity)); 1739 ++devices.device_num, le64_to_cpu(conf.capacity));
1522} 1740}
1523 1741
1524struct rng_info { 1742/*L:211
1525 int rfd; 1743 * Our random number generator device reads from /dev/random into the Guest's
1526};
1527
1528/* Our random number generator device reads from /dev/random into the Guest's
1529 * input buffers. The usual case is that the Guest doesn't want random numbers 1744 * input buffers. The usual case is that the Guest doesn't want random numbers
1530 * and so has no buffers although /dev/random is still readable, whereas 1745 * and so has no buffers although /dev/random is still readable, whereas
1531 * console is the reverse. 1746 * console is the reverse.
1532 * 1747 *
1533 * The same logic applies, however. */ 1748 * The same logic applies, however.
1749 */
1750struct rng_info {
1751 int rfd;
1752};
1753
1534static void rng_input(struct virtqueue *vq) 1754static void rng_input(struct virtqueue *vq)
1535{ 1755{
1536 int len; 1756 int len;
@@ -1543,9 +1763,10 @@ static void rng_input(struct virtqueue *vq)
1543 if (out_num) 1763 if (out_num)
1544 errx(1, "Output buffers in rng?"); 1764 errx(1, "Output buffers in rng?");
1545 1765
1546 /* This is why we convert to iovecs: the readv() call uses them, and so 1766 /*
1547 * it reads straight into the Guest's buffer. We loop to make sure we 1767 * Just like the console write, we loop to cover the whole iovec.
1548 * fill it. */ 1768 * In this case, short reads actually happen quite a bit.
1769 */
1549 while (!iov_empty(iov, in_num)) { 1770 while (!iov_empty(iov, in_num)) {
1550 len = readv(rng_info->rfd, iov, in_num); 1771 len = readv(rng_info->rfd, iov, in_num);
1551 if (len <= 0) 1772 if (len <= 0)
@@ -1558,15 +1779,18 @@ static void rng_input(struct virtqueue *vq)
1558 add_used(vq, head, totlen); 1779 add_used(vq, head, totlen);
1559} 1780}
1560 1781
1561/* And this creates a "hardware" random number device for the Guest. */ 1782/*L:199
1783 * This creates a "hardware" random number device for the Guest.
1784 */
1562static void setup_rng(void) 1785static void setup_rng(void)
1563{ 1786{
1564 struct device *dev; 1787 struct device *dev;
1565 struct rng_info *rng_info = malloc(sizeof(*rng_info)); 1788 struct rng_info *rng_info = malloc(sizeof(*rng_info));
1566 1789
1790 /* Our device's privat info simply contains the /dev/random fd. */
1567 rng_info->rfd = open_or_die("/dev/random", O_RDONLY); 1791 rng_info->rfd = open_or_die("/dev/random", O_RDONLY);
1568 1792
1569 /* The device responds to return from I/O thread. */ 1793 /* Create the new device. */
1570 dev = new_device("rng", VIRTIO_ID_RNG); 1794 dev = new_device("rng", VIRTIO_ID_RNG);
1571 dev->priv = rng_info; 1795 dev->priv = rng_info;
1572 1796
@@ -1582,8 +1806,10 @@ static void __attribute__((noreturn)) restart_guest(void)
1582{ 1806{
1583 unsigned int i; 1807 unsigned int i;
1584 1808
1585 /* Since we don't track all open fds, we simply close everything beyond 1809 /*
1586 * stderr. */ 1810 * Since we don't track all open fds, we simply close everything beyond
1811 * stderr.
1812 */
1587 for (i = 3; i < FD_SETSIZE; i++) 1813 for (i = 3; i < FD_SETSIZE; i++)
1588 close(i); 1814 close(i);
1589 1815
@@ -1594,8 +1820,10 @@ static void __attribute__((noreturn)) restart_guest(void)
1594 err(1, "Could not exec %s", main_args[0]); 1820 err(1, "Could not exec %s", main_args[0]);
1595} 1821}
1596 1822
1597/*L:220 Finally we reach the core of the Launcher which runs the Guest, serves 1823/*L:220
1598 * its input and output, and finally, lays it to rest. */ 1824 * Finally we reach the core of the Launcher which runs the Guest, serves
1825 * its input and output, and finally, lays it to rest.
1826 */
1599static void __attribute__((noreturn)) run_guest(void) 1827static void __attribute__((noreturn)) run_guest(void)
1600{ 1828{
1601 for (;;) { 1829 for (;;) {
@@ -1630,7 +1858,7 @@ static void __attribute__((noreturn)) run_guest(void)
1630 * 1858 *
1631 * Are you ready? Take a deep breath and join me in the core of the Host, in 1859 * Are you ready? Take a deep breath and join me in the core of the Host, in
1632 * "make Host". 1860 * "make Host".
1633 :*/ 1861:*/
1634 1862
1635static struct option opts[] = { 1863static struct option opts[] = {
1636 { "verbose", 0, NULL, 'v' }, 1864 { "verbose", 0, NULL, 'v' },
@@ -1651,8 +1879,7 @@ static void usage(void)
1651/*L:105 The main routine is where the real work begins: */ 1879/*L:105 The main routine is where the real work begins: */
1652int main(int argc, char *argv[]) 1880int main(int argc, char *argv[])
1653{ 1881{
1654 /* Memory, top-level pagetable, code startpoint and size of the 1882 /* Memory, code startpoint and size of the (optional) initrd. */
1655 * (optional) initrd. */
1656 unsigned long mem = 0, start, initrd_size = 0; 1883 unsigned long mem = 0, start, initrd_size = 0;
1657 /* Two temporaries. */ 1884 /* Two temporaries. */
1658 int i, c; 1885 int i, c;
@@ -1664,24 +1891,32 @@ int main(int argc, char *argv[])
1664 /* Save the args: we "reboot" by execing ourselves again. */ 1891 /* Save the args: we "reboot" by execing ourselves again. */
1665 main_args = argv; 1892 main_args = argv;
1666 1893
1667 /* First we initialize the device list. We keep a pointer to the last 1894 /*
1895 * First we initialize the device list. We keep a pointer to the last
1668 * device, and the next interrupt number to use for devices (1: 1896 * device, and the next interrupt number to use for devices (1:
1669 * remember that 0 is used by the timer). */ 1897 * remember that 0 is used by the timer).
1898 */
1670 devices.lastdev = NULL; 1899 devices.lastdev = NULL;
1671 devices.next_irq = 1; 1900 devices.next_irq = 1;
1672 1901
1902 /* We're CPU 0. In fact, that's the only CPU possible right now. */
1673 cpu_id = 0; 1903 cpu_id = 0;
1674 /* We need to know how much memory so we can set up the device 1904
1905 /*
1906 * We need to know how much memory so we can set up the device
1675 * descriptor and memory pages for the devices as we parse the command 1907 * descriptor and memory pages for the devices as we parse the command
1676 * line. So we quickly look through the arguments to find the amount 1908 * line. So we quickly look through the arguments to find the amount
1677 * of memory now. */ 1909 * of memory now.
1910 */
1678 for (i = 1; i < argc; i++) { 1911 for (i = 1; i < argc; i++) {
1679 if (argv[i][0] != '-') { 1912 if (argv[i][0] != '-') {
1680 mem = atoi(argv[i]) * 1024 * 1024; 1913 mem = atoi(argv[i]) * 1024 * 1024;
1681 /* We start by mapping anonymous pages over all of 1914 /*
1915 * We start by mapping anonymous pages over all of
1682 * guest-physical memory range. This fills it with 0, 1916 * guest-physical memory range. This fills it with 0,
1683 * and ensures that the Guest won't be killed when it 1917 * and ensures that the Guest won't be killed when it
1684 * tries to access it. */ 1918 * tries to access it.
1919 */
1685 guest_base = map_zeroed_pages(mem / getpagesize() 1920 guest_base = map_zeroed_pages(mem / getpagesize()
1686 + DEVICE_PAGES); 1921 + DEVICE_PAGES);
1687 guest_limit = mem; 1922 guest_limit = mem;
@@ -1714,8 +1949,10 @@ int main(int argc, char *argv[])
1714 usage(); 1949 usage();
1715 } 1950 }
1716 } 1951 }
1717 /* After the other arguments we expect memory and kernel image name, 1952 /*
1718 * followed by command line arguments for the kernel. */ 1953 * After the other arguments we expect memory and kernel image name,
1954 * followed by command line arguments for the kernel.
1955 */
1719 if (optind + 2 > argc) 1956 if (optind + 2 > argc)
1720 usage(); 1957 usage();
1721 1958
@@ -1733,20 +1970,26 @@ int main(int argc, char *argv[])
1733 /* Map the initrd image if requested (at top of physical memory) */ 1970 /* Map the initrd image if requested (at top of physical memory) */
1734 if (initrd_name) { 1971 if (initrd_name) {
1735 initrd_size = load_initrd(initrd_name, mem); 1972 initrd_size = load_initrd(initrd_name, mem);
1736 /* These are the location in the Linux boot header where the 1973 /*
1737 * start and size of the initrd are expected to be found. */ 1974 * These are the location in the Linux boot header where the
1975 * start and size of the initrd are expected to be found.
1976 */
1738 boot->hdr.ramdisk_image = mem - initrd_size; 1977 boot->hdr.ramdisk_image = mem - initrd_size;
1739 boot->hdr.ramdisk_size = initrd_size; 1978 boot->hdr.ramdisk_size = initrd_size;
1740 /* The bootloader type 0xFF means "unknown"; that's OK. */ 1979 /* The bootloader type 0xFF means "unknown"; that's OK. */
1741 boot->hdr.type_of_loader = 0xFF; 1980 boot->hdr.type_of_loader = 0xFF;
1742 } 1981 }
1743 1982
1744 /* The Linux boot header contains an "E820" memory map: ours is a 1983 /*
1745 * simple, single region. */ 1984 * The Linux boot header contains an "E820" memory map: ours is a
1985 * simple, single region.
1986 */
1746 boot->e820_entries = 1; 1987 boot->e820_entries = 1;
1747 boot->e820_map[0] = ((struct e820entry) { 0, mem, E820_RAM }); 1988 boot->e820_map[0] = ((struct e820entry) { 0, mem, E820_RAM });
1748 /* The boot header contains a command line pointer: we put the command 1989 /*
1749 * line after the boot header. */ 1990 * The boot header contains a command line pointer: we put the command
1991 * line after the boot header.
1992 */
1750 boot->hdr.cmd_line_ptr = to_guest_phys(boot + 1); 1993 boot->hdr.cmd_line_ptr = to_guest_phys(boot + 1);
1751 /* We use a simple helper to copy the arguments separated by spaces. */ 1994 /* We use a simple helper to copy the arguments separated by spaces. */
1752 concat((char *)(boot + 1), argv+optind+2); 1995 concat((char *)(boot + 1), argv+optind+2);
@@ -1760,11 +2003,13 @@ int main(int argc, char *argv[])
1760 /* Tell the entry path not to try to reload segment registers. */ 2003 /* Tell the entry path not to try to reload segment registers. */
1761 boot->hdr.loadflags |= KEEP_SEGMENTS; 2004 boot->hdr.loadflags |= KEEP_SEGMENTS;
1762 2005
1763 /* We tell the kernel to initialize the Guest: this returns the open 2006 /*
1764 * /dev/lguest file descriptor. */ 2007 * We tell the kernel to initialize the Guest: this returns the open
2008 * /dev/lguest file descriptor.
2009 */
1765 tell_kernel(start); 2010 tell_kernel(start);
1766 2011
1767 /* Ensure that we terminate if a child dies. */ 2012 /* Ensure that we terminate if a device-servicing child dies. */
1768 signal(SIGCHLD, kill_launcher); 2013 signal(SIGCHLD, kill_launcher);
1769 2014
1770 /* If we exit via err(), this kills all the threads, restores tty. */ 2015 /* If we exit via err(), this kills all the threads, restores tty. */
diff --git a/Documentation/lockdep-design.txt b/Documentation/lockdep-design.txt
index e20d913d5914..abf768c681e2 100644
--- a/Documentation/lockdep-design.txt
+++ b/Documentation/lockdep-design.txt
@@ -30,9 +30,9 @@ State
30The validator tracks lock-class usage history into 4n + 1 separate state bits: 30The validator tracks lock-class usage history into 4n + 1 separate state bits:
31 31
32- 'ever held in STATE context' 32- 'ever held in STATE context'
33- 'ever head as readlock in STATE context' 33- 'ever held as readlock in STATE context'
34- 'ever head with STATE enabled' 34- 'ever held with STATE enabled'
35- 'ever head as readlock with STATE enabled' 35- 'ever held as readlock with STATE enabled'
36 36
37Where STATE can be either one of (kernel/lockdep_states.h) 37Where STATE can be either one of (kernel/lockdep_states.h)
38 - hardirq 38 - hardirq
diff --git a/Documentation/networking/6pack.txt b/Documentation/networking/6pack.txt
index d0777a1200e1..8f339428fdf4 100644
--- a/Documentation/networking/6pack.txt
+++ b/Documentation/networking/6pack.txt
@@ -1,7 +1,7 @@
1This is the 6pack-mini-HOWTO, written by 1This is the 6pack-mini-HOWTO, written by
2 2
3Andreas Könsgen DG3KQ 3Andreas Könsgen DG3KQ
4Internet: ajk@iehk.rwth-aachen.de 4Internet: ajk@comnets.uni-bremen.de
5AMPR-net: dg3kq@db0pra.ampr.org 5AMPR-net: dg3kq@db0pra.ampr.org
6AX.25: dg3kq@db0ach.#nrw.deu.eu 6AX.25: dg3kq@db0ach.#nrw.deu.eu
7 7
diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt
index 1df7f9cdab05..86eabe6c3419 100644
--- a/Documentation/scheduler/sched-rt-group.txt
+++ b/Documentation/scheduler/sched-rt-group.txt
@@ -73,7 +73,7 @@ The remaining CPU time will be used for user input and other tasks. Because
73realtime tasks have explicitly allocated the CPU time they need to perform 73realtime tasks have explicitly allocated the CPU time they need to perform
74their tasks, buffer underruns in the graphics or audio can be eliminated. 74their tasks, buffer underruns in the graphics or audio can be eliminated.
75 75
76NOTE: the above example is not fully implemented as of yet (2.6.25). We still 76NOTE: the above example is not fully implemented yet. We still
77lack an EDF scheduler to make non-uniform periods usable. 77lack an EDF scheduler to make non-uniform periods usable.
78 78
79 79
@@ -140,14 +140,15 @@ The other option is:
140 140
141.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups") 141.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups")
142 142
143This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us" 143This uses the /cgroup virtual file system and
144to control the CPU time reserved for each control group instead. 144"/cgroup/<cgroup>/cpu.rt_runtime_us" to control the CPU time reserved for each
145control group instead.
145 146
146For more information on working with control groups, you should read 147For more information on working with control groups, you should read
147Documentation/cgroups/cgroups.txt as well. 148Documentation/cgroups/cgroups.txt as well.
148 149
149Group settings are checked against the following limits in order to keep the configuration 150Group settings are checked against the following limits in order to keep the
150schedulable: 151configuration schedulable:
151 152
152 \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period 153 \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
153 154
@@ -189,7 +190,7 @@ Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
189the biggest challenge as the current linux PI infrastructure is geared towards 190the biggest challenge as the current linux PI infrastructure is geared towards
190the limited static priority levels 0-99. With deadline scheduling you need to 191the limited static priority levels 0-99. With deadline scheduling you need to
191do deadline inheritance (since priority is inversely proportional to the 192do deadline inheritance (since priority is inversely proportional to the
192deadline delta (deadline - now). 193deadline delta (deadline - now)).
193 194
194This means the whole PI machinery will have to be reworked - and that is one of 195This means the whole PI machinery will have to be reworked - and that is one of
195the most complex pieces of code we have. 196the most complex pieces of code we have.
diff --git a/Documentation/sound/alsa/Procfile.txt b/Documentation/sound/alsa/Procfile.txt
index 381908d8ca42..719a819f8cc2 100644
--- a/Documentation/sound/alsa/Procfile.txt
+++ b/Documentation/sound/alsa/Procfile.txt
@@ -101,6 +101,8 @@ card*/pcm*/xrun_debug
101 bit 0 = Enable XRUN/jiffies debug messages 101 bit 0 = Enable XRUN/jiffies debug messages
102 bit 1 = Show stack trace at XRUN / jiffies check 102 bit 1 = Show stack trace at XRUN / jiffies check
103 bit 2 = Enable additional jiffies check 103 bit 2 = Enable additional jiffies check
104 bit 3 = Log hwptr update at each period interrupt
105 bit 4 = Log hwptr update at each snd_pcm_update_hw_ptr()
104 106
105 When the bit 0 is set, the driver will show the messages to 107 When the bit 0 is set, the driver will show the messages to
106 kernel log when an xrun is detected. The debug message is 108 kernel log when an xrun is detected. The debug message is
@@ -117,6 +119,9 @@ card*/pcm*/xrun_debug
117 buggy) hardware that doesn't give smooth pointer updates. 119 buggy) hardware that doesn't give smooth pointer updates.
118 This feature is enabled via the bit 2. 120 This feature is enabled via the bit 2.
119 121
122 Bits 3 and 4 are for logging the hwptr records. Note that
123 these will give flood of kernel messages.
124
120card*/pcm*/sub*/info 125card*/pcm*/sub*/info
121 The general information of this PCM sub-stream. 126 The general information of this PCM sub-stream.
122 127
diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt
index cf42b820ff9d..d56a01775423 100644
--- a/Documentation/sysrq.txt
+++ b/Documentation/sysrq.txt
@@ -66,7 +66,8 @@ On all - write a character to /proc/sysrq-trigger. e.g.:
66'b' - Will immediately reboot the system without syncing or unmounting 66'b' - Will immediately reboot the system without syncing or unmounting
67 your disks. 67 your disks.
68 68
69'c' - Will perform a kexec reboot in order to take a crashdump. 69'c' - Will perform a system crash by a NULL pointer dereference.
70 A crashdump will be taken if configured.
70 71
71'd' - Shows all locks that are held. 72'd' - Shows all locks that are held.
72 73
@@ -141,8 +142,8 @@ useful when you want to exit a program that will not let you switch consoles.
141re'B'oot is good when you're unable to shut down. But you should also 'S'ync 142re'B'oot is good when you're unable to shut down. But you should also 'S'ync
142and 'U'mount first. 143and 'U'mount first.
143 144
144'C'rashdump can be used to manually trigger a crashdump when the system is hung. 145'C'rash can be used to manually trigger a crashdump when the system is hung.
145The kernel needs to have been built with CONFIG_KEXEC enabled. 146Note that this just triggers a crash if there is no dump mechanism available.
146 147
147'S'ync is great when your system is locked up, it allows you to sync your 148'S'ync is great when your system is locked up, it allows you to sync your
148disks and will certainly lessen the chance of data loss and fscking. Note 149disks and will certainly lessen the chance of data loss and fscking. Note
diff --git a/Documentation/video4linux/CARDLIST.em28xx b/Documentation/video4linux/CARDLIST.em28xx
index 873630e7e53e..e352d754875c 100644
--- a/Documentation/video4linux/CARDLIST.em28xx
+++ b/Documentation/video4linux/CARDLIST.em28xx
@@ -1,5 +1,5 @@
1 0 -> Unknown EM2800 video grabber (em2800) [eb1a:2800] 1 0 -> Unknown EM2800 video grabber (em2800) [eb1a:2800]
2 1 -> Unknown EM2750/28xx video grabber (em2820/em2840) [eb1a:2820,eb1a:2821,eb1a:2860,eb1a:2861,eb1a:2870,eb1a:2881,eb1a:2883] 2 1 -> Unknown EM2750/28xx video grabber (em2820/em2840) [eb1a:2710,eb1a:2820,eb1a:2821,eb1a:2860,eb1a:2861,eb1a:2870,eb1a:2881,eb1a:2883]
3 2 -> Terratec Cinergy 250 USB (em2820/em2840) [0ccd:0036] 3 2 -> Terratec Cinergy 250 USB (em2820/em2840) [0ccd:0036]
4 3 -> Pinnacle PCTV USB 2 (em2820/em2840) [2304:0208] 4 3 -> Pinnacle PCTV USB 2 (em2820/em2840) [2304:0208]
5 4 -> Hauppauge WinTV USB 2 (em2820/em2840) [2040:4200,2040:4201] 5 4 -> Hauppauge WinTV USB 2 (em2820/em2840) [2040:4200,2040:4201]
@@ -20,7 +20,7 @@
20 19 -> EM2860/SAA711X Reference Design (em2860) 20 19 -> EM2860/SAA711X Reference Design (em2860)
21 20 -> AMD ATI TV Wonder HD 600 (em2880) [0438:b002] 21 20 -> AMD ATI TV Wonder HD 600 (em2880) [0438:b002]
22 21 -> eMPIA Technology, Inc. GrabBeeX+ Video Encoder (em2800) [eb1a:2801] 22 21 -> eMPIA Technology, Inc. GrabBeeX+ Video Encoder (em2800) [eb1a:2801]
23 22 -> Unknown EM2750/EM2751 webcam grabber (em2750) [eb1a:2750,eb1a:2751] 23 22 -> EM2710/EM2750/EM2751 webcam grabber (em2750) [eb1a:2750,eb1a:2751]
24 23 -> Huaqi DLCW-130 (em2750) 24 23 -> Huaqi DLCW-130 (em2750)
25 24 -> D-Link DUB-T210 TV Tuner (em2820/em2840) [2001:f112] 25 24 -> D-Link DUB-T210 TV Tuner (em2820/em2840) [2001:f112]
26 25 -> Gadmei UTV310 (em2820/em2840) 26 25 -> Gadmei UTV310 (em2820/em2840)
@@ -66,3 +66,4 @@
66 68 -> Terratec AV350 (em2860) [0ccd:0084] 66 68 -> Terratec AV350 (em2860) [0ccd:0084]
67 69 -> KWorld ATSC 315U HDTV TV Box (em2882) [eb1a:a313] 67 69 -> KWorld ATSC 315U HDTV TV Box (em2882) [eb1a:a313]
68 70 -> Evga inDtube (em2882) 68 70 -> Evga inDtube (em2882)
69 71 -> Silvercrest Webcam 1.3mpix (em2820/em2840)
diff --git a/Documentation/video4linux/CARDLIST.saa7134 b/Documentation/video4linux/CARDLIST.saa7134
index 15562427e8a9..c913e5614195 100644
--- a/Documentation/video4linux/CARDLIST.saa7134
+++ b/Documentation/video4linux/CARDLIST.saa7134
@@ -153,8 +153,8 @@
153152 -> Asus Tiger Rev:1.00 [1043:4857] 153152 -> Asus Tiger Rev:1.00 [1043:4857]
154153 -> Kworld Plus TV Analog Lite PCI [17de:7128] 154153 -> Kworld Plus TV Analog Lite PCI [17de:7128]
155154 -> Avermedia AVerTV GO 007 FM Plus [1461:f31d] 155154 -> Avermedia AVerTV GO 007 FM Plus [1461:f31d]
156155 -> Hauppauge WinTV-HVR1120 ATSC/QAM-Hybrid [0070:6706,0070:6708] 156155 -> Hauppauge WinTV-HVR1150 ATSC/QAM-Hybrid [0070:6706,0070:6708]
157156 -> Hauppauge WinTV-HVR1110r3 DVB-T/Hybrid [0070:6707,0070:6709,0070:670a] 157156 -> Hauppauge WinTV-HVR1120 DVB-T/Hybrid [0070:6707,0070:6709,0070:670a]
158157 -> Avermedia AVerTV Studio 507UA [1461:a11b] 158157 -> Avermedia AVerTV Studio 507UA [1461:a11b]
159158 -> AVerMedia Cardbus TV/Radio (E501R) [1461:b7e9] 159158 -> AVerMedia Cardbus TV/Radio (E501R) [1461:b7e9]
160159 -> Beholder BeholdTV 505 RDS [0000:505B] 160159 -> Beholder BeholdTV 505 RDS [0000:505B]
diff --git a/Documentation/video4linux/gspca.txt b/Documentation/video4linux/gspca.txt
index 2bcf78896e22..573f95b58807 100644
--- a/Documentation/video4linux/gspca.txt
+++ b/Documentation/video4linux/gspca.txt
@@ -44,7 +44,9 @@ zc3xx 0458:7007 Genius VideoCam V2
44zc3xx 0458:700c Genius VideoCam V3 44zc3xx 0458:700c Genius VideoCam V3
45zc3xx 0458:700f Genius VideoCam Web V2 45zc3xx 0458:700f Genius VideoCam Web V2
46sonixj 0458:7025 Genius Eye 311Q 46sonixj 0458:7025 Genius Eye 311Q
47sn9c20x 0458:7029 Genius Look 320s
47sonixj 0458:702e Genius Slim 310 NB 48sonixj 0458:702e Genius Slim 310 NB
49sn9c20x 045e:00f4 LifeCam VX-6000 (SN9C20x + OV9650)
48sonixj 045e:00f5 MicroSoft VX3000 50sonixj 045e:00f5 MicroSoft VX3000
49sonixj 045e:00f7 MicroSoft VX1000 51sonixj 045e:00f7 MicroSoft VX1000
50ov519 045e:028c Micro$oft xbox cam 52ov519 045e:028c Micro$oft xbox cam
@@ -282,6 +284,28 @@ sonixj 0c45:613a Microdia Sonix PC Camera
282sonixj 0c45:613b Surfer SN-206 284sonixj 0c45:613b Surfer SN-206
283sonixj 0c45:613c Sonix Pccam168 285sonixj 0c45:613c Sonix Pccam168
284sonixj 0c45:6143 Sonix Pccam168 286sonixj 0c45:6143 Sonix Pccam168
287sn9c20x 0c45:6240 PC Camera (SN9C201 + MT9M001)
288sn9c20x 0c45:6242 PC Camera (SN9C201 + MT9M111)
289sn9c20x 0c45:6248 PC Camera (SN9C201 + OV9655)
290sn9c20x 0c45:624e PC Camera (SN9C201 + SOI968)
291sn9c20x 0c45:624f PC Camera (SN9C201 + OV9650)
292sn9c20x 0c45:6251 PC Camera (SN9C201 + OV9650)
293sn9c20x 0c45:6253 PC Camera (SN9C201 + OV9650)
294sn9c20x 0c45:6260 PC Camera (SN9C201 + OV7670)
295sn9c20x 0c45:6270 PC Camera (SN9C201 + MT9V011/MT9V111/MT9V112)
296sn9c20x 0c45:627b PC Camera (SN9C201 + OV7660)
297sn9c20x 0c45:627c PC Camera (SN9C201 + HV7131R)
298sn9c20x 0c45:627f PC Camera (SN9C201 + OV9650)
299sn9c20x 0c45:6280 PC Camera (SN9C202 + MT9M001)
300sn9c20x 0c45:6282 PC Camera (SN9C202 + MT9M111)
301sn9c20x 0c45:6288 PC Camera (SN9C202 + OV9655)
302sn9c20x 0c45:628e PC Camera (SN9C202 + SOI968)
303sn9c20x 0c45:628f PC Camera (SN9C202 + OV9650)
304sn9c20x 0c45:62a0 PC Camera (SN9C202 + OV7670)
305sn9c20x 0c45:62b0 PC Camera (SN9C202 + MT9V011/MT9V111/MT9V112)
306sn9c20x 0c45:62b3 PC Camera (SN9C202 + OV9655)
307sn9c20x 0c45:62bb PC Camera (SN9C202 + OV7660)
308sn9c20x 0c45:62bc PC Camera (SN9C202 + HV7131R)
285sunplus 0d64:0303 Sunplus FashionCam DXG 309sunplus 0d64:0303 Sunplus FashionCam DXG
286etoms 102c:6151 Qcam Sangha CIF 310etoms 102c:6151 Qcam Sangha CIF
287etoms 102c:6251 Qcam xxxxxx VGA 311etoms 102c:6251 Qcam xxxxxx VGA
@@ -290,6 +314,7 @@ spca561 10fd:7e50 FlyCam Usb 100
290zc3xx 10fd:8050 Typhoon Webshot II USB 300k 314zc3xx 10fd:8050 Typhoon Webshot II USB 300k
291ov534 1415:2000 Sony HD Eye for PS3 (SLEH 00201) 315ov534 1415:2000 Sony HD Eye for PS3 (SLEH 00201)
292pac207 145f:013a Trust WB-1300N 316pac207 145f:013a Trust WB-1300N
317sn9c20x 145f:013d Trust WB-3600R
293vc032x 15b8:6001 HP 2.0 Megapixel 318vc032x 15b8:6001 HP 2.0 Megapixel
294vc032x 15b8:6002 HP 2.0 Megapixel rz406aa 319vc032x 15b8:6002 HP 2.0 Megapixel rz406aa
295spca501 1776:501c Arowana 300K CMOS Camera 320spca501 1776:501c Arowana 300K CMOS Camera
@@ -300,4 +325,11 @@ spca500 2899:012c Toptro Industrial
300spca508 8086:0110 Intel Easy PC Camera 325spca508 8086:0110 Intel Easy PC Camera
301spca500 8086:0630 Intel Pocket PC Camera 326spca500 8086:0630 Intel Pocket PC Camera
302spca506 99fa:8988 Grandtec V.cap 327spca506 99fa:8988 Grandtec V.cap
328sn9c20x a168:0610 Dino-Lite Digital Microscope (SN9C201 + HV7131R)
329sn9c20x a168:0611 Dino-Lite Digital Microscope (SN9C201 + HV7131R)
330sn9c20x a168:0613 Dino-Lite Digital Microscope (SN9C201 + HV7131R)
331sn9c20x a168:0618 Dino-Lite Digital Microscope (SN9C201 + HV7131R)
332sn9c20x a168:0614 Dino-Lite Digital Microscope (SN9C201 + MT9M111)
333sn9c20x a168:0615 Dino-Lite Digital Microscope (SN9C201 + MT9M111)
334sn9c20x a168:0617 Dino-Lite Digital Microscope (SN9C201 + MT9M111)
303spca561 abcd:cdee Petcam 335spca561 abcd:cdee Petcam
diff --git a/Documentation/x86/00-INDEX b/Documentation/x86/00-INDEX
index dbe3377754af..f37b46d34861 100644
--- a/Documentation/x86/00-INDEX
+++ b/Documentation/x86/00-INDEX
@@ -2,3 +2,5 @@
2 - this file 2 - this file
3mtrr.txt 3mtrr.txt
4 - how to use x86 Memory Type Range Registers to increase performance 4 - how to use x86 Memory Type Range Registers to increase performance
5exception-tables.txt
6 - why and how Linux kernel uses exception tables on x86
diff --git a/Documentation/exception.txt b/Documentation/x86/exception-tables.txt
index 2d5aded64247..32901aa36f0a 100644
--- a/Documentation/exception.txt
+++ b/Documentation/x86/exception-tables.txt
@@ -1,123 +1,123 @@
1 Kernel level exception handling in Linux 2.1.8 1 Kernel level exception handling in Linux
2 Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com> 2 Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
3 3
4When a process runs in kernel mode, it often has to access user 4When a process runs in kernel mode, it often has to access user
5mode memory whose address has been passed by an untrusted program. 5mode memory whose address has been passed by an untrusted program.
6To protect itself the kernel has to verify this address. 6To protect itself the kernel has to verify this address.
7 7
8In older versions of Linux this was done with the 8In older versions of Linux this was done with the
9int verify_area(int type, const void * addr, unsigned long size) 9int verify_area(int type, const void * addr, unsigned long size)
10function (which has since been replaced by access_ok()). 10function (which has since been replaced by access_ok()).
11 11
12This function verified that the memory area starting at address 12This function verified that the memory area starting at address
13'addr' and of size 'size' was accessible for the operation specified 13'addr' and of size 'size' was accessible for the operation specified
14in type (read or write). To do this, verify_read had to look up the 14in type (read or write). To do this, verify_read had to look up the
15virtual memory area (vma) that contained the address addr. In the 15virtual memory area (vma) that contained the address addr. In the
16normal case (correctly working program), this test was successful. 16normal case (correctly working program), this test was successful.
17It only failed for a few buggy programs. In some kernel profiling 17It only failed for a few buggy programs. In some kernel profiling
18tests, this normally unneeded verification used up a considerable 18tests, this normally unneeded verification used up a considerable
19amount of time. 19amount of time.
20 20
21To overcome this situation, Linus decided to let the virtual memory 21To overcome this situation, Linus decided to let the virtual memory
22hardware present in every Linux-capable CPU handle this test. 22hardware present in every Linux-capable CPU handle this test.
23 23
24How does this work? 24How does this work?
25 25
26Whenever the kernel tries to access an address that is currently not 26Whenever the kernel tries to access an address that is currently not
27accessible, the CPU generates a page fault exception and calls the 27accessible, the CPU generates a page fault exception and calls the
28page fault handler 28page fault handler
29 29
30void do_page_fault(struct pt_regs *regs, unsigned long error_code) 30void do_page_fault(struct pt_regs *regs, unsigned long error_code)
31 31
32in arch/i386/mm/fault.c. The parameters on the stack are set up by 32in arch/x86/mm/fault.c. The parameters on the stack are set up by
33the low level assembly glue in arch/i386/kernel/entry.S. The parameter 33the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter
34regs is a pointer to the saved registers on the stack, error_code 34regs is a pointer to the saved registers on the stack, error_code
35contains a reason code for the exception. 35contains a reason code for the exception.
36 36
37do_page_fault first obtains the unaccessible address from the CPU 37do_page_fault first obtains the unaccessible address from the CPU
38control register CR2. If the address is within the virtual address 38control register CR2. If the address is within the virtual address
39space of the process, the fault probably occurred, because the page 39space of the process, the fault probably occurred, because the page
40was not swapped in, write protected or something similar. However, 40was not swapped in, write protected or something similar. However,
41we are interested in the other case: the address is not valid, there 41we are interested in the other case: the address is not valid, there
42is no vma that contains this address. In this case, the kernel jumps 42is no vma that contains this address. In this case, the kernel jumps
43to the bad_area label. 43to the bad_area label.
44 44
45There it uses the address of the instruction that caused the exception 45There it uses the address of the instruction that caused the exception
46(i.e. regs->eip) to find an address where the execution can continue 46(i.e. regs->eip) to find an address where the execution can continue
47(fixup). If this search is successful, the fault handler modifies the 47(fixup). If this search is successful, the fault handler modifies the
48return address (again regs->eip) and returns. The execution will 48return address (again regs->eip) and returns. The execution will
49continue at the address in fixup. 49continue at the address in fixup.
50 50
51Where does fixup point to? 51Where does fixup point to?
52 52
53Since we jump to the contents of fixup, fixup obviously points 53Since we jump to the contents of fixup, fixup obviously points
54to executable code. This code is hidden inside the user access macros. 54to executable code. This code is hidden inside the user access macros.
55I have picked the get_user macro defined in include/asm/uaccess.h as an 55I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
56example. The definition is somewhat hard to follow, so let's peek at 56as an example. The definition is somewhat hard to follow, so let's peek at
57the code generated by the preprocessor and the compiler. I selected 57the code generated by the preprocessor and the compiler. I selected
58the get_user call in drivers/char/console.c for a detailed examination. 58the get_user call in drivers/char/sysrq.c for a detailed examination.
59 59
60The original code in console.c line 1405: 60The original code in sysrq.c line 587:
61 get_user(c, buf); 61 get_user(c, buf);
62 62
63The preprocessor output (edited to become somewhat readable): 63The preprocessor output (edited to become somewhat readable):
64 64
65( 65(
66 { 66 {
67 long __gu_err = - 14 , __gu_val = 0; 67 long __gu_err = - 14 , __gu_val = 0;
68 const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); 68 const __typeof__(*( ( buf ) )) *__gu_addr = ((buf));
69 if (((((0 + current_set[0])->tss.segment) == 0x18 ) || 69 if (((((0 + current_set[0])->tss.segment) == 0x18 ) ||
70 (((sizeof(*(buf))) <= 0xC0000000UL) && 70 (((sizeof(*(buf))) <= 0xC0000000UL) &&
71 ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) 71 ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
72 do { 72 do {
73 __gu_err = 0; 73 __gu_err = 0;
74 switch ((sizeof(*(buf)))) { 74 switch ((sizeof(*(buf)))) {
75 case 1: 75 case 1:
76 __asm__ __volatile__( 76 __asm__ __volatile__(
77 "1: mov" "b" " %2,%" "b" "1\n" 77 "1: mov" "b" " %2,%" "b" "1\n"
78 "2:\n" 78 "2:\n"
79 ".section .fixup,\"ax\"\n" 79 ".section .fixup,\"ax\"\n"
80 "3: movl %3,%0\n" 80 "3: movl %3,%0\n"
81 " xor" "b" " %" "b" "1,%" "b" "1\n" 81 " xor" "b" " %" "b" "1,%" "b" "1\n"
82 " jmp 2b\n" 82 " jmp 2b\n"
83 ".section __ex_table,\"a\"\n" 83 ".section __ex_table,\"a\"\n"
84 " .align 4\n" 84 " .align 4\n"
85 " .long 1b,3b\n" 85 " .long 1b,3b\n"
86 ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *) 86 ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
87 ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; 87 ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ;
88 break; 88 break;
89 case 2: 89 case 2:
90 __asm__ __volatile__( 90 __asm__ __volatile__(
91 "1: mov" "w" " %2,%" "w" "1\n" 91 "1: mov" "w" " %2,%" "w" "1\n"
92 "2:\n" 92 "2:\n"
93 ".section .fixup,\"ax\"\n" 93 ".section .fixup,\"ax\"\n"
94 "3: movl %3,%0\n" 94 "3: movl %3,%0\n"
95 " xor" "w" " %" "w" "1,%" "w" "1\n" 95 " xor" "w" " %" "w" "1,%" "w" "1\n"
96 " jmp 2b\n" 96 " jmp 2b\n"
97 ".section __ex_table,\"a\"\n" 97 ".section __ex_table,\"a\"\n"
98 " .align 4\n" 98 " .align 4\n"
99 " .long 1b,3b\n" 99 " .long 1b,3b\n"
100 ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) 100 ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
101 ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); 101 ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err ));
102 break; 102 break;
103 case 4: 103 case 4:
104 __asm__ __volatile__( 104 __asm__ __volatile__(
105 "1: mov" "l" " %2,%" "" "1\n" 105 "1: mov" "l" " %2,%" "" "1\n"
106 "2:\n" 106 "2:\n"
107 ".section .fixup,\"ax\"\n" 107 ".section .fixup,\"ax\"\n"
108 "3: movl %3,%0\n" 108 "3: movl %3,%0\n"
109 " xor" "l" " %" "" "1,%" "" "1\n" 109 " xor" "l" " %" "" "1,%" "" "1\n"
110 " jmp 2b\n" 110 " jmp 2b\n"
111 ".section __ex_table,\"a\"\n" 111 ".section __ex_table,\"a\"\n"
112 " .align 4\n" " .long 1b,3b\n" 112 " .align 4\n" " .long 1b,3b\n"
113 ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) 113 ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
114 ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); 114 ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err));
115 break; 115 break;
116 default: 116 default:
117 (__gu_val) = __get_user_bad(); 117 (__gu_val) = __get_user_bad();
118 } 118 }
119 } while (0) ; 119 } while (0) ;
120 ((c)) = (__typeof__(*((buf))))__gu_val; 120 ((c)) = (__typeof__(*((buf))))__gu_val;
121 __gu_err; 121 __gu_err;
122 } 122 }
123); 123);
@@ -127,12 +127,12 @@ see what code gcc generates:
127 127
128 > xorl %edx,%edx 128 > xorl %edx,%edx
129 > movl current_set,%eax 129 > movl current_set,%eax
130 > cmpl $24,788(%eax) 130 > cmpl $24,788(%eax)
131 > je .L1424 131 > je .L1424
132 > cmpl $-1073741825,64(%esp) 132 > cmpl $-1073741825,64(%esp)
133 > ja .L1423 133 > ja .L1423
134 > .L1424: 134 > .L1424:
135 > movl %edx,%eax 135 > movl %edx,%eax
136 > movl 64(%esp),%ebx 136 > movl 64(%esp),%ebx
137 > #APP 137 > #APP
138 > 1: movb (%ebx),%dl /* this is the actual user access */ 138 > 1: movb (%ebx),%dl /* this is the actual user access */
@@ -149,17 +149,17 @@ see what code gcc generates:
149 > .L1423: 149 > .L1423:
150 > movzbl %dl,%esi 150 > movzbl %dl,%esi
151 151
152The optimizer does a good job and gives us something we can actually 152The optimizer does a good job and gives us something we can actually
153understand. Can we? The actual user access is quite obvious. Thanks 153understand. Can we? The actual user access is quite obvious. Thanks
154to the unified address space we can just access the address in user 154to the unified address space we can just access the address in user
155memory. But what does the .section stuff do????? 155memory. But what does the .section stuff do?????
156 156
157To understand this we have to look at the final kernel: 157To understand this we have to look at the final kernel:
158 158
159 > objdump --section-headers vmlinux 159 > objdump --section-headers vmlinux
160 > 160 >
161 > vmlinux: file format elf32-i386 161 > vmlinux: file format elf32-i386
162 > 162 >
163 > Sections: 163 > Sections:
164 > Idx Name Size VMA LMA File off Algn 164 > Idx Name Size VMA LMA File off Algn
165 > 0 .text 00098f40 c0100000 c0100000 00001000 2**4 165 > 0 .text 00098f40 c0100000 c0100000 00001000 2**4
@@ -198,18 +198,18 @@ final kernel executable:
198 198
199The whole user memory access is reduced to 10 x86 machine instructions. 199The whole user memory access is reduced to 10 x86 machine instructions.
200The instructions bracketed in the .section directives are no longer 200The instructions bracketed in the .section directives are no longer
201in the normal execution path. They are located in a different section 201in the normal execution path. They are located in a different section
202of the executable file: 202of the executable file:
203 203
204 > objdump --disassemble --section=.fixup vmlinux 204 > objdump --disassemble --section=.fixup vmlinux
205 > 205 >
206 > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax 206 > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax
207 > c0199ffa <.fixup+10ba> xorb %dl,%dl 207 > c0199ffa <.fixup+10ba> xorb %dl,%dl
208 > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3> 208 > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3>
209 209
210And finally: 210And finally:
211 > objdump --full-contents --section=__ex_table vmlinux 211 > objdump --full-contents --section=__ex_table vmlinux
212 > 212 >
213 > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................ 213 > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................
214 > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................ 214 > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................
215 > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................ 215 > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................
@@ -235,8 +235,8 @@ sections in the ELF object file. So the instructions
235ended up in the .fixup section of the object file and the addresses 235ended up in the .fixup section of the object file and the addresses
236 .long 1b,3b 236 .long 1b,3b
237ended up in the __ex_table section of the object file. 1b and 3b 237ended up in the __ex_table section of the object file. 1b and 3b
238are local labels. The local label 1b (1b stands for next label 1 238are local labels. The local label 1b (1b stands for next label 1
239backward) is the address of the instruction that might fault, i.e. 239backward) is the address of the instruction that might fault, i.e.
240in our case the address of the label 1 is c017e7a5: 240in our case the address of the label 1 is c017e7a5:
241the original assembly code: > 1: movb (%ebx),%dl 241the original assembly code: > 1: movb (%ebx),%dl
242and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl 242and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl
@@ -254,7 +254,7 @@ The assembly code
254becomes the value pair 254becomes the value pair
255 > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ 255 > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................
256 ^this is ^this is 256 ^this is ^this is
257 1b 3b 257 1b 3b
258c017e7a5,c0199ff5 in the exception table of the kernel. 258c017e7a5,c0199ff5 in the exception table of the kernel.
259 259
260So, what actually happens if a fault from kernel mode with no suitable 260So, what actually happens if a fault from kernel mode with no suitable
@@ -266,9 +266,9 @@ vma occurs?
2663.) CPU calls do_page_fault 2663.) CPU calls do_page_fault
2674.) do page fault calls search_exception_table (regs->eip == c017e7a5); 2674.) do page fault calls search_exception_table (regs->eip == c017e7a5);
2685.) search_exception_table looks up the address c017e7a5 in the 2685.) search_exception_table looks up the address c017e7a5 in the
269 exception table (i.e. the contents of the ELF section __ex_table) 269 exception table (i.e. the contents of the ELF section __ex_table)
270 and returns the address of the associated fault handle code c0199ff5. 270 and returns the address of the associated fault handle code c0199ff5.
2716.) do_page_fault modifies its own return address to point to the fault 2716.) do_page_fault modifies its own return address to point to the fault
272 handle code and returns. 272 handle code and returns.
2737.) execution continues in the fault handling code. 2737.) execution continues in the fault handling code.
2748.) 8a) EAX becomes -EFAULT (== -14) 2748.) 8a) EAX becomes -EFAULT (== -14)