aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorDan Williams <dan.j.williams@intel.com>2009-09-08 20:55:54 -0400
committerDan Williams <dan.j.williams@intel.com>2009-09-08 20:55:54 -0400
commit9134d02bc0af4a8747d448d1f811ec5f8eb96df6 (patch)
tree704c3e5dcc10f360815c4868a74711f82fb62e27 /Documentation
parentbbb20089a3275a19e475dbc21320c3742e3ca423 (diff)
parent80ffb3cceaefa405f2ecd46d66500ed8d53efe74 (diff)
Merge commit 'md/for-linus' into async-tx-next
Conflicts: drivers/md/raid5.c
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/sysfs-block37
-rw-r--r--Documentation/DocBook/kernel-hacking.tmpl4
-rw-r--r--Documentation/DocBook/mac80211.tmpl2
-rw-r--r--Documentation/RCU/rculist_nulls.txt7
-rw-r--r--Documentation/arm/memory.txt2
-rw-r--r--Documentation/block/data-integrity.txt4
-rw-r--r--Documentation/cgroups/cpusets.txt12
-rw-r--r--Documentation/connector/cn_test.c4
-rw-r--r--Documentation/connector/ucon.c2
-rw-r--r--Documentation/driver-model/driver.txt4
-rw-r--r--Documentation/dvb/get_dvb_firmware53
-rw-r--r--Documentation/feature-removal-schedule.txt10
-rw-r--r--Documentation/filesystems/sysfs.txt3
-rw-r--r--Documentation/gcov.txt25
-rw-r--r--Documentation/kernel-parameters.txt13
-rw-r--r--Documentation/kmemleak.txt23
-rw-r--r--Documentation/laptops/thinkpad-acpi.txt127
-rw-r--r--Documentation/leds-lp3944.txt50
-rw-r--r--Documentation/lguest/lguest.c721
-rw-r--r--Documentation/networking/6pack.txt2
-rw-r--r--Documentation/powerpc/booting-without-of.txt1168
-rw-r--r--Documentation/powerpc/dts-bindings/4xx/emac.txt148
-rw-r--r--Documentation/powerpc/dts-bindings/gpio/gpio.txt50
-rw-r--r--Documentation/powerpc/dts-bindings/gpio/led.txt17
-rw-r--r--Documentation/powerpc/dts-bindings/gpio/mdio.txt19
-rw-r--r--Documentation/powerpc/dts-bindings/marvell.txt521
-rw-r--r--Documentation/powerpc/dts-bindings/phy.txt25
-rw-r--r--Documentation/powerpc/dts-bindings/spi-bus.txt57
-rw-r--r--Documentation/powerpc/dts-bindings/usb-ehci.txt25
-rw-r--r--Documentation/powerpc/dts-bindings/xilinx.txt295
-rw-r--r--Documentation/scheduler/sched-rt-group.txt13
-rw-r--r--Documentation/sound/alsa/HD-Audio-Models.txt1
-rw-r--r--Documentation/sound/alsa/Procfile.txt5
-rw-r--r--Documentation/spi/spidev_test.c10
-rw-r--r--Documentation/sysrq.txt7
-rw-r--r--Documentation/video4linux/CARDLIST.em28xx3
-rw-r--r--Documentation/video4linux/gspca.txt32
-rw-r--r--Documentation/x86/00-INDEX2
-rw-r--r--Documentation/x86/exception-tables.txt (renamed from Documentation/exception.txt)202
39 files changed, 2013 insertions, 1692 deletions
diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
index cbbd3e06994..5f3bedaf8e3 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -94,28 +94,37 @@ What: /sys/block/<disk>/queue/physical_block_size
94Date: May 2009 94Date: May 2009
95Contact: Martin K. Petersen <martin.petersen@oracle.com> 95Contact: Martin K. Petersen <martin.petersen@oracle.com>
96Description: 96Description:
97 This is the smallest unit the storage device can write 97 This is the smallest unit a physical storage device can
98 without resorting to read-modify-write operation. It is 98 write atomically. It is usually the same as the logical
99 usually the same as the logical block size but may be 99 block size but may be bigger. One example is SATA
100 bigger. One example is SATA drives with 4KB sectors 100 drives with 4KB sectors that expose a 512-byte logical
101 that expose a 512-byte logical block size to the 101 block size to the operating system. For stacked block
102 operating system. 102 devices the physical_block_size variable contains the
103 maximum physical_block_size of the component devices.
103 104
104What: /sys/block/<disk>/queue/minimum_io_size 105What: /sys/block/<disk>/queue/minimum_io_size
105Date: April 2009 106Date: April 2009
106Contact: Martin K. Petersen <martin.petersen@oracle.com> 107Contact: Martin K. Petersen <martin.petersen@oracle.com>
107Description: 108Description:
108 Storage devices may report a preferred minimum I/O size, 109 Storage devices may report a granularity or preferred
109 which is the smallest request the device can perform 110 minimum I/O size which is the smallest request the
110 without incurring a read-modify-write penalty. For disk 111 device can perform without incurring a performance
111 drives this is often the physical block size. For RAID 112 penalty. For disk drives this is often the physical
112 arrays it is often the stripe chunk size. 113 block size. For RAID arrays it is often the stripe
114 chunk size. A properly aligned multiple of
115 minimum_io_size is the preferred request size for
116 workloads where a high number of I/O operations is
117 desired.
113 118
114What: /sys/block/<disk>/queue/optimal_io_size 119What: /sys/block/<disk>/queue/optimal_io_size
115Date: April 2009 120Date: April 2009
116Contact: Martin K. Petersen <martin.petersen@oracle.com> 121Contact: Martin K. Petersen <martin.petersen@oracle.com>
117Description: 122Description:
118 Storage devices may report an optimal I/O size, which is 123 Storage devices may report an optimal I/O size, which is
119 the device's preferred unit of receiving I/O. This is 124 the device's preferred unit for sustained I/O. This is
120 rarely reported for disk drives. For RAID devices it is 125 rarely reported for disk drives. For RAID arrays it is
121 usually the stripe width or the internal block size. 126 usually the stripe width or the internal track size. A
127 properly aligned multiple of optimal_io_size is the
128 preferred request size for workloads where sustained
129 throughput is desired. If no optimal I/O size is
130 reported this file contains 0.
diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl
index a50d6cd5857..992e67e6be7 100644
--- a/Documentation/DocBook/kernel-hacking.tmpl
+++ b/Documentation/DocBook/kernel-hacking.tmpl
@@ -449,8 +449,8 @@ printk(KERN_INFO "i = %u\n", i);
449 </para> 449 </para>
450 450
451 <programlisting> 451 <programlisting>
452__u32 ipaddress; 452__be32 ipaddress;
453printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); 453printk(KERN_INFO "my ip: %pI4\n", &amp;ipaddress);
454 </programlisting> 454 </programlisting>
455 455
456 <para> 456 <para>
diff --git a/Documentation/DocBook/mac80211.tmpl b/Documentation/DocBook/mac80211.tmpl
index e3698666357..f3f37f141db 100644
--- a/Documentation/DocBook/mac80211.tmpl
+++ b/Documentation/DocBook/mac80211.tmpl
@@ -184,8 +184,6 @@ usage should require reading the full document.
184!Finclude/net/mac80211.h ieee80211_ctstoself_get 184!Finclude/net/mac80211.h ieee80211_ctstoself_get
185!Finclude/net/mac80211.h ieee80211_ctstoself_duration 185!Finclude/net/mac80211.h ieee80211_ctstoself_duration
186!Finclude/net/mac80211.h ieee80211_generic_frame_duration 186!Finclude/net/mac80211.h ieee80211_generic_frame_duration
187!Finclude/net/mac80211.h ieee80211_get_hdrlen_from_skb
188!Finclude/net/mac80211.h ieee80211_hdrlen
189!Finclude/net/mac80211.h ieee80211_wake_queue 187!Finclude/net/mac80211.h ieee80211_wake_queue
190!Finclude/net/mac80211.h ieee80211_stop_queue 188!Finclude/net/mac80211.h ieee80211_stop_queue
191!Finclude/net/mac80211.h ieee80211_wake_queues 189!Finclude/net/mac80211.h ieee80211_wake_queues
diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt
index 93cb28d05dc..18f9651ff23 100644
--- a/Documentation/RCU/rculist_nulls.txt
+++ b/Documentation/RCU/rculist_nulls.txt
@@ -83,11 +83,12 @@ not detect it missed following items in original chain.
83obj = kmem_cache_alloc(...); 83obj = kmem_cache_alloc(...);
84lock_chain(); // typically a spin_lock() 84lock_chain(); // typically a spin_lock()
85obj->key = key; 85obj->key = key;
86atomic_inc(&obj->refcnt);
87/* 86/*
88 * we need to make sure obj->key is updated before obj->next 87 * we need to make sure obj->key is updated before obj->next
88 * or obj->refcnt
89 */ 89 */
90smp_wmb(); 90smp_wmb();
91atomic_set(&obj->refcnt, 1);
91hlist_add_head_rcu(&obj->obj_node, list); 92hlist_add_head_rcu(&obj->obj_node, list);
92unlock_chain(); // typically a spin_unlock() 93unlock_chain(); // typically a spin_unlock()
93 94
@@ -159,6 +160,10 @@ out:
159obj = kmem_cache_alloc(cachep); 160obj = kmem_cache_alloc(cachep);
160lock_chain(); // typically a spin_lock() 161lock_chain(); // typically a spin_lock()
161obj->key = key; 162obj->key = key;
163/*
164 * changes to obj->key must be visible before refcnt one
165 */
166smp_wmb();
162atomic_set(&obj->refcnt, 1); 167atomic_set(&obj->refcnt, 1);
163/* 168/*
164 * insert obj in RCU way (readers might be traversing chain) 169 * insert obj in RCU way (readers might be traversing chain)
diff --git a/Documentation/arm/memory.txt b/Documentation/arm/memory.txt
index 43cb1004d35..9d58c7c5edd 100644
--- a/Documentation/arm/memory.txt
+++ b/Documentation/arm/memory.txt
@@ -21,6 +21,8 @@ ffff8000 ffffffff copy_user_page / clear_user_page use.
21 For SA11xx and Xscale, this is used to 21 For SA11xx and Xscale, this is used to
22 setup a minicache mapping. 22 setup a minicache mapping.
23 23
24ffff4000 ffffffff cache aliasing on ARMv6 and later CPUs.
25
24ffff1000 ffff7fff Reserved. 26ffff1000 ffff7fff Reserved.
25 Platforms must not use this address range. 27 Platforms must not use this address range.
26 28
diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt
index e8ca040ba2c..2d735b0ae38 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -50,7 +50,7 @@ encouraged them to allow separation of the data and integrity metadata
50scatter-gather lists. 50scatter-gather lists.
51 51
52The controller will interleave the buffers on write and split them on 52The controller will interleave the buffers on write and split them on
53read. This means that the Linux can DMA the data buffers to and from 53read. This means that Linux can DMA the data buffers to and from
54host memory without changes to the page cache. 54host memory without changes to the page cache.
55 55
56Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs 56Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
@@ -66,7 +66,7 @@ software RAID5).
66 66
67The IP checksum is weaker than the CRC in terms of detecting bit 67The IP checksum is weaker than the CRC in terms of detecting bit
68errors. However, the strength is really in the separation of the data 68errors. However, the strength is really in the separation of the data
69buffers and the integrity metadata. These two distinct buffers much 69buffers and the integrity metadata. These two distinct buffers must
70match up for an I/O to complete. 70match up for an I/O to complete.
71 71
72The separation of the data and integrity metadata buffers as well as 72The separation of the data and integrity metadata buffers as well as
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index f9ca389dddf..1d7e9784439 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -777,6 +777,18 @@ in cpuset directories:
777# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 777# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4
778# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 778# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4
779 779
780To add a CPU to a cpuset, write the new list of CPUs including the
781CPU to be added. To add 6 to the above cpuset:
782
783# /bin/echo 1-4,6 > cpus -> set cpus list to cpus 1,2,3,4,6
784
785Similarly to remove a CPU from a cpuset, write the new list of CPUs
786without the CPU to be removed.
787
788To remove all the CPUs:
789
790# /bin/echo "" > cpus -> clear cpus list
791
7802.3 Setting flags 7922.3 Setting flags
781----------------- 793-----------------
782 794
diff --git a/Documentation/connector/cn_test.c b/Documentation/connector/cn_test.c
index f688eba8770..6a5be5d5c8e 100644
--- a/Documentation/connector/cn_test.c
+++ b/Documentation/connector/cn_test.c
@@ -1,7 +1,7 @@
1/* 1/*
2 * cn_test.c 2 * cn_test.c
3 * 3 *
4 * 2004-2005 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> 4 * 2004+ Copyright (c) Evgeniy Polyakov <zbr@ioremap.net>
5 * All rights reserved. 5 * All rights reserved.
6 * 6 *
7 * This program is free software; you can redistribute it and/or modify 7 * This program is free software; you can redistribute it and/or modify
@@ -194,5 +194,5 @@ module_init(cn_test_init);
194module_exit(cn_test_fini); 194module_exit(cn_test_fini);
195 195
196MODULE_LICENSE("GPL"); 196MODULE_LICENSE("GPL");
197MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>"); 197MODULE_AUTHOR("Evgeniy Polyakov <zbr@ioremap.net>");
198MODULE_DESCRIPTION("Connector's test module"); 198MODULE_DESCRIPTION("Connector's test module");
diff --git a/Documentation/connector/ucon.c b/Documentation/connector/ucon.c
index d738cde2a8d..c5092ad0ce4 100644
--- a/Documentation/connector/ucon.c
+++ b/Documentation/connector/ucon.c
@@ -1,7 +1,7 @@
1/* 1/*
2 * ucon.c 2 * ucon.c
3 * 3 *
4 * Copyright (c) 2004+ Evgeniy Polyakov <johnpol@2ka.mipt.ru> 4 * Copyright (c) 2004+ Evgeniy Polyakov <zbr@ioremap.net>
5 * 5 *
6 * 6 *
7 * This program is free software; you can redistribute it and/or modify 7 * This program is free software; you can redistribute it and/or modify
diff --git a/Documentation/driver-model/driver.txt b/Documentation/driver-model/driver.txt
index 82132169d47..60120fb3b96 100644
--- a/Documentation/driver-model/driver.txt
+++ b/Documentation/driver-model/driver.txt
@@ -207,8 +207,8 @@ Attributes
207~~~~~~~~~~ 207~~~~~~~~~~
208struct driver_attribute { 208struct driver_attribute {
209 struct attribute attr; 209 struct attribute attr;
210 ssize_t (*show)(struct device_driver *, char * buf, size_t count, loff_t off); 210 ssize_t (*show)(struct device_driver *driver, char *buf);
211 ssize_t (*store)(struct device_driver *, const char * buf, size_t count, loff_t off); 211 ssize_t (*store)(struct device_driver *, const char * buf, size_t count);
212}; 212};
213 213
214Device drivers can export attributes via their sysfs directories. 214Device drivers can export attributes via their sysfs directories.
diff --git a/Documentation/dvb/get_dvb_firmware b/Documentation/dvb/get_dvb_firmware
index a52adfc9a57..3d1b0ab70c8 100644
--- a/Documentation/dvb/get_dvb_firmware
+++ b/Documentation/dvb/get_dvb_firmware
@@ -25,7 +25,7 @@ use IO::Handle;
25 "tda10046lifeview", "av7110", "dec2000t", "dec2540t", 25 "tda10046lifeview", "av7110", "dec2000t", "dec2540t",
26 "dec3000s", "vp7041", "dibusb", "nxt2002", "nxt2004", 26 "dec3000s", "vp7041", "dibusb", "nxt2002", "nxt2004",
27 "or51211", "or51132_qam", "or51132_vsb", "bluebird", 27 "or51211", "or51132_qam", "or51132_vsb", "bluebird",
28 "opera1", "cx231xx", "cx18", "cx23885", "pvrusb2" ); 28 "opera1", "cx231xx", "cx18", "cx23885", "pvrusb2", "mpc718" );
29 29
30# Check args 30# Check args
31syntax() if (scalar(@ARGV) != 1); 31syntax() if (scalar(@ARGV) != 1);
@@ -381,6 +381,57 @@ sub cx18 {
381 $allfiles; 381 $allfiles;
382} 382}
383 383
384sub mpc718 {
385 my $archive = 'Yuan MPC718 TV Tuner Card 2.13.10.1016.zip';
386 my $url = "ftp://ftp.work.acer-euro.com/desktop/aspire_idea510/vista/Drivers/$archive";
387 my $fwfile = "dvb-cx18-mpc718-mt352.fw";
388 my $tmpdir = tempdir(DIR => "/tmp", CLEANUP => 1);
389
390 checkstandard();
391 wgetfile($archive, $url);
392 unzip($archive, $tmpdir);
393
394 my $sourcefile = "$tmpdir/Yuan MPC718 TV Tuner Card 2.13.10.1016/mpc718_32bit/yuanrap.sys";
395 my $found = 0;
396
397 open IN, '<', $sourcefile or die "Couldn't open $sourcefile to extract $fwfile data\n";
398 binmode IN;
399 open OUT, '>', $fwfile;
400 binmode OUT;
401 {
402 # Block scope because we change the line terminator variable $/
403 my $prevlen = 0;
404 my $currlen;
405
406 # Buried in the data segment are 3 runs of almost identical
407 # register-value pairs that end in 0x5d 0x01 which is a "TUNER GO"
408 # command for the MT352.
409 # Pull out the middle run (because it's easy) of register-value
410 # pairs to make the "firmware" file.
411
412 local $/ = "\x5d\x01"; # MT352 "TUNER GO"
413
414 while (<IN>) {
415 $currlen = length($_);
416 if ($prevlen == $currlen && $currlen <= 64) {
417 chop; chop; # Get rid of "TUNER GO"
418 s/^\0\0//; # get rid of leading 00 00 if it's there
419 printf OUT "$_";
420 $found = 1;
421 last;
422 }
423 $prevlen = $currlen;
424 }
425 }
426 close OUT;
427 close IN;
428 if (!$found) {
429 unlink $fwfile;
430 die "Couldn't find valid register-value sequence in $sourcefile for $fwfile\n";
431 }
432 $fwfile;
433}
434
384sub cx23885 { 435sub cx23885 {
385 my $url = "http://linuxtv.org/downloads/firmware/"; 436 my $url = "http://linuxtv.org/downloads/firmware/";
386 437
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index f8cd450be9a..09e031c5588 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -458,3 +458,13 @@ Why: Remove the old legacy 32bit machine check code. This has been
458 but the old version has been kept around for easier testing. Note this 458 but the old version has been kept around for easier testing. Note this
459 doesn't impact the old P5 and WinChip machine check handlers. 459 doesn't impact the old P5 and WinChip machine check handlers.
460Who: Andi Kleen <andi@firstfloor.org> 460Who: Andi Kleen <andi@firstfloor.org>
461
462----------------------------
463
464What: lock_policy_rwsem_* and unlock_policy_rwsem_* will not be
465 exported interface anymore.
466When: 2.6.33
467Why: cpu_policy_rwsem has a new cleaner definition making it local to
468 cpufreq core and contained inside cpufreq.c. Other dependent
469 drivers should not use it in order to safely avoid lockdep issues.
470Who: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt
index 7e81e37c0b1..b245d524d56 100644
--- a/Documentation/filesystems/sysfs.txt
+++ b/Documentation/filesystems/sysfs.txt
@@ -23,7 +23,8 @@ interface.
23Using sysfs 23Using sysfs
24~~~~~~~~~~~ 24~~~~~~~~~~~
25 25
26sysfs is always compiled in. You can access it by doing: 26sysfs is always compiled in if CONFIG_SYSFS is defined. You can access
27it by doing:
27 28
28 mount -t sysfs sysfs /sys 29 mount -t sysfs sysfs /sys
29 30
diff --git a/Documentation/gcov.txt b/Documentation/gcov.txt
index e716aadb3a3..40ec6335276 100644
--- a/Documentation/gcov.txt
+++ b/Documentation/gcov.txt
@@ -188,13 +188,18 @@ Solution: Exclude affected source files from profiling by specifying
188 GCOV_PROFILE := n or GCOV_PROFILE_basename.o := n in the 188 GCOV_PROFILE := n or GCOV_PROFILE_basename.o := n in the
189 corresponding Makefile. 189 corresponding Makefile.
190 190
191Problem: Files copied from sysfs appear empty or incomplete.
192Cause: Due to the way seq_file works, some tools such as cp or tar
193 may not correctly copy files from sysfs.
194Solution: Use 'cat' to read .gcda files and 'cp -d' to copy links.
195 Alternatively use the mechanism shown in Appendix B.
196
191 197
192Appendix A: gather_on_build.sh 198Appendix A: gather_on_build.sh
193============================== 199==============================
194 200
195Sample script to gather coverage meta files on the build machine 201Sample script to gather coverage meta files on the build machine
196(see 6a): 202(see 6a):
197
198#!/bin/bash 203#!/bin/bash
199 204
200KSRC=$1 205KSRC=$1
@@ -226,7 +231,7 @@ Appendix B: gather_on_test.sh
226Sample script to gather coverage data files on the test machine 231Sample script to gather coverage data files on the test machine
227(see 6b): 232(see 6b):
228 233
229#!/bin/bash 234#!/bin/bash -e
230 235
231DEST=$1 236DEST=$1
232GCDA=/sys/kernel/debug/gcov 237GCDA=/sys/kernel/debug/gcov
@@ -236,11 +241,13 @@ if [ -z "$DEST" ] ; then
236 exit 1 241 exit 1
237fi 242fi
238 243
239find $GCDA -name '*.gcno' -o -name '*.gcda' | tar cfz $DEST -T - 244TEMPDIR=$(mktemp -d)
245echo Collecting data..
246find $GCDA -type d -exec mkdir -p $TEMPDIR/\{\} \;
247find $GCDA -name '*.gcda' -exec sh -c 'cat < $0 > '$TEMPDIR'/$0' {} \;
248find $GCDA -name '*.gcno' -exec sh -c 'cp -d $0 '$TEMPDIR'/$0' {} \;
249tar czf $DEST -C $TEMPDIR sys
250rm -rf $TEMPDIR
240 251
241if [ $? -eq 0 ] ; then 252echo "$DEST successfully created, copy to build system and unpack with:"
242 echo "$DEST successfully created, copy to build system and unpack with:" 253echo " tar xfz $DEST"
243 echo " tar xfz $DEST"
244else
245 echo "Could not create file $DEST"
246fi
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index d08759aa090..dd1a6d4bb74 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1720,8 +1720,8 @@ and is between 256 and 4096 characters. It is defined in the file
1720 oprofile.cpu_type= Force an oprofile cpu type 1720 oprofile.cpu_type= Force an oprofile cpu type
1721 This might be useful if you have an older oprofile 1721 This might be useful if you have an older oprofile
1722 userland or if you want common events. 1722 userland or if you want common events.
1723 Format: { archperfmon } 1723 Format: { arch_perfmon }
1724 archperfmon: [X86] Force use of architectural 1724 arch_perfmon: [X86] Force use of architectural
1725 perfmon on Intel CPUs instead of the 1725 perfmon on Intel CPUs instead of the
1726 CPU specific event set. 1726 CPU specific event set.
1727 1727
@@ -1915,6 +1915,12 @@ and is between 256 and 4096 characters. It is defined in the file
1915 Format: { 0 | 1 } 1915 Format: { 0 | 1 }
1916 See arch/parisc/kernel/pdc_chassis.c 1916 See arch/parisc/kernel/pdc_chassis.c
1917 1917
1918 percpu_alloc= [X86] Select which percpu first chunk allocator to use.
1919 Allowed values are one of "lpage", "embed" and "4k".
1920 See comments in arch/x86/kernel/setup_percpu.c for
1921 details on each allocator. This parameter is primarily
1922 for debugging and performance comparison.
1923
1918 pf. [PARIDE] 1924 pf. [PARIDE]
1919 See Documentation/blockdev/paride.txt. 1925 See Documentation/blockdev/paride.txt.
1920 1926
@@ -2467,7 +2473,8 @@ and is between 256 and 4096 characters. It is defined in the file
2467 2473
2468 tp720= [HW,PS2] 2474 tp720= [HW,PS2]
2469 2475
2470 trace_buf_size=nn[KMG] [ftrace] will set tracing buffer size. 2476 trace_buf_size=nn[KMG]
2477 [FTRACE] will set tracing buffer size.
2471 2478
2472 trix= [HW,OSS] MediaTrix AudioTrix Pro 2479 trix= [HW,OSS] MediaTrix AudioTrix Pro
2473 Format: 2480 Format:
diff --git a/Documentation/kmemleak.txt b/Documentation/kmemleak.txt
index 0112da3b9ab..89068030b01 100644
--- a/Documentation/kmemleak.txt
+++ b/Documentation/kmemleak.txt
@@ -16,13 +16,17 @@ Usage
16----- 16-----
17 17
18CONFIG_DEBUG_KMEMLEAK in "Kernel hacking" has to be enabled. A kernel 18CONFIG_DEBUG_KMEMLEAK in "Kernel hacking" has to be enabled. A kernel
19thread scans the memory every 10 minutes (by default) and prints any new 19thread scans the memory every 10 minutes (by default) and prints the
20unreferenced objects found. To trigger an intermediate scan and display 20number of new unreferenced objects found. To display the details of all
21all the possible memory leaks: 21the possible memory leaks:
22 22
23 # mount -t debugfs nodev /sys/kernel/debug/ 23 # mount -t debugfs nodev /sys/kernel/debug/
24 # cat /sys/kernel/debug/kmemleak 24 # cat /sys/kernel/debug/kmemleak
25 25
26To trigger an intermediate memory scan:
27
28 # echo scan > /sys/kernel/debug/kmemleak
29
26Note that the orphan objects are listed in the order they were allocated 30Note that the orphan objects are listed in the order they were allocated
27and one object at the beginning of the list may cause other subsequent 31and one object at the beginning of the list may cause other subsequent
28objects to be reported as orphan. 32objects to be reported as orphan.
@@ -31,16 +35,21 @@ Memory scanning parameters can be modified at run-time by writing to the
31/sys/kernel/debug/kmemleak file. The following parameters are supported: 35/sys/kernel/debug/kmemleak file. The following parameters are supported:
32 36
33 off - disable kmemleak (irreversible) 37 off - disable kmemleak (irreversible)
34 stack=on - enable the task stacks scanning 38 stack=on - enable the task stacks scanning (default)
35 stack=off - disable the tasks stacks scanning 39 stack=off - disable the tasks stacks scanning
36 scan=on - start the automatic memory scanning thread 40 scan=on - start the automatic memory scanning thread (default)
37 scan=off - stop the automatic memory scanning thread 41 scan=off - stop the automatic memory scanning thread
38 scan=<secs> - set the automatic memory scanning period in seconds (0 42 scan=<secs> - set the automatic memory scanning period in seconds
39 to disable it) 43 (default 600, 0 to stop the automatic scanning)
44 scan - trigger a memory scan
40 45
41Kmemleak can also be disabled at boot-time by passing "kmemleak=off" on 46Kmemleak can also be disabled at boot-time by passing "kmemleak=off" on
42the kernel command line. 47the kernel command line.
43 48
49Memory may be allocated or freed before kmemleak is initialised and
50these actions are stored in an early log buffer. The size of this buffer
51is configured via the CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE option.
52
44Basic Algorithm 53Basic Algorithm
45--------------- 54---------------
46 55
diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt
index f2296ecedb8..e2ddcdeb61b 100644
--- a/Documentation/laptops/thinkpad-acpi.txt
+++ b/Documentation/laptops/thinkpad-acpi.txt
@@ -36,8 +36,6 @@ detailed description):
36 - Bluetooth enable and disable 36 - Bluetooth enable and disable
37 - video output switching, expansion control 37 - video output switching, expansion control
38 - ThinkLight on and off 38 - ThinkLight on and off
39 - limited docking and undocking
40 - UltraBay eject
41 - CMOS/UCMS control 39 - CMOS/UCMS control
42 - LED control 40 - LED control
43 - ACPI sounds 41 - ACPI sounds
@@ -729,131 +727,6 @@ cannot be read or if it is unknown, thinkpad-acpi will report it as "off".
729It is impossible to know if the status returned through sysfs is valid. 727It is impossible to know if the status returned through sysfs is valid.
730 728
731 729
732Docking / undocking -- /proc/acpi/ibm/dock
733------------------------------------------
734
735Docking and undocking (e.g. with the X4 UltraBase) requires some
736actions to be taken by the operating system to safely make or break
737the electrical connections with the dock.
738
739The docking feature of this driver generates the following ACPI events:
740
741 ibm/dock GDCK 00000003 00000001 -- eject request
742 ibm/dock GDCK 00000003 00000002 -- undocked
743 ibm/dock GDCK 00000000 00000003 -- docked
744
745NOTE: These events will only be generated if the laptop was docked
746when originally booted. This is due to the current lack of support for
747hot plugging of devices in the Linux ACPI framework. If the laptop was
748booted while not in the dock, the following message is shown in the
749logs:
750
751 Mar 17 01:42:34 aero kernel: thinkpad_acpi: dock device not present
752
753In this case, no dock-related events are generated but the dock and
754undock commands described below still work. They can be executed
755manually or triggered by Fn key combinations (see the example acpid
756configuration files included in the driver tarball package available
757on the web site).
758
759When the eject request button on the dock is pressed, the first event
760above is generated. The handler for this event should issue the
761following command:
762
763 echo undock > /proc/acpi/ibm/dock
764
765After the LED on the dock goes off, it is safe to eject the laptop.
766Note: if you pressed this key by mistake, go ahead and eject the
767laptop, then dock it back in. Otherwise, the dock may not function as
768expected.
769
770When the laptop is docked, the third event above is generated. The
771handler for this event should issue the following command to fully
772enable the dock:
773
774 echo dock > /proc/acpi/ibm/dock
775
776The contents of the /proc/acpi/ibm/dock file shows the current status
777of the dock, as provided by the ACPI framework.
778
779The docking support in this driver does not take care of enabling or
780disabling any other devices you may have attached to the dock. For
781example, a CD drive plugged into the UltraBase needs to be disabled or
782enabled separately. See the provided example acpid configuration files
783for how this can be accomplished.
784
785There is no support yet for PCI devices that may be attached to a
786docking station, e.g. in the ThinkPad Dock II. The driver currently
787does not recognize, enable or disable such devices. This means that
788the only docking stations currently supported are the X-series
789UltraBase docks and "dumb" port replicators like the Mini Dock (the
790latter don't need any ACPI support, actually).
791
792
793UltraBay eject -- /proc/acpi/ibm/bay
794------------------------------------
795
796Inserting or ejecting an UltraBay device requires some actions to be
797taken by the operating system to safely make or break the electrical
798connections with the device.
799
800This feature generates the following ACPI events:
801
802 ibm/bay MSTR 00000003 00000000 -- eject request
803 ibm/bay MSTR 00000001 00000000 -- eject lever inserted
804
805NOTE: These events will only be generated if the UltraBay was present
806when the laptop was originally booted (on the X series, the UltraBay
807is in the dock, so it may not be present if the laptop was undocked).
808This is due to the current lack of support for hot plugging of devices
809in the Linux ACPI framework. If the laptop was booted without the
810UltraBay, the following message is shown in the logs:
811
812 Mar 17 01:42:34 aero kernel: thinkpad_acpi: bay device not present
813
814In this case, no bay-related events are generated but the eject
815command described below still works. It can be executed manually or
816triggered by a hot key combination.
817
818Sliding the eject lever generates the first event shown above. The
819handler for this event should take whatever actions are necessary to
820shut down the device in the UltraBay (e.g. call idectl), then issue
821the following command:
822
823 echo eject > /proc/acpi/ibm/bay
824
825After the LED on the UltraBay goes off, it is safe to pull out the
826device.
827
828When the eject lever is inserted, the second event above is
829generated. The handler for this event should take whatever actions are
830necessary to enable the UltraBay device (e.g. call idectl).
831
832The contents of the /proc/acpi/ibm/bay file shows the current status
833of the UltraBay, as provided by the ACPI framework.
834
835EXPERIMENTAL warm eject support on the 600e/x, A22p and A3x (To use
836this feature, you need to supply the experimental=1 parameter when
837loading the module):
838
839These models do not have a button near the UltraBay device to request
840a hot eject but rather require the laptop to be put to sleep
841(suspend-to-ram) before the bay device is ejected or inserted).
842The sequence of steps to eject the device is as follows:
843
844 echo eject > /proc/acpi/ibm/bay
845 put the ThinkPad to sleep
846 remove the drive
847 resume from sleep
848 cat /proc/acpi/ibm/bay should show that the drive was removed
849
850On the A3x, both the UltraBay 2000 and UltraBay Plus devices are
851supported. Use "eject2" instead of "eject" for the second bay.
852
853Note: the UltraBay eject support on the 600e/x, A22p and A3x is
854EXPERIMENTAL and may not work as expected. USE WITH CAUTION!
855
856
857CMOS/UCMS control 730CMOS/UCMS control
858----------------- 731-----------------
859 732
diff --git a/Documentation/leds-lp3944.txt b/Documentation/leds-lp3944.txt
new file mode 100644
index 00000000000..c6eda18b15e
--- /dev/null
+++ b/Documentation/leds-lp3944.txt
@@ -0,0 +1,50 @@
1Kernel driver lp3944
2====================
3
4 * National Semiconductor LP3944 Fun-light Chip
5 Prefix: 'lp3944'
6 Addresses scanned: None (see the Notes section below)
7 Datasheet: Publicly available at the National Semiconductor website
8 http://www.national.com/pf/LP/LP3944.html
9
10Authors:
11 Antonio Ospite <ospite@studenti.unina.it>
12
13
14Description
15-----------
16The LP3944 is a helper chip that can drive up to 8 leds, with two programmable
17DIM modes; it could even be used as a gpio expander but this driver assumes it
18is used as a led controller.
19
20The DIM modes are used to set _blink_ patterns for leds, the pattern is
21specified supplying two parameters:
22 - period: from 0s to 1.6s
23 - duty cycle: percentage of the period the led is on, from 0 to 100
24
25Setting a led in DIM0 or DIM1 mode makes it blink according to the pattern.
26See the datasheet for details.
27
28LP3944 can be found on Motorola A910 smartphone, where it drives the rgb
29leds, the camera flash light and the lcds power.
30
31
32Notes
33-----
34The chip is used mainly in embedded contexts, so this driver expects it is
35registered using the i2c_board_info mechanism.
36
37To register the chip at address 0x60 on adapter 0, set the platform data
38according to include/linux/leds-lp3944.h, set the i2c board info:
39
40 static struct i2c_board_info __initdata a910_i2c_board_info[] = {
41 {
42 I2C_BOARD_INFO("lp3944", 0x60),
43 .platform_data = &a910_lp3944_leds,
44 },
45 };
46
47and register it in the platform init function
48
49 i2c_register_board_info(0, a910_i2c_board_info,
50 ARRAY_SIZE(a910_i2c_board_info));
diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index 9ebcd6ef361..950cde6d6e5 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -1,7 +1,9 @@
1/*P:100 This is the Launcher code, a simple program which lays out the 1/*P:100
2 * "physical" memory for the new Guest by mapping the kernel image and 2 * This is the Launcher code, a simple program which lays out the "physical"
3 * the virtual devices, then opens /dev/lguest to tell the kernel 3 * memory for the new Guest by mapping the kernel image and the virtual
4 * about the Guest and control it. :*/ 4 * devices, then opens /dev/lguest to tell the kernel about the Guest and
5 * control it.
6:*/
5#define _LARGEFILE64_SOURCE 7#define _LARGEFILE64_SOURCE
6#define _GNU_SOURCE 8#define _GNU_SOURCE
7#include <stdio.h> 9#include <stdio.h>
@@ -46,13 +48,15 @@
46#include "linux/virtio_rng.h" 48#include "linux/virtio_rng.h"
47#include "linux/virtio_ring.h" 49#include "linux/virtio_ring.h"
48#include "asm/bootparam.h" 50#include "asm/bootparam.h"
49/*L:110 We can ignore the 39 include files we need for this program, but I do 51/*L:110
50 * want to draw attention to the use of kernel-style types. 52 * We can ignore the 42 include files we need for this program, but I do want
53 * to draw attention to the use of kernel-style types.
51 * 54 *
52 * As Linus said, "C is a Spartan language, and so should your naming be." I 55 * As Linus said, "C is a Spartan language, and so should your naming be." I
53 * like these abbreviations, so we define them here. Note that u64 is always 56 * like these abbreviations, so we define them here. Note that u64 is always
54 * unsigned long long, which works on all Linux systems: this means that we can 57 * unsigned long long, which works on all Linux systems: this means that we can
55 * use %llu in printf for any u64. */ 58 * use %llu in printf for any u64.
59 */
56typedef unsigned long long u64; 60typedef unsigned long long u64;
57typedef uint32_t u32; 61typedef uint32_t u32;
58typedef uint16_t u16; 62typedef uint16_t u16;
@@ -69,8 +73,10 @@ typedef uint8_t u8;
69/* This will occupy 3 pages: it must be a power of 2. */ 73/* This will occupy 3 pages: it must be a power of 2. */
70#define VIRTQUEUE_NUM 256 74#define VIRTQUEUE_NUM 256
71 75
72/*L:120 verbose is both a global flag and a macro. The C preprocessor allows 76/*L:120
73 * this, and although I wouldn't recommend it, it works quite nicely here. */ 77 * verbose is both a global flag and a macro. The C preprocessor allows
78 * this, and although I wouldn't recommend it, it works quite nicely here.
79 */
74static bool verbose; 80static bool verbose;
75#define verbose(args...) \ 81#define verbose(args...) \
76 do { if (verbose) printf(args); } while(0) 82 do { if (verbose) printf(args); } while(0)
@@ -87,8 +93,7 @@ static int lguest_fd;
87static unsigned int __thread cpu_id; 93static unsigned int __thread cpu_id;
88 94
89/* This is our list of devices. */ 95/* This is our list of devices. */
90struct device_list 96struct device_list {
91{
92 /* Counter to assign interrupt numbers. */ 97 /* Counter to assign interrupt numbers. */
93 unsigned int next_irq; 98 unsigned int next_irq;
94 99
@@ -100,8 +105,7 @@ struct device_list
100 105
101 /* A single linked list of devices. */ 106 /* A single linked list of devices. */
102 struct device *dev; 107 struct device *dev;
103 /* And a pointer to the last device for easy append and also for 108 /* And a pointer to the last device for easy append. */
104 * configuration appending. */
105 struct device *lastdev; 109 struct device *lastdev;
106}; 110};
107 111
@@ -109,8 +113,7 @@ struct device_list
109static struct device_list devices; 113static struct device_list devices;
110 114
111/* The device structure describes a single device. */ 115/* The device structure describes a single device. */
112struct device 116struct device {
113{
114 /* The linked-list pointer. */ 117 /* The linked-list pointer. */
115 struct device *next; 118 struct device *next;
116 119
@@ -135,8 +138,7 @@ struct device
135}; 138};
136 139
137/* The virtqueue structure describes a queue attached to a device. */ 140/* The virtqueue structure describes a queue attached to a device. */
138struct virtqueue 141struct virtqueue {
139{
140 struct virtqueue *next; 142 struct virtqueue *next;
141 143
142 /* Which device owns me. */ 144 /* Which device owns me. */
@@ -168,20 +170,24 @@ static char **main_args;
168/* The original tty settings to restore on exit. */ 170/* The original tty settings to restore on exit. */
169static struct termios orig_term; 171static struct termios orig_term;
170 172
171/* We have to be careful with barriers: our devices are all run in separate 173/*
174 * We have to be careful with barriers: our devices are all run in separate
172 * threads and so we need to make sure that changes visible to the Guest happen 175 * threads and so we need to make sure that changes visible to the Guest happen
173 * in precise order. */ 176 * in precise order.
177 */
174#define wmb() __asm__ __volatile__("" : : : "memory") 178#define wmb() __asm__ __volatile__("" : : : "memory")
175#define mb() __asm__ __volatile__("" : : : "memory") 179#define mb() __asm__ __volatile__("" : : : "memory")
176 180
177/* Convert an iovec element to the given type. 181/*
182 * Convert an iovec element to the given type.
178 * 183 *
179 * This is a fairly ugly trick: we need to know the size of the type and 184 * This is a fairly ugly trick: we need to know the size of the type and
180 * alignment requirement to check the pointer is kosher. It's also nice to 185 * alignment requirement to check the pointer is kosher. It's also nice to
181 * have the name of the type in case we report failure. 186 * have the name of the type in case we report failure.
182 * 187 *
183 * Typing those three things all the time is cumbersome and error prone, so we 188 * Typing those three things all the time is cumbersome and error prone, so we
184 * have a macro which sets them all up and passes to the real function. */ 189 * have a macro which sets them all up and passes to the real function.
190 */
185#define convert(iov, type) \ 191#define convert(iov, type) \
186 ((type *)_convert((iov), sizeof(type), __alignof__(type), #type)) 192 ((type *)_convert((iov), sizeof(type), __alignof__(type), #type))
187 193
@@ -198,8 +204,10 @@ static void *_convert(struct iovec *iov, size_t size, size_t align,
198/* Wrapper for the last available index. Makes it easier to change. */ 204/* Wrapper for the last available index. Makes it easier to change. */
199#define lg_last_avail(vq) ((vq)->last_avail_idx) 205#define lg_last_avail(vq) ((vq)->last_avail_idx)
200 206
201/* The virtio configuration space is defined to be little-endian. x86 is 207/*
202 * little-endian too, but it's nice to be explicit so we have these helpers. */ 208 * The virtio configuration space is defined to be little-endian. x86 is
209 * little-endian too, but it's nice to be explicit so we have these helpers.
210 */
203#define cpu_to_le16(v16) (v16) 211#define cpu_to_le16(v16) (v16)
204#define cpu_to_le32(v32) (v32) 212#define cpu_to_le32(v32) (v32)
205#define cpu_to_le64(v64) (v64) 213#define cpu_to_le64(v64) (v64)
@@ -241,11 +249,12 @@ static u8 *get_feature_bits(struct device *dev)
241 + dev->num_vq * sizeof(struct lguest_vqconfig); 249 + dev->num_vq * sizeof(struct lguest_vqconfig);
242} 250}
243 251
244/*L:100 The Launcher code itself takes us out into userspace, that scary place 252/*L:100
245 * where pointers run wild and free! Unfortunately, like most userspace 253 * The Launcher code itself takes us out into userspace, that scary place where
246 * programs, it's quite boring (which is why everyone likes to hack on the 254 * pointers run wild and free! Unfortunately, like most userspace programs,
247 * kernel!). Perhaps if you make up an Lguest Drinking Game at this point, it 255 * it's quite boring (which is why everyone likes to hack on the kernel!).
248 * will get you through this section. Or, maybe not. 256 * Perhaps if you make up an Lguest Drinking Game at this point, it will get
257 * you through this section. Or, maybe not.
249 * 258 *
250 * The Launcher sets up a big chunk of memory to be the Guest's "physical" 259 * The Launcher sets up a big chunk of memory to be the Guest's "physical"
251 * memory and stores it in "guest_base". In other words, Guest physical == 260 * memory and stores it in "guest_base". In other words, Guest physical ==
@@ -253,7 +262,8 @@ static u8 *get_feature_bits(struct device *dev)
253 * 262 *
254 * This can be tough to get your head around, but usually it just means that we 263 * This can be tough to get your head around, but usually it just means that we
255 * use these trivial conversion functions when the Guest gives us it's 264 * use these trivial conversion functions when the Guest gives us it's
256 * "physical" addresses: */ 265 * "physical" addresses:
266 */
257static void *from_guest_phys(unsigned long addr) 267static void *from_guest_phys(unsigned long addr)
258{ 268{
259 return guest_base + addr; 269 return guest_base + addr;
@@ -268,7 +278,8 @@ static unsigned long to_guest_phys(const void *addr)
268 * Loading the Kernel. 278 * Loading the Kernel.
269 * 279 *
270 * We start with couple of simple helper routines. open_or_die() avoids 280 * We start with couple of simple helper routines. open_or_die() avoids
271 * error-checking code cluttering the callers: */ 281 * error-checking code cluttering the callers:
282 */
272static int open_or_die(const char *name, int flags) 283static int open_or_die(const char *name, int flags)
273{ 284{
274 int fd = open(name, flags); 285 int fd = open(name, flags);
@@ -283,12 +294,19 @@ static void *map_zeroed_pages(unsigned int num)
283 int fd = open_or_die("/dev/zero", O_RDONLY); 294 int fd = open_or_die("/dev/zero", O_RDONLY);
284 void *addr; 295 void *addr;
285 296
286 /* We use a private mapping (ie. if we write to the page, it will be 297 /*
287 * copied). */ 298 * We use a private mapping (ie. if we write to the page, it will be
299 * copied).
300 */
288 addr = mmap(NULL, getpagesize() * num, 301 addr = mmap(NULL, getpagesize() * num,
289 PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, fd, 0); 302 PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, fd, 0);
290 if (addr == MAP_FAILED) 303 if (addr == MAP_FAILED)
291 err(1, "Mmaping %u pages of /dev/zero", num); 304 err(1, "Mmaping %u pages of /dev/zero", num);
305
306 /*
307 * One neat mmap feature is that you can close the fd, and it
308 * stays mapped.
309 */
292 close(fd); 310 close(fd);
293 311
294 return addr; 312 return addr;
@@ -305,20 +323,24 @@ static void *get_pages(unsigned int num)
305 return addr; 323 return addr;
306} 324}
307 325
308/* This routine is used to load the kernel or initrd. It tries mmap, but if 326/*
327 * This routine is used to load the kernel or initrd. It tries mmap, but if
309 * that fails (Plan 9's kernel file isn't nicely aligned on page boundaries), 328 * that fails (Plan 9's kernel file isn't nicely aligned on page boundaries),
310 * it falls back to reading the memory in. */ 329 * it falls back to reading the memory in.
330 */
311static void map_at(int fd, void *addr, unsigned long offset, unsigned long len) 331static void map_at(int fd, void *addr, unsigned long offset, unsigned long len)
312{ 332{
313 ssize_t r; 333 ssize_t r;
314 334
315 /* We map writable even though for some segments are marked read-only. 335 /*
336 * We map writable even though for some segments are marked read-only.
316 * The kernel really wants to be writable: it patches its own 337 * The kernel really wants to be writable: it patches its own
317 * instructions. 338 * instructions.
318 * 339 *
319 * MAP_PRIVATE means that the page won't be copied until a write is 340 * MAP_PRIVATE means that the page won't be copied until a write is
320 * done to it. This allows us to share untouched memory between 341 * done to it. This allows us to share untouched memory between
321 * Guests. */ 342 * Guests.
343 */
322 if (mmap(addr, len, PROT_READ|PROT_WRITE|PROT_EXEC, 344 if (mmap(addr, len, PROT_READ|PROT_WRITE|PROT_EXEC,
323 MAP_FIXED|MAP_PRIVATE, fd, offset) != MAP_FAILED) 345 MAP_FIXED|MAP_PRIVATE, fd, offset) != MAP_FAILED)
324 return; 346 return;
@@ -329,7 +351,8 @@ static void map_at(int fd, void *addr, unsigned long offset, unsigned long len)
329 err(1, "Reading offset %lu len %lu gave %zi", offset, len, r); 351 err(1, "Reading offset %lu len %lu gave %zi", offset, len, r);
330} 352}
331 353
332/* This routine takes an open vmlinux image, which is in ELF, and maps it into 354/*
355 * This routine takes an open vmlinux image, which is in ELF, and maps it into
333 * the Guest memory. ELF = Embedded Linking Format, which is the format used 356 * the Guest memory. ELF = Embedded Linking Format, which is the format used
334 * by all modern binaries on Linux including the kernel. 357 * by all modern binaries on Linux including the kernel.
335 * 358 *
@@ -337,23 +360,28 @@ static void map_at(int fd, void *addr, unsigned long offset, unsigned long len)
337 * address. We use the physical address; the Guest will map itself to the 360 * address. We use the physical address; the Guest will map itself to the
338 * virtual address. 361 * virtual address.
339 * 362 *
340 * We return the starting address. */ 363 * We return the starting address.
364 */
341static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr) 365static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr)
342{ 366{
343 Elf32_Phdr phdr[ehdr->e_phnum]; 367 Elf32_Phdr phdr[ehdr->e_phnum];
344 unsigned int i; 368 unsigned int i;
345 369
346 /* Sanity checks on the main ELF header: an x86 executable with a 370 /*
347 * reasonable number of correctly-sized program headers. */ 371 * Sanity checks on the main ELF header: an x86 executable with a
372 * reasonable number of correctly-sized program headers.
373 */
348 if (ehdr->e_type != ET_EXEC 374 if (ehdr->e_type != ET_EXEC
349 || ehdr->e_machine != EM_386 375 || ehdr->e_machine != EM_386
350 || ehdr->e_phentsize != sizeof(Elf32_Phdr) 376 || ehdr->e_phentsize != sizeof(Elf32_Phdr)
351 || ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(Elf32_Phdr)) 377 || ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(Elf32_Phdr))
352 errx(1, "Malformed elf header"); 378 errx(1, "Malformed elf header");
353 379
354 /* An ELF executable contains an ELF header and a number of "program" 380 /*
381 * An ELF executable contains an ELF header and a number of "program"
355 * headers which indicate which parts ("segments") of the program to 382 * headers which indicate which parts ("segments") of the program to
356 * load where. */ 383 * load where.
384 */
357 385
358 /* We read in all the program headers at once: */ 386 /* We read in all the program headers at once: */
359 if (lseek(elf_fd, ehdr->e_phoff, SEEK_SET) < 0) 387 if (lseek(elf_fd, ehdr->e_phoff, SEEK_SET) < 0)
@@ -361,8 +389,10 @@ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr)
361 if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr)) 389 if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr))
362 err(1, "Reading program headers"); 390 err(1, "Reading program headers");
363 391
364 /* Try all the headers: there are usually only three. A read-only one, 392 /*
365 * a read-write one, and a "note" section which we don't load. */ 393 * Try all the headers: there are usually only three. A read-only one,
394 * a read-write one, and a "note" section which we don't load.
395 */
366 for (i = 0; i < ehdr->e_phnum; i++) { 396 for (i = 0; i < ehdr->e_phnum; i++) {
367 /* If this isn't a loadable segment, we ignore it */ 397 /* If this isn't a loadable segment, we ignore it */
368 if (phdr[i].p_type != PT_LOAD) 398 if (phdr[i].p_type != PT_LOAD)
@@ -380,13 +410,15 @@ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr)
380 return ehdr->e_entry; 410 return ehdr->e_entry;
381} 411}
382 412
383/*L:150 A bzImage, unlike an ELF file, is not meant to be loaded. You're 413/*L:150
384 * supposed to jump into it and it will unpack itself. We used to have to 414 * A bzImage, unlike an ELF file, is not meant to be loaded. You're supposed
385 * perform some hairy magic because the unpacking code scared me. 415 * to jump into it and it will unpack itself. We used to have to perform some
416 * hairy magic because the unpacking code scared me.
386 * 417 *
387 * Fortunately, Jeremy Fitzhardinge convinced me it wasn't that hard and wrote 418 * Fortunately, Jeremy Fitzhardinge convinced me it wasn't that hard and wrote
388 * a small patch to jump over the tricky bits in the Guest, so now we just read 419 * a small patch to jump over the tricky bits in the Guest, so now we just read
389 * the funky header so we know where in the file to load, and away we go! */ 420 * the funky header so we know where in the file to load, and away we go!
421 */
390static unsigned long load_bzimage(int fd) 422static unsigned long load_bzimage(int fd)
391{ 423{
392 struct boot_params boot; 424 struct boot_params boot;
@@ -394,8 +426,10 @@ static unsigned long load_bzimage(int fd)
394 /* Modern bzImages get loaded at 1M. */ 426 /* Modern bzImages get loaded at 1M. */
395 void *p = from_guest_phys(0x100000); 427 void *p = from_guest_phys(0x100000);
396 428
397 /* Go back to the start of the file and read the header. It should be 429 /*
398 * a Linux boot header (see Documentation/x86/i386/boot.txt) */ 430 * Go back to the start of the file and read the header. It should be
431 * a Linux boot header (see Documentation/x86/i386/boot.txt)
432 */
399 lseek(fd, 0, SEEK_SET); 433 lseek(fd, 0, SEEK_SET);
400 read(fd, &boot, sizeof(boot)); 434 read(fd, &boot, sizeof(boot));
401 435
@@ -414,9 +448,11 @@ static unsigned long load_bzimage(int fd)
414 return boot.hdr.code32_start; 448 return boot.hdr.code32_start;
415} 449}
416 450
417/*L:140 Loading the kernel is easy when it's a "vmlinux", but most kernels 451/*L:140
452 * Loading the kernel is easy when it's a "vmlinux", but most kernels
418 * come wrapped up in the self-decompressing "bzImage" format. With a little 453 * come wrapped up in the self-decompressing "bzImage" format. With a little
419 * work, we can load those, too. */ 454 * work, we can load those, too.
455 */
420static unsigned long load_kernel(int fd) 456static unsigned long load_kernel(int fd)
421{ 457{
422 Elf32_Ehdr hdr; 458 Elf32_Ehdr hdr;
@@ -433,24 +469,28 @@ static unsigned long load_kernel(int fd)
433 return load_bzimage(fd); 469 return load_bzimage(fd);
434} 470}
435 471
436/* This is a trivial little helper to align pages. Andi Kleen hated it because 472/*
473 * This is a trivial little helper to align pages. Andi Kleen hated it because
437 * it calls getpagesize() twice: "it's dumb code." 474 * it calls getpagesize() twice: "it's dumb code."
438 * 475 *
439 * Kernel guys get really het up about optimization, even when it's not 476 * Kernel guys get really het up about optimization, even when it's not
440 * necessary. I leave this code as a reaction against that. */ 477 * necessary. I leave this code as a reaction against that.
478 */
441static inline unsigned long page_align(unsigned long addr) 479static inline unsigned long page_align(unsigned long addr)
442{ 480{
443 /* Add upwards and truncate downwards. */ 481 /* Add upwards and truncate downwards. */
444 return ((addr + getpagesize()-1) & ~(getpagesize()-1)); 482 return ((addr + getpagesize()-1) & ~(getpagesize()-1));
445} 483}
446 484
447/*L:180 An "initial ram disk" is a disk image loaded into memory along with 485/*L:180
448 * the kernel which the kernel can use to boot from without needing any 486 * An "initial ram disk" is a disk image loaded into memory along with the
449 * drivers. Most distributions now use this as standard: the initrd contains 487 * kernel which the kernel can use to boot from without needing any drivers.
450 * the code to load the appropriate driver modules for the current machine. 488 * Most distributions now use this as standard: the initrd contains the code to
489 * load the appropriate driver modules for the current machine.
451 * 490 *
452 * Importantly, James Morris works for RedHat, and Fedora uses initrds for its 491 * Importantly, James Morris works for RedHat, and Fedora uses initrds for its
453 * kernels. He sent me this (and tells me when I break it). */ 492 * kernels. He sent me this (and tells me when I break it).
493 */
454static unsigned long load_initrd(const char *name, unsigned long mem) 494static unsigned long load_initrd(const char *name, unsigned long mem)
455{ 495{
456 int ifd; 496 int ifd;
@@ -462,12 +502,16 @@ static unsigned long load_initrd(const char *name, unsigned long mem)
462 if (fstat(ifd, &st) < 0) 502 if (fstat(ifd, &st) < 0)
463 err(1, "fstat() on initrd '%s'", name); 503 err(1, "fstat() on initrd '%s'", name);
464 504
465 /* We map the initrd at the top of memory, but mmap wants it to be 505 /*
466 * page-aligned, so we round the size up for that. */ 506 * We map the initrd at the top of memory, but mmap wants it to be
507 * page-aligned, so we round the size up for that.
508 */
467 len = page_align(st.st_size); 509 len = page_align(st.st_size);
468 map_at(ifd, from_guest_phys(mem - len), 0, st.st_size); 510 map_at(ifd, from_guest_phys(mem - len), 0, st.st_size);
469 /* Once a file is mapped, you can close the file descriptor. It's a 511 /*
470 * little odd, but quite useful. */ 512 * Once a file is mapped, you can close the file descriptor. It's a
513 * little odd, but quite useful.
514 */
471 close(ifd); 515 close(ifd);
472 verbose("mapped initrd %s size=%lu @ %p\n", name, len, (void*)mem-len); 516 verbose("mapped initrd %s size=%lu @ %p\n", name, len, (void*)mem-len);
473 517
@@ -476,8 +520,10 @@ static unsigned long load_initrd(const char *name, unsigned long mem)
476} 520}
477/*:*/ 521/*:*/
478 522
479/* Simple routine to roll all the commandline arguments together with spaces 523/*
480 * between them. */ 524 * Simple routine to roll all the commandline arguments together with spaces
525 * between them.
526 */
481static void concat(char *dst, char *args[]) 527static void concat(char *dst, char *args[])
482{ 528{
483 unsigned int i, len = 0; 529 unsigned int i, len = 0;
@@ -494,10 +540,12 @@ static void concat(char *dst, char *args[])
494 dst[len] = '\0'; 540 dst[len] = '\0';
495} 541}
496 542
497/*L:185 This is where we actually tell the kernel to initialize the Guest. We 543/*L:185
544 * This is where we actually tell the kernel to initialize the Guest. We
498 * saw the arguments it expects when we looked at initialize() in lguest_user.c: 545 * saw the arguments it expects when we looked at initialize() in lguest_user.c:
499 * the base of Guest "physical" memory, the top physical page to allow and the 546 * the base of Guest "physical" memory, the top physical page to allow and the
500 * entry point for the Guest. */ 547 * entry point for the Guest.
548 */
501static void tell_kernel(unsigned long start) 549static void tell_kernel(unsigned long start)
502{ 550{
503 unsigned long args[] = { LHREQ_INITIALIZE, 551 unsigned long args[] = { LHREQ_INITIALIZE,
@@ -511,7 +559,7 @@ static void tell_kernel(unsigned long start)
511} 559}
512/*:*/ 560/*:*/
513 561
514/* 562/*L:200
515 * Device Handling. 563 * Device Handling.
516 * 564 *
517 * When the Guest gives us a buffer, it sends an array of addresses and sizes. 565 * When the Guest gives us a buffer, it sends an array of addresses and sizes.
@@ -522,20 +570,26 @@ static void tell_kernel(unsigned long start)
522static void *_check_pointer(unsigned long addr, unsigned int size, 570static void *_check_pointer(unsigned long addr, unsigned int size,
523 unsigned int line) 571 unsigned int line)
524{ 572{
525 /* We have to separately check addr and addr+size, because size could 573 /*
526 * be huge and addr + size might wrap around. */ 574 * We have to separately check addr and addr+size, because size could
575 * be huge and addr + size might wrap around.
576 */
527 if (addr >= guest_limit || addr + size >= guest_limit) 577 if (addr >= guest_limit || addr + size >= guest_limit)
528 errx(1, "%s:%i: Invalid address %#lx", __FILE__, line, addr); 578 errx(1, "%s:%i: Invalid address %#lx", __FILE__, line, addr);
529 /* We return a pointer for the caller's convenience, now we know it's 579 /*
530 * safe to use. */ 580 * We return a pointer for the caller's convenience, now we know it's
581 * safe to use.
582 */
531 return from_guest_phys(addr); 583 return from_guest_phys(addr);
532} 584}
533/* A macro which transparently hands the line number to the real function. */ 585/* A macro which transparently hands the line number to the real function. */
534#define check_pointer(addr,size) _check_pointer(addr, size, __LINE__) 586#define check_pointer(addr,size) _check_pointer(addr, size, __LINE__)
535 587
536/* Each buffer in the virtqueues is actually a chain of descriptors. This 588/*
589 * Each buffer in the virtqueues is actually a chain of descriptors. This
537 * function returns the next descriptor in the chain, or vq->vring.num if we're 590 * function returns the next descriptor in the chain, or vq->vring.num if we're
538 * at the end. */ 591 * at the end.
592 */
539static unsigned next_desc(struct vring_desc *desc, 593static unsigned next_desc(struct vring_desc *desc,
540 unsigned int i, unsigned int max) 594 unsigned int i, unsigned int max)
541{ 595{
@@ -556,7 +610,10 @@ static unsigned next_desc(struct vring_desc *desc,
556 return next; 610 return next;
557} 611}
558 612
559/* This actually sends the interrupt for this virtqueue */ 613/*
614 * This actually sends the interrupt for this virtqueue, if we've used a
615 * buffer.
616 */
560static void trigger_irq(struct virtqueue *vq) 617static void trigger_irq(struct virtqueue *vq)
561{ 618{
562 unsigned long buf[] = { LHREQ_IRQ, vq->config.irq }; 619 unsigned long buf[] = { LHREQ_IRQ, vq->config.irq };
@@ -576,12 +633,14 @@ static void trigger_irq(struct virtqueue *vq)
576 err(1, "Triggering irq %i", vq->config.irq); 633 err(1, "Triggering irq %i", vq->config.irq);
577} 634}
578 635
579/* This looks in the virtqueue and for the first available buffer, and converts 636/*
637 * This looks in the virtqueue for the first available buffer, and converts
580 * it to an iovec for convenient access. Since descriptors consist of some 638 * it to an iovec for convenient access. Since descriptors consist of some
581 * number of output then some number of input descriptors, it's actually two 639 * number of output then some number of input descriptors, it's actually two
582 * iovecs, but we pack them into one and note how many of each there were. 640 * iovecs, but we pack them into one and note how many of each there were.
583 * 641 *
584 * This function returns the descriptor number found. */ 642 * This function waits if necessary, and returns the descriptor number found.
643 */
585static unsigned wait_for_vq_desc(struct virtqueue *vq, 644static unsigned wait_for_vq_desc(struct virtqueue *vq,
586 struct iovec iov[], 645 struct iovec iov[],
587 unsigned int *out_num, unsigned int *in_num) 646 unsigned int *out_num, unsigned int *in_num)
@@ -590,17 +649,23 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq,
590 struct vring_desc *desc; 649 struct vring_desc *desc;
591 u16 last_avail = lg_last_avail(vq); 650 u16 last_avail = lg_last_avail(vq);
592 651
652 /* There's nothing available? */
593 while (last_avail == vq->vring.avail->idx) { 653 while (last_avail == vq->vring.avail->idx) {
594 u64 event; 654 u64 event;
595 655
596 /* OK, tell Guest about progress up to now. */ 656 /*
657 * Since we're about to sleep, now is a good time to tell the
658 * Guest about what we've used up to now.
659 */
597 trigger_irq(vq); 660 trigger_irq(vq);
598 661
599 /* OK, now we need to know about added descriptors. */ 662 /* OK, now we need to know about added descriptors. */
600 vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; 663 vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
601 664
602 /* They could have slipped one in as we were doing that: make 665 /*
603 * sure it's written, then check again. */ 666 * They could have slipped one in as we were doing that: make
667 * sure it's written, then check again.
668 */
604 mb(); 669 mb();
605 if (last_avail != vq->vring.avail->idx) { 670 if (last_avail != vq->vring.avail->idx) {
606 vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY; 671 vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
@@ -620,8 +685,10 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq,
620 errx(1, "Guest moved used index from %u to %u", 685 errx(1, "Guest moved used index from %u to %u",
621 last_avail, vq->vring.avail->idx); 686 last_avail, vq->vring.avail->idx);
622 687
623 /* Grab the next descriptor number they're advertising, and increment 688 /*
624 * the index we've seen. */ 689 * Grab the next descriptor number they're advertising, and increment
690 * the index we've seen.
691 */
625 head = vq->vring.avail->ring[last_avail % vq->vring.num]; 692 head = vq->vring.avail->ring[last_avail % vq->vring.num];
626 lg_last_avail(vq)++; 693 lg_last_avail(vq)++;
627 694
@@ -636,8 +703,10 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq,
636 desc = vq->vring.desc; 703 desc = vq->vring.desc;
637 i = head; 704 i = head;
638 705
639 /* If this is an indirect entry, then this buffer contains a descriptor 706 /*
640 * table which we handle as if it's any normal descriptor chain. */ 707 * If this is an indirect entry, then this buffer contains a descriptor
708 * table which we handle as if it's any normal descriptor chain.
709 */
641 if (desc[i].flags & VRING_DESC_F_INDIRECT) { 710 if (desc[i].flags & VRING_DESC_F_INDIRECT) {
642 if (desc[i].len % sizeof(struct vring_desc)) 711 if (desc[i].len % sizeof(struct vring_desc))
643 errx(1, "Invalid size for indirect buffer table"); 712 errx(1, "Invalid size for indirect buffer table");
@@ -656,8 +725,10 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq,
656 if (desc[i].flags & VRING_DESC_F_WRITE) 725 if (desc[i].flags & VRING_DESC_F_WRITE)
657 (*in_num)++; 726 (*in_num)++;
658 else { 727 else {
659 /* If it's an output descriptor, they're all supposed 728 /*
660 * to come before any input descriptors. */ 729 * If it's an output descriptor, they're all supposed
730 * to come before any input descriptors.
731 */
661 if (*in_num) 732 if (*in_num)
662 errx(1, "Descriptor has out after in"); 733 errx(1, "Descriptor has out after in");
663 (*out_num)++; 734 (*out_num)++;
@@ -671,14 +742,19 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq,
671 return head; 742 return head;
672} 743}
673 744
674/* After we've used one of their buffers, we tell them about it. We'll then 745/*
675 * want to send them an interrupt, using trigger_irq(). */ 746 * After we've used one of their buffers, we tell the Guest about it. Sometime
747 * later we'll want to send them an interrupt using trigger_irq(); note that
748 * wait_for_vq_desc() does that for us if it has to wait.
749 */
676static void add_used(struct virtqueue *vq, unsigned int head, int len) 750static void add_used(struct virtqueue *vq, unsigned int head, int len)
677{ 751{
678 struct vring_used_elem *used; 752 struct vring_used_elem *used;
679 753
680 /* The virtqueue contains a ring of used buffers. Get a pointer to the 754 /*
681 * next entry in that used ring. */ 755 * The virtqueue contains a ring of used buffers. Get a pointer to the
756 * next entry in that used ring.
757 */
682 used = &vq->vring.used->ring[vq->vring.used->idx % vq->vring.num]; 758 used = &vq->vring.used->ring[vq->vring.used->idx % vq->vring.num];
683 used->id = head; 759 used->id = head;
684 used->len = len; 760 used->len = len;
@@ -698,9 +774,9 @@ static void add_used_and_trigger(struct virtqueue *vq, unsigned head, int len)
698/* 774/*
699 * The Console 775 * The Console
700 * 776 *
701 * We associate some data with the console for our exit hack. */ 777 * We associate some data with the console for our exit hack.
702struct console_abort 778 */
703{ 779struct console_abort {
704 /* How many times have they hit ^C? */ 780 /* How many times have they hit ^C? */
705 int count; 781 int count;
706 /* When did they start? */ 782 /* When did they start? */
@@ -715,30 +791,35 @@ static void console_input(struct virtqueue *vq)
715 struct console_abort *abort = vq->dev->priv; 791 struct console_abort *abort = vq->dev->priv;
716 struct iovec iov[vq->vring.num]; 792 struct iovec iov[vq->vring.num];
717 793
718 /* Make sure there's a descriptor waiting. */ 794 /* Make sure there's a descriptor available. */
719 head = wait_for_vq_desc(vq, iov, &out_num, &in_num); 795 head = wait_for_vq_desc(vq, iov, &out_num, &in_num);
720 if (out_num) 796 if (out_num)
721 errx(1, "Output buffers in console in queue?"); 797 errx(1, "Output buffers in console in queue?");
722 798
723 /* Read it in. */ 799 /* Read into it. This is where we usually wait. */
724 len = readv(STDIN_FILENO, iov, in_num); 800 len = readv(STDIN_FILENO, iov, in_num);
725 if (len <= 0) { 801 if (len <= 0) {
726 /* Ran out of input? */ 802 /* Ran out of input? */
727 warnx("Failed to get console input, ignoring console."); 803 warnx("Failed to get console input, ignoring console.");
728 /* For simplicity, dying threads kill the whole Launcher. So 804 /*
729 * just nap here. */ 805 * For simplicity, dying threads kill the whole Launcher. So
806 * just nap here.
807 */
730 for (;;) 808 for (;;)
731 pause(); 809 pause();
732 } 810 }
733 811
812 /* Tell the Guest we used a buffer. */
734 add_used_and_trigger(vq, head, len); 813 add_used_and_trigger(vq, head, len);
735 814
736 /* Three ^C within one second? Exit. 815 /*
816 * Three ^C within one second? Exit.
737 * 817 *
738 * This is such a hack, but works surprisingly well. Each ^C has to 818 * This is such a hack, but works surprisingly well. Each ^C has to
739 * be in a buffer by itself, so they can't be too fast. But we check 819 * be in a buffer by itself, so they can't be too fast. But we check
740 * that we get three within about a second, so they can't be too 820 * that we get three within about a second, so they can't be too
741 * slow. */ 821 * slow.
822 */
742 if (len != 1 || ((char *)iov[0].iov_base)[0] != 3) { 823 if (len != 1 || ((char *)iov[0].iov_base)[0] != 3) {
743 abort->count = 0; 824 abort->count = 0;
744 return; 825 return;
@@ -763,15 +844,23 @@ static void console_output(struct virtqueue *vq)
763 unsigned int head, out, in; 844 unsigned int head, out, in;
764 struct iovec iov[vq->vring.num]; 845 struct iovec iov[vq->vring.num];
765 846
847 /* We usually wait in here, for the Guest to give us something. */
766 head = wait_for_vq_desc(vq, iov, &out, &in); 848 head = wait_for_vq_desc(vq, iov, &out, &in);
767 if (in) 849 if (in)
768 errx(1, "Input buffers in console output queue?"); 850 errx(1, "Input buffers in console output queue?");
851
852 /* writev can return a partial write, so we loop here. */
769 while (!iov_empty(iov, out)) { 853 while (!iov_empty(iov, out)) {
770 int len = writev(STDOUT_FILENO, iov, out); 854 int len = writev(STDOUT_FILENO, iov, out);
771 if (len <= 0) 855 if (len <= 0)
772 err(1, "Write to stdout gave %i", len); 856 err(1, "Write to stdout gave %i", len);
773 iov_consume(iov, out, len); 857 iov_consume(iov, out, len);
774 } 858 }
859
860 /*
861 * We're finished with that buffer: if we're going to sleep,
862 * wait_for_vq_desc() will prod the Guest with an interrupt.
863 */
775 add_used(vq, head, 0); 864 add_used(vq, head, 0);
776} 865}
777 866
@@ -791,15 +880,30 @@ static void net_output(struct virtqueue *vq)
791 unsigned int head, out, in; 880 unsigned int head, out, in;
792 struct iovec iov[vq->vring.num]; 881 struct iovec iov[vq->vring.num];
793 882
883 /* We usually wait in here for the Guest to give us a packet. */
794 head = wait_for_vq_desc(vq, iov, &out, &in); 884 head = wait_for_vq_desc(vq, iov, &out, &in);
795 if (in) 885 if (in)
796 errx(1, "Input buffers in net output queue?"); 886 errx(1, "Input buffers in net output queue?");
887 /*
888 * Send the whole thing through to /dev/net/tun. It expects the exact
889 * same format: what a coincidence!
890 */
797 if (writev(net_info->tunfd, iov, out) < 0) 891 if (writev(net_info->tunfd, iov, out) < 0)
798 errx(1, "Write to tun failed?"); 892 errx(1, "Write to tun failed?");
893
894 /*
895 * Done with that one; wait_for_vq_desc() will send the interrupt if
896 * all packets are processed.
897 */
799 add_used(vq, head, 0); 898 add_used(vq, head, 0);
800} 899}
801 900
802/* Will reading from this file descriptor block? */ 901/*
902 * Handling network input is a bit trickier, because I've tried to optimize it.
903 *
904 * First we have a helper routine which tells is if from this file descriptor
905 * (ie. the /dev/net/tun device) will block:
906 */
803static bool will_block(int fd) 907static bool will_block(int fd)
804{ 908{
805 fd_set fdset; 909 fd_set fdset;
@@ -809,8 +913,11 @@ static bool will_block(int fd)
809 return select(fd+1, &fdset, NULL, NULL, &zero) != 1; 913 return select(fd+1, &fdset, NULL, NULL, &zero) != 1;
810} 914}
811 915
812/* This is where we handle packets coming in from the tun device to our 916/*
813 * Guest. */ 917 * This handles packets coming in from the tun device to our Guest. Like all
918 * service routines, it gets called again as soon as it returns, so you don't
919 * see a while(1) loop here.
920 */
814static void net_input(struct virtqueue *vq) 921static void net_input(struct virtqueue *vq)
815{ 922{
816 int len; 923 int len;
@@ -818,21 +925,38 @@ static void net_input(struct virtqueue *vq)
818 struct iovec iov[vq->vring.num]; 925 struct iovec iov[vq->vring.num];
819 struct net_info *net_info = vq->dev->priv; 926 struct net_info *net_info = vq->dev->priv;
820 927
928 /*
929 * Get a descriptor to write an incoming packet into. This will also
930 * send an interrupt if they're out of descriptors.
931 */
821 head = wait_for_vq_desc(vq, iov, &out, &in); 932 head = wait_for_vq_desc(vq, iov, &out, &in);
822 if (out) 933 if (out)
823 errx(1, "Output buffers in net input queue?"); 934 errx(1, "Output buffers in net input queue?");
824 935
825 /* Deliver interrupt now, since we're about to sleep. */ 936 /*
937 * If it looks like we'll block reading from the tun device, send them
938 * an interrupt.
939 */
826 if (vq->pending_used && will_block(net_info->tunfd)) 940 if (vq->pending_used && will_block(net_info->tunfd))
827 trigger_irq(vq); 941 trigger_irq(vq);
828 942
943 /*
944 * Read in the packet. This is where we normally wait (when there's no
945 * incoming network traffic).
946 */
829 len = readv(net_info->tunfd, iov, in); 947 len = readv(net_info->tunfd, iov, in);
830 if (len <= 0) 948 if (len <= 0)
831 err(1, "Failed to read from tun."); 949 err(1, "Failed to read from tun.");
950
951 /*
952 * Mark that packet buffer as used, but don't interrupt here. We want
953 * to wait until we've done as much work as we can.
954 */
832 add_used(vq, head, len); 955 add_used(vq, head, len);
833} 956}
957/*:*/
834 958
835/* This is the helper to create threads. */ 959/* This is the helper to create threads: run the service routine in a loop. */
836static int do_thread(void *_vq) 960static int do_thread(void *_vq)
837{ 961{
838 struct virtqueue *vq = _vq; 962 struct virtqueue *vq = _vq;
@@ -842,8 +966,10 @@ static int do_thread(void *_vq)
842 return 0; 966 return 0;
843} 967}
844 968
845/* When a child dies, we kill our entire process group with SIGTERM. This 969/*
846 * also has the side effect that the shell restores the console for us! */ 970 * When a child dies, we kill our entire process group with SIGTERM. This
971 * also has the side effect that the shell restores the console for us!
972 */
847static void kill_launcher(int signal) 973static void kill_launcher(int signal)
848{ 974{
849 kill(0, SIGTERM); 975 kill(0, SIGTERM);
@@ -878,11 +1004,15 @@ static void reset_device(struct device *dev)
878 signal(SIGCHLD, (void *)kill_launcher); 1004 signal(SIGCHLD, (void *)kill_launcher);
879} 1005}
880 1006
1007/*L:216
1008 * This actually creates the thread which services the virtqueue for a device.
1009 */
881static void create_thread(struct virtqueue *vq) 1010static void create_thread(struct virtqueue *vq)
882{ 1011{
883 /* Create stack for thread and run it. Since stack grows 1012 /*
884 * upwards, we point the stack pointer to the end of this 1013 * Create stack for thread. Since the stack grows upwards, we point
885 * region. */ 1014 * the stack pointer to the end of this region.
1015 */
886 char *stack = malloc(32768); 1016 char *stack = malloc(32768);
887 unsigned long args[] = { LHREQ_EVENTFD, 1017 unsigned long args[] = { LHREQ_EVENTFD,
888 vq->config.pfn*getpagesize(), 0 }; 1018 vq->config.pfn*getpagesize(), 0 };
@@ -893,17 +1023,22 @@ static void create_thread(struct virtqueue *vq)
893 err(1, "Creating eventfd"); 1023 err(1, "Creating eventfd");
894 args[2] = vq->eventfd; 1024 args[2] = vq->eventfd;
895 1025
896 /* Attach an eventfd to this virtqueue: it will go off 1026 /*
897 * when the Guest does an LHCALL_NOTIFY for this vq. */ 1027 * Attach an eventfd to this virtqueue: it will go off when the Guest
1028 * does an LHCALL_NOTIFY for this vq.
1029 */
898 if (write(lguest_fd, &args, sizeof(args)) != 0) 1030 if (write(lguest_fd, &args, sizeof(args)) != 0)
899 err(1, "Attaching eventfd"); 1031 err(1, "Attaching eventfd");
900 1032
901 /* CLONE_VM: because it has to access the Guest memory, and 1033 /*
902 * SIGCHLD so we get a signal if it dies. */ 1034 * CLONE_VM: because it has to access the Guest memory, and SIGCHLD so
1035 * we get a signal if it dies.
1036 */
903 vq->thread = clone(do_thread, stack + 32768, CLONE_VM | SIGCHLD, vq); 1037 vq->thread = clone(do_thread, stack + 32768, CLONE_VM | SIGCHLD, vq);
904 if (vq->thread == (pid_t)-1) 1038 if (vq->thread == (pid_t)-1)
905 err(1, "Creating clone"); 1039 err(1, "Creating clone");
906 /* We close our local copy, now the child has it. */ 1040
1041 /* We close our local copy now the child has it. */
907 close(vq->eventfd); 1042 close(vq->eventfd);
908} 1043}
909 1044
@@ -955,7 +1090,10 @@ static void update_device_status(struct device *dev)
955 } 1090 }
956} 1091}
957 1092
958/* This is the generic routine we call when the Guest uses LHCALL_NOTIFY. */ 1093/*L:215
1094 * This is the generic routine we call when the Guest uses LHCALL_NOTIFY. In
1095 * particular, it's used to notify us of device status changes during boot.
1096 */
959static void handle_output(unsigned long addr) 1097static void handle_output(unsigned long addr)
960{ 1098{
961 struct device *i; 1099 struct device *i;
@@ -964,25 +1102,42 @@ static void handle_output(unsigned long addr)
964 for (i = devices.dev; i; i = i->next) { 1102 for (i = devices.dev; i; i = i->next) {
965 struct virtqueue *vq; 1103 struct virtqueue *vq;
966 1104
967 /* Notifications to device descriptors update device status. */ 1105 /*
1106 * Notifications to device descriptors mean they updated the
1107 * device status.
1108 */
968 if (from_guest_phys(addr) == i->desc) { 1109 if (from_guest_phys(addr) == i->desc) {
969 update_device_status(i); 1110 update_device_status(i);
970 return; 1111 return;
971 } 1112 }
972 1113
973 /* Devices *can* be used before status is set to DRIVER_OK. */ 1114 /*
1115 * Devices *can* be used before status is set to DRIVER_OK.
1116 * The original plan was that they would never do this: they
1117 * would always finish setting up their status bits before
1118 * actually touching the virtqueues. In practice, we allowed
1119 * them to, and they do (eg. the disk probes for partition
1120 * tables as part of initialization).
1121 *
1122 * If we see this, we start the device: once it's running, we
1123 * expect the device to catch all the notifications.
1124 */
974 for (vq = i->vq; vq; vq = vq->next) { 1125 for (vq = i->vq; vq; vq = vq->next) {
975 if (addr != vq->config.pfn*getpagesize()) 1126 if (addr != vq->config.pfn*getpagesize())
976 continue; 1127 continue;
977 if (i->running) 1128 if (i->running)
978 errx(1, "Notification on running %s", i->name); 1129 errx(1, "Notification on running %s", i->name);
1130 /* This just calls create_thread() for each virtqueue */
979 start_device(i); 1131 start_device(i);
980 return; 1132 return;
981 } 1133 }
982 } 1134 }
983 1135
984 /* Early console write is done using notify on a nul-terminated string 1136 /*
985 * in Guest memory. */ 1137 * Early console write is done using notify on a nul-terminated string
1138 * in Guest memory. It's also great for hacking debugging messages
1139 * into a Guest.
1140 */
986 if (addr >= guest_limit) 1141 if (addr >= guest_limit)
987 errx(1, "Bad NOTIFY %#lx", addr); 1142 errx(1, "Bad NOTIFY %#lx", addr);
988 1143
@@ -998,10 +1153,12 @@ static void handle_output(unsigned long addr)
998 * routines to allocate and manage them. 1153 * routines to allocate and manage them.
999 */ 1154 */
1000 1155
1001/* The layout of the device page is a "struct lguest_device_desc" followed by a 1156/*
1157 * The layout of the device page is a "struct lguest_device_desc" followed by a
1002 * number of virtqueue descriptors, then two sets of feature bits, then an 1158 * number of virtqueue descriptors, then two sets of feature bits, then an
1003 * array of configuration bytes. This routine returns the configuration 1159 * array of configuration bytes. This routine returns the configuration
1004 * pointer. */ 1160 * pointer.
1161 */
1005static u8 *device_config(const struct device *dev) 1162static u8 *device_config(const struct device *dev)
1006{ 1163{
1007 return (void *)(dev->desc + 1) 1164 return (void *)(dev->desc + 1)
@@ -1009,9 +1166,11 @@ static u8 *device_config(const struct device *dev)
1009 + dev->feature_len * 2; 1166 + dev->feature_len * 2;
1010} 1167}
1011 1168
1012/* This routine allocates a new "struct lguest_device_desc" from descriptor 1169/*
1170 * This routine allocates a new "struct lguest_device_desc" from descriptor
1013 * table page just above the Guest's normal memory. It returns a pointer to 1171 * table page just above the Guest's normal memory. It returns a pointer to
1014 * that descriptor. */ 1172 * that descriptor.
1173 */
1015static struct lguest_device_desc *new_dev_desc(u16 type) 1174static struct lguest_device_desc *new_dev_desc(u16 type)
1016{ 1175{
1017 struct lguest_device_desc d = { .type = type }; 1176 struct lguest_device_desc d = { .type = type };
@@ -1032,8 +1191,10 @@ static struct lguest_device_desc *new_dev_desc(u16 type)
1032 return memcpy(p, &d, sizeof(d)); 1191 return memcpy(p, &d, sizeof(d));
1033} 1192}
1034 1193
1035/* Each device descriptor is followed by the description of its virtqueues. We 1194/*
1036 * specify how many descriptors the virtqueue is to have. */ 1195 * Each device descriptor is followed by the description of its virtqueues. We
1196 * specify how many descriptors the virtqueue is to have.
1197 */
1037static void add_virtqueue(struct device *dev, unsigned int num_descs, 1198static void add_virtqueue(struct device *dev, unsigned int num_descs,
1038 void (*service)(struct virtqueue *)) 1199 void (*service)(struct virtqueue *))
1039{ 1200{
@@ -1050,6 +1211,11 @@ static void add_virtqueue(struct device *dev, unsigned int num_descs,
1050 vq->next = NULL; 1211 vq->next = NULL;
1051 vq->last_avail_idx = 0; 1212 vq->last_avail_idx = 0;
1052 vq->dev = dev; 1213 vq->dev = dev;
1214
1215 /*
1216 * This is the routine the service thread will run, and its Process ID
1217 * once it's running.
1218 */
1053 vq->service = service; 1219 vq->service = service;
1054 vq->thread = (pid_t)-1; 1220 vq->thread = (pid_t)-1;
1055 1221
@@ -1061,10 +1227,12 @@ static void add_virtqueue(struct device *dev, unsigned int num_descs,
1061 /* Initialize the vring. */ 1227 /* Initialize the vring. */
1062 vring_init(&vq->vring, num_descs, p, LGUEST_VRING_ALIGN); 1228 vring_init(&vq->vring, num_descs, p, LGUEST_VRING_ALIGN);
1063 1229
1064 /* Append virtqueue to this device's descriptor. We use 1230 /*
1231 * Append virtqueue to this device's descriptor. We use
1065 * device_config() to get the end of the device's current virtqueues; 1232 * device_config() to get the end of the device's current virtqueues;
1066 * we check that we haven't added any config or feature information 1233 * we check that we haven't added any config or feature information
1067 * yet, otherwise we'd be overwriting them. */ 1234 * yet, otherwise we'd be overwriting them.
1235 */
1068 assert(dev->desc->config_len == 0 && dev->desc->feature_len == 0); 1236 assert(dev->desc->config_len == 0 && dev->desc->feature_len == 0);
1069 memcpy(device_config(dev), &vq->config, sizeof(vq->config)); 1237 memcpy(device_config(dev), &vq->config, sizeof(vq->config));
1070 dev->num_vq++; 1238 dev->num_vq++;
@@ -1072,14 +1240,18 @@ static void add_virtqueue(struct device *dev, unsigned int num_descs,
1072 1240
1073 verbose("Virtqueue page %#lx\n", to_guest_phys(p)); 1241 verbose("Virtqueue page %#lx\n", to_guest_phys(p));
1074 1242
1075 /* Add to tail of list, so dev->vq is first vq, dev->vq->next is 1243 /*
1076 * second. */ 1244 * Add to tail of list, so dev->vq is first vq, dev->vq->next is
1245 * second.
1246 */
1077 for (i = &dev->vq; *i; i = &(*i)->next); 1247 for (i = &dev->vq; *i; i = &(*i)->next);
1078 *i = vq; 1248 *i = vq;
1079} 1249}
1080 1250
1081/* The first half of the feature bitmask is for us to advertise features. The 1251/*
1082 * second half is for the Guest to accept features. */ 1252 * The first half of the feature bitmask is for us to advertise features. The
1253 * second half is for the Guest to accept features.
1254 */
1083static void add_feature(struct device *dev, unsigned bit) 1255static void add_feature(struct device *dev, unsigned bit)
1084{ 1256{
1085 u8 *features = get_feature_bits(dev); 1257 u8 *features = get_feature_bits(dev);
@@ -1093,9 +1265,11 @@ static void add_feature(struct device *dev, unsigned bit)
1093 features[bit / CHAR_BIT] |= (1 << (bit % CHAR_BIT)); 1265 features[bit / CHAR_BIT] |= (1 << (bit % CHAR_BIT));
1094} 1266}
1095 1267
1096/* This routine sets the configuration fields for an existing device's 1268/*
1269 * This routine sets the configuration fields for an existing device's
1097 * descriptor. It only works for the last device, but that's OK because that's 1270 * descriptor. It only works for the last device, but that's OK because that's
1098 * how we use it. */ 1271 * how we use it.
1272 */
1099static void set_config(struct device *dev, unsigned len, const void *conf) 1273static void set_config(struct device *dev, unsigned len, const void *conf)
1100{ 1274{
1101 /* Check we haven't overflowed our single page. */ 1275 /* Check we haven't overflowed our single page. */
@@ -1105,12 +1279,18 @@ static void set_config(struct device *dev, unsigned len, const void *conf)
1105 /* Copy in the config information, and store the length. */ 1279 /* Copy in the config information, and store the length. */
1106 memcpy(device_config(dev), conf, len); 1280 memcpy(device_config(dev), conf, len);
1107 dev->desc->config_len = len; 1281 dev->desc->config_len = len;
1282
1283 /* Size must fit in config_len field (8 bits)! */
1284 assert(dev->desc->config_len == len);
1108} 1285}
1109 1286
1110/* This routine does all the creation and setup of a new device, including 1287/*
1111 * calling new_dev_desc() to allocate the descriptor and device memory. 1288 * This routine does all the creation and setup of a new device, including
1289 * calling new_dev_desc() to allocate the descriptor and device memory. We
1290 * don't actually start the service threads until later.
1112 * 1291 *
1113 * See what I mean about userspace being boring? */ 1292 * See what I mean about userspace being boring?
1293 */
1114static struct device *new_device(const char *name, u16 type) 1294static struct device *new_device(const char *name, u16 type)
1115{ 1295{
1116 struct device *dev = malloc(sizeof(*dev)); 1296 struct device *dev = malloc(sizeof(*dev));
@@ -1123,10 +1303,12 @@ static struct device *new_device(const char *name, u16 type)
1123 dev->num_vq = 0; 1303 dev->num_vq = 0;
1124 dev->running = false; 1304 dev->running = false;
1125 1305
1126 /* Append to device list. Prepending to a single-linked list is 1306 /*
1307 * Append to device list. Prepending to a single-linked list is
1127 * easier, but the user expects the devices to be arranged on the bus 1308 * easier, but the user expects the devices to be arranged on the bus
1128 * in command-line order. The first network device on the command line 1309 * in command-line order. The first network device on the command line
1129 * is eth0, the first block device /dev/vda, etc. */ 1310 * is eth0, the first block device /dev/vda, etc.
1311 */
1130 if (devices.lastdev) 1312 if (devices.lastdev)
1131 devices.lastdev->next = dev; 1313 devices.lastdev->next = dev;
1132 else 1314 else
@@ -1136,8 +1318,10 @@ static struct device *new_device(const char *name, u16 type)
1136 return dev; 1318 return dev;
1137} 1319}
1138 1320
1139/* Our first setup routine is the console. It's a fairly simple device, but 1321/*
1140 * UNIX tty handling makes it uglier than it could be. */ 1322 * Our first setup routine is the console. It's a fairly simple device, but
1323 * UNIX tty handling makes it uglier than it could be.
1324 */
1141static void setup_console(void) 1325static void setup_console(void)
1142{ 1326{
1143 struct device *dev; 1327 struct device *dev;
@@ -1145,8 +1329,10 @@ static void setup_console(void)
1145 /* If we can save the initial standard input settings... */ 1329 /* If we can save the initial standard input settings... */
1146 if (tcgetattr(STDIN_FILENO, &orig_term) == 0) { 1330 if (tcgetattr(STDIN_FILENO, &orig_term) == 0) {
1147 struct termios term = orig_term; 1331 struct termios term = orig_term;
1148 /* Then we turn off echo, line buffering and ^C etc. We want a 1332 /*
1149 * raw input stream to the Guest. */ 1333 * Then we turn off echo, line buffering and ^C etc: We want a
1334 * raw input stream to the Guest.
1335 */
1150 term.c_lflag &= ~(ISIG|ICANON|ECHO); 1336 term.c_lflag &= ~(ISIG|ICANON|ECHO);
1151 tcsetattr(STDIN_FILENO, TCSANOW, &term); 1337 tcsetattr(STDIN_FILENO, TCSANOW, &term);
1152 } 1338 }
@@ -1157,10 +1343,12 @@ static void setup_console(void)
1157 dev->priv = malloc(sizeof(struct console_abort)); 1343 dev->priv = malloc(sizeof(struct console_abort));
1158 ((struct console_abort *)dev->priv)->count = 0; 1344 ((struct console_abort *)dev->priv)->count = 0;
1159 1345
1160 /* The console needs two virtqueues: the input then the output. When 1346 /*
1347 * The console needs two virtqueues: the input then the output. When
1161 * they put something the input queue, we make sure we're listening to 1348 * they put something the input queue, we make sure we're listening to
1162 * stdin. When they put something in the output queue, we write it to 1349 * stdin. When they put something in the output queue, we write it to
1163 * stdout. */ 1350 * stdout.
1351 */
1164 add_virtqueue(dev, VIRTQUEUE_NUM, console_input); 1352 add_virtqueue(dev, VIRTQUEUE_NUM, console_input);
1165 add_virtqueue(dev, VIRTQUEUE_NUM, console_output); 1353 add_virtqueue(dev, VIRTQUEUE_NUM, console_output);
1166 1354
@@ -1168,7 +1356,8 @@ static void setup_console(void)
1168} 1356}
1169/*:*/ 1357/*:*/
1170 1358
1171/*M:010 Inter-guest networking is an interesting area. Simplest is to have a 1359/*M:010
1360 * Inter-guest networking is an interesting area. Simplest is to have a
1172 * --sharenet=<name> option which opens or creates a named pipe. This can be 1361 * --sharenet=<name> option which opens or creates a named pipe. This can be
1173 * used to send packets to another guest in a 1:1 manner. 1362 * used to send packets to another guest in a 1:1 manner.
1174 * 1363 *
@@ -1182,7 +1371,8 @@ static void setup_console(void)
1182 * multiple inter-guest channels behind one interface, although it would 1371 * multiple inter-guest channels behind one interface, although it would
1183 * require some manner of hotplugging new virtio channels. 1372 * require some manner of hotplugging new virtio channels.
1184 * 1373 *
1185 * Finally, we could implement a virtio network switch in the kernel. :*/ 1374 * Finally, we could implement a virtio network switch in the kernel.
1375:*/
1186 1376
1187static u32 str2ip(const char *ipaddr) 1377static u32 str2ip(const char *ipaddr)
1188{ 1378{
@@ -1207,11 +1397,13 @@ static void str2mac(const char *macaddr, unsigned char mac[6])
1207 mac[5] = m[5]; 1397 mac[5] = m[5];
1208} 1398}
1209 1399
1210/* This code is "adapted" from libbridge: it attaches the Host end of the 1400/*
1401 * This code is "adapted" from libbridge: it attaches the Host end of the
1211 * network device to the bridge device specified by the command line. 1402 * network device to the bridge device specified by the command line.
1212 * 1403 *
1213 * This is yet another James Morris contribution (I'm an IP-level guy, so I 1404 * This is yet another James Morris contribution (I'm an IP-level guy, so I
1214 * dislike bridging), and I just try not to break it. */ 1405 * dislike bridging), and I just try not to break it.
1406 */
1215static void add_to_bridge(int fd, const char *if_name, const char *br_name) 1407static void add_to_bridge(int fd, const char *if_name, const char *br_name)
1216{ 1408{
1217 int ifidx; 1409 int ifidx;
@@ -1231,9 +1423,11 @@ static void add_to_bridge(int fd, const char *if_name, const char *br_name)
1231 err(1, "can't add %s to bridge %s", if_name, br_name); 1423 err(1, "can't add %s to bridge %s", if_name, br_name);
1232} 1424}
1233 1425
1234/* This sets up the Host end of the network device with an IP address, brings 1426/*
1427 * This sets up the Host end of the network device with an IP address, brings
1235 * it up so packets will flow, the copies the MAC address into the hwaddr 1428 * it up so packets will flow, the copies the MAC address into the hwaddr
1236 * pointer. */ 1429 * pointer.
1430 */
1237static void configure_device(int fd, const char *tapif, u32 ipaddr) 1431static void configure_device(int fd, const char *tapif, u32 ipaddr)
1238{ 1432{
1239 struct ifreq ifr; 1433 struct ifreq ifr;
@@ -1260,10 +1454,12 @@ static int get_tun_device(char tapif[IFNAMSIZ])
1260 /* Start with this zeroed. Messy but sure. */ 1454 /* Start with this zeroed. Messy but sure. */
1261 memset(&ifr, 0, sizeof(ifr)); 1455 memset(&ifr, 0, sizeof(ifr));
1262 1456
1263 /* We open the /dev/net/tun device and tell it we want a tap device. A 1457 /*
1458 * We open the /dev/net/tun device and tell it we want a tap device. A
1264 * tap device is like a tun device, only somehow different. To tell 1459 * tap device is like a tun device, only somehow different. To tell
1265 * the truth, I completely blundered my way through this code, but it 1460 * the truth, I completely blundered my way through this code, but it
1266 * works now! */ 1461 * works now!
1462 */
1267 netfd = open_or_die("/dev/net/tun", O_RDWR); 1463 netfd = open_or_die("/dev/net/tun", O_RDWR);
1268 ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR; 1464 ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
1269 strcpy(ifr.ifr_name, "tap%d"); 1465 strcpy(ifr.ifr_name, "tap%d");
@@ -1274,18 +1470,22 @@ static int get_tun_device(char tapif[IFNAMSIZ])
1274 TUN_F_CSUM|TUN_F_TSO4|TUN_F_TSO6|TUN_F_TSO_ECN) != 0) 1470 TUN_F_CSUM|TUN_F_TSO4|TUN_F_TSO6|TUN_F_TSO_ECN) != 0)
1275 err(1, "Could not set features for tun device"); 1471 err(1, "Could not set features for tun device");
1276 1472
1277 /* We don't need checksums calculated for packets coming in this 1473 /*
1278 * device: trust us! */ 1474 * We don't need checksums calculated for packets coming in this
1475 * device: trust us!
1476 */
1279 ioctl(netfd, TUNSETNOCSUM, 1); 1477 ioctl(netfd, TUNSETNOCSUM, 1);
1280 1478
1281 memcpy(tapif, ifr.ifr_name, IFNAMSIZ); 1479 memcpy(tapif, ifr.ifr_name, IFNAMSIZ);
1282 return netfd; 1480 return netfd;
1283} 1481}
1284 1482
1285/*L:195 Our network is a Host<->Guest network. This can either use bridging or 1483/*L:195
1484 * Our network is a Host<->Guest network. This can either use bridging or
1286 * routing, but the principle is the same: it uses the "tun" device to inject 1485 * routing, but the principle is the same: it uses the "tun" device to inject
1287 * packets into the Host as if they came in from a normal network card. We 1486 * packets into the Host as if they came in from a normal network card. We
1288 * just shunt packets between the Guest and the tun device. */ 1487 * just shunt packets between the Guest and the tun device.
1488 */
1289static void setup_tun_net(char *arg) 1489static void setup_tun_net(char *arg)
1290{ 1490{
1291 struct device *dev; 1491 struct device *dev;
@@ -1302,13 +1502,14 @@ static void setup_tun_net(char *arg)
1302 dev = new_device("net", VIRTIO_ID_NET); 1502 dev = new_device("net", VIRTIO_ID_NET);
1303 dev->priv = net_info; 1503 dev->priv = net_info;
1304 1504
1305 /* Network devices need a receive and a send queue, just like 1505 /* Network devices need a recv and a send queue, just like console. */
1306 * console. */
1307 add_virtqueue(dev, VIRTQUEUE_NUM, net_input); 1506 add_virtqueue(dev, VIRTQUEUE_NUM, net_input);
1308 add_virtqueue(dev, VIRTQUEUE_NUM, net_output); 1507 add_virtqueue(dev, VIRTQUEUE_NUM, net_output);
1309 1508
1310 /* We need a socket to perform the magic network ioctls to bring up the 1509 /*
1311 * tap interface, connect to the bridge etc. Any socket will do! */ 1510 * We need a socket to perform the magic network ioctls to bring up the
1511 * tap interface, connect to the bridge etc. Any socket will do!
1512 */
1312 ipfd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP); 1513 ipfd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
1313 if (ipfd < 0) 1514 if (ipfd < 0)
1314 err(1, "opening IP socket"); 1515 err(1, "opening IP socket");
@@ -1362,39 +1563,31 @@ static void setup_tun_net(char *arg)
1362 verbose("device %u: tun %s: %s\n", 1563 verbose("device %u: tun %s: %s\n",
1363 devices.device_num, tapif, arg); 1564 devices.device_num, tapif, arg);
1364} 1565}
1365 1566/*:*/
1366/* Our block (disk) device should be really simple: the Guest asks for a block
1367 * number and we read or write that position in the file. Unfortunately, that
1368 * was amazingly slow: the Guest waits until the read is finished before
1369 * running anything else, even if it could have been doing useful work.
1370 *
1371 * We could use async I/O, except it's reputed to suck so hard that characters
1372 * actually go missing from your code when you try to use it.
1373 *
1374 * So we farm the I/O out to thread, and communicate with it via a pipe. */
1375 1567
1376/* This hangs off device->priv. */ 1568/* This hangs off device->priv. */
1377struct vblk_info 1569struct vblk_info {
1378{
1379 /* The size of the file. */ 1570 /* The size of the file. */
1380 off64_t len; 1571 off64_t len;
1381 1572
1382 /* The file descriptor for the file. */ 1573 /* The file descriptor for the file. */
1383 int fd; 1574 int fd;
1384 1575
1385 /* IO thread listens on this file descriptor [0]. */
1386 int workpipe[2];
1387
1388 /* IO thread writes to this file descriptor to mark it done, then
1389 * Launcher triggers interrupt to Guest. */
1390 int done_fd;
1391}; 1576};
1392 1577
1393/*L:210 1578/*L:210
1394 * The Disk 1579 * The Disk
1395 * 1580 *
1396 * Remember that the block device is handled by a separate I/O thread. We head 1581 * The disk only has one virtqueue, so it only has one thread. It is really
1397 * straight into the core of that thread here: 1582 * simple: the Guest asks for a block number and we read or write that position
1583 * in the file.
1584 *
1585 * Before we serviced each virtqueue in a separate thread, that was unacceptably
1586 * slow: the Guest waits until the read is finished before running anything
1587 * else, even if it could have been doing useful work.
1588 *
1589 * We could have used async I/O, except it's reputed to suck so hard that
1590 * characters actually go missing from your code when you try to use it.
1398 */ 1591 */
1399static void blk_request(struct virtqueue *vq) 1592static void blk_request(struct virtqueue *vq)
1400{ 1593{
@@ -1406,47 +1599,64 @@ static void blk_request(struct virtqueue *vq)
1406 struct iovec iov[vq->vring.num]; 1599 struct iovec iov[vq->vring.num];
1407 off64_t off; 1600 off64_t off;
1408 1601
1409 /* Get the next request. */ 1602 /*
1603 * Get the next request, where we normally wait. It triggers the
1604 * interrupt to acknowledge previously serviced requests (if any).
1605 */
1410 head = wait_for_vq_desc(vq, iov, &out_num, &in_num); 1606 head = wait_for_vq_desc(vq, iov, &out_num, &in_num);
1411 1607
1412 /* Every block request should contain at least one output buffer 1608 /*
1609 * Every block request should contain at least one output buffer
1413 * (detailing the location on disk and the type of request) and one 1610 * (detailing the location on disk and the type of request) and one
1414 * input buffer (to hold the result). */ 1611 * input buffer (to hold the result).
1612 */
1415 if (out_num == 0 || in_num == 0) 1613 if (out_num == 0 || in_num == 0)
1416 errx(1, "Bad virtblk cmd %u out=%u in=%u", 1614 errx(1, "Bad virtblk cmd %u out=%u in=%u",
1417 head, out_num, in_num); 1615 head, out_num, in_num);
1418 1616
1419 out = convert(&iov[0], struct virtio_blk_outhdr); 1617 out = convert(&iov[0], struct virtio_blk_outhdr);
1420 in = convert(&iov[out_num+in_num-1], u8); 1618 in = convert(&iov[out_num+in_num-1], u8);
1619 /*
1620 * For historical reasons, block operations are expressed in 512 byte
1621 * "sectors".
1622 */
1421 off = out->sector * 512; 1623 off = out->sector * 512;
1422 1624
1423 /* The block device implements "barriers", where the Guest indicates 1625 /*
1626 * The block device implements "barriers", where the Guest indicates
1424 * that it wants all previous writes to occur before this write. We 1627 * that it wants all previous writes to occur before this write. We
1425 * don't have a way of asking our kernel to do a barrier, so we just 1628 * don't have a way of asking our kernel to do a barrier, so we just
1426 * synchronize all the data in the file. Pretty poor, no? */ 1629 * synchronize all the data in the file. Pretty poor, no?
1630 */
1427 if (out->type & VIRTIO_BLK_T_BARRIER) 1631 if (out->type & VIRTIO_BLK_T_BARRIER)
1428 fdatasync(vblk->fd); 1632 fdatasync(vblk->fd);
1429 1633
1430 /* In general the virtio block driver is allowed to try SCSI commands. 1634 /*
1431 * It'd be nice if we supported eject, for example, but we don't. */ 1635 * In general the virtio block driver is allowed to try SCSI commands.
1636 * It'd be nice if we supported eject, for example, but we don't.
1637 */
1432 if (out->type & VIRTIO_BLK_T_SCSI_CMD) { 1638 if (out->type & VIRTIO_BLK_T_SCSI_CMD) {
1433 fprintf(stderr, "Scsi commands unsupported\n"); 1639 fprintf(stderr, "Scsi commands unsupported\n");
1434 *in = VIRTIO_BLK_S_UNSUPP; 1640 *in = VIRTIO_BLK_S_UNSUPP;
1435 wlen = sizeof(*in); 1641 wlen = sizeof(*in);
1436 } else if (out->type & VIRTIO_BLK_T_OUT) { 1642 } else if (out->type & VIRTIO_BLK_T_OUT) {
1437 /* Write */ 1643 /*
1438 1644 * Write
1439 /* Move to the right location in the block file. This can fail 1645 *
1440 * if they try to write past end. */ 1646 * Move to the right location in the block file. This can fail
1647 * if they try to write past end.
1648 */
1441 if (lseek64(vblk->fd, off, SEEK_SET) != off) 1649 if (lseek64(vblk->fd, off, SEEK_SET) != off)
1442 err(1, "Bad seek to sector %llu", out->sector); 1650 err(1, "Bad seek to sector %llu", out->sector);
1443 1651
1444 ret = writev(vblk->fd, iov+1, out_num-1); 1652 ret = writev(vblk->fd, iov+1, out_num-1);
1445 verbose("WRITE to sector %llu: %i\n", out->sector, ret); 1653 verbose("WRITE to sector %llu: %i\n", out->sector, ret);
1446 1654
1447 /* Grr... Now we know how long the descriptor they sent was, we 1655 /*
1656 * Grr... Now we know how long the descriptor they sent was, we
1448 * make sure they didn't try to write over the end of the block 1657 * make sure they didn't try to write over the end of the block
1449 * file (possibly extending it). */ 1658 * file (possibly extending it).
1659 */
1450 if (ret > 0 && off + ret > vblk->len) { 1660 if (ret > 0 && off + ret > vblk->len) {
1451 /* Trim it back to the correct length */ 1661 /* Trim it back to the correct length */
1452 ftruncate64(vblk->fd, vblk->len); 1662 ftruncate64(vblk->fd, vblk->len);
@@ -1456,10 +1666,12 @@ static void blk_request(struct virtqueue *vq)
1456 wlen = sizeof(*in); 1666 wlen = sizeof(*in);
1457 *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR); 1667 *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
1458 } else { 1668 } else {
1459 /* Read */ 1669 /*
1460 1670 * Read
1461 /* Move to the right location in the block file. This can fail 1671 *
1462 * if they try to read past end. */ 1672 * Move to the right location in the block file. This can fail
1673 * if they try to read past end.
1674 */
1463 if (lseek64(vblk->fd, off, SEEK_SET) != off) 1675 if (lseek64(vblk->fd, off, SEEK_SET) != off)
1464 err(1, "Bad seek to sector %llu", out->sector); 1676 err(1, "Bad seek to sector %llu", out->sector);
1465 1677
@@ -1474,13 +1686,16 @@ static void blk_request(struct virtqueue *vq)
1474 } 1686 }
1475 } 1687 }
1476 1688
1477 /* OK, so we noted that it was pretty poor to use an fdatasync as a 1689 /*
1690 * OK, so we noted that it was pretty poor to use an fdatasync as a
1478 * barrier. But Christoph Hellwig points out that we need a sync 1691 * barrier. But Christoph Hellwig points out that we need a sync
1479 * *afterwards* as well: "Barriers specify no reordering to the front 1692 * *afterwards* as well: "Barriers specify no reordering to the front
1480 * or the back." And Jens Axboe confirmed it, so here we are: */ 1693 * or the back." And Jens Axboe confirmed it, so here we are:
1694 */
1481 if (out->type & VIRTIO_BLK_T_BARRIER) 1695 if (out->type & VIRTIO_BLK_T_BARRIER)
1482 fdatasync(vblk->fd); 1696 fdatasync(vblk->fd);
1483 1697
1698 /* Finished that request. */
1484 add_used(vq, head, wlen); 1699 add_used(vq, head, wlen);
1485} 1700}
1486 1701
@@ -1491,7 +1706,7 @@ static void setup_block_file(const char *filename)
1491 struct vblk_info *vblk; 1706 struct vblk_info *vblk;
1492 struct virtio_blk_config conf; 1707 struct virtio_blk_config conf;
1493 1708
1494 /* The device responds to return from I/O thread. */ 1709 /* Creat the device. */
1495 dev = new_device("block", VIRTIO_ID_BLOCK); 1710 dev = new_device("block", VIRTIO_ID_BLOCK);
1496 1711
1497 /* The device has one virtqueue, where the Guest places requests. */ 1712 /* The device has one virtqueue, where the Guest places requests. */
@@ -1510,27 +1725,32 @@ static void setup_block_file(const char *filename)
1510 /* Tell Guest how many sectors this device has. */ 1725 /* Tell Guest how many sectors this device has. */
1511 conf.capacity = cpu_to_le64(vblk->len / 512); 1726 conf.capacity = cpu_to_le64(vblk->len / 512);
1512 1727
1513 /* Tell Guest not to put in too many descriptors at once: two are used 1728 /*
1514 * for the in and out elements. */ 1729 * Tell Guest not to put in too many descriptors at once: two are used
1730 * for the in and out elements.
1731 */
1515 add_feature(dev, VIRTIO_BLK_F_SEG_MAX); 1732 add_feature(dev, VIRTIO_BLK_F_SEG_MAX);
1516 conf.seg_max = cpu_to_le32(VIRTQUEUE_NUM - 2); 1733 conf.seg_max = cpu_to_le32(VIRTQUEUE_NUM - 2);
1517 1734
1518 set_config(dev, sizeof(conf), &conf); 1735 /* Don't try to put whole struct: we have 8 bit limit. */
1736 set_config(dev, offsetof(struct virtio_blk_config, geometry), &conf);
1519 1737
1520 verbose("device %u: virtblock %llu sectors\n", 1738 verbose("device %u: virtblock %llu sectors\n",
1521 ++devices.device_num, le64_to_cpu(conf.capacity)); 1739 ++devices.device_num, le64_to_cpu(conf.capacity));
1522} 1740}
1523 1741
1524struct rng_info { 1742/*L:211
1525 int rfd; 1743 * Our random number generator device reads from /dev/random into the Guest's
1526};
1527
1528/* Our random number generator device reads from /dev/random into the Guest's
1529 * input buffers. The usual case is that the Guest doesn't want random numbers 1744 * input buffers. The usual case is that the Guest doesn't want random numbers
1530 * and so has no buffers although /dev/random is still readable, whereas 1745 * and so has no buffers although /dev/random is still readable, whereas
1531 * console is the reverse. 1746 * console is the reverse.
1532 * 1747 *
1533 * The same logic applies, however. */ 1748 * The same logic applies, however.
1749 */
1750struct rng_info {
1751 int rfd;
1752};
1753
1534static void rng_input(struct virtqueue *vq) 1754static void rng_input(struct virtqueue *vq)
1535{ 1755{
1536 int len; 1756 int len;
@@ -1543,9 +1763,10 @@ static void rng_input(struct virtqueue *vq)
1543 if (out_num) 1763 if (out_num)
1544 errx(1, "Output buffers in rng?"); 1764 errx(1, "Output buffers in rng?");
1545 1765
1546 /* This is why we convert to iovecs: the readv() call uses them, and so 1766 /*
1547 * it reads straight into the Guest's buffer. We loop to make sure we 1767 * Just like the console write, we loop to cover the whole iovec.
1548 * fill it. */ 1768 * In this case, short reads actually happen quite a bit.
1769 */
1549 while (!iov_empty(iov, in_num)) { 1770 while (!iov_empty(iov, in_num)) {
1550 len = readv(rng_info->rfd, iov, in_num); 1771 len = readv(rng_info->rfd, iov, in_num);
1551 if (len <= 0) 1772 if (len <= 0)
@@ -1558,15 +1779,18 @@ static void rng_input(struct virtqueue *vq)
1558 add_used(vq, head, totlen); 1779 add_used(vq, head, totlen);
1559} 1780}
1560 1781
1561/* And this creates a "hardware" random number device for the Guest. */ 1782/*L:199
1783 * This creates a "hardware" random number device for the Guest.
1784 */
1562static void setup_rng(void) 1785static void setup_rng(void)
1563{ 1786{
1564 struct device *dev; 1787 struct device *dev;
1565 struct rng_info *rng_info = malloc(sizeof(*rng_info)); 1788 struct rng_info *rng_info = malloc(sizeof(*rng_info));
1566 1789
1790 /* Our device's privat info simply contains the /dev/random fd. */
1567 rng_info->rfd = open_or_die("/dev/random", O_RDONLY); 1791 rng_info->rfd = open_or_die("/dev/random", O_RDONLY);
1568 1792
1569 /* The device responds to return from I/O thread. */ 1793 /* Create the new device. */
1570 dev = new_device("rng", VIRTIO_ID_RNG); 1794 dev = new_device("rng", VIRTIO_ID_RNG);
1571 dev->priv = rng_info; 1795 dev->priv = rng_info;
1572 1796
@@ -1582,8 +1806,10 @@ static void __attribute__((noreturn)) restart_guest(void)
1582{ 1806{
1583 unsigned int i; 1807 unsigned int i;
1584 1808
1585 /* Since we don't track all open fds, we simply close everything beyond 1809 /*
1586 * stderr. */ 1810 * Since we don't track all open fds, we simply close everything beyond
1811 * stderr.
1812 */
1587 for (i = 3; i < FD_SETSIZE; i++) 1813 for (i = 3; i < FD_SETSIZE; i++)
1588 close(i); 1814 close(i);
1589 1815
@@ -1594,8 +1820,10 @@ static void __attribute__((noreturn)) restart_guest(void)
1594 err(1, "Could not exec %s", main_args[0]); 1820 err(1, "Could not exec %s", main_args[0]);
1595} 1821}
1596 1822
1597/*L:220 Finally we reach the core of the Launcher which runs the Guest, serves 1823/*L:220
1598 * its input and output, and finally, lays it to rest. */ 1824 * Finally we reach the core of the Launcher which runs the Guest, serves
1825 * its input and output, and finally, lays it to rest.
1826 */
1599static void __attribute__((noreturn)) run_guest(void) 1827static void __attribute__((noreturn)) run_guest(void)
1600{ 1828{
1601 for (;;) { 1829 for (;;) {
@@ -1630,7 +1858,7 @@ static void __attribute__((noreturn)) run_guest(void)
1630 * 1858 *
1631 * Are you ready? Take a deep breath and join me in the core of the Host, in 1859 * Are you ready? Take a deep breath and join me in the core of the Host, in
1632 * "make Host". 1860 * "make Host".
1633 :*/ 1861:*/
1634 1862
1635static struct option opts[] = { 1863static struct option opts[] = {
1636 { "verbose", 0, NULL, 'v' }, 1864 { "verbose", 0, NULL, 'v' },
@@ -1651,8 +1879,7 @@ static void usage(void)
1651/*L:105 The main routine is where the real work begins: */ 1879/*L:105 The main routine is where the real work begins: */
1652int main(int argc, char *argv[]) 1880int main(int argc, char *argv[])
1653{ 1881{
1654 /* Memory, top-level pagetable, code startpoint and size of the 1882 /* Memory, code startpoint and size of the (optional) initrd. */
1655 * (optional) initrd. */
1656 unsigned long mem = 0, start, initrd_size = 0; 1883 unsigned long mem = 0, start, initrd_size = 0;
1657 /* Two temporaries. */ 1884 /* Two temporaries. */
1658 int i, c; 1885 int i, c;
@@ -1664,24 +1891,32 @@ int main(int argc, char *argv[])
1664 /* Save the args: we "reboot" by execing ourselves again. */ 1891 /* Save the args: we "reboot" by execing ourselves again. */
1665 main_args = argv; 1892 main_args = argv;
1666 1893
1667 /* First we initialize the device list. We keep a pointer to the last 1894 /*
1895 * First we initialize the device list. We keep a pointer to the last
1668 * device, and the next interrupt number to use for devices (1: 1896 * device, and the next interrupt number to use for devices (1:
1669 * remember that 0 is used by the timer). */ 1897 * remember that 0 is used by the timer).
1898 */
1670 devices.lastdev = NULL; 1899 devices.lastdev = NULL;
1671 devices.next_irq = 1; 1900 devices.next_irq = 1;
1672 1901
1902 /* We're CPU 0. In fact, that's the only CPU possible right now. */
1673 cpu_id = 0; 1903 cpu_id = 0;
1674 /* We need to know how much memory so we can set up the device 1904
1905 /*
1906 * We need to know how much memory so we can set up the device
1675 * descriptor and memory pages for the devices as we parse the command 1907 * descriptor and memory pages for the devices as we parse the command
1676 * line. So we quickly look through the arguments to find the amount 1908 * line. So we quickly look through the arguments to find the amount
1677 * of memory now. */ 1909 * of memory now.
1910 */
1678 for (i = 1; i < argc; i++) { 1911 for (i = 1; i < argc; i++) {
1679 if (argv[i][0] != '-') { 1912 if (argv[i][0] != '-') {
1680 mem = atoi(argv[i]) * 1024 * 1024; 1913 mem = atoi(argv[i]) * 1024 * 1024;
1681 /* We start by mapping anonymous pages over all of 1914 /*
1915 * We start by mapping anonymous pages over all of
1682 * guest-physical memory range. This fills it with 0, 1916 * guest-physical memory range. This fills it with 0,
1683 * and ensures that the Guest won't be killed when it 1917 * and ensures that the Guest won't be killed when it
1684 * tries to access it. */ 1918 * tries to access it.
1919 */
1685 guest_base = map_zeroed_pages(mem / getpagesize() 1920 guest_base = map_zeroed_pages(mem / getpagesize()
1686 + DEVICE_PAGES); 1921 + DEVICE_PAGES);
1687 guest_limit = mem; 1922 guest_limit = mem;
@@ -1714,8 +1949,10 @@ int main(int argc, char *argv[])
1714 usage(); 1949 usage();
1715 } 1950 }
1716 } 1951 }
1717 /* After the other arguments we expect memory and kernel image name, 1952 /*
1718 * followed by command line arguments for the kernel. */ 1953 * After the other arguments we expect memory and kernel image name,
1954 * followed by command line arguments for the kernel.
1955 */
1719 if (optind + 2 > argc) 1956 if (optind + 2 > argc)
1720 usage(); 1957 usage();
1721 1958
@@ -1733,20 +1970,26 @@ int main(int argc, char *argv[])
1733 /* Map the initrd image if requested (at top of physical memory) */ 1970 /* Map the initrd image if requested (at top of physical memory) */
1734 if (initrd_name) { 1971 if (initrd_name) {
1735 initrd_size = load_initrd(initrd_name, mem); 1972 initrd_size = load_initrd(initrd_name, mem);
1736 /* These are the location in the Linux boot header where the 1973 /*
1737 * start and size of the initrd are expected to be found. */ 1974 * These are the location in the Linux boot header where the
1975 * start and size of the initrd are expected to be found.
1976 */
1738 boot->hdr.ramdisk_image = mem - initrd_size; 1977 boot->hdr.ramdisk_image = mem - initrd_size;
1739 boot->hdr.ramdisk_size = initrd_size; 1978 boot->hdr.ramdisk_size = initrd_size;
1740 /* The bootloader type 0xFF means "unknown"; that's OK. */ 1979 /* The bootloader type 0xFF means "unknown"; that's OK. */
1741 boot->hdr.type_of_loader = 0xFF; 1980 boot->hdr.type_of_loader = 0xFF;
1742 } 1981 }
1743 1982
1744 /* The Linux boot header contains an "E820" memory map: ours is a 1983 /*
1745 * simple, single region. */ 1984 * The Linux boot header contains an "E820" memory map: ours is a
1985 * simple, single region.
1986 */
1746 boot->e820_entries = 1; 1987 boot->e820_entries = 1;
1747 boot->e820_map[0] = ((struct e820entry) { 0, mem, E820_RAM }); 1988 boot->e820_map[0] = ((struct e820entry) { 0, mem, E820_RAM });
1748 /* The boot header contains a command line pointer: we put the command 1989 /*
1749 * line after the boot header. */ 1990 * The boot header contains a command line pointer: we put the command
1991 * line after the boot header.
1992 */
1750 boot->hdr.cmd_line_ptr = to_guest_phys(boot + 1); 1993 boot->hdr.cmd_line_ptr = to_guest_phys(boot + 1);
1751 /* We use a simple helper to copy the arguments separated by spaces. */ 1994 /* We use a simple helper to copy the arguments separated by spaces. */
1752 concat((char *)(boot + 1), argv+optind+2); 1995 concat((char *)(boot + 1), argv+optind+2);
@@ -1760,11 +2003,13 @@ int main(int argc, char *argv[])
1760 /* Tell the entry path not to try to reload segment registers. */ 2003 /* Tell the entry path not to try to reload segment registers. */
1761 boot->hdr.loadflags |= KEEP_SEGMENTS; 2004 boot->hdr.loadflags |= KEEP_SEGMENTS;
1762 2005
1763 /* We tell the kernel to initialize the Guest: this returns the open 2006 /*
1764 * /dev/lguest file descriptor. */ 2007 * We tell the kernel to initialize the Guest: this returns the open
2008 * /dev/lguest file descriptor.
2009 */
1765 tell_kernel(start); 2010 tell_kernel(start);
1766 2011
1767 /* Ensure that we terminate if a child dies. */ 2012 /* Ensure that we terminate if a device-servicing child dies. */
1768 signal(SIGCHLD, kill_launcher); 2013 signal(SIGCHLD, kill_launcher);
1769 2014
1770 /* If we exit via err(), this kills all the threads, restores tty. */ 2015 /* If we exit via err(), this kills all the threads, restores tty. */
diff --git a/Documentation/networking/6pack.txt b/Documentation/networking/6pack.txt
index d0777a1200e..8f339428fdf 100644
--- a/Documentation/networking/6pack.txt
+++ b/Documentation/networking/6pack.txt
@@ -1,7 +1,7 @@
1This is the 6pack-mini-HOWTO, written by 1This is the 6pack-mini-HOWTO, written by
2 2
3Andreas Könsgen DG3KQ 3Andreas Könsgen DG3KQ
4Internet: ajk@iehk.rwth-aachen.de 4Internet: ajk@comnets.uni-bremen.de
5AMPR-net: dg3kq@db0pra.ampr.org 5AMPR-net: dg3kq@db0pra.ampr.org
6AX.25: dg3kq@db0ach.#nrw.deu.eu 6AX.25: dg3kq@db0ach.#nrw.deu.eu
7 7
diff --git a/Documentation/powerpc/booting-without-of.txt b/Documentation/powerpc/booting-without-of.txt
index 8d999d862d0..79f533f38c6 100644
--- a/Documentation/powerpc/booting-without-of.txt
+++ b/Documentation/powerpc/booting-without-of.txt
@@ -1238,1122 +1238,7 @@ descriptions for the SOC devices for which new nodes have been
1238defined; this list will expand as more and more SOC-containing 1238defined; this list will expand as more and more SOC-containing
1239platforms are moved over to use the flattened-device-tree model. 1239platforms are moved over to use the flattened-device-tree model.
1240 1240
1241 a) PHY nodes 1241VII - Specifying interrupt information for devices
1242
1243 Required properties:
1244
1245 - device_type : Should be "ethernet-phy"
1246 - interrupts : <a b> where a is the interrupt number and b is a
1247 field that represents an encoding of the sense and level
1248 information for the interrupt. This should be encoded based on
1249 the information in section 2) depending on the type of interrupt
1250 controller you have.
1251 - interrupt-parent : the phandle for the interrupt controller that
1252 services interrupts for this device.
1253 - reg : The ID number for the phy, usually a small integer
1254 - linux,phandle : phandle for this node; likely referenced by an
1255 ethernet controller node.
1256
1257
1258 Example:
1259
1260 ethernet-phy@0 {
1261 linux,phandle = <2452000>
1262 interrupt-parent = <40000>;
1263 interrupts = <35 1>;
1264 reg = <0>;
1265 device_type = "ethernet-phy";
1266 };
1267
1268
1269 b) Interrupt controllers
1270
1271 Some SOC devices contain interrupt controllers that are different
1272 from the standard Open PIC specification. The SOC device nodes for
1273 these types of controllers should be specified just like a standard
1274 OpenPIC controller. Sense and level information should be encoded
1275 as specified in section 2) of this chapter for each device that
1276 specifies an interrupt.
1277
1278 Example :
1279
1280 pic@40000 {
1281 linux,phandle = <40000>;
1282 interrupt-controller;
1283 #address-cells = <0>;
1284 reg = <40000 40000>;
1285 compatible = "chrp,open-pic";
1286 device_type = "open-pic";
1287 };
1288
1289 c) 4xx/Axon EMAC ethernet nodes
1290
1291 The EMAC ethernet controller in IBM and AMCC 4xx chips, and also
1292 the Axon bridge. To operate this needs to interact with a ths
1293 special McMAL DMA controller, and sometimes an RGMII or ZMII
1294 interface. In addition to the nodes and properties described
1295 below, the node for the OPB bus on which the EMAC sits must have a
1296 correct clock-frequency property.
1297
1298 i) The EMAC node itself
1299
1300 Required properties:
1301 - device_type : "network"
1302
1303 - compatible : compatible list, contains 2 entries, first is
1304 "ibm,emac-CHIP" where CHIP is the host ASIC (440gx,
1305 405gp, Axon) and second is either "ibm,emac" or
1306 "ibm,emac4". For Axon, thus, we have: "ibm,emac-axon",
1307 "ibm,emac4"
1308 - interrupts : <interrupt mapping for EMAC IRQ and WOL IRQ>
1309 - interrupt-parent : optional, if needed for interrupt mapping
1310 - reg : <registers mapping>
1311 - local-mac-address : 6 bytes, MAC address
1312 - mal-device : phandle of the associated McMAL node
1313 - mal-tx-channel : 1 cell, index of the tx channel on McMAL associated
1314 with this EMAC
1315 - mal-rx-channel : 1 cell, index of the rx channel on McMAL associated
1316 with this EMAC
1317 - cell-index : 1 cell, hardware index of the EMAC cell on a given
1318 ASIC (typically 0x0 and 0x1 for EMAC0 and EMAC1 on
1319 each Axon chip)
1320 - max-frame-size : 1 cell, maximum frame size supported in bytes
1321 - rx-fifo-size : 1 cell, Rx fifo size in bytes for 10 and 100 Mb/sec
1322 operations.
1323 For Axon, 2048
1324 - tx-fifo-size : 1 cell, Tx fifo size in bytes for 10 and 100 Mb/sec
1325 operations.
1326 For Axon, 2048.
1327 - fifo-entry-size : 1 cell, size of a fifo entry (used to calculate
1328 thresholds).
1329 For Axon, 0x00000010
1330 - mal-burst-size : 1 cell, MAL burst size (used to calculate thresholds)
1331 in bytes.
1332 For Axon, 0x00000100 (I think ...)
1333 - phy-mode : string, mode of operations of the PHY interface.
1334 Supported values are: "mii", "rmii", "smii", "rgmii",
1335 "tbi", "gmii", rtbi", "sgmii".
1336 For Axon on CAB, it is "rgmii"
1337 - mdio-device : 1 cell, required iff using shared MDIO registers
1338 (440EP). phandle of the EMAC to use to drive the
1339 MDIO lines for the PHY used by this EMAC.
1340 - zmii-device : 1 cell, required iff connected to a ZMII. phandle of
1341 the ZMII device node
1342 - zmii-channel : 1 cell, required iff connected to a ZMII. Which ZMII
1343 channel or 0xffffffff if ZMII is only used for MDIO.
1344 - rgmii-device : 1 cell, required iff connected to an RGMII. phandle
1345 of the RGMII device node.
1346 For Axon: phandle of plb5/plb4/opb/rgmii
1347 - rgmii-channel : 1 cell, required iff connected to an RGMII. Which
1348 RGMII channel is used by this EMAC.
1349 Fox Axon: present, whatever value is appropriate for each
1350 EMAC, that is the content of the current (bogus) "phy-port"
1351 property.
1352
1353 Optional properties:
1354 - phy-address : 1 cell, optional, MDIO address of the PHY. If absent,
1355 a search is performed.
1356 - phy-map : 1 cell, optional, bitmap of addresses to probe the PHY
1357 for, used if phy-address is absent. bit 0x00000001 is
1358 MDIO address 0.
1359 For Axon it can be absent, though my current driver
1360 doesn't handle phy-address yet so for now, keep
1361 0x00ffffff in it.
1362 - rx-fifo-size-gige : 1 cell, Rx fifo size in bytes for 1000 Mb/sec
1363 operations (if absent the value is the same as
1364 rx-fifo-size). For Axon, either absent or 2048.
1365 - tx-fifo-size-gige : 1 cell, Tx fifo size in bytes for 1000 Mb/sec
1366 operations (if absent the value is the same as
1367 tx-fifo-size). For Axon, either absent or 2048.
1368 - tah-device : 1 cell, optional. If connected to a TAH engine for
1369 offload, phandle of the TAH device node.
1370 - tah-channel : 1 cell, optional. If appropriate, channel used on the
1371 TAH engine.
1372
1373 Example:
1374
1375 EMAC0: ethernet@40000800 {
1376 device_type = "network";
1377 compatible = "ibm,emac-440gp", "ibm,emac";
1378 interrupt-parent = <&UIC1>;
1379 interrupts = <1c 4 1d 4>;
1380 reg = <40000800 70>;
1381 local-mac-address = [00 04 AC E3 1B 1E];
1382 mal-device = <&MAL0>;
1383 mal-tx-channel = <0 1>;
1384 mal-rx-channel = <0>;
1385 cell-index = <0>;
1386 max-frame-size = <5dc>;
1387 rx-fifo-size = <1000>;
1388 tx-fifo-size = <800>;
1389 phy-mode = "rmii";
1390 phy-map = <00000001>;
1391 zmii-device = <&ZMII0>;
1392 zmii-channel = <0>;
1393 };
1394
1395 ii) McMAL node
1396
1397 Required properties:
1398 - device_type : "dma-controller"
1399 - compatible : compatible list, containing 2 entries, first is
1400 "ibm,mcmal-CHIP" where CHIP is the host ASIC (like
1401 emac) and the second is either "ibm,mcmal" or
1402 "ibm,mcmal2".
1403 For Axon, "ibm,mcmal-axon","ibm,mcmal2"
1404 - interrupts : <interrupt mapping for the MAL interrupts sources:
1405 5 sources: tx_eob, rx_eob, serr, txde, rxde>.
1406 For Axon: This is _different_ from the current
1407 firmware. We use the "delayed" interrupts for txeob
1408 and rxeob. Thus we end up with mapping those 5 MPIC
1409 interrupts, all level positive sensitive: 10, 11, 32,
1410 33, 34 (in decimal)
1411 - dcr-reg : < DCR registers range >
1412 - dcr-parent : if needed for dcr-reg
1413 - num-tx-chans : 1 cell, number of Tx channels
1414 - num-rx-chans : 1 cell, number of Rx channels
1415
1416 iii) ZMII node
1417
1418 Required properties:
1419 - compatible : compatible list, containing 2 entries, first is
1420 "ibm,zmii-CHIP" where CHIP is the host ASIC (like
1421 EMAC) and the second is "ibm,zmii".
1422 For Axon, there is no ZMII node.
1423 - reg : <registers mapping>
1424
1425 iv) RGMII node
1426
1427 Required properties:
1428 - compatible : compatible list, containing 2 entries, first is
1429 "ibm,rgmii-CHIP" where CHIP is the host ASIC (like
1430 EMAC) and the second is "ibm,rgmii".
1431 For Axon, "ibm,rgmii-axon","ibm,rgmii"
1432 - reg : <registers mapping>
1433 - revision : as provided by the RGMII new version register if
1434 available.
1435 For Axon: 0x0000012a
1436
1437 d) Xilinx IP cores
1438
1439 The Xilinx EDK toolchain ships with a set of IP cores (devices) for use
1440 in Xilinx Spartan and Virtex FPGAs. The devices cover the whole range
1441 of standard device types (network, serial, etc.) and miscellaneous
1442 devices (gpio, LCD, spi, etc). Also, since these devices are
1443 implemented within the fpga fabric every instance of the device can be
1444 synthesised with different options that change the behaviour.
1445
1446 Each IP-core has a set of parameters which the FPGA designer can use to
1447 control how the core is synthesized. Historically, the EDK tool would
1448 extract the device parameters relevant to device drivers and copy them
1449 into an 'xparameters.h' in the form of #define symbols. This tells the
1450 device drivers how the IP cores are configured, but it requres the kernel
1451 to be recompiled every time the FPGA bitstream is resynthesized.
1452
1453 The new approach is to export the parameters into the device tree and
1454 generate a new device tree each time the FPGA bitstream changes. The
1455 parameters which used to be exported as #defines will now become
1456 properties of the device node. In general, device nodes for IP-cores
1457 will take the following form:
1458
1459 (name): (generic-name)@(base-address) {
1460 compatible = "xlnx,(ip-core-name)-(HW_VER)"
1461 [, (list of compatible devices), ...];
1462 reg = <(baseaddr) (size)>;
1463 interrupt-parent = <&interrupt-controller-phandle>;
1464 interrupts = < ... >;
1465 xlnx,(parameter1) = "(string-value)";
1466 xlnx,(parameter2) = <(int-value)>;
1467 };
1468
1469 (generic-name): an open firmware-style name that describes the
1470 generic class of device. Preferably, this is one word, such
1471 as 'serial' or 'ethernet'.
1472 (ip-core-name): the name of the ip block (given after the BEGIN
1473 directive in system.mhs). Should be in lowercase
1474 and all underscores '_' converted to dashes '-'.
1475 (name): is derived from the "PARAMETER INSTANCE" value.
1476 (parameter#): C_* parameters from system.mhs. The C_ prefix is
1477 dropped from the parameter name, the name is converted
1478 to lowercase and all underscore '_' characters are
1479 converted to dashes '-'.
1480 (baseaddr): the baseaddr parameter value (often named C_BASEADDR).
1481 (HW_VER): from the HW_VER parameter.
1482 (size): the address range size (often C_HIGHADDR - C_BASEADDR + 1).
1483
1484 Typically, the compatible list will include the exact IP core version
1485 followed by an older IP core version which implements the same
1486 interface or any other device with the same interface.
1487
1488 'reg', 'interrupt-parent' and 'interrupts' are all optional properties.
1489
1490 For example, the following block from system.mhs:
1491
1492 BEGIN opb_uartlite
1493 PARAMETER INSTANCE = opb_uartlite_0
1494 PARAMETER HW_VER = 1.00.b
1495 PARAMETER C_BAUDRATE = 115200
1496 PARAMETER C_DATA_BITS = 8
1497 PARAMETER C_ODD_PARITY = 0
1498 PARAMETER C_USE_PARITY = 0
1499 PARAMETER C_CLK_FREQ = 50000000
1500 PARAMETER C_BASEADDR = 0xEC100000
1501 PARAMETER C_HIGHADDR = 0xEC10FFFF
1502 BUS_INTERFACE SOPB = opb_7
1503 PORT OPB_Clk = CLK_50MHz
1504 PORT Interrupt = opb_uartlite_0_Interrupt
1505 PORT RX = opb_uartlite_0_RX
1506 PORT TX = opb_uartlite_0_TX
1507 PORT OPB_Rst = sys_bus_reset_0
1508 END
1509
1510 becomes the following device tree node:
1511
1512 opb_uartlite_0: serial@ec100000 {
1513 device_type = "serial";
1514 compatible = "xlnx,opb-uartlite-1.00.b";
1515 reg = <ec100000 10000>;
1516 interrupt-parent = <&opb_intc_0>;
1517 interrupts = <1 0>; // got this from the opb_intc parameters
1518 current-speed = <d#115200>; // standard serial device prop
1519 clock-frequency = <d#50000000>; // standard serial device prop
1520 xlnx,data-bits = <8>;
1521 xlnx,odd-parity = <0>;
1522 xlnx,use-parity = <0>;
1523 };
1524
1525 Some IP cores actually implement 2 or more logical devices. In
1526 this case, the device should still describe the whole IP core with
1527 a single node and add a child node for each logical device. The
1528 ranges property can be used to translate from parent IP-core to the
1529 registers of each device. In addition, the parent node should be
1530 compatible with the bus type 'xlnx,compound', and should contain
1531 #address-cells and #size-cells, as with any other bus. (Note: this
1532 makes the assumption that both logical devices have the same bus
1533 binding. If this is not true, then separate nodes should be used
1534 for each logical device). The 'cell-index' property can be used to
1535 enumerate logical devices within an IP core. For example, the
1536 following is the system.mhs entry for the dual ps2 controller found
1537 on the ml403 reference design.
1538
1539 BEGIN opb_ps2_dual_ref
1540 PARAMETER INSTANCE = opb_ps2_dual_ref_0
1541 PARAMETER HW_VER = 1.00.a
1542 PARAMETER C_BASEADDR = 0xA9000000
1543 PARAMETER C_HIGHADDR = 0xA9001FFF
1544 BUS_INTERFACE SOPB = opb_v20_0
1545 PORT Sys_Intr1 = ps2_1_intr
1546 PORT Sys_Intr2 = ps2_2_intr
1547 PORT Clkin1 = ps2_clk_rx_1
1548 PORT Clkin2 = ps2_clk_rx_2
1549 PORT Clkpd1 = ps2_clk_tx_1
1550 PORT Clkpd2 = ps2_clk_tx_2
1551 PORT Rx1 = ps2_d_rx_1
1552 PORT Rx2 = ps2_d_rx_2
1553 PORT Txpd1 = ps2_d_tx_1
1554 PORT Txpd2 = ps2_d_tx_2
1555 END
1556
1557 It would result in the following device tree nodes:
1558
1559 opb_ps2_dual_ref_0: opb-ps2-dual-ref@a9000000 {
1560 #address-cells = <1>;
1561 #size-cells = <1>;
1562 compatible = "xlnx,compound";
1563 ranges = <0 a9000000 2000>;
1564 // If this device had extra parameters, then they would
1565 // go here.
1566 ps2@0 {
1567 compatible = "xlnx,opb-ps2-dual-ref-1.00.a";
1568 reg = <0 40>;
1569 interrupt-parent = <&opb_intc_0>;
1570 interrupts = <3 0>;
1571 cell-index = <0>;
1572 };
1573 ps2@1000 {
1574 compatible = "xlnx,opb-ps2-dual-ref-1.00.a";
1575 reg = <1000 40>;
1576 interrupt-parent = <&opb_intc_0>;
1577 interrupts = <3 0>;
1578 cell-index = <0>;
1579 };
1580 };
1581
1582 Also, the system.mhs file defines bus attachments from the processor
1583 to the devices. The device tree structure should reflect the bus
1584 attachments. Again an example; this system.mhs fragment:
1585
1586 BEGIN ppc405_virtex4
1587 PARAMETER INSTANCE = ppc405_0
1588 PARAMETER HW_VER = 1.01.a
1589 BUS_INTERFACE DPLB = plb_v34_0
1590 BUS_INTERFACE IPLB = plb_v34_0
1591 END
1592
1593 BEGIN opb_intc
1594 PARAMETER INSTANCE = opb_intc_0
1595 PARAMETER HW_VER = 1.00.c
1596 PARAMETER C_BASEADDR = 0xD1000FC0
1597 PARAMETER C_HIGHADDR = 0xD1000FDF
1598 BUS_INTERFACE SOPB = opb_v20_0
1599 END
1600
1601 BEGIN opb_uart16550
1602 PARAMETER INSTANCE = opb_uart16550_0
1603 PARAMETER HW_VER = 1.00.d
1604 PARAMETER C_BASEADDR = 0xa0000000
1605 PARAMETER C_HIGHADDR = 0xa0001FFF
1606 BUS_INTERFACE SOPB = opb_v20_0
1607 END
1608
1609 BEGIN plb_v34
1610 PARAMETER INSTANCE = plb_v34_0
1611 PARAMETER HW_VER = 1.02.a
1612 END
1613
1614 BEGIN plb_bram_if_cntlr
1615 PARAMETER INSTANCE = plb_bram_if_cntlr_0
1616 PARAMETER HW_VER = 1.00.b
1617 PARAMETER C_BASEADDR = 0xFFFF0000
1618 PARAMETER C_HIGHADDR = 0xFFFFFFFF
1619 BUS_INTERFACE SPLB = plb_v34_0
1620 END
1621
1622 BEGIN plb2opb_bridge
1623 PARAMETER INSTANCE = plb2opb_bridge_0
1624 PARAMETER HW_VER = 1.01.a
1625 PARAMETER C_RNG0_BASEADDR = 0x20000000
1626 PARAMETER C_RNG0_HIGHADDR = 0x3FFFFFFF
1627 PARAMETER C_RNG1_BASEADDR = 0x60000000
1628 PARAMETER C_RNG1_HIGHADDR = 0x7FFFFFFF
1629 PARAMETER C_RNG2_BASEADDR = 0x80000000
1630 PARAMETER C_RNG2_HIGHADDR = 0xBFFFFFFF
1631 PARAMETER C_RNG3_BASEADDR = 0xC0000000
1632 PARAMETER C_RNG3_HIGHADDR = 0xDFFFFFFF
1633 BUS_INTERFACE SPLB = plb_v34_0
1634 BUS_INTERFACE MOPB = opb_v20_0
1635 END
1636
1637 Gives this device tree (some properties removed for clarity):
1638
1639 plb@0 {
1640 #address-cells = <1>;
1641 #size-cells = <1>;
1642 compatible = "xlnx,plb-v34-1.02.a";
1643 device_type = "ibm,plb";
1644 ranges; // 1:1 translation
1645
1646 plb_bram_if_cntrl_0: bram@ffff0000 {
1647 reg = <ffff0000 10000>;
1648 }
1649
1650 opb@20000000 {
1651 #address-cells = <1>;
1652 #size-cells = <1>;
1653 ranges = <20000000 20000000 20000000
1654 60000000 60000000 20000000
1655 80000000 80000000 40000000
1656 c0000000 c0000000 20000000>;
1657
1658 opb_uart16550_0: serial@a0000000 {
1659 reg = <a00000000 2000>;
1660 };
1661
1662 opb_intc_0: interrupt-controller@d1000fc0 {
1663 reg = <d1000fc0 20>;
1664 };
1665 };
1666 };
1667
1668 That covers the general approach to binding xilinx IP cores into the
1669 device tree. The following are bindings for specific devices:
1670
1671 i) Xilinx ML300 Framebuffer
1672
1673 Simple framebuffer device from the ML300 reference design (also on the
1674 ML403 reference design as well as others).
1675
1676 Optional properties:
1677 - resolution = <xres yres> : pixel resolution of framebuffer. Some
1678 implementations use a different resolution.
1679 Default is <d#640 d#480>
1680 - virt-resolution = <xvirt yvirt> : Size of framebuffer in memory.
1681 Default is <d#1024 d#480>.
1682 - rotate-display (empty) : rotate display 180 degrees.
1683
1684 ii) Xilinx SystemACE
1685
1686 The Xilinx SystemACE device is used to program FPGAs from an FPGA
1687 bitstream stored on a CF card. It can also be used as a generic CF
1688 interface device.
1689
1690 Optional properties:
1691 - 8-bit (empty) : Set this property for SystemACE in 8 bit mode
1692
1693 iii) Xilinx EMAC and Xilinx TEMAC
1694
1695 Xilinx Ethernet devices. In addition to general xilinx properties
1696 listed above, nodes for these devices should include a phy-handle
1697 property, and may include other common network device properties
1698 like local-mac-address.
1699
1700 iv) Xilinx Uartlite
1701
1702 Xilinx uartlite devices are simple fixed speed serial ports.
1703
1704 Required properties:
1705 - current-speed : Baud rate of uartlite
1706
1707 v) Xilinx hwicap
1708
1709 Xilinx hwicap devices provide access to the configuration logic
1710 of the FPGA through the Internal Configuration Access Port
1711 (ICAP). The ICAP enables partial reconfiguration of the FPGA,
1712 readback of the configuration information, and some control over
1713 'warm boots' of the FPGA fabric.
1714
1715 Required properties:
1716 - xlnx,family : The family of the FPGA, necessary since the
1717 capabilities of the underlying ICAP hardware
1718 differ between different families. May be
1719 'virtex2p', 'virtex4', or 'virtex5'.
1720
1721 vi) Xilinx Uart 16550
1722
1723 Xilinx UART 16550 devices are very similar to the NS16550 but with
1724 different register spacing and an offset from the base address.
1725
1726 Required properties:
1727 - clock-frequency : Frequency of the clock input
1728 - reg-offset : A value of 3 is required
1729 - reg-shift : A value of 2 is required
1730
1731 e) USB EHCI controllers
1732
1733 Required properties:
1734 - compatible : should be "usb-ehci".
1735 - reg : should contain at least address and length of the standard EHCI
1736 register set for the device. Optional platform-dependent registers
1737 (debug-port or other) can be also specified here, but only after
1738 definition of standard EHCI registers.
1739 - interrupts : one EHCI interrupt should be described here.
1740 If device registers are implemented in big endian mode, the device
1741 node should have "big-endian-regs" property.
1742 If controller implementation operates with big endian descriptors,
1743 "big-endian-desc" property should be specified.
1744 If both big endian registers and descriptors are used by the controller
1745 implementation, "big-endian" property can be specified instead of having
1746 both "big-endian-regs" and "big-endian-desc".
1747
1748 Example (Sequoia 440EPx):
1749 ehci@e0000300 {
1750 compatible = "ibm,usb-ehci-440epx", "usb-ehci";
1751 interrupt-parent = <&UIC0>;
1752 interrupts = <1a 4>;
1753 reg = <0 e0000300 90 0 e0000390 70>;
1754 big-endian;
1755 };
1756
1757 f) MDIO on GPIOs
1758
1759 Currently defined compatibles:
1760 - virtual,gpio-mdio
1761
1762 MDC and MDIO lines connected to GPIO controllers are listed in the
1763 gpios property as described in section VIII.1 in the following order:
1764
1765 MDC, MDIO.
1766
1767 Example:
1768
1769 mdio {
1770 compatible = "virtual,mdio-gpio";
1771 #address-cells = <1>;
1772 #size-cells = <0>;
1773 gpios = <&qe_pio_a 11
1774 &qe_pio_c 6>;
1775 };
1776
1777 g) SPI (Serial Peripheral Interface) busses
1778
1779 SPI busses can be described with a node for the SPI master device
1780 and a set of child nodes for each SPI slave on the bus. For this
1781 discussion, it is assumed that the system's SPI controller is in
1782 SPI master mode. This binding does not describe SPI controllers
1783 in slave mode.
1784
1785 The SPI master node requires the following properties:
1786 - #address-cells - number of cells required to define a chip select
1787 address on the SPI bus.
1788 - #size-cells - should be zero.
1789 - compatible - name of SPI bus controller following generic names
1790 recommended practice.
1791 No other properties are required in the SPI bus node. It is assumed
1792 that a driver for an SPI bus device will understand that it is an SPI bus.
1793 However, the binding does not attempt to define the specific method for
1794 assigning chip select numbers. Since SPI chip select configuration is
1795 flexible and non-standardized, it is left out of this binding with the
1796 assumption that board specific platform code will be used to manage
1797 chip selects. Individual drivers can define additional properties to
1798 support describing the chip select layout.
1799
1800 SPI slave nodes must be children of the SPI master node and can
1801 contain the following properties.
1802 - reg - (required) chip select address of device.
1803 - compatible - (required) name of SPI device following generic names
1804 recommended practice
1805 - spi-max-frequency - (required) Maximum SPI clocking speed of device in Hz
1806 - spi-cpol - (optional) Empty property indicating device requires
1807 inverse clock polarity (CPOL) mode
1808 - spi-cpha - (optional) Empty property indicating device requires
1809 shifted clock phase (CPHA) mode
1810 - spi-cs-high - (optional) Empty property indicating device requires
1811 chip select active high
1812
1813 SPI example for an MPC5200 SPI bus:
1814 spi@f00 {
1815 #address-cells = <1>;
1816 #size-cells = <0>;
1817 compatible = "fsl,mpc5200b-spi","fsl,mpc5200-spi";
1818 reg = <0xf00 0x20>;
1819 interrupts = <2 13 0 2 14 0>;
1820 interrupt-parent = <&mpc5200_pic>;
1821
1822 ethernet-switch@0 {
1823 compatible = "micrel,ks8995m";
1824 spi-max-frequency = <1000000>;
1825 reg = <0>;
1826 };
1827
1828 codec@1 {
1829 compatible = "ti,tlv320aic26";
1830 spi-max-frequency = <100000>;
1831 reg = <1>;
1832 };
1833 };
1834
1835VII - Marvell Discovery mv64[345]6x System Controller chips
1836===========================================================
1837
1838The Marvell mv64[345]60 series of system controller chips contain
1839many of the peripherals needed to implement a complete computer
1840system. In this section, we define device tree nodes to describe
1841the system controller chip itself and each of the peripherals
1842which it contains. Compatible string values for each node are
1843prefixed with the string "marvell,", for Marvell Technology Group Ltd.
1844
18451) The /system-controller node
1846
1847 This node is used to represent the system-controller and must be
1848 present when the system uses a system controller chip. The top-level
1849 system-controller node contains information that is global to all
1850 devices within the system controller chip. The node name begins
1851 with "system-controller" followed by the unit address, which is
1852 the base address of the memory-mapped register set for the system
1853 controller chip.
1854
1855 Required properties:
1856
1857 - ranges : Describes the translation of system controller addresses
1858 for memory mapped registers.
1859 - clock-frequency: Contains the main clock frequency for the system
1860 controller chip.
1861 - reg : This property defines the address and size of the
1862 memory-mapped registers contained within the system controller
1863 chip. The address specified in the "reg" property should match
1864 the unit address of the system-controller node.
1865 - #address-cells : Address representation for system controller
1866 devices. This field represents the number of cells needed to
1867 represent the address of the memory-mapped registers of devices
1868 within the system controller chip.
1869 - #size-cells : Size representation for for the memory-mapped
1870 registers within the system controller chip.
1871 - #interrupt-cells : Defines the width of cells used to represent
1872 interrupts.
1873
1874 Optional properties:
1875
1876 - model : The specific model of the system controller chip. Such
1877 as, "mv64360", "mv64460", or "mv64560".
1878 - compatible : A string identifying the compatibility identifiers
1879 of the system controller chip.
1880
1881 The system-controller node contains child nodes for each system
1882 controller device that the platform uses. Nodes should not be created
1883 for devices which exist on the system controller chip but are not used
1884
1885 Example Marvell Discovery mv64360 system-controller node:
1886
1887 system-controller@f1000000 { /* Marvell Discovery mv64360 */
1888 #address-cells = <1>;
1889 #size-cells = <1>;
1890 model = "mv64360"; /* Default */
1891 compatible = "marvell,mv64360";
1892 clock-frequency = <133333333>;
1893 reg = <0xf1000000 0x10000>;
1894 virtual-reg = <0xf1000000>;
1895 ranges = <0x88000000 0x88000000 0x1000000 /* PCI 0 I/O Space */
1896 0x80000000 0x80000000 0x8000000 /* PCI 0 MEM Space */
1897 0xa0000000 0xa0000000 0x4000000 /* User FLASH */
1898 0x00000000 0xf1000000 0x0010000 /* Bridge's regs */
1899 0xf2000000 0xf2000000 0x0040000>;/* Integrated SRAM */
1900
1901 [ child node definitions... ]
1902 }
1903
19042) Child nodes of /system-controller
1905
1906 a) Marvell Discovery MDIO bus
1907
1908 The MDIO is a bus to which the PHY devices are connected. For each
1909 device that exists on this bus, a child node should be created. See
1910 the definition of the PHY node below for an example of how to define
1911 a PHY.
1912
1913 Required properties:
1914 - #address-cells : Should be <1>
1915 - #size-cells : Should be <0>
1916 - device_type : Should be "mdio"
1917 - compatible : Should be "marvell,mv64360-mdio"
1918
1919 Example:
1920
1921 mdio {
1922 #address-cells = <1>;
1923 #size-cells = <0>;
1924 device_type = "mdio";
1925 compatible = "marvell,mv64360-mdio";
1926
1927 ethernet-phy@0 {
1928 ......
1929 };
1930 };
1931
1932
1933 b) Marvell Discovery ethernet controller
1934
1935 The Discover ethernet controller is described with two levels
1936 of nodes. The first level describes an ethernet silicon block
1937 and the second level describes up to 3 ethernet nodes within
1938 that block. The reason for the multiple levels is that the
1939 registers for the node are interleaved within a single set
1940 of registers. The "ethernet-block" level describes the
1941 shared register set, and the "ethernet" nodes describe ethernet
1942 port-specific properties.
1943
1944 Ethernet block node
1945
1946 Required properties:
1947 - #address-cells : <1>
1948 - #size-cells : <0>
1949 - compatible : "marvell,mv64360-eth-block"
1950 - reg : Offset and length of the register set for this block
1951
1952 Example Discovery Ethernet block node:
1953 ethernet-block@2000 {
1954 #address-cells = <1>;
1955 #size-cells = <0>;
1956 compatible = "marvell,mv64360-eth-block";
1957 reg = <0x2000 0x2000>;
1958 ethernet@0 {
1959 .......
1960 };
1961 };
1962
1963 Ethernet port node
1964
1965 Required properties:
1966 - device_type : Should be "network".
1967 - compatible : Should be "marvell,mv64360-eth".
1968 - reg : Should be <0>, <1>, or <2>, according to which registers
1969 within the silicon block the device uses.
1970 - interrupts : <a> where a is the interrupt number for the port.
1971 - interrupt-parent : the phandle for the interrupt controller
1972 that services interrupts for this device.
1973 - phy : the phandle for the PHY connected to this ethernet
1974 controller.
1975 - local-mac-address : 6 bytes, MAC address
1976
1977 Example Discovery Ethernet port node:
1978 ethernet@0 {
1979 device_type = "network";
1980 compatible = "marvell,mv64360-eth";
1981 reg = <0>;
1982 interrupts = <32>;
1983 interrupt-parent = <&PIC>;
1984 phy = <&PHY0>;
1985 local-mac-address = [ 00 00 00 00 00 00 ];
1986 };
1987
1988
1989
1990 c) Marvell Discovery PHY nodes
1991
1992 Required properties:
1993 - device_type : Should be "ethernet-phy"
1994 - interrupts : <a> where a is the interrupt number for this phy.
1995 - interrupt-parent : the phandle for the interrupt controller that
1996 services interrupts for this device.
1997 - reg : The ID number for the phy, usually a small integer
1998
1999 Example Discovery PHY node:
2000 ethernet-phy@1 {
2001 device_type = "ethernet-phy";
2002 compatible = "broadcom,bcm5421";
2003 interrupts = <76>; /* GPP 12 */
2004 interrupt-parent = <&PIC>;
2005 reg = <1>;
2006 };
2007
2008
2009 d) Marvell Discovery SDMA nodes
2010
2011 Represent DMA hardware associated with the MPSC (multiprotocol
2012 serial controllers).
2013
2014 Required properties:
2015 - compatible : "marvell,mv64360-sdma"
2016 - reg : Offset and length of the register set for this device
2017 - interrupts : <a> where a is the interrupt number for the DMA
2018 device.
2019 - interrupt-parent : the phandle for the interrupt controller
2020 that services interrupts for this device.
2021
2022 Example Discovery SDMA node:
2023 sdma@4000 {
2024 compatible = "marvell,mv64360-sdma";
2025 reg = <0x4000 0xc18>;
2026 virtual-reg = <0xf1004000>;
2027 interrupts = <36>;
2028 interrupt-parent = <&PIC>;
2029 };
2030
2031
2032 e) Marvell Discovery BRG nodes
2033
2034 Represent baud rate generator hardware associated with the MPSC
2035 (multiprotocol serial controllers).
2036
2037 Required properties:
2038 - compatible : "marvell,mv64360-brg"
2039 - reg : Offset and length of the register set for this device
2040 - clock-src : A value from 0 to 15 which selects the clock
2041 source for the baud rate generator. This value corresponds
2042 to the CLKS value in the BRGx configuration register. See
2043 the mv64x60 User's Manual.
2044 - clock-frequence : The frequency (in Hz) of the baud rate
2045 generator's input clock.
2046 - current-speed : The current speed setting (presumably by
2047 firmware) of the baud rate generator.
2048
2049 Example Discovery BRG node:
2050 brg@b200 {
2051 compatible = "marvell,mv64360-brg";
2052 reg = <0xb200 0x8>;
2053 clock-src = <8>;
2054 clock-frequency = <133333333>;
2055 current-speed = <9600>;
2056 };
2057
2058
2059 f) Marvell Discovery CUNIT nodes
2060
2061 Represent the Serial Communications Unit device hardware.
2062
2063 Required properties:
2064 - reg : Offset and length of the register set for this device
2065
2066 Example Discovery CUNIT node:
2067 cunit@f200 {
2068 reg = <0xf200 0x200>;
2069 };
2070
2071
2072 g) Marvell Discovery MPSCROUTING nodes
2073
2074 Represent the Discovery's MPSC routing hardware
2075
2076 Required properties:
2077 - reg : Offset and length of the register set for this device
2078
2079 Example Discovery CUNIT node:
2080 mpscrouting@b500 {
2081 reg = <0xb400 0xc>;
2082 };
2083
2084
2085 h) Marvell Discovery MPSCINTR nodes
2086
2087 Represent the Discovery's MPSC DMA interrupt hardware registers
2088 (SDMA cause and mask registers).
2089
2090 Required properties:
2091 - reg : Offset and length of the register set for this device
2092
2093 Example Discovery MPSCINTR node:
2094 mpsintr@b800 {
2095 reg = <0xb800 0x100>;
2096 };
2097
2098
2099 i) Marvell Discovery MPSC nodes
2100
2101 Represent the Discovery's MPSC (Multiprotocol Serial Controller)
2102 serial port.
2103
2104 Required properties:
2105 - device_type : "serial"
2106 - compatible : "marvell,mv64360-mpsc"
2107 - reg : Offset and length of the register set for this device
2108 - sdma : the phandle for the SDMA node used by this port
2109 - brg : the phandle for the BRG node used by this port
2110 - cunit : the phandle for the CUNIT node used by this port
2111 - mpscrouting : the phandle for the MPSCROUTING node used by this port
2112 - mpscintr : the phandle for the MPSCINTR node used by this port
2113 - cell-index : the hardware index of this cell in the MPSC core
2114 - max_idle : value needed for MPSC CHR3 (Maximum Frame Length)
2115 register
2116 - interrupts : <a> where a is the interrupt number for the MPSC.
2117 - interrupt-parent : the phandle for the interrupt controller
2118 that services interrupts for this device.
2119
2120 Example Discovery MPSCINTR node:
2121 mpsc@8000 {
2122 device_type = "serial";
2123 compatible = "marvell,mv64360-mpsc";
2124 reg = <0x8000 0x38>;
2125 virtual-reg = <0xf1008000>;
2126 sdma = <&SDMA0>;
2127 brg = <&BRG0>;
2128 cunit = <&CUNIT>;
2129 mpscrouting = <&MPSCROUTING>;
2130 mpscintr = <&MPSCINTR>;
2131 cell-index = <0>;
2132 max_idle = <40>;
2133 interrupts = <40>;
2134 interrupt-parent = <&PIC>;
2135 };
2136
2137
2138 j) Marvell Discovery Watch Dog Timer nodes
2139
2140 Represent the Discovery's watchdog timer hardware
2141
2142 Required properties:
2143 - compatible : "marvell,mv64360-wdt"
2144 - reg : Offset and length of the register set for this device
2145
2146 Example Discovery Watch Dog Timer node:
2147 wdt@b410 {
2148 compatible = "marvell,mv64360-wdt";
2149 reg = <0xb410 0x8>;
2150 };
2151
2152
2153 k) Marvell Discovery I2C nodes
2154
2155 Represent the Discovery's I2C hardware
2156
2157 Required properties:
2158 - device_type : "i2c"
2159 - compatible : "marvell,mv64360-i2c"
2160 - reg : Offset and length of the register set for this device
2161 - interrupts : <a> where a is the interrupt number for the I2C.
2162 - interrupt-parent : the phandle for the interrupt controller
2163 that services interrupts for this device.
2164
2165 Example Discovery I2C node:
2166 compatible = "marvell,mv64360-i2c";
2167 reg = <0xc000 0x20>;
2168 virtual-reg = <0xf100c000>;
2169 interrupts = <37>;
2170 interrupt-parent = <&PIC>;
2171 };
2172
2173
2174 l) Marvell Discovery PIC (Programmable Interrupt Controller) nodes
2175
2176 Represent the Discovery's PIC hardware
2177
2178 Required properties:
2179 - #interrupt-cells : <1>
2180 - #address-cells : <0>
2181 - compatible : "marvell,mv64360-pic"
2182 - reg : Offset and length of the register set for this device
2183 - interrupt-controller
2184
2185 Example Discovery PIC node:
2186 pic {
2187 #interrupt-cells = <1>;
2188 #address-cells = <0>;
2189 compatible = "marvell,mv64360-pic";
2190 reg = <0x0 0x88>;
2191 interrupt-controller;
2192 };
2193
2194
2195 m) Marvell Discovery MPP (Multipurpose Pins) multiplexing nodes
2196
2197 Represent the Discovery's MPP hardware
2198
2199 Required properties:
2200 - compatible : "marvell,mv64360-mpp"
2201 - reg : Offset and length of the register set for this device
2202
2203 Example Discovery MPP node:
2204 mpp@f000 {
2205 compatible = "marvell,mv64360-mpp";
2206 reg = <0xf000 0x10>;
2207 };
2208
2209
2210 n) Marvell Discovery GPP (General Purpose Pins) nodes
2211
2212 Represent the Discovery's GPP hardware
2213
2214 Required properties:
2215 - compatible : "marvell,mv64360-gpp"
2216 - reg : Offset and length of the register set for this device
2217
2218 Example Discovery GPP node:
2219 gpp@f000 {
2220 compatible = "marvell,mv64360-gpp";
2221 reg = <0xf100 0x20>;
2222 };
2223
2224
2225 o) Marvell Discovery PCI host bridge node
2226
2227 Represents the Discovery's PCI host bridge device. The properties
2228 for this node conform to Rev 2.1 of the PCI Bus Binding to IEEE
2229 1275-1994. A typical value for the compatible property is
2230 "marvell,mv64360-pci".
2231
2232 Example Discovery PCI host bridge node
2233 pci@80000000 {
2234 #address-cells = <3>;
2235 #size-cells = <2>;
2236 #interrupt-cells = <1>;
2237 device_type = "pci";
2238 compatible = "marvell,mv64360-pci";
2239 reg = <0xcf8 0x8>;
2240 ranges = <0x01000000 0x0 0x0
2241 0x88000000 0x0 0x01000000
2242 0x02000000 0x0 0x80000000
2243 0x80000000 0x0 0x08000000>;
2244 bus-range = <0 255>;
2245 clock-frequency = <66000000>;
2246 interrupt-parent = <&PIC>;
2247 interrupt-map-mask = <0xf800 0x0 0x0 0x7>;
2248 interrupt-map = <
2249 /* IDSEL 0x0a */
2250 0x5000 0 0 1 &PIC 80
2251 0x5000 0 0 2 &PIC 81
2252 0x5000 0 0 3 &PIC 91
2253 0x5000 0 0 4 &PIC 93
2254
2255 /* IDSEL 0x0b */
2256 0x5800 0 0 1 &PIC 91
2257 0x5800 0 0 2 &PIC 93
2258 0x5800 0 0 3 &PIC 80
2259 0x5800 0 0 4 &PIC 81
2260
2261 /* IDSEL 0x0c */
2262 0x6000 0 0 1 &PIC 91
2263 0x6000 0 0 2 &PIC 93
2264 0x6000 0 0 3 &PIC 80
2265 0x6000 0 0 4 &PIC 81
2266
2267 /* IDSEL 0x0d */
2268 0x6800 0 0 1 &PIC 93
2269 0x6800 0 0 2 &PIC 80
2270 0x6800 0 0 3 &PIC 81
2271 0x6800 0 0 4 &PIC 91
2272 >;
2273 };
2274
2275
2276 p) Marvell Discovery CPU Error nodes
2277
2278 Represent the Discovery's CPU error handler device.
2279
2280 Required properties:
2281 - compatible : "marvell,mv64360-cpu-error"
2282 - reg : Offset and length of the register set for this device
2283 - interrupts : the interrupt number for this device
2284 - interrupt-parent : the phandle for the interrupt controller
2285 that services interrupts for this device.
2286
2287 Example Discovery CPU Error node:
2288 cpu-error@0070 {
2289 compatible = "marvell,mv64360-cpu-error";
2290 reg = <0x70 0x10 0x128 0x28>;
2291 interrupts = <3>;
2292 interrupt-parent = <&PIC>;
2293 };
2294
2295
2296 q) Marvell Discovery SRAM Controller nodes
2297
2298 Represent the Discovery's SRAM controller device.
2299
2300 Required properties:
2301 - compatible : "marvell,mv64360-sram-ctrl"
2302 - reg : Offset and length of the register set for this device
2303 - interrupts : the interrupt number for this device
2304 - interrupt-parent : the phandle for the interrupt controller
2305 that services interrupts for this device.
2306
2307 Example Discovery SRAM Controller node:
2308 sram-ctrl@0380 {
2309 compatible = "marvell,mv64360-sram-ctrl";
2310 reg = <0x380 0x80>;
2311 interrupts = <13>;
2312 interrupt-parent = <&PIC>;
2313 };
2314
2315
2316 r) Marvell Discovery PCI Error Handler nodes
2317
2318 Represent the Discovery's PCI error handler device.
2319
2320 Required properties:
2321 - compatible : "marvell,mv64360-pci-error"
2322 - reg : Offset and length of the register set for this device
2323 - interrupts : the interrupt number for this device
2324 - interrupt-parent : the phandle for the interrupt controller
2325 that services interrupts for this device.
2326
2327 Example Discovery PCI Error Handler node:
2328 pci-error@1d40 {
2329 compatible = "marvell,mv64360-pci-error";
2330 reg = <0x1d40 0x40 0xc28 0x4>;
2331 interrupts = <12>;
2332 interrupt-parent = <&PIC>;
2333 };
2334
2335
2336 s) Marvell Discovery Memory Controller nodes
2337
2338 Represent the Discovery's memory controller device.
2339
2340 Required properties:
2341 - compatible : "marvell,mv64360-mem-ctrl"
2342 - reg : Offset and length of the register set for this device
2343 - interrupts : the interrupt number for this device
2344 - interrupt-parent : the phandle for the interrupt controller
2345 that services interrupts for this device.
2346
2347 Example Discovery Memory Controller node:
2348 mem-ctrl@1400 {
2349 compatible = "marvell,mv64360-mem-ctrl";
2350 reg = <0x1400 0x60>;
2351 interrupts = <17>;
2352 interrupt-parent = <&PIC>;
2353 };
2354
2355
2356VIII - Specifying interrupt information for devices
2357=================================================== 1242===================================================
2358 1243
2359The device tree represents the busses and devices of a hardware 1244The device tree represents the busses and devices of a hardware
@@ -2439,56 +1324,7 @@ encodings listed below:
2439 2 = high to low edge sensitive type enabled 1324 2 = high to low edge sensitive type enabled
2440 3 = low to high edge sensitive type enabled 1325 3 = low to high edge sensitive type enabled
2441 1326
2442IX - Specifying GPIO information for devices 1327VIII - Specifying Device Power Management Information (sleep property)
2443============================================
2444
24451) gpios property
2446-----------------
2447
2448Nodes that makes use of GPIOs should define them using `gpios' property,
2449format of which is: <&gpio-controller1-phandle gpio1-specifier
2450 &gpio-controller2-phandle gpio2-specifier
2451 0 /* holes are permitted, means no GPIO 3 */
2452 &gpio-controller4-phandle gpio4-specifier
2453 ...>;
2454
2455Note that gpio-specifier length is controller dependent.
2456
2457gpio-specifier may encode: bank, pin position inside the bank,
2458whether pin is open-drain and whether pin is logically inverted.
2459
2460Example of the node using GPIOs:
2461
2462 node {
2463 gpios = <&qe_pio_e 18 0>;
2464 };
2465
2466In this example gpio-specifier is "18 0" and encodes GPIO pin number,
2467and empty GPIO flags as accepted by the "qe_pio_e" gpio-controller.
2468
24692) gpio-controller nodes
2470------------------------
2471
2472Every GPIO controller node must have #gpio-cells property defined,
2473this information will be used to translate gpio-specifiers.
2474
2475Example of two SOC GPIO banks defined as gpio-controller nodes:
2476
2477 qe_pio_a: gpio-controller@1400 {
2478 #gpio-cells = <2>;
2479 compatible = "fsl,qe-pario-bank-a", "fsl,qe-pario-bank";
2480 reg = <0x1400 0x18>;
2481 gpio-controller;
2482 };
2483
2484 qe_pio_e: gpio-controller@1460 {
2485 #gpio-cells = <2>;
2486 compatible = "fsl,qe-pario-bank-e", "fsl,qe-pario-bank";
2487 reg = <0x1460 0x18>;
2488 gpio-controller;
2489 };
2490
2491X - Specifying Device Power Management Information (sleep property)
2492=================================================================== 1328===================================================================
2493 1329
2494Devices on SOCs often have mechanisms for placing devices into low-power 1330Devices on SOCs often have mechanisms for placing devices into low-power
diff --git a/Documentation/powerpc/dts-bindings/4xx/emac.txt b/Documentation/powerpc/dts-bindings/4xx/emac.txt
new file mode 100644
index 00000000000..2161334a7ca
--- /dev/null
+++ b/Documentation/powerpc/dts-bindings/4xx/emac.txt
@@ -0,0 +1,148 @@
1 4xx/Axon EMAC ethernet nodes
2
3 The EMAC ethernet controller in IBM and AMCC 4xx chips, and also
4 the Axon bridge. To operate this needs to interact with a ths
5 special McMAL DMA controller, and sometimes an RGMII or ZMII
6 interface. In addition to the nodes and properties described
7 below, the node for the OPB bus on which the EMAC sits must have a
8 correct clock-frequency property.
9
10 i) The EMAC node itself
11
12 Required properties:
13 - device_type : "network"
14
15 - compatible : compatible list, contains 2 entries, first is
16 "ibm,emac-CHIP" where CHIP is the host ASIC (440gx,
17 405gp, Axon) and second is either "ibm,emac" or
18 "ibm,emac4". For Axon, thus, we have: "ibm,emac-axon",
19 "ibm,emac4"
20 - interrupts : <interrupt mapping for EMAC IRQ and WOL IRQ>
21 - interrupt-parent : optional, if needed for interrupt mapping
22 - reg : <registers mapping>
23 - local-mac-address : 6 bytes, MAC address
24 - mal-device : phandle of the associated McMAL node
25 - mal-tx-channel : 1 cell, index of the tx channel on McMAL associated
26 with this EMAC
27 - mal-rx-channel : 1 cell, index of the rx channel on McMAL associated
28 with this EMAC
29 - cell-index : 1 cell, hardware index of the EMAC cell on a given
30 ASIC (typically 0x0 and 0x1 for EMAC0 and EMAC1 on
31 each Axon chip)
32 - max-frame-size : 1 cell, maximum frame size supported in bytes
33 - rx-fifo-size : 1 cell, Rx fifo size in bytes for 10 and 100 Mb/sec
34 operations.
35 For Axon, 2048
36 - tx-fifo-size : 1 cell, Tx fifo size in bytes for 10 and 100 Mb/sec
37 operations.
38 For Axon, 2048.
39 - fifo-entry-size : 1 cell, size of a fifo entry (used to calculate
40 thresholds).
41 For Axon, 0x00000010
42 - mal-burst-size : 1 cell, MAL burst size (used to calculate thresholds)
43 in bytes.
44 For Axon, 0x00000100 (I think ...)
45 - phy-mode : string, mode of operations of the PHY interface.
46 Supported values are: "mii", "rmii", "smii", "rgmii",
47 "tbi", "gmii", rtbi", "sgmii".
48 For Axon on CAB, it is "rgmii"
49 - mdio-device : 1 cell, required iff using shared MDIO registers
50 (440EP). phandle of the EMAC to use to drive the
51 MDIO lines for the PHY used by this EMAC.
52 - zmii-device : 1 cell, required iff connected to a ZMII. phandle of
53 the ZMII device node
54 - zmii-channel : 1 cell, required iff connected to a ZMII. Which ZMII
55 channel or 0xffffffff if ZMII is only used for MDIO.
56 - rgmii-device : 1 cell, required iff connected to an RGMII. phandle
57 of the RGMII device node.
58 For Axon: phandle of plb5/plb4/opb/rgmii
59 - rgmii-channel : 1 cell, required iff connected to an RGMII. Which
60 RGMII channel is used by this EMAC.
61 Fox Axon: present, whatever value is appropriate for each
62 EMAC, that is the content of the current (bogus) "phy-port"
63 property.
64
65 Optional properties:
66 - phy-address : 1 cell, optional, MDIO address of the PHY. If absent,
67 a search is performed.
68 - phy-map : 1 cell, optional, bitmap of addresses to probe the PHY
69 for, used if phy-address is absent. bit 0x00000001 is
70 MDIO address 0.
71 For Axon it can be absent, though my current driver
72 doesn't handle phy-address yet so for now, keep
73 0x00ffffff in it.
74 - rx-fifo-size-gige : 1 cell, Rx fifo size in bytes for 1000 Mb/sec
75 operations (if absent the value is the same as
76 rx-fifo-size). For Axon, either absent or 2048.
77 - tx-fifo-size-gige : 1 cell, Tx fifo size in bytes for 1000 Mb/sec
78 operations (if absent the value is the same as
79 tx-fifo-size). For Axon, either absent or 2048.
80 - tah-device : 1 cell, optional. If connected to a TAH engine for
81 offload, phandle of the TAH device node.
82 - tah-channel : 1 cell, optional. If appropriate, channel used on the
83 TAH engine.
84
85 Example:
86
87 EMAC0: ethernet@40000800 {
88 device_type = "network";
89 compatible = "ibm,emac-440gp", "ibm,emac";
90 interrupt-parent = <&UIC1>;
91 interrupts = <1c 4 1d 4>;
92 reg = <40000800 70>;
93 local-mac-address = [00 04 AC E3 1B 1E];
94 mal-device = <&MAL0>;
95 mal-tx-channel = <0 1>;
96 mal-rx-channel = <0>;
97 cell-index = <0>;
98 max-frame-size = <5dc>;
99 rx-fifo-size = <1000>;
100 tx-fifo-size = <800>;
101 phy-mode = "rmii";
102 phy-map = <00000001>;
103 zmii-device = <&ZMII0>;
104 zmii-channel = <0>;
105 };
106
107 ii) McMAL node
108
109 Required properties:
110 - device_type : "dma-controller"
111 - compatible : compatible list, containing 2 entries, first is
112 "ibm,mcmal-CHIP" where CHIP is the host ASIC (like
113 emac) and the second is either "ibm,mcmal" or
114 "ibm,mcmal2".
115 For Axon, "ibm,mcmal-axon","ibm,mcmal2"
116 - interrupts : <interrupt mapping for the MAL interrupts sources:
117 5 sources: tx_eob, rx_eob, serr, txde, rxde>.
118 For Axon: This is _different_ from the current
119 firmware. We use the "delayed" interrupts for txeob
120 and rxeob. Thus we end up with mapping those 5 MPIC
121 interrupts, all level positive sensitive: 10, 11, 32,
122 33, 34 (in decimal)
123 - dcr-reg : < DCR registers range >
124 - dcr-parent : if needed for dcr-reg
125 - num-tx-chans : 1 cell, number of Tx channels
126 - num-rx-chans : 1 cell, number of Rx channels
127
128 iii) ZMII node
129
130 Required properties:
131 - compatible : compatible list, containing 2 entries, first is
132 "ibm,zmii-CHIP" where CHIP is the host ASIC (like
133 EMAC) and the second is "ibm,zmii".
134 For Axon, there is no ZMII node.
135 - reg : <registers mapping>
136
137 iv) RGMII node
138
139 Required properties:
140 - compatible : compatible list, containing 2 entries, first is
141 "ibm,rgmii-CHIP" where CHIP is the host ASIC (like
142 EMAC) and the second is "ibm,rgmii".
143 For Axon, "ibm,rgmii-axon","ibm,rgmii"
144 - reg : <registers mapping>
145 - revision : as provided by the RGMII new version register if
146 available.
147 For Axon: 0x0000012a
148
diff --git a/Documentation/powerpc/dts-bindings/gpio/gpio.txt b/Documentation/powerpc/dts-bindings/gpio/gpio.txt
new file mode 100644
index 00000000000..edaa84d288a
--- /dev/null
+++ b/Documentation/powerpc/dts-bindings/gpio/gpio.txt
@@ -0,0 +1,50 @@
1Specifying GPIO information for devices
2============================================
3
41) gpios property
5-----------------
6
7Nodes that makes use of GPIOs should define them using `gpios' property,
8format of which is: <&gpio-controller1-phandle gpio1-specifier
9 &gpio-controller2-phandle gpio2-specifier
10 0 /* holes are permitted, means no GPIO 3 */
11 &gpio-controller4-phandle gpio4-specifier
12 ...>;
13
14Note that gpio-specifier length is controller dependent.
15
16gpio-specifier may encode: bank, pin position inside the bank,
17whether pin is open-drain and whether pin is logically inverted.
18
19Example of the node using GPIOs:
20
21 node {
22 gpios = <&qe_pio_e 18 0>;
23 };
24
25In this example gpio-specifier is "18 0" and encodes GPIO pin number,
26and empty GPIO flags as accepted by the "qe_pio_e" gpio-controller.
27
282) gpio-controller nodes
29------------------------
30
31Every GPIO controller node must have #gpio-cells property defined,
32this information will be used to translate gpio-specifiers.
33
34Example of two SOC GPIO banks defined as gpio-controller nodes:
35
36 qe_pio_a: gpio-controller@1400 {
37 #gpio-cells = <2>;
38 compatible = "fsl,qe-pario-bank-a", "fsl,qe-pario-bank";
39 reg = <0x1400 0x18>;
40 gpio-controller;
41 };
42
43 qe_pio_e: gpio-controller@1460 {
44 #gpio-cells = <2>;
45 compatible = "fsl,qe-pario-bank-e", "fsl,qe-pario-bank";
46 reg = <0x1460 0x18>;
47 gpio-controller;
48 };
49
50
diff --git a/Documentation/powerpc/dts-bindings/gpio/led.txt b/Documentation/powerpc/dts-bindings/gpio/led.txt
index 4fe14deedc0..064db928c3c 100644
--- a/Documentation/powerpc/dts-bindings/gpio/led.txt
+++ b/Documentation/powerpc/dts-bindings/gpio/led.txt
@@ -16,10 +16,17 @@ LED sub-node properties:
16 string defining the trigger assigned to the LED. Current triggers are: 16 string defining the trigger assigned to the LED. Current triggers are:
17 "backlight" - LED will act as a back-light, controlled by the framebuffer 17 "backlight" - LED will act as a back-light, controlled by the framebuffer
18 system 18 system
19 "default-on" - LED will turn on 19 "default-on" - LED will turn on, but see "default-state" below
20 "heartbeat" - LED "double" flashes at a load average based rate 20 "heartbeat" - LED "double" flashes at a load average based rate
21 "ide-disk" - LED indicates disk activity 21 "ide-disk" - LED indicates disk activity
22 "timer" - LED flashes at a fixed, configurable rate 22 "timer" - LED flashes at a fixed, configurable rate
23- default-state: (optional) The initial state of the LED. Valid
24 values are "on", "off", and "keep". If the LED is already on or off
25 and the default-state property is set the to same value, then no
26 glitch should be produced where the LED momentarily turns off (or
27 on). The "keep" setting will keep the LED at whatever its current
28 state is, without producing a glitch. The default is off if this
29 property is not present.
23 30
24Examples: 31Examples:
25 32
@@ -30,14 +37,22 @@ leds {
30 gpios = <&mcu_pio 0 1>; /* Active low */ 37 gpios = <&mcu_pio 0 1>; /* Active low */
31 linux,default-trigger = "ide-disk"; 38 linux,default-trigger = "ide-disk";
32 }; 39 };
40
41 fault {
42 gpios = <&mcu_pio 1 0>;
43 /* Keep LED on if BIOS detected hardware fault */
44 default-state = "keep";
45 };
33}; 46};
34 47
35run-control { 48run-control {
36 compatible = "gpio-leds"; 49 compatible = "gpio-leds";
37 red { 50 red {
38 gpios = <&mpc8572 6 0>; 51 gpios = <&mpc8572 6 0>;
52 default-state = "off";
39 }; 53 };
40 green { 54 green {
41 gpios = <&mpc8572 7 0>; 55 gpios = <&mpc8572 7 0>;
56 default-state = "on";
42 }; 57 };
43} 58}
diff --git a/Documentation/powerpc/dts-bindings/gpio/mdio.txt b/Documentation/powerpc/dts-bindings/gpio/mdio.txt
new file mode 100644
index 00000000000..bc954952901
--- /dev/null
+++ b/Documentation/powerpc/dts-bindings/gpio/mdio.txt
@@ -0,0 +1,19 @@
1MDIO on GPIOs
2
3Currently defined compatibles:
4- virtual,gpio-mdio
5
6MDC and MDIO lines connected to GPIO controllers are listed in the
7gpios property as described in section VIII.1 in the following order:
8
9MDC, MDIO.
10
11Example:
12
13mdio {
14 compatible = "virtual,mdio-gpio";
15 #address-cells = <1>;
16 #size-cells = <0>;
17 gpios = <&qe_pio_a 11
18 &qe_pio_c 6>;
19};
diff --git a/Documentation/powerpc/dts-bindings/marvell.txt b/Documentation/powerpc/dts-bindings/marvell.txt
new file mode 100644
index 00000000000..3708a2fd474
--- /dev/null
+++ b/Documentation/powerpc/dts-bindings/marvell.txt
@@ -0,0 +1,521 @@
1Marvell Discovery mv64[345]6x System Controller chips
2===========================================================
3
4The Marvell mv64[345]60 series of system controller chips contain
5many of the peripherals needed to implement a complete computer
6system. In this section, we define device tree nodes to describe
7the system controller chip itself and each of the peripherals
8which it contains. Compatible string values for each node are
9prefixed with the string "marvell,", for Marvell Technology Group Ltd.
10
111) The /system-controller node
12
13 This node is used to represent the system-controller and must be
14 present when the system uses a system controller chip. The top-level
15 system-controller node contains information that is global to all
16 devices within the system controller chip. The node name begins
17 with "system-controller" followed by the unit address, which is
18 the base address of the memory-mapped register set for the system
19 controller chip.
20
21 Required properties:
22
23 - ranges : Describes the translation of system controller addresses
24 for memory mapped registers.
25 - clock-frequency: Contains the main clock frequency for the system
26 controller chip.
27 - reg : This property defines the address and size of the
28 memory-mapped registers contained within the system controller
29 chip. The address specified in the "reg" property should match
30 the unit address of the system-controller node.
31 - #address-cells : Address representation for system controller
32 devices. This field represents the number of cells needed to
33 represent the address of the memory-mapped registers of devices
34 within the system controller chip.
35 - #size-cells : Size representation for for the memory-mapped
36 registers within the system controller chip.
37 - #interrupt-cells : Defines the width of cells used to represent
38 interrupts.
39
40 Optional properties:
41
42 - model : The specific model of the system controller chip. Such
43 as, "mv64360", "mv64460", or "mv64560".
44 - compatible : A string identifying the compatibility identifiers
45 of the system controller chip.
46
47 The system-controller node contains child nodes for each system
48 controller device that the platform uses. Nodes should not be created
49 for devices which exist on the system controller chip but are not used
50
51 Example Marvell Discovery mv64360 system-controller node:
52
53 system-controller@f1000000 { /* Marvell Discovery mv64360 */
54 #address-cells = <1>;
55 #size-cells = <1>;
56 model = "mv64360"; /* Default */
57 compatible = "marvell,mv64360";
58 clock-frequency = <133333333>;
59 reg = <0xf1000000 0x10000>;
60 virtual-reg = <0xf1000000>;
61 ranges = <0x88000000 0x88000000 0x1000000 /* PCI 0 I/O Space */
62 0x80000000 0x80000000 0x8000000 /* PCI 0 MEM Space */
63 0xa0000000 0xa0000000 0x4000000 /* User FLASH */
64 0x00000000 0xf1000000 0x0010000 /* Bridge's regs */
65 0xf2000000 0xf2000000 0x0040000>;/* Integrated SRAM */
66
67 [ child node definitions... ]
68 }
69
702) Child nodes of /system-controller
71
72 a) Marvell Discovery MDIO bus
73
74 The MDIO is a bus to which the PHY devices are connected. For each
75 device that exists on this bus, a child node should be created. See
76 the definition of the PHY node below for an example of how to define
77 a PHY.
78
79 Required properties:
80 - #address-cells : Should be <1>
81 - #size-cells : Should be <0>
82 - device_type : Should be "mdio"
83 - compatible : Should be "marvell,mv64360-mdio"
84
85 Example:
86
87 mdio {
88 #address-cells = <1>;
89 #size-cells = <0>;
90 device_type = "mdio";
91 compatible = "marvell,mv64360-mdio";
92
93 ethernet-phy@0 {
94 ......
95 };
96 };
97
98
99 b) Marvell Discovery ethernet controller
100
101 The Discover ethernet controller is described with two levels
102 of nodes. The first level describes an ethernet silicon block
103 and the second level describes up to 3 ethernet nodes within
104 that block. The reason for the multiple levels is that the
105 registers for the node are interleaved within a single set
106 of registers. The "ethernet-block" level describes the
107 shared register set, and the "ethernet" nodes describe ethernet
108 port-specific properties.
109
110 Ethernet block node
111
112 Required properties:
113 - #address-cells : <1>
114 - #size-cells : <0>
115 - compatible : "marvell,mv64360-eth-block"
116 - reg : Offset and length of the register set for this block
117
118 Example Discovery Ethernet block node:
119 ethernet-block@2000 {
120 #address-cells = <1>;
121 #size-cells = <0>;
122 compatible = "marvell,mv64360-eth-block";
123 reg = <0x2000 0x2000>;
124 ethernet@0 {
125 .......
126 };
127 };
128
129 Ethernet port node
130
131 Required properties:
132 - device_type : Should be "network".
133 - compatible : Should be "marvell,mv64360-eth".
134 - reg : Should be <0>, <1>, or <2>, according to which registers
135 within the silicon block the device uses.
136 - interrupts : <a> where a is the interrupt number for the port.
137 - interrupt-parent : the phandle for the interrupt controller
138 that services interrupts for this device.
139 - phy : the phandle for the PHY connected to this ethernet
140 controller.
141 - local-mac-address : 6 bytes, MAC address
142
143 Example Discovery Ethernet port node:
144 ethernet@0 {
145 device_type = "network";
146 compatible = "marvell,mv64360-eth";
147 reg = <0>;
148 interrupts = <32>;
149 interrupt-parent = <&PIC>;
150 phy = <&PHY0>;
151 local-mac-address = [ 00 00 00 00 00 00 ];
152 };
153
154
155
156 c) Marvell Discovery PHY nodes
157
158 Required properties:
159 - device_type : Should be "ethernet-phy"
160 - interrupts : <a> where a is the interrupt number for this phy.
161 - interrupt-parent : the phandle for the interrupt controller that
162 services interrupts for this device.
163 - reg : The ID number for the phy, usually a small integer
164
165 Example Discovery PHY node:
166 ethernet-phy@1 {
167 device_type = "ethernet-phy";
168 compatible = "broadcom,bcm5421";
169 interrupts = <76>; /* GPP 12 */
170 interrupt-parent = <&PIC>;
171 reg = <1>;
172 };
173
174
175 d) Marvell Discovery SDMA nodes
176
177 Represent DMA hardware associated with the MPSC (multiprotocol
178 serial controllers).
179
180 Required properties:
181 - compatible : "marvell,mv64360-sdma"
182 - reg : Offset and length of the register set for this device
183 - interrupts : <a> where a is the interrupt number for the DMA
184 device.
185 - interrupt-parent : the phandle for the interrupt controller
186 that services interrupts for this device.
187
188 Example Discovery SDMA node:
189 sdma@4000 {
190 compatible = "marvell,mv64360-sdma";
191 reg = <0x4000 0xc18>;
192 virtual-reg = <0xf1004000>;
193 interrupts = <36>;
194 interrupt-parent = <&PIC>;
195 };
196
197
198 e) Marvell Discovery BRG nodes
199
200 Represent baud rate generator hardware associated with the MPSC
201 (multiprotocol serial controllers).
202
203 Required properties:
204 - compatible : "marvell,mv64360-brg"
205 - reg : Offset and length of the register set for this device
206 - clock-src : A value from 0 to 15 which selects the clock
207 source for the baud rate generator. This value corresponds
208 to the CLKS value in the BRGx configuration register. See
209 the mv64x60 User's Manual.
210 - clock-frequence : The frequency (in Hz) of the baud rate
211 generator's input clock.
212 - current-speed : The current speed setting (presumably by
213 firmware) of the baud rate generator.
214
215 Example Discovery BRG node:
216 brg@b200 {
217 compatible = "marvell,mv64360-brg";
218 reg = <0xb200 0x8>;
219 clock-src = <8>;
220 clock-frequency = <133333333>;
221 current-speed = <9600>;
222 };
223
224
225 f) Marvell Discovery CUNIT nodes
226
227 Represent the Serial Communications Unit device hardware.
228
229 Required properties:
230 - reg : Offset and length of the register set for this device
231
232 Example Discovery CUNIT node:
233 cunit@f200 {
234 reg = <0xf200 0x200>;
235 };
236
237
238 g) Marvell Discovery MPSCROUTING nodes
239
240 Represent the Discovery's MPSC routing hardware
241
242 Required properties:
243 - reg : Offset and length of the register set for this device
244
245 Example Discovery CUNIT node:
246 mpscrouting@b500 {
247 reg = <0xb400 0xc>;
248 };
249
250
251 h) Marvell Discovery MPSCINTR nodes
252
253 Represent the Discovery's MPSC DMA interrupt hardware registers
254 (SDMA cause and mask registers).
255
256 Required properties:
257 - reg : Offset and length of the register set for this device
258
259 Example Discovery MPSCINTR node:
260 mpsintr@b800 {
261 reg = <0xb800 0x100>;
262 };
263
264
265 i) Marvell Discovery MPSC nodes
266
267 Represent the Discovery's MPSC (Multiprotocol Serial Controller)
268 serial port.
269
270 Required properties:
271 - device_type : "serial"
272 - compatible : "marvell,mv64360-mpsc"
273 - reg : Offset and length of the register set for this device
274 - sdma : the phandle for the SDMA node used by this port
275 - brg : the phandle for the BRG node used by this port
276 - cunit : the phandle for the CUNIT node used by this port
277 - mpscrouting : the phandle for the MPSCROUTING node used by this port
278 - mpscintr : the phandle for the MPSCINTR node used by this port
279 - cell-index : the hardware index of this cell in the MPSC core
280 - max_idle : value needed for MPSC CHR3 (Maximum Frame Length)
281 register
282 - interrupts : <a> where a is the interrupt number for the MPSC.
283 - interrupt-parent : the phandle for the interrupt controller
284 that services interrupts for this device.
285
286 Example Discovery MPSCINTR node:
287 mpsc@8000 {
288 device_type = "serial";
289 compatible = "marvell,mv64360-mpsc";
290 reg = <0x8000 0x38>;
291 virtual-reg = <0xf1008000>;
292 sdma = <&SDMA0>;
293 brg = <&BRG0>;
294 cunit = <&CUNIT>;
295 mpscrouting = <&MPSCROUTING>;
296 mpscintr = <&MPSCINTR>;
297 cell-index = <0>;
298 max_idle = <40>;
299 interrupts = <40>;
300 interrupt-parent = <&PIC>;
301 };
302
303
304 j) Marvell Discovery Watch Dog Timer nodes
305
306 Represent the Discovery's watchdog timer hardware
307
308 Required properties:
309 - compatible : "marvell,mv64360-wdt"
310 - reg : Offset and length of the register set for this device
311
312 Example Discovery Watch Dog Timer node:
313 wdt@b410 {
314 compatible = "marvell,mv64360-wdt";
315 reg = <0xb410 0x8>;
316 };
317
318
319 k) Marvell Discovery I2C nodes
320
321 Represent the Discovery's I2C hardware
322
323 Required properties:
324 - device_type : "i2c"
325 - compatible : "marvell,mv64360-i2c"
326 - reg : Offset and length of the register set for this device
327 - interrupts : <a> where a is the interrupt number for the I2C.
328 - interrupt-parent : the phandle for the interrupt controller
329 that services interrupts for this device.
330
331 Example Discovery I2C node:
332 compatible = "marvell,mv64360-i2c";
333 reg = <0xc000 0x20>;
334 virtual-reg = <0xf100c000>;
335 interrupts = <37>;
336 interrupt-parent = <&PIC>;
337 };
338
339
340 l) Marvell Discovery PIC (Programmable Interrupt Controller) nodes
341
342 Represent the Discovery's PIC hardware
343
344 Required properties:
345 - #interrupt-cells : <1>
346 - #address-cells : <0>
347 - compatible : "marvell,mv64360-pic"
348 - reg : Offset and length of the register set for this device
349 - interrupt-controller
350
351 Example Discovery PIC node:
352 pic {
353 #interrupt-cells = <1>;
354 #address-cells = <0>;
355 compatible = "marvell,mv64360-pic";
356 reg = <0x0 0x88>;
357 interrupt-controller;
358 };
359
360
361 m) Marvell Discovery MPP (Multipurpose Pins) multiplexing nodes
362
363 Represent the Discovery's MPP hardware
364
365 Required properties:
366 - compatible : "marvell,mv64360-mpp"
367 - reg : Offset and length of the register set for this device
368
369 Example Discovery MPP node:
370 mpp@f000 {
371 compatible = "marvell,mv64360-mpp";
372 reg = <0xf000 0x10>;
373 };
374
375
376 n) Marvell Discovery GPP (General Purpose Pins) nodes
377
378 Represent the Discovery's GPP hardware
379
380 Required properties:
381 - compatible : "marvell,mv64360-gpp"
382 - reg : Offset and length of the register set for this device
383
384 Example Discovery GPP node:
385 gpp@f000 {
386 compatible = "marvell,mv64360-gpp";
387 reg = <0xf100 0x20>;
388 };
389
390
391 o) Marvell Discovery PCI host bridge node
392
393 Represents the Discovery's PCI host bridge device. The properties
394 for this node conform to Rev 2.1 of the PCI Bus Binding to IEEE
395 1275-1994. A typical value for the compatible property is
396 "marvell,mv64360-pci".
397
398 Example Discovery PCI host bridge node
399 pci@80000000 {
400 #address-cells = <3>;
401 #size-cells = <2>;
402 #interrupt-cells = <1>;
403 device_type = "pci";
404 compatible = "marvell,mv64360-pci";
405 reg = <0xcf8 0x8>;
406 ranges = <0x01000000 0x0 0x0
407 0x88000000 0x0 0x01000000
408 0x02000000 0x0 0x80000000
409 0x80000000 0x0 0x08000000>;
410 bus-range = <0 255>;
411 clock-frequency = <66000000>;
412 interrupt-parent = <&PIC>;
413 interrupt-map-mask = <0xf800 0x0 0x0 0x7>;
414 interrupt-map = <
415 /* IDSEL 0x0a */
416 0x5000 0 0 1 &PIC 80
417 0x5000 0 0 2 &PIC 81
418 0x5000 0 0 3 &PIC 91
419 0x5000 0 0 4 &PIC 93
420
421 /* IDSEL 0x0b */
422 0x5800 0 0 1 &PIC 91
423 0x5800 0 0 2 &PIC 93
424 0x5800 0 0 3 &PIC 80
425 0x5800 0 0 4 &PIC 81
426
427 /* IDSEL 0x0c */
428 0x6000 0 0 1 &PIC 91
429 0x6000 0 0 2 &PIC 93
430 0x6000 0 0 3 &PIC 80
431 0x6000 0 0 4 &PIC 81
432
433 /* IDSEL 0x0d */
434 0x6800 0 0 1 &PIC 93
435 0x6800 0 0 2 &PIC 80
436 0x6800 0 0 3 &PIC 81
437 0x6800 0 0 4 &PIC 91
438 >;
439 };
440
441
442 p) Marvell Discovery CPU Error nodes
443
444 Represent the Discovery's CPU error handler device.
445
446 Required properties:
447 - compatible : "marvell,mv64360-cpu-error"
448 - reg : Offset and length of the register set for this device
449 - interrupts : the interrupt number for this device
450 - interrupt-parent : the phandle for the interrupt controller
451 that services interrupts for this device.
452
453 Example Discovery CPU Error node:
454 cpu-error@0070 {
455 compatible = "marvell,mv64360-cpu-error";
456 reg = <0x70 0x10 0x128 0x28>;
457 interrupts = <3>;
458 interrupt-parent = <&PIC>;
459 };
460
461
462 q) Marvell Discovery SRAM Controller nodes
463
464 Represent the Discovery's SRAM controller device.
465
466 Required properties:
467 - compatible : "marvell,mv64360-sram-ctrl"
468 - reg : Offset and length of the register set for this device
469 - interrupts : the interrupt number for this device
470 - interrupt-parent : the phandle for the interrupt controller
471 that services interrupts for this device.
472
473 Example Discovery SRAM Controller node:
474 sram-ctrl@0380 {
475 compatible = "marvell,mv64360-sram-ctrl";
476 reg = <0x380 0x80>;
477 interrupts = <13>;
478 interrupt-parent = <&PIC>;
479 };
480
481
482 r) Marvell Discovery PCI Error Handler nodes
483
484 Represent the Discovery's PCI error handler device.
485
486 Required properties:
487 - compatible : "marvell,mv64360-pci-error"
488 - reg : Offset and length of the register set for this device
489 - interrupts : the interrupt number for this device
490 - interrupt-parent : the phandle for the interrupt controller
491 that services interrupts for this device.
492
493 Example Discovery PCI Error Handler node:
494 pci-error@1d40 {
495 compatible = "marvell,mv64360-pci-error";
496 reg = <0x1d40 0x40 0xc28 0x4>;
497 interrupts = <12>;
498 interrupt-parent = <&PIC>;
499 };
500
501
502 s) Marvell Discovery Memory Controller nodes
503
504 Represent the Discovery's memory controller device.
505
506 Required properties:
507 - compatible : "marvell,mv64360-mem-ctrl"
508 - reg : Offset and length of the register set for this device
509 - interrupts : the interrupt number for this device
510 - interrupt-parent : the phandle for the interrupt controller
511 that services interrupts for this device.
512
513 Example Discovery Memory Controller node:
514 mem-ctrl@1400 {
515 compatible = "marvell,mv64360-mem-ctrl";
516 reg = <0x1400 0x60>;
517 interrupts = <17>;
518 interrupt-parent = <&PIC>;
519 };
520
521
diff --git a/Documentation/powerpc/dts-bindings/phy.txt b/Documentation/powerpc/dts-bindings/phy.txt
new file mode 100644
index 00000000000..bb8c742eb8c
--- /dev/null
+++ b/Documentation/powerpc/dts-bindings/phy.txt
@@ -0,0 +1,25 @@
1PHY nodes
2
3Required properties:
4
5 - device_type : Should be "ethernet-phy"
6 - interrupts : <a b> where a is the interrupt number and b is a
7 field that represents an encoding of the sense and level
8 information for the interrupt. This should be encoded based on
9 the information in section 2) depending on the type of interrupt
10 controller you have.
11 - interrupt-parent : the phandle for the interrupt controller that
12 services interrupts for this device.
13 - reg : The ID number for the phy, usually a small integer
14 - linux,phandle : phandle for this node; likely referenced by an
15 ethernet controller node.
16
17Example:
18
19ethernet-phy@0 {
20 linux,phandle = <2452000>
21 interrupt-parent = <40000>;
22 interrupts = <35 1>;
23 reg = <0>;
24 device_type = "ethernet-phy";
25};
diff --git a/Documentation/powerpc/dts-bindings/spi-bus.txt b/Documentation/powerpc/dts-bindings/spi-bus.txt
new file mode 100644
index 00000000000..e782add2e45
--- /dev/null
+++ b/Documentation/powerpc/dts-bindings/spi-bus.txt
@@ -0,0 +1,57 @@
1SPI (Serial Peripheral Interface) busses
2
3SPI busses can be described with a node for the SPI master device
4and a set of child nodes for each SPI slave on the bus. For this
5discussion, it is assumed that the system's SPI controller is in
6SPI master mode. This binding does not describe SPI controllers
7in slave mode.
8
9The SPI master node requires the following properties:
10- #address-cells - number of cells required to define a chip select
11 address on the SPI bus.
12- #size-cells - should be zero.
13- compatible - name of SPI bus controller following generic names
14 recommended practice.
15No other properties are required in the SPI bus node. It is assumed
16that a driver for an SPI bus device will understand that it is an SPI bus.
17However, the binding does not attempt to define the specific method for
18assigning chip select numbers. Since SPI chip select configuration is
19flexible and non-standardized, it is left out of this binding with the
20assumption that board specific platform code will be used to manage
21chip selects. Individual drivers can define additional properties to
22support describing the chip select layout.
23
24SPI slave nodes must be children of the SPI master node and can
25contain the following properties.
26- reg - (required) chip select address of device.
27- compatible - (required) name of SPI device following generic names
28 recommended practice
29- spi-max-frequency - (required) Maximum SPI clocking speed of device in Hz
30- spi-cpol - (optional) Empty property indicating device requires
31 inverse clock polarity (CPOL) mode
32- spi-cpha - (optional) Empty property indicating device requires
33 shifted clock phase (CPHA) mode
34- spi-cs-high - (optional) Empty property indicating device requires
35 chip select active high
36
37SPI example for an MPC5200 SPI bus:
38 spi@f00 {
39 #address-cells = <1>;
40 #size-cells = <0>;
41 compatible = "fsl,mpc5200b-spi","fsl,mpc5200-spi";
42 reg = <0xf00 0x20>;
43 interrupts = <2 13 0 2 14 0>;
44 interrupt-parent = <&mpc5200_pic>;
45
46 ethernet-switch@0 {
47 compatible = "micrel,ks8995m";
48 spi-max-frequency = <1000000>;
49 reg = <0>;
50 };
51
52 codec@1 {
53 compatible = "ti,tlv320aic26";
54 spi-max-frequency = <100000>;
55 reg = <1>;
56 };
57 };
diff --git a/Documentation/powerpc/dts-bindings/usb-ehci.txt b/Documentation/powerpc/dts-bindings/usb-ehci.txt
new file mode 100644
index 00000000000..fa18612f757
--- /dev/null
+++ b/Documentation/powerpc/dts-bindings/usb-ehci.txt
@@ -0,0 +1,25 @@
1USB EHCI controllers
2
3Required properties:
4 - compatible : should be "usb-ehci".
5 - reg : should contain at least address and length of the standard EHCI
6 register set for the device. Optional platform-dependent registers
7 (debug-port or other) can be also specified here, but only after
8 definition of standard EHCI registers.
9 - interrupts : one EHCI interrupt should be described here.
10If device registers are implemented in big endian mode, the device
11node should have "big-endian-regs" property.
12If controller implementation operates with big endian descriptors,
13"big-endian-desc" property should be specified.
14If both big endian registers and descriptors are used by the controller
15implementation, "big-endian" property can be specified instead of having
16both "big-endian-regs" and "big-endian-desc".
17
18Example (Sequoia 440EPx):
19 ehci@e0000300 {
20 compatible = "ibm,usb-ehci-440epx", "usb-ehci";
21 interrupt-parent = <&UIC0>;
22 interrupts = <1a 4>;
23 reg = <0 e0000300 90 0 e0000390 70>;
24 big-endian;
25 };
diff --git a/Documentation/powerpc/dts-bindings/xilinx.txt b/Documentation/powerpc/dts-bindings/xilinx.txt
new file mode 100644
index 00000000000..80339fe4300
--- /dev/null
+++ b/Documentation/powerpc/dts-bindings/xilinx.txt
@@ -0,0 +1,295 @@
1 d) Xilinx IP cores
2
3 The Xilinx EDK toolchain ships with a set of IP cores (devices) for use
4 in Xilinx Spartan and Virtex FPGAs. The devices cover the whole range
5 of standard device types (network, serial, etc.) and miscellaneous
6 devices (gpio, LCD, spi, etc). Also, since these devices are
7 implemented within the fpga fabric every instance of the device can be
8 synthesised with different options that change the behaviour.
9
10 Each IP-core has a set of parameters which the FPGA designer can use to
11 control how the core is synthesized. Historically, the EDK tool would
12 extract the device parameters relevant to device drivers and copy them
13 into an 'xparameters.h' in the form of #define symbols. This tells the
14 device drivers how the IP cores are configured, but it requres the kernel
15 to be recompiled every time the FPGA bitstream is resynthesized.
16
17 The new approach is to export the parameters into the device tree and
18 generate a new device tree each time the FPGA bitstream changes. The
19 parameters which used to be exported as #defines will now become
20 properties of the device node. In general, device nodes for IP-cores
21 will take the following form:
22
23 (name): (generic-name)@(base-address) {
24 compatible = "xlnx,(ip-core-name)-(HW_VER)"
25 [, (list of compatible devices), ...];
26 reg = <(baseaddr) (size)>;
27 interrupt-parent = <&interrupt-controller-phandle>;
28 interrupts = < ... >;
29 xlnx,(parameter1) = "(string-value)";
30 xlnx,(parameter2) = <(int-value)>;
31 };
32
33 (generic-name): an open firmware-style name that describes the
34 generic class of device. Preferably, this is one word, such
35 as 'serial' or 'ethernet'.
36 (ip-core-name): the name of the ip block (given after the BEGIN
37 directive in system.mhs). Should be in lowercase
38 and all underscores '_' converted to dashes '-'.
39 (name): is derived from the "PARAMETER INSTANCE" value.
40 (parameter#): C_* parameters from system.mhs. The C_ prefix is
41 dropped from the parameter name, the name is converted
42 to lowercase and all underscore '_' characters are
43 converted to dashes '-'.
44 (baseaddr): the baseaddr parameter value (often named C_BASEADDR).
45 (HW_VER): from the HW_VER parameter.
46 (size): the address range size (often C_HIGHADDR - C_BASEADDR + 1).
47
48 Typically, the compatible list will include the exact IP core version
49 followed by an older IP core version which implements the same
50 interface or any other device with the same interface.
51
52 'reg', 'interrupt-parent' and 'interrupts' are all optional properties.
53
54 For example, the following block from system.mhs:
55
56 BEGIN opb_uartlite
57 PARAMETER INSTANCE = opb_uartlite_0
58 PARAMETER HW_VER = 1.00.b
59 PARAMETER C_BAUDRATE = 115200
60 PARAMETER C_DATA_BITS = 8
61 PARAMETER C_ODD_PARITY = 0
62 PARAMETER C_USE_PARITY = 0
63 PARAMETER C_CLK_FREQ = 50000000
64 PARAMETER C_BASEADDR = 0xEC100000
65 PARAMETER C_HIGHADDR = 0xEC10FFFF
66 BUS_INTERFACE SOPB = opb_7
67 PORT OPB_Clk = CLK_50MHz
68 PORT Interrupt = opb_uartlite_0_Interrupt
69 PORT RX = opb_uartlite_0_RX
70 PORT TX = opb_uartlite_0_TX
71 PORT OPB_Rst = sys_bus_reset_0
72 END
73
74 becomes the following device tree node:
75
76 opb_uartlite_0: serial@ec100000 {
77 device_type = "serial";
78 compatible = "xlnx,opb-uartlite-1.00.b";
79 reg = <ec100000 10000>;
80 interrupt-parent = <&opb_intc_0>;
81 interrupts = <1 0>; // got this from the opb_intc parameters
82 current-speed = <d#115200>; // standard serial device prop
83 clock-frequency = <d#50000000>; // standard serial device prop
84 xlnx,data-bits = <8>;
85 xlnx,odd-parity = <0>;
86 xlnx,use-parity = <0>;
87 };
88
89 Some IP cores actually implement 2 or more logical devices. In
90 this case, the device should still describe the whole IP core with
91 a single node and add a child node for each logical device. The
92 ranges property can be used to translate from parent IP-core to the
93 registers of each device. In addition, the parent node should be
94 compatible with the bus type 'xlnx,compound', and should contain
95 #address-cells and #size-cells, as with any other bus. (Note: this
96 makes the assumption that both logical devices have the same bus
97 binding. If this is not true, then separate nodes should be used
98 for each logical device). The 'cell-index' property can be used to
99 enumerate logical devices within an IP core. For example, the
100 following is the system.mhs entry for the dual ps2 controller found
101 on the ml403 reference design.
102
103 BEGIN opb_ps2_dual_ref
104 PARAMETER INSTANCE = opb_ps2_dual_ref_0
105 PARAMETER HW_VER = 1.00.a
106 PARAMETER C_BASEADDR = 0xA9000000
107 PARAMETER C_HIGHADDR = 0xA9001FFF
108 BUS_INTERFACE SOPB = opb_v20_0
109 PORT Sys_Intr1 = ps2_1_intr
110 PORT Sys_Intr2 = ps2_2_intr
111 PORT Clkin1 = ps2_clk_rx_1
112 PORT Clkin2 = ps2_clk_rx_2
113 PORT Clkpd1 = ps2_clk_tx_1
114 PORT Clkpd2 = ps2_clk_tx_2
115 PORT Rx1 = ps2_d_rx_1
116 PORT Rx2 = ps2_d_rx_2
117 PORT Txpd1 = ps2_d_tx_1
118 PORT Txpd2 = ps2_d_tx_2
119 END
120
121 It would result in the following device tree nodes:
122
123 opb_ps2_dual_ref_0: opb-ps2-dual-ref@a9000000 {
124 #address-cells = <1>;
125 #size-cells = <1>;
126 compatible = "xlnx,compound";
127 ranges = <0 a9000000 2000>;
128 // If this device had extra parameters, then they would
129 // go here.
130 ps2@0 {
131 compatible = "xlnx,opb-ps2-dual-ref-1.00.a";
132 reg = <0 40>;
133 interrupt-parent = <&opb_intc_0>;
134 interrupts = <3 0>;
135 cell-index = <0>;
136 };
137 ps2@1000 {
138 compatible = "xlnx,opb-ps2-dual-ref-1.00.a";
139 reg = <1000 40>;
140 interrupt-parent = <&opb_intc_0>;
141 interrupts = <3 0>;
142 cell-index = <0>;
143 };
144 };
145
146 Also, the system.mhs file defines bus attachments from the processor
147 to the devices. The device tree structure should reflect the bus
148 attachments. Again an example; this system.mhs fragment:
149
150 BEGIN ppc405_virtex4
151 PARAMETER INSTANCE = ppc405_0
152 PARAMETER HW_VER = 1.01.a
153 BUS_INTERFACE DPLB = plb_v34_0
154 BUS_INTERFACE IPLB = plb_v34_0
155 END
156
157 BEGIN opb_intc
158 PARAMETER INSTANCE = opb_intc_0
159 PARAMETER HW_VER = 1.00.c
160 PARAMETER C_BASEADDR = 0xD1000FC0
161 PARAMETER C_HIGHADDR = 0xD1000FDF
162 BUS_INTERFACE SOPB = opb_v20_0
163 END
164
165 BEGIN opb_uart16550
166 PARAMETER INSTANCE = opb_uart16550_0
167 PARAMETER HW_VER = 1.00.d
168 PARAMETER C_BASEADDR = 0xa0000000
169 PARAMETER C_HIGHADDR = 0xa0001FFF
170 BUS_INTERFACE SOPB = opb_v20_0
171 END
172
173 BEGIN plb_v34
174 PARAMETER INSTANCE = plb_v34_0
175 PARAMETER HW_VER = 1.02.a
176 END
177
178 BEGIN plb_bram_if_cntlr
179 PARAMETER INSTANCE = plb_bram_if_cntlr_0
180 PARAMETER HW_VER = 1.00.b
181 PARAMETER C_BASEADDR = 0xFFFF0000
182 PARAMETER C_HIGHADDR = 0xFFFFFFFF
183 BUS_INTERFACE SPLB = plb_v34_0
184 END
185
186 BEGIN plb2opb_bridge
187 PARAMETER INSTANCE = plb2opb_bridge_0
188 PARAMETER HW_VER = 1.01.a
189 PARAMETER C_RNG0_BASEADDR = 0x20000000
190 PARAMETER C_RNG0_HIGHADDR = 0x3FFFFFFF
191 PARAMETER C_RNG1_BASEADDR = 0x60000000
192 PARAMETER C_RNG1_HIGHADDR = 0x7FFFFFFF
193 PARAMETER C_RNG2_BASEADDR = 0x80000000
194 PARAMETER C_RNG2_HIGHADDR = 0xBFFFFFFF
195 PARAMETER C_RNG3_BASEADDR = 0xC0000000
196 PARAMETER C_RNG3_HIGHADDR = 0xDFFFFFFF
197 BUS_INTERFACE SPLB = plb_v34_0
198 BUS_INTERFACE MOPB = opb_v20_0
199 END
200
201 Gives this device tree (some properties removed for clarity):
202
203 plb@0 {
204 #address-cells = <1>;
205 #size-cells = <1>;
206 compatible = "xlnx,plb-v34-1.02.a";
207 device_type = "ibm,plb";
208 ranges; // 1:1 translation
209
210 plb_bram_if_cntrl_0: bram@ffff0000 {
211 reg = <ffff0000 10000>;
212 }
213
214 opb@20000000 {
215 #address-cells = <1>;
216 #size-cells = <1>;
217 ranges = <20000000 20000000 20000000
218 60000000 60000000 20000000
219 80000000 80000000 40000000
220 c0000000 c0000000 20000000>;
221
222 opb_uart16550_0: serial@a0000000 {
223 reg = <a00000000 2000>;
224 };
225
226 opb_intc_0: interrupt-controller@d1000fc0 {
227 reg = <d1000fc0 20>;
228 };
229 };
230 };
231
232 That covers the general approach to binding xilinx IP cores into the
233 device tree. The following are bindings for specific devices:
234
235 i) Xilinx ML300 Framebuffer
236
237 Simple framebuffer device from the ML300 reference design (also on the
238 ML403 reference design as well as others).
239
240 Optional properties:
241 - resolution = <xres yres> : pixel resolution of framebuffer. Some
242 implementations use a different resolution.
243 Default is <d#640 d#480>
244 - virt-resolution = <xvirt yvirt> : Size of framebuffer in memory.
245 Default is <d#1024 d#480>.
246 - rotate-display (empty) : rotate display 180 degrees.
247
248 ii) Xilinx SystemACE
249
250 The Xilinx SystemACE device is used to program FPGAs from an FPGA
251 bitstream stored on a CF card. It can also be used as a generic CF
252 interface device.
253
254 Optional properties:
255 - 8-bit (empty) : Set this property for SystemACE in 8 bit mode
256
257 iii) Xilinx EMAC and Xilinx TEMAC
258
259 Xilinx Ethernet devices. In addition to general xilinx properties
260 listed above, nodes for these devices should include a phy-handle
261 property, and may include other common network device properties
262 like local-mac-address.
263
264 iv) Xilinx Uartlite
265
266 Xilinx uartlite devices are simple fixed speed serial ports.
267
268 Required properties:
269 - current-speed : Baud rate of uartlite
270
271 v) Xilinx hwicap
272
273 Xilinx hwicap devices provide access to the configuration logic
274 of the FPGA through the Internal Configuration Access Port
275 (ICAP). The ICAP enables partial reconfiguration of the FPGA,
276 readback of the configuration information, and some control over
277 'warm boots' of the FPGA fabric.
278
279 Required properties:
280 - xlnx,family : The family of the FPGA, necessary since the
281 capabilities of the underlying ICAP hardware
282 differ between different families. May be
283 'virtex2p', 'virtex4', or 'virtex5'.
284
285 vi) Xilinx Uart 16550
286
287 Xilinx UART 16550 devices are very similar to the NS16550 but with
288 different register spacing and an offset from the base address.
289
290 Required properties:
291 - clock-frequency : Frequency of the clock input
292 - reg-offset : A value of 3 is required
293 - reg-shift : A value of 2 is required
294
295
diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt
index 1df7f9cdab0..86eabe6c341 100644
--- a/Documentation/scheduler/sched-rt-group.txt
+++ b/Documentation/scheduler/sched-rt-group.txt
@@ -73,7 +73,7 @@ The remaining CPU time will be used for user input and other tasks. Because
73realtime tasks have explicitly allocated the CPU time they need to perform 73realtime tasks have explicitly allocated the CPU time they need to perform
74their tasks, buffer underruns in the graphics or audio can be eliminated. 74their tasks, buffer underruns in the graphics or audio can be eliminated.
75 75
76NOTE: the above example is not fully implemented as of yet (2.6.25). We still 76NOTE: the above example is not fully implemented yet. We still
77lack an EDF scheduler to make non-uniform periods usable. 77lack an EDF scheduler to make non-uniform periods usable.
78 78
79 79
@@ -140,14 +140,15 @@ The other option is:
140 140
141.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups") 141.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups")
142 142
143This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us" 143This uses the /cgroup virtual file system and
144to control the CPU time reserved for each control group instead. 144"/cgroup/<cgroup>/cpu.rt_runtime_us" to control the CPU time reserved for each
145control group instead.
145 146
146For more information on working with control groups, you should read 147For more information on working with control groups, you should read
147Documentation/cgroups/cgroups.txt as well. 148Documentation/cgroups/cgroups.txt as well.
148 149
149Group settings are checked against the following limits in order to keep the configuration 150Group settings are checked against the following limits in order to keep the
150schedulable: 151configuration schedulable:
151 152
152 \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period 153 \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
153 154
@@ -189,7 +190,7 @@ Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
189the biggest challenge as the current linux PI infrastructure is geared towards 190the biggest challenge as the current linux PI infrastructure is geared towards
190the limited static priority levels 0-99. With deadline scheduling you need to 191the limited static priority levels 0-99. With deadline scheduling you need to
191do deadline inheritance (since priority is inversely proportional to the 192do deadline inheritance (since priority is inversely proportional to the
192deadline delta (deadline - now). 193deadline delta (deadline - now)).
193 194
194This means the whole PI machinery will have to be reworked - and that is one of 195This means the whole PI machinery will have to be reworked - and that is one of
195the most complex pieces of code we have. 196the most complex pieces of code we have.
diff --git a/Documentation/sound/alsa/HD-Audio-Models.txt b/Documentation/sound/alsa/HD-Audio-Models.txt
index 0d8d23581c4..939a3dd5814 100644
--- a/Documentation/sound/alsa/HD-Audio-Models.txt
+++ b/Documentation/sound/alsa/HD-Audio-Models.txt
@@ -240,6 +240,7 @@ AD1986A
240 laptop-automute 2-channel with EAPD and HP-automute (Lenovo N100) 240 laptop-automute 2-channel with EAPD and HP-automute (Lenovo N100)
241 ultra 2-channel with EAPD (Samsung Ultra tablet PC) 241 ultra 2-channel with EAPD (Samsung Ultra tablet PC)
242 samsung 2-channel with EAPD (Samsung R65) 242 samsung 2-channel with EAPD (Samsung R65)
243 samsung-p50 2-channel with HP-automute (Samsung P50)
243 244
244AD1988/AD1988B/AD1989A/AD1989B 245AD1988/AD1988B/AD1989A/AD1989B
245============================== 246==============================
diff --git a/Documentation/sound/alsa/Procfile.txt b/Documentation/sound/alsa/Procfile.txt
index 381908d8ca4..719a819f8cc 100644
--- a/Documentation/sound/alsa/Procfile.txt
+++ b/Documentation/sound/alsa/Procfile.txt
@@ -101,6 +101,8 @@ card*/pcm*/xrun_debug
101 bit 0 = Enable XRUN/jiffies debug messages 101 bit 0 = Enable XRUN/jiffies debug messages
102 bit 1 = Show stack trace at XRUN / jiffies check 102 bit 1 = Show stack trace at XRUN / jiffies check
103 bit 2 = Enable additional jiffies check 103 bit 2 = Enable additional jiffies check
104 bit 3 = Log hwptr update at each period interrupt
105 bit 4 = Log hwptr update at each snd_pcm_update_hw_ptr()
104 106
105 When the bit 0 is set, the driver will show the messages to 107 When the bit 0 is set, the driver will show the messages to
106 kernel log when an xrun is detected. The debug message is 108 kernel log when an xrun is detected. The debug message is
@@ -117,6 +119,9 @@ card*/pcm*/xrun_debug
117 buggy) hardware that doesn't give smooth pointer updates. 119 buggy) hardware that doesn't give smooth pointer updates.
118 This feature is enabled via the bit 2. 120 This feature is enabled via the bit 2.
119 121
122 Bits 3 and 4 are for logging the hwptr records. Note that
123 these will give flood of kernel messages.
124
120card*/pcm*/sub*/info 125card*/pcm*/sub*/info
121 The general information of this PCM sub-stream. 126 The general information of this PCM sub-stream.
122 127
diff --git a/Documentation/spi/spidev_test.c b/Documentation/spi/spidev_test.c
index cf0e3ce0d52..c1a5aad3c75 100644
--- a/Documentation/spi/spidev_test.c
+++ b/Documentation/spi/spidev_test.c
@@ -99,11 +99,13 @@ void parse_opts(int argc, char *argv[])
99 { "lsb", 0, 0, 'L' }, 99 { "lsb", 0, 0, 'L' },
100 { "cs-high", 0, 0, 'C' }, 100 { "cs-high", 0, 0, 'C' },
101 { "3wire", 0, 0, '3' }, 101 { "3wire", 0, 0, '3' },
102 { "no-cs", 0, 0, 'N' },
103 { "ready", 0, 0, 'R' },
102 { NULL, 0, 0, 0 }, 104 { NULL, 0, 0, 0 },
103 }; 105 };
104 int c; 106 int c;
105 107
106 c = getopt_long(argc, argv, "D:s:d:b:lHOLC3", lopts, NULL); 108 c = getopt_long(argc, argv, "D:s:d:b:lHOLC3NR", lopts, NULL);
107 109
108 if (c == -1) 110 if (c == -1)
109 break; 111 break;
@@ -139,6 +141,12 @@ void parse_opts(int argc, char *argv[])
139 case '3': 141 case '3':
140 mode |= SPI_3WIRE; 142 mode |= SPI_3WIRE;
141 break; 143 break;
144 case 'N':
145 mode |= SPI_NO_CS;
146 break;
147 case 'R':
148 mode |= SPI_READY;
149 break;
142 default: 150 default:
143 print_usage(argv[0]); 151 print_usage(argv[0]);
144 break; 152 break;
diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt
index cf42b820ff9..d56a0177542 100644
--- a/Documentation/sysrq.txt
+++ b/Documentation/sysrq.txt
@@ -66,7 +66,8 @@ On all - write a character to /proc/sysrq-trigger. e.g.:
66'b' - Will immediately reboot the system without syncing or unmounting 66'b' - Will immediately reboot the system without syncing or unmounting
67 your disks. 67 your disks.
68 68
69'c' - Will perform a kexec reboot in order to take a crashdump. 69'c' - Will perform a system crash by a NULL pointer dereference.
70 A crashdump will be taken if configured.
70 71
71'd' - Shows all locks that are held. 72'd' - Shows all locks that are held.
72 73
@@ -141,8 +142,8 @@ useful when you want to exit a program that will not let you switch consoles.
141re'B'oot is good when you're unable to shut down. But you should also 'S'ync 142re'B'oot is good when you're unable to shut down. But you should also 'S'ync
142and 'U'mount first. 143and 'U'mount first.
143 144
144'C'rashdump can be used to manually trigger a crashdump when the system is hung. 145'C'rash can be used to manually trigger a crashdump when the system is hung.
145The kernel needs to have been built with CONFIG_KEXEC enabled. 146Note that this just triggers a crash if there is no dump mechanism available.
146 147
147'S'ync is great when your system is locked up, it allows you to sync your 148'S'ync is great when your system is locked up, it allows you to sync your
148disks and will certainly lessen the chance of data loss and fscking. Note 149disks and will certainly lessen the chance of data loss and fscking. Note
diff --git a/Documentation/video4linux/CARDLIST.em28xx b/Documentation/video4linux/CARDLIST.em28xx
index 873630e7e53..68c236c0184 100644
--- a/Documentation/video4linux/CARDLIST.em28xx
+++ b/Documentation/video4linux/CARDLIST.em28xx
@@ -20,7 +20,7 @@
20 19 -> EM2860/SAA711X Reference Design (em2860) 20 19 -> EM2860/SAA711X Reference Design (em2860)
21 20 -> AMD ATI TV Wonder HD 600 (em2880) [0438:b002] 21 20 -> AMD ATI TV Wonder HD 600 (em2880) [0438:b002]
22 21 -> eMPIA Technology, Inc. GrabBeeX+ Video Encoder (em2800) [eb1a:2801] 22 21 -> eMPIA Technology, Inc. GrabBeeX+ Video Encoder (em2800) [eb1a:2801]
23 22 -> Unknown EM2750/EM2751 webcam grabber (em2750) [eb1a:2750,eb1a:2751] 23 22 -> EM2710/EM2750/EM2751 webcam grabber (em2750) [eb1a:2750,eb1a:2751]
24 23 -> Huaqi DLCW-130 (em2750) 24 23 -> Huaqi DLCW-130 (em2750)
25 24 -> D-Link DUB-T210 TV Tuner (em2820/em2840) [2001:f112] 25 24 -> D-Link DUB-T210 TV Tuner (em2820/em2840) [2001:f112]
26 25 -> Gadmei UTV310 (em2820/em2840) 26 25 -> Gadmei UTV310 (em2820/em2840)
@@ -66,3 +66,4 @@
66 68 -> Terratec AV350 (em2860) [0ccd:0084] 66 68 -> Terratec AV350 (em2860) [0ccd:0084]
67 69 -> KWorld ATSC 315U HDTV TV Box (em2882) [eb1a:a313] 67 69 -> KWorld ATSC 315U HDTV TV Box (em2882) [eb1a:a313]
68 70 -> Evga inDtube (em2882) 68 70 -> Evga inDtube (em2882)
69 71 -> Silvercrest Webcam 1.3mpix (em2820/em2840)
diff --git a/Documentation/video4linux/gspca.txt b/Documentation/video4linux/gspca.txt
index 2bcf78896e2..573f95b5880 100644
--- a/Documentation/video4linux/gspca.txt
+++ b/Documentation/video4linux/gspca.txt
@@ -44,7 +44,9 @@ zc3xx 0458:7007 Genius VideoCam V2
44zc3xx 0458:700c Genius VideoCam V3 44zc3xx 0458:700c Genius VideoCam V3
45zc3xx 0458:700f Genius VideoCam Web V2 45zc3xx 0458:700f Genius VideoCam Web V2
46sonixj 0458:7025 Genius Eye 311Q 46sonixj 0458:7025 Genius Eye 311Q
47sn9c20x 0458:7029 Genius Look 320s
47sonixj 0458:702e Genius Slim 310 NB 48sonixj 0458:702e Genius Slim 310 NB
49sn9c20x 045e:00f4 LifeCam VX-6000 (SN9C20x + OV9650)
48sonixj 045e:00f5 MicroSoft VX3000 50sonixj 045e:00f5 MicroSoft VX3000
49sonixj 045e:00f7 MicroSoft VX1000 51sonixj 045e:00f7 MicroSoft VX1000
50ov519 045e:028c Micro$oft xbox cam 52ov519 045e:028c Micro$oft xbox cam
@@ -282,6 +284,28 @@ sonixj 0c45:613a Microdia Sonix PC Camera
282sonixj 0c45:613b Surfer SN-206 284sonixj 0c45:613b Surfer SN-206
283sonixj 0c45:613c Sonix Pccam168 285sonixj 0c45:613c Sonix Pccam168
284sonixj 0c45:6143 Sonix Pccam168 286sonixj 0c45:6143 Sonix Pccam168
287sn9c20x 0c45:6240 PC Camera (SN9C201 + MT9M001)
288sn9c20x 0c45:6242 PC Camera (SN9C201 + MT9M111)
289sn9c20x 0c45:6248 PC Camera (SN9C201 + OV9655)
290sn9c20x 0c45:624e PC Camera (SN9C201 + SOI968)
291sn9c20x 0c45:624f PC Camera (SN9C201 + OV9650)
292sn9c20x 0c45:6251 PC Camera (SN9C201 + OV9650)
293sn9c20x 0c45:6253 PC Camera (SN9C201 + OV9650)
294sn9c20x 0c45:6260 PC Camera (SN9C201 + OV7670)
295sn9c20x 0c45:6270 PC Camera (SN9C201 + MT9V011/MT9V111/MT9V112)
296sn9c20x 0c45:627b PC Camera (SN9C201 + OV7660)
297sn9c20x 0c45:627c PC Camera (SN9C201 + HV7131R)
298sn9c20x 0c45:627f PC Camera (SN9C201 + OV9650)
299sn9c20x 0c45:6280 PC Camera (SN9C202 + MT9M001)
300sn9c20x 0c45:6282 PC Camera (SN9C202 + MT9M111)
301sn9c20x 0c45:6288 PC Camera (SN9C202 + OV9655)
302sn9c20x 0c45:628e PC Camera (SN9C202 + SOI968)
303sn9c20x 0c45:628f PC Camera (SN9C202 + OV9650)
304sn9c20x 0c45:62a0 PC Camera (SN9C202 + OV7670)
305sn9c20x 0c45:62b0 PC Camera (SN9C202 + MT9V011/MT9V111/MT9V112)
306sn9c20x 0c45:62b3 PC Camera (SN9C202 + OV9655)
307sn9c20x 0c45:62bb PC Camera (SN9C202 + OV7660)
308sn9c20x 0c45:62bc PC Camera (SN9C202 + HV7131R)
285sunplus 0d64:0303 Sunplus FashionCam DXG 309sunplus 0d64:0303 Sunplus FashionCam DXG
286etoms 102c:6151 Qcam Sangha CIF 310etoms 102c:6151 Qcam Sangha CIF
287etoms 102c:6251 Qcam xxxxxx VGA 311etoms 102c:6251 Qcam xxxxxx VGA
@@ -290,6 +314,7 @@ spca561 10fd:7e50 FlyCam Usb 100
290zc3xx 10fd:8050 Typhoon Webshot II USB 300k 314zc3xx 10fd:8050 Typhoon Webshot II USB 300k
291ov534 1415:2000 Sony HD Eye for PS3 (SLEH 00201) 315ov534 1415:2000 Sony HD Eye for PS3 (SLEH 00201)
292pac207 145f:013a Trust WB-1300N 316pac207 145f:013a Trust WB-1300N
317sn9c20x 145f:013d Trust WB-3600R
293vc032x 15b8:6001 HP 2.0 Megapixel 318vc032x 15b8:6001 HP 2.0 Megapixel
294vc032x 15b8:6002 HP 2.0 Megapixel rz406aa 319vc032x 15b8:6002 HP 2.0 Megapixel rz406aa
295spca501 1776:501c Arowana 300K CMOS Camera 320spca501 1776:501c Arowana 300K CMOS Camera
@@ -300,4 +325,11 @@ spca500 2899:012c Toptro Industrial
300spca508 8086:0110 Intel Easy PC Camera 325spca508 8086:0110 Intel Easy PC Camera
301spca500 8086:0630 Intel Pocket PC Camera 326spca500 8086:0630 Intel Pocket PC Camera
302spca506 99fa:8988 Grandtec V.cap 327spca506 99fa:8988 Grandtec V.cap
328sn9c20x a168:0610 Dino-Lite Digital Microscope (SN9C201 + HV7131R)
329sn9c20x a168:0611 Dino-Lite Digital Microscope (SN9C201 + HV7131R)
330sn9c20x a168:0613 Dino-Lite Digital Microscope (SN9C201 + HV7131R)
331sn9c20x a168:0618 Dino-Lite Digital Microscope (SN9C201 + HV7131R)
332sn9c20x a168:0614 Dino-Lite Digital Microscope (SN9C201 + MT9M111)
333sn9c20x a168:0615 Dino-Lite Digital Microscope (SN9C201 + MT9M111)
334sn9c20x a168:0617 Dino-Lite Digital Microscope (SN9C201 + MT9M111)
303spca561 abcd:cdee Petcam 335spca561 abcd:cdee Petcam
diff --git a/Documentation/x86/00-INDEX b/Documentation/x86/00-INDEX
index dbe3377754a..f37b46d3486 100644
--- a/Documentation/x86/00-INDEX
+++ b/Documentation/x86/00-INDEX
@@ -2,3 +2,5 @@
2 - this file 2 - this file
3mtrr.txt 3mtrr.txt
4 - how to use x86 Memory Type Range Registers to increase performance 4 - how to use x86 Memory Type Range Registers to increase performance
5exception-tables.txt
6 - why and how Linux kernel uses exception tables on x86
diff --git a/Documentation/exception.txt b/Documentation/x86/exception-tables.txt
index 2d5aded6424..32901aa36f0 100644
--- a/Documentation/exception.txt
+++ b/Documentation/x86/exception-tables.txt
@@ -1,123 +1,123 @@
1 Kernel level exception handling in Linux 2.1.8 1 Kernel level exception handling in Linux
2 Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com> 2 Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
3 3
4When a process runs in kernel mode, it often has to access user 4When a process runs in kernel mode, it often has to access user
5mode memory whose address has been passed by an untrusted program. 5mode memory whose address has been passed by an untrusted program.
6To protect itself the kernel has to verify this address. 6To protect itself the kernel has to verify this address.
7 7
8In older versions of Linux this was done with the 8In older versions of Linux this was done with the
9int verify_area(int type, const void * addr, unsigned long size) 9int verify_area(int type, const void * addr, unsigned long size)
10function (which has since been replaced by access_ok()). 10function (which has since been replaced by access_ok()).
11 11
12This function verified that the memory area starting at address 12This function verified that the memory area starting at address
13'addr' and of size 'size' was accessible for the operation specified 13'addr' and of size 'size' was accessible for the operation specified
14in type (read or write). To do this, verify_read had to look up the 14in type (read or write). To do this, verify_read had to look up the
15virtual memory area (vma) that contained the address addr. In the 15virtual memory area (vma) that contained the address addr. In the
16normal case (correctly working program), this test was successful. 16normal case (correctly working program), this test was successful.
17It only failed for a few buggy programs. In some kernel profiling 17It only failed for a few buggy programs. In some kernel profiling
18tests, this normally unneeded verification used up a considerable 18tests, this normally unneeded verification used up a considerable
19amount of time. 19amount of time.
20 20
21To overcome this situation, Linus decided to let the virtual memory 21To overcome this situation, Linus decided to let the virtual memory
22hardware present in every Linux-capable CPU handle this test. 22hardware present in every Linux-capable CPU handle this test.
23 23
24How does this work? 24How does this work?
25 25
26Whenever the kernel tries to access an address that is currently not 26Whenever the kernel tries to access an address that is currently not
27accessible, the CPU generates a page fault exception and calls the 27accessible, the CPU generates a page fault exception and calls the
28page fault handler 28page fault handler
29 29
30void do_page_fault(struct pt_regs *regs, unsigned long error_code) 30void do_page_fault(struct pt_regs *regs, unsigned long error_code)
31 31
32in arch/i386/mm/fault.c. The parameters on the stack are set up by 32in arch/x86/mm/fault.c. The parameters on the stack are set up by
33the low level assembly glue in arch/i386/kernel/entry.S. The parameter 33the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter
34regs is a pointer to the saved registers on the stack, error_code 34regs is a pointer to the saved registers on the stack, error_code
35contains a reason code for the exception. 35contains a reason code for the exception.
36 36
37do_page_fault first obtains the unaccessible address from the CPU 37do_page_fault first obtains the unaccessible address from the CPU
38control register CR2. If the address is within the virtual address 38control register CR2. If the address is within the virtual address
39space of the process, the fault probably occurred, because the page 39space of the process, the fault probably occurred, because the page
40was not swapped in, write protected or something similar. However, 40was not swapped in, write protected or something similar. However,
41we are interested in the other case: the address is not valid, there 41we are interested in the other case: the address is not valid, there
42is no vma that contains this address. In this case, the kernel jumps 42is no vma that contains this address. In this case, the kernel jumps
43to the bad_area label. 43to the bad_area label.
44 44
45There it uses the address of the instruction that caused the exception 45There it uses the address of the instruction that caused the exception
46(i.e. regs->eip) to find an address where the execution can continue 46(i.e. regs->eip) to find an address where the execution can continue
47(fixup). If this search is successful, the fault handler modifies the 47(fixup). If this search is successful, the fault handler modifies the
48return address (again regs->eip) and returns. The execution will 48return address (again regs->eip) and returns. The execution will
49continue at the address in fixup. 49continue at the address in fixup.
50 50
51Where does fixup point to? 51Where does fixup point to?
52 52
53Since we jump to the contents of fixup, fixup obviously points 53Since we jump to the contents of fixup, fixup obviously points
54to executable code. This code is hidden inside the user access macros. 54to executable code. This code is hidden inside the user access macros.
55I have picked the get_user macro defined in include/asm/uaccess.h as an 55I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
56example. The definition is somewhat hard to follow, so let's peek at 56as an example. The definition is somewhat hard to follow, so let's peek at
57the code generated by the preprocessor and the compiler. I selected 57the code generated by the preprocessor and the compiler. I selected
58the get_user call in drivers/char/console.c for a detailed examination. 58the get_user call in drivers/char/sysrq.c for a detailed examination.
59 59
60The original code in console.c line 1405: 60The original code in sysrq.c line 587:
61 get_user(c, buf); 61 get_user(c, buf);
62 62
63The preprocessor output (edited to become somewhat readable): 63The preprocessor output (edited to become somewhat readable):
64 64
65( 65(
66 { 66 {
67 long __gu_err = - 14 , __gu_val = 0; 67 long __gu_err = - 14 , __gu_val = 0;
68 const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); 68 const __typeof__(*( ( buf ) )) *__gu_addr = ((buf));
69 if (((((0 + current_set[0])->tss.segment) == 0x18 ) || 69 if (((((0 + current_set[0])->tss.segment) == 0x18 ) ||
70 (((sizeof(*(buf))) <= 0xC0000000UL) && 70 (((sizeof(*(buf))) <= 0xC0000000UL) &&
71 ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) 71 ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
72 do { 72 do {
73 __gu_err = 0; 73 __gu_err = 0;
74 switch ((sizeof(*(buf)))) { 74 switch ((sizeof(*(buf)))) {
75 case 1: 75 case 1:
76 __asm__ __volatile__( 76 __asm__ __volatile__(
77 "1: mov" "b" " %2,%" "b" "1\n" 77 "1: mov" "b" " %2,%" "b" "1\n"
78 "2:\n" 78 "2:\n"
79 ".section .fixup,\"ax\"\n" 79 ".section .fixup,\"ax\"\n"
80 "3: movl %3,%0\n" 80 "3: movl %3,%0\n"
81 " xor" "b" " %" "b" "1,%" "b" "1\n" 81 " xor" "b" " %" "b" "1,%" "b" "1\n"
82 " jmp 2b\n" 82 " jmp 2b\n"
83 ".section __ex_table,\"a\"\n" 83 ".section __ex_table,\"a\"\n"
84 " .align 4\n" 84 " .align 4\n"
85 " .long 1b,3b\n" 85 " .long 1b,3b\n"
86 ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *) 86 ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
87 ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; 87 ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ;
88 break; 88 break;
89 case 2: 89 case 2:
90 __asm__ __volatile__( 90 __asm__ __volatile__(
91 "1: mov" "w" " %2,%" "w" "1\n" 91 "1: mov" "w" " %2,%" "w" "1\n"
92 "2:\n" 92 "2:\n"
93 ".section .fixup,\"ax\"\n" 93 ".section .fixup,\"ax\"\n"
94 "3: movl %3,%0\n" 94 "3: movl %3,%0\n"
95 " xor" "w" " %" "w" "1,%" "w" "1\n" 95 " xor" "w" " %" "w" "1,%" "w" "1\n"
96 " jmp 2b\n" 96 " jmp 2b\n"
97 ".section __ex_table,\"a\"\n" 97 ".section __ex_table,\"a\"\n"
98 " .align 4\n" 98 " .align 4\n"
99 " .long 1b,3b\n" 99 " .long 1b,3b\n"
100 ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) 100 ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
101 ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); 101 ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err ));
102 break; 102 break;
103 case 4: 103 case 4:
104 __asm__ __volatile__( 104 __asm__ __volatile__(
105 "1: mov" "l" " %2,%" "" "1\n" 105 "1: mov" "l" " %2,%" "" "1\n"
106 "2:\n" 106 "2:\n"
107 ".section .fixup,\"ax\"\n" 107 ".section .fixup,\"ax\"\n"
108 "3: movl %3,%0\n" 108 "3: movl %3,%0\n"
109 " xor" "l" " %" "" "1,%" "" "1\n" 109 " xor" "l" " %" "" "1,%" "" "1\n"
110 " jmp 2b\n" 110 " jmp 2b\n"
111 ".section __ex_table,\"a\"\n" 111 ".section __ex_table,\"a\"\n"
112 " .align 4\n" " .long 1b,3b\n" 112 " .align 4\n" " .long 1b,3b\n"
113 ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) 113 ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
114 ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); 114 ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err));
115 break; 115 break;
116 default: 116 default:
117 (__gu_val) = __get_user_bad(); 117 (__gu_val) = __get_user_bad();
118 } 118 }
119 } while (0) ; 119 } while (0) ;
120 ((c)) = (__typeof__(*((buf))))__gu_val; 120 ((c)) = (__typeof__(*((buf))))__gu_val;
121 __gu_err; 121 __gu_err;
122 } 122 }
123); 123);
@@ -127,12 +127,12 @@ see what code gcc generates:
127 127
128 > xorl %edx,%edx 128 > xorl %edx,%edx
129 > movl current_set,%eax 129 > movl current_set,%eax
130 > cmpl $24,788(%eax) 130 > cmpl $24,788(%eax)
131 > je .L1424 131 > je .L1424
132 > cmpl $-1073741825,64(%esp) 132 > cmpl $-1073741825,64(%esp)
133 > ja .L1423 133 > ja .L1423
134 > .L1424: 134 > .L1424:
135 > movl %edx,%eax 135 > movl %edx,%eax
136 > movl 64(%esp),%ebx 136 > movl 64(%esp),%ebx
137 > #APP 137 > #APP
138 > 1: movb (%ebx),%dl /* this is the actual user access */ 138 > 1: movb (%ebx),%dl /* this is the actual user access */
@@ -149,17 +149,17 @@ see what code gcc generates:
149 > .L1423: 149 > .L1423:
150 > movzbl %dl,%esi 150 > movzbl %dl,%esi
151 151
152The optimizer does a good job and gives us something we can actually 152The optimizer does a good job and gives us something we can actually
153understand. Can we? The actual user access is quite obvious. Thanks 153understand. Can we? The actual user access is quite obvious. Thanks
154to the unified address space we can just access the address in user 154to the unified address space we can just access the address in user
155memory. But what does the .section stuff do????? 155memory. But what does the .section stuff do?????
156 156
157To understand this we have to look at the final kernel: 157To understand this we have to look at the final kernel:
158 158
159 > objdump --section-headers vmlinux 159 > objdump --section-headers vmlinux
160 > 160 >
161 > vmlinux: file format elf32-i386 161 > vmlinux: file format elf32-i386
162 > 162 >
163 > Sections: 163 > Sections:
164 > Idx Name Size VMA LMA File off Algn 164 > Idx Name Size VMA LMA File off Algn
165 > 0 .text 00098f40 c0100000 c0100000 00001000 2**4 165 > 0 .text 00098f40 c0100000 c0100000 00001000 2**4
@@ -198,18 +198,18 @@ final kernel executable:
198 198
199The whole user memory access is reduced to 10 x86 machine instructions. 199The whole user memory access is reduced to 10 x86 machine instructions.
200The instructions bracketed in the .section directives are no longer 200The instructions bracketed in the .section directives are no longer
201in the normal execution path. They are located in a different section 201in the normal execution path. They are located in a different section
202of the executable file: 202of the executable file:
203 203
204 > objdump --disassemble --section=.fixup vmlinux 204 > objdump --disassemble --section=.fixup vmlinux
205 > 205 >
206 > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax 206 > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax
207 > c0199ffa <.fixup+10ba> xorb %dl,%dl 207 > c0199ffa <.fixup+10ba> xorb %dl,%dl
208 > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3> 208 > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3>
209 209
210And finally: 210And finally:
211 > objdump --full-contents --section=__ex_table vmlinux 211 > objdump --full-contents --section=__ex_table vmlinux
212 > 212 >
213 > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................ 213 > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................
214 > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................ 214 > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................
215 > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................ 215 > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................
@@ -235,8 +235,8 @@ sections in the ELF object file. So the instructions
235ended up in the .fixup section of the object file and the addresses 235ended up in the .fixup section of the object file and the addresses
236 .long 1b,3b 236 .long 1b,3b
237ended up in the __ex_table section of the object file. 1b and 3b 237ended up in the __ex_table section of the object file. 1b and 3b
238are local labels. The local label 1b (1b stands for next label 1 238are local labels. The local label 1b (1b stands for next label 1
239backward) is the address of the instruction that might fault, i.e. 239backward) is the address of the instruction that might fault, i.e.
240in our case the address of the label 1 is c017e7a5: 240in our case the address of the label 1 is c017e7a5:
241the original assembly code: > 1: movb (%ebx),%dl 241the original assembly code: > 1: movb (%ebx),%dl
242and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl 242and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl
@@ -254,7 +254,7 @@ The assembly code
254becomes the value pair 254becomes the value pair
255 > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ 255 > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................
256 ^this is ^this is 256 ^this is ^this is
257 1b 3b 257 1b 3b
258c017e7a5,c0199ff5 in the exception table of the kernel. 258c017e7a5,c0199ff5 in the exception table of the kernel.
259 259
260So, what actually happens if a fault from kernel mode with no suitable 260So, what actually happens if a fault from kernel mode with no suitable
@@ -266,9 +266,9 @@ vma occurs?
2663.) CPU calls do_page_fault 2663.) CPU calls do_page_fault
2674.) do page fault calls search_exception_table (regs->eip == c017e7a5); 2674.) do page fault calls search_exception_table (regs->eip == c017e7a5);
2685.) search_exception_table looks up the address c017e7a5 in the 2685.) search_exception_table looks up the address c017e7a5 in the
269 exception table (i.e. the contents of the ELF section __ex_table) 269 exception table (i.e. the contents of the ELF section __ex_table)
270 and returns the address of the associated fault handle code c0199ff5. 270 and returns the address of the associated fault handle code c0199ff5.
2716.) do_page_fault modifies its own return address to point to the fault 2716.) do_page_fault modifies its own return address to point to the fault
272 handle code and returns. 272 handle code and returns.
2737.) execution continues in the fault handling code. 2737.) execution continues in the fault handling code.
2748.) 8a) EAX becomes -EFAULT (== -14) 2748.) 8a) EAX becomes -EFAULT (== -14)