2 files changed, 189 insertions, 75 deletions
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt
index 33f74310d161..a584f05403a4 100644
--- a/Documentation/filesystems/fuse.txt
+++ b/Documentation/filesystems/fuse.txt
@@ -18,6 +18,14 @@ Non-privileged mount (or user mount):
  user.  NOTE: this is not the same as mounts allowed with the "user"
  option in /etc/fstab, which is not discussed here.
+Filesystem connection:
+  A connection between the filesystem daemon and the kernel.  The
+  connection exists until either the daemon dies, or the filesystem is
+  umounted.  Note that detaching (or lazy umounting) the filesystem
+  does _not_ break the connection, in this case it will exist until
+  the last reference to the filesystem is released.
 Mount owner:
  The user who does the mounting.
@@ -86,16 +94,20 @@ Mount options
  The default is infinite.  Note that the size of read requests is
  limited anyway to 32 pages (which is 128kbyte on i386).
-Sysfs
+Control filesystem
-~~~~~
+~~~~~~~~~~~~~~~~~~
+There's a control filesystem for FUSE, which can be mounted by:
-FUSE sets up the following hierarchy in sysfs:
+  mount -t fusectl none /sys/fs/fuse/connections
-  /sys/fs/fuse/connections/N/
+Mounting it under the '/sys/fs/fuse/connections' directory makes it
+backwards compatible with earlier versions.
-where N is an increasing number allocated to each new connection.
+Under the fuse control filesystem each connection has a directory
+named by a unique number.
-For each connection the following attributes are defined:
+For each connection the following files exist within this directory:
 'waiting'
@@ -110,7 +122,47 @@ For each connection the following attributes are defined:
  connection.  This means that all waiting requests will be aborted an
  error returned for all aborted and new requests.
-Only a privileged user may read or write these attributes.
+Only the owner of the mount may read or write these files.
+Interrupting filesystem operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If a process issuing a FUSE filesystem request is interrupted, the
+following will happen:
+  1) If the request is not yet sent to userspace AND the signal is
+     fatal (SIGKILL or unhandled fatal signal), then the request is
+     dequeued and returns immediately.
+  2) If the request is not yet sent to userspace AND the signal is not
+     fatal, then an 'interrupted' flag is set for the request.  When
+     the request has been successfully transfered to userspace and
+     this flag is set, an INTERRUPT request is queued.
+  3) If the request is already sent to userspace, then an INTERRUPT
+     request is queued.
+INTERRUPT requests take precedence over other requests, so the
+userspace filesystem will receive queued INTERRUPTs before any others.
+The userspace filesystem may ignore the INTERRUPT requests entirely,
+or may honor them by sending a reply to the _original_ request, with
+the error set to EINTR.
+It is also possible that there's a race between processing the
+original request and it's INTERRUPT request.  There are two possibilities:
+  1) The INTERRUPT request is processed before the original request is
+     processed
+  2) The INTERRUPT request is processed after the original request has
+     been answered
+If the filesystem cannot find the original request, it should wait for
+some timeout and/or a number of new requests to arrive, after which it
+should reply to the INTERRUPT request with an EAGAIN error.  In case
+1) the INTERRUPT request will be requeued.  In case 2) the INTERRUPT
+reply will be ignored.
 Aborting a filesystem connection
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -139,8 +191,8 @@ the filesystem.  There are several ways to do this:
  - Use forced umount (umount -f).  Works in all cases but only if
    filesystem is still attached (it hasn't been lazy unmounted)
-  - Abort filesystem through the sysfs interface.  Most powerful
+  - Abort filesystem through the FUSE control filesystem.  Most
-    method, always works.
+    powerful method, always works.
 How do non-privileged mounts work?
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -304,25 +356,7 @@ Scenario 1 -  Simple deadlock
 |                                    |     for "file"]
 |                                    |    *DEADLOCK*
-The solution for this is to allow requests to be interrupted while
+The solution for this is to allow the filesystem to be aborted.
-they are in userspace:
- |      [interrupted by signal]       |
- |    <fuse_unlink()                  |
- |    [release semaphore]             |    [semaphore acquired]
- |  <sys_unlink()                     |
- |                                    |    >fuse_unlink()
- |                                    |      [queue req on fc->pending]
- |                                    |      [wake up fc->waitq]
- |                                    |      [sleep on req->waitq]
-If the filesystem daemon was single threaded, this will stop here,
-since there's no other thread to dequeue and execute the request.
-In this case the solution is to kill the FUSE daemon as well.  If
-there are multiple serving threads, you just have to kill them as
-long as any remain.
-Moral: a filesystem which deadlocks, can soon find itself dead.
 Scenario 2 - Tricky deadlock
 ----------------------------
@@ -355,24 +389,14 @@ but is caused by a pagefault.
 |                                    |           [lock page]
 |                                    |           * DEADLOCK *
-Solution is again to let the the request be interrupted (not
+Solution is basically the same as above.
-elaborated further).
-An additional problem is that while the write buffer is being
-copied to the request, the request must not be interrupted.  This
-is because the destination address of the copy may not be valid
-after the request is interrupted.
-This is solved with doing the copy atomically, and allowing
-interruption while the page(s) belonging to the write buffer are
-faulted with get_user_pages().  The 'req->locked' flag indicates
-when the copy is taking place, and interruption is delayed until
-this flag is unset.
-Scenario 3 - Tricky deadlock with asynchronous read
+An additional problem is that while the write buffer is being copied
---------------------------------------------------
+to the request, the request must not be interrupted/aborted.  This is
+because the destination address of the copy may not be valid after the
+request has returned.
-The same situation as above, except thread-1 will wait on page lock
+This is solved with doing the copy atomically, and allowing abort
-and hence it will be uninterruptible as well.  The solution is to
+while the page(s) belonging to the write buffer are faulted with
-abort the connection with forced umount (if mount is attached) or
+get_user_pages().  The 'req->locked' flag indicates when the copy is
-through the abort attribute in sysfs.
+taking place, and abort is delayed until this flag is unset.
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
index 60ab61e54e8a..25981e2e51be 100644
--- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
@@ -70,11 +70,13 @@ tmpfs mounts.  See Documentation/filesystems/tmpfs.txt for more information.
 What is rootfs?
 ---------------
-Rootfs is a special instance of ramfs, which is always present in 2.6 systems.
+Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is
-(It's used internally as the starting and stopping point for searches of the
+always present in 2.6 systems.  You can't unmount rootfs for approximately the
-kernel's doubly-linked list of mount points.)
+same reason you can't kill the init process; rather than having special code
+to check for and handle an empty list, it's smaller and simpler for the kernel
+to just make sure certain lists can't become empty.
-Most systems just mount another filesystem over it and ignore it.  The
+Most systems just mount another filesystem over rootfs and ignore it.  The
 amount of space an empty instance of ramfs takes up is tiny.
 What is initramfs?
@@ -92,14 +94,16 @@ out of that.
 All this differs from the old initrd in several ways:
-  - The old initrd was a separate file, while the initramfs archive is linked
+  - The old initrd was always a separate file, while the initramfs archive is
-    into the linux kernel image.  (The directory linux-*/usr is devoted to
+    linked into the linux kernel image.  (The directory linux-*/usr is devoted
-    generating this archive during the build.)
+    to generating this archive during the build.)
  - The old initrd file was a gzipped filesystem image (in some file format,
-    such as ext2, that had to be built into the kernel), while the new
+    such as ext2, that needed a driver built into the kernel), while the new
    initramfs archive is a gzipped cpio archive (like tar only simpler,
-    see cpio(1) and Documentation/early-userspace/buffer-format.txt).
+    see cpio(1) and Documentation/early-userspace/buffer-format.txt).  The
+    kernel's cpio extraction code is not only extremely small, it's also
+    __init data that can be discarded during the boot process.
  - The program run by the old initrd (which was called /initrd, not /init) did
    some setup and then returned to the kernel, while the init program from
@@ -124,13 +128,14 @@ Populating initramfs:
 The 2.6 kernel build process always creates a gzipped cpio format initramfs
 archive and links it into the resulting kernel binary.  By default, this
-archive is empty (consuming 134 bytes on x86).  The config option
+archive is empty (consuming 134 bytes on x86).
-CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices
-in menuconfig, and living in usr/Kconfig) can be used to specify a source for
+The config option CONFIG_INITRAMFS_SOURCE (for some reason buried under
-the initramfs archive, which will automatically be incorporated into the
+devices->block devices in menuconfig, and living in usr/Kconfig) can be used
-resulting binary.  This option can point to an existing gzipped cpio archive, a
+to specify a source for the initramfs archive, which will automatically be
-directory containing files to be archived, or a text file specification such
+incorporated into the resulting binary.  This option can point to an existing
-as the following example:
+gzipped cpio archive, a directory containing files to be archived, or a text
+file specification such as the following example:
  dir /dev 755 0 0
  nod /dev/console 644 0 0 c 5 1
@@ -146,23 +151,84 @@ as the following example:
 Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
 documenting the above file format.
-One advantage of the text file is that root access is not required to
+One advantage of the configuration file is that root access is not required to
 set permissions or create device nodes in the new archive.  (Note that those
 two example "file" entries expect to find files named "init.sh" and "busybox" in
 a directory called "initramfs", under the linux-2.6.* directory.  See
 Documentation/early-userspace/README for more details.)
-The kernel does not depend on external cpio tools, gen_init_cpio is created
+The kernel does not depend on external cpio tools.  If you specify a
-from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's
+directory instead of a configuration file, the kernel's build infrastructure
-boot-time extractor is also (obviously) self-contained.  However, if you _do_
+creates a configuration file from that directory (usr/Makefile calls
-happen to have cpio installed, the following command line can extract the
+scripts/gen_initramfs_list.sh), and proceeds to package up that directory
-generated cpio image back into its component files:
+using the config file (by feeding it to usr/gen_init_cpio, which is created
+from usr/gen_init_cpio.c).  The kernel's build-time cpio creation code is
+entirely self-contained, and the kernel's boot-time extractor is also
+(obviously) self-contained.
+The one thing you might need external cpio utilities installed for is creating
+or extracting your own preprepared cpio files to feed to the kernel build
+(instead of a config file or directory).
+The following command line can extract a cpio image (either by the above script
+or by the kernel build) back into its component files:
  cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
+The following shell script can create a prebuilt cpio archive you can
+use in place of the above config file:
+  #!/bin/sh
+  # Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation.
+  # Licensed under GPL version 2
+  if [ $# -ne 2 ]
+  then
+    echo "usage: mkinitramfs directory imagename.cpio.gz"
+    exit 1
+  fi
+  if [ -d "$1" ]
+  then
+    echo "creating $2 from $1"
+    (cd "$1"; find . | cpio -o -H newc | gzip) > "$2"
+  else
+    echo "First argument must be a directory"
+    exit 1
+  fi
+Note: The cpio man page contains some bad advice that will break your initramfs
+archive if you follow it.  It says "A typical way to generate the list
+of filenames is with the find command; you should give find the -depth option
+to minimize problems with permissions on directories that are unwritable or not
+searchable."  Don't do this when creating initramfs.cpio.gz images, it won't
+work.  The Linux kernel cpio extractor won't create files in a directory that
+doesn't exist, so the directory entries must go before the files that go in
+those directories.  The above script gets them in the right order.
+External initramfs images:
+--------------------------
+If the kernel has initrd support enabled, an external cpio.gz archive can also
+be passed into a 2.6 kernel in place of an initrd.  In this case, the kernel
+will autodetect the type (initramfs, not initrd) and extract the external cpio
+archive into rootfs before trying to run /init.
+This has the memory efficiency advantages of initramfs (no ramdisk block
+device) but the separate packaging of initrd (which is nice if you have
+non-GPL code you'd like to run from initramfs, without conflating it with
+the GPL licensed Linux kernel binary).
+It can also be used to supplement the kernel's built-in initamfs image.  The
+files in the external archive will overwrite any conflicting files in
+the built-in initramfs archive.  Some distributors also prefer to customize
+a single kernel image with task-specific initramfs images, without recompiling.
 Contents of initramfs:
 ----------------------
+An initramfs archive is a complete self-contained root filesystem for Linux.
 If you don't already understand what shared libraries, devices, and paths
 you need to get a minimal root filesystem up and running, here are some
 references:
@@ -176,13 +242,36 @@ code against, along with some related utilities.  It is BSD licensed.
 I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
 myself.  These are LGPL and GPL, respectively.  (A self-contained initramfs
-package is planned for the busybox 1.2 release.)
+package is planned for the busybox 1.3 release.)
 In theory you could use glibc, but that's not well suited for small embedded
 uses like this.  (A "hello world" program statically linked against glibc is
 over 400k.  With uClibc it's 7k.  Also note that glibc dlopens libnss to do
 name lookups, even when otherwise statically linked.)
+A good first step is to get initramfs to run a statically linked "hello world"
+program as init, and test it under an emulator like qemu (www.qemu.org) or
+User Mode Linux, like so:
+  cat > hello.c << EOF
+  #include <stdio.h>
+  #include <unistd.h>
+  int main(int argc, char *argv[])
+  {
+    printf("Hello world!\n");
+    sleep(999999999);
+  }
+  EOF
+  gcc -static hello2.c -o init
+  echo init | cpio -o -H newc | gzip > test.cpio.gz
+  # Testing external initramfs using the initrd loading mechanism.
+  qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero
+When debugging a normal root filesystem, it's nice to be able to boot with
+"init=/bin/sh".  The initramfs equivalent is "rdinit=/bin/sh", and it's
+just as useful.
 Why cpio rather than tar?
 -------------------------
@@ -241,7 +330,7 @@ the above threads) is:
 Future directions:
 ------------------
-Today (2.6.14), initramfs is always compiled in, but not always used.  The
+Today (2.6.16), initramfs is always compiled in, but not always used.  The
 kernel falls back to legacy boot code that is reached only if initramfs does
 not contain an /init program.  The fallback is legacy code, there to ensure a
 smooth transition and allowing early boot functionality to gradually move to
@@ -258,8 +347,9 @@ and so on.
 This kind of complexity (which inevitably includes policy) is rightly handled
 in userspace.  Both klibc and busybox/uClibc are working on simple initramfs
-packages to drop into a kernel build, and when standard solutions are ready
+packages to drop into a kernel build.
-and widely deployed, the kernel's legacy early boot code will become obsolete
-and a candidate for the feature removal schedule.
-But that's a while off yet.
+The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree.
+The kernel's current early boot code (partition detection, etc) will probably
+be migrated into a default initramfs, automatically created and used by the
+kernel build.