6 files changed, 624 insertions, 327 deletions
diff --git a/Documentation/filesystems/dentry-locking.txt b/Documentation/filesystems/dentry-locking.txt
new file mode 100644
index 000000000000..4c0c575a4012
--- /dev/null
+++ b/Documentation/filesystems/dentry-locking.txt
@@ -0,0 +1,173 @@
+RCU-based dcache locking model
+==============================
+On many workloads, the most common operation on dcache is to look up a
+dentry, given a parent dentry and the name of the child. Typically,
+for every open(), stat() etc., the dentry corresponding to the
+pathname will be looked up by walking the tree starting with the first
+component of the pathname and using that dentry along with the next
+component to look up the next level and so on. Since it is a frequent
+operation for workloads like multiuser environments and web servers,
+it is important to optimize this path.
+Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in
+every component during path look-up. Since 2.5.10 onwards, fast-walk
+algorithm changed this by holding the dcache_lock at the beginning and
+walking as many cached path component dentries as possible. This
+significantly decreases the number of acquisition of
+dcache_lock. However it also increases the lock hold time
+significantly and affects performance in large SMP machines. Since
+2.5.62 kernel, dcache has been using a new locking model that uses RCU
+to make dcache look-up lock-free.
+The current dcache locking model is not very different from the
+existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock
+protected the hash chain, d_child, d_alias, d_lru lists as well as
+d_inode and several other things like mount look-up. RCU-based changes
+affect only the way the hash chain is protected. For everything else
+the dcache_lock must be taken for both traversing as well as
+updating. The hash chain updates too take the dcache_lock.  The
+significant change is the way d_lookup traverses the hash chain, it
+doesn't acquire the dcache_lock for this and rely on RCU to ensure
+that the dentry has not been *freed*.
+Dcache locking details
+======================
+For many multi-user workloads, open() and stat() on files are very
+frequently occurring operations. Both involve walking of path names to
+find the dentry corresponding to the concerned file. In 2.4 kernel,
+dcache_lock was held during look-up of each path component. Contention
+and cache-line bouncing of this global lock caused significant
+scalability problems. With the introduction of RCU in Linux kernel,
+this was worked around by making the look-up of path components during
+path walking lock-free.
+Safe lock-free look-up of dcache hash table
+===========================================
+Dcache is a complex data structure with the hash table entries also
+linked together in other lists. In 2.4 kernel, dcache_lock protected
+all the lists. We applied RCU only on hash chain walking. The rest of
+the lists are still protected by dcache_lock.  Some of the important
+changes are :
+1. The deletion from hash chain is done using hlist_del_rcu() macro
+   which doesn't initialize next pointer of the deleted dentry and
+   this allows us to walk safely lock-free while a deletion is
+   happening.
+2. Insertion of a dentry into the hash table is done using
+   hlist_add_head_rcu() which take care of ordering the writes - the
+   writes to the dentry must be visible before the dentry is
+   inserted. This works in conjunction with hlist_for_each_rcu() while
+   walking the hash chain. The only requirement is that all
+   initialization to the dentry must be done before
+   hlist_add_head_rcu() since we don't have dcache_lock protection
+   while traversing the hash chain. This isn't different from the
+   existing code.
+3. The dentry looked up without holding dcache_lock by cannot be
+   returned for walking if it is unhashed. It then may have a NULL
+   d_inode or other bogosity since RCU doesn't protect the other
+   fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
+   indicate unhashed dentries and use this in conjunction with a
+   per-dentry lock (d_lock). Once looked up without the dcache_lock,
+   we acquire the per-dentry lock (d_lock) and check if the dentry is
+   unhashed. If so, the look-up is failed. If not, the reference count
+   of the dentry is increased and the dentry is returned.
+4. Once a dentry is looked up, it must be ensured during the path walk
+   for that component it doesn't go away. In pre-2.5.10 code, this was
+   done holding a reference to the dentry. dcache_rcu does the same.
+   In some sense, dcache_rcu path walking looks like the pre-2.5.10
+   version.
+5. All dentry hash chain updates must take the dcache_lock as well as
+   the per-dentry lock in that order. dput() does this to ensure that
+   a dentry that has just been looked up in another CPU doesn't get
+   deleted before dget() can be done on it.
+6. There are several ways to do reference counting of RCU protected
+   objects. One such example is in ipv4 route cache where deferred
+   freeing (using call_rcu()) is done as soon as the reference count
+   goes to zero. This cannot be done in the case of dentries because
+   tearing down of dentries require blocking (dentry_iput()) which
+   isn't supported from RCU callbacks. Instead, tearing down of
+   dentries happen synchronously in dput(), but actual freeing happens
+   later when RCU grace period is over. This allows safe lock-free
+   walking of the hash chains, but a matched dentry may have been
+   partially torn down. The checking of DCACHE_UNHASHED flag with
+   d_lock held detects such dentries and prevents them from being
+   returned from look-up.
+Maintaining POSIX rename semantics
+==================================
+Since look-up of dentries is lock-free, it can race against a
+concurrent rename operation. For example, during rename of file A to
+B, look-up of either A or B must succeed.  So, if look-up of B happens
+after A has been removed from the hash chain but not added to the new
+hash chain, it may fail.  Also, a comparison while the name is being
+written concurrently by a rename may result in false positive matches
+violating rename semantics.  Issues related to race with rename are
+handled as described below :
+1. Look-up can be done in two ways - d_lookup() which is safe from
+   simultaneous renames and __d_lookup() which is not.  If
+   __d_lookup() fails, it must be followed up by a d_lookup() to
+   correctly determine whether a dentry is in the hash table or
+   not. d_lookup() protects look-ups using a sequence lock
+   (rename_lock).
+2. The name associated with a dentry (d_name) may be changed if a
+   rename is allowed to happen simultaneously. To avoid memcmp() in
+   __d_lookup() go out of bounds due to a rename and false positive
+   comparison, the name comparison is done while holding the
+   per-dentry lock. This prevents concurrent renames during this
+   operation.
+3. Hash table walking during look-up may move to a different bucket as
+   the current dentry is moved to a different bucket due to rename.
+   But we use hlists in dcache hash table and they are
+   null-terminated.  So, even if a dentry moves to a different bucket,
+   hash chain walk will terminate. [with a list_head list, it may not
+   since termination is when the list_head in the original bucket is
+   reached].  Since we redo the d_parent check and compare name while
+   holding d_lock, lock-free look-up will not race against d_move().
+4. There can be a theoretical race when a dentry keeps coming back to
+   original bucket due to double moves. Due to this look-up may
+   consider that it has never moved and can end up in a infinite loop.
+   But this is not any worse that theoretical livelocks we already
+   have in the kernel.
+Important guidelines for filesystem developers related to dcache_rcu
+====================================================================
+1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
+   don't change. Only dcache internal implementation changes. However
+   filesystems *must not* delete from the dentry hash chains directly
+   using the list macros like allowed earlier. They must use dcache
+   APIs like d_drop() or __d_drop() depending on the situation.
+2. d_flags is now protected by a per-dentry lock (d_lock). All access
+   to d_flags must be protected by it.
+3. For a hashed dentry, checking of d_count needs to be protected by
+   d_lock.
+Papers and other documentation on dcache locking
+================================================
+1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
+2. http://lse.sourceforge.net/locking/dcache/dcache.html
diff --git a/Documentation/filesystems/devfs/README b/Documentation/filesystems/devfs/README
index 54366ecc241f..aabfba24bc2e 100644
--- a/Documentation/filesystems/devfs/README
+++ b/Documentation/filesystems/devfs/README
@@ -1812,11 +1812,6 @@ it may overflow the messages buffer, but try to get as much of it as
 you can
-if you get an Oops, run ksymoops to decode it so that the
-names of the offending functions are provided. A non-decoded Oops is
-pretty useless
 send a copy of your devfsd configuration file(s)
 send the bug report to me first.
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index d16334ec48ba..a8edb376b041 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -17,8 +17,6 @@ set using tune2fs(8). Kernel-determined defaults are indicated by (*).
 bsddf                   (*)     Makes `df' act like BSD.
 minixdf                         Makes `df' act like Minix.
-check                           Check block and inode bitmaps at mount time
-                                (requires CONFIG_EXT2_CHECK).
 check=none, nocheck     (*)     Don't do extra checking of bitmaps on mount
                                (check=normal and check=strict options removed)
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
new file mode 100644
index 000000000000..b3404a032596
--- /dev/null
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
@@ -0,0 +1,195 @@
+ramfs, rootfs and initramfs
+October 17, 2005
+Rob Landley <rob@landley.net>
+=============================
+What is ramfs?
+--------------
+Ramfs is a very simple filesystem that exports Linux's disk caching
+mechanisms (the page cache and dentry cache) as a dynamically resizable
+ram-based filesystem.
+Normally all files are cached in memory by Linux.  Pages of data read from
+backing store (usually the block device the filesystem is mounted on) are kept
+around in case it's needed again, but marked as clean (freeable) in case the
+Virtual Memory system needs the memory for something else.  Similarly, data
+written to files is marked clean as soon as it has been written to backing
+store, but kept around for caching purposes until the VM reallocates the
+memory.  A similar mechanism (the dentry cache) greatly speeds up access to
+directories.
+With ramfs, there is no backing store.  Files written into ramfs allocate
+dentries and page cache as usual, but there's nowhere to write them to.
+This means the pages are never marked clean, so they can't be freed by the
+VM when it's looking to recycle memory.
+The amount of code required to implement ramfs is tiny, because all the
+work is done by the existing Linux caching infrastructure.  Basically,
+you're mounting the disk cache as a filesystem.  Because of this, ramfs is not
+an optional component removable via menuconfig, since there would be negligible
+space savings.
+ramfs and ramdisk:
+------------------
+The older "ram disk" mechanism created a synthetic block device out of
+an area of ram and used it as backing store for a filesystem.  This block
+device was of fixed size, so the filesystem mounted on it was of fixed
+size.  Using a ram disk also required unnecessarily copying memory from the
+fake block device into the page cache (and copying changes back out), as well
+as creating and destroying dentries.  Plus it needed a filesystem driver
+(such as ext2) to format and interpret this data.
+Compared to ramfs, this wastes memory (and memory bus bandwidth), creates
+unnecessary work for the CPU, and pollutes the CPU caches.  (There are tricks
+to avoid this copying by playing with the page tables, but they're unpleasantly
+complicated and turn out to be about as expensive as the copying anyway.)
+More to the point, all the work ramfs is doing has to happen _anyway_,
+since all file access goes through the page and dentry caches.  The ram
+disk is simply unnecessary, ramfs is internally much simpler.
+Another reason ramdisks are semi-obsolete is that the introduction of
+loopback devices offered a more flexible and convenient way to create
+synthetic block devices, now from files instead of from chunks of memory.
+See losetup (8) for details.
+ramfs and tmpfs:
+----------------
+One downside of ramfs is you can keep writing data into it until you fill
+up all memory, and the VM can't free it because the VM thinks that files
+should get written to backing store (rather than swap space), but ramfs hasn't
+got any backing store.  Because of this, only root (or a trusted user) should
+be allowed write access to a ramfs mount.
+A ramfs derivative called tmpfs was created to add size limits, and the ability
+to write the data to swap space.  Normal users can be allowed write access to
+tmpfs mounts.  See Documentation/filesystems/tmpfs.txt for more information.
+What is rootfs?
+---------------
+Rootfs is a special instance of ramfs, which is always present in 2.6 systems.
+(It's used internally as the starting and stopping point for searches of the
+kernel's doubly-linked list of mount points.)
+Most systems just mount another filesystem over it and ignore it.  The
+amount of space an empty instance of ramfs takes up is tiny.
+What is initramfs?
+------------------
+All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is
+extracted into rootfs when the kernel boots up.  After extracting, the kernel
+checks to see if rootfs contains a file "init", and if so it executes it as PID
+1.  If found, this init process is responsible for bringing the system the
+rest of the way up, including locating and mounting the real root device (if
+any).  If rootfs does not contain an init program after the embedded cpio
+archive is extracted into it, the kernel will fall through to the older code
+to locate and mount a root partition, then exec some variant of /sbin/init
+out of that.
+All this differs from the old initrd in several ways:
+  - The old initrd was a separate file, while the initramfs archive is linked
+    into the linux kernel image.  (The directory linux-*/usr is devoted to
+    generating this archive during the build.)
+  - The old initrd file was a gzipped filesystem image (in some file format,
+    such as ext2, that had to be built into the kernel), while the new
+    initramfs archive is a gzipped cpio archive (like tar only simpler,
+    see cpio(1) and Documentation/early-userspace/buffer-format.txt).
+  - The program run by the old initrd (which was called /initrd, not /init) did
+    some setup and then returned to the kernel, while the init program from
+    initramfs is not expected to return to the kernel.  (If /init needs to hand
+    off control it can overmount / with a new root device and exec another init
+    program.  See the switch_root utility, below.)
+  - When switching another root device, initrd would pivot_root and then
+    umount the ramdisk.  But initramfs is rootfs: you can neither pivot_root
+    rootfs, nor unmount it.  Instead delete everything out of rootfs to
+    free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs
+    with the new root (cd /newmount; mount --move . /; chroot .), attach
+    stdin/stdout/stderr to the new /dev/console, and exec the new init.
+    Since this is a remarkably persnickity process (and involves deleting
+    commands before you can run them), the klibc package introduced a helper
+    program (utils/run_init.c) to do all this for you.  Most other packages
+    (such as busybox) have named this command "switch_root".
+Populating initramfs:
+---------------------
+The 2.6 kernel build process always creates a gzipped cpio format initramfs
+archive and links it into the resulting kernel binary.  By default, this
+archive is empty (consuming 134 bytes on x86).  The config option
+CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices
+in menuconfig, and living in usr/Kconfig) can be used to specify a source for
+the initramfs archive, which will automatically be incorporated into the
+resulting binary.  This option can point to an existing gzipped cpio archive, a
+directory containing files to be archived, or a text file specification such
+as the following example:
+  dir /dev 755 0 0
+  nod /dev/console 644 0 0 c 5 1
+  nod /dev/loop0 644 0 0 b 7 0
+  dir /bin 755 1000 1000
+  slink /bin/sh busybox 777 0 0
+  file /bin/busybox initramfs/busybox 755 0 0
+  dir /proc 755 0 0
+  dir /sys 755 0 0
+  dir /mnt 755 0 0
+  file /init initramfs/init.sh 755 0 0
+One advantage of the text file is that root access is not required to
+set permissions or create device nodes in the new archive.  (Note that those
+two example "file" entries expect to find files named "init.sh" and "busybox" in
+a directory called "initramfs", under the linux-2.6.* directory.  See
+Documentation/early-userspace/README for more details.)
+If you don't already understand what shared libraries, devices, and paths
+you need to get a minimal root filesystem up and running, here are some
+references:
+http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
+http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
+http://www.linuxfromscratch.org/lfs/view/stable/
+The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is
+designed to be a tiny C library to statically link early userspace
+code against, along with some related utilities.  It is BSD licensed.
+I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
+myself.  These are LGPL and GPL, respectively.
+In theory you could use glibc, but that's not well suited for small embedded
+uses like this.  (A "hello world" program statically linked against glibc is
+over 400k.  With uClibc it's 7k.  Also note that glibc dlopens libnss to do
+name lookups, even when otherwise statically linked.)
+Future directions:
+------------------
+Today (2.6.14), initramfs is always compiled in, but not always used.  The
+kernel falls back to legacy boot code that is reached only if initramfs does
+not contain an /init program.  The fallback is legacy code, there to ensure a
+smooth transition and allowing early boot functionality to gradually move to
+"early userspace" (I.E. initramfs).
+The move to early userspace is necessary because finding and mounting the real
+root device is complex.  Root partitions can span multiple devices (raid or
+separate journal).  They can be out on the network (requiring dhcp, setting a
+specific mac address, logging into a server, etc).  They can live on removable
+media, with dynamically allocated major/minor numbers and persistent naming
+issues requiring a full udev implementation to sort out.  They can be
+compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned,
+and so on.
+This kind of complexity (which inevitably includes policy) is rightly handled
+in userspace.  Both klibc and busybox/uClibc are working on simple initramfs
+packages to drop into a kernel build, and when standard solutions are ready
+and widely deployed, the kernel's legacy early boot code will become obsolete
+and a candidate for the feature removal schedule.
+But that's a while off yet.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f042c12e0ed2..ee4c0a8b8db7 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -3,7 +3,7 @@
        Original author: Richard Gooch <rgooch@atnf.csiro.au>
-                  Last updated on August 25, 2005
+                  Last updated on October 28, 2005
  Copyright (C) 1999 Richard Gooch
  Copyright (C) 2005 Pekka Enberg
@@ -11,62 +11,61 @@
  This file is released under the GPLv2.
-What is it?
+Introduction
-===========
+============
-The Virtual File System (otherwise known as the Virtual Filesystem
+The Virtual File System (also known as the Virtual Filesystem Switch)
-Switch) is the software layer in the kernel that provides the
+is the software layer in the kernel that provides the filesystem
-filesystem interface to userspace programs. It also provides an
+interface to userspace programs. It also provides an abstraction
-abstraction within the kernel which allows different filesystem
+within the kernel which allows different filesystem implementations to
-implementations to coexist.
+coexist.
+VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
+on are called from a process context. Filesystem locking is described
+in the document Documentation/filesystems/Locking.
-A Quick Look At How It Works
-============================
-In this section I'll briefly describe how things work, before
+Directory Entry Cache (dcache)
-launching into the details. I'll start with describing what happens
+------------------------------
-when user programs open and manipulate files, and then look from the
-other view which is how a filesystem is supported and subsequently
-mounted.
-Opening a File
--------------
-The VFS implements the open(2), stat(2), chmod(2) and similar system
-calls. The pathname argument is used by the VFS to search through the
-directory entry cache (dentry cache or "dcache"). This provides a very
-fast look-up mechanism to translate a pathname (filename) into a
-specific dentry.
-An individual dentry usually has a pointer to an inode. Inodes are the
-things that live on disc drives, and can be regular files (you know:
-those things that you write data into), directories, FIFOs and other
-beasts. Dentries live in RAM and are never saved to disc: they exist
-only for performance. Inodes live on disc and are copied into memory
-when required. Later any changes are written back to disc. The inode
-that lives in RAM is a VFS inode, and it is this which the dentry
-points to. A single inode can be pointed to by multiple dentries
-(think about hardlinks).
-The dcache is meant to be a view into your entire filespace. Unlike
-Linus, most of us losers can't fit enough dentries into RAM to cover
-all of our filespace, so the dcache has bits missing. In order to
-resolve your pathname into a dentry, the VFS may have to resort to
-creating dentries along the way, and then loading the inode. This is
-done by looking up the inode.
-To look up an inode (usually read from disc) requires that the VFS
-calls the lookup() method of the parent directory inode. This method
-is installed by the specific filesystem implementation that the inode
-lives in. There will be more on this later.
-Once the VFS has the required dentry (and hence the inode), we can do
+The VFS implements the open(2), stat(2), chmod(2), and similar system
-all those boring things like open(2) the file, or stat(2) it to peek
+calls. The pathname argument that is passed to them is used by the VFS
-at the inode data. The stat(2) operation is fairly simple: once the
+to search through the directory entry cache (also known as the dentry
-VFS has the dentry, it peeks at the inode data and passes some of it
+cache or dcache). This provides a very fast look-up mechanism to
-back to userspace.
+translate a pathname (filename) into a specific dentry. Dentries live
+in RAM and are never saved to disc: they exist only for performance.
+The dentry cache is meant to be a view into your entire filespace. As
+most computers cannot fit all dentries in the RAM at the same time,
+some bits of the cache are missing. In order to resolve your pathname
+into a dentry, the VFS may have to resort to creating dentries along
+the way, and then loading the inode. This is done by looking up the
+inode.
+The Inode Object
+----------------
+An individual dentry usually has a pointer to an inode. Inodes are
+filesystem objects such as regular files, directories, FIFOs and other
+beasts.  They live either on the disc (for block device filesystems)
+or in the memory (for pseudo filesystems). Inodes that live on the
+disc are copied into the memory when required and changes to the inode
+are written back to disc. A single inode can be pointed to by multiple
+dentries (hard links, for example, do this).
+To look up an inode requires that the VFS calls the lookup() method of
+the parent directory inode. This method is installed by the specific
+filesystem implementation that the inode lives in. Once the VFS has
+the required dentry (and hence the inode), we can do all those boring
+things like open(2) the file, or stat(2) it to peek at the inode
+data. The stat(2) operation is fairly simple: once the VFS has the
+dentry, it peeks at the inode data and passes some of it back to
+userspace.
+The File Object
+---------------
 Opening a file requires another operation: allocation of a file
 structure (this is the kernel-side implementation of file
@@ -74,51 +73,39 @@ descriptors). The freshly allocated file structure is initialized with
 a pointer to the dentry and a set of file operation member functions.
 These are taken from the inode data. The open() file method is then
 called so the specific filesystem implementation can do it's work. You
-can see that this is another switch performed by the VFS.
+can see that this is another switch performed by the VFS. The file
+structure is placed into the file descriptor table for the process.
-The file structure is placed into the file descriptor table for the
-process.
 Reading, writing and closing files (and other assorted VFS operations)
 is done by using the userspace file descriptor to grab the appropriate
-file structure, and then calling the required file structure method
+file structure, and then calling the required file structure method to
-function to do whatever is required.
+do whatever is required. For as long as the file is open, it keeps the
+dentry in use, which in turn means that the VFS inode is still in use.
-For as long as the file is open, it keeps the dentry "open" (in use),
-which in turn means that the VFS inode is still in use.
-All VFS system calls (i.e. open(2), stat(2), read(2), write(2),
-chmod(2) and so on) are called from a process context. You should
-assume that these calls are made without any kernel locks being
-held. This means that the processes may be executing the same piece of
-filesystem or driver code at the same time, on different
-processors. You should ensure that access to shared resources is
-protected by appropriate locks.
 Registering and Mounting a Filesystem
-------------------------------------
+=====================================
-If you want to support a new kind of filesystem in the kernel, all you
+To register and unregister a filesystem, use the following API
-need to do is call register_filesystem(). You pass a structure
+functions:
-describing the filesystem implementation (struct file_system_type)
-which is then added to an internal table of supported filesystems. You
-can do:
-% cat /proc/filesystems
+   #include <linux/fs.h>
-to see what filesystems are currently available on your system.
+   extern int register_filesystem(struct file_system_type *);
+   extern int unregister_filesystem(struct file_system_type *);
-When a request is made to mount a block device onto a directory in
+The passed struct file_system_type describes your filesystem. When a
-your filespace the VFS will call the appropriate method for the
+request is made to mount a device onto a directory in your filespace,
-specific filesystem. The dentry for the mount point will then be
+the VFS will call the appropriate get_sb() method for the specific
-updated to point to the root inode for the new filesystem.
+filesystem. The dentry for the mount point will then be updated to
+point to the root inode for the new filesystem.
-It's now time to look at things in more detail.
+You can see all filesystems that are registered to the kernel in the
+file /proc/filesystems.
 struct file_system_type
-=======================
+-----------------------
 This describes the filesystem. As of kernel 2.6.13, the following
 members are defined:
@@ -197,8 +184,14 @@ A fill_super() method implementation has the following arguments:
  int silent: whether or not to be silent on error
+The Superblock Object
+=====================
+A superblock object represents a mounted filesystem.
 struct super_operations
-=======================
+-----------------------
 This describes how the VFS can manipulate the superblock of your
 filesystem. As of kernel 2.6.13, the following members are defined:
@@ -286,9 +279,9 @@ or bottom half).
        a superblock. The second parameter indicates whether the method
        should wait until the write out has been completed. Optional.
-  write_super_lockfs: called when VFS is locking a filesystem and forcing
+  write_super_lockfs: called when VFS is locking a filesystem and
-        it into a consistent state.  This function is currently used by the
+        forcing it into a consistent state.  This method is currently
-        Logical Volume Manager (LVM).
+        used by the Logical Volume Manager (LVM).
  unlockfs: called when VFS is unlocking a filesystem and making it writable
        again.
@@ -317,8 +310,14 @@ field. This is a pointer to a "struct inode_operations" which
 describes the methods that can be performed on individual inodes.
+The Inode Object
+================
+An inode object represents an object within the filesystem.
 struct inode_operations
-=======================
+-----------------------
 This describes how the VFS can manipulate an inode in your
 filesystem. As of kernel 2.6.13, the following members are defined:
@@ -394,51 +393,62 @@ otherwise noted.
        will probably need to call d_instantiate() just as you would
        in the create() method
+  rename: called by the rename(2) system call to rename the object to
+        have the parent and name given by the second inode and dentry.
  readlink: called by the readlink(2) system call. Only required if
        you want to support reading symbolic links
  follow_link: called by the VFS to follow a symbolic link to the
        inode it points to.  Only required if you want to support
-        symbolic links.  This function returns a void pointer cookie
+        symbolic links.  This method returns a void pointer cookie
        that is passed to put_link().
  put_link: called by the VFS to release resources allocated by
-        follow_link().  The cookie returned by follow_link() is passed to
+        follow_link().  The cookie returned by follow_link() is passed
-        to this function as the last parameter.  It is used by filesystems
+        to to this method as the last parameter.  It is used by
-        such as NFS where page cache is not stable (i.e. page that was
+        filesystems such as NFS where page cache is not stable
-        installed when the symbolic link walk started might not be in the
+        (i.e. page that was installed when the symbolic link walk
-        page cache at the end of the walk).
+        started might not be in the page cache at the end of the
+        walk).
-  truncate: called by the VFS to change the size of a file.  The i_size
-        field of the inode is set to the desired size by the VFS before
+  truncate: called by the VFS to change the size of a file.  The
-        this function is called.  This function is called by the truncate(2)
+        i_size field of the inode is set to the desired size by the
-        system call and related functionality.
+        VFS before this method is called.  This method is called by
+        the truncate(2) system call and related functionality.
  permission: called by the VFS to check for access rights on a POSIX-like
        filesystem.
-  setattr: called by the VFS to set attributes for a file.  This function is
+  setattr: called by the VFS to set attributes for a file. This method
-        called by chmod(2) and related system calls.
+        is called by chmod(2) and related system calls.
-  getattr: called by the VFS to get attributes of a file.  This function is
+  getattr: called by the VFS to get attributes of a file. This method
-        called by stat(2) and related system calls.
+        is called by stat(2) and related system calls.
  setxattr: called by the VFS to set an extended attribute for a file.
-        Extended attribute is a name:value pair associated with an inode. This
+        Extended attribute is a name:value pair associated with an
-        function is called by setxattr(2) system call.
+        inode. This method is called by setxattr(2) system call.
+  getxattr: called by the VFS to retrieve the value of an extended
+        attribute name. This method is called by getxattr(2) function
+        call.
-  getxattr: called by the VFS to retrieve the value of an extended attribute
+  listxattr: called by the VFS to list all extended attributes for a
-        name.  This function is called by getxattr(2) function call.
+        given file. This method is called by listxattr(2) system call.
-  listxattr: called by the VFS to list all extended attributes for a given
+  removexattr: called by the VFS to remove an extended attribute from
-        file.  This function is called by listxattr(2) system call.
+        a file. This method is called by removexattr(2) system call.
-  removexattr: called by the VFS to remove an extended attribute from a file.
-        This function is called by removexattr(2) system call.
+The Address Space Object
+========================
+The address space object is used to identify pages in the page cache.
 struct address_space_operations
-===============================
+-------------------------------
 This describes how the VFS can manipulate mapping of a file to page cache in
 your filesystem. As of kernel 2.6.13, the following members are defined:
@@ -502,8 +512,14 @@ struct address_space_operations {
        it.  An example implementation can be found in fs/ext2/xip.c.
+The File Object
+===============
+A file object represents a file opened by a process.
 struct file_operations
-======================
+----------------------
 This describes how the VFS can manipulate an open file. As of kernel
 2.6.13, the following members are defined:
@@ -661,7 +677,7 @@ of child dentries. Child dentries are basically like files in a
 directory.
-Directory Entry Cache APIs
+Directory Entry Cache API
 --------------------------
 There are a number of functions defined which permit a filesystem to
@@ -705,178 +721,24 @@ manipulate dentries:
        and the dentry is returned. The caller must use d_put()
        to free the dentry when it finishes using it.
+For further information on dentry locking, please refer to the document
+Documentation/filesystems/dentry-locking.txt.
-RCU-based dcache locking model
------------------------------
-On many workloads, the most common operation on dcache is
+Resources
-to look up a dentry, given a parent dentry and the name
+=========
-of the child. Typically, for every open(), stat() etc.,
-the dentry corresponding to the pathname will be looked
+(Note some of these resources are not up-to-date with the latest kernel
-up by walking the tree starting with the first component
+ version.)
-of the pathname and using that dentry along with the next
-component to look up the next level and so on. Since it
+Creating Linux virtual filesystems. 2002
-is a frequent operation for workloads like multiuser
+    <http://lwn.net/Articles/13325/>
-environments and web servers, it is important to optimize
-this path.
+The Linux Virtual File-system Layer by Neil Brown. 1999
+    <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
-Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
-in every component during path look-up. Since 2.5.10 onwards,
+A tour of the Linux VFS by Michael K. Johnson. 1996
-fast-walk algorithm changed this by holding the dcache_lock
+    <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
-at the beginning and walking as many cached path component
-dentries as possible. This significantly decreases the number
-of acquisition of dcache_lock. However it also increases the
-lock hold time significantly and affects performance in large
-SMP machines. Since 2.5.62 kernel, dcache has been using
-a new locking model that uses RCU to make dcache look-up
-lock-free.
-The current dcache locking model is not very different from the existing
-dcache locking model. Prior to 2.5.62 kernel, dcache_lock
-protected the hash chain, d_child, d_alias, d_lru lists as well
-as d_inode and several other things like mount look-up. RCU-based
-changes affect only the way the hash chain is protected. For everything
-else the dcache_lock must be taken for both traversing as well as
-updating. The hash chain updates too take the dcache_lock.
-The significant change is the way d_lookup traverses the hash chain,
-it doesn't acquire the dcache_lock for this and rely on RCU to
-ensure that the dentry has not been *freed*.
-Dcache locking details
----------------------
-For many multi-user workloads, open() and stat() on files are
+A small trail through the Linux kernel by Andries Brouwer. 2001
-very frequently occurring operations. Both involve walking
+    <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>
-of path names to find the dentry corresponding to the
-concerned file. In 2.4 kernel, dcache_lock was held
-during look-up of each path component. Contention and
-cache-line bouncing of this global lock caused significant
-scalability problems. With the introduction of RCU
-in Linux kernel, this was worked around by making
-the look-up of path components during path walking lock-free.
-Safe lock-free look-up of dcache hash table
-===========================================
-Dcache is a complex data structure with the hash table entries
-also linked together in other lists. In 2.4 kernel, dcache_lock
-protected all the lists. We applied RCU only on hash chain
-walking. The rest of the lists are still protected by dcache_lock.
-Some of the important changes are :
-1. The deletion from hash chain is done using hlist_del_rcu() macro which
-   doesn't initialize next pointer of the deleted dentry and this
-   allows us to walk safely lock-free while a deletion is happening.
-2. Insertion of a dentry into the hash table is done using
-   hlist_add_head_rcu() which take care of ordering the writes -
-   the writes to the dentry must be visible before the dentry
-   is inserted. This works in conjunction with hlist_for_each_rcu()
-   while walking the hash chain. The only requirement is that
-   all initialization to the dentry must be done before hlist_add_head_rcu()
-   since we don't have dcache_lock protection while traversing
-   the hash chain. This isn't different from the existing code.
-3. The dentry looked up without holding dcache_lock by cannot be
-   returned for walking if it is unhashed. It then may have a NULL
-   d_inode or other bogosity since RCU doesn't protect the other
-   fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
-   indicate unhashed  dentries and use this in conjunction with a
-   per-dentry lock (d_lock). Once looked up without the dcache_lock,
-   we acquire the per-dentry lock (d_lock) and check if the
-   dentry is unhashed. If so, the look-up is failed. If not, the
-   reference count of the dentry is increased and the dentry is returned.
-4. Once a dentry is looked up, it must be ensured during the path
-   walk for that component it doesn't go away. In pre-2.5.10 code,
-   this was done holding a reference to the dentry. dcache_rcu does
-   the same.  In some sense, dcache_rcu path walking looks like
-   the pre-2.5.10 version.
-5. All dentry hash chain updates must take the dcache_lock as well as
-   the per-dentry lock in that order. dput() does this to ensure
-   that a dentry that has just been looked up in another CPU
-   doesn't get deleted before dget() can be done on it.
-6. There are several ways to do reference counting of RCU protected
-   objects. One such example is in ipv4 route cache where
-   deferred freeing (using call_rcu()) is done as soon as
-   the reference count goes to zero. This cannot be done in
-   the case of dentries because tearing down of dentries
-   require blocking (dentry_iput()) which isn't supported from
-   RCU callbacks. Instead, tearing down of dentries happen
-   synchronously in dput(), but actual freeing happens later
-   when RCU grace period is over. This allows safe lock-free
-   walking of the hash chains, but a matched dentry may have
-   been partially torn down. The checking of DCACHE_UNHASHED
-   flag with d_lock held detects such dentries and prevents
-   them from being returned from look-up.
-Maintaining POSIX rename semantics
-==================================
-Since look-up of dentries is lock-free, it can race against
-a concurrent rename operation. For example, during rename
-of file A to B, look-up of either A or B must succeed.
-So, if look-up of B happens after A has been removed from the
-hash chain but not added to the new hash chain, it may fail.
-Also, a comparison while the name is being written concurrently
-by a rename may result in false positive matches violating
-rename semantics.  Issues related to race with rename are
-handled as described below :
-1. Look-up can be done in two ways - d_lookup() which is safe
-   from simultaneous renames and __d_lookup() which is not.
-   If __d_lookup() fails, it must be followed up by a d_lookup()
-   to correctly determine whether a dentry is in the hash table
-   or not. d_lookup() protects look-ups using a sequence
-   lock (rename_lock).
-2. The name associated with a dentry (d_name) may be changed if
-   a rename is allowed to happen simultaneously. To avoid memcmp()
-   in __d_lookup() go out of bounds due to a rename and false
-   positive comparison, the name comparison is done while holding the
-   per-dentry lock. This prevents concurrent renames during this
-   operation.
-3. Hash table walking during look-up may move to a different bucket as
-   the current dentry is moved to a different bucket due to rename.
-   But we use hlists in dcache hash table and they are null-terminated.
-   So, even if a dentry moves to a different bucket, hash chain
-   walk will terminate. [with a list_head list, it may not since
-   termination is when the list_head in the original bucket is reached].
-   Since we redo the d_parent check and compare name while holding
-   d_lock, lock-free look-up will not race against d_move().
-4. There can be a theoretical race when a dentry keeps coming back
-   to original bucket due to double moves. Due to this look-up may
-   consider that it has never moved and can end up in a infinite loop.
-   But this is not any worse that theoretical livelocks we already
-   have in the kernel.
-Important guidelines for filesystem developers related to dcache_rcu
-====================================================================
-1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
-   don't change. Only dcache internal implementation changes. However
-   filesystems *must not* delete from the dentry hash chains directly
-   using the list macros like allowed earlier. They must use dcache
-   APIs like d_drop() or __d_drop() depending on the situation.
-2. d_flags is now protected by a per-dentry lock (d_lock). All
-   access to d_flags must be protected by it.
-3. For a hashed dentry, checking of d_count needs to be protected
-   by d_lock.
-Papers and other documentation on dcache locking
-================================================
-1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
-2. http://lse.sourceforge.net/locking/dcache/dcache.html
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
index c7d5d0c7067d..74aeb142ae5f 100644
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -19,15 +19,43 @@ Mount Options
 When mounting an XFS filesystem, the following options are accepted.
-  biosize=size
+  allocsize=size
-        Sets the preferred buffered I/O size (default size is 64K).
+        Sets the buffered I/O end-of-file preallocation size when
-        "size" must be expressed as the logarithm (base2) of the
+        doing delayed allocation writeout (default size is 64KiB).
-        desired I/O size.
+        Valid values for this option are page size (typically 4KiB)
-        Valid values for this option are 14 through 16, inclusive
+        through to 1GiB, inclusive, in power-of-2 increments.
-        (i.e. 16K, 32K, and 64K bytes).  On machines with a 4K
-        pagesize, 13 (8K bytes) is also a valid size.
+  attr2/noattr2
-        The preferred buffered I/O size can also be altered on an
+        The options enable/disable (default is disabled for backward
-        individual file basis using the ioctl(2) system call.
+        compatibility on-disk) an "opportunistic" improvement to be
+        made in the way inline extended attributes are stored on-disk.
+        When the new form is used for the first time (by setting or
+        removing extended attributes) the on-disk superblock feature
+        bit field will be updated to reflect this format being in use.
+  barrier
+        Enables the use of block layer write barriers for writes into
+        the journal and unwritten extent conversion.  This allows for
+        drive level write caching to be enabled, for devices that
+        support write barriers.
+  dmapi
+        Enable the DMAPI (Data Management API) event callouts.
+        Use with the "mtpt" option.
+  grpid/bsdgroups and nogrpid/sysvgroups
+        These options define what group ID a newly created file gets.
+        When grpid is set, it takes the group ID of the directory in
+        which it is created; otherwise (the default) it takes the fsgid
+        of the current process, unless the directory has the setgid bit
+        set, in which case it takes the gid from the parent directory,
+        and also gets the setgid bit set if it is a directory itself.
+  ihashsize=value
+        Sets the number of hash buckets available for hashing the
+        in-memory inodes of the specified mount point.  If a value
+        of zero is used, the value selected by the default algorithm
+        will be displayed in /proc/mounts.
  ikeep/noikeep
        When inode clusters are emptied of inodes, keep them around
@@ -35,12 +63,31 @@ When mounting an XFS filesystem, the following options are accepted.
        and is still the default for now.  Using the noikeep option,
        inode clusters are returned to the free space pool.
+  inode64
+        Indicates that XFS is allowed to create inodes at any location
+        in the filesystem, including those which will result in inode
+        numbers occupying more than 32 bits of significance.  This is
+        provided for backwards compatibility, but causes problems for
+        backup applications that cannot handle large inode numbers.
+  largeio/nolargeio
+        If "nolargeio" is specified, the optimal I/O reported in
+        st_blksize by stat(2) will be as small as possible to allow user
+        applications to avoid inefficient read/modify/write I/O.
+        If "largeio" specified, a filesystem that has a "swidth" specified
+        will return the "swidth" value (in bytes) in st_blksize. If the
+        filesystem does not have a "swidth" specified but does specify
+        an "allocsize" then "allocsize" (in bytes) will be returned
+        instead.
+        If neither of these two options are specified, then filesystem
+        will behave as if "nolargeio" was specified.
  logbufs=value
        Set the number of in-memory log buffers.  Valid numbers range
        from 2-8 inclusive.
        The default value is 8 buffers for filesystems with a
-        blocksize of 64K, 4 buffers for filesystems with a blocksize
+        blocksize of 64KiB, 4 buffers for filesystems with a blocksize
-        of 32K, 3 buffers for filesystems with a blocksize of 16K
+        of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB
        and 2 buffers for all other configurations.  Increasing the
        number of buffers may increase performance on some workloads
        at the cost of the memory used for the additional log buffers
@@ -49,10 +96,10 @@ When mounting an XFS filesystem, the following options are accepted.
  logbsize=value
        Set the size of each in-memory log buffer.
        Size may be specified in bytes, or in kilobytes with a "k" suffix.
-        Valid sizes for version 1 and version 2 logs are 16384 (16k) and 
+        Valid sizes for version 1 and version 2 logs are 16384 (16k) and
-        32768 (32k).  Valid sizes for version 2 logs also include 
+        32768 (32k).  Valid sizes for version 2 logs also include
        65536 (64k), 131072 (128k) and 262144 (256k).
-        The default value for machines with more than 32MB of memory
+        The default value for machines with more than 32MiB of memory
        is 32768, machines with less memory use 16384 by default.
  logdev=device and rtdev=device
@@ -62,6 +109,11 @@ When mounting an XFS filesystem, the following options are accepted.
        optional, and the log section can be separate from the data
        section or contained within it.
+  mtpt=mountpoint
+        Use with the "dmapi" option.  The value specified here will be
+        included in the DMAPI mount event, and should be the path of
+        the actual mountpoint that is used.
  noalign
        Data allocations will not be aligned at stripe unit boundaries.
@@ -91,13 +143,17 @@ When mounting an XFS filesystem, the following options are accepted.
        O_SYNC writes can be lost if the system crashes.
        If timestamp updates are critical, use the osyncisosync option.
-  quota/usrquota/uqnoenforce
+  uquota/usrquota/uqnoenforce/quota
        User disk quota accounting enabled, and limits (optionally)
-        enforced.
+        enforced.  Refer to xfs_quota(8) for further details.
-  grpquota/gqnoenforce
+  gquota/grpquota/gqnoenforce
        Group disk quota accounting enabled and limits (optionally)
-        enforced.
+        enforced.  Refer to xfs_quota(8) for further details.
+  pquota/prjquota/pqnoenforce
+        Project disk quota accounting enabled and limits (optionally)
+        enforced.  Refer to xfs_quota(8) for further details.
  sunit=value and swidth=value
        Used to specify the stripe unit and width for a RAID device or
@@ -113,15 +169,21 @@ When mounting an XFS filesystem, the following options are accepted.
        The "swidth" option is required if the "sunit" option has been
        specified, and must be a multiple of the "sunit" value.
+  swalloc
+        Data allocations will be rounded up to stripe width boundaries
+        when the current end of file is being extended and the file
+        size is larger than the stripe width size.
 sysctls
 =======
 The following sysctls are available for the XFS filesystem:
  fs.xfs.stats_clear            (Min: 0  Default: 0  Max: 1)
-        Setting this to "1" clears accumulated XFS statistics 
+        Setting this to "1" clears accumulated XFS statistics
        in /proc/fs/xfs/stat.  It then immediately resets to "0".
-  
  fs.xfs.xfssyncd_centisecs     (Min: 100  Default: 3000  Max: 720000)
        The interval at which the xfssyncd thread flushes metadata
        out to disk.  This thread will flush log activity out, and
@@ -143,9 +205,9 @@ The following sysctls are available for the XFS filesystem:
                XFS_ERRLEVEL_HIGH:      5
  fs.xfs.panic_mask             (Min: 0  Default: 0  Max: 127)
-        Causes certain error conditions to call BUG(). Value is a bitmask; 
+        Causes certain error conditions to call BUG(). Value is a bitmask;
        AND together the tags which represent errors which should cause panics:
-        
                XFS_NO_PTAG                     0
                XFS_PTAG_IFLUSH                 0x00000001
                XFS_PTAG_LOGRES                 0x00000002
@@ -155,7 +217,7 @@ The following sysctls are available for the XFS filesystem:
                XFS_PTAG_SHUTDOWN_IOERROR       0x00000020
                XFS_PTAG_SHUTDOWN_LOGERROR      0x00000040
-        This option is intended for debugging only.             
+        This option is intended for debugging only.
  fs.xfs.irix_symlink_mode      (Min: 0  Default: 0  Max: 1)
        Controls whether symlinks are created with mode 0777 (default)
@@ -164,25 +226,37 @@ The following sysctls are available for the XFS filesystem:
  fs.xfs.irix_sgid_inherit      (Min: 0  Default: 0  Max: 1)
        Controls files created in SGID directories.
        If the group ID of the new file does not match the effective group
-        ID or one of the supplementary group IDs of the parent dir, the 
+        ID or one of the supplementary group IDs of the parent dir, the
-        ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl 
+        ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
        is set.
  fs.xfs.restrict_chown         (Min: 0  Default: 1  Max: 1)
        Controls whether unprivileged users can use chown to "give away"
        a file to another user.
-  fs.xfs.inherit_sync           (Min: 0  Default: 1  Max 1)
+  fs.xfs.inherit_sync           (Min: 0  Default: 1  Max: 1)
-        Setting this to "1" will cause the "sync" flag set 
+        Setting this to "1" will cause the "sync" flag set
-        by the chattr(1) command on a directory to be
+        by the xfs_io(8) chattr command on a directory to be
        inherited by files in that directory.
-  fs.xfs.inherit_nodump         (Min: 0  Default: 1  Max 1)
+  fs.xfs.inherit_nodump         (Min: 0  Default: 1  Max: 1)
-        Setting this to "1" will cause the "nodump" flag set 
+        Setting this to "1" will cause the "nodump" flag set
-        by the chattr(1) command on a directory to be
+        by the xfs_io(8) chattr command on a directory to be
        inherited by files in that directory.
-  fs.xfs.inherit_noatime        (Min: 0  Default: 1  Max 1)
+  fs.xfs.inherit_noatime        (Min: 0  Default: 1  Max: 1)
-        Setting this to "1" will cause the "noatime" flag set 
+        Setting this to "1" will cause the "noatime" flag set
-        by the chattr(1) command on a directory to be
+        by the xfs_io(8) chattr command on a directory to be
        inherited by files in that directory.
+  fs.xfs.inherit_nosymlinks     (Min: 0  Default: 1  Max: 1)
+        Setting this to "1" will cause the "nosymlinks" flag set
+        by the xfs_io(8) chattr command on a directory to be
+        inherited by files in that directory.
+  fs.xfs.rotorstep              (Min: 1  Default: 1  Max: 256)
+        In "inode32" allocation mode, this option determines how many
+        files the allocator attempts to allocate in the same allocation
+        group before moving to the next allocation group.  The intent
+        is to control the rate at which the allocator moves between
+        allocation groups when allocating extents for new files.