1 files changed, 148 insertions, 286 deletions
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f042c12e0ed2..ee4c0a8b8db7 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -3,7 +3,7 @@
        Original author: Richard Gooch <rgooch@atnf.csiro.au>
-                  Last updated on August 25, 2005
+                  Last updated on October 28, 2005
  Copyright (C) 1999 Richard Gooch
  Copyright (C) 2005 Pekka Enberg
@@ -11,62 +11,61 @@
  This file is released under the GPLv2.
-What is it?
+Introduction
-===========
+============
-The Virtual File System (otherwise known as the Virtual Filesystem
+The Virtual File System (also known as the Virtual Filesystem Switch)
-Switch) is the software layer in the kernel that provides the
+is the software layer in the kernel that provides the filesystem
-filesystem interface to userspace programs. It also provides an
+interface to userspace programs. It also provides an abstraction
-abstraction within the kernel which allows different filesystem
+within the kernel which allows different filesystem implementations to
-implementations to coexist.
+coexist.
+VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
+on are called from a process context. Filesystem locking is described
+in the document Documentation/filesystems/Locking.
-A Quick Look At How It Works
-============================
-In this section I'll briefly describe how things work, before
+Directory Entry Cache (dcache)
-launching into the details. I'll start with describing what happens
+------------------------------
-when user programs open and manipulate files, and then look from the
-other view which is how a filesystem is supported and subsequently
-mounted.
-Opening a File
--------------
-The VFS implements the open(2), stat(2), chmod(2) and similar system
-calls. The pathname argument is used by the VFS to search through the
-directory entry cache (dentry cache or "dcache"). This provides a very
-fast look-up mechanism to translate a pathname (filename) into a
-specific dentry.
-An individual dentry usually has a pointer to an inode. Inodes are the
-things that live on disc drives, and can be regular files (you know:
-those things that you write data into), directories, FIFOs and other
-beasts. Dentries live in RAM and are never saved to disc: they exist
-only for performance. Inodes live on disc and are copied into memory
-when required. Later any changes are written back to disc. The inode
-that lives in RAM is a VFS inode, and it is this which the dentry
-points to. A single inode can be pointed to by multiple dentries
-(think about hardlinks).
-The dcache is meant to be a view into your entire filespace. Unlike
-Linus, most of us losers can't fit enough dentries into RAM to cover
-all of our filespace, so the dcache has bits missing. In order to
-resolve your pathname into a dentry, the VFS may have to resort to
-creating dentries along the way, and then loading the inode. This is
-done by looking up the inode.
-To look up an inode (usually read from disc) requires that the VFS
-calls the lookup() method of the parent directory inode. This method
-is installed by the specific filesystem implementation that the inode
-lives in. There will be more on this later.
-Once the VFS has the required dentry (and hence the inode), we can do
+The VFS implements the open(2), stat(2), chmod(2), and similar system
-all those boring things like open(2) the file, or stat(2) it to peek
+calls. The pathname argument that is passed to them is used by the VFS
-at the inode data. The stat(2) operation is fairly simple: once the
+to search through the directory entry cache (also known as the dentry
-VFS has the dentry, it peeks at the inode data and passes some of it
+cache or dcache). This provides a very fast look-up mechanism to
-back to userspace.
+translate a pathname (filename) into a specific dentry. Dentries live
+in RAM and are never saved to disc: they exist only for performance.
+The dentry cache is meant to be a view into your entire filespace. As
+most computers cannot fit all dentries in the RAM at the same time,
+some bits of the cache are missing. In order to resolve your pathname
+into a dentry, the VFS may have to resort to creating dentries along
+the way, and then loading the inode. This is done by looking up the
+inode.
+The Inode Object
+----------------
+An individual dentry usually has a pointer to an inode. Inodes are
+filesystem objects such as regular files, directories, FIFOs and other
+beasts.  They live either on the disc (for block device filesystems)
+or in the memory (for pseudo filesystems). Inodes that live on the
+disc are copied into the memory when required and changes to the inode
+are written back to disc. A single inode can be pointed to by multiple
+dentries (hard links, for example, do this).
+To look up an inode requires that the VFS calls the lookup() method of
+the parent directory inode. This method is installed by the specific
+filesystem implementation that the inode lives in. Once the VFS has
+the required dentry (and hence the inode), we can do all those boring
+things like open(2) the file, or stat(2) it to peek at the inode
+data. The stat(2) operation is fairly simple: once the VFS has the
+dentry, it peeks at the inode data and passes some of it back to
+userspace.
+The File Object
+---------------
 Opening a file requires another operation: allocation of a file
 structure (this is the kernel-side implementation of file
@@ -74,51 +73,39 @@ descriptors). The freshly allocated file structure is initialized with
 a pointer to the dentry and a set of file operation member functions.
 These are taken from the inode data. The open() file method is then
 called so the specific filesystem implementation can do it's work. You
-can see that this is another switch performed by the VFS.
+can see that this is another switch performed by the VFS. The file
+structure is placed into the file descriptor table for the process.
-The file structure is placed into the file descriptor table for the
-process.
 Reading, writing and closing files (and other assorted VFS operations)
 is done by using the userspace file descriptor to grab the appropriate
-file structure, and then calling the required file structure method
+file structure, and then calling the required file structure method to
-function to do whatever is required.
+do whatever is required. For as long as the file is open, it keeps the
+dentry in use, which in turn means that the VFS inode is still in use.
-For as long as the file is open, it keeps the dentry "open" (in use),
-which in turn means that the VFS inode is still in use.
-All VFS system calls (i.e. open(2), stat(2), read(2), write(2),
-chmod(2) and so on) are called from a process context. You should
-assume that these calls are made without any kernel locks being
-held. This means that the processes may be executing the same piece of
-filesystem or driver code at the same time, on different
-processors. You should ensure that access to shared resources is
-protected by appropriate locks.
 Registering and Mounting a Filesystem
-------------------------------------
+=====================================
-If you want to support a new kind of filesystem in the kernel, all you
+To register and unregister a filesystem, use the following API
-need to do is call register_filesystem(). You pass a structure
+functions:
-describing the filesystem implementation (struct file_system_type)
-which is then added to an internal table of supported filesystems. You
-can do:
-% cat /proc/filesystems
+   #include <linux/fs.h>
-to see what filesystems are currently available on your system.
+   extern int register_filesystem(struct file_system_type *);
+   extern int unregister_filesystem(struct file_system_type *);
-When a request is made to mount a block device onto a directory in
+The passed struct file_system_type describes your filesystem. When a
-your filespace the VFS will call the appropriate method for the
+request is made to mount a device onto a directory in your filespace,
-specific filesystem. The dentry for the mount point will then be
+the VFS will call the appropriate get_sb() method for the specific
-updated to point to the root inode for the new filesystem.
+filesystem. The dentry for the mount point will then be updated to
+point to the root inode for the new filesystem.
-It's now time to look at things in more detail.
+You can see all filesystems that are registered to the kernel in the
+file /proc/filesystems.
 struct file_system_type
-=======================
+-----------------------
 This describes the filesystem. As of kernel 2.6.13, the following
 members are defined:
@@ -197,8 +184,14 @@ A fill_super() method implementation has the following arguments:
  int silent: whether or not to be silent on error
+The Superblock Object
+=====================
+A superblock object represents a mounted filesystem.
 struct super_operations
-=======================
+-----------------------
 This describes how the VFS can manipulate the superblock of your
 filesystem. As of kernel 2.6.13, the following members are defined:
@@ -286,9 +279,9 @@ or bottom half).
        a superblock. The second parameter indicates whether the method
        should wait until the write out has been completed. Optional.
-  write_super_lockfs: called when VFS is locking a filesystem and forcing
+  write_super_lockfs: called when VFS is locking a filesystem and
-        it into a consistent state.  This function is currently used by the
+        forcing it into a consistent state.  This method is currently
-        Logical Volume Manager (LVM).
+        used by the Logical Volume Manager (LVM).
  unlockfs: called when VFS is unlocking a filesystem and making it writable
        again.
@@ -317,8 +310,14 @@ field. This is a pointer to a "struct inode_operations" which
 describes the methods that can be performed on individual inodes.
+The Inode Object
+================
+An inode object represents an object within the filesystem.
 struct inode_operations
-=======================
+-----------------------
 This describes how the VFS can manipulate an inode in your
 filesystem. As of kernel 2.6.13, the following members are defined:
@@ -394,51 +393,62 @@ otherwise noted.
        will probably need to call d_instantiate() just as you would
        in the create() method
+  rename: called by the rename(2) system call to rename the object to
+        have the parent and name given by the second inode and dentry.
  readlink: called by the readlink(2) system call. Only required if
        you want to support reading symbolic links
  follow_link: called by the VFS to follow a symbolic link to the
        inode it points to.  Only required if you want to support
-        symbolic links.  This function returns a void pointer cookie
+        symbolic links.  This method returns a void pointer cookie
        that is passed to put_link().
  put_link: called by the VFS to release resources allocated by
-        follow_link().  The cookie returned by follow_link() is passed to
+        follow_link().  The cookie returned by follow_link() is passed
-        to this function as the last parameter.  It is used by filesystems
+        to to this method as the last parameter.  It is used by
-        such as NFS where page cache is not stable (i.e. page that was
+        filesystems such as NFS where page cache is not stable
-        installed when the symbolic link walk started might not be in the
+        (i.e. page that was installed when the symbolic link walk
-        page cache at the end of the walk).
+        started might not be in the page cache at the end of the
+        walk).
-  truncate: called by the VFS to change the size of a file.  The i_size
-        field of the inode is set to the desired size by the VFS before
+  truncate: called by the VFS to change the size of a file.  The
-        this function is called.  This function is called by the truncate(2)
+        i_size field of the inode is set to the desired size by the
-        system call and related functionality.
+        VFS before this method is called.  This method is called by
+        the truncate(2) system call and related functionality.
  permission: called by the VFS to check for access rights on a POSIX-like
        filesystem.
-  setattr: called by the VFS to set attributes for a file.  This function is
+  setattr: called by the VFS to set attributes for a file. This method
-        called by chmod(2) and related system calls.
+        is called by chmod(2) and related system calls.
-  getattr: called by the VFS to get attributes of a file.  This function is
+  getattr: called by the VFS to get attributes of a file. This method
-        called by stat(2) and related system calls.
+        is called by stat(2) and related system calls.
  setxattr: called by the VFS to set an extended attribute for a file.
-        Extended attribute is a name:value pair associated with an inode. This
+        Extended attribute is a name:value pair associated with an
-        function is called by setxattr(2) system call.
+        inode. This method is called by setxattr(2) system call.
+  getxattr: called by the VFS to retrieve the value of an extended
+        attribute name. This method is called by getxattr(2) function
+        call.
-  getxattr: called by the VFS to retrieve the value of an extended attribute
+  listxattr: called by the VFS to list all extended attributes for a
-        name.  This function is called by getxattr(2) function call.
+        given file. This method is called by listxattr(2) system call.
-  listxattr: called by the VFS to list all extended attributes for a given
+  removexattr: called by the VFS to remove an extended attribute from
-        file.  This function is called by listxattr(2) system call.
+        a file. This method is called by removexattr(2) system call.
-  removexattr: called by the VFS to remove an extended attribute from a file.
-        This function is called by removexattr(2) system call.
+The Address Space Object
+========================
+The address space object is used to identify pages in the page cache.
 struct address_space_operations
-===============================
+-------------------------------
 This describes how the VFS can manipulate mapping of a file to page cache in
 your filesystem. As of kernel 2.6.13, the following members are defined:
@@ -502,8 +512,14 @@ struct address_space_operations {
        it.  An example implementation can be found in fs/ext2/xip.c.
+The File Object
+===============
+A file object represents a file opened by a process.
 struct file_operations
-======================
+----------------------
 This describes how the VFS can manipulate an open file. As of kernel
 2.6.13, the following members are defined:
@@ -661,7 +677,7 @@ of child dentries. Child dentries are basically like files in a
 directory.
-Directory Entry Cache APIs
+Directory Entry Cache API
 --------------------------
 There are a number of functions defined which permit a filesystem to
@@ -705,178 +721,24 @@ manipulate dentries:
        and the dentry is returned. The caller must use d_put()
        to free the dentry when it finishes using it.
+For further information on dentry locking, please refer to the document
+Documentation/filesystems/dentry-locking.txt.
-RCU-based dcache locking model
------------------------------
-On many workloads, the most common operation on dcache is
+Resources
-to look up a dentry, given a parent dentry and the name
+=========
-of the child. Typically, for every open(), stat() etc.,
-the dentry corresponding to the pathname will be looked
+(Note some of these resources are not up-to-date with the latest kernel
-up by walking the tree starting with the first component
+ version.)
-of the pathname and using that dentry along with the next
-component to look up the next level and so on. Since it
+Creating Linux virtual filesystems. 2002
-is a frequent operation for workloads like multiuser
+    <http://lwn.net/Articles/13325/>
-environments and web servers, it is important to optimize
-this path.
+The Linux Virtual File-system Layer by Neil Brown. 1999
+    <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
-Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
-in every component during path look-up. Since 2.5.10 onwards,
+A tour of the Linux VFS by Michael K. Johnson. 1996
-fast-walk algorithm changed this by holding the dcache_lock
+    <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
-at the beginning and walking as many cached path component
-dentries as possible. This significantly decreases the number
-of acquisition of dcache_lock. However it also increases the
-lock hold time significantly and affects performance in large
-SMP machines. Since 2.5.62 kernel, dcache has been using
-a new locking model that uses RCU to make dcache look-up
-lock-free.
-The current dcache locking model is not very different from the existing
-dcache locking model. Prior to 2.5.62 kernel, dcache_lock
-protected the hash chain, d_child, d_alias, d_lru lists as well
-as d_inode and several other things like mount look-up. RCU-based
-changes affect only the way the hash chain is protected. For everything
-else the dcache_lock must be taken for both traversing as well as
-updating. The hash chain updates too take the dcache_lock.
-The significant change is the way d_lookup traverses the hash chain,
-it doesn't acquire the dcache_lock for this and rely on RCU to
-ensure that the dentry has not been *freed*.
-Dcache locking details
----------------------
-For many multi-user workloads, open() and stat() on files are
+A small trail through the Linux kernel by Andries Brouwer. 2001
-very frequently occurring operations. Both involve walking
+    <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>
-of path names to find the dentry corresponding to the
-concerned file. In 2.4 kernel, dcache_lock was held
-during look-up of each path component. Contention and
-cache-line bouncing of this global lock caused significant
-scalability problems. With the introduction of RCU
-in Linux kernel, this was worked around by making
-the look-up of path components during path walking lock-free.
-Safe lock-free look-up of dcache hash table
-===========================================
-Dcache is a complex data structure with the hash table entries
-also linked together in other lists. In 2.4 kernel, dcache_lock
-protected all the lists. We applied RCU only on hash chain
-walking. The rest of the lists are still protected by dcache_lock.
-Some of the important changes are :
-1. The deletion from hash chain is done using hlist_del_rcu() macro which
-   doesn't initialize next pointer of the deleted dentry and this
-   allows us to walk safely lock-free while a deletion is happening.
-2. Insertion of a dentry into the hash table is done using
-   hlist_add_head_rcu() which take care of ordering the writes -
-   the writes to the dentry must be visible before the dentry
-   is inserted. This works in conjunction with hlist_for_each_rcu()
-   while walking the hash chain. The only requirement is that
-   all initialization to the dentry must be done before hlist_add_head_rcu()
-   since we don't have dcache_lock protection while traversing
-   the hash chain. This isn't different from the existing code.
-3. The dentry looked up without holding dcache_lock by cannot be
-   returned for walking if it is unhashed. It then may have a NULL
-   d_inode or other bogosity since RCU doesn't protect the other
-   fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
-   indicate unhashed  dentries and use this in conjunction with a
-   per-dentry lock (d_lock). Once looked up without the dcache_lock,
-   we acquire the per-dentry lock (d_lock) and check if the
-   dentry is unhashed. If so, the look-up is failed. If not, the
-   reference count of the dentry is increased and the dentry is returned.
-4. Once a dentry is looked up, it must be ensured during the path
-   walk for that component it doesn't go away. In pre-2.5.10 code,
-   this was done holding a reference to the dentry. dcache_rcu does
-   the same.  In some sense, dcache_rcu path walking looks like
-   the pre-2.5.10 version.
-5. All dentry hash chain updates must take the dcache_lock as well as
-   the per-dentry lock in that order. dput() does this to ensure
-   that a dentry that has just been looked up in another CPU
-   doesn't get deleted before dget() can be done on it.
-6. There are several ways to do reference counting of RCU protected
-   objects. One such example is in ipv4 route cache where
-   deferred freeing (using call_rcu()) is done as soon as
-   the reference count goes to zero. This cannot be done in
-   the case of dentries because tearing down of dentries
-   require blocking (dentry_iput()) which isn't supported from
-   RCU callbacks. Instead, tearing down of dentries happen
-   synchronously in dput(), but actual freeing happens later
-   when RCU grace period is over. This allows safe lock-free
-   walking of the hash chains, but a matched dentry may have
-   been partially torn down. The checking of DCACHE_UNHASHED
-   flag with d_lock held detects such dentries and prevents
-   them from being returned from look-up.
-Maintaining POSIX rename semantics
-==================================
-Since look-up of dentries is lock-free, it can race against
-a concurrent rename operation. For example, during rename
-of file A to B, look-up of either A or B must succeed.
-So, if look-up of B happens after A has been removed from the
-hash chain but not added to the new hash chain, it may fail.
-Also, a comparison while the name is being written concurrently
-by a rename may result in false positive matches violating
-rename semantics.  Issues related to race with rename are
-handled as described below :
-1. Look-up can be done in two ways - d_lookup() which is safe
-   from simultaneous renames and __d_lookup() which is not.
-   If __d_lookup() fails, it must be followed up by a d_lookup()
-   to correctly determine whether a dentry is in the hash table
-   or not. d_lookup() protects look-ups using a sequence
-   lock (rename_lock).
-2. The name associated with a dentry (d_name) may be changed if
-   a rename is allowed to happen simultaneously. To avoid memcmp()
-   in __d_lookup() go out of bounds due to a rename and false
-   positive comparison, the name comparison is done while holding the
-   per-dentry lock. This prevents concurrent renames during this
-   operation.
-3. Hash table walking during look-up may move to a different bucket as
-   the current dentry is moved to a different bucket due to rename.
-   But we use hlists in dcache hash table and they are null-terminated.
-   So, even if a dentry moves to a different bucket, hash chain
-   walk will terminate. [with a list_head list, it may not since
-   termination is when the list_head in the original bucket is reached].
-   Since we redo the d_parent check and compare name while holding
-   d_lock, lock-free look-up will not race against d_move().
-4. There can be a theoretical race when a dentry keeps coming back
-   to original bucket due to double moves. Due to this look-up may
-   consider that it has never moved and can end up in a infinite loop.
-   But this is not any worse that theoretical livelocks we already
-   have in the kernel.
-Important guidelines for filesystem developers related to dcache_rcu
-====================================================================
-1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
-   don't change. Only dcache internal implementation changes. However
-   filesystems *must not* delete from the dentry hash chains directly
-   using the list macros like allowed earlier. They must use dcache
-   APIs like d_drop() or __d_drop() depending on the situation.
-2. d_flags is now protected by a per-dentry lock (d_lock). All
-   access to d_flags must be protected by it.
-3. For a hashed dentry, checking of d_count needs to be protected
-   by d_lock.
-Papers and other documentation on dcache locking
-================================================
-1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
-2. http://lse.sourceforge.net/locking/dcache/dcache.html