Merge branch 'master' of /usr/src/ntfs-2.6/

author: Anton Altaparmakov <aia21@cantab.net> 2006-01-19 11:39:33 -0500
committer: Anton Altaparmakov <aia21@cantab.net> 2006-01-19 11:39:33 -0500
commit: 944d79559d154c12becde0dab327016cf438f46c (patch)
tree: 50c101806f4d3b6585222dda060559eb4f3e005a /Documentation/filesystems
parent: d087e4bdd24ebe3ae3d0b265b6573ec901af4b4b (diff)
parent: 0f36b018b2e314d45af86449f1a97facb1fbe300 (diff)
14 files changed, 1998 insertions, 123 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index bcfbab899b37..74052d22d868 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -12,14 +12,16 @@ cifs.txt
        - description of the CIFS filesystem
 coda.txt
        - description of the CODA filesystem.
+configfs/
+        - directory containing configfs documentation and example code.
 cramfs.txt
        - info on the cram filesystem for small storage (ROMs etc)
 devfs/
        - directory containing devfs documentation.
+dlmfs.txt
+        - info on the userspace interface to the OCFS2 DLM.
 ext2.txt
        - info, mount options and specifications for the Ext2 filesystem.
-fat_cvf.txt
-        - info on the Compressed Volume Files extension to the FAT filesystem
 hpfs.txt
        - info and mount options for the OS/2 HPFS.
 isofs.txt
@@ -32,6 +34,8 @@ ntfs.txt
        - info and mount options for the NTFS filesystem (Windows NT).
 proc.txt
        - info on Linux's /proc filesystem.
+ocfs2.txt
+        - info and mount options for the OCFS2 clustered filesystem.
 romfs.txt
        - Description of the ROMFS filesystem.
 smbfs.txt
diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs/configfs.txt
new file mode 100644
index 000000000000..c4ff96b7c4e0
--- /dev/null
+++ b/Documentation/filesystems/configfs/configfs.txt
@@ -0,0 +1,434 @@
+configfs - Userspace-driven kernel object configuation.
+Joel Becker <joel.becker@oracle.com>
+Updated: 31 March 2005
+Copyright (c) 2005 Oracle Corporation,
+        Joel Becker <joel.becker@oracle.com>
+[What is configfs?]
+configfs is a ram-based filesystem that provides the converse of
+sysfs's functionality.  Where sysfs is a filesystem-based view of
+kernel objects, configfs is a filesystem-based manager of kernel
+objects, or config_items.
+With sysfs, an object is created in kernel (for example, when a device
+is discovered) and it is registered with sysfs.  Its attributes then
+appear in sysfs, allowing userspace to read the attributes via
+readdir(3)/read(2).  It may allow some attributes to be modified via
+write(2).  The important point is that the object is created and
+destroyed in kernel, the kernel controls the lifecycle of the sysfs
+representation, and sysfs is merely a window on all this.
+A configfs config_item is created via an explicit userspace operation:
+mkdir(2).  It is destroyed via rmdir(2).  The attributes appear at
+mkdir(2) time, and can be read or modified via read(2) and write(2).
+As with sysfs, readdir(3) queries the list of items and/or attributes.
+symlink(2) can be used to group items together.  Unlike sysfs, the
+lifetime of the representation is completely driven by userspace.  The
+kernel modules backing the items must respond to this.
+Both sysfs and configfs can and should exist together on the same
+system.  One is not a replacement for the other.
+[Using configfs]
+configfs can be compiled as a module or into the kernel.  You can access
+it by doing
+        mount -t configfs none /config
+The configfs tree will be empty unless client modules are also loaded.
+These are modules that register their item types with configfs as
+subsystems.  Once a client subsystem is loaded, it will appear as a
+subdirectory (or more than one) under /config.  Like sysfs, the
+configfs tree is always there, whether mounted on /config or not.
+An item is created via mkdir(2).  The item's attributes will also
+appear at this time.  readdir(3) can determine what the attributes are,
+read(2) can query their default values, and write(2) can store new
+values.  Like sysfs, attributes should be ASCII text files, preferably
+with only one value per file.  The same efficiency caveats from sysfs
+apply.  Don't mix more than one attribute in one attribute file.
+Like sysfs, configfs expects write(2) to store the entire buffer at
+once.  When writing to configfs attributes, userspace processes should
+first read the entire file, modify the portions they wish to change, and
+then write the entire buffer back.  Attribute files have a maximum size
+of one page (PAGE_SIZE, 4096 on i386).
+When an item needs to be destroyed, remove it with rmdir(2).  An
+item cannot be destroyed if any other item has a link to it (via
+symlink(2)).  Links can be removed via unlink(2).
+[Configuring FakeNBD: an Example]
+Imagine there's a Network Block Device (NBD) driver that allows you to
+access remote block devices.  Call it FakeNBD.  FakeNBD uses configfs
+for its configuration.  Obviously, there will be a nice program that
+sysadmins use to configure FakeNBD, but somehow that program has to tell
+the driver about it.  Here's where configfs comes in.
+When the FakeNBD driver is loaded, it registers itself with configfs.
+readdir(3) sees this just fine:
+        # ls /config
+        fakenbd
+A fakenbd connection can be created with mkdir(2).  The name is
+arbitrary, but likely the tool will make some use of the name.  Perhaps
+it is a uuid or a disk name:
+        # mkdir /config/fakenbd/disk1
+        # ls /config/fakenbd/disk1
+        target device rw
+The target attribute contains the IP address of the server FakeNBD will
+connect to.  The device attribute is the device on the server.
+Predictably, the rw attribute determines whether the connection is
+read-only or read-write.
+        # echo 10.0.0.1 > /config/fakenbd/disk1/target
+        # echo /dev/sda1 > /config/fakenbd/disk1/device
+        # echo 1 > /config/fakenbd/disk1/rw
+That's it.  That's all there is.  Now the device is configured, via the
+shell no less.
+[Coding With configfs]
+Every object in configfs is a config_item.  A config_item reflects an
+object in the subsystem.  It has attributes that match values on that
+object.  configfs handles the filesystem representation of that object
+and its attributes, allowing the subsystem to ignore all but the
+basic show/store interaction.
+Items are created and destroyed inside a config_group.  A group is a
+collection of items that share the same attributes and operations.
+Items are created by mkdir(2) and removed by rmdir(2), but configfs
+handles that.  The group has a set of operations to perform these tasks
+A subsystem is the top level of a client module.  During initialization,
+the client module registers the subsystem with configfs, the subsystem
+appears as a directory at the top of the configfs filesystem.  A
+subsystem is also a config_group, and can do everything a config_group
+can.
+[struct config_item]
+        struct config_item {
+                char                    *ci_name;
+                char                    ci_namebuf[UOBJ_NAME_LEN];
+                struct kref             ci_kref;
+                struct list_head        ci_entry;
+                struct config_item      *ci_parent;
+                struct config_group     *ci_group;
+                struct config_item_type *ci_type;
+                struct dentry           *ci_dentry;
+        };
+        void config_item_init(struct config_item *);
+        void config_item_init_type_name(struct config_item *,
+                                        const char *name,
+                                        struct config_item_type *type);
+        struct config_item *config_item_get(struct config_item *);
+        void config_item_put(struct config_item *);
+Generally, struct config_item is embedded in a container structure, a
+structure that actually represents what the subsystem is doing.  The
+config_item portion of that structure is how the object interacts with
+configfs.
+Whether statically defined in a source file or created by a parent
+config_group, a config_item must have one of the _init() functions
+called on it.  This initializes the reference count and sets up the
+appropriate fields.
+All users of a config_item should have a reference on it via
+config_item_get(), and drop the reference when they are done via
+config_item_put().
+By itself, a config_item cannot do much more than appear in configfs.
+Usually a subsystem wants the item to display and/or store attributes,
+among other things.  For that, it needs a type.
+[struct config_item_type]
+        struct configfs_item_operations {
+                void (*release)(struct config_item *);
+                ssize_t (*show_attribute)(struct config_item *,
+                                          struct configfs_attribute *,
+                                          char *);
+                ssize_t (*store_attribute)(struct config_item *,
+                                           struct configfs_attribute *,
+                                           const char *, size_t);
+                int (*allow_link)(struct config_item *src,
+                                  struct config_item *target);
+                int (*drop_link)(struct config_item *src,
+                                 struct config_item *target);
+        };
+        struct config_item_type {
+                struct module                           *ct_owner;
+                struct configfs_item_operations         *ct_item_ops;
+                struct configfs_group_operations        *ct_group_ops;
+                struct configfs_attribute               **ct_attrs;
+        };
+The most basic function of a config_item_type is to define what
+operations can be performed on a config_item.  All items that have been
+allocated dynamically will need to provide the ct_item_ops->release()
+method.  This method is called when the config_item's reference count
+reaches zero.  Items that wish to display an attribute need to provide
+the ct_item_ops->show_attribute() method.  Similarly, storing a new
+attribute value uses the store_attribute() method.
+[struct configfs_attribute]
+        struct configfs_attribute {
+                char                    *ca_name;
+                struct module           *ca_owner;
+                mode_t                  ca_mode;
+        };
+When a config_item wants an attribute to appear as a file in the item's
+configfs directory, it must define a configfs_attribute describing it.
+It then adds the attribute to the NULL-terminated array
+config_item_type->ct_attrs.  When the item appears in configfs, the
+attribute file will appear with the configfs_attribute->ca_name
+filename.  configfs_attribute->ca_mode specifies the file permissions.
+If an attribute is readable and the config_item provides a
+ct_item_ops->show_attribute() method, that method will be called
+whenever userspace asks for a read(2) on the attribute.  The converse
+will happen for write(2).
+[struct config_group]
+A config_item cannot live in a vaccum.  The only way one can be created
+is via mkdir(2) on a config_group.  This will trigger creation of a
+child item.
+        struct config_group {
+                struct config_item              cg_item;
+                struct list_head                cg_children;
+                struct configfs_subsystem       *cg_subsys;
+                struct config_group             **default_groups;
+        };
+        void config_group_init(struct config_group *group);
+        void config_group_init_type_name(struct config_group *group,
+                                         const char *name,
+                                         struct config_item_type *type);
+The config_group structure contains a config_item.  Properly configuring
+that item means that a group can behave as an item in its own right.
+However, it can do more: it can create child items or groups.  This is
+accomplished via the group operations specified on the group's
+config_item_type.
+        struct configfs_group_operations {
+                struct config_item *(*make_item)(struct config_group *group,
+                                                 const char *name);
+                struct config_group *(*make_group)(struct config_group *group,
+                                                   const char *name);
+                int (*commit_item)(struct config_item *item);
+                void (*drop_item)(struct config_group *group,
+                                  struct config_item *item);
+        };
+A group creates child items by providing the
+ct_group_ops->make_item() method.  If provided, this method is called from mkdir(2) in the group's directory.  The subsystem allocates a new
+config_item (or more likely, its container structure), initializes it,
+and returns it to configfs.  Configfs will then populate the filesystem
+tree to reflect the new item.
+If the subsystem wants the child to be a group itself, the subsystem
+provides ct_group_ops->make_group().  Everything else behaves the same,
+using the group _init() functions on the group.
+Finally, when userspace calls rmdir(2) on the item or group,
+ct_group_ops->drop_item() is called.  As a config_group is also a
+config_item, it is not necessary for a seperate drop_group() method.
+The subsystem must config_item_put() the reference that was initialized
+upon item allocation.  If a subsystem has no work to do, it may omit
+the ct_group_ops->drop_item() method, and configfs will call
+config_item_put() on the item on behalf of the subsystem.
+IMPORTANT: drop_item() is void, and as such cannot fail.  When rmdir(2)
+is called, configfs WILL remove the item from the filesystem tree
+(assuming that it has no children to keep it busy).  The subsystem is
+responsible for responding to this.  If the subsystem has references to
+the item in other threads, the memory is safe.  It may take some time
+for the item to actually disappear from the subsystem's usage.  But it
+is gone from configfs.
+A config_group cannot be removed while it still has child items.  This
+is implemented in the configfs rmdir(2) code.  ->drop_item() will not be
+called, as the item has not been dropped.  rmdir(2) will fail, as the
+directory is not empty.
+[struct configfs_subsystem]
+A subsystem must register itself, ususally at module_init time.  This
+tells configfs to make the subsystem appear in the file tree.
+        struct configfs_subsystem {
+                struct config_group     su_group;
+                struct semaphore        su_sem;
+        };
+        int configfs_register_subsystem(struct configfs_subsystem *subsys);
+        void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
+        A subsystem consists of a toplevel config_group and a semaphore.
+The group is where child config_items are created.  For a subsystem,
+this group is usually defined statically.  Before calling
+configfs_register_subsystem(), the subsystem must have initialized the
+group via the usual group _init() functions, and it must also have
+initialized the semaphore.
+        When the register call returns, the subsystem is live, and it
+will be visible via configfs.  At that point, mkdir(2) can be called and
+the subsystem must be ready for it.
+[An Example]
+The best example of these basic concepts is the simple_children
+subsystem/group and the simple_child item in configfs_example.c  It
+shows a trivial object displaying and storing an attribute, and a simple
+group creating and destroying these children.
+[Hierarchy Navigation and the Subsystem Semaphore]
+There is an extra bonus that configfs provides.  The config_groups and
+config_items are arranged in a hierarchy due to the fact that they
+appear in a filesystem.  A subsystem is NEVER to touch the filesystem
+parts, but the subsystem might be interested in this hierarchy.  For
+this reason, the hierarchy is mirrored via the config_group->cg_children
+and config_item->ci_parent structure members.
+A subsystem can navigate the cg_children list and the ci_parent pointer
+to see the tree created by the subsystem.  This can race with configfs'
+management of the hierarchy, so configfs uses the subsystem semaphore to
+protect modifications.  Whenever a subsystem wants to navigate the
+hierarchy, it must do so under the protection of the subsystem
+semaphore.
+A subsystem will be prevented from acquiring the semaphore while a newly
+allocated item has not been linked into this hierarchy.   Similarly, it
+will not be able to acquire the semaphore while a dropping item has not
+yet been unlinked.  This means that an item's ci_parent pointer will
+never be NULL while the item is in configfs, and that an item will only
+be in its parent's cg_children list for the same duration.  This allows
+a subsystem to trust ci_parent and cg_children while they hold the
+semaphore.
+[Item Aggregation Via symlink(2)]
+configfs provides a simple group via the group->item parent/child
+relationship.  Often, however, a larger environment requires aggregation
+outside of the parent/child connection.  This is implemented via
+symlink(2).
+A config_item may provide the ct_item_ops->allow_link() and
+ct_item_ops->drop_link() methods.  If the ->allow_link() method exists,
+symlink(2) may be called with the config_item as the source of the link.
+These links are only allowed between configfs config_items.  Any
+symlink(2) attempt outside the configfs filesystem will be denied.
+When symlink(2) is called, the source config_item's ->allow_link()
+method is called with itself and a target item.  If the source item
+allows linking to target item, it returns 0.  A source item may wish to
+reject a link if it only wants links to a certain type of object (say,
+in its own subsystem).
+When unlink(2) is called on the symbolic link, the source item is
+notified via the ->drop_link() method.  Like the ->drop_item() method,
+this is a void function and cannot return failure.  The subsystem is
+responsible for responding to the change.
+A config_item cannot be removed while it links to any other item, nor
+can it be removed while an item links to it.  Dangling symlinks are not
+allowed in configfs.
+[Automatically Created Subgroups]
+A new config_group may want to have two types of child config_items.
+While this could be codified by magic names in ->make_item(), it is much
+more explicit to have a method whereby userspace sees this divergence.
+Rather than have a group where some items behave differently than
+others, configfs provides a method whereby one or many subgroups are
+automatically created inside the parent at its creation.  Thus,
+mkdir("parent) results in "parent", "parent/subgroup1", up through
+"parent/subgroupN".  Items of type 1 can now be created in
+"parent/subgroup1", and items of type N can be created in
+"parent/subgroupN".
+These automatic subgroups, or default groups, do not preclude other
+children of the parent group.  If ct_group_ops->make_group() exists,
+other child groups can be created on the parent group directly.
+A configfs subsystem specifies default groups by filling in the
+NULL-terminated array default_groups on the config_group structure.
+Each group in that array is populated in the configfs tree at the same
+time as the parent group.  Similarly, they are removed at the same time
+as the parent.  No extra notification is provided.  When a ->drop_item()
+method call notifies the subsystem the parent group is going away, it
+also means every default group child associated with that parent group.
+As a consequence of this, default_groups cannot be removed directly via
+rmdir(2).  They also are not considered when rmdir(2) on the parent
+group is checking for children.
+[Committable Items]
+NOTE: Committable items are currently unimplemented.
+Some config_items cannot have a valid initial state.  That is, no
+default values can be specified for the item's attributes such that the
+item can do its work.  Userspace must configure one or more attributes,
+after which the subsystem can start whatever entity this item
+represents.
+Consider the FakeNBD device from above.  Without a target address *and*
+a target device, the subsystem has no idea what block device to import.
+The simple example assumes that the subsystem merely waits until all the
+appropriate attributes are configured, and then connects.  This will,
+indeed, work, but now every attribute store must check if the attributes
+are initialized.  Every attribute store must fire off the connection if
+that condition is met.
+Far better would be an explicit action notifying the subsystem that the
+config_item is ready to go.  More importantly, an explicit action allows
+the subsystem to provide feedback as to whether the attibutes are
+initialized in a way that makes sense.  configfs provides this as
+committable items.
+configfs still uses only normal filesystem operations.  An item is
+committed via rename(2).  The item is moved from a directory where it
+can be modified to a directory where it cannot.
+Any group that provides the ct_group_ops->commit_item() method has
+committable items.  When this group appears in configfs, mkdir(2) will
+not work directly in the group.  Instead, the group will have two
+subdirectories: "live" and "pending".  The "live" directory does not
+support mkdir(2) or rmdir(2) either.  It only allows rename(2).  The
+"pending" directory does allow mkdir(2) and rmdir(2).  An item is
+created in the "pending" directory.  Its attributes can be modified at
+will.  Userspace commits the item by renaming it into the "live"
+directory.  At this point, the subsystem recieves the ->commit_item()
+callback.  If all required attributes are filled to satisfaction, the
+method returns zero and the item is moved to the "live" directory.
+As rmdir(2) does not work in the "live" directory, an item must be
+shutdown, or "uncommitted".  Again, this is done via rename(2), this
+time from the "live" directory back to the "pending" one.  The subsystem
+is notified by the ct_group_ops->uncommit_object() method.
diff --git a/Documentation/filesystems/configfs/configfs_example.c b/Documentation/filesystems/configfs/configfs_example.c
new file mode 100644
index 000000000000..f3c6e4946f98
--- /dev/null
+++ b/Documentation/filesystems/configfs/configfs_example.c
@@ -0,0 +1,474 @@
+/*
+ * vim: noexpandtab ts=8 sts=0 sw=8:
+ *
+ * configfs_example.c - This file is a demonstration module containing
+ *      a number of configfs subsystems.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on sysfs:
+ *      sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ */
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/configfs.h>
+/*
+ * 01-childless
+ *
+ * This first example is a childless subsystem.  It cannot create
+ * any config_items.  It just has attributes.
+ *
+ * Note that we are enclosing the configfs_subsystem inside a container.
+ * This is not necessary if a subsystem has no attributes directly
+ * on the subsystem.  See the next example, 02-simple-children, for
+ * such a subsystem.
+ */
+struct childless {
+        struct configfs_subsystem subsys;
+        int showme;
+        int storeme;
+};
+struct childless_attribute {
+        struct configfs_attribute attr;
+        ssize_t (*show)(struct childless *, char *);
+        ssize_t (*store)(struct childless *, const char *, size_t);
+};
+static inline struct childless *to_childless(struct config_item *item)
+{
+        return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL;
+}
+static ssize_t childless_showme_read(struct childless *childless,
+                                     char *page)
+{
+        ssize_t pos;
+        pos = sprintf(page, "%d\n", childless->showme);
+        childless->showme++;
+        return pos;
+}
+static ssize_t childless_storeme_read(struct childless *childless,
+                                      char *page)
+{
+        return sprintf(page, "%d\n", childless->storeme);
+}
+static ssize_t childless_storeme_write(struct childless *childless,
+                                       const char *page,
+                                       size_t count)
+{
+        unsigned long tmp;
+        char *p = (char *) page;
+        tmp = simple_strtoul(p, &p, 10);
+        if (!p || (*p && (*p != '\n')))
+                return -EINVAL;
+        if (tmp > INT_MAX)
+                return -ERANGE;
+        childless->storeme = tmp;
+        return count;
+}
+static ssize_t childless_description_read(struct childless *childless,
+                                          char *page)
+{
+        return sprintf(page,
+"[01-childless]\n"
+"\n"
+"The childless subsystem is the simplest possible subsystem in\n"
+"configfs.  It does not support the creation of child config_items.\n"
+"It only has a few attributes.  In fact, it isn't much different\n"
+"than a directory in /proc.\n");
+}
+static struct childless_attribute childless_attr_showme = {
+        .attr   = { .ca_owner = THIS_MODULE, .ca_name = "showme", .ca_mode = S_IRUGO },
+        .show   = childless_showme_read,
+};
+static struct childless_attribute childless_attr_storeme = {
+        .attr   = { .ca_owner = THIS_MODULE, .ca_name = "storeme", .ca_mode = S_IRUGO | S_IWUSR },
+        .show   = childless_storeme_read,
+        .store  = childless_storeme_write,
+};
+static struct childless_attribute childless_attr_description = {
+        .attr = { .ca_owner = THIS_MODULE, .ca_name = "description", .ca_mode = S_IRUGO },
+        .show = childless_description_read,
+};
+static struct configfs_attribute *childless_attrs[] = {
+        &childless_attr_showme.attr,
+        &childless_attr_storeme.attr,
+        &childless_attr_description.attr,
+        NULL,
+};
+static ssize_t childless_attr_show(struct config_item *item,
+                                   struct configfs_attribute *attr,
+                                   char *page)
+{
+        struct childless *childless = to_childless(item);
+        struct childless_attribute *childless_attr =
+                container_of(attr, struct childless_attribute, attr);
+        ssize_t ret = 0;
+        if (childless_attr->show)
+                ret = childless_attr->show(childless, page);
+        return ret;
+}
+static ssize_t childless_attr_store(struct config_item *item,
+                                    struct configfs_attribute *attr,
+                                    const char *page, size_t count)
+{
+        struct childless *childless = to_childless(item);
+        struct childless_attribute *childless_attr =
+                container_of(attr, struct childless_attribute, attr);
+        ssize_t ret = -EINVAL;
+        if (childless_attr->store)
+                ret = childless_attr->store(childless, page, count);
+        return ret;
+}
+static struct configfs_item_operations childless_item_ops = {
+        .show_attribute         = childless_attr_show,
+        .store_attribute        = childless_attr_store,
+};
+static struct config_item_type childless_type = {
+        .ct_item_ops    = &childless_item_ops,
+        .ct_attrs       = childless_attrs,
+        .ct_owner       = THIS_MODULE,
+};
+static struct childless childless_subsys = {
+        .subsys = {
+                .su_group = {
+                        .cg_item = {
+                                .ci_namebuf = "01-childless",
+                                .ci_type = &childless_type,
+                        },
+                },
+        },
+};
+/* ----------------------------------------------------------------- */
+/*
+ * 02-simple-children
+ *
+ * This example merely has a simple one-attribute child.  Note that
+ * there is no extra attribute structure, as the child's attribute is
+ * known from the get-go.  Also, there is no container for the
+ * subsystem, as it has no attributes of its own.
+ */
+struct simple_child {
+        struct config_item item;
+        int storeme;
+};
+static inline struct simple_child *to_simple_child(struct config_item *item)
+{
+        return item ? container_of(item, struct simple_child, item) : NULL;
+}
+static struct configfs_attribute simple_child_attr_storeme = {
+        .ca_owner = THIS_MODULE,
+        .ca_name = "storeme",
+        .ca_mode = S_IRUGO | S_IWUSR,
+};
+static struct configfs_attribute *simple_child_attrs[] = {
+        &simple_child_attr_storeme,
+        NULL,
+};
+static ssize_t simple_child_attr_show(struct config_item *item,
+                                      struct configfs_attribute *attr,
+                                      char *page)
+{
+        ssize_t count;
+        struct simple_child *simple_child = to_simple_child(item);
+        count = sprintf(page, "%d\n", simple_child->storeme);
+        return count;
+}
+static ssize_t simple_child_attr_store(struct config_item *item,
+                                       struct configfs_attribute *attr,
+                                       const char *page, size_t count)
+{
+        struct simple_child *simple_child = to_simple_child(item);
+        unsigned long tmp;
+        char *p = (char *) page;
+        tmp = simple_strtoul(p, &p, 10);
+        if (!p || (*p && (*p != '\n')))
+                return -EINVAL;
+        if (tmp > INT_MAX)
+                return -ERANGE;
+        simple_child->storeme = tmp;
+        return count;
+}
+static void simple_child_release(struct config_item *item)
+{
+        kfree(to_simple_child(item));
+}
+static struct configfs_item_operations simple_child_item_ops = {
+        .release                = simple_child_release,
+        .show_attribute         = simple_child_attr_show,
+        .store_attribute        = simple_child_attr_store,
+};
+static struct config_item_type simple_child_type = {
+        .ct_item_ops    = &simple_child_item_ops,
+        .ct_attrs       = simple_child_attrs,
+        .ct_owner       = THIS_MODULE,
+};
+static struct config_item *simple_children_make_item(struct config_group *group, const char *name)
+{
+        struct simple_child *simple_child;
+        simple_child = kmalloc(sizeof(struct simple_child), GFP_KERNEL);
+        if (!simple_child)
+                return NULL;
+        memset(simple_child, 0, sizeof(struct simple_child));
+        config_item_init_type_name(&simple_child->item, name,
+                                   &simple_child_type);
+        simple_child->storeme = 0;
+        return &simple_child->item;
+}
+static struct configfs_attribute simple_children_attr_description = {
+        .ca_owner = THIS_MODULE,
+        .ca_name = "description",
+        .ca_mode = S_IRUGO,
+};
+static struct configfs_attribute *simple_children_attrs[] = {
+        &simple_children_attr_description,
+        NULL,
+};
+static ssize_t simple_children_attr_show(struct config_item *item,
+                                         struct configfs_attribute *attr,
+                                         char *page)
+{
+        return sprintf(page,
+"[02-simple-children]\n"
+"\n"
+"This subsystem allows the creation of child config_items.  These\n"
+"items have only one attribute that is readable and writeable.\n");
+}
+static struct configfs_item_operations simple_children_item_ops = {
+        .show_attribute = simple_children_attr_show,
+};
+/*
+ * Note that, since no extra work is required on ->drop_item(),
+ * no ->drop_item() is provided.
+ */
+static struct configfs_group_operations simple_children_group_ops = {
+        .make_item      = simple_children_make_item,
+};
+static struct config_item_type simple_children_type = {
+        .ct_item_ops    = &simple_children_item_ops,
+        .ct_group_ops   = &simple_children_group_ops,
+        .ct_attrs       = simple_children_attrs,
+};
+static struct configfs_subsystem simple_children_subsys = {
+        .su_group = {
+                .cg_item = {
+                        .ci_namebuf = "02-simple-children",
+                        .ci_type = &simple_children_type,
+                },
+        },
+};
+/* ----------------------------------------------------------------- */
+/*
+ * 03-group-children
+ *
+ * This example reuses the simple_children group from above.  However,
+ * the simple_children group is not the subsystem itself, it is a
+ * child of the subsystem.  Creation of a group in the subsystem creates
+ * a new simple_children group.  That group can then have simple_child
+ * children of its own.
+ */
+struct simple_children {
+        struct config_group group;
+};
+static struct config_group *group_children_make_group(struct config_group *group, const char *name)
+{
+        struct simple_children *simple_children;
+        simple_children = kmalloc(sizeof(struct simple_children),
+                                  GFP_KERNEL);
+        if (!simple_children)
+                return NULL;
+        memset(simple_children, 0, sizeof(struct simple_children));
+        config_group_init_type_name(&simple_children->group, name,
+                                    &simple_children_type);
+        return &simple_children->group;
+}
+static struct configfs_attribute group_children_attr_description = {
+        .ca_owner = THIS_MODULE,
+        .ca_name = "description",
+        .ca_mode = S_IRUGO,
+};
+static struct configfs_attribute *group_children_attrs[] = {
+        &group_children_attr_description,
+        NULL,
+};
+static ssize_t group_children_attr_show(struct config_item *item,
+                                        struct configfs_attribute *attr,
+                                        char *page)
+{
+        return sprintf(page,
+"[03-group-children]\n"
+"\n"
+"This subsystem allows the creation of child config_groups.  These\n"
+"groups are like the subsystem simple-children.\n");
+}
+static struct configfs_item_operations group_children_item_ops = {
+        .show_attribute = group_children_attr_show,
+};
+/*
+ * Note that, since no extra work is required on ->drop_item(),
+ * no ->drop_item() is provided.
+ */
+static struct configfs_group_operations group_children_group_ops = {
+        .make_group     = group_children_make_group,
+};
+static struct config_item_type group_children_type = {
+        .ct_item_ops    = &group_children_item_ops,
+        .ct_group_ops   = &group_children_group_ops,
+        .ct_attrs       = group_children_attrs,
+};
+static struct configfs_subsystem group_children_subsys = {
+        .su_group = {
+                .cg_item = {
+                        .ci_namebuf = "03-group-children",
+                        .ci_type = &group_children_type,
+                },
+        },
+};
+/* ----------------------------------------------------------------- */
+/*
+ * We're now done with our subsystem definitions.
+ * For convenience in this module, here's a list of them all.  It
+ * allows the init function to easily register them.  Most modules
+ * will only have one subsystem, and will only call register_subsystem
+ * on it directly.
+ */
+static struct configfs_subsystem *example_subsys[] = {
+        &childless_subsys.subsys,
+        &simple_children_subsys,
+        &group_children_subsys,
+        NULL,
+};
+static int __init configfs_example_init(void)
+{
+        int ret;
+        int i;
+        struct configfs_subsystem *subsys;
+        for (i = 0; example_subsys[i]; i++) {
+                subsys = example_subsys[i];
+                config_group_init(&subsys->su_group);
+                init_MUTEX(&subsys->su_sem);
+                ret = configfs_register_subsystem(subsys);
+                if (ret) {
+                        printk(KERN_ERR "Error %d while registering subsystem %s\n",
+                               ret,
+                               subsys->su_group.cg_item.ci_namebuf);
+                        goto out_unregister;
+                }
+        }
+        return 0;
+out_unregister:
+        for (; i >= 0; i--) {
+                configfs_unregister_subsystem(example_subsys[i]);
+        }
+        return ret;
+}
+static void __exit configfs_example_exit(void)
+{
+        int i;
+        for (i = 0; example_subsys[i]; i++) {
+                configfs_unregister_subsystem(example_subsys[i]);
+        }
+}
+module_init(configfs_example_init);
+module_exit(configfs_example_exit);
+MODULE_LICENSE("GPL");
diff --git a/Documentation/filesystems/dlmfs.txt b/Documentation/filesystems/dlmfs.txt
new file mode 100644
index 000000000000..9afab845a906
--- /dev/null
+++ b/Documentation/filesystems/dlmfs.txt
@@ -0,0 +1,130 @@
+dlmfs
+==================
+A minimal DLM userspace interface implemented via a virtual file
+system.
+dlmfs is built with OCFS2 as it requires most of its infrastructure.
+Project web page:    http://oss.oracle.com/projects/ocfs2
+Tools web page:      http://oss.oracle.com/projects/ocfs2-tools
+OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
+All code copyright 2005 Oracle except when otherwise noted.
+CREDITS
+=======
+Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds
+and Transmeta Corp.
+Mark Fasheh <mark.fasheh@oracle.com>
+Caveats
+=======
+- Right now it only works with the OCFS2 DLM, though support for other
+  DLM implementations should not be a major issue.
+Mount options
+=============
+None
+Usage
+=====
+If you're just interested in OCFS2, then please see ocfs2.txt. The
+rest of this document will be geared towards those who want to use
+dlmfs for easy to setup and easy to use clustered locking in
+userspace.
+Setup
+=====
+dlmfs requires that the OCFS2 cluster infrastructure be in
+place. Please download ocfs2-tools from the above url and configure a
+cluster.
+You'll want to start heartbeating on a volume which all the nodes in
+your lockspace can access. The easiest way to do this is via
+ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires
+that an OCFS2 file system be in place so that it can automatically
+find it's heartbeat area, though it will eventually support heartbeat
+against raw disks.
+Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed
+with ocfs2-tools.
+Once you're heartbeating, DLM lock 'domains' can be easily created /
+destroyed and locks within them accessed.
+Locking
+=======
+Users may access dlmfs via standard file system calls, or they can use
+'libo2dlm' (distributed with ocfs2-tools) which abstracts the file
+system calls and presents a more traditional locking api.
+dlmfs handles lock caching automatically for the user, so a lock
+request for an already acquired lock will not generate another DLM
+call. Userspace programs are assumed to handle their own local
+locking.
+Two levels of locks are supported - Shared Read, and Exlcusive.
+Also supported is a Trylock operation.
+For information on the libo2dlm interface, please see o2dlm.h,
+distributed with ocfs2-tools.
+Lock value blocks can be read and written to a resource via read(2)
+and write(2) against the fd obtained via your open(2) call. The
+maximum currently supported LVB length is 64 bytes (though that is an
+OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share
+small amounts of data amongst their nodes.
+mkdir(2) signals dlmfs to join a domain (which will have the same name
+as the resulting directory)
+rmdir(2) signals dlmfs to leave the domain
+Locks for a given domain are represented by regular inodes inside the
+domain directory.  Locking against them is done via the open(2) system
+call.
+The open(2) call will not return until your lock has been granted or
+an error has occurred, unless it has been instructed to do a trylock
+operation. If the lock succeeds, you'll get an fd.
+open(2) with O_CREAT to ensure the resource inode is created - dlmfs does
+not automatically create inodes for existing lock resources.
+Open Flag     Lock Request Type
+---------     -----------------
+O_RDONLY      Shared Read
+O_RDWR        Exclusive
+Open Flag     Resulting Locking Behavior
+---------     --------------------------
+O_NONBLOCK    Trylock operation
+You must provide exactly one of O_RDONLY or O_RDWR.
+If O_NONBLOCK is also provided and the trylock operation was valid but
+could not lock the resource then open(2) will return ETXTBUSY.
+close(2) drops the lock associated with your fd.
+Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is
+supported locally as well. This means you can use them to restrict
+access to the resources via dlmfs on your local node only.
+The resource LVB may be read from the fd in either Shared Read or
+Exclusive modes via the read(2) system call. It can be written via
+write(2) only when open in Exclusive mode.
+Once written, an LVB will be visible to other nodes who obtain Read
+Only or higher level locks on the resource.
+See Also
+========
+http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
+For more information on the VMS distributed locking API.
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9840d5b8d5b9..afb1335c05d6 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -2,11 +2,11 @@
 Ext3 Filesystem
 ===============
-ext3 was originally released in September 1999. Written by Stephen Tweedie
+Ext3 was originally released in September 1999. Written by Stephen Tweedie
-for 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, 
+for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
 Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
-ext3 is ext2 filesystem enhanced with journalling capabilities. 
+Ext3 is the ext2 filesystem enhanced with journalling capabilities.
 Options
 =======
@@ -14,76 +14,81 @@ Options
 When mounting an ext3 filesystem, the following option are accepted:
 (*) == default
-jounal=update           Update the ext3 file system's journal to the 
+journal=update          Update the ext3 file system's journal to the current
-                        current format.
+                        format.
-journal=inum            When a journal already exists, this option is 
+journal=inum            When a journal already exists, this option is ignored.
-                        ignored. Otherwise, it specifies the number of
+                        Otherwise, it specifies the number of the inode which
-                        the inode which will represent the ext3 file
+                        will represent the ext3 file system's journal file.
-                        system's journal file.
+journal_dev=devnum      When the external journal device's major/minor numbers
+                        have changed, this option allows the user to specify
+                        the new journal location.  The journal device is
+                        identified through its new major/minor numbers encoded
+                        in devnum.
 noload                  Don't load the journal on mounting.
-data=journal            All data are committed into the journal prior
+data=journal            All data are committed into the journal prior to being
-                        to being written into the main file system.
+                        written into the main file system.
 data=ordered    (*)     All data are forced directly out to the main file
-                        system prior to its metadata being committed to
+                        system prior to its metadata being committed to the
-                        the journal.
+                        journal.
-data=writeback          Data ordering is not preserved, data may be
+data=writeback          Data ordering is not preserved, data may be written
-                        written into the main file system after its
+                        into the main file system after its metadata has been
-                        metadata has been committed to the journal.
+                        committed to the journal.
 commit=nrsec    (*)     Ext3 can be told to sync all its data and metadata
                        every 'nrsec' seconds. The default value is 5 seconds.
-                        This means that if you lose your power, you will lose,
+                        This means that if you lose your power, you will lose
-                        as much, the latest 5 seconds of work (your filesystem
+                        as much as the latest 5 seconds of work (your
-                        will not be damaged though, thanks to journaling). This
+                        filesystem will not be damaged though, thanks to the
-                        default value (or any low value) will hurt performance,
+                        journaling).  This default value (or any low value)
-                        but it's good for data-safety. Setting it to 0 will
+                        will hurt performance, but it's good for data-safety.
-                        have the same effect than leaving the default 5 sec.
+                        Setting it to 0 will have the same effect as leaving
+                        it at the default (5 seconds).
                        Setting it to very large values will improve
                        performance.
-barrier=1               This enables/disables barriers. barrier=0 disables it,
+barrier=1               This enables/disables barriers.  barrier=0 disables
-                        barrier=1 enables it.
+                        it, barrier=1 enables it.
-orlov           (*)     This enables the new Orlov block allocator. It's enabled
+orlov           (*)     This enables the new Orlov block allocator. It is
-                        by default.
+                        enabled by default.
-oldalloc                This disables the Orlov block allocator and enables the
+oldalloc                This disables the Orlov block allocator and enables
-                        old block allocator. Orlov should have better performance,
+                        the old block allocator.  Orlov should have better
-                        we'd like to get some feedback if it's the contrary for
+                        performance - we'd like to get some feedback if it's
-                        you.
+                        the contrary for you.
-user_xattr              Enables Extended User Attributes. Additionally, you need
+user_xattr              Enables Extended User Attributes.  Additionally, you
-                        to have extended attribute support enabled in the kernel
+                        need to have extended attribute support enabled in the
-                        configuration (CONFIG_EXT3_FS_XATTR). See the attr(5)
+                        kernel configuration (CONFIG_EXT3_FS_XATTR).  See the
-                        manual page and http://acl.bestbits.at to learn more
+                        attr(5) manual page and http://acl.bestbits.at/ to
-                        about extended attributes.
+                        learn more about extended attributes.
 nouser_xattr            Disables Extended User Attributes.
-acl                     Enables POSIX Access Control Lists support.  Additionally,
+acl                     Enables POSIX Access Control Lists support.
-                        you need to have ACL support enabled in the kernel
+                        Additionally, you need to have ACL support enabled in
-                        configuration (CONFIG_EXT3_FS_POSIX_ACL). See the acl(5)
+                        the kernel configuration (CONFIG_EXT3_FS_POSIX_ACL).
-                        manual page and http://acl.bestbits.at for more
+                        See the acl(5) manual page and http://acl.bestbits.at/
-                        information.
+                        for more information.
-noacl                   This option disables POSIX Access Control List support.
+noacl                   This option disables POSIX Access Control List
+                        support.
 reservation
 noreservation
-resize=
 bsddf           (*)     Make 'df' act like BSD.
 minixdf                 Make 'df' act like Minix.
 check=none              Don't do extra checking of bitmaps on mount.
-nocheck         
+nocheck
 debug                   Extra debugging information is sent to syslog.
@@ -92,7 +97,7 @@ errors=continue		Keep going on a filesystem error.
 errors=panic            Panic and halt the machine if an error occurs.
 grpid                   Give objects the same group ID as their creator.
-bsdgroups               
+bsdgroups
 nogrpid         (*)     New objects have the group ID of their creator.
 sysvgroups
@@ -103,81 +108,83 @@ resuid=n		The user ID which may use the reserved blocks.
 sb=n                    Use alternate superblock at this location.
-quota                   Quota options are currently silently ignored.
+quota
-noquota                 (see fs/ext3/super.c, line 594)
+noquota
 grpquota
 usrquota
 Specification
 =============
-ext3 shares all disk implementation with ext2 filesystem, and add
+Ext3 shares all disk implementation with the ext2 filesystem, and adds
-transactions capabilities to ext2.  Journaling is done by the
+transactions capabilities to ext2.  Journaling is done by the Journaling Block
-Journaling block device layer.
+Device layer.
 Journaling Block Device layer
 -----------------------------
-The Journaling Block Device layer (JBD) isn't ext3 specific.  It was
+The Journaling Block Device layer (JBD) isn't ext3 specific.  It was design to
-design to add journaling capabilities on a block device.  The ext3
+add journaling capabilities on a block device.  The ext3 filesystem code will
-filesystem code will inform the JBD of modifications it is performing
+inform the JBD of modifications it is performing (called a transaction).  The
-(Call a transaction).  the journal support the transactions start and
+journal supports the transactions start and stop, and in case of crash, the
-stop, and in case of crash, the journal can replayed the transactions
+journal can replayed the transactions to put the partition back in a
-to put the partition on a consistent state fastly.
+consistent state fast.
-handles represent a single atomic update to a filesystem.  JBD can
+Handles represent a single atomic update to a filesystem.  JBD can handle an
-handle external journal on a block device.
+external journal on a block device.
 Data Mode
 ---------
-There's 3 different data modes:
+There are 3 different data modes:
 * writeback mode
-In data=writeback mode, ext3 does not journal data at all.  This mode
+In data=writeback mode, ext3 does not journal data at all.  This mode provides
-provides a similar level of journaling as XFS, JFS, and ReiserFS in its
+a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
-default mode - metadata journaling.  A crash+recovery can cause
+mode - metadata journaling.  A crash+recovery can cause incorrect data to
-incorrect data to appear in files which were written shortly before the
+appear in files which were written shortly before the crash.  This mode will
-crash.  This mode will typically provide the best ext3 performance.
+typically provide the best ext3 performance.
 * ordered mode
-In data=ordered mode, ext3 only officially journals metadata, but it
+In data=ordered mode, ext3 only officially journals metadata, but it logically
-logically groups metadata and data blocks into a single unit called a
+groups metadata and data blocks into a single unit called a transaction.  When
-transaction.  When it's time to write the new metadata out to disk, the
+it's time to write the new metadata out to disk, the associated data blocks
-associated data blocks are written first.  In general, this mode
+are written first.  In general, this mode performs slightly slower than
-perform slightly slower than writeback but significantly faster than
+writeback but significantly faster than journal mode.
-journal mode.
 * journal mode
-data=journal mode provides full data and metadata journaling.  All new
+data=journal mode provides full data and metadata journaling.  All new data is
-data is written to the journal first, and then to its final location. 
+written to the journal first, and then to its final location.
-In the event of a crash, the journal can be replayed, bringing both
+In the event of a crash, the journal can be replayed, bringing both data and
-data and metadata into a consistent state.  This mode is the slowest
+metadata into a consistent state.  This mode is the slowest except when data
-except when data needs to be read from and written to disk at the same
+needs to be read from and written to disk at the same time where it
-time where it outperform all others mode.
+outperforms all others modes.
 Compatibility
 -------------
 Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`.
-Ext3 is fully compatible with Ext2.  Ext3 partitions can easily be
+Ext3 is fully compatible with Ext2.  Ext3 partitions can easily be mounted as
-mounted as Ext2.
+Ext2.
 External Tools
 ==============
-see manual pages to know more.
+See manual pages to learn more.
+tune2fs:        create a ext3 journal on a ext2 partition with the -j flag.
+mke2fs:         create a ext3 partition with the -j flag.
+debugfs:        ext2 and ext3 file system debugger.
+ext2online:     online (mounted) ext2 and ext3 filesystem resizer
-tune2fs:        create a ext3 journal on a ext2 partition with the -j flags
-mke2fs:         create a ext3 partition with the -j flags
-debugfs:        ext2 and ext3 file system debugger
 References
 ==========
-kernel source:  file:/usr/src/linux/fs/ext3
+kernel source:  <file:fs/ext3/>
-                file:/usr/src/linux/fs/jbd
+                <file:fs/jbd/>
-programs:       http://e2fsprogs.sourceforge.net
+programs:       http://e2fsprogs.sourceforge.net/
+                http://ext2resize.sourceforge.net
-useful link:
+useful links:   http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
-                http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
                http://www-106.ibm.com/developerworks/linux/library/l-fs7/
                http://www-106.ibm.com/developerworks/linux/library/l-fs8/
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt
index 6b5741e651a2..33f74310d161 100644
--- a/Documentation/filesystems/fuse.txt
+++ b/Documentation/filesystems/fuse.txt
@@ -86,6 +86,62 @@ Mount options
  The default is infinite.  Note that the size of read requests is
  limited anyway to 32 pages (which is 128kbyte on i386).
+Sysfs
+~~~~~
+FUSE sets up the following hierarchy in sysfs:
+  /sys/fs/fuse/connections/N/
+where N is an increasing number allocated to each new connection.
+For each connection the following attributes are defined:
+ 'waiting'
+  The number of requests which are waiting to be transfered to
+  userspace or being processed by the filesystem daemon.  If there is
+  no filesystem activity and 'waiting' is non-zero, then the
+  filesystem is hung or deadlocked.
+ 'abort'
+  Writing anything into this file will abort the filesystem
+  connection.  This means that all waiting requests will be aborted an
+  error returned for all aborted and new requests.
+Only a privileged user may read or write these attributes.
+Aborting a filesystem connection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+It is possible to get into certain situations where the filesystem is
+not responding.  Reasons for this may be:
+  a) Broken userspace filesystem implementation
+  b) Network connection down
+  c) Accidental deadlock
+  d) Malicious deadlock
+(For more on c) and d) see later sections)
+In either of these cases it may be useful to abort the connection to
+the filesystem.  There are several ways to do this:
+  - Kill the filesystem daemon.  Works in case of a) and b)
+  - Kill the filesystem daemon and all users of the filesystem.  Works
+    in all cases except some malicious deadlocks
+  - Use forced umount (umount -f).  Works in all cases but only if
+    filesystem is still attached (it hasn't been lazy unmounted)
+  - Abort filesystem through the sysfs interface.  Most powerful
+    method, always works.
 How do non-privileged mounts work?
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -313,3 +369,10 @@ faulted with get_user_pages().  The 'req->locked' flag indicates
 when the copy is taking place, and interruption is delayed until
 this flag is unset.
+Scenario 3 - Tricky deadlock with asynchronous read
+---------------------------------------------------
+The same situation as above, except thread-1 will wait on page lock
+and hence it will be uninterruptible as well.  The solution is to
+abort the connection with forced umount (if mount is attached) or
+through the abort attribute in sysfs.
diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt
new file mode 100644
index 000000000000..f2595caf052e
--- /dev/null
+++ b/Documentation/filesystems/ocfs2.txt
@@ -0,0 +1,55 @@
+OCFS2 filesystem
+==================
+OCFS2 is a general purpose extent based shared disk cluster file
+system with many similarities to ext3. It supports 64 bit inode
+numbers, and has automatically extending metadata groups which may
+also make it attractive for non-clustered use.
+You'll want to install the ocfs2-tools package in order to at least
+get "mount.ocfs2" and "ocfs2_hb_ctl".
+Project web page:    http://oss.oracle.com/projects/ocfs2
+Tools web page:      http://oss.oracle.com/projects/ocfs2-tools
+OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
+All code copyright 2005 Oracle except when otherwise noted.
+CREDITS:
+Lots of code taken from ext3 and other projects.
+Authors in alphabetical order:
+Joel Becker   <joel.becker@oracle.com>
+Zach Brown    <zach.brown@oracle.com>
+Mark Fasheh   <mark.fasheh@oracle.com>
+Kurt Hackel   <kurt.hackel@oracle.com>
+Sunil Mushran <sunil.mushran@oracle.com>
+Manish Singh  <manish.singh@oracle.com>
+Caveats
+=======
+Features which OCFS2 does not support yet:
+        - sparse files
+        - extended attributes
+        - shared writeable mmap
+        - loopback is supported, but data written will not
+          be cluster coherent.
+        - quotas
+        - cluster aware flock
+        - Directory change notification (F_NOTIFY)
+        - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease)
+        - POSIX ACLs
+        - readpages / writepages (not user visible)
+Mount options
+=============
+OCFS2 supports the following mount options:
+(*) == default
+barrier=1               This enables/disables barriers. barrier=0 disables it,
+                        barrier=1 enables it.
+errors=remount-ro(*)    Remount the filesystem read-only on an error.
+errors=panic            Panic and halt the machine if an error occurs.
+intr            (*)     Allow signals to interrupt cluster operations.
+nointr                  Do not allow signals to interrupt cluster
+                        operations.
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index d4773565ea2f..944cf109a6f5 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -418,7 +418,7 @@ VmallocChunk:   111088 kB
       Dirty: Memory which is waiting to get written back to the disk
   Writeback: Memory which is actively being written back to the disk
      Mapped: files which have been mmaped, such as libraries
-              Slab: in-kernel data structures cache
+        Slab: in-kernel data structures cache
 CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'),
              this is the total amount of  memory currently available to
              be allocated on the system. This limit is only adhered to
@@ -1302,6 +1302,23 @@ VM has token based thrashing control mechanism and uses the token to prevent
 unnecessary page faults in thrashing situation. The unit of the value is
 second. The value would be useful to tune thrashing behavior.
+drop_caches
+-----------
+Writing to this will cause the kernel to drop clean caches, dentries and
+inodes from memory, causing that memory to become free.
+To free pagecache:
+        echo 1 > /proc/sys/vm/drop_caches
+To free dentries and inodes:
+        echo 2 > /proc/sys/vm/drop_caches
+To free pagecache, dentries and inodes:
+        echo 3 > /proc/sys/vm/drop_caches
+As this is a non-destructive operation and dirty objects are not freeable, the
+user should run `sync' first.
 2.5 /proc/sys/dev - Device specific parameters
 ----------------------------------------------
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
index b3404a032596..60ab61e54e8a 100644
--- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
@@ -143,12 +143,26 @@ as the following example:
  dir /mnt 755 0 0
  file /init initramfs/init.sh 755 0 0
+Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
+documenting the above file format.
 One advantage of the text file is that root access is not required to
 set permissions or create device nodes in the new archive.  (Note that those
 two example "file" entries expect to find files named "init.sh" and "busybox" in
 a directory called "initramfs", under the linux-2.6.* directory.  See
 Documentation/early-userspace/README for more details.)
+The kernel does not depend on external cpio tools, gen_init_cpio is created
+from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's
+boot-time extractor is also (obviously) self-contained.  However, if you _do_
+happen to have cpio installed, the following command line can extract the
+generated cpio image back into its component files:
+  cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
+Contents of initramfs:
+----------------------
 If you don't already understand what shared libraries, devices, and paths
 you need to get a minimal root filesystem up and running, here are some
 references:
@@ -161,13 +175,69 @@ designed to be a tiny C library to statically link early userspace
 code against, along with some related utilities.  It is BSD licensed.
 I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
-myself.  These are LGPL and GPL, respectively.
+myself.  These are LGPL and GPL, respectively.  (A self-contained initramfs
+package is planned for the busybox 1.2 release.)
 In theory you could use glibc, but that's not well suited for small embedded
 uses like this.  (A "hello world" program statically linked against glibc is
 over 400k.  With uClibc it's 7k.  Also note that glibc dlopens libnss to do
 name lookups, even when otherwise statically linked.)
+Why cpio rather than tar?
+-------------------------
+This decision was made back in December, 2001.  The discussion started here:
+  http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html
+And spawned a second thread (specifically on tar vs cpio), starting here:
+  http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html
+The quick and dirty summary version (which is no substitute for reading
+the above threads) is:
+1) cpio is a standard.  It's decades old (from the AT&T days), and already
+   widely used on Linux (inside RPM, Red Hat's device driver disks).  Here's
+   a Linux Journal article about it from 1996:
+      http://www.linuxjournal.com/article/1213
+   It's not as popular as tar because the traditional cpio command line tools
+   require _truly_hideous_ command line arguments.  But that says nothing
+   either way about the archive format, and there are alternative tools,
+   such as:
+     http://freshmeat.net/projects/afio/
+2) The cpio archive format chosen by the kernel is simpler and cleaner (and
+   thus easier to create and parse) than any of the (literally dozens of)
+   various tar archive formats.  The complete initramfs archive format is
+   explained in buffer-format.txt, created in usr/gen_init_cpio.c, and
+   extracted in init/initramfs.c.  All three together come to less than 26k
+   total of human-readable text.
+3) The GNU project standardizing on tar is approximately as relevant as
+   Windows standardizing on zip.  Linux is not part of either, and is free
+   to make its own technical decisions.
+4) Since this is a kernel internal format, it could easily have been
+   something brand new.  The kernel provides its own tools to create and
+   extract this format anyway.  Using an existing standard was preferable,
+   but not essential.
+5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be
+   supported on the kernel side"):
+      http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html
+   explained his reasoning:
+      http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
+      http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
+   and, most importantly, designed and implemented the initramfs code.
 Future directions:
 ------------------
diff --git a/Documentation/filesystems/relayfs.txt b/Documentation/filesystems/relayfs.txt
index d803abed29f0..5832377b7340 100644
--- a/Documentation/filesystems/relayfs.txt
+++ b/Documentation/filesystems/relayfs.txt
@@ -44,30 +44,41 @@ relayfs can operate in a mode where it will overwrite data not yet
 collected by userspace, and not wait for it to consume it.
 relayfs itself does not provide for communication of such data between
-userspace and kernel, allowing the kernel side to remain simple and not
+userspace and kernel, allowing the kernel side to remain simple and
-impose a single interface on userspace. It does provide a separate
+not impose a single interface on userspace. It does provide a set of
-helper though, described below.
+examples and a separate helper though, described below.
+klog and relay-apps example code
+================================
+relayfs itself is ready to use, but to make things easier, a couple
+simple utility functions and a set of examples are provided.
+The relay-apps example tarball, available on the relayfs sourceforge
+site, contains a set of self-contained examples, each consisting of a
+pair of .c files containing boilerplate code for each of the user and
+kernel sides of a relayfs application; combined these two sets of
+boilerplate code provide glue to easily stream data to disk, without
+having to bother with mundane housekeeping chores.
+The 'klog debugging functions' patch (klog.patch in the relay-apps
+tarball) provides a couple of high-level logging functions to the
+kernel which allow writing formatted text or raw data to a channel,
+regardless of whether a channel to write into exists or not, or
+whether relayfs is compiled into the kernel or is configured as a
+module.  These functions allow you to put unconditional 'trace'
+statements anywhere in the kernel or kernel modules; only when there
+is a 'klog handler' registered will data actually be logged (see the
+klog and kleak examples for details).
+It is of course possible to use relayfs from scratch i.e. without
+using any of the relay-apps example code or klog, but you'll have to
+implement communication between userspace and kernel, allowing both to
+convey the state of buffers (full, empty, amount of padding).
+klog and the relay-apps examples can be found in the relay-apps
+tarball on http://relayfs.sourceforge.net
-klog, relay-app & librelay
-==========================
-relayfs itself is ready to use, but to make things easier, two
-additional systems are provided.  klog is a simple wrapper to make
-writing formatted text or raw data to a channel simpler, regardless of
-whether a channel to write into exists or not, or whether relayfs is
-compiled into the kernel or is configured as a module.  relay-app is
-the kernel counterpart of userspace librelay.c, combined these two
-files provide glue to easily stream data to disk, without having to
-bother with housekeeping.  klog and relay-app can be used together,
-with klog providing high-level logging functions to the kernel and
-relay-app taking care of kernel-user control and disk-logging chores.
-It is possible to use relayfs without relay-app & librelay, but you'll
-have to implement communication between userspace and kernel, allowing
-both to convey the state of buffers (full, empty, amount of padding).
-klog, relay-app and librelay can be found in the relay-apps tarball on
-http://relayfs.sourceforge.net
 The relayfs user space API
 ==========================
@@ -125,6 +136,8 @@ Here's a summary of the API relayfs provides to in-kernel clients:
    relay_reset(chan)
    relayfs_create_dir(name, parent)
    relayfs_remove_dir(dentry)
+    relayfs_create_file(name, parent, mode, fops, data)
+    relayfs_remove_file(dentry)
  channel management typically called on instigation of userspace:
@@ -141,6 +154,8 @@ Here's a summary of the API relayfs provides to in-kernel clients:
    subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
    buf_mapped(buf, filp)
    buf_unmapped(buf, filp)
+    create_buf_file(filename, parent, mode, buf, is_global)
+    remove_buf_file(dentry)
  helper functions:
@@ -320,6 +335,71 @@ forces a sub-buffer switch on all the channel buffers, and can be used
 to finalize and process the last sub-buffers before the channel is
 closed.
+Creating non-relay files
+------------------------
+relay_open() automatically creates files in the relayfs filesystem to
+represent the per-cpu kernel buffers; it's often useful for
+applications to be able to create their own files alongside the relay
+files in the relayfs filesystem as well e.g. 'control' files much like
+those created in /proc or debugfs for similar purposes, used to
+communicate control information between the kernel and user sides of a
+relayfs application.  For this purpose the relayfs_create_file() and
+relayfs_remove_file() API functions exist.  For relayfs_create_file(),
+the caller passes in a set of user-defined file operations to be used
+for the file and an optional void * to a user-specified data item,
+which will be accessible via inode->u.generic_ip (see the relay-apps
+tarball for examples).  The file_operations are a required parameter
+to relayfs_create_file() and thus the semantics of these files are
+completely defined by the caller.
+See the relay-apps tarball at http://relayfs.sourceforge.net for
+examples of how these non-relay files are meant to be used.
+Creating relay files in other filesystems
+-----------------------------------------
+By default of course, relay_open() creates relay files in the relayfs
+filesystem.  Because relay_file_operations is exported, however, it's
+also possible to create and use relay files in other pseudo-filesytems
+such as debugfs.
+For this purpose, two callback functions are provided,
+create_buf_file() and remove_buf_file().  create_buf_file() is called
+once for each per-cpu buffer from relay_open() to allow the client to
+create a file to be used to represent the corresponding buffer; if
+this callback is not defined, the default implementation will create
+and return a file in the relayfs filesystem to represent the buffer.
+The callback should return the dentry of the file created to represent
+the relay buffer.  Note that the parent directory passed to
+relay_open() (and passed along to the callback), if specified, must
+exist in the same filesystem the new relay file is created in.  If
+create_buf_file() is defined, remove_buf_file() must also be defined;
+it's responsible for deleting the file(s) created in create_buf_file()
+and is called during relay_close().
+The create_buf_file() implementation can also be defined in such a way
+as to allow the creation of a single 'global' buffer instead of the
+default per-cpu set.  This can be useful for applications interested
+mainly in seeing the relative ordering of system-wide events without
+the need to bother with saving explicit timestamps for the purpose of
+merging/sorting per-cpu files in a postprocessing step.
+To have relay_open() create a global buffer, the create_buf_file()
+implementation should set the value of the is_global outparam to a
+non-zero value in addition to creating the file that will be used to
+represent the single buffer.  In the case of a global buffer,
+create_buf_file() and remove_buf_file() will be called only once.  The
+normal channel-writing functions e.g. relay_write() can still be used
+- writes from any cpu will transparently end up in the global buffer -
+but since it is a global buffer, callers should make sure they use the
+proper locking for such a buffer, either by wrapping writes in a
+spinlock, or by copying a write function from relayfs_fs.h and
+creating a local version that internally does the proper locking.
+See the 'exported-relayfile' examples in the relay-apps tarball for
+examples of creating and using relay files in debugfs.
 Misc
 ----
diff --git a/Documentation/filesystems/spufs.txt b/Documentation/filesystems/spufs.txt
new file mode 100644
index 000000000000..8edc3952eff4
--- /dev/null
+++ b/Documentation/filesystems/spufs.txt
@@ -0,0 +1,521 @@
+SPUFS(2)                   Linux Programmer's Manual                  SPUFS(2)
+NAME
+       spufs - the SPU file system
+DESCRIPTION
+       The SPU file system is used on PowerPC machines that implement the Cell
+       Broadband Engine Architecture in order to access Synergistic  Processor
+       Units (SPUs).
+       The file system provides a name space similar to posix shared memory or
+       message queues. Users that have write permissions on  the  file  system
+       can use spu_create(2) to establish SPU contexts in the spufs root.
+       Every SPU context is represented by a directory containing a predefined
+       set of files. These files can be used for manipulating the state of the
+       logical SPU. Users can change permissions on those files, but not actu-
+       ally add or remove files.
+MOUNT OPTIONS
+       uid=<uid>
+              set the user owning the mount point, the default is 0 (root).
+       gid=<gid>
+              set the group owning the mount point, the default is 0 (root).
+FILES
+       The files in spufs mostly follow the standard behavior for regular sys-
+       tem  calls like read(2) or write(2), but often support only a subset of
+       the operations supported on regular file systems. This list details the
+       supported  operations  and  the  deviations  from  the behaviour in the
+       respective man pages.
+       All files that support the read(2) operation also support readv(2)  and
+       all  files  that support the write(2) operation also support writev(2).
+       All files support the access(2) and stat(2) family of  operations,  but
+       only  the  st_mode,  st_nlink,  st_uid and st_gid fields of struct stat
+       contain reliable information.
+       All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2)  opera-
+       tions,  but  will  not be able to grant permissions that contradict the
+       possible operations, e.g. read access on the wbox file.
+       The current set of files is:
+   /mem
+       the contents of the local storage memory  of  the  SPU.   This  can  be
+       accessed  like  a regular shared memory file and contains both code and
+       data in the address space of the SPU.  The possible  operations  on  an
+       open mem file are:
+       read(2), pread(2), write(2), pwrite(2), lseek(2)
+              These  operate  as  documented, with the exception that seek(2),
+              write(2) and pwrite(2) are not supported beyond the end  of  the
+              file. The file size is the size of the local storage of the SPU,
+              which normally is 256 kilobytes.
+       mmap(2)
+              Mapping mem into the process address space gives access  to  the
+              SPU  local  storage  within  the  process  address  space.  Only
+              MAP_SHARED mappings are allowed.
+   /mbox
+       The first SPU to CPU communication mailbox. This file is read-only  and
+       can  be  read  in  units of 32 bits.  The file can only be used in non-
+       blocking mode and it even poll() will not block on  it.   The  possible
+       operations on an open mbox file are:
+       read(2)
+              If  a  count smaller than four is requested, read returns -1 and
+              sets errno to EINVAL.  If there is no data available in the mail
+              box,  the  return  value  is set to -1 and errno becomes EAGAIN.
+              When data has been read successfully, four bytes are  placed  in
+              the data buffer and the value four is returned.
+   /ibox
+       The  second  SPU  to CPU communication mailbox. This file is similar to
+       the first mailbox file, but can be read in blocking I/O mode,  and  the
+       poll  familiy of system calls can be used to wait for it.  The possible
+       operations on an open ibox file are:
+       read(2)
+              If a count smaller than four is requested, read returns  -1  and
+              sets errno to EINVAL.  If there is no data available in the mail
+              box and the file descriptor has been opened with O_NONBLOCK, the
+              return value is set to -1 and errno becomes EAGAIN.
+              If  there  is  no  data  available  in the mail box and the file
+              descriptor has been opened without  O_NONBLOCK,  the  call  will
+              block  until  the  SPU  writes to its interrupt mailbox channel.
+              When data has been read successfully, four bytes are  placed  in
+              the data buffer and the value four is returned.
+       poll(2)
+              Poll  on  the  ibox  file returns (POLLIN | POLLRDNORM) whenever
+              data is available for reading.
+   /wbox
+       The CPU to SPU communation mailbox. It is write-only can can be written
+       in  units  of  32  bits. If the mailbox is full, write() will block and
+       poll can be used to wait for it becoming  empty  again.   The  possible
+       operations  on  an open wbox file are: write(2) If a count smaller than
+       four is requested, write returns -1 and sets errno to EINVAL.  If there
+       is  no space available in the mail box and the file descriptor has been
+       opened with O_NONBLOCK, the return value is set to -1 and errno becomes
+       EAGAIN.
+       If  there is no space available in the mail box and the file descriptor
+       has been opened without O_NONBLOCK, the call will block until  the  SPU
+       reads  from  its PPE mailbox channel.  When data has been read success-
+       fully, four bytes are placed in the data buffer and the value  four  is
+       returned.
+       poll(2)
+              Poll  on  the  ibox file returns (POLLOUT | POLLWRNORM) whenever
+              space is available for writing.
+   /mbox_stat
+   /ibox_stat
+   /wbox_stat
+       Read-only files that contain the length of the current queue, i.e.  how
+       many  words  can  be  read  from  mbox or ibox or how many words can be
+       written to wbox without blocking.  The files can be read only in 4-byte
+       units  and  return  a  big-endian  binary integer number.  The possible
+       operations on an open *box_stat file are:
+       read(2)
+              If a count smaller than four is requested, read returns  -1  and
+              sets errno to EINVAL.  Otherwise, a four byte value is placed in
+              the data buffer, containing the number of elements that  can  be
+              read  from  (for  mbox_stat  and  ibox_stat)  or written to (for
+              wbox_stat) the respective mail box without blocking or resulting
+              in EAGAIN.
+   /npc
+   /decr
+   /decr_status
+   /spu_tag_mask
+   /event_mask
+   /srr0
+       Internal  registers  of  the SPU. The representation is an ASCII string
+       with the numeric value of the next instruction to  be  executed.  These
+       can  be  used in read/write mode for debugging, but normal operation of
+       programs should not rely on them because access to any of  them  except
+       npc requires an SPU context save and is therefore very inefficient.
+       The contents of these files are:
+       npc                 Next Program Counter
+       decr                SPU Decrementer
+       decr_status         Decrementer Status
+       spu_tag_mask        MFC tag mask for SPU DMA
+       event_mask          Event mask for SPU interrupts
+       srr0                Interrupt Return address register
+       The   possible   operations   on   an   open  npc,  decr,  decr_status,
+       spu_tag_mask, event_mask or srr0 file are:
+       read(2)
+              When the count supplied to the read call  is  shorter  than  the
+              required  length for the pointer value plus a newline character,
+              subsequent reads from the same file descriptor  will  result  in
+              completing  the string, regardless of changes to the register by
+              a running SPU task.  When a complete string has been  read,  all
+              subsequent read operations will return zero bytes and a new file
+              descriptor needs to be opened to read the value again.
+       write(2)
+              A write operation on the file results in setting the register to
+              the  value  given  in  the string. The string is parsed from the
+              beginning to the first non-numeric character or the end  of  the
+              buffer.  Subsequent writes to the same file descriptor overwrite
+              the previous setting.
+   /fpcr
+       This file gives access to the Floating Point Status and Control  Regis-
+       ter as a four byte long file. The operations on the fpcr file are:
+       read(2)
+              If  a  count smaller than four is requested, read returns -1 and
+              sets errno to EINVAL.  Otherwise, a four byte value is placed in
+              the data buffer, containing the current value of the fpcr regis-
+              ter.
+       write(2)
+              If a count smaller than four is requested, write returns -1  and
+              sets  errno  to  EINVAL.  Otherwise, a four byte value is copied
+              from the data buffer, updating the value of the fpcr register.
+   /signal1
+   /signal2
+       The two signal notification channels of an SPU.  These  are  read-write
+       files  that  operate  on  a 32 bit word.  Writing to one of these files
+       triggers an interrupt on the SPU. The  value  writting  to  the  signal
+       files can be read from the SPU through a channel read or from host user
+       space through the file.  After the value has been read by the  SPU,  it
+       is  reset  to zero.  The possible operations on an open signal1 or sig-
+       nal2 file are:
+       read(2)
+              If a count smaller than four is requested, read returns  -1  and
+              sets errno to EINVAL.  Otherwise, a four byte value is placed in
+              the data buffer, containing the current value of  the  specified
+              signal notification register.
+       write(2)
+              If  a count smaller than four is requested, write returns -1 and
+              sets errno to EINVAL.  Otherwise, a four byte  value  is  copied
+              from the data buffer, updating the value of the specified signal
+              notification register.  The signal  notification  register  will
+              either be replaced with the input data or will be updated to the
+              bitwise OR or the old value and the input data, depending on the
+              contents  of  the  signal1_type,  or  signal2_type respectively,
+              file.
+   /signal1_type
+   /signal2_type
+       These two files change the behavior of the signal1 and signal2  notifi-
+       cation  files.  The  contain  a numerical ASCII string which is read as
+       either "1" or "0".  In mode 0 (overwrite), the  hardware  replaces  the
+       contents of the signal channel with the data that is written to it.  in
+       mode 1 (logical OR), the hardware accumulates the bits that are  subse-
+       quently written to it.  The possible operations on an open signal1_type
+       or signal2_type file are:
+       read(2)
+              When the count supplied to the read call  is  shorter  than  the
+              required  length  for the digit plus a newline character, subse-
+              quent reads from the same file descriptor will  result  in  com-
+              pleting  the  string.  When a complete string has been read, all
+              subsequent read operations will return zero bytes and a new file
+              descriptor needs to be opened to read the value again.
+       write(2)
+              A write operation on the file results in setting the register to
+              the value given in the string. The string  is  parsed  from  the
+              beginning  to  the first non-numeric character or the end of the
+              buffer.  Subsequent writes to the same file descriptor overwrite
+              the previous setting.
+EXAMPLES
+       /etc/fstab entry
+              none      /spu      spufs     gid=spu   0    0
+AUTHORS
+       Arnd  Bergmann  <arndb@de.ibm.com>,  Mark  Nutter <mnutter@us.ibm.com>,
+       Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
+SEE ALSO
+       capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
+Linux                             2005-09-28                          SPUFS(2)
+------------------------------------------------------------------------------
+SPU_RUN(2)                 Linux Programmer's Manual                SPU_RUN(2)
+NAME
+       spu_run - execute an spu context
+SYNOPSIS
+       #include <sys/spu.h>
+       int spu_run(int fd, unsigned int *npc, unsigned int *event);
+DESCRIPTION
+       The  spu_run system call is used on PowerPC machines that implement the
+       Cell Broadband Engine Architecture in order to access Synergistic  Pro-
+       cessor  Units  (SPUs).  It  uses the fd that was returned from spu_cre-
+       ate(2) to address a specific SPU context. When the context gets  sched-
+       uled  to a physical SPU, it starts execution at the instruction pointer
+       passed in npc.
+       Execution of SPU code happens synchronously, meaning that spu_run  does
+       not  return  while the SPU is still running. If there is a need to exe-
+       cute SPU code in parallel with other code on either  the  main  CPU  or
+       other  SPUs,  you  need to create a new thread of execution first, e.g.
+       using the pthread_create(3) call.
+       When spu_run returns, the current value of the SPU instruction  pointer
+       is  written back to npc, so you can call spu_run again without updating
+       the pointers.
+       event can be a NULL pointer or point to an extended  status  code  that
+       gets  filled  when spu_run returns. It can be one of the following con-
+       stants:
+       SPE_EVENT_DMA_ALIGNMENT
+              A DMA alignment error
+       SPE_EVENT_SPE_DATA_SEGMENT
+              A DMA segmentation error
+       SPE_EVENT_SPE_DATA_STORAGE
+              A DMA storage error
+       If NULL is passed as the event argument, these errors will result in  a
+       signal delivered to the calling process.
+RETURN VALUE
+       spu_run  returns the value of the spu_status register or -1 to indicate
+       an error and set errno to one of the error  codes  listed  below.   The
+       spu_status  register  value  contains  a  bit  mask of status codes and
+       optionally a 14 bit code returned from the stop-and-signal  instruction
+       on the SPU. The bit masks for the status codes are:
+       0x02   SPU was stopped by stop-and-signal.
+       0x04   SPU was stopped by halt.
+       0x08   SPU is waiting for a channel.
+       0x10   SPU is in single-step mode.
+       0x20   SPU has tried to execute an invalid instruction.
+       0x40   SPU has tried to access an invalid channel.
+       0x3fff0000
+              The  bits  masked with this value contain the code returned from
+              stop-and-signal.
+       There are always one or more of the lower eight bits set  or  an  error
+       code is returned from spu_run.
+ERRORS
+       EAGAIN or EWOULDBLOCK
+              fd is in non-blocking mode and spu_run would block.
+       EBADF  fd is not a valid file descriptor.
+       EFAULT npc is not a valid pointer or status is neither NULL nor a valid
+              pointer.
+       EINTR  A signal occured while spu_run was in progress.  The  npc  value
+              has  been updated to the new program counter value if necessary.
+       EINVAL fd is not a file descriptor returned from spu_create(2).
+       ENOMEM Insufficient memory was available to handle a page fault result-
+              ing from an MFC direct memory access.
+       ENOSYS the functionality is not provided by the current system, because
+              either the hardware does not provide SPUs or the spufs module is
+              not loaded.
+NOTES
+       spu_run  is  meant  to  be  used  from  libraries that implement a more
+       abstract interface to SPUs, not to be used from  regular  applications.
+       See  http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
+       ommended libraries.
+CONFORMING TO
+       This call is Linux specific and only implemented by the ppc64 architec-
+       ture. Programs using this system call are not portable.
+BUGS
+       The code does not yet fully implement all features lined out here.
+AUTHOR
+       Arnd Bergmann <arndb@de.ibm.com>
+SEE ALSO
+       capabilities(7), close(2), spu_create(2), spufs(7)
+Linux                             2005-09-28                        SPU_RUN(2)
+------------------------------------------------------------------------------
+SPU_CREATE(2)              Linux Programmer's Manual             SPU_CREATE(2)
+NAME
+       spu_create - create a new spu context
+SYNOPSIS
+       #include <sys/types.h>
+       #include <sys/spu.h>
+       int spu_create(const char *pathname, int flags, mode_t mode);
+DESCRIPTION
+       The  spu_create  system call is used on PowerPC machines that implement
+       the Cell Broadband Engine Architecture in order to  access  Synergistic
+       Processor  Units (SPUs). It creates a new logical context for an SPU in
+       pathname and returns a handle to associated  with  it.   pathname  must
+       point  to  a  non-existing directory in the mount point of the SPU file
+       system (spufs).  When spu_create is successful, a directory  gets  cre-
+       ated on pathname and it is populated with files.
+       The  returned  file  handle can only be passed to spu_run(2) or closed,
+       other operations are not defined on it. When it is closed, all  associ-
+       ated  directory entries in spufs are removed. When the last file handle
+       pointing either inside  of  the  context  directory  or  to  this  file
+       descriptor is closed, the logical SPU context is destroyed.
+       The  parameter flags can be zero or any bitwise or'd combination of the
+       following constants:
+       SPU_RAWIO
+              Allow mapping of some of the hardware registers of the SPU  into
+              user space. This flag requires the CAP_SYS_RAWIO capability, see
+              capabilities(7).
+       The mode parameter specifies the permissions used for creating the  new
+       directory  in  spufs.   mode is modified with the user's umask(2) value
+       and then used for both the directory and the files contained in it. The
+       file permissions mask out some more bits of mode because they typically
+       support only read or write access. See stat(2) for a full list  of  the
+       possible mode values.
+RETURN VALUE
+       spu_create  returns a new file descriptor. It may return -1 to indicate
+       an error condition and set errno to  one  of  the  error  codes  listed
+       below.
+ERRORS
+       EACCESS
+              The  current  user does not have write access on the spufs mount
+              point.
+       EEXIST An SPU context already exists at the given path name.
+       EFAULT pathname is not a valid string pointer in  the  current  address
+              space.
+       EINVAL pathname is not a directory in the spufs mount point.
+       ELOOP  Too many symlinks were found while resolving pathname.
+       EMFILE The process has reached its maximum open file limit.
+       ENAMETOOLONG
+              pathname was too long.
+       ENFILE The system has reached the global open file limit.
+       ENOENT Part of pathname could not be resolved.
+       ENOMEM The kernel could not allocate all resources required.
+       ENOSPC There  are  not  enough  SPU resources available to create a new
+              context or the user specific limit for the number  of  SPU  con-
+              texts has been reached.
+       ENOSYS the functionality is not provided by the current system, because
+              either the hardware does not provide SPUs or the spufs module is
+              not loaded.
+       ENOTDIR
+              A part of pathname is not a directory.
+NOTES
+       spu_create  is  meant  to  be used from libraries that implement a more
+       abstract interface to SPUs, not to be used from  regular  applications.
+       See  http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
+       ommended libraries.
+FILES
+       pathname must point to a location beneath the mount point of spufs.  By
+       convention, it gets mounted in /spu.
+CONFORMING TO
+       This call is Linux specific and only implemented by the ppc64 architec-
+       ture. Programs using this system call are not portable.
+BUGS
+       The code does not yet fully implement all features lined out here.
+AUTHOR
+       Arnd Bergmann <arndb@de.ibm.com>
+SEE ALSO
+       capabilities(7), close(2), spu_run(2), spufs(7)
+Linux                             2005-09-28                     SPU_CREATE(2)
diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt
index 988a62fae11f..7ba2baa165ff 100644
--- a/Documentation/filesystems/sysfs-pci.txt
+++ b/Documentation/filesystems/sysfs-pci.txt
@@ -1,4 +1,5 @@
 Accessing PCI device resources through sysfs
+--------------------------------------------
 sysfs, usually mounted at /sys, provides access to PCI resources on platforms
 that support it.  For example, a given bus might look like this:
@@ -47,14 +48,21 @@ files, each with their own function.
  binary - file contains binary data
  cpumask - file contains a cpumask type
-The read only files are informational, writes to them will be ignored.
+The read only files are informational, writes to them will be ignored, with
-Writable files can be used to perform actions on the device (e.g. changing
+the exception of the 'rom' file.  Writable files can be used to perform
-config space, detaching a device).  mmapable files are available via an
+actions on the device (e.g. changing config space, detaching a device).
-mmap of the file at offset 0 and can be used to do actual device programming
+mmapable files are available via an mmap of the file at offset 0 and can be
-from userspace.  Note that some platforms don't support mmapping of certain
+used to do actual device programming from userspace.  Note that some platforms
-resources, so be sure to check the return value from any attempted mmap.
+don't support mmapping of certain resources, so be sure to check the return
+value from any attempted mmap.
+The 'rom' file is special in that it provides read-only access to the device's
+ROM file, if available.  It's disabled by default, however, so applications
+should write the string "1" to the file to enable it before attempting a read
+call, and disable it following the access by writing "0" to the file.
 Accessing legacy resources through sysfs
+----------------------------------------
 Legacy I/O port and ISA memory resources are also provided in sysfs if the
 underlying platform supports them.  They're located in the PCI class heirarchy,
@@ -75,6 +83,7 @@ simply dereference the returned pointer (after checking for errors of course)
 to access legacy memory space.
 Supporting PCI access on new platforms
+--------------------------------------
 In order to support PCI resource mapping as described above, Linux platform
 code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function.
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt
index 0d783c504ead..dbe4d87d2615 100644
--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -78,6 +78,18 @@ use up all the memory on the machine; but enhances the scalability of
 that instance in a system with many cpus making intensive use of it.
+tmpfs has a mount option to set the NUMA memory allocation policy for
+all files in that instance:
+mpol=interleave         prefers to allocate memory from each node in turn
+mpol=default            prefers to allocate memory from the local node
+mpol=bind               prefers to allocate from mpol_nodelist
+mpol=preferred          prefers to allocate from first node in mpol_nodelist
+The following mount option is used in conjunction with mpol=interleave,
+mpol=bind or mpol=preferred:
+mpol_nodelist:  nodelist suitable for parsing with nodelist_parse.
 To specify the initial root directory you can use the following mount
 options:
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index ee4c0a8b8db7..e56e842847d3 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -162,9 +162,8 @@ get_sb() method fills in is the "s_op" field. This is a pointer to
 a "struct super_operations" which describes the next level of the
 filesystem implementation.
-Usually, a filesystem uses generic one of the generic get_sb()
+Usually, a filesystem uses one of the generic get_sb() implementations
-implementations and provides a fill_super() method instead. The
+and provides a fill_super() method instead. The generic methods are:
-generic methods are:
  get_sb_bdev: mount a filesystem residing on a block device
author	Anton Altaparmakov <aia21@cantab.net>	2006-01-19 11:39:33 -0500
committer	Anton Altaparmakov <aia21@cantab.net>	2006-01-19 11:39:33 -0500
commit	944d79559d154c12becde0dab327016cf438f46c (patch)
tree	50c101806f4d3b6585222dda060559eb4f3e005a /Documentation/filesystems
parent	d087e4bdd24ebe3ae3d0b265b6573ec901af4b4b (diff)
parent	0f36b018b2e314d45af86449f1a97facb1fbe300 (diff)