Merge branch 'master' into for-next

author: Jiri Kosina <jkosina@suse.cz> 2010-04-22 20:08:44 -0400
committer: Jiri Kosina <jkosina@suse.cz> 2010-04-22 20:08:44 -0400
commit: 6c9468e9eb1252eaefd94ce7f06e1be9b0b641b1 (patch)
tree: 797676a336b050bfa1ef879377c07e541b9075d6 /Documentation/filesystems
parent: 4cb3ca7cd7e2cae8d1daf5345ec99a1e8502cf3f (diff)
parent: c81eddb0e3728661d1585fbc564449c94165cc36 (diff)
4 files changed, 163 insertions, 3 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 3bae418c6ad3..4303614b5add 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -16,6 +16,8 @@ befs.txt
        - information about the BeOS filesystem for Linux.
 bfs.txt
        - info for the SCO UnixWare Boot Filesystem (BFS).
+ceph.txt
+        - info for the Ceph Distributed File System
 cifs.txt
        - description of the CIFS filesystem.
 coda.txt
diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt
index 57e0b80a5274..c0236e753bc8 100644
--- a/Documentation/filesystems/9p.txt
+++ b/Documentation/filesystems/9p.txt
@@ -37,6 +37,15 @@ For Plan 9 From User Space applications (http://swtch.com/plan9)
        mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER
+For server running on QEMU host with virtio transport:
+        mount -t 9p -o trans=virtio <mount_tag> /mnt/9
+where mount_tag is the tag associated by the server to each of the exported
+mount points. Each 9P export is seen by the client as a virtio device with an
+associated "mount_tag" property. Available mount tags can be
+seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
 OPTIONS
 =======
@@ -47,7 +56,7 @@ OPTIONS
                        fd      - used passed file descriptors for connection
                                (see rfdno and wfdno)
                        virtio  - connect to the next virtio channel available
-                                (from lguest or KVM with trans_virtio module)
+                                (from QEMU with trans_virtio module)
                        rdma    - connect to a specified RDMA channel
  uname=name    user name to attempt mount as on the remote server.  The
@@ -85,7 +94,12 @@ OPTIONS
  port=n        port to connect to on the remote server
-  noextend      force legacy mode (no 9p2000.u semantics)
+  noextend      force legacy mode (no 9p2000.u or 9p2000.L semantics)
+  version=name  Select 9P protocol version. Valid options are:
+                        9p2000          - Legacy mode (same as noextend)
+                        9p2000.u        - Use 9P2000.u protocol
+                        9p2000.L        - Use 9P2000.L protocol
  dfltuid       attempt to mount as a particular uid
diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt
new file mode 100644
index 000000000000..0660c9f5deef
--- /dev/null
+++ b/Documentation/filesystems/ceph.txt
@@ -0,0 +1,140 @@
+Ceph Distributed File System
+============================
+Ceph is a distributed network file system designed to provide good
+performance, reliability, and scalability.
+Basic features include:
+ * POSIX semantics
+ * Seamless scaling from 1 to many thousands of nodes
+ * High availability and reliability.  No single point of failure.
+ * N-way replication of data across storage nodes
+ * Fast recovery from node failures
+ * Automatic rebalancing of data on node addition/removal
+ * Easy deployment: most FS components are userspace daemons
+Also,
+ * Flexible snapshots (on any directory)
+ * Recursive accounting (nested files, directories, bytes)
+In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely
+on symmetric access by all clients to shared block devices, Ceph
+separates data and metadata management into independent server
+clusters, similar to Lustre.  Unlike Lustre, however, metadata and
+storage nodes run entirely as user space daemons.  Storage nodes
+utilize btrfs to store data objects, leveraging its advanced features
+(checksumming, metadata replication, etc.).  File data is striped
+across storage nodes in large chunks to distribute workload and
+facilitate high throughputs.  When storage nodes fail, data is
+re-replicated in a distributed fashion by the storage nodes themselves
+(with some minimal coordination from a cluster monitor), making the
+system extremely efficient and scalable.
+Metadata servers effectively form a large, consistent, distributed
+in-memory cache above the file namespace that is extremely scalable,
+dynamically redistributes metadata in response to workload changes,
+and can tolerate arbitrary (well, non-Byzantine) node failures.  The
+metadata server takes a somewhat unconventional approach to metadata
+storage to significantly improve performance for common workloads.  In
+particular, inodes with only a single link are embedded in
+directories, allowing entire directories of dentries and inodes to be
+loaded into its cache with a single I/O operation.  The contents of
+extremely large directories can be fragmented and managed by
+independent metadata servers, allowing scalable concurrent access.
+The system offers automatic data rebalancing/migration when scaling
+from a small cluster of just a few nodes to many hundreds, without
+requiring an administrator carve the data set into static volumes or
+go through the tedious process of migrating data between servers.
+When the file system approaches full, new nodes can be easily added
+and things will "just work."
+Ceph includes flexible snapshot mechanism that allows a user to create
+a snapshot on any subdirectory (and its nested contents) in the
+system.  Snapshot creation and deletion are as simple as 'mkdir
+.snap/foo' and 'rmdir .snap/foo'.
+Ceph also provides some recursive accounting on directories for nested
+files and bytes.  That is, a 'getfattr -d foo' on any directory in the
+system will reveal the total number of nested regular files and
+subdirectories, and a summation of all nested file sizes.  This makes
+the identification of large disk space consumers relatively quick, as
+no 'du' or similar recursive scan of the file system is required.
+Mount Syntax
+============
+The basic mount syntax is:
+ # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt
+You only need to specify a single monitor, as the client will get the
+full list when it connects.  (However, if the monitor you specify
+happens to be down, the mount won't succeed.)  The port can be left
+off if the monitor is using the default.  So if the monitor is at
+1.2.3.4,
+ # mount -t ceph 1.2.3.4:/ /mnt/ceph
+is sufficient.  If /sbin/mount.ceph is installed, a hostname can be
+used instead of an IP address.
+Mount Options
+=============
+  ip=A.B.C.D[:N]
+        Specify the IP and/or port the client should bind to locally.
+        There is normally not much reason to do this.  If the IP is not
+        specified, the client's IP address is determined by looking at the
+        address it's connection to the monitor originates from.
+  wsize=X
+        Specify the maximum write size in bytes.  By default there is no
+        maximum.  Ceph will normally size writes based on the file stripe
+        size.
+  rsize=X
+        Specify the maximum readahead.
+  mount_timeout=X
+        Specify the timeout value for mount (in seconds), in the case
+        of a non-responsive Ceph file system.  The default is 30
+        seconds.
+  rbytes
+        When stat() is called on a directory, set st_size to 'rbytes',
+        the summation of file sizes over all files nested beneath that
+        directory.  This is the default.
+  norbytes
+        When stat() is called on a directory, set st_size to the
+        number of entries in that directory.
+  nocrc
+        Disable CRC32C calculation for data writes.  If set, the storage node
+        must rely on TCP's error correction to detect data corruption
+        in the data payload.
+  noasyncreaddir
+        Disable client's use its local cache to satisfy readdir
+        requests.  (This does not change correctness; the client uses
+        cached metadata only when a lease or capability ensures it is
+        valid.)
+More Information
+================
+For more information on Ceph, see the home page at
+        http://ceph.newdream.net/
+The Linux kernel client source tree is available at
+        git://ceph.newdream.net/git/ceph-client.git
+        git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
+and the source for the full system is at
+        git://ceph.newdream.net/git/ceph.git
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt
index 3015da0c6b2a..fe09a2cb1858 100644
--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -82,11 +82,13 @@ tmpfs has a mount option to set the NUMA memory allocation policy for
 all files in that instance (if CONFIG_NUMA is enabled) - which can be
 adjusted on the fly via 'mount -o remount ...'
-mpol=default             prefers to allocate memory from the local node
+mpol=default             use the process allocation policy
+                         (see set_mempolicy(2))
 mpol=prefer:Node         prefers to allocate memory from the given Node
 mpol=bind:NodeList       allocates memory only from nodes in NodeList
 mpol=interleave          prefers to allocate from each node in turn
 mpol=interleave:NodeList allocates from each node of NodeList in turn
+mpol=local               prefers to allocate memory from the local node
 NodeList format is a comma-separated list of decimal numbers and ranges,
 a range being two hyphen-separated decimal numbers, the smallest and
@@ -134,3 +136,5 @@ Author:
   Christoph Rohland <cr@sap.com>, 1.12.01
 Updated:
   Hugh Dickins, 4 June 2007
+Updated:
+   KOSAKI Motohiro, 16 Mar 2010
author	Jiri Kosina <jkosina@suse.cz>	2010-04-22 20:08:44 -0400
committer	Jiri Kosina <jkosina@suse.cz>	2010-04-22 20:08:44 -0400
commit	6c9468e9eb1252eaefd94ce7f06e1be9b0b641b1 (patch)
tree	797676a336b050bfa1ef879377c07e541b9075d6 /Documentation/filesystems
parent	4cb3ca7cd7e2cae8d1daf5345ec99a1e8502cf3f (diff)
parent	c81eddb0e3728661d1585fbc564449c94165cc36 (diff)

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index 3bae418c6ad3..4303614b5add 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX
@@ -16,6 +16,8 @@ befs.txt
16	- information about the BeOS filesystem for Linux.	16	- information about the BeOS filesystem for Linux.
17	bfs.txt	17	bfs.txt
18	- info for the SCO UnixWare Boot Filesystem (BFS).	18	- info for the SCO UnixWare Boot Filesystem (BFS).
		19	ceph.txt
		20	- info for the Ceph Distributed File System
19	cifs.txt	21	cifs.txt
20	- description of the CIFS filesystem.	22	- description of the CIFS filesystem.
21	coda.txt	23	coda.txt


diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt index 57e0b80a5274..c0236e753bc8 100644 --- a/Documentation/filesystems/9p.txt +++ b/Documentation/filesystems/9p.txt
@@ -37,6 +37,15 @@ For Plan 9 From User Space applications (http://swtch.com/plan9)
37		37
38	mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER	38	mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER
39		39
		40	For server running on QEMU host with virtio transport:
		41
		42	mount -t 9p -o trans=virtio <mount_tag> /mnt/9
		43
		44	where mount_tag is the tag associated by the server to each of the exported
		45	mount points. Each 9P export is seen by the client as a virtio device with an
		46	associated "mount_tag" property. Available mount tags can be
		47	seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
		48
40	OPTIONS	49	OPTIONS
41	=======	50	=======
42		51
@@ -47,7 +56,7 @@ OPTIONS
47	fd - used passed file descriptors for connection	56	fd - used passed file descriptors for connection
48	(see rfdno and wfdno)	57	(see rfdno and wfdno)
49	virtio - connect to the next virtio channel available	58	virtio - connect to the next virtio channel available
50	(from lguest or KVM with trans_virtio module)	59	(from QEMU with trans_virtio module)
51	rdma - connect to a specified RDMA channel	60	rdma - connect to a specified RDMA channel
52		61
53	uname=name user name to attempt mount as on the remote server. The	62	uname=name user name to attempt mount as on the remote server. The
@@ -85,7 +94,12 @@ OPTIONS
85		94
86	port=n port to connect to on the remote server	95	port=n port to connect to on the remote server
87		96
88	noextend force legacy mode (no 9p2000.u semantics)	97	noextend force legacy mode (no 9p2000.u or 9p2000.L semantics)
		98
		99	version=name Select 9P protocol version. Valid options are:
		100	9p2000 - Legacy mode (same as noextend)
		101	9p2000.u - Use 9P2000.u protocol
		102	9p2000.L - Use 9P2000.L protocol
89		103
90	dfltuid attempt to mount as a particular uid	104	dfltuid attempt to mount as a particular uid
91		105


diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt new file mode 100644 index 000000000000..0660c9f5deef --- /dev/null +++ b/Documentation/filesystems/ceph.txt
@@ -0,0 +1,140 @@
		1	Ceph Distributed File System
		2	============================
		3
		4	Ceph is a distributed network file system designed to provide good
		5	performance, reliability, and scalability.
		6
		7	Basic features include:
		8
		9	* POSIX semantics
		10	* Seamless scaling from 1 to many thousands of nodes
		11	* High availability and reliability. No single point of failure.
		12	* N-way replication of data across storage nodes
		13	* Fast recovery from node failures
		14	* Automatic rebalancing of data on node addition/removal
		15	* Easy deployment: most FS components are userspace daemons
		16
		17	Also,
		18	* Flexible snapshots (on any directory)
		19	* Recursive accounting (nested files, directories, bytes)
		20
		21	In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely
		22	on symmetric access by all clients to shared block devices, Ceph
		23	separates data and metadata management into independent server
		24	clusters, similar to Lustre. Unlike Lustre, however, metadata and
		25	storage nodes run entirely as user space daemons. Storage nodes
		26	utilize btrfs to store data objects, leveraging its advanced features
		27	(checksumming, metadata replication, etc.). File data is striped
		28	across storage nodes in large chunks to distribute workload and
		29	facilitate high throughputs. When storage nodes fail, data is
		30	re-replicated in a distributed fashion by the storage nodes themselves
		31	(with some minimal coordination from a cluster monitor), making the
		32	system extremely efficient and scalable.
		33
		34	Metadata servers effectively form a large, consistent, distributed
		35	in-memory cache above the file namespace that is extremely scalable,
		36	dynamically redistributes metadata in response to workload changes,
		37	and can tolerate arbitrary (well, non-Byzantine) node failures. The
		38	metadata server takes a somewhat unconventional approach to metadata
		39	storage to significantly improve performance for common workloads. In
		40	particular, inodes with only a single link are embedded in
		41	directories, allowing entire directories of dentries and inodes to be
		42	loaded into its cache with a single I/O operation. The contents of
		43	extremely large directories can be fragmented and managed by
		44	independent metadata servers, allowing scalable concurrent access.
		45
		46	The system offers automatic data rebalancing/migration when scaling
		47	from a small cluster of just a few nodes to many hundreds, without
		48	requiring an administrator carve the data set into static volumes or
		49	go through the tedious process of migrating data between servers.
		50	When the file system approaches full, new nodes can be easily added
		51	and things will "just work."
		52
		53	Ceph includes flexible snapshot mechanism that allows a user to create
		54	a snapshot on any subdirectory (and its nested contents) in the
		55	system. Snapshot creation and deletion are as simple as 'mkdir
		56	.snap/foo' and 'rmdir .snap/foo'.
		57
		58	Ceph also provides some recursive accounting on directories for nested
		59	files and bytes. That is, a 'getfattr -d foo' on any directory in the
		60	system will reveal the total number of nested regular files and
		61	subdirectories, and a summation of all nested file sizes. This makes
		62	the identification of large disk space consumers relatively quick, as
		63	no 'du' or similar recursive scan of the file system is required.
		64
		65
		66	Mount Syntax
		67	============
		68
		69	The basic mount syntax is:
		70
		71	# mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt
		72
		73	You only need to specify a single monitor, as the client will get the
		74	full list when it connects. (However, if the monitor you specify
		75	happens to be down, the mount won't succeed.) The port can be left
		76	off if the monitor is using the default. So if the monitor is at
		77	1.2.3.4,
		78
		79	# mount -t ceph 1.2.3.4:/ /mnt/ceph
		80
		81	is sufficient. If /sbin/mount.ceph is installed, a hostname can be
		82	used instead of an IP address.
		83
		84
		85
		86	Mount Options
		87	=============
		88
		89	ip=A.B.C.D[:N]
		90	Specify the IP and/or port the client should bind to locally.
		91	There is normally not much reason to do this. If the IP is not
		92	specified, the client's IP address is determined by looking at the
		93	address it's connection to the monitor originates from.
		94
		95	wsize=X
		96	Specify the maximum write size in bytes. By default there is no
		97	maximum. Ceph will normally size writes based on the file stripe
		98	size.
		99
		100	rsize=X
		101	Specify the maximum readahead.
		102
		103	mount_timeout=X
		104	Specify the timeout value for mount (in seconds), in the case
		105	of a non-responsive Ceph file system. The default is 30
		106	seconds.
		107
		108	rbytes
		109	When stat() is called on a directory, set st_size to 'rbytes',
		110	the summation of file sizes over all files nested beneath that
		111	directory. This is the default.
		112
		113	norbytes
		114	When stat() is called on a directory, set st_size to the
		115	number of entries in that directory.
		116
		117	nocrc
		118	Disable CRC32C calculation for data writes. If set, the storage node
		119	must rely on TCP's error correction to detect data corruption
		120	in the data payload.
		121
		122	noasyncreaddir
		123	Disable client's use its local cache to satisfy readdir
		124	requests. (This does not change correctness; the client uses
		125	cached metadata only when a lease or capability ensures it is
		126	valid.)
		127
		128
		129	More Information
		130	================
		131
		132	For more information on Ceph, see the home page at
		133	http://ceph.newdream.net/
		134
		135	The Linux kernel client source tree is available at
		136	git://ceph.newdream.net/git/ceph-client.git
		137	git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
		138
		139	and the source for the full system is at
		140	git://ceph.newdream.net/git/ceph.git


diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index 3015da0c6b2a..fe09a2cb1858 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt
@@ -82,11 +82,13 @@ tmpfs has a mount option to set the NUMA memory allocation policy for
82	all files in that instance (if CONFIG_NUMA is enabled) - which can be	82	all files in that instance (if CONFIG_NUMA is enabled) - which can be
83	adjusted on the fly via 'mount -o remount ...'	83	adjusted on the fly via 'mount -o remount ...'
84		84
85	mpol=default prefers to allocate memory from the local node	85	mpol=default use the process allocation policy
		86	(see set_mempolicy(2))
86	mpol=prefer:Node prefers to allocate memory from the given Node	87	mpol=prefer:Node prefers to allocate memory from the given Node
87	mpol=bind:NodeList allocates memory only from nodes in NodeList	88	mpol=bind:NodeList allocates memory only from nodes in NodeList
88	mpol=interleave prefers to allocate from each node in turn	89	mpol=interleave prefers to allocate from each node in turn
89	mpol=interleave:NodeList allocates from each node of NodeList in turn	90	mpol=interleave:NodeList allocates from each node of NodeList in turn
		91	mpol=local prefers to allocate memory from the local node
90		92
91	NodeList format is a comma-separated list of decimal numbers and ranges,	93	NodeList format is a comma-separated list of decimal numbers and ranges,
92	a range being two hyphen-separated decimal numbers, the smallest and	94	a range being two hyphen-separated decimal numbers, the smallest and
@@ -134,3 +136,5 @@ Author:
134	Christoph Rohland <cr@sap.com>, 1.12.01	136	Christoph Rohland <cr@sap.com>, 1.12.01
135	Updated:	137	Updated:
136	Hugh Dickins, 4 June 2007	138	Hugh Dickins, 4 June 2007
		139	Updated:
		140	KOSAKI Motohiro, 16 Mar 2010