3 files changed, 207 insertions, 61 deletions
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 0c5086db8352..80e193d82e2e 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -13,72 +13,93 @@ Mailing list: linux-ext4@vger.kernel.org
 1. Quick usage instructions:
 ===========================
-  - Grab updated e2fsprogs from
+  - Compile and install the latest version of e2fsprogs (as of this
-    ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/
+    writing version 1.41) from:
-    This is a patchset on top of e2fsprogs-1.39, which can be found at
+    http://sourceforge.net/project/showfiles.php?group_id=2406
+        
+        or
    ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
-  - It's still mke2fs -j /dev/hda1
+        or grab the latest git repository from:
+    git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
+  - Create a new filesystem using the ext4dev filesystem type:
+        # mke2fs -t ext4dev /dev/hda1
+    Or configure an existing ext3 filesystem to support extents and set
+    the test_fs flag to indicate that it's ok for an in-development
+    filesystem to touch this filesystem:
-  - mount /dev/hda1 /wherever -t ext4dev
+        # tune2fs -O extents -E test_fs /dev/hda1
-  - To enable extents,
+    If the filesystem was created with 128 byte inodes, it can be
+    converted to use 256 byte for greater efficiency via:
-        mount /dev/hda1 /wherever -t ext4dev -o extents
+        # tune2fs -I 256 /dev/hda1
-  - The filesystem is compatible with the ext3 driver until you add a file
+    (Note: we currently do not have tools to convert an ext4dev
-    which has extents (ie: `mount -o extents', then create a file).
+    filesystem back to ext3; so please do not do try this on production
+    filesystems.)
-    NOTE: The "extents" mount flag is temporary.  It will soon go away and
+  - Mounting:
-    extents will be enabled by the "-o extents" flag to mke2fs or tune2fs
+        # mount -t ext4dev /dev/hda1 /wherever
  - When comparing performance with other filesystems, remember that
-    ext3/4 by default offers higher data integrity guarantees than most.  So
+    ext3/4 by default offers higher data integrity guarantees than most.
-    when comparing with a metadata-only journalling filesystem, use `mount -o
+    So when comparing with a metadata-only journalling filesystem, such
-    data=writeback'.  And you might as well use `mount -o nobh' too along
+    as ext3, use `mount -o data=writeback'.  And you might as well use
-    with it.  Making the journal larger than the mke2fs default often helps
+    `mount -o nobh' too along with it.  Making the journal larger than
-    performance with metadata-intensive workloads.
+    the mke2fs default often helps performance with metadata-intensive
+    workloads.
 2. Features
 ===========
 2.1 Currently available
-* ability to use filesystems > 16TB
+* ability to use filesystems > 16TB (e2fsprogs support not available yet)
 * extent format reduces metadata overhead (RAM, IO for access, transactions)
 * extent format more robust in face of on-disk corruption due to magics,
 * internal redunancy in tree
+* improved file allocation (multi-block alloc)
-2.1 Previously available, soon to be enabled by default by "mkefs.ext4":
+* fix 32000 subdirectory limit
+* nsec timestamps for mtime, atime, ctime, create time
-* dir_index and resize inode will be on by default
+* inode version field on disk (NFSv4, Lustre)
-* large inodes will be used by default for fast EAs, nsec timestamps, etc
+* reduced e2fsck time via uninit_bg feature
+* journal checksumming for robustness, performance
+* persistent file preallocation (e.g for streaming media, databases)
+* ability to pack bitmaps and inode tables into larger virtual groups via the
+  flex_bg feature
+* large file support
+* Inode allocation using large virtual block groups via flex_bg
+* delayed allocation
+* large block (up to pagesize) support
+* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force
+  the ordering)
 2.2 Candidate features for future inclusion
-There are several under discussion, whether they all make it in is
+* Online defrag (patches available but not well tested)
-partly a function of how much time everyone has to work on them:
+* reduced mke2fs time via lazy itable initialization in conjuction with
+  the uninit_bg feature (capability to do this is available in e2fsprogs
+  but a kernel thread to do lazy zeroing of unused inode table blocks
+  after filesystem is first mounted is required for safety)
-* improved file allocation (multi-block alloc, delayed alloc; basically done)
+There are several others under discussion, whether they all make it in is
-* fix 32000 subdirectory limit (patch exists, needs some e2fsck work)
+partly a function of how much time everyone has to work on them. Features like
-* nsec timestamps for mtime, atime, ctime, create time (patch exists,
+metadata checksumming have been discussed and planned for a bit but no patches
-  needs some e2fsck work)
+exist yet so I'm not sure they're in the near-term roadmap.
-* inode version field on disk (NFSv4, Lustre; prototype exists)
-* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
-* journal checksumming for robustness, performance (prototype exists)
-* persistent file preallocation (e.g for streaming media, databases)
-Features like metadata checksumming have been discussed and planned for
+The big performance win will come with mballoc, delalloc and flex_bg
-a bit but no patches exist yet so I'm not sure they're in the near-term
+grouping of bitmaps and inode tables.  Some test results available here:
-roadmap.
-The big performance win will come with mballoc and delalloc.  CFS has
+ - http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html
-been using mballoc for a few years already with Lustre, and IBM + Bull
+ - http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html
-did a lot of benchmarking on it.  The reason it isn't in the first set of
-patches is partly a manageability issue, and partly because it doesn't
-directly affect the on-disk format (outside of much better allocation)
-so it isn't critical to get into the first round of changes.  I believe
-Alex is working on a new set of patches right now.
 3. Options
 ==========
@@ -222,9 +243,11 @@ stripe=n		Number of filesystem blocks that mballoc will try
                        to use for allocation size and alignment. For RAID5/6
                        systems this should be the number of data
                        disks *  RAID chunk size in file system blocks.
+delalloc        (*)     Deferring block allocation until write-out time.
+nodelalloc              Disable delayed allocation. Blocks are allocation
+                        when data is copied from user to page cache.
 Data Mode
---------
+=========
 There are 3 different data modes:
 * writeback mode
@@ -236,10 +259,10 @@ typically provide the best ext4 performance.
 * ordered mode
 In data=ordered mode, ext4 only officially journals metadata, but it logically
-groups metadata and data blocks into a single unit called a transaction.  When
+groups metadata information related to data changes with the data blocks into a
-it's time to write the new metadata out to disk, the associated data blocks
+single unit called a transaction.  When it's time to write the new metadata
-are written first.  In general, this mode performs slightly slower than
+out to disk, the associated data blocks are written first.  In general,
-writeback but significantly faster than journal mode.
+this mode performs slightly slower than writeback but significantly faster than journal mode.
 * journal mode
 data=journal mode provides full data and metadata journaling.  All new data is
@@ -247,7 +270,8 @@ written to the journal first, and then to its final location.
 In the event of a crash, the journal can be replayed, bringing both data and
 metadata into a consistent state.  This mode is the slowest except when data
 needs to be read from and written to disk at the same time where it
-outperforms all others modes.
+outperforms all others modes.  Curently ext4 does not have delayed
+allocation support if this data journalling mode is selected.
 References
 ==========
@@ -256,7 +280,8 @@ kernel source:	<file:fs/ext4/>
                <file:fs/jbd2/>
 programs:       http://e2fsprogs.sourceforge.net/
-                http://ext2resize.sourceforge.net
 useful links:   http://fedoraproject.org/wiki/ext3-devel
                http://www.bullopensource.org/ext4/
+                http://ext4.wiki.kernel.org/index.php/Main_Page
+                http://fedoraproject.org/wiki/Features/Ext4
diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.txt
new file mode 100644
index 000000000000..4dae9a3840bf
--- /dev/null
+++ b/Documentation/filesystems/gfs2-glocks.txt
@@ -0,0 +1,114 @@
+                   Glock internal locking rules
+                  ------------------------------
+This documents the basic principles of the glock state machine
+internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
+has two main (internal) locks:
+ 1. A spinlock (gl_spin) which protects the internal state such
+    as gl_state, gl_target and the list of holders (gl_holders)
+ 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other
+    threads from making calls to the DLM, etc. at the same time. If a
+    thread takes this lock, it must then call run_queue (usually via the
+    workqueue) when it releases it in order to ensure any pending tasks
+    are completed.
+The gl_holders list contains all the queued lock requests (not
+just the holders) associated with the glock. If there are any
+held locks, then they will be contiguous entries at the head
+of the list. Locks are granted in strictly the order that they
+are queued, except for those marked LM_FLAG_PRIORITY which are
+used only during recovery, and even then only for journal locks.
+There are three lock states that users of the glock layer can request,
+namely shared (SH), deferred (DF) and exclusive (EX). Those translate
+to the following DLM lock modes:
+Glock mode    | DLM lock mode
+------------------------------
+    UN        |    IV/NL  Unlocked (no DLM lock associated with glock) or NL
+    SH        |    PR     (Protected read)
+    DF        |    CW     (Concurrent write)
+    EX        |    EX     (Exclusive)
+Thus DF is basically a shared mode which is incompatible with the "normal"
+shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
+operations. The glocks are basically a lock plus some routines which deal
+with cache management. The following rules apply for the cache:
+Glock mode   |  Cache data | Cache Metadata | Dirty Data | Dirty Metadata
+--------------------------------------------------------------------------
+    UN       |     No      |       No       |     No     |      No
+    SH       |     Yes     |       Yes      |     No     |      No
+    DF       |     No      |       Yes      |     No     |      No
+    EX       |     Yes     |       Yes      |     Yes    |      Yes
+These rules are implemented using the various glock operations which
+are defined for each type of glock. Not all types of glocks use
+all the modes. Only inode glocks use the DF mode for example.
+Table of glock operations and per type constants:
+Field            | Purpose
+----------------------------------------------------------------------------
+go_xmote_th      | Called before remote state change (e.g. to sync dirty data)
+go_xmote_bh      | Called after remote state change (e.g. to refill cache)
+go_inval         | Called if remote state change requires invalidating the cache
+go_demote_ok     | Returns boolean value of whether its ok to demote a glock
+                 | (e.g. checks timeout, and that there is no cached data)
+go_lock          | Called for the first local holder of a lock
+go_unlock        | Called on the final local unlock of a lock
+go_dump          | Called to print content of object for debugfs file, or on
+                 | error to dump glock to the log.
+go_type;         | The type of the glock, LM_TYPE_.....
+go_min_hold_time | The minimum hold time
+The minimum hold time for each lock is the time after a remote lock
+grant for which we ignore remote demote requests. This is in order to
+prevent a situation where locks are being bounced around the cluster
+from node to node with none of the nodes making any progress. This
+tends to show up most with shared mmaped files which are being written
+to by multiple nodes. By delaying the demotion in response to a
+remote callback, that gives the userspace program time to make
+some progress before the pages are unmapped.
+There is a plan to try and remove the go_lock and go_unlock callbacks
+if possible, in order to try and speed up the fast path though the locking.
+Also, eventually we hope to make the glock "EX" mode locally shared
+such that any local locking will be done with the i_mutex as required
+rather than via the glock.
+Locking rules for glock operations:
+Operation     |  GLF_LOCK bit lock held |  gl_spin spinlock held
+-----------------------------------------------------------------
+go_xmote_th   |       Yes               |       No
+go_xmote_bh   |       Yes               |       No
+go_inval      |       Yes               |       No
+go_demote_ok  |       Sometimes         |       Yes
+go_lock       |       Yes               |       No
+go_unlock     |       Yes               |       No
+go_dump       |       Sometimes         |       Yes
+N.B. Operations must not drop either the bit lock or the spinlock
+if its held on entry. go_dump and do_demote_ok must never block.
+Note that go_dump will only be called if the glock's state
+indicates that it is caching uptodate data.
+Glock locking order within GFS2:
+ 1. i_mutex (if required)
+ 2. Rename glock (for rename only)
+ 3. Inode glock(s)
+    (Parents before children, inodes at "same level" with same parent in
+     lock number order)
+ 4. Rgrp glock(s) (for (de)allocation operations)
+ 5. Transaction glock (via gfs2_trans_begin) for non-read operations
+ 6. Page lock  (always last, very important!)
+There are two glocks per inode. One deals with access to the inode
+itself (locking order as above), and the other, known as the iopen
+glock is used in conjunction with the i_nlink field in the inode to
+determine the lifetime of the inode in question. Locking of inodes
+is on a per-inode basis. Locking of rgrps is on a per rgrp basis.
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index dbc3c6a3650f..7f268f327d75 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -380,28 +380,35 @@ i386 and x86_64 platforms support the new IRQ vector displays.
 Of some interest is the introduction of the /proc/irq directory to 2.4.
 It could be used to set IRQ to CPU affinity, this means that you can "hook" an
 IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
-irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask
+irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and
+prof_cpu_mask.
 For example 
  > ls /proc/irq/
  0  10  12  14  16  18  2  4  6  8  prof_cpu_mask
-  1  11  13  15  17  19  3  5  7  9
+  1  11  13  15  17  19  3  5  7  9  default_smp_affinity
  > ls /proc/irq/0/
  smp_affinity
-The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ
+smp_affinity is a bitmask, in which you can specify which CPUs can handle the
-is the same by default:
+IRQ, you can set it by doing:
-  > cat /proc/irq/0/smp_affinity 
+  > echo 1 > /proc/irq/10/smp_affinity
-  ffffffff
+This means that only the first CPU will handle the IRQ, but you can also echo
+5 which means that only the first and fourth CPU can handle the IRQ.
-It's a bitmask, in which you can specify which CPUs can handle the IRQ, you can
+The contents of each smp_affinity file is the same by default:
-set it by doing:
+  > cat /proc/irq/0/smp_affinity
+  ffffffff
-  > echo 1 > /proc/irq/prof_cpu_mask
+The default_smp_affinity mask applies to all non-active IRQs, which are the
+IRQs which have not yet been allocated/activated, and hence which lack a
+/proc/irq/[0-9]* directory.
-This means that only the first CPU will handle the IRQ, but you can also echo 5
+prof_cpu_mask specifies which CPUs are to be profiled by the system wide
-which means that only the first and fourth CPU can handle the IRQ.
+profiler. Default value is ffffffff (all cpus).
 The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
 between all the CPUs which are allowed to handle it. As usual the kernel has

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 0c5086db8352..80e193d82e2e 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt
@@ -13,72 +13,93 @@ Mailing list: linux-ext4@vger.kernel.org
13	1. Quick usage instructions:	13	1. Quick usage instructions:
14	===========================	14	===========================
15		15
16	- Grab updated e2fsprogs from	16	- Compile and install the latest version of e2fsprogs (as of this
17	ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/	17	writing version 1.41) from:
18	This is a patchset on top of e2fsprogs-1.39, which can be found at	18
		19	http://sourceforge.net/project/showfiles.php?group_id=2406
		20
		21	or
		22
19	ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/	23	ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
20		24
21	- It's still mke2fs -j /dev/hda1	25	or grab the latest git repository from:
		26
		27	git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
		28
		29	- Create a new filesystem using the ext4dev filesystem type:
		30
		31	# mke2fs -t ext4dev /dev/hda1
		32
		33	Or configure an existing ext3 filesystem to support extents and set
		34	the test_fs flag to indicate that it's ok for an in-development
		35	filesystem to touch this filesystem:
22		36
23	- mount /dev/hda1 /wherever -t ext4dev	37	# tune2fs -O extents -E test_fs /dev/hda1
24		38
25	- To enable extents,	39	If the filesystem was created with 128 byte inodes, it can be
		40	converted to use 256 byte for greater efficiency via:
26		41
27	mount /dev/hda1 /wherever -t ext4dev -o extents	42	# tune2fs -I 256 /dev/hda1
28		43
29	- The filesystem is compatible with the ext3 driver until you add a file	44	(Note: we currently do not have tools to convert an ext4dev
30	which has extents (ie: `mount -o extents', then create a file).	45	filesystem back to ext3; so please do not do try this on production
		46	filesystems.)
31		47
32	NOTE: The "extents" mount flag is temporary. It will soon go away and	48	- Mounting:
33	extents will be enabled by the "-o extents" flag to mke2fs or tune2fs	49
		50	# mount -t ext4dev /dev/hda1 /wherever
34		51
35	- When comparing performance with other filesystems, remember that	52	- When comparing performance with other filesystems, remember that
36	ext3/4 by default offers higher data integrity guarantees than most. So	53	ext3/4 by default offers higher data integrity guarantees than most.
37	when comparing with a metadata-only journalling filesystem, use `mount -o	54	So when comparing with a metadata-only journalling filesystem, such
38	data=writeback'. And you might as well use `mount -o nobh' too along	55	as ext3, use `mount -o data=writeback'. And you might as well use
39	with it. Making the journal larger than the mke2fs default often helps	56	`mount -o nobh' too along with it. Making the journal larger than
40	performance with metadata-intensive workloads.	57	the mke2fs default often helps performance with metadata-intensive
		58	workloads.
41		59
42	2. Features	60	2. Features
43	===========	61	===========
44		62
45	2.1 Currently available	63	2.1 Currently available
46		64
47	* ability to use filesystems > 16TB	65	* ability to use filesystems > 16TB (e2fsprogs support not available yet)
48	* extent format reduces metadata overhead (RAM, IO for access, transactions)	66	* extent format reduces metadata overhead (RAM, IO for access, transactions)
49	* extent format more robust in face of on-disk corruption due to magics,	67	* extent format more robust in face of on-disk corruption due to magics,
50	* internal redunancy in tree	68	* internal redunancy in tree
51		69	* improved file allocation (multi-block alloc)
52	2.1 Previously available, soon to be enabled by default by "mkefs.ext4":	70	* fix 32000 subdirectory limit
53		71	* nsec timestamps for mtime, atime, ctime, create time
54	* dir_index and resize inode will be on by default	72	* inode version field on disk (NFSv4, Lustre)
55	* large inodes will be used by default for fast EAs, nsec timestamps, etc	73	* reduced e2fsck time via uninit_bg feature
		74	* journal checksumming for robustness, performance
		75	* persistent file preallocation (e.g for streaming media, databases)
		76	* ability to pack bitmaps and inode tables into larger virtual groups via the
		77	flex_bg feature
		78	* large file support
		79	* Inode allocation using large virtual block groups via flex_bg
		80	* delayed allocation
		81	* large block (up to pagesize) support
		82	* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force
		83	the ordering)
56		84
57	2.2 Candidate features for future inclusion	85	2.2 Candidate features for future inclusion
58		86
59	There are several under discussion, whether they all make it in is	87	* Online defrag (patches available but not well tested)
60	partly a function of how much time everyone has to work on them:	88	* reduced mke2fs time via lazy itable initialization in conjuction with
		89	the uninit_bg feature (capability to do this is available in e2fsprogs
		90	but a kernel thread to do lazy zeroing of unused inode table blocks
		91	after filesystem is first mounted is required for safety)
61		92
62	* improved file allocation (multi-block alloc, delayed alloc; basically done)	93	There are several others under discussion, whether they all make it in is
63	* fix 32000 subdirectory limit (patch exists, needs some e2fsck work)	94	partly a function of how much time everyone has to work on them. Features like
64	* nsec timestamps for mtime, atime, ctime, create time (patch exists,	95	metadata checksumming have been discussed and planned for a bit but no patches
65	needs some e2fsck work)	96	exist yet so I'm not sure they're in the near-term roadmap.
66	* inode version field on disk (NFSv4, Lustre; prototype exists)
67	* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
68	* journal checksumming for robustness, performance (prototype exists)
69	* persistent file preallocation (e.g for streaming media, databases)
70		97
71	Features like metadata checksumming have been discussed and planned for	98	The big performance win will come with mballoc, delalloc and flex_bg
72	a bit but no patches exist yet so I'm not sure they're in the near-term	99	grouping of bitmaps and inode tables. Some test results available here:
73	roadmap.
74		100
75	The big performance win will come with mballoc and delalloc. CFS has	101	- http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html
76	been using mballoc for a few years already with Lustre, and IBM + Bull	102	- http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html
77	did a lot of benchmarking on it. The reason it isn't in the first set of
78	patches is partly a manageability issue, and partly because it doesn't
79	directly affect the on-disk format (outside of much better allocation)
80	so it isn't critical to get into the first round of changes. I believe
81	Alex is working on a new set of patches right now.
82		103
83	3. Options	104	3. Options
84	==========	105	==========
@@ -222,9 +243,11 @@ stripe=n Number of filesystem blocks that mballoc will try
222	to use for allocation size and alignment. For RAID5/6	243	to use for allocation size and alignment. For RAID5/6
223	systems this should be the number of data	244	systems this should be the number of data
224	disks * RAID chunk size in file system blocks.	245	disks * RAID chunk size in file system blocks.
225		246	delalloc (*) Deferring block allocation until write-out time.
		247	nodelalloc Disable delayed allocation. Blocks are allocation
		248	when data is copied from user to page cache.
226	Data Mode	249	Data Mode
227	---------	250	=========
228	There are 3 different data modes:	251	There are 3 different data modes:
229		252
230	* writeback mode	253	* writeback mode
@@ -236,10 +259,10 @@ typically provide the best ext4 performance.
236		259
237	* ordered mode	260	* ordered mode
238	In data=ordered mode, ext4 only officially journals metadata, but it logically	261	In data=ordered mode, ext4 only officially journals metadata, but it logically
239	groups metadata and data blocks into a single unit called a transaction. When	262	groups metadata information related to data changes with the data blocks into a
240	it's time to write the new metadata out to disk, the associated data blocks	263	single unit called a transaction. When it's time to write the new metadata
241	are written first. In general, this mode performs slightly slower than	264	out to disk, the associated data blocks are written first. In general,
242	writeback but significantly faster than journal mode.	265	this mode performs slightly slower than writeback but significantly faster than journal mode.
243		266
244	* journal mode	267	* journal mode
245	data=journal mode provides full data and metadata journaling. All new data is	268	data=journal mode provides full data and metadata journaling. All new data is
@@ -247,7 +270,8 @@ written to the journal first, and then to its final location.
247	In the event of a crash, the journal can be replayed, bringing both data and	270	In the event of a crash, the journal can be replayed, bringing both data and
248	metadata into a consistent state. This mode is the slowest except when data	271	metadata into a consistent state. This mode is the slowest except when data
249	needs to be read from and written to disk at the same time where it	272	needs to be read from and written to disk at the same time where it
250	outperforms all others modes.	273	outperforms all others modes. Curently ext4 does not have delayed
		274	allocation support if this data journalling mode is selected.
251		275
252	References	276	References
253	==========	277	==========
@@ -256,7 +280,8 @@ kernel source: <file:fs/ext4/>
256	<file:fs/jbd2/>	280	<file:fs/jbd2/>
257		281
258	programs: http://e2fsprogs.sourceforge.net/	282	programs: http://e2fsprogs.sourceforge.net/
259	http://ext2resize.sourceforge.net
260		283
261	useful links: http://fedoraproject.org/wiki/ext3-devel	284	useful links: http://fedoraproject.org/wiki/ext3-devel
262	http://www.bullopensource.org/ext4/	285	http://www.bullopensource.org/ext4/
		286	http://ext4.wiki.kernel.org/index.php/Main_Page
		287	http://fedoraproject.org/wiki/Features/Ext4


diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.txt new file mode 100644 index 000000000000..4dae9a3840bf --- /dev/null +++ b/Documentation/filesystems/gfs2-glocks.txt
@@ -0,0 +1,114 @@
		1	Glock internal locking rules
		2	------------------------------
		3
		4	This documents the basic principles of the glock state machine
		5	internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
		6	has two main (internal) locks:
		7
		8	1. A spinlock (gl_spin) which protects the internal state such
		9	as gl_state, gl_target and the list of holders (gl_holders)
		10	2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other
		11	threads from making calls to the DLM, etc. at the same time. If a
		12	thread takes this lock, it must then call run_queue (usually via the
		13	workqueue) when it releases it in order to ensure any pending tasks
		14	are completed.
		15
		16	The gl_holders list contains all the queued lock requests (not
		17	just the holders) associated with the glock. If there are any
		18	held locks, then they will be contiguous entries at the head
		19	of the list. Locks are granted in strictly the order that they
		20	are queued, except for those marked LM_FLAG_PRIORITY which are
		21	used only during recovery, and even then only for journal locks.
		22
		23	There are three lock states that users of the glock layer can request,
		24	namely shared (SH), deferred (DF) and exclusive (EX). Those translate
		25	to the following DLM lock modes:
		26
		27	Glock mode \| DLM lock mode
		28	------------------------------
		29	UN \| IV/NL Unlocked (no DLM lock associated with glock) or NL
		30	SH \| PR (Protected read)
		31	DF \| CW (Concurrent write)
		32	EX \| EX (Exclusive)
		33
		34	Thus DF is basically a shared mode which is incompatible with the "normal"
		35	shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
		36	operations. The glocks are basically a lock plus some routines which deal
		37	with cache management. The following rules apply for the cache:
		38
		39	Glock mode \| Cache data \| Cache Metadata \| Dirty Data \| Dirty Metadata
		40	--------------------------------------------------------------------------
		41	UN \| No \| No \| No \| No
		42	SH \| Yes \| Yes \| No \| No
		43	DF \| No \| Yes \| No \| No
		44	EX \| Yes \| Yes \| Yes \| Yes
		45
		46	These rules are implemented using the various glock operations which
		47	are defined for each type of glock. Not all types of glocks use
		48	all the modes. Only inode glocks use the DF mode for example.
		49
		50	Table of glock operations and per type constants:
		51
		52	Field \| Purpose
		53	----------------------------------------------------------------------------
		54	go_xmote_th \| Called before remote state change (e.g. to sync dirty data)
		55	go_xmote_bh \| Called after remote state change (e.g. to refill cache)
		56	go_inval \| Called if remote state change requires invalidating the cache
		57	go_demote_ok \| Returns boolean value of whether its ok to demote a glock
		58	\| (e.g. checks timeout, and that there is no cached data)
		59	go_lock \| Called for the first local holder of a lock
		60	go_unlock \| Called on the final local unlock of a lock
		61	go_dump \| Called to print content of object for debugfs file, or on
		62	\| error to dump glock to the log.
		63	go_type; \| The type of the glock, LM_TYPE_.....
		64	go_min_hold_time \| The minimum hold time
		65
		66	The minimum hold time for each lock is the time after a remote lock
		67	grant for which we ignore remote demote requests. This is in order to
		68	prevent a situation where locks are being bounced around the cluster
		69	from node to node with none of the nodes making any progress. This
		70	tends to show up most with shared mmaped files which are being written
		71	to by multiple nodes. By delaying the demotion in response to a
		72	remote callback, that gives the userspace program time to make
		73	some progress before the pages are unmapped.
		74
		75	There is a plan to try and remove the go_lock and go_unlock callbacks
		76	if possible, in order to try and speed up the fast path though the locking.
		77	Also, eventually we hope to make the glock "EX" mode locally shared
		78	such that any local locking will be done with the i_mutex as required
		79	rather than via the glock.
		80
		81	Locking rules for glock operations:
		82
		83	Operation \| GLF_LOCK bit lock held \| gl_spin spinlock held
		84	-----------------------------------------------------------------
		85	go_xmote_th \| Yes \| No
		86	go_xmote_bh \| Yes \| No
		87	go_inval \| Yes \| No
		88	go_demote_ok \| Sometimes \| Yes
		89	go_lock \| Yes \| No
		90	go_unlock \| Yes \| No
		91	go_dump \| Sometimes \| Yes
		92
		93	N.B. Operations must not drop either the bit lock or the spinlock
		94	if its held on entry. go_dump and do_demote_ok must never block.
		95	Note that go_dump will only be called if the glock's state
		96	indicates that it is caching uptodate data.
		97
		98	Glock locking order within GFS2:
		99
		100	1. i_mutex (if required)
		101	2. Rename glock (for rename only)
		102	3. Inode glock(s)
		103	(Parents before children, inodes at "same level" with same parent in
		104	lock number order)
		105	4. Rgrp glock(s) (for (de)allocation operations)
		106	5. Transaction glock (via gfs2_trans_begin) for non-read operations
		107	6. Page lock (always last, very important!)
		108
		109	There are two glocks per inode. One deals with access to the inode
		110	itself (locking order as above), and the other, known as the iopen
		111	glock is used in conjunction with the i_nlink field in the inode to
		112	determine the lifetime of the inode in question. Locking of inodes
		113	is on a per-inode basis. Locking of rgrps is on a per rgrp basis.
		114


diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index dbc3c6a3650f..7f268f327d75 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt
@@ -380,28 +380,35 @@ i386 and x86_64 platforms support the new IRQ vector displays.
380	Of some interest is the introduction of the /proc/irq directory to 2.4.	380	Of some interest is the introduction of the /proc/irq directory to 2.4.
381	It could be used to set IRQ to CPU affinity, this means that you can "hook" an	381	It could be used to set IRQ to CPU affinity, this means that you can "hook" an
382	IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the	382	IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
383	irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask	383	irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and
		384	prof_cpu_mask.
384		385
385	For example	386	For example
386	> ls /proc/irq/	387	> ls /proc/irq/
387	0 10 12 14 16 18 2 4 6 8 prof_cpu_mask	388	0 10 12 14 16 18 2 4 6 8 prof_cpu_mask
388	1 11 13 15 17 19 3 5 7 9	389	1 11 13 15 17 19 3 5 7 9 default_smp_affinity
389	> ls /proc/irq/0/	390	> ls /proc/irq/0/
390	smp_affinity	391	smp_affinity
391		392
392	The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ	393	smp_affinity is a bitmask, in which you can specify which CPUs can handle the
393	is the same by default:	394	IRQ, you can set it by doing:
394		395
395	> cat /proc/irq/0/smp_affinity	396	> echo 1 > /proc/irq/10/smp_affinity
396	ffffffff	397
		398	This means that only the first CPU will handle the IRQ, but you can also echo
		399	5 which means that only the first and fourth CPU can handle the IRQ.
397		400
398	It's a bitmask, in which you can specify which CPUs can handle the IRQ, you can	401	The contents of each smp_affinity file is the same by default:
399	set it by doing:	402
		403	> cat /proc/irq/0/smp_affinity
		404	ffffffff
400		405
401	> echo 1 > /proc/irq/prof_cpu_mask	406	The default_smp_affinity mask applies to all non-active IRQs, which are the
		407	IRQs which have not yet been allocated/activated, and hence which lack a
		408	/proc/irq/[0-9]* directory.
402		409
403	This means that only the first CPU will handle the IRQ, but you can also echo 5	410	prof_cpu_mask specifies which CPUs are to be profiled by the system wide
404	which means that only the first and fourth CPU can handle the IRQ.	411	profiler. Default value is ffffffff (all cpus).
405		412
406	The way IRQs are routed is handled by the IO-APIC, and it's Round Robin	413	The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
407	between all the CPUs which are allowed to handle it. As usual the kernel has	414	between all the CPUs which are allowed to handle it. As usual the kernel has