aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/ext4.txt125
-rw-r--r--Documentation/filesystems/gfs2-glocks.txt114
-rw-r--r--Documentation/filesystems/proc.txt29
3 files changed, 207 insertions, 61 deletions
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 0c5086db8352..80e193d82e2e 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -13,72 +13,93 @@ Mailing list: linux-ext4@vger.kernel.org
131. Quick usage instructions: 131. Quick usage instructions:
14=========================== 14===========================
15 15
16 - Grab updated e2fsprogs from 16 - Compile and install the latest version of e2fsprogs (as of this
17 ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/ 17 writing version 1.41) from:
18 This is a patchset on top of e2fsprogs-1.39, which can be found at 18
19 http://sourceforge.net/project/showfiles.php?group_id=2406
20
21 or
22
19 ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/ 23 ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
20 24
21 - It's still mke2fs -j /dev/hda1 25 or grab the latest git repository from:
26
27 git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
28
29 - Create a new filesystem using the ext4dev filesystem type:
30
31 # mke2fs -t ext4dev /dev/hda1
32
33 Or configure an existing ext3 filesystem to support extents and set
34 the test_fs flag to indicate that it's ok for an in-development
35 filesystem to touch this filesystem:
22 36
23 - mount /dev/hda1 /wherever -t ext4dev 37 # tune2fs -O extents -E test_fs /dev/hda1
24 38
25 - To enable extents, 39 If the filesystem was created with 128 byte inodes, it can be
40 converted to use 256 byte for greater efficiency via:
26 41
27 mount /dev/hda1 /wherever -t ext4dev -o extents 42 # tune2fs -I 256 /dev/hda1
28 43
29 - The filesystem is compatible with the ext3 driver until you add a file 44 (Note: we currently do not have tools to convert an ext4dev
30 which has extents (ie: `mount -o extents', then create a file). 45 filesystem back to ext3; so please do not do try this on production
46 filesystems.)
31 47
32 NOTE: The "extents" mount flag is temporary. It will soon go away and 48 - Mounting:
33 extents will be enabled by the "-o extents" flag to mke2fs or tune2fs 49
50 # mount -t ext4dev /dev/hda1 /wherever
34 51
35 - When comparing performance with other filesystems, remember that 52 - When comparing performance with other filesystems, remember that
36 ext3/4 by default offers higher data integrity guarantees than most. So 53 ext3/4 by default offers higher data integrity guarantees than most.
37 when comparing with a metadata-only journalling filesystem, use `mount -o 54 So when comparing with a metadata-only journalling filesystem, such
38 data=writeback'. And you might as well use `mount -o nobh' too along 55 as ext3, use `mount -o data=writeback'. And you might as well use
39 with it. Making the journal larger than the mke2fs default often helps 56 `mount -o nobh' too along with it. Making the journal larger than
40 performance with metadata-intensive workloads. 57 the mke2fs default often helps performance with metadata-intensive
58 workloads.
41 59
422. Features 602. Features
43=========== 61===========
44 62
452.1 Currently available 632.1 Currently available
46 64
47* ability to use filesystems > 16TB 65* ability to use filesystems > 16TB (e2fsprogs support not available yet)
48* extent format reduces metadata overhead (RAM, IO for access, transactions) 66* extent format reduces metadata overhead (RAM, IO for access, transactions)
49* extent format more robust in face of on-disk corruption due to magics, 67* extent format more robust in face of on-disk corruption due to magics,
50* internal redunancy in tree 68* internal redunancy in tree
51 69* improved file allocation (multi-block alloc)
522.1 Previously available, soon to be enabled by default by "mkefs.ext4": 70* fix 32000 subdirectory limit
53 71* nsec timestamps for mtime, atime, ctime, create time
54* dir_index and resize inode will be on by default 72* inode version field on disk (NFSv4, Lustre)
55* large inodes will be used by default for fast EAs, nsec timestamps, etc 73* reduced e2fsck time via uninit_bg feature
74* journal checksumming for robustness, performance
75* persistent file preallocation (e.g for streaming media, databases)
76* ability to pack bitmaps and inode tables into larger virtual groups via the
77 flex_bg feature
78* large file support
79* Inode allocation using large virtual block groups via flex_bg
80* delayed allocation
81* large block (up to pagesize) support
82* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force
83 the ordering)
56 84
572.2 Candidate features for future inclusion 852.2 Candidate features for future inclusion
58 86
59There are several under discussion, whether they all make it in is 87* Online defrag (patches available but not well tested)
60partly a function of how much time everyone has to work on them: 88* reduced mke2fs time via lazy itable initialization in conjuction with
89 the uninit_bg feature (capability to do this is available in e2fsprogs
90 but a kernel thread to do lazy zeroing of unused inode table blocks
91 after filesystem is first mounted is required for safety)
61 92
62* improved file allocation (multi-block alloc, delayed alloc; basically done) 93There are several others under discussion, whether they all make it in is
63* fix 32000 subdirectory limit (patch exists, needs some e2fsck work) 94partly a function of how much time everyone has to work on them. Features like
64* nsec timestamps for mtime, atime, ctime, create time (patch exists, 95metadata checksumming have been discussed and planned for a bit but no patches
65 needs some e2fsck work) 96exist yet so I'm not sure they're in the near-term roadmap.
66* inode version field on disk (NFSv4, Lustre; prototype exists)
67* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
68* journal checksumming for robustness, performance (prototype exists)
69* persistent file preallocation (e.g for streaming media, databases)
70 97
71Features like metadata checksumming have been discussed and planned for 98The big performance win will come with mballoc, delalloc and flex_bg
72a bit but no patches exist yet so I'm not sure they're in the near-term 99grouping of bitmaps and inode tables. Some test results available here:
73roadmap.
74 100
75The big performance win will come with mballoc and delalloc. CFS has 101 - http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html
76been using mballoc for a few years already with Lustre, and IBM + Bull 102 - http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html
77did a lot of benchmarking on it. The reason it isn't in the first set of
78patches is partly a manageability issue, and partly because it doesn't
79directly affect the on-disk format (outside of much better allocation)
80so it isn't critical to get into the first round of changes. I believe
81Alex is working on a new set of patches right now.
82 103
833. Options 1043. Options
84========== 105==========
@@ -222,9 +243,11 @@ stripe=n Number of filesystem blocks that mballoc will try
222 to use for allocation size and alignment. For RAID5/6 243 to use for allocation size and alignment. For RAID5/6
223 systems this should be the number of data 244 systems this should be the number of data
224 disks * RAID chunk size in file system blocks. 245 disks * RAID chunk size in file system blocks.
225 246delalloc (*) Deferring block allocation until write-out time.
247nodelalloc Disable delayed allocation. Blocks are allocation
248 when data is copied from user to page cache.
226Data Mode 249Data Mode
227--------- 250=========
228There are 3 different data modes: 251There are 3 different data modes:
229 252
230* writeback mode 253* writeback mode
@@ -236,10 +259,10 @@ typically provide the best ext4 performance.
236 259
237* ordered mode 260* ordered mode
238In data=ordered mode, ext4 only officially journals metadata, but it logically 261In data=ordered mode, ext4 only officially journals metadata, but it logically
239groups metadata and data blocks into a single unit called a transaction. When 262groups metadata information related to data changes with the data blocks into a
240it's time to write the new metadata out to disk, the associated data blocks 263single unit called a transaction. When it's time to write the new metadata
241are written first. In general, this mode performs slightly slower than 264out to disk, the associated data blocks are written first. In general,
242writeback but significantly faster than journal mode. 265this mode performs slightly slower than writeback but significantly faster than journal mode.
243 266
244* journal mode 267* journal mode
245data=journal mode provides full data and metadata journaling. All new data is 268data=journal mode provides full data and metadata journaling. All new data is
@@ -247,7 +270,8 @@ written to the journal first, and then to its final location.
247In the event of a crash, the journal can be replayed, bringing both data and 270In the event of a crash, the journal can be replayed, bringing both data and
248metadata into a consistent state. This mode is the slowest except when data 271metadata into a consistent state. This mode is the slowest except when data
249needs to be read from and written to disk at the same time where it 272needs to be read from and written to disk at the same time where it
250outperforms all others modes. 273outperforms all others modes. Curently ext4 does not have delayed
274allocation support if this data journalling mode is selected.
251 275
252References 276References
253========== 277==========
@@ -256,7 +280,8 @@ kernel source: <file:fs/ext4/>
256 <file:fs/jbd2/> 280 <file:fs/jbd2/>
257 281
258programs: http://e2fsprogs.sourceforge.net/ 282programs: http://e2fsprogs.sourceforge.net/
259 http://ext2resize.sourceforge.net
260 283
261useful links: http://fedoraproject.org/wiki/ext3-devel 284useful links: http://fedoraproject.org/wiki/ext3-devel
262 http://www.bullopensource.org/ext4/ 285 http://www.bullopensource.org/ext4/
286 http://ext4.wiki.kernel.org/index.php/Main_Page
287 http://fedoraproject.org/wiki/Features/Ext4
diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.txt
new file mode 100644
index 000000000000..4dae9a3840bf
--- /dev/null
+++ b/Documentation/filesystems/gfs2-glocks.txt
@@ -0,0 +1,114 @@
1 Glock internal locking rules
2 ------------------------------
3
4This documents the basic principles of the glock state machine
5internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
6has two main (internal) locks:
7
8 1. A spinlock (gl_spin) which protects the internal state such
9 as gl_state, gl_target and the list of holders (gl_holders)
10 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other
11 threads from making calls to the DLM, etc. at the same time. If a
12 thread takes this lock, it must then call run_queue (usually via the
13 workqueue) when it releases it in order to ensure any pending tasks
14 are completed.
15
16The gl_holders list contains all the queued lock requests (not
17just the holders) associated with the glock. If there are any
18held locks, then they will be contiguous entries at the head
19of the list. Locks are granted in strictly the order that they
20are queued, except for those marked LM_FLAG_PRIORITY which are
21used only during recovery, and even then only for journal locks.
22
23There are three lock states that users of the glock layer can request,
24namely shared (SH), deferred (DF) and exclusive (EX). Those translate
25to the following DLM lock modes:
26
27Glock mode | DLM lock mode
28------------------------------
29 UN | IV/NL Unlocked (no DLM lock associated with glock) or NL
30 SH | PR (Protected read)
31 DF | CW (Concurrent write)
32 EX | EX (Exclusive)
33
34Thus DF is basically a shared mode which is incompatible with the "normal"
35shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
36operations. The glocks are basically a lock plus some routines which deal
37with cache management. The following rules apply for the cache:
38
39Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata
40--------------------------------------------------------------------------
41 UN | No | No | No | No
42 SH | Yes | Yes | No | No
43 DF | No | Yes | No | No
44 EX | Yes | Yes | Yes | Yes
45
46These rules are implemented using the various glock operations which
47are defined for each type of glock. Not all types of glocks use
48all the modes. Only inode glocks use the DF mode for example.
49
50Table of glock operations and per type constants:
51
52Field | Purpose
53----------------------------------------------------------------------------
54go_xmote_th | Called before remote state change (e.g. to sync dirty data)
55go_xmote_bh | Called after remote state change (e.g. to refill cache)
56go_inval | Called if remote state change requires invalidating the cache
57go_demote_ok | Returns boolean value of whether its ok to demote a glock
58 | (e.g. checks timeout, and that there is no cached data)
59go_lock | Called for the first local holder of a lock
60go_unlock | Called on the final local unlock of a lock
61go_dump | Called to print content of object for debugfs file, or on
62 | error to dump glock to the log.
63go_type; | The type of the glock, LM_TYPE_.....
64go_min_hold_time | The minimum hold time
65
66The minimum hold time for each lock is the time after a remote lock
67grant for which we ignore remote demote requests. This is in order to
68prevent a situation where locks are being bounced around the cluster
69from node to node with none of the nodes making any progress. This
70tends to show up most with shared mmaped files which are being written
71to by multiple nodes. By delaying the demotion in response to a
72remote callback, that gives the userspace program time to make
73some progress before the pages are unmapped.
74
75There is a plan to try and remove the go_lock and go_unlock callbacks
76if possible, in order to try and speed up the fast path though the locking.
77Also, eventually we hope to make the glock "EX" mode locally shared
78such that any local locking will be done with the i_mutex as required
79rather than via the glock.
80
81Locking rules for glock operations:
82
83Operation | GLF_LOCK bit lock held | gl_spin spinlock held
84-----------------------------------------------------------------
85go_xmote_th | Yes | No
86go_xmote_bh | Yes | No
87go_inval | Yes | No
88go_demote_ok | Sometimes | Yes
89go_lock | Yes | No
90go_unlock | Yes | No
91go_dump | Sometimes | Yes
92
93N.B. Operations must not drop either the bit lock or the spinlock
94if its held on entry. go_dump and do_demote_ok must never block.
95Note that go_dump will only be called if the glock's state
96indicates that it is caching uptodate data.
97
98Glock locking order within GFS2:
99
100 1. i_mutex (if required)
101 2. Rename glock (for rename only)
102 3. Inode glock(s)
103 (Parents before children, inodes at "same level" with same parent in
104 lock number order)
105 4. Rgrp glock(s) (for (de)allocation operations)
106 5. Transaction glock (via gfs2_trans_begin) for non-read operations
107 6. Page lock (always last, very important!)
108
109There are two glocks per inode. One deals with access to the inode
110itself (locking order as above), and the other, known as the iopen
111glock is used in conjunction with the i_nlink field in the inode to
112determine the lifetime of the inode in question. Locking of inodes
113is on a per-inode basis. Locking of rgrps is on a per rgrp basis.
114
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index dbc3c6a3650f..7f268f327d75 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -380,28 +380,35 @@ i386 and x86_64 platforms support the new IRQ vector displays.
380Of some interest is the introduction of the /proc/irq directory to 2.4. 380Of some interest is the introduction of the /proc/irq directory to 2.4.
381It could be used to set IRQ to CPU affinity, this means that you can "hook" an 381It could be used to set IRQ to CPU affinity, this means that you can "hook" an
382IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the 382IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
383irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask 383irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and
384prof_cpu_mask.
384 385
385For example 386For example
386 > ls /proc/irq/ 387 > ls /proc/irq/
387 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask 388 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask
388 1 11 13 15 17 19 3 5 7 9 389 1 11 13 15 17 19 3 5 7 9 default_smp_affinity
389 > ls /proc/irq/0/ 390 > ls /proc/irq/0/
390 smp_affinity 391 smp_affinity
391 392
392The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ 393smp_affinity is a bitmask, in which you can specify which CPUs can handle the
393is the same by default: 394IRQ, you can set it by doing:
394 395
395 > cat /proc/irq/0/smp_affinity 396 > echo 1 > /proc/irq/10/smp_affinity
396 ffffffff 397
398This means that only the first CPU will handle the IRQ, but you can also echo
3995 which means that only the first and fourth CPU can handle the IRQ.
397 400
398It's a bitmask, in which you can specify which CPUs can handle the IRQ, you can 401The contents of each smp_affinity file is the same by default:
399set it by doing: 402
403 > cat /proc/irq/0/smp_affinity
404 ffffffff
400 405
401 > echo 1 > /proc/irq/prof_cpu_mask 406The default_smp_affinity mask applies to all non-active IRQs, which are the
407IRQs which have not yet been allocated/activated, and hence which lack a
408/proc/irq/[0-9]* directory.
402 409
403This means that only the first CPU will handle the IRQ, but you can also echo 5 410prof_cpu_mask specifies which CPUs are to be profiled by the system wide
404which means that only the first and fourth CPU can handle the IRQ. 411profiler. Default value is ffffffff (all cpus).
405 412
406The way IRQs are routed is handled by the IO-APIC, and it's Round Robin 413The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
407between all the CPUs which are allowed to handle it. As usual the kernel has 414between all the CPUs which are allowed to handle it. As usual the kernel has