diff options
Diffstat (limited to 'Documentation/filesystems')
| -rw-r--r-- | Documentation/filesystems/ext4.txt | 125 | ||||
| -rw-r--r-- | Documentation/filesystems/gfs2-glocks.txt | 114 | ||||
| -rw-r--r-- | Documentation/filesystems/proc.txt | 29 |
3 files changed, 207 insertions, 61 deletions
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 0c5086db8352..80e193d82e2e 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt | |||
| @@ -13,72 +13,93 @@ Mailing list: linux-ext4@vger.kernel.org | |||
| 13 | 1. Quick usage instructions: | 13 | 1. Quick usage instructions: |
| 14 | =========================== | 14 | =========================== |
| 15 | 15 | ||
| 16 | - Grab updated e2fsprogs from | 16 | - Compile and install the latest version of e2fsprogs (as of this |
| 17 | ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/ | 17 | writing version 1.41) from: |
| 18 | This is a patchset on top of e2fsprogs-1.39, which can be found at | 18 | |
| 19 | http://sourceforge.net/project/showfiles.php?group_id=2406 | ||
| 20 | |||
| 21 | or | ||
| 22 | |||
| 19 | ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/ | 23 | ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/ |
| 20 | 24 | ||
| 21 | - It's still mke2fs -j /dev/hda1 | 25 | or grab the latest git repository from: |
| 26 | |||
| 27 | git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git | ||
| 28 | |||
| 29 | - Create a new filesystem using the ext4dev filesystem type: | ||
| 30 | |||
| 31 | # mke2fs -t ext4dev /dev/hda1 | ||
| 32 | |||
| 33 | Or configure an existing ext3 filesystem to support extents and set | ||
| 34 | the test_fs flag to indicate that it's ok for an in-development | ||
| 35 | filesystem to touch this filesystem: | ||
| 22 | 36 | ||
| 23 | - mount /dev/hda1 /wherever -t ext4dev | 37 | # tune2fs -O extents -E test_fs /dev/hda1 |
| 24 | 38 | ||
| 25 | - To enable extents, | 39 | If the filesystem was created with 128 byte inodes, it can be |
| 40 | converted to use 256 byte for greater efficiency via: | ||
| 26 | 41 | ||
| 27 | mount /dev/hda1 /wherever -t ext4dev -o extents | 42 | # tune2fs -I 256 /dev/hda1 |
| 28 | 43 | ||
| 29 | - The filesystem is compatible with the ext3 driver until you add a file | 44 | (Note: we currently do not have tools to convert an ext4dev |
| 30 | which has extents (ie: `mount -o extents', then create a file). | 45 | filesystem back to ext3; so please do not do try this on production |
| 46 | filesystems.) | ||
| 31 | 47 | ||
| 32 | NOTE: The "extents" mount flag is temporary. It will soon go away and | 48 | - Mounting: |
| 33 | extents will be enabled by the "-o extents" flag to mke2fs or tune2fs | 49 | |
| 50 | # mount -t ext4dev /dev/hda1 /wherever | ||
| 34 | 51 | ||
| 35 | - When comparing performance with other filesystems, remember that | 52 | - When comparing performance with other filesystems, remember that |
| 36 | ext3/4 by default offers higher data integrity guarantees than most. So | 53 | ext3/4 by default offers higher data integrity guarantees than most. |
| 37 | when comparing with a metadata-only journalling filesystem, use `mount -o | 54 | So when comparing with a metadata-only journalling filesystem, such |
| 38 | data=writeback'. And you might as well use `mount -o nobh' too along | 55 | as ext3, use `mount -o data=writeback'. And you might as well use |
| 39 | with it. Making the journal larger than the mke2fs default often helps | 56 | `mount -o nobh' too along with it. Making the journal larger than |
| 40 | performance with metadata-intensive workloads. | 57 | the mke2fs default often helps performance with metadata-intensive |
| 58 | workloads. | ||
| 41 | 59 | ||
| 42 | 2. Features | 60 | 2. Features |
| 43 | =========== | 61 | =========== |
| 44 | 62 | ||
| 45 | 2.1 Currently available | 63 | 2.1 Currently available |
| 46 | 64 | ||
| 47 | * ability to use filesystems > 16TB | 65 | * ability to use filesystems > 16TB (e2fsprogs support not available yet) |
| 48 | * extent format reduces metadata overhead (RAM, IO for access, transactions) | 66 | * extent format reduces metadata overhead (RAM, IO for access, transactions) |
| 49 | * extent format more robust in face of on-disk corruption due to magics, | 67 | * extent format more robust in face of on-disk corruption due to magics, |
| 50 | * internal redunancy in tree | 68 | * internal redunancy in tree |
| 51 | 69 | * improved file allocation (multi-block alloc) | |
| 52 | 2.1 Previously available, soon to be enabled by default by "mkefs.ext4": | 70 | * fix 32000 subdirectory limit |
| 53 | 71 | * nsec timestamps for mtime, atime, ctime, create time | |
| 54 | * dir_index and resize inode will be on by default | 72 | * inode version field on disk (NFSv4, Lustre) |
| 55 | * large inodes will be used by default for fast EAs, nsec timestamps, etc | 73 | * reduced e2fsck time via uninit_bg feature |
| 74 | * journal checksumming for robustness, performance | ||
| 75 | * persistent file preallocation (e.g for streaming media, databases) | ||
| 76 | * ability to pack bitmaps and inode tables into larger virtual groups via the | ||
| 77 | flex_bg feature | ||
| 78 | * large file support | ||
| 79 | * Inode allocation using large virtual block groups via flex_bg | ||
| 80 | * delayed allocation | ||
| 81 | * large block (up to pagesize) support | ||
| 82 | * efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force | ||
| 83 | the ordering) | ||
| 56 | 84 | ||
| 57 | 2.2 Candidate features for future inclusion | 85 | 2.2 Candidate features for future inclusion |
| 58 | 86 | ||
| 59 | There are several under discussion, whether they all make it in is | 87 | * Online defrag (patches available but not well tested) |
| 60 | partly a function of how much time everyone has to work on them: | 88 | * reduced mke2fs time via lazy itable initialization in conjuction with |
| 89 | the uninit_bg feature (capability to do this is available in e2fsprogs | ||
| 90 | but a kernel thread to do lazy zeroing of unused inode table blocks | ||
| 91 | after filesystem is first mounted is required for safety) | ||
| 61 | 92 | ||
| 62 | * improved file allocation (multi-block alloc, delayed alloc; basically done) | 93 | There are several others under discussion, whether they all make it in is |
| 63 | * fix 32000 subdirectory limit (patch exists, needs some e2fsck work) | 94 | partly a function of how much time everyone has to work on them. Features like |
| 64 | * nsec timestamps for mtime, atime, ctime, create time (patch exists, | 95 | metadata checksumming have been discussed and planned for a bit but no patches |
| 65 | needs some e2fsck work) | 96 | exist yet so I'm not sure they're in the near-term roadmap. |
| 66 | * inode version field on disk (NFSv4, Lustre; prototype exists) | ||
| 67 | * reduced mke2fs/e2fsck time via uninitialized groups (prototype exists) | ||
| 68 | * journal checksumming for robustness, performance (prototype exists) | ||
| 69 | * persistent file preallocation (e.g for streaming media, databases) | ||
| 70 | 97 | ||
| 71 | Features like metadata checksumming have been discussed and planned for | 98 | The big performance win will come with mballoc, delalloc and flex_bg |
| 72 | a bit but no patches exist yet so I'm not sure they're in the near-term | 99 | grouping of bitmaps and inode tables. Some test results available here: |
| 73 | roadmap. | ||
| 74 | 100 | ||
| 75 | The big performance win will come with mballoc and delalloc. CFS has | 101 | - http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html |
| 76 | been using mballoc for a few years already with Lustre, and IBM + Bull | 102 | - http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html |
| 77 | did a lot of benchmarking on it. The reason it isn't in the first set of | ||
| 78 | patches is partly a manageability issue, and partly because it doesn't | ||
| 79 | directly affect the on-disk format (outside of much better allocation) | ||
| 80 | so it isn't critical to get into the first round of changes. I believe | ||
| 81 | Alex is working on a new set of patches right now. | ||
| 82 | 103 | ||
| 83 | 3. Options | 104 | 3. Options |
| 84 | ========== | 105 | ========== |
| @@ -222,9 +243,11 @@ stripe=n Number of filesystem blocks that mballoc will try | |||
| 222 | to use for allocation size and alignment. For RAID5/6 | 243 | to use for allocation size and alignment. For RAID5/6 |
| 223 | systems this should be the number of data | 244 | systems this should be the number of data |
| 224 | disks * RAID chunk size in file system blocks. | 245 | disks * RAID chunk size in file system blocks. |
| 225 | 246 | delalloc (*) Deferring block allocation until write-out time. | |
| 247 | nodelalloc Disable delayed allocation. Blocks are allocation | ||
| 248 | when data is copied from user to page cache. | ||
| 226 | Data Mode | 249 | Data Mode |
| 227 | --------- | 250 | ========= |
| 228 | There are 3 different data modes: | 251 | There are 3 different data modes: |
| 229 | 252 | ||
| 230 | * writeback mode | 253 | * writeback mode |
| @@ -236,10 +259,10 @@ typically provide the best ext4 performance. | |||
| 236 | 259 | ||
| 237 | * ordered mode | 260 | * ordered mode |
| 238 | In data=ordered mode, ext4 only officially journals metadata, but it logically | 261 | In data=ordered mode, ext4 only officially journals metadata, but it logically |
| 239 | groups metadata and data blocks into a single unit called a transaction. When | 262 | groups metadata information related to data changes with the data blocks into a |
| 240 | it's time to write the new metadata out to disk, the associated data blocks | 263 | single unit called a transaction. When it's time to write the new metadata |
| 241 | are written first. In general, this mode performs slightly slower than | 264 | out to disk, the associated data blocks are written first. In general, |
| 242 | writeback but significantly faster than journal mode. | 265 | this mode performs slightly slower than writeback but significantly faster than journal mode. |
| 243 | 266 | ||
| 244 | * journal mode | 267 | * journal mode |
| 245 | data=journal mode provides full data and metadata journaling. All new data is | 268 | data=journal mode provides full data and metadata journaling. All new data is |
| @@ -247,7 +270,8 @@ written to the journal first, and then to its final location. | |||
| 247 | In the event of a crash, the journal can be replayed, bringing both data and | 270 | In the event of a crash, the journal can be replayed, bringing both data and |
| 248 | metadata into a consistent state. This mode is the slowest except when data | 271 | metadata into a consistent state. This mode is the slowest except when data |
| 249 | needs to be read from and written to disk at the same time where it | 272 | needs to be read from and written to disk at the same time where it |
| 250 | outperforms all others modes. | 273 | outperforms all others modes. Curently ext4 does not have delayed |
| 274 | allocation support if this data journalling mode is selected. | ||
| 251 | 275 | ||
| 252 | References | 276 | References |
| 253 | ========== | 277 | ========== |
| @@ -256,7 +280,8 @@ kernel source: <file:fs/ext4/> | |||
| 256 | <file:fs/jbd2/> | 280 | <file:fs/jbd2/> |
| 257 | 281 | ||
| 258 | programs: http://e2fsprogs.sourceforge.net/ | 282 | programs: http://e2fsprogs.sourceforge.net/ |
| 259 | http://ext2resize.sourceforge.net | ||
| 260 | 283 | ||
| 261 | useful links: http://fedoraproject.org/wiki/ext3-devel | 284 | useful links: http://fedoraproject.org/wiki/ext3-devel |
| 262 | http://www.bullopensource.org/ext4/ | 285 | http://www.bullopensource.org/ext4/ |
| 286 | http://ext4.wiki.kernel.org/index.php/Main_Page | ||
| 287 | http://fedoraproject.org/wiki/Features/Ext4 | ||
diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.txt new file mode 100644 index 000000000000..4dae9a3840bf --- /dev/null +++ b/Documentation/filesystems/gfs2-glocks.txt | |||
| @@ -0,0 +1,114 @@ | |||
| 1 | Glock internal locking rules | ||
| 2 | ------------------------------ | ||
| 3 | |||
| 4 | This documents the basic principles of the glock state machine | ||
| 5 | internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h) | ||
| 6 | has two main (internal) locks: | ||
| 7 | |||
| 8 | 1. A spinlock (gl_spin) which protects the internal state such | ||
| 9 | as gl_state, gl_target and the list of holders (gl_holders) | ||
| 10 | 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other | ||
| 11 | threads from making calls to the DLM, etc. at the same time. If a | ||
| 12 | thread takes this lock, it must then call run_queue (usually via the | ||
| 13 | workqueue) when it releases it in order to ensure any pending tasks | ||
| 14 | are completed. | ||
| 15 | |||
| 16 | The gl_holders list contains all the queued lock requests (not | ||
| 17 | just the holders) associated with the glock. If there are any | ||
| 18 | held locks, then they will be contiguous entries at the head | ||
| 19 | of the list. Locks are granted in strictly the order that they | ||
| 20 | are queued, except for those marked LM_FLAG_PRIORITY which are | ||
| 21 | used only during recovery, and even then only for journal locks. | ||
| 22 | |||
| 23 | There are three lock states that users of the glock layer can request, | ||
| 24 | namely shared (SH), deferred (DF) and exclusive (EX). Those translate | ||
| 25 | to the following DLM lock modes: | ||
| 26 | |||
| 27 | Glock mode | DLM lock mode | ||
| 28 | ------------------------------ | ||
| 29 | UN | IV/NL Unlocked (no DLM lock associated with glock) or NL | ||
| 30 | SH | PR (Protected read) | ||
| 31 | DF | CW (Concurrent write) | ||
| 32 | EX | EX (Exclusive) | ||
| 33 | |||
| 34 | Thus DF is basically a shared mode which is incompatible with the "normal" | ||
| 35 | shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O | ||
| 36 | operations. The glocks are basically a lock plus some routines which deal | ||
| 37 | with cache management. The following rules apply for the cache: | ||
| 38 | |||
| 39 | Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata | ||
| 40 | -------------------------------------------------------------------------- | ||
| 41 | UN | No | No | No | No | ||
| 42 | SH | Yes | Yes | No | No | ||
| 43 | DF | No | Yes | No | No | ||
| 44 | EX | Yes | Yes | Yes | Yes | ||
| 45 | |||
| 46 | These rules are implemented using the various glock operations which | ||
| 47 | are defined for each type of glock. Not all types of glocks use | ||
| 48 | all the modes. Only inode glocks use the DF mode for example. | ||
| 49 | |||
| 50 | Table of glock operations and per type constants: | ||
| 51 | |||
| 52 | Field | Purpose | ||
| 53 | ---------------------------------------------------------------------------- | ||
| 54 | go_xmote_th | Called before remote state change (e.g. to sync dirty data) | ||
| 55 | go_xmote_bh | Called after remote state change (e.g. to refill cache) | ||
| 56 | go_inval | Called if remote state change requires invalidating the cache | ||
| 57 | go_demote_ok | Returns boolean value of whether its ok to demote a glock | ||
| 58 | | (e.g. checks timeout, and that there is no cached data) | ||
| 59 | go_lock | Called for the first local holder of a lock | ||
| 60 | go_unlock | Called on the final local unlock of a lock | ||
| 61 | go_dump | Called to print content of object for debugfs file, or on | ||
| 62 | | error to dump glock to the log. | ||
| 63 | go_type; | The type of the glock, LM_TYPE_..... | ||
| 64 | go_min_hold_time | The minimum hold time | ||
| 65 | |||
| 66 | The minimum hold time for each lock is the time after a remote lock | ||
| 67 | grant for which we ignore remote demote requests. This is in order to | ||
| 68 | prevent a situation where locks are being bounced around the cluster | ||
| 69 | from node to node with none of the nodes making any progress. This | ||
| 70 | tends to show up most with shared mmaped files which are being written | ||
| 71 | to by multiple nodes. By delaying the demotion in response to a | ||
| 72 | remote callback, that gives the userspace program time to make | ||
| 73 | some progress before the pages are unmapped. | ||
| 74 | |||
| 75 | There is a plan to try and remove the go_lock and go_unlock callbacks | ||
| 76 | if possible, in order to try and speed up the fast path though the locking. | ||
| 77 | Also, eventually we hope to make the glock "EX" mode locally shared | ||
| 78 | such that any local locking will be done with the i_mutex as required | ||
| 79 | rather than via the glock. | ||
| 80 | |||
| 81 | Locking rules for glock operations: | ||
| 82 | |||
| 83 | Operation | GLF_LOCK bit lock held | gl_spin spinlock held | ||
| 84 | ----------------------------------------------------------------- | ||
| 85 | go_xmote_th | Yes | No | ||
| 86 | go_xmote_bh | Yes | No | ||
| 87 | go_inval | Yes | No | ||
| 88 | go_demote_ok | Sometimes | Yes | ||
| 89 | go_lock | Yes | No | ||
| 90 | go_unlock | Yes | No | ||
| 91 | go_dump | Sometimes | Yes | ||
| 92 | |||
| 93 | N.B. Operations must not drop either the bit lock or the spinlock | ||
| 94 | if its held on entry. go_dump and do_demote_ok must never block. | ||
| 95 | Note that go_dump will only be called if the glock's state | ||
| 96 | indicates that it is caching uptodate data. | ||
| 97 | |||
| 98 | Glock locking order within GFS2: | ||
| 99 | |||
| 100 | 1. i_mutex (if required) | ||
| 101 | 2. Rename glock (for rename only) | ||
| 102 | 3. Inode glock(s) | ||
| 103 | (Parents before children, inodes at "same level" with same parent in | ||
| 104 | lock number order) | ||
| 105 | 4. Rgrp glock(s) (for (de)allocation operations) | ||
| 106 | 5. Transaction glock (via gfs2_trans_begin) for non-read operations | ||
| 107 | 6. Page lock (always last, very important!) | ||
| 108 | |||
| 109 | There are two glocks per inode. One deals with access to the inode | ||
| 110 | itself (locking order as above), and the other, known as the iopen | ||
| 111 | glock is used in conjunction with the i_nlink field in the inode to | ||
| 112 | determine the lifetime of the inode in question. Locking of inodes | ||
| 113 | is on a per-inode basis. Locking of rgrps is on a per rgrp basis. | ||
| 114 | |||
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index dbc3c6a3650f..7f268f327d75 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
| @@ -380,28 +380,35 @@ i386 and x86_64 platforms support the new IRQ vector displays. | |||
| 380 | Of some interest is the introduction of the /proc/irq directory to 2.4. | 380 | Of some interest is the introduction of the /proc/irq directory to 2.4. |
| 381 | It could be used to set IRQ to CPU affinity, this means that you can "hook" an | 381 | It could be used to set IRQ to CPU affinity, this means that you can "hook" an |
| 382 | IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the | 382 | IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the |
| 383 | irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask | 383 | irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and |
| 384 | prof_cpu_mask. | ||
| 384 | 385 | ||
| 385 | For example | 386 | For example |
| 386 | > ls /proc/irq/ | 387 | > ls /proc/irq/ |
| 387 | 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask | 388 | 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask |
| 388 | 1 11 13 15 17 19 3 5 7 9 | 389 | 1 11 13 15 17 19 3 5 7 9 default_smp_affinity |
| 389 | > ls /proc/irq/0/ | 390 | > ls /proc/irq/0/ |
| 390 | smp_affinity | 391 | smp_affinity |
| 391 | 392 | ||
| 392 | The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ | 393 | smp_affinity is a bitmask, in which you can specify which CPUs can handle the |
| 393 | is the same by default: | 394 | IRQ, you can set it by doing: |
| 394 | 395 | ||
| 395 | > cat /proc/irq/0/smp_affinity | 396 | > echo 1 > /proc/irq/10/smp_affinity |
| 396 | ffffffff | 397 | |
| 398 | This means that only the first CPU will handle the IRQ, but you can also echo | ||
| 399 | 5 which means that only the first and fourth CPU can handle the IRQ. | ||
| 397 | 400 | ||
| 398 | It's a bitmask, in which you can specify which CPUs can handle the IRQ, you can | 401 | The contents of each smp_affinity file is the same by default: |
| 399 | set it by doing: | 402 | |
| 403 | > cat /proc/irq/0/smp_affinity | ||
| 404 | ffffffff | ||
| 400 | 405 | ||
| 401 | > echo 1 > /proc/irq/prof_cpu_mask | 406 | The default_smp_affinity mask applies to all non-active IRQs, which are the |
| 407 | IRQs which have not yet been allocated/activated, and hence which lack a | ||
| 408 | /proc/irq/[0-9]* directory. | ||
| 402 | 409 | ||
| 403 | This means that only the first CPU will handle the IRQ, but you can also echo 5 | 410 | prof_cpu_mask specifies which CPUs are to be profiled by the system wide |
| 404 | which means that only the first and fourth CPU can handle the IRQ. | 411 | profiler. Default value is ffffffff (all cpus). |
| 405 | 412 | ||
| 406 | The way IRQs are routed is handled by the IO-APIC, and it's Round Robin | 413 | The way IRQs are routed is handled by the IO-APIC, and it's Round Robin |
| 407 | between all the CPUs which are allowed to handle it. As usual the kernel has | 414 | between all the CPUs which are allowed to handle it. As usual the kernel has |
