diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/dentry-locking.txt | 173 | ||||
-rw-r--r-- | Documentation/filesystems/devfs/README | 5 | ||||
-rw-r--r-- | Documentation/filesystems/ext2.txt | 2 | ||||
-rw-r--r-- | Documentation/filesystems/ramfs-rootfs-initramfs.txt | 195 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.txt | 434 | ||||
-rw-r--r-- | Documentation/filesystems/xfs.txt | 142 |
6 files changed, 624 insertions, 327 deletions
diff --git a/Documentation/filesystems/dentry-locking.txt b/Documentation/filesystems/dentry-locking.txt new file mode 100644 index 000000000000..4c0c575a4012 --- /dev/null +++ b/Documentation/filesystems/dentry-locking.txt | |||
@@ -0,0 +1,173 @@ | |||
1 | RCU-based dcache locking model | ||
2 | ============================== | ||
3 | |||
4 | On many workloads, the most common operation on dcache is to look up a | ||
5 | dentry, given a parent dentry and the name of the child. Typically, | ||
6 | for every open(), stat() etc., the dentry corresponding to the | ||
7 | pathname will be looked up by walking the tree starting with the first | ||
8 | component of the pathname and using that dentry along with the next | ||
9 | component to look up the next level and so on. Since it is a frequent | ||
10 | operation for workloads like multiuser environments and web servers, | ||
11 | it is important to optimize this path. | ||
12 | |||
13 | Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in | ||
14 | every component during path look-up. Since 2.5.10 onwards, fast-walk | ||
15 | algorithm changed this by holding the dcache_lock at the beginning and | ||
16 | walking as many cached path component dentries as possible. This | ||
17 | significantly decreases the number of acquisition of | ||
18 | dcache_lock. However it also increases the lock hold time | ||
19 | significantly and affects performance in large SMP machines. Since | ||
20 | 2.5.62 kernel, dcache has been using a new locking model that uses RCU | ||
21 | to make dcache look-up lock-free. | ||
22 | |||
23 | The current dcache locking model is not very different from the | ||
24 | existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock | ||
25 | protected the hash chain, d_child, d_alias, d_lru lists as well as | ||
26 | d_inode and several other things like mount look-up. RCU-based changes | ||
27 | affect only the way the hash chain is protected. For everything else | ||
28 | the dcache_lock must be taken for both traversing as well as | ||
29 | updating. The hash chain updates too take the dcache_lock. The | ||
30 | significant change is the way d_lookup traverses the hash chain, it | ||
31 | doesn't acquire the dcache_lock for this and rely on RCU to ensure | ||
32 | that the dentry has not been *freed*. | ||
33 | |||
34 | |||
35 | Dcache locking details | ||
36 | ====================== | ||
37 | |||
38 | For many multi-user workloads, open() and stat() on files are very | ||
39 | frequently occurring operations. Both involve walking of path names to | ||
40 | find the dentry corresponding to the concerned file. In 2.4 kernel, | ||
41 | dcache_lock was held during look-up of each path component. Contention | ||
42 | and cache-line bouncing of this global lock caused significant | ||
43 | scalability problems. With the introduction of RCU in Linux kernel, | ||
44 | this was worked around by making the look-up of path components during | ||
45 | path walking lock-free. | ||
46 | |||
47 | |||
48 | Safe lock-free look-up of dcache hash table | ||
49 | =========================================== | ||
50 | |||
51 | Dcache is a complex data structure with the hash table entries also | ||
52 | linked together in other lists. In 2.4 kernel, dcache_lock protected | ||
53 | all the lists. We applied RCU only on hash chain walking. The rest of | ||
54 | the lists are still protected by dcache_lock. Some of the important | ||
55 | changes are : | ||
56 | |||
57 | 1. The deletion from hash chain is done using hlist_del_rcu() macro | ||
58 | which doesn't initialize next pointer of the deleted dentry and | ||
59 | this allows us to walk safely lock-free while a deletion is | ||
60 | happening. | ||
61 | |||
62 | 2. Insertion of a dentry into the hash table is done using | ||
63 | hlist_add_head_rcu() which take care of ordering the writes - the | ||
64 | writes to the dentry must be visible before the dentry is | ||
65 | inserted. This works in conjunction with hlist_for_each_rcu() while | ||
66 | walking the hash chain. The only requirement is that all | ||
67 | initialization to the dentry must be done before | ||
68 | hlist_add_head_rcu() since we don't have dcache_lock protection | ||
69 | while traversing the hash chain. This isn't different from the | ||
70 | existing code. | ||
71 | |||
72 | 3. The dentry looked up without holding dcache_lock by cannot be | ||
73 | returned for walking if it is unhashed. It then may have a NULL | ||
74 | d_inode or other bogosity since RCU doesn't protect the other | ||
75 | fields in the dentry. We therefore use a flag DCACHE_UNHASHED to | ||
76 | indicate unhashed dentries and use this in conjunction with a | ||
77 | per-dentry lock (d_lock). Once looked up without the dcache_lock, | ||
78 | we acquire the per-dentry lock (d_lock) and check if the dentry is | ||
79 | unhashed. If so, the look-up is failed. If not, the reference count | ||
80 | of the dentry is increased and the dentry is returned. | ||
81 | |||
82 | 4. Once a dentry is looked up, it must be ensured during the path walk | ||
83 | for that component it doesn't go away. In pre-2.5.10 code, this was | ||
84 | done holding a reference to the dentry. dcache_rcu does the same. | ||
85 | In some sense, dcache_rcu path walking looks like the pre-2.5.10 | ||
86 | version. | ||
87 | |||
88 | 5. All dentry hash chain updates must take the dcache_lock as well as | ||
89 | the per-dentry lock in that order. dput() does this to ensure that | ||
90 | a dentry that has just been looked up in another CPU doesn't get | ||
91 | deleted before dget() can be done on it. | ||
92 | |||
93 | 6. There are several ways to do reference counting of RCU protected | ||
94 | objects. One such example is in ipv4 route cache where deferred | ||
95 | freeing (using call_rcu()) is done as soon as the reference count | ||
96 | goes to zero. This cannot be done in the case of dentries because | ||
97 | tearing down of dentries require blocking (dentry_iput()) which | ||
98 | isn't supported from RCU callbacks. Instead, tearing down of | ||
99 | dentries happen synchronously in dput(), but actual freeing happens | ||
100 | later when RCU grace period is over. This allows safe lock-free | ||
101 | walking of the hash chains, but a matched dentry may have been | ||
102 | partially torn down. The checking of DCACHE_UNHASHED flag with | ||
103 | d_lock held detects such dentries and prevents them from being | ||
104 | returned from look-up. | ||
105 | |||
106 | |||
107 | Maintaining POSIX rename semantics | ||
108 | ================================== | ||
109 | |||
110 | Since look-up of dentries is lock-free, it can race against a | ||
111 | concurrent rename operation. For example, during rename of file A to | ||
112 | B, look-up of either A or B must succeed. So, if look-up of B happens | ||
113 | after A has been removed from the hash chain but not added to the new | ||
114 | hash chain, it may fail. Also, a comparison while the name is being | ||
115 | written concurrently by a rename may result in false positive matches | ||
116 | violating rename semantics. Issues related to race with rename are | ||
117 | handled as described below : | ||
118 | |||
119 | 1. Look-up can be done in two ways - d_lookup() which is safe from | ||
120 | simultaneous renames and __d_lookup() which is not. If | ||
121 | __d_lookup() fails, it must be followed up by a d_lookup() to | ||
122 | correctly determine whether a dentry is in the hash table or | ||
123 | not. d_lookup() protects look-ups using a sequence lock | ||
124 | (rename_lock). | ||
125 | |||
126 | 2. The name associated with a dentry (d_name) may be changed if a | ||
127 | rename is allowed to happen simultaneously. To avoid memcmp() in | ||
128 | __d_lookup() go out of bounds due to a rename and false positive | ||
129 | comparison, the name comparison is done while holding the | ||
130 | per-dentry lock. This prevents concurrent renames during this | ||
131 | operation. | ||
132 | |||
133 | 3. Hash table walking during look-up may move to a different bucket as | ||
134 | the current dentry is moved to a different bucket due to rename. | ||
135 | But we use hlists in dcache hash table and they are | ||
136 | null-terminated. So, even if a dentry moves to a different bucket, | ||
137 | hash chain walk will terminate. [with a list_head list, it may not | ||
138 | since termination is when the list_head in the original bucket is | ||
139 | reached]. Since we redo the d_parent check and compare name while | ||
140 | holding d_lock, lock-free look-up will not race against d_move(). | ||
141 | |||
142 | 4. There can be a theoretical race when a dentry keeps coming back to | ||
143 | original bucket due to double moves. Due to this look-up may | ||
144 | consider that it has never moved and can end up in a infinite loop. | ||
145 | But this is not any worse that theoretical livelocks we already | ||
146 | have in the kernel. | ||
147 | |||
148 | |||
149 | Important guidelines for filesystem developers related to dcache_rcu | ||
150 | ==================================================================== | ||
151 | |||
152 | 1. Existing dcache interfaces (pre-2.5.62) exported to filesystem | ||
153 | don't change. Only dcache internal implementation changes. However | ||
154 | filesystems *must not* delete from the dentry hash chains directly | ||
155 | using the list macros like allowed earlier. They must use dcache | ||
156 | APIs like d_drop() or __d_drop() depending on the situation. | ||
157 | |||
158 | 2. d_flags is now protected by a per-dentry lock (d_lock). All access | ||
159 | to d_flags must be protected by it. | ||
160 | |||
161 | 3. For a hashed dentry, checking of d_count needs to be protected by | ||
162 | d_lock. | ||
163 | |||
164 | |||
165 | Papers and other documentation on dcache locking | ||
166 | ================================================ | ||
167 | |||
168 | 1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). | ||
169 | |||
170 | 2. http://lse.sourceforge.net/locking/dcache/dcache.html | ||
171 | |||
172 | |||
173 | |||
diff --git a/Documentation/filesystems/devfs/README b/Documentation/filesystems/devfs/README index 54366ecc241f..aabfba24bc2e 100644 --- a/Documentation/filesystems/devfs/README +++ b/Documentation/filesystems/devfs/README | |||
@@ -1812,11 +1812,6 @@ it may overflow the messages buffer, but try to get as much of it as | |||
1812 | you can | 1812 | you can |
1813 | 1813 | ||
1814 | 1814 | ||
1815 | if you get an Oops, run ksymoops to decode it so that the | ||
1816 | names of the offending functions are provided. A non-decoded Oops is | ||
1817 | pretty useless | ||
1818 | |||
1819 | |||
1820 | send a copy of your devfsd configuration file(s) | 1815 | send a copy of your devfsd configuration file(s) |
1821 | 1816 | ||
1822 | send the bug report to me first. | 1817 | send the bug report to me first. |
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index d16334ec48ba..a8edb376b041 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt | |||
@@ -17,8 +17,6 @@ set using tune2fs(8). Kernel-determined defaults are indicated by (*). | |||
17 | bsddf (*) Makes `df' act like BSD. | 17 | bsddf (*) Makes `df' act like BSD. |
18 | minixdf Makes `df' act like Minix. | 18 | minixdf Makes `df' act like Minix. |
19 | 19 | ||
20 | check Check block and inode bitmaps at mount time | ||
21 | (requires CONFIG_EXT2_CHECK). | ||
22 | check=none, nocheck (*) Don't do extra checking of bitmaps on mount | 20 | check=none, nocheck (*) Don't do extra checking of bitmaps on mount |
23 | (check=normal and check=strict options removed) | 21 | (check=normal and check=strict options removed) |
24 | 22 | ||
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt new file mode 100644 index 000000000000..b3404a032596 --- /dev/null +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt | |||
@@ -0,0 +1,195 @@ | |||
1 | ramfs, rootfs and initramfs | ||
2 | October 17, 2005 | ||
3 | Rob Landley <rob@landley.net> | ||
4 | ============================= | ||
5 | |||
6 | What is ramfs? | ||
7 | -------------- | ||
8 | |||
9 | Ramfs is a very simple filesystem that exports Linux's disk caching | ||
10 | mechanisms (the page cache and dentry cache) as a dynamically resizable | ||
11 | ram-based filesystem. | ||
12 | |||
13 | Normally all files are cached in memory by Linux. Pages of data read from | ||
14 | backing store (usually the block device the filesystem is mounted on) are kept | ||
15 | around in case it's needed again, but marked as clean (freeable) in case the | ||
16 | Virtual Memory system needs the memory for something else. Similarly, data | ||
17 | written to files is marked clean as soon as it has been written to backing | ||
18 | store, but kept around for caching purposes until the VM reallocates the | ||
19 | memory. A similar mechanism (the dentry cache) greatly speeds up access to | ||
20 | directories. | ||
21 | |||
22 | With ramfs, there is no backing store. Files written into ramfs allocate | ||
23 | dentries and page cache as usual, but there's nowhere to write them to. | ||
24 | This means the pages are never marked clean, so they can't be freed by the | ||
25 | VM when it's looking to recycle memory. | ||
26 | |||
27 | The amount of code required to implement ramfs is tiny, because all the | ||
28 | work is done by the existing Linux caching infrastructure. Basically, | ||
29 | you're mounting the disk cache as a filesystem. Because of this, ramfs is not | ||
30 | an optional component removable via menuconfig, since there would be negligible | ||
31 | space savings. | ||
32 | |||
33 | ramfs and ramdisk: | ||
34 | ------------------ | ||
35 | |||
36 | The older "ram disk" mechanism created a synthetic block device out of | ||
37 | an area of ram and used it as backing store for a filesystem. This block | ||
38 | device was of fixed size, so the filesystem mounted on it was of fixed | ||
39 | size. Using a ram disk also required unnecessarily copying memory from the | ||
40 | fake block device into the page cache (and copying changes back out), as well | ||
41 | as creating and destroying dentries. Plus it needed a filesystem driver | ||
42 | (such as ext2) to format and interpret this data. | ||
43 | |||
44 | Compared to ramfs, this wastes memory (and memory bus bandwidth), creates | ||
45 | unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks | ||
46 | to avoid this copying by playing with the page tables, but they're unpleasantly | ||
47 | complicated and turn out to be about as expensive as the copying anyway.) | ||
48 | More to the point, all the work ramfs is doing has to happen _anyway_, | ||
49 | since all file access goes through the page and dentry caches. The ram | ||
50 | disk is simply unnecessary, ramfs is internally much simpler. | ||
51 | |||
52 | Another reason ramdisks are semi-obsolete is that the introduction of | ||
53 | loopback devices offered a more flexible and convenient way to create | ||
54 | synthetic block devices, now from files instead of from chunks of memory. | ||
55 | See losetup (8) for details. | ||
56 | |||
57 | ramfs and tmpfs: | ||
58 | ---------------- | ||
59 | |||
60 | One downside of ramfs is you can keep writing data into it until you fill | ||
61 | up all memory, and the VM can't free it because the VM thinks that files | ||
62 | should get written to backing store (rather than swap space), but ramfs hasn't | ||
63 | got any backing store. Because of this, only root (or a trusted user) should | ||
64 | be allowed write access to a ramfs mount. | ||
65 | |||
66 | A ramfs derivative called tmpfs was created to add size limits, and the ability | ||
67 | to write the data to swap space. Normal users can be allowed write access to | ||
68 | tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. | ||
69 | |||
70 | What is rootfs? | ||
71 | --------------- | ||
72 | |||
73 | Rootfs is a special instance of ramfs, which is always present in 2.6 systems. | ||
74 | (It's used internally as the starting and stopping point for searches of the | ||
75 | kernel's doubly-linked list of mount points.) | ||
76 | |||
77 | Most systems just mount another filesystem over it and ignore it. The | ||
78 | amount of space an empty instance of ramfs takes up is tiny. | ||
79 | |||
80 | What is initramfs? | ||
81 | ------------------ | ||
82 | |||
83 | All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is | ||
84 | extracted into rootfs when the kernel boots up. After extracting, the kernel | ||
85 | checks to see if rootfs contains a file "init", and if so it executes it as PID | ||
86 | 1. If found, this init process is responsible for bringing the system the | ||
87 | rest of the way up, including locating and mounting the real root device (if | ||
88 | any). If rootfs does not contain an init program after the embedded cpio | ||
89 | archive is extracted into it, the kernel will fall through to the older code | ||
90 | to locate and mount a root partition, then exec some variant of /sbin/init | ||
91 | out of that. | ||
92 | |||
93 | All this differs from the old initrd in several ways: | ||
94 | |||
95 | - The old initrd was a separate file, while the initramfs archive is linked | ||
96 | into the linux kernel image. (The directory linux-*/usr is devoted to | ||
97 | generating this archive during the build.) | ||
98 | |||
99 | - The old initrd file was a gzipped filesystem image (in some file format, | ||
100 | such as ext2, that had to be built into the kernel), while the new | ||
101 | initramfs archive is a gzipped cpio archive (like tar only simpler, | ||
102 | see cpio(1) and Documentation/early-userspace/buffer-format.txt). | ||
103 | |||
104 | - The program run by the old initrd (which was called /initrd, not /init) did | ||
105 | some setup and then returned to the kernel, while the init program from | ||
106 | initramfs is not expected to return to the kernel. (If /init needs to hand | ||
107 | off control it can overmount / with a new root device and exec another init | ||
108 | program. See the switch_root utility, below.) | ||
109 | |||
110 | - When switching another root device, initrd would pivot_root and then | ||
111 | umount the ramdisk. But initramfs is rootfs: you can neither pivot_root | ||
112 | rootfs, nor unmount it. Instead delete everything out of rootfs to | ||
113 | free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs | ||
114 | with the new root (cd /newmount; mount --move . /; chroot .), attach | ||
115 | stdin/stdout/stderr to the new /dev/console, and exec the new init. | ||
116 | |||
117 | Since this is a remarkably persnickity process (and involves deleting | ||
118 | commands before you can run them), the klibc package introduced a helper | ||
119 | program (utils/run_init.c) to do all this for you. Most other packages | ||
120 | (such as busybox) have named this command "switch_root". | ||
121 | |||
122 | Populating initramfs: | ||
123 | --------------------- | ||
124 | |||
125 | The 2.6 kernel build process always creates a gzipped cpio format initramfs | ||
126 | archive and links it into the resulting kernel binary. By default, this | ||
127 | archive is empty (consuming 134 bytes on x86). The config option | ||
128 | CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices | ||
129 | in menuconfig, and living in usr/Kconfig) can be used to specify a source for | ||
130 | the initramfs archive, which will automatically be incorporated into the | ||
131 | resulting binary. This option can point to an existing gzipped cpio archive, a | ||
132 | directory containing files to be archived, or a text file specification such | ||
133 | as the following example: | ||
134 | |||
135 | dir /dev 755 0 0 | ||
136 | nod /dev/console 644 0 0 c 5 1 | ||
137 | nod /dev/loop0 644 0 0 b 7 0 | ||
138 | dir /bin 755 1000 1000 | ||
139 | slink /bin/sh busybox 777 0 0 | ||
140 | file /bin/busybox initramfs/busybox 755 0 0 | ||
141 | dir /proc 755 0 0 | ||
142 | dir /sys 755 0 0 | ||
143 | dir /mnt 755 0 0 | ||
144 | file /init initramfs/init.sh 755 0 0 | ||
145 | |||
146 | One advantage of the text file is that root access is not required to | ||
147 | set permissions or create device nodes in the new archive. (Note that those | ||
148 | two example "file" entries expect to find files named "init.sh" and "busybox" in | ||
149 | a directory called "initramfs", under the linux-2.6.* directory. See | ||
150 | Documentation/early-userspace/README for more details.) | ||
151 | |||
152 | If you don't already understand what shared libraries, devices, and paths | ||
153 | you need to get a minimal root filesystem up and running, here are some | ||
154 | references: | ||
155 | http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ | ||
156 | http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html | ||
157 | http://www.linuxfromscratch.org/lfs/view/stable/ | ||
158 | |||
159 | The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is | ||
160 | designed to be a tiny C library to statically link early userspace | ||
161 | code against, along with some related utilities. It is BSD licensed. | ||
162 | |||
163 | I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) | ||
164 | myself. These are LGPL and GPL, respectively. | ||
165 | |||
166 | In theory you could use glibc, but that's not well suited for small embedded | ||
167 | uses like this. (A "hello world" program statically linked against glibc is | ||
168 | over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do | ||
169 | name lookups, even when otherwise statically linked.) | ||
170 | |||
171 | Future directions: | ||
172 | ------------------ | ||
173 | |||
174 | Today (2.6.14), initramfs is always compiled in, but not always used. The | ||
175 | kernel falls back to legacy boot code that is reached only if initramfs does | ||
176 | not contain an /init program. The fallback is legacy code, there to ensure a | ||
177 | smooth transition and allowing early boot functionality to gradually move to | ||
178 | "early userspace" (I.E. initramfs). | ||
179 | |||
180 | The move to early userspace is necessary because finding and mounting the real | ||
181 | root device is complex. Root partitions can span multiple devices (raid or | ||
182 | separate journal). They can be out on the network (requiring dhcp, setting a | ||
183 | specific mac address, logging into a server, etc). They can live on removable | ||
184 | media, with dynamically allocated major/minor numbers and persistent naming | ||
185 | issues requiring a full udev implementation to sort out. They can be | ||
186 | compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned, | ||
187 | and so on. | ||
188 | |||
189 | This kind of complexity (which inevitably includes policy) is rightly handled | ||
190 | in userspace. Both klibc and busybox/uClibc are working on simple initramfs | ||
191 | packages to drop into a kernel build, and when standard solutions are ready | ||
192 | and widely deployed, the kernel's legacy early boot code will become obsolete | ||
193 | and a candidate for the feature removal schedule. | ||
194 | |||
195 | But that's a while off yet. | ||
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f042c12e0ed2..ee4c0a8b8db7 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt | |||
@@ -3,7 +3,7 @@ | |||
3 | 3 | ||
4 | Original author: Richard Gooch <rgooch@atnf.csiro.au> | 4 | Original author: Richard Gooch <rgooch@atnf.csiro.au> |
5 | 5 | ||
6 | Last updated on August 25, 2005 | 6 | Last updated on October 28, 2005 |
7 | 7 | ||
8 | Copyright (C) 1999 Richard Gooch | 8 | Copyright (C) 1999 Richard Gooch |
9 | Copyright (C) 2005 Pekka Enberg | 9 | Copyright (C) 2005 Pekka Enberg |
@@ -11,62 +11,61 @@ | |||
11 | This file is released under the GPLv2. | 11 | This file is released under the GPLv2. |
12 | 12 | ||
13 | 13 | ||
14 | What is it? | 14 | Introduction |
15 | =========== | 15 | ============ |
16 | 16 | ||
17 | The Virtual File System (otherwise known as the Virtual Filesystem | 17 | The Virtual File System (also known as the Virtual Filesystem Switch) |
18 | Switch) is the software layer in the kernel that provides the | 18 | is the software layer in the kernel that provides the filesystem |
19 | filesystem interface to userspace programs. It also provides an | 19 | interface to userspace programs. It also provides an abstraction |
20 | abstraction within the kernel which allows different filesystem | 20 | within the kernel which allows different filesystem implementations to |
21 | implementations to coexist. | 21 | coexist. |
22 | 22 | ||
23 | VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so | ||
24 | on are called from a process context. Filesystem locking is described | ||
25 | in the document Documentation/filesystems/Locking. | ||
23 | 26 | ||
24 | A Quick Look At How It Works | ||
25 | ============================ | ||
26 | 27 | ||
27 | In this section I'll briefly describe how things work, before | 28 | Directory Entry Cache (dcache) |
28 | launching into the details. I'll start with describing what happens | 29 | ------------------------------ |
29 | when user programs open and manipulate files, and then look from the | ||
30 | other view which is how a filesystem is supported and subsequently | ||
31 | mounted. | ||
32 | |||
33 | |||
34 | Opening a File | ||
35 | -------------- | ||
36 | |||
37 | The VFS implements the open(2), stat(2), chmod(2) and similar system | ||
38 | calls. The pathname argument is used by the VFS to search through the | ||
39 | directory entry cache (dentry cache or "dcache"). This provides a very | ||
40 | fast look-up mechanism to translate a pathname (filename) into a | ||
41 | specific dentry. | ||
42 | |||
43 | An individual dentry usually has a pointer to an inode. Inodes are the | ||
44 | things that live on disc drives, and can be regular files (you know: | ||
45 | those things that you write data into), directories, FIFOs and other | ||
46 | beasts. Dentries live in RAM and are never saved to disc: they exist | ||
47 | only for performance. Inodes live on disc and are copied into memory | ||
48 | when required. Later any changes are written back to disc. The inode | ||
49 | that lives in RAM is a VFS inode, and it is this which the dentry | ||
50 | points to. A single inode can be pointed to by multiple dentries | ||
51 | (think about hardlinks). | ||
52 | |||
53 | The dcache is meant to be a view into your entire filespace. Unlike | ||
54 | Linus, most of us losers can't fit enough dentries into RAM to cover | ||
55 | all of our filespace, so the dcache has bits missing. In order to | ||
56 | resolve your pathname into a dentry, the VFS may have to resort to | ||
57 | creating dentries along the way, and then loading the inode. This is | ||
58 | done by looking up the inode. | ||
59 | |||
60 | To look up an inode (usually read from disc) requires that the VFS | ||
61 | calls the lookup() method of the parent directory inode. This method | ||
62 | is installed by the specific filesystem implementation that the inode | ||
63 | lives in. There will be more on this later. | ||
64 | 30 | ||
65 | Once the VFS has the required dentry (and hence the inode), we can do | 31 | The VFS implements the open(2), stat(2), chmod(2), and similar system |
66 | all those boring things like open(2) the file, or stat(2) it to peek | 32 | calls. The pathname argument that is passed to them is used by the VFS |
67 | at the inode data. The stat(2) operation is fairly simple: once the | 33 | to search through the directory entry cache (also known as the dentry |
68 | VFS has the dentry, it peeks at the inode data and passes some of it | 34 | cache or dcache). This provides a very fast look-up mechanism to |
69 | back to userspace. | 35 | translate a pathname (filename) into a specific dentry. Dentries live |
36 | in RAM and are never saved to disc: they exist only for performance. | ||
37 | |||
38 | The dentry cache is meant to be a view into your entire filespace. As | ||
39 | most computers cannot fit all dentries in the RAM at the same time, | ||
40 | some bits of the cache are missing. In order to resolve your pathname | ||
41 | into a dentry, the VFS may have to resort to creating dentries along | ||
42 | the way, and then loading the inode. This is done by looking up the | ||
43 | inode. | ||
44 | |||
45 | |||
46 | The Inode Object | ||
47 | ---------------- | ||
48 | |||
49 | An individual dentry usually has a pointer to an inode. Inodes are | ||
50 | filesystem objects such as regular files, directories, FIFOs and other | ||
51 | beasts. They live either on the disc (for block device filesystems) | ||
52 | or in the memory (for pseudo filesystems). Inodes that live on the | ||
53 | disc are copied into the memory when required and changes to the inode | ||
54 | are written back to disc. A single inode can be pointed to by multiple | ||
55 | dentries (hard links, for example, do this). | ||
56 | |||
57 | To look up an inode requires that the VFS calls the lookup() method of | ||
58 | the parent directory inode. This method is installed by the specific | ||
59 | filesystem implementation that the inode lives in. Once the VFS has | ||
60 | the required dentry (and hence the inode), we can do all those boring | ||
61 | things like open(2) the file, or stat(2) it to peek at the inode | ||
62 | data. The stat(2) operation is fairly simple: once the VFS has the | ||
63 | dentry, it peeks at the inode data and passes some of it back to | ||
64 | userspace. | ||
65 | |||
66 | |||
67 | The File Object | ||
68 | --------------- | ||
70 | 69 | ||
71 | Opening a file requires another operation: allocation of a file | 70 | Opening a file requires another operation: allocation of a file |
72 | structure (this is the kernel-side implementation of file | 71 | structure (this is the kernel-side implementation of file |
@@ -74,51 +73,39 @@ descriptors). The freshly allocated file structure is initialized with | |||
74 | a pointer to the dentry and a set of file operation member functions. | 73 | a pointer to the dentry and a set of file operation member functions. |
75 | These are taken from the inode data. The open() file method is then | 74 | These are taken from the inode data. The open() file method is then |
76 | called so the specific filesystem implementation can do it's work. You | 75 | called so the specific filesystem implementation can do it's work. You |
77 | can see that this is another switch performed by the VFS. | 76 | can see that this is another switch performed by the VFS. The file |
78 | 77 | structure is placed into the file descriptor table for the process. | |
79 | The file structure is placed into the file descriptor table for the | ||
80 | process. | ||
81 | 78 | ||
82 | Reading, writing and closing files (and other assorted VFS operations) | 79 | Reading, writing and closing files (and other assorted VFS operations) |
83 | is done by using the userspace file descriptor to grab the appropriate | 80 | is done by using the userspace file descriptor to grab the appropriate |
84 | file structure, and then calling the required file structure method | 81 | file structure, and then calling the required file structure method to |
85 | function to do whatever is required. | 82 | do whatever is required. For as long as the file is open, it keeps the |
86 | 83 | dentry in use, which in turn means that the VFS inode is still in use. | |
87 | For as long as the file is open, it keeps the dentry "open" (in use), | ||
88 | which in turn means that the VFS inode is still in use. | ||
89 | |||
90 | All VFS system calls (i.e. open(2), stat(2), read(2), write(2), | ||
91 | chmod(2) and so on) are called from a process context. You should | ||
92 | assume that these calls are made without any kernel locks being | ||
93 | held. This means that the processes may be executing the same piece of | ||
94 | filesystem or driver code at the same time, on different | ||
95 | processors. You should ensure that access to shared resources is | ||
96 | protected by appropriate locks. | ||
97 | 84 | ||
98 | 85 | ||
99 | Registering and Mounting a Filesystem | 86 | Registering and Mounting a Filesystem |
100 | ------------------------------------- | 87 | ===================================== |
101 | 88 | ||
102 | If you want to support a new kind of filesystem in the kernel, all you | 89 | To register and unregister a filesystem, use the following API |
103 | need to do is call register_filesystem(). You pass a structure | 90 | functions: |
104 | describing the filesystem implementation (struct file_system_type) | ||
105 | which is then added to an internal table of supported filesystems. You | ||
106 | can do: | ||
107 | 91 | ||
108 | % cat /proc/filesystems | 92 | #include <linux/fs.h> |
109 | 93 | ||
110 | to see what filesystems are currently available on your system. | 94 | extern int register_filesystem(struct file_system_type *); |
95 | extern int unregister_filesystem(struct file_system_type *); | ||
111 | 96 | ||
112 | When a request is made to mount a block device onto a directory in | 97 | The passed struct file_system_type describes your filesystem. When a |
113 | your filespace the VFS will call the appropriate method for the | 98 | request is made to mount a device onto a directory in your filespace, |
114 | specific filesystem. The dentry for the mount point will then be | 99 | the VFS will call the appropriate get_sb() method for the specific |
115 | updated to point to the root inode for the new filesystem. | 100 | filesystem. The dentry for the mount point will then be updated to |
101 | point to the root inode for the new filesystem. | ||
116 | 102 | ||
117 | It's now time to look at things in more detail. | 103 | You can see all filesystems that are registered to the kernel in the |
104 | file /proc/filesystems. | ||
118 | 105 | ||
119 | 106 | ||
120 | struct file_system_type | 107 | struct file_system_type |
121 | ======================= | 108 | ----------------------- |
122 | 109 | ||
123 | This describes the filesystem. As of kernel 2.6.13, the following | 110 | This describes the filesystem. As of kernel 2.6.13, the following |
124 | members are defined: | 111 | members are defined: |
@@ -197,8 +184,14 @@ A fill_super() method implementation has the following arguments: | |||
197 | int silent: whether or not to be silent on error | 184 | int silent: whether or not to be silent on error |
198 | 185 | ||
199 | 186 | ||
187 | The Superblock Object | ||
188 | ===================== | ||
189 | |||
190 | A superblock object represents a mounted filesystem. | ||
191 | |||
192 | |||
200 | struct super_operations | 193 | struct super_operations |
201 | ======================= | 194 | ----------------------- |
202 | 195 | ||
203 | This describes how the VFS can manipulate the superblock of your | 196 | This describes how the VFS can manipulate the superblock of your |
204 | filesystem. As of kernel 2.6.13, the following members are defined: | 197 | filesystem. As of kernel 2.6.13, the following members are defined: |
@@ -286,9 +279,9 @@ or bottom half). | |||
286 | a superblock. The second parameter indicates whether the method | 279 | a superblock. The second parameter indicates whether the method |
287 | should wait until the write out has been completed. Optional. | 280 | should wait until the write out has been completed. Optional. |
288 | 281 | ||
289 | write_super_lockfs: called when VFS is locking a filesystem and forcing | 282 | write_super_lockfs: called when VFS is locking a filesystem and |
290 | it into a consistent state. This function is currently used by the | 283 | forcing it into a consistent state. This method is currently |
291 | Logical Volume Manager (LVM). | 284 | used by the Logical Volume Manager (LVM). |
292 | 285 | ||
293 | unlockfs: called when VFS is unlocking a filesystem and making it writable | 286 | unlockfs: called when VFS is unlocking a filesystem and making it writable |
294 | again. | 287 | again. |
@@ -317,8 +310,14 @@ field. This is a pointer to a "struct inode_operations" which | |||
317 | describes the methods that can be performed on individual inodes. | 310 | describes the methods that can be performed on individual inodes. |
318 | 311 | ||
319 | 312 | ||
313 | The Inode Object | ||
314 | ================ | ||
315 | |||
316 | An inode object represents an object within the filesystem. | ||
317 | |||
318 | |||
320 | struct inode_operations | 319 | struct inode_operations |
321 | ======================= | 320 | ----------------------- |
322 | 321 | ||
323 | This describes how the VFS can manipulate an inode in your | 322 | This describes how the VFS can manipulate an inode in your |
324 | filesystem. As of kernel 2.6.13, the following members are defined: | 323 | filesystem. As of kernel 2.6.13, the following members are defined: |
@@ -394,51 +393,62 @@ otherwise noted. | |||
394 | will probably need to call d_instantiate() just as you would | 393 | will probably need to call d_instantiate() just as you would |
395 | in the create() method | 394 | in the create() method |
396 | 395 | ||
396 | rename: called by the rename(2) system call to rename the object to | ||
397 | have the parent and name given by the second inode and dentry. | ||
398 | |||
397 | readlink: called by the readlink(2) system call. Only required if | 399 | readlink: called by the readlink(2) system call. Only required if |
398 | you want to support reading symbolic links | 400 | you want to support reading symbolic links |
399 | 401 | ||
400 | follow_link: called by the VFS to follow a symbolic link to the | 402 | follow_link: called by the VFS to follow a symbolic link to the |
401 | inode it points to. Only required if you want to support | 403 | inode it points to. Only required if you want to support |
402 | symbolic links. This function returns a void pointer cookie | 404 | symbolic links. This method returns a void pointer cookie |
403 | that is passed to put_link(). | 405 | that is passed to put_link(). |
404 | 406 | ||
405 | put_link: called by the VFS to release resources allocated by | 407 | put_link: called by the VFS to release resources allocated by |
406 | follow_link(). The cookie returned by follow_link() is passed to | 408 | follow_link(). The cookie returned by follow_link() is passed |
407 | to this function as the last parameter. It is used by filesystems | 409 | to to this method as the last parameter. It is used by |
408 | such as NFS where page cache is not stable (i.e. page that was | 410 | filesystems such as NFS where page cache is not stable |
409 | installed when the symbolic link walk started might not be in the | 411 | (i.e. page that was installed when the symbolic link walk |
410 | page cache at the end of the walk). | 412 | started might not be in the page cache at the end of the |
411 | 413 | walk). | |
412 | truncate: called by the VFS to change the size of a file. The i_size | 414 | |
413 | field of the inode is set to the desired size by the VFS before | 415 | truncate: called by the VFS to change the size of a file. The |
414 | this function is called. This function is called by the truncate(2) | 416 | i_size field of the inode is set to the desired size by the |
415 | system call and related functionality. | 417 | VFS before this method is called. This method is called by |
418 | the truncate(2) system call and related functionality. | ||
416 | 419 | ||
417 | permission: called by the VFS to check for access rights on a POSIX-like | 420 | permission: called by the VFS to check for access rights on a POSIX-like |
418 | filesystem. | 421 | filesystem. |
419 | 422 | ||
420 | setattr: called by the VFS to set attributes for a file. This function is | 423 | setattr: called by the VFS to set attributes for a file. This method |
421 | called by chmod(2) and related system calls. | 424 | is called by chmod(2) and related system calls. |
422 | 425 | ||
423 | getattr: called by the VFS to get attributes of a file. This function is | 426 | getattr: called by the VFS to get attributes of a file. This method |
424 | called by stat(2) and related system calls. | 427 | is called by stat(2) and related system calls. |
425 | 428 | ||
426 | setxattr: called by the VFS to set an extended attribute for a file. | 429 | setxattr: called by the VFS to set an extended attribute for a file. |
427 | Extended attribute is a name:value pair associated with an inode. This | 430 | Extended attribute is a name:value pair associated with an |
428 | function is called by setxattr(2) system call. | 431 | inode. This method is called by setxattr(2) system call. |
432 | |||
433 | getxattr: called by the VFS to retrieve the value of an extended | ||
434 | attribute name. This method is called by getxattr(2) function | ||
435 | call. | ||
429 | 436 | ||
430 | getxattr: called by the VFS to retrieve the value of an extended attribute | 437 | listxattr: called by the VFS to list all extended attributes for a |
431 | name. This function is called by getxattr(2) function call. | 438 | given file. This method is called by listxattr(2) system call. |
432 | 439 | ||
433 | listxattr: called by the VFS to list all extended attributes for a given | 440 | removexattr: called by the VFS to remove an extended attribute from |
434 | file. This function is called by listxattr(2) system call. | 441 | a file. This method is called by removexattr(2) system call. |
435 | 442 | ||
436 | removexattr: called by the VFS to remove an extended attribute from a file. | 443 | |
437 | This function is called by removexattr(2) system call. | 444 | The Address Space Object |
445 | ======================== | ||
446 | |||
447 | The address space object is used to identify pages in the page cache. | ||
438 | 448 | ||
439 | 449 | ||
440 | struct address_space_operations | 450 | struct address_space_operations |
441 | =============================== | 451 | ------------------------------- |
442 | 452 | ||
443 | This describes how the VFS can manipulate mapping of a file to page cache in | 453 | This describes how the VFS can manipulate mapping of a file to page cache in |
444 | your filesystem. As of kernel 2.6.13, the following members are defined: | 454 | your filesystem. As of kernel 2.6.13, the following members are defined: |
@@ -502,8 +512,14 @@ struct address_space_operations { | |||
502 | it. An example implementation can be found in fs/ext2/xip.c. | 512 | it. An example implementation can be found in fs/ext2/xip.c. |
503 | 513 | ||
504 | 514 | ||
515 | The File Object | ||
516 | =============== | ||
517 | |||
518 | A file object represents a file opened by a process. | ||
519 | |||
520 | |||
505 | struct file_operations | 521 | struct file_operations |
506 | ====================== | 522 | ---------------------- |
507 | 523 | ||
508 | This describes how the VFS can manipulate an open file. As of kernel | 524 | This describes how the VFS can manipulate an open file. As of kernel |
509 | 2.6.13, the following members are defined: | 525 | 2.6.13, the following members are defined: |
@@ -661,7 +677,7 @@ of child dentries. Child dentries are basically like files in a | |||
661 | directory. | 677 | directory. |
662 | 678 | ||
663 | 679 | ||
664 | Directory Entry Cache APIs | 680 | Directory Entry Cache API |
665 | -------------------------- | 681 | -------------------------- |
666 | 682 | ||
667 | There are a number of functions defined which permit a filesystem to | 683 | There are a number of functions defined which permit a filesystem to |
@@ -705,178 +721,24 @@ manipulate dentries: | |||
705 | and the dentry is returned. The caller must use d_put() | 721 | and the dentry is returned. The caller must use d_put() |
706 | to free the dentry when it finishes using it. | 722 | to free the dentry when it finishes using it. |
707 | 723 | ||
724 | For further information on dentry locking, please refer to the document | ||
725 | Documentation/filesystems/dentry-locking.txt. | ||
708 | 726 | ||
709 | RCU-based dcache locking model | ||
710 | ------------------------------ | ||
711 | 727 | ||
712 | On many workloads, the most common operation on dcache is | 728 | Resources |
713 | to look up a dentry, given a parent dentry and the name | 729 | ========= |
714 | of the child. Typically, for every open(), stat() etc., | 730 | |
715 | the dentry corresponding to the pathname will be looked | 731 | (Note some of these resources are not up-to-date with the latest kernel |
716 | up by walking the tree starting with the first component | 732 | version.) |
717 | of the pathname and using that dentry along with the next | 733 | |
718 | component to look up the next level and so on. Since it | 734 | Creating Linux virtual filesystems. 2002 |
719 | is a frequent operation for workloads like multiuser | 735 | <http://lwn.net/Articles/13325/> |
720 | environments and web servers, it is important to optimize | 736 | |
721 | this path. | 737 | The Linux Virtual File-system Layer by Neil Brown. 1999 |
722 | 738 | <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html> | |
723 | Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus | 739 | |
724 | in every component during path look-up. Since 2.5.10 onwards, | 740 | A tour of the Linux VFS by Michael K. Johnson. 1996 |
725 | fast-walk algorithm changed this by holding the dcache_lock | 741 | <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html> |
726 | at the beginning and walking as many cached path component | ||
727 | dentries as possible. This significantly decreases the number | ||
728 | of acquisition of dcache_lock. However it also increases the | ||
729 | lock hold time significantly and affects performance in large | ||
730 | SMP machines. Since 2.5.62 kernel, dcache has been using | ||
731 | a new locking model that uses RCU to make dcache look-up | ||
732 | lock-free. | ||
733 | |||
734 | The current dcache locking model is not very different from the existing | ||
735 | dcache locking model. Prior to 2.5.62 kernel, dcache_lock | ||
736 | protected the hash chain, d_child, d_alias, d_lru lists as well | ||
737 | as d_inode and several other things like mount look-up. RCU-based | ||
738 | changes affect only the way the hash chain is protected. For everything | ||
739 | else the dcache_lock must be taken for both traversing as well as | ||
740 | updating. The hash chain updates too take the dcache_lock. | ||
741 | The significant change is the way d_lookup traverses the hash chain, | ||
742 | it doesn't acquire the dcache_lock for this and rely on RCU to | ||
743 | ensure that the dentry has not been *freed*. | ||
744 | |||
745 | |||
746 | Dcache locking details | ||
747 | ---------------------- | ||
748 | 742 | ||
749 | For many multi-user workloads, open() and stat() on files are | 743 | A small trail through the Linux kernel by Andries Brouwer. 2001 |
750 | very frequently occurring operations. Both involve walking | 744 | <http://www.win.tue.nl/~aeb/linux/vfs/trail.html> |
751 | of path names to find the dentry corresponding to the | ||
752 | concerned file. In 2.4 kernel, dcache_lock was held | ||
753 | during look-up of each path component. Contention and | ||
754 | cache-line bouncing of this global lock caused significant | ||
755 | scalability problems. With the introduction of RCU | ||
756 | in Linux kernel, this was worked around by making | ||
757 | the look-up of path components during path walking lock-free. | ||
758 | |||
759 | |||
760 | Safe lock-free look-up of dcache hash table | ||
761 | =========================================== | ||
762 | |||
763 | Dcache is a complex data structure with the hash table entries | ||
764 | also linked together in other lists. In 2.4 kernel, dcache_lock | ||
765 | protected all the lists. We applied RCU only on hash chain | ||
766 | walking. The rest of the lists are still protected by dcache_lock. | ||
767 | Some of the important changes are : | ||
768 | |||
769 | 1. The deletion from hash chain is done using hlist_del_rcu() macro which | ||
770 | doesn't initialize next pointer of the deleted dentry and this | ||
771 | allows us to walk safely lock-free while a deletion is happening. | ||
772 | |||
773 | 2. Insertion of a dentry into the hash table is done using | ||
774 | hlist_add_head_rcu() which take care of ordering the writes - | ||
775 | the writes to the dentry must be visible before the dentry | ||
776 | is inserted. This works in conjunction with hlist_for_each_rcu() | ||
777 | while walking the hash chain. The only requirement is that | ||
778 | all initialization to the dentry must be done before hlist_add_head_rcu() | ||
779 | since we don't have dcache_lock protection while traversing | ||
780 | the hash chain. This isn't different from the existing code. | ||
781 | |||
782 | 3. The dentry looked up without holding dcache_lock by cannot be | ||
783 | returned for walking if it is unhashed. It then may have a NULL | ||
784 | d_inode or other bogosity since RCU doesn't protect the other | ||
785 | fields in the dentry. We therefore use a flag DCACHE_UNHASHED to | ||
786 | indicate unhashed dentries and use this in conjunction with a | ||
787 | per-dentry lock (d_lock). Once looked up without the dcache_lock, | ||
788 | we acquire the per-dentry lock (d_lock) and check if the | ||
789 | dentry is unhashed. If so, the look-up is failed. If not, the | ||
790 | reference count of the dentry is increased and the dentry is returned. | ||
791 | |||
792 | 4. Once a dentry is looked up, it must be ensured during the path | ||
793 | walk for that component it doesn't go away. In pre-2.5.10 code, | ||
794 | this was done holding a reference to the dentry. dcache_rcu does | ||
795 | the same. In some sense, dcache_rcu path walking looks like | ||
796 | the pre-2.5.10 version. | ||
797 | |||
798 | 5. All dentry hash chain updates must take the dcache_lock as well as | ||
799 | the per-dentry lock in that order. dput() does this to ensure | ||
800 | that a dentry that has just been looked up in another CPU | ||
801 | doesn't get deleted before dget() can be done on it. | ||
802 | |||
803 | 6. There are several ways to do reference counting of RCU protected | ||
804 | objects. One such example is in ipv4 route cache where | ||
805 | deferred freeing (using call_rcu()) is done as soon as | ||
806 | the reference count goes to zero. This cannot be done in | ||
807 | the case of dentries because tearing down of dentries | ||
808 | require blocking (dentry_iput()) which isn't supported from | ||
809 | RCU callbacks. Instead, tearing down of dentries happen | ||
810 | synchronously in dput(), but actual freeing happens later | ||
811 | when RCU grace period is over. This allows safe lock-free | ||
812 | walking of the hash chains, but a matched dentry may have | ||
813 | been partially torn down. The checking of DCACHE_UNHASHED | ||
814 | flag with d_lock held detects such dentries and prevents | ||
815 | them from being returned from look-up. | ||
816 | |||
817 | |||
818 | Maintaining POSIX rename semantics | ||
819 | ================================== | ||
820 | |||
821 | Since look-up of dentries is lock-free, it can race against | ||
822 | a concurrent rename operation. For example, during rename | ||
823 | of file A to B, look-up of either A or B must succeed. | ||
824 | So, if look-up of B happens after A has been removed from the | ||
825 | hash chain but not added to the new hash chain, it may fail. | ||
826 | Also, a comparison while the name is being written concurrently | ||
827 | by a rename may result in false positive matches violating | ||
828 | rename semantics. Issues related to race with rename are | ||
829 | handled as described below : | ||
830 | |||
831 | 1. Look-up can be done in two ways - d_lookup() which is safe | ||
832 | from simultaneous renames and __d_lookup() which is not. | ||
833 | If __d_lookup() fails, it must be followed up by a d_lookup() | ||
834 | to correctly determine whether a dentry is in the hash table | ||
835 | or not. d_lookup() protects look-ups using a sequence | ||
836 | lock (rename_lock). | ||
837 | |||
838 | 2. The name associated with a dentry (d_name) may be changed if | ||
839 | a rename is allowed to happen simultaneously. To avoid memcmp() | ||
840 | in __d_lookup() go out of bounds due to a rename and false | ||
841 | positive comparison, the name comparison is done while holding the | ||
842 | per-dentry lock. This prevents concurrent renames during this | ||
843 | operation. | ||
844 | |||
845 | 3. Hash table walking during look-up may move to a different bucket as | ||
846 | the current dentry is moved to a different bucket due to rename. | ||
847 | But we use hlists in dcache hash table and they are null-terminated. | ||
848 | So, even if a dentry moves to a different bucket, hash chain | ||
849 | walk will terminate. [with a list_head list, it may not since | ||
850 | termination is when the list_head in the original bucket is reached]. | ||
851 | Since we redo the d_parent check and compare name while holding | ||
852 | d_lock, lock-free look-up will not race against d_move(). | ||
853 | |||
854 | 4. There can be a theoretical race when a dentry keeps coming back | ||
855 | to original bucket due to double moves. Due to this look-up may | ||
856 | consider that it has never moved and can end up in a infinite loop. | ||
857 | But this is not any worse that theoretical livelocks we already | ||
858 | have in the kernel. | ||
859 | |||
860 | |||
861 | Important guidelines for filesystem developers related to dcache_rcu | ||
862 | ==================================================================== | ||
863 | |||
864 | 1. Existing dcache interfaces (pre-2.5.62) exported to filesystem | ||
865 | don't change. Only dcache internal implementation changes. However | ||
866 | filesystems *must not* delete from the dentry hash chains directly | ||
867 | using the list macros like allowed earlier. They must use dcache | ||
868 | APIs like d_drop() or __d_drop() depending on the situation. | ||
869 | |||
870 | 2. d_flags is now protected by a per-dentry lock (d_lock). All | ||
871 | access to d_flags must be protected by it. | ||
872 | |||
873 | 3. For a hashed dentry, checking of d_count needs to be protected | ||
874 | by d_lock. | ||
875 | |||
876 | |||
877 | Papers and other documentation on dcache locking | ||
878 | ================================================ | ||
879 | |||
880 | 1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). | ||
881 | |||
882 | 2. http://lse.sourceforge.net/locking/dcache/dcache.html | ||
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt index c7d5d0c7067d..74aeb142ae5f 100644 --- a/Documentation/filesystems/xfs.txt +++ b/Documentation/filesystems/xfs.txt | |||
@@ -19,15 +19,43 @@ Mount Options | |||
19 | 19 | ||
20 | When mounting an XFS filesystem, the following options are accepted. | 20 | When mounting an XFS filesystem, the following options are accepted. |
21 | 21 | ||
22 | biosize=size | 22 | allocsize=size |
23 | Sets the preferred buffered I/O size (default size is 64K). | 23 | Sets the buffered I/O end-of-file preallocation size when |
24 | "size" must be expressed as the logarithm (base2) of the | 24 | doing delayed allocation writeout (default size is 64KiB). |
25 | desired I/O size. | 25 | Valid values for this option are page size (typically 4KiB) |
26 | Valid values for this option are 14 through 16, inclusive | 26 | through to 1GiB, inclusive, in power-of-2 increments. |
27 | (i.e. 16K, 32K, and 64K bytes). On machines with a 4K | 27 | |
28 | pagesize, 13 (8K bytes) is also a valid size. | 28 | attr2/noattr2 |
29 | The preferred buffered I/O size can also be altered on an | 29 | The options enable/disable (default is disabled for backward |
30 | individual file basis using the ioctl(2) system call. | 30 | compatibility on-disk) an "opportunistic" improvement to be |
31 | made in the way inline extended attributes are stored on-disk. | ||
32 | When the new form is used for the first time (by setting or | ||
33 | removing extended attributes) the on-disk superblock feature | ||
34 | bit field will be updated to reflect this format being in use. | ||
35 | |||
36 | barrier | ||
37 | Enables the use of block layer write barriers for writes into | ||
38 | the journal and unwritten extent conversion. This allows for | ||
39 | drive level write caching to be enabled, for devices that | ||
40 | support write barriers. | ||
41 | |||
42 | dmapi | ||
43 | Enable the DMAPI (Data Management API) event callouts. | ||
44 | Use with the "mtpt" option. | ||
45 | |||
46 | grpid/bsdgroups and nogrpid/sysvgroups | ||
47 | These options define what group ID a newly created file gets. | ||
48 | When grpid is set, it takes the group ID of the directory in | ||
49 | which it is created; otherwise (the default) it takes the fsgid | ||
50 | of the current process, unless the directory has the setgid bit | ||
51 | set, in which case it takes the gid from the parent directory, | ||
52 | and also gets the setgid bit set if it is a directory itself. | ||
53 | |||
54 | ihashsize=value | ||
55 | Sets the number of hash buckets available for hashing the | ||
56 | in-memory inodes of the specified mount point. If a value | ||
57 | of zero is used, the value selected by the default algorithm | ||
58 | will be displayed in /proc/mounts. | ||
31 | 59 | ||
32 | ikeep/noikeep | 60 | ikeep/noikeep |
33 | When inode clusters are emptied of inodes, keep them around | 61 | When inode clusters are emptied of inodes, keep them around |
@@ -35,12 +63,31 @@ When mounting an XFS filesystem, the following options are accepted. | |||
35 | and is still the default for now. Using the noikeep option, | 63 | and is still the default for now. Using the noikeep option, |
36 | inode clusters are returned to the free space pool. | 64 | inode clusters are returned to the free space pool. |
37 | 65 | ||
66 | inode64 | ||
67 | Indicates that XFS is allowed to create inodes at any location | ||
68 | in the filesystem, including those which will result in inode | ||
69 | numbers occupying more than 32 bits of significance. This is | ||
70 | provided for backwards compatibility, but causes problems for | ||
71 | backup applications that cannot handle large inode numbers. | ||
72 | |||
73 | largeio/nolargeio | ||
74 | If "nolargeio" is specified, the optimal I/O reported in | ||
75 | st_blksize by stat(2) will be as small as possible to allow user | ||
76 | applications to avoid inefficient read/modify/write I/O. | ||
77 | If "largeio" specified, a filesystem that has a "swidth" specified | ||
78 | will return the "swidth" value (in bytes) in st_blksize. If the | ||
79 | filesystem does not have a "swidth" specified but does specify | ||
80 | an "allocsize" then "allocsize" (in bytes) will be returned | ||
81 | instead. | ||
82 | If neither of these two options are specified, then filesystem | ||
83 | will behave as if "nolargeio" was specified. | ||
84 | |||
38 | logbufs=value | 85 | logbufs=value |
39 | Set the number of in-memory log buffers. Valid numbers range | 86 | Set the number of in-memory log buffers. Valid numbers range |
40 | from 2-8 inclusive. | 87 | from 2-8 inclusive. |
41 | The default value is 8 buffers for filesystems with a | 88 | The default value is 8 buffers for filesystems with a |
42 | blocksize of 64K, 4 buffers for filesystems with a blocksize | 89 | blocksize of 64KiB, 4 buffers for filesystems with a blocksize |
43 | of 32K, 3 buffers for filesystems with a blocksize of 16K | 90 | of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB |
44 | and 2 buffers for all other configurations. Increasing the | 91 | and 2 buffers for all other configurations. Increasing the |
45 | number of buffers may increase performance on some workloads | 92 | number of buffers may increase performance on some workloads |
46 | at the cost of the memory used for the additional log buffers | 93 | at the cost of the memory used for the additional log buffers |
@@ -49,10 +96,10 @@ When mounting an XFS filesystem, the following options are accepted. | |||
49 | logbsize=value | 96 | logbsize=value |
50 | Set the size of each in-memory log buffer. | 97 | Set the size of each in-memory log buffer. |
51 | Size may be specified in bytes, or in kilobytes with a "k" suffix. | 98 | Size may be specified in bytes, or in kilobytes with a "k" suffix. |
52 | Valid sizes for version 1 and version 2 logs are 16384 (16k) and | 99 | Valid sizes for version 1 and version 2 logs are 16384 (16k) and |
53 | 32768 (32k). Valid sizes for version 2 logs also include | 100 | 32768 (32k). Valid sizes for version 2 logs also include |
54 | 65536 (64k), 131072 (128k) and 262144 (256k). | 101 | 65536 (64k), 131072 (128k) and 262144 (256k). |
55 | The default value for machines with more than 32MB of memory | 102 | The default value for machines with more than 32MiB of memory |
56 | is 32768, machines with less memory use 16384 by default. | 103 | is 32768, machines with less memory use 16384 by default. |
57 | 104 | ||
58 | logdev=device and rtdev=device | 105 | logdev=device and rtdev=device |
@@ -62,6 +109,11 @@ When mounting an XFS filesystem, the following options are accepted. | |||
62 | optional, and the log section can be separate from the data | 109 | optional, and the log section can be separate from the data |
63 | section or contained within it. | 110 | section or contained within it. |
64 | 111 | ||
112 | mtpt=mountpoint | ||
113 | Use with the "dmapi" option. The value specified here will be | ||
114 | included in the DMAPI mount event, and should be the path of | ||
115 | the actual mountpoint that is used. | ||
116 | |||
65 | noalign | 117 | noalign |
66 | Data allocations will not be aligned at stripe unit boundaries. | 118 | Data allocations will not be aligned at stripe unit boundaries. |
67 | 119 | ||
@@ -91,13 +143,17 @@ When mounting an XFS filesystem, the following options are accepted. | |||
91 | O_SYNC writes can be lost if the system crashes. | 143 | O_SYNC writes can be lost if the system crashes. |
92 | If timestamp updates are critical, use the osyncisosync option. | 144 | If timestamp updates are critical, use the osyncisosync option. |
93 | 145 | ||
94 | quota/usrquota/uqnoenforce | 146 | uquota/usrquota/uqnoenforce/quota |
95 | User disk quota accounting enabled, and limits (optionally) | 147 | User disk quota accounting enabled, and limits (optionally) |
96 | enforced. | 148 | enforced. Refer to xfs_quota(8) for further details. |
97 | 149 | ||
98 | grpquota/gqnoenforce | 150 | gquota/grpquota/gqnoenforce |
99 | Group disk quota accounting enabled and limits (optionally) | 151 | Group disk quota accounting enabled and limits (optionally) |
100 | enforced. | 152 | enforced. Refer to xfs_quota(8) for further details. |
153 | |||
154 | pquota/prjquota/pqnoenforce | ||
155 | Project disk quota accounting enabled and limits (optionally) | ||
156 | enforced. Refer to xfs_quota(8) for further details. | ||
101 | 157 | ||
102 | sunit=value and swidth=value | 158 | sunit=value and swidth=value |
103 | Used to specify the stripe unit and width for a RAID device or | 159 | Used to specify the stripe unit and width for a RAID device or |
@@ -113,15 +169,21 @@ When mounting an XFS filesystem, the following options are accepted. | |||
113 | The "swidth" option is required if the "sunit" option has been | 169 | The "swidth" option is required if the "sunit" option has been |
114 | specified, and must be a multiple of the "sunit" value. | 170 | specified, and must be a multiple of the "sunit" value. |
115 | 171 | ||
172 | swalloc | ||
173 | Data allocations will be rounded up to stripe width boundaries | ||
174 | when the current end of file is being extended and the file | ||
175 | size is larger than the stripe width size. | ||
176 | |||
177 | |||
116 | sysctls | 178 | sysctls |
117 | ======= | 179 | ======= |
118 | 180 | ||
119 | The following sysctls are available for the XFS filesystem: | 181 | The following sysctls are available for the XFS filesystem: |
120 | 182 | ||
121 | fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1) | 183 | fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1) |
122 | Setting this to "1" clears accumulated XFS statistics | 184 | Setting this to "1" clears accumulated XFS statistics |
123 | in /proc/fs/xfs/stat. It then immediately resets to "0". | 185 | in /proc/fs/xfs/stat. It then immediately resets to "0". |
124 | 186 | ||
125 | fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) | 187 | fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) |
126 | The interval at which the xfssyncd thread flushes metadata | 188 | The interval at which the xfssyncd thread flushes metadata |
127 | out to disk. This thread will flush log activity out, and | 189 | out to disk. This thread will flush log activity out, and |
@@ -143,9 +205,9 @@ The following sysctls are available for the XFS filesystem: | |||
143 | XFS_ERRLEVEL_HIGH: 5 | 205 | XFS_ERRLEVEL_HIGH: 5 |
144 | 206 | ||
145 | fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127) | 207 | fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127) |
146 | Causes certain error conditions to call BUG(). Value is a bitmask; | 208 | Causes certain error conditions to call BUG(). Value is a bitmask; |
147 | AND together the tags which represent errors which should cause panics: | 209 | AND together the tags which represent errors which should cause panics: |
148 | 210 | ||
149 | XFS_NO_PTAG 0 | 211 | XFS_NO_PTAG 0 |
150 | XFS_PTAG_IFLUSH 0x00000001 | 212 | XFS_PTAG_IFLUSH 0x00000001 |
151 | XFS_PTAG_LOGRES 0x00000002 | 213 | XFS_PTAG_LOGRES 0x00000002 |
@@ -155,7 +217,7 @@ The following sysctls are available for the XFS filesystem: | |||
155 | XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 | 217 | XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 |
156 | XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 | 218 | XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 |
157 | 219 | ||
158 | This option is intended for debugging only. | 220 | This option is intended for debugging only. |
159 | 221 | ||
160 | fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1) | 222 | fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1) |
161 | Controls whether symlinks are created with mode 0777 (default) | 223 | Controls whether symlinks are created with mode 0777 (default) |
@@ -164,25 +226,37 @@ The following sysctls are available for the XFS filesystem: | |||
164 | fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1) | 226 | fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1) |
165 | Controls files created in SGID directories. | 227 | Controls files created in SGID directories. |
166 | If the group ID of the new file does not match the effective group | 228 | If the group ID of the new file does not match the effective group |
167 | ID or one of the supplementary group IDs of the parent dir, the | 229 | ID or one of the supplementary group IDs of the parent dir, the |
168 | ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl | 230 | ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl |
169 | is set. | 231 | is set. |
170 | 232 | ||
171 | fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1) | 233 | fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1) |
172 | Controls whether unprivileged users can use chown to "give away" | 234 | Controls whether unprivileged users can use chown to "give away" |
173 | a file to another user. | 235 | a file to another user. |
174 | 236 | ||
175 | fs.xfs.inherit_sync (Min: 0 Default: 1 Max 1) | 237 | fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1) |
176 | Setting this to "1" will cause the "sync" flag set | 238 | Setting this to "1" will cause the "sync" flag set |
177 | by the chattr(1) command on a directory to be | 239 | by the xfs_io(8) chattr command on a directory to be |
178 | inherited by files in that directory. | 240 | inherited by files in that directory. |
179 | 241 | ||
180 | fs.xfs.inherit_nodump (Min: 0 Default: 1 Max 1) | 242 | fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1) |
181 | Setting this to "1" will cause the "nodump" flag set | 243 | Setting this to "1" will cause the "nodump" flag set |
182 | by the chattr(1) command on a directory to be | 244 | by the xfs_io(8) chattr command on a directory to be |
183 | inherited by files in that directory. | 245 | inherited by files in that directory. |
184 | 246 | ||
185 | fs.xfs.inherit_noatime (Min: 0 Default: 1 Max 1) | 247 | fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1) |
186 | Setting this to "1" will cause the "noatime" flag set | 248 | Setting this to "1" will cause the "noatime" flag set |
187 | by the chattr(1) command on a directory to be | 249 | by the xfs_io(8) chattr command on a directory to be |
188 | inherited by files in that directory. | 250 | inherited by files in that directory. |
251 | |||
252 | fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1) | ||
253 | Setting this to "1" will cause the "nosymlinks" flag set | ||
254 | by the xfs_io(8) chattr command on a directory to be | ||
255 | inherited by files in that directory. | ||
256 | |||
257 | fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256) | ||
258 | In "inode32" allocation mode, this option determines how many | ||
259 | files the allocator attempts to allocate in the same allocation | ||
260 | group before moving to the next allocation group. The intent | ||
261 | is to control the rate at which the allocator moves between | ||
262 | allocation groups when allocating extents for new files. | ||