aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/dentry-locking.txt173
-rw-r--r--Documentation/filesystems/devfs/README5
-rw-r--r--Documentation/filesystems/ext2.txt2
-rw-r--r--Documentation/filesystems/ramfs-rootfs-initramfs.txt195
-rw-r--r--Documentation/filesystems/vfs.txt434
-rw-r--r--Documentation/filesystems/xfs.txt142
6 files changed, 624 insertions, 327 deletions
diff --git a/Documentation/filesystems/dentry-locking.txt b/Documentation/filesystems/dentry-locking.txt
new file mode 100644
index 000000000000..4c0c575a4012
--- /dev/null
+++ b/Documentation/filesystems/dentry-locking.txt
@@ -0,0 +1,173 @@
1RCU-based dcache locking model
2==============================
3
4On many workloads, the most common operation on dcache is to look up a
5dentry, given a parent dentry and the name of the child. Typically,
6for every open(), stat() etc., the dentry corresponding to the
7pathname will be looked up by walking the tree starting with the first
8component of the pathname and using that dentry along with the next
9component to look up the next level and so on. Since it is a frequent
10operation for workloads like multiuser environments and web servers,
11it is important to optimize this path.
12
13Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in
14every component during path look-up. Since 2.5.10 onwards, fast-walk
15algorithm changed this by holding the dcache_lock at the beginning and
16walking as many cached path component dentries as possible. This
17significantly decreases the number of acquisition of
18dcache_lock. However it also increases the lock hold time
19significantly and affects performance in large SMP machines. Since
202.5.62 kernel, dcache has been using a new locking model that uses RCU
21to make dcache look-up lock-free.
22
23The current dcache locking model is not very different from the
24existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock
25protected the hash chain, d_child, d_alias, d_lru lists as well as
26d_inode and several other things like mount look-up. RCU-based changes
27affect only the way the hash chain is protected. For everything else
28the dcache_lock must be taken for both traversing as well as
29updating. The hash chain updates too take the dcache_lock. The
30significant change is the way d_lookup traverses the hash chain, it
31doesn't acquire the dcache_lock for this and rely on RCU to ensure
32that the dentry has not been *freed*.
33
34
35Dcache locking details
36======================
37
38For many multi-user workloads, open() and stat() on files are very
39frequently occurring operations. Both involve walking of path names to
40find the dentry corresponding to the concerned file. In 2.4 kernel,
41dcache_lock was held during look-up of each path component. Contention
42and cache-line bouncing of this global lock caused significant
43scalability problems. With the introduction of RCU in Linux kernel,
44this was worked around by making the look-up of path components during
45path walking lock-free.
46
47
48Safe lock-free look-up of dcache hash table
49===========================================
50
51Dcache is a complex data structure with the hash table entries also
52linked together in other lists. In 2.4 kernel, dcache_lock protected
53all the lists. We applied RCU only on hash chain walking. The rest of
54the lists are still protected by dcache_lock. Some of the important
55changes are :
56
571. The deletion from hash chain is done using hlist_del_rcu() macro
58 which doesn't initialize next pointer of the deleted dentry and
59 this allows us to walk safely lock-free while a deletion is
60 happening.
61
622. Insertion of a dentry into the hash table is done using
63 hlist_add_head_rcu() which take care of ordering the writes - the
64 writes to the dentry must be visible before the dentry is
65 inserted. This works in conjunction with hlist_for_each_rcu() while
66 walking the hash chain. The only requirement is that all
67 initialization to the dentry must be done before
68 hlist_add_head_rcu() since we don't have dcache_lock protection
69 while traversing the hash chain. This isn't different from the
70 existing code.
71
723. The dentry looked up without holding dcache_lock by cannot be
73 returned for walking if it is unhashed. It then may have a NULL
74 d_inode or other bogosity since RCU doesn't protect the other
75 fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
76 indicate unhashed dentries and use this in conjunction with a
77 per-dentry lock (d_lock). Once looked up without the dcache_lock,
78 we acquire the per-dentry lock (d_lock) and check if the dentry is
79 unhashed. If so, the look-up is failed. If not, the reference count
80 of the dentry is increased and the dentry is returned.
81
824. Once a dentry is looked up, it must be ensured during the path walk
83 for that component it doesn't go away. In pre-2.5.10 code, this was
84 done holding a reference to the dentry. dcache_rcu does the same.
85 In some sense, dcache_rcu path walking looks like the pre-2.5.10
86 version.
87
885. All dentry hash chain updates must take the dcache_lock as well as
89 the per-dentry lock in that order. dput() does this to ensure that
90 a dentry that has just been looked up in another CPU doesn't get
91 deleted before dget() can be done on it.
92
936. There are several ways to do reference counting of RCU protected
94 objects. One such example is in ipv4 route cache where deferred
95 freeing (using call_rcu()) is done as soon as the reference count
96 goes to zero. This cannot be done in the case of dentries because
97 tearing down of dentries require blocking (dentry_iput()) which
98 isn't supported from RCU callbacks. Instead, tearing down of
99 dentries happen synchronously in dput(), but actual freeing happens
100 later when RCU grace period is over. This allows safe lock-free
101 walking of the hash chains, but a matched dentry may have been
102 partially torn down. The checking of DCACHE_UNHASHED flag with
103 d_lock held detects such dentries and prevents them from being
104 returned from look-up.
105
106
107Maintaining POSIX rename semantics
108==================================
109
110Since look-up of dentries is lock-free, it can race against a
111concurrent rename operation. For example, during rename of file A to
112B, look-up of either A or B must succeed. So, if look-up of B happens
113after A has been removed from the hash chain but not added to the new
114hash chain, it may fail. Also, a comparison while the name is being
115written concurrently by a rename may result in false positive matches
116violating rename semantics. Issues related to race with rename are
117handled as described below :
118
1191. Look-up can be done in two ways - d_lookup() which is safe from
120 simultaneous renames and __d_lookup() which is not. If
121 __d_lookup() fails, it must be followed up by a d_lookup() to
122 correctly determine whether a dentry is in the hash table or
123 not. d_lookup() protects look-ups using a sequence lock
124 (rename_lock).
125
1262. The name associated with a dentry (d_name) may be changed if a
127 rename is allowed to happen simultaneously. To avoid memcmp() in
128 __d_lookup() go out of bounds due to a rename and false positive
129 comparison, the name comparison is done while holding the
130 per-dentry lock. This prevents concurrent renames during this
131 operation.
132
1333. Hash table walking during look-up may move to a different bucket as
134 the current dentry is moved to a different bucket due to rename.
135 But we use hlists in dcache hash table and they are
136 null-terminated. So, even if a dentry moves to a different bucket,
137 hash chain walk will terminate. [with a list_head list, it may not
138 since termination is when the list_head in the original bucket is
139 reached]. Since we redo the d_parent check and compare name while
140 holding d_lock, lock-free look-up will not race against d_move().
141
1424. There can be a theoretical race when a dentry keeps coming back to
143 original bucket due to double moves. Due to this look-up may
144 consider that it has never moved and can end up in a infinite loop.
145 But this is not any worse that theoretical livelocks we already
146 have in the kernel.
147
148
149Important guidelines for filesystem developers related to dcache_rcu
150====================================================================
151
1521. Existing dcache interfaces (pre-2.5.62) exported to filesystem
153 don't change. Only dcache internal implementation changes. However
154 filesystems *must not* delete from the dentry hash chains directly
155 using the list macros like allowed earlier. They must use dcache
156 APIs like d_drop() or __d_drop() depending on the situation.
157
1582. d_flags is now protected by a per-dentry lock (d_lock). All access
159 to d_flags must be protected by it.
160
1613. For a hashed dentry, checking of d_count needs to be protected by
162 d_lock.
163
164
165Papers and other documentation on dcache locking
166================================================
167
1681. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
169
1702. http://lse.sourceforge.net/locking/dcache/dcache.html
171
172
173
diff --git a/Documentation/filesystems/devfs/README b/Documentation/filesystems/devfs/README
index 54366ecc241f..aabfba24bc2e 100644
--- a/Documentation/filesystems/devfs/README
+++ b/Documentation/filesystems/devfs/README
@@ -1812,11 +1812,6 @@ it may overflow the messages buffer, but try to get as much of it as
1812you can 1812you can
1813 1813
1814 1814
1815if you get an Oops, run ksymoops to decode it so that the
1816names of the offending functions are provided. A non-decoded Oops is
1817pretty useless
1818
1819
1820send a copy of your devfsd configuration file(s) 1815send a copy of your devfsd configuration file(s)
1821 1816
1822send the bug report to me first. 1817send the bug report to me first.
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index d16334ec48ba..a8edb376b041 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -17,8 +17,6 @@ set using tune2fs(8). Kernel-determined defaults are indicated by (*).
17bsddf (*) Makes `df' act like BSD. 17bsddf (*) Makes `df' act like BSD.
18minixdf Makes `df' act like Minix. 18minixdf Makes `df' act like Minix.
19 19
20check Check block and inode bitmaps at mount time
21 (requires CONFIG_EXT2_CHECK).
22check=none, nocheck (*) Don't do extra checking of bitmaps on mount 20check=none, nocheck (*) Don't do extra checking of bitmaps on mount
23 (check=normal and check=strict options removed) 21 (check=normal and check=strict options removed)
24 22
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
new file mode 100644
index 000000000000..b3404a032596
--- /dev/null
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
@@ -0,0 +1,195 @@
1ramfs, rootfs and initramfs
2October 17, 2005
3Rob Landley <rob@landley.net>
4=============================
5
6What is ramfs?
7--------------
8
9Ramfs is a very simple filesystem that exports Linux's disk caching
10mechanisms (the page cache and dentry cache) as a dynamically resizable
11ram-based filesystem.
12
13Normally all files are cached in memory by Linux. Pages of data read from
14backing store (usually the block device the filesystem is mounted on) are kept
15around in case it's needed again, but marked as clean (freeable) in case the
16Virtual Memory system needs the memory for something else. Similarly, data
17written to files is marked clean as soon as it has been written to backing
18store, but kept around for caching purposes until the VM reallocates the
19memory. A similar mechanism (the dentry cache) greatly speeds up access to
20directories.
21
22With ramfs, there is no backing store. Files written into ramfs allocate
23dentries and page cache as usual, but there's nowhere to write them to.
24This means the pages are never marked clean, so they can't be freed by the
25VM when it's looking to recycle memory.
26
27The amount of code required to implement ramfs is tiny, because all the
28work is done by the existing Linux caching infrastructure. Basically,
29you're mounting the disk cache as a filesystem. Because of this, ramfs is not
30an optional component removable via menuconfig, since there would be negligible
31space savings.
32
33ramfs and ramdisk:
34------------------
35
36The older "ram disk" mechanism created a synthetic block device out of
37an area of ram and used it as backing store for a filesystem. This block
38device was of fixed size, so the filesystem mounted on it was of fixed
39size. Using a ram disk also required unnecessarily copying memory from the
40fake block device into the page cache (and copying changes back out), as well
41as creating and destroying dentries. Plus it needed a filesystem driver
42(such as ext2) to format and interpret this data.
43
44Compared to ramfs, this wastes memory (and memory bus bandwidth), creates
45unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks
46to avoid this copying by playing with the page tables, but they're unpleasantly
47complicated and turn out to be about as expensive as the copying anyway.)
48More to the point, all the work ramfs is doing has to happen _anyway_,
49since all file access goes through the page and dentry caches. The ram
50disk is simply unnecessary, ramfs is internally much simpler.
51
52Another reason ramdisks are semi-obsolete is that the introduction of
53loopback devices offered a more flexible and convenient way to create
54synthetic block devices, now from files instead of from chunks of memory.
55See losetup (8) for details.
56
57ramfs and tmpfs:
58----------------
59
60One downside of ramfs is you can keep writing data into it until you fill
61up all memory, and the VM can't free it because the VM thinks that files
62should get written to backing store (rather than swap space), but ramfs hasn't
63got any backing store. Because of this, only root (or a trusted user) should
64be allowed write access to a ramfs mount.
65
66A ramfs derivative called tmpfs was created to add size limits, and the ability
67to write the data to swap space. Normal users can be allowed write access to
68tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information.
69
70What is rootfs?
71---------------
72
73Rootfs is a special instance of ramfs, which is always present in 2.6 systems.
74(It's used internally as the starting and stopping point for searches of the
75kernel's doubly-linked list of mount points.)
76
77Most systems just mount another filesystem over it and ignore it. The
78amount of space an empty instance of ramfs takes up is tiny.
79
80What is initramfs?
81------------------
82
83All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is
84extracted into rootfs when the kernel boots up. After extracting, the kernel
85checks to see if rootfs contains a file "init", and if so it executes it as PID
861. If found, this init process is responsible for bringing the system the
87rest of the way up, including locating and mounting the real root device (if
88any). If rootfs does not contain an init program after the embedded cpio
89archive is extracted into it, the kernel will fall through to the older code
90to locate and mount a root partition, then exec some variant of /sbin/init
91out of that.
92
93All this differs from the old initrd in several ways:
94
95 - The old initrd was a separate file, while the initramfs archive is linked
96 into the linux kernel image. (The directory linux-*/usr is devoted to
97 generating this archive during the build.)
98
99 - The old initrd file was a gzipped filesystem image (in some file format,
100 such as ext2, that had to be built into the kernel), while the new
101 initramfs archive is a gzipped cpio archive (like tar only simpler,
102 see cpio(1) and Documentation/early-userspace/buffer-format.txt).
103
104 - The program run by the old initrd (which was called /initrd, not /init) did
105 some setup and then returned to the kernel, while the init program from
106 initramfs is not expected to return to the kernel. (If /init needs to hand
107 off control it can overmount / with a new root device and exec another init
108 program. See the switch_root utility, below.)
109
110 - When switching another root device, initrd would pivot_root and then
111 umount the ramdisk. But initramfs is rootfs: you can neither pivot_root
112 rootfs, nor unmount it. Instead delete everything out of rootfs to
113 free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs
114 with the new root (cd /newmount; mount --move . /; chroot .), attach
115 stdin/stdout/stderr to the new /dev/console, and exec the new init.
116
117 Since this is a remarkably persnickity process (and involves deleting
118 commands before you can run them), the klibc package introduced a helper
119 program (utils/run_init.c) to do all this for you. Most other packages
120 (such as busybox) have named this command "switch_root".
121
122Populating initramfs:
123---------------------
124
125The 2.6 kernel build process always creates a gzipped cpio format initramfs
126archive and links it into the resulting kernel binary. By default, this
127archive is empty (consuming 134 bytes on x86). The config option
128CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices
129in menuconfig, and living in usr/Kconfig) can be used to specify a source for
130the initramfs archive, which will automatically be incorporated into the
131resulting binary. This option can point to an existing gzipped cpio archive, a
132directory containing files to be archived, or a text file specification such
133as the following example:
134
135 dir /dev 755 0 0
136 nod /dev/console 644 0 0 c 5 1
137 nod /dev/loop0 644 0 0 b 7 0
138 dir /bin 755 1000 1000
139 slink /bin/sh busybox 777 0 0
140 file /bin/busybox initramfs/busybox 755 0 0
141 dir /proc 755 0 0
142 dir /sys 755 0 0
143 dir /mnt 755 0 0
144 file /init initramfs/init.sh 755 0 0
145
146One advantage of the text file is that root access is not required to
147set permissions or create device nodes in the new archive. (Note that those
148two example "file" entries expect to find files named "init.sh" and "busybox" in
149a directory called "initramfs", under the linux-2.6.* directory. See
150Documentation/early-userspace/README for more details.)
151
152If you don't already understand what shared libraries, devices, and paths
153you need to get a minimal root filesystem up and running, here are some
154references:
155http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
156http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
157http://www.linuxfromscratch.org/lfs/view/stable/
158
159The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is
160designed to be a tiny C library to statically link early userspace
161code against, along with some related utilities. It is BSD licensed.
162
163I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
164myself. These are LGPL and GPL, respectively.
165
166In theory you could use glibc, but that's not well suited for small embedded
167uses like this. (A "hello world" program statically linked against glibc is
168over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do
169name lookups, even when otherwise statically linked.)
170
171Future directions:
172------------------
173
174Today (2.6.14), initramfs is always compiled in, but not always used. The
175kernel falls back to legacy boot code that is reached only if initramfs does
176not contain an /init program. The fallback is legacy code, there to ensure a
177smooth transition and allowing early boot functionality to gradually move to
178"early userspace" (I.E. initramfs).
179
180The move to early userspace is necessary because finding and mounting the real
181root device is complex. Root partitions can span multiple devices (raid or
182separate journal). They can be out on the network (requiring dhcp, setting a
183specific mac address, logging into a server, etc). They can live on removable
184media, with dynamically allocated major/minor numbers and persistent naming
185issues requiring a full udev implementation to sort out. They can be
186compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned,
187and so on.
188
189This kind of complexity (which inevitably includes policy) is rightly handled
190in userspace. Both klibc and busybox/uClibc are working on simple initramfs
191packages to drop into a kernel build, and when standard solutions are ready
192and widely deployed, the kernel's legacy early boot code will become obsolete
193and a candidate for the feature removal schedule.
194
195But that's a while off yet.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f042c12e0ed2..ee4c0a8b8db7 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -3,7 +3,7 @@
3 3
4 Original author: Richard Gooch <rgooch@atnf.csiro.au> 4 Original author: Richard Gooch <rgooch@atnf.csiro.au>
5 5
6 Last updated on August 25, 2005 6 Last updated on October 28, 2005
7 7
8 Copyright (C) 1999 Richard Gooch 8 Copyright (C) 1999 Richard Gooch
9 Copyright (C) 2005 Pekka Enberg 9 Copyright (C) 2005 Pekka Enberg
@@ -11,62 +11,61 @@
11 This file is released under the GPLv2. 11 This file is released under the GPLv2.
12 12
13 13
14What is it? 14Introduction
15=========== 15============
16 16
17The Virtual File System (otherwise known as the Virtual Filesystem 17The Virtual File System (also known as the Virtual Filesystem Switch)
18Switch) is the software layer in the kernel that provides the 18is the software layer in the kernel that provides the filesystem
19filesystem interface to userspace programs. It also provides an 19interface to userspace programs. It also provides an abstraction
20abstraction within the kernel which allows different filesystem 20within the kernel which allows different filesystem implementations to
21implementations to coexist. 21coexist.
22 22
23VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
24on are called from a process context. Filesystem locking is described
25in the document Documentation/filesystems/Locking.
23 26
24A Quick Look At How It Works
25============================
26 27
27In this section I'll briefly describe how things work, before 28Directory Entry Cache (dcache)
28launching into the details. I'll start with describing what happens 29------------------------------
29when user programs open and manipulate files, and then look from the
30other view which is how a filesystem is supported and subsequently
31mounted.
32
33
34Opening a File
35--------------
36
37The VFS implements the open(2), stat(2), chmod(2) and similar system
38calls. The pathname argument is used by the VFS to search through the
39directory entry cache (dentry cache or "dcache"). This provides a very
40fast look-up mechanism to translate a pathname (filename) into a
41specific dentry.
42
43An individual dentry usually has a pointer to an inode. Inodes are the
44things that live on disc drives, and can be regular files (you know:
45those things that you write data into), directories, FIFOs and other
46beasts. Dentries live in RAM and are never saved to disc: they exist
47only for performance. Inodes live on disc and are copied into memory
48when required. Later any changes are written back to disc. The inode
49that lives in RAM is a VFS inode, and it is this which the dentry
50points to. A single inode can be pointed to by multiple dentries
51(think about hardlinks).
52
53The dcache is meant to be a view into your entire filespace. Unlike
54Linus, most of us losers can't fit enough dentries into RAM to cover
55all of our filespace, so the dcache has bits missing. In order to
56resolve your pathname into a dentry, the VFS may have to resort to
57creating dentries along the way, and then loading the inode. This is
58done by looking up the inode.
59
60To look up an inode (usually read from disc) requires that the VFS
61calls the lookup() method of the parent directory inode. This method
62is installed by the specific filesystem implementation that the inode
63lives in. There will be more on this later.
64 30
65Once the VFS has the required dentry (and hence the inode), we can do 31The VFS implements the open(2), stat(2), chmod(2), and similar system
66all those boring things like open(2) the file, or stat(2) it to peek 32calls. The pathname argument that is passed to them is used by the VFS
67at the inode data. The stat(2) operation is fairly simple: once the 33to search through the directory entry cache (also known as the dentry
68VFS has the dentry, it peeks at the inode data and passes some of it 34cache or dcache). This provides a very fast look-up mechanism to
69back to userspace. 35translate a pathname (filename) into a specific dentry. Dentries live
36in RAM and are never saved to disc: they exist only for performance.
37
38The dentry cache is meant to be a view into your entire filespace. As
39most computers cannot fit all dentries in the RAM at the same time,
40some bits of the cache are missing. In order to resolve your pathname
41into a dentry, the VFS may have to resort to creating dentries along
42the way, and then loading the inode. This is done by looking up the
43inode.
44
45
46The Inode Object
47----------------
48
49An individual dentry usually has a pointer to an inode. Inodes are
50filesystem objects such as regular files, directories, FIFOs and other
51beasts. They live either on the disc (for block device filesystems)
52or in the memory (for pseudo filesystems). Inodes that live on the
53disc are copied into the memory when required and changes to the inode
54are written back to disc. A single inode can be pointed to by multiple
55dentries (hard links, for example, do this).
56
57To look up an inode requires that the VFS calls the lookup() method of
58the parent directory inode. This method is installed by the specific
59filesystem implementation that the inode lives in. Once the VFS has
60the required dentry (and hence the inode), we can do all those boring
61things like open(2) the file, or stat(2) it to peek at the inode
62data. The stat(2) operation is fairly simple: once the VFS has the
63dentry, it peeks at the inode data and passes some of it back to
64userspace.
65
66
67The File Object
68---------------
70 69
71Opening a file requires another operation: allocation of a file 70Opening a file requires another operation: allocation of a file
72structure (this is the kernel-side implementation of file 71structure (this is the kernel-side implementation of file
@@ -74,51 +73,39 @@ descriptors). The freshly allocated file structure is initialized with
74a pointer to the dentry and a set of file operation member functions. 73a pointer to the dentry and a set of file operation member functions.
75These are taken from the inode data. The open() file method is then 74These are taken from the inode data. The open() file method is then
76called so the specific filesystem implementation can do it's work. You 75called so the specific filesystem implementation can do it's work. You
77can see that this is another switch performed by the VFS. 76can see that this is another switch performed by the VFS. The file
78 77structure is placed into the file descriptor table for the process.
79The file structure is placed into the file descriptor table for the
80process.
81 78
82Reading, writing and closing files (and other assorted VFS operations) 79Reading, writing and closing files (and other assorted VFS operations)
83is done by using the userspace file descriptor to grab the appropriate 80is done by using the userspace file descriptor to grab the appropriate
84file structure, and then calling the required file structure method 81file structure, and then calling the required file structure method to
85function to do whatever is required. 82do whatever is required. For as long as the file is open, it keeps the
86 83dentry in use, which in turn means that the VFS inode is still in use.
87For as long as the file is open, it keeps the dentry "open" (in use),
88which in turn means that the VFS inode is still in use.
89
90All VFS system calls (i.e. open(2), stat(2), read(2), write(2),
91chmod(2) and so on) are called from a process context. You should
92assume that these calls are made without any kernel locks being
93held. This means that the processes may be executing the same piece of
94filesystem or driver code at the same time, on different
95processors. You should ensure that access to shared resources is
96protected by appropriate locks.
97 84
98 85
99Registering and Mounting a Filesystem 86Registering and Mounting a Filesystem
100------------------------------------- 87=====================================
101 88
102If you want to support a new kind of filesystem in the kernel, all you 89To register and unregister a filesystem, use the following API
103need to do is call register_filesystem(). You pass a structure 90functions:
104describing the filesystem implementation (struct file_system_type)
105which is then added to an internal table of supported filesystems. You
106can do:
107 91
108% cat /proc/filesystems 92 #include <linux/fs.h>
109 93
110to see what filesystems are currently available on your system. 94 extern int register_filesystem(struct file_system_type *);
95 extern int unregister_filesystem(struct file_system_type *);
111 96
112When a request is made to mount a block device onto a directory in 97The passed struct file_system_type describes your filesystem. When a
113your filespace the VFS will call the appropriate method for the 98request is made to mount a device onto a directory in your filespace,
114specific filesystem. The dentry for the mount point will then be 99the VFS will call the appropriate get_sb() method for the specific
115updated to point to the root inode for the new filesystem. 100filesystem. The dentry for the mount point will then be updated to
101point to the root inode for the new filesystem.
116 102
117It's now time to look at things in more detail. 103You can see all filesystems that are registered to the kernel in the
104file /proc/filesystems.
118 105
119 106
120struct file_system_type 107struct file_system_type
121======================= 108-----------------------
122 109
123This describes the filesystem. As of kernel 2.6.13, the following 110This describes the filesystem. As of kernel 2.6.13, the following
124members are defined: 111members are defined:
@@ -197,8 +184,14 @@ A fill_super() method implementation has the following arguments:
197 int silent: whether or not to be silent on error 184 int silent: whether or not to be silent on error
198 185
199 186
187The Superblock Object
188=====================
189
190A superblock object represents a mounted filesystem.
191
192
200struct super_operations 193struct super_operations
201======================= 194-----------------------
202 195
203This describes how the VFS can manipulate the superblock of your 196This describes how the VFS can manipulate the superblock of your
204filesystem. As of kernel 2.6.13, the following members are defined: 197filesystem. As of kernel 2.6.13, the following members are defined:
@@ -286,9 +279,9 @@ or bottom half).
286 a superblock. The second parameter indicates whether the method 279 a superblock. The second parameter indicates whether the method
287 should wait until the write out has been completed. Optional. 280 should wait until the write out has been completed. Optional.
288 281
289 write_super_lockfs: called when VFS is locking a filesystem and forcing 282 write_super_lockfs: called when VFS is locking a filesystem and
290 it into a consistent state. This function is currently used by the 283 forcing it into a consistent state. This method is currently
291 Logical Volume Manager (LVM). 284 used by the Logical Volume Manager (LVM).
292 285
293 unlockfs: called when VFS is unlocking a filesystem and making it writable 286 unlockfs: called when VFS is unlocking a filesystem and making it writable
294 again. 287 again.
@@ -317,8 +310,14 @@ field. This is a pointer to a "struct inode_operations" which
317describes the methods that can be performed on individual inodes. 310describes the methods that can be performed on individual inodes.
318 311
319 312
313The Inode Object
314================
315
316An inode object represents an object within the filesystem.
317
318
320struct inode_operations 319struct inode_operations
321======================= 320-----------------------
322 321
323This describes how the VFS can manipulate an inode in your 322This describes how the VFS can manipulate an inode in your
324filesystem. As of kernel 2.6.13, the following members are defined: 323filesystem. As of kernel 2.6.13, the following members are defined:
@@ -394,51 +393,62 @@ otherwise noted.
394 will probably need to call d_instantiate() just as you would 393 will probably need to call d_instantiate() just as you would
395 in the create() method 394 in the create() method
396 395
396 rename: called by the rename(2) system call to rename the object to
397 have the parent and name given by the second inode and dentry.
398
397 readlink: called by the readlink(2) system call. Only required if 399 readlink: called by the readlink(2) system call. Only required if
398 you want to support reading symbolic links 400 you want to support reading symbolic links
399 401
400 follow_link: called by the VFS to follow a symbolic link to the 402 follow_link: called by the VFS to follow a symbolic link to the
401 inode it points to. Only required if you want to support 403 inode it points to. Only required if you want to support
402 symbolic links. This function returns a void pointer cookie 404 symbolic links. This method returns a void pointer cookie
403 that is passed to put_link(). 405 that is passed to put_link().
404 406
405 put_link: called by the VFS to release resources allocated by 407 put_link: called by the VFS to release resources allocated by
406 follow_link(). The cookie returned by follow_link() is passed to 408 follow_link(). The cookie returned by follow_link() is passed
407 to this function as the last parameter. It is used by filesystems 409 to to this method as the last parameter. It is used by
408 such as NFS where page cache is not stable (i.e. page that was 410 filesystems such as NFS where page cache is not stable
409 installed when the symbolic link walk started might not be in the 411 (i.e. page that was installed when the symbolic link walk
410 page cache at the end of the walk). 412 started might not be in the page cache at the end of the
411 413 walk).
412 truncate: called by the VFS to change the size of a file. The i_size 414
413 field of the inode is set to the desired size by the VFS before 415 truncate: called by the VFS to change the size of a file. The
414 this function is called. This function is called by the truncate(2) 416 i_size field of the inode is set to the desired size by the
415 system call and related functionality. 417 VFS before this method is called. This method is called by
418 the truncate(2) system call and related functionality.
416 419
417 permission: called by the VFS to check for access rights on a POSIX-like 420 permission: called by the VFS to check for access rights on a POSIX-like
418 filesystem. 421 filesystem.
419 422
420 setattr: called by the VFS to set attributes for a file. This function is 423 setattr: called by the VFS to set attributes for a file. This method
421 called by chmod(2) and related system calls. 424 is called by chmod(2) and related system calls.
422 425
423 getattr: called by the VFS to get attributes of a file. This function is 426 getattr: called by the VFS to get attributes of a file. This method
424 called by stat(2) and related system calls. 427 is called by stat(2) and related system calls.
425 428
426 setxattr: called by the VFS to set an extended attribute for a file. 429 setxattr: called by the VFS to set an extended attribute for a file.
427 Extended attribute is a name:value pair associated with an inode. This 430 Extended attribute is a name:value pair associated with an
428 function is called by setxattr(2) system call. 431 inode. This method is called by setxattr(2) system call.
432
433 getxattr: called by the VFS to retrieve the value of an extended
434 attribute name. This method is called by getxattr(2) function
435 call.
429 436
430 getxattr: called by the VFS to retrieve the value of an extended attribute 437 listxattr: called by the VFS to list all extended attributes for a
431 name. This function is called by getxattr(2) function call. 438 given file. This method is called by listxattr(2) system call.
432 439
433 listxattr: called by the VFS to list all extended attributes for a given 440 removexattr: called by the VFS to remove an extended attribute from
434 file. This function is called by listxattr(2) system call. 441 a file. This method is called by removexattr(2) system call.
435 442
436 removexattr: called by the VFS to remove an extended attribute from a file. 443
437 This function is called by removexattr(2) system call. 444The Address Space Object
445========================
446
447The address space object is used to identify pages in the page cache.
438 448
439 449
440struct address_space_operations 450struct address_space_operations
441=============================== 451-------------------------------
442 452
443This describes how the VFS can manipulate mapping of a file to page cache in 453This describes how the VFS can manipulate mapping of a file to page cache in
444your filesystem. As of kernel 2.6.13, the following members are defined: 454your filesystem. As of kernel 2.6.13, the following members are defined:
@@ -502,8 +512,14 @@ struct address_space_operations {
502 it. An example implementation can be found in fs/ext2/xip.c. 512 it. An example implementation can be found in fs/ext2/xip.c.
503 513
504 514
515The File Object
516===============
517
518A file object represents a file opened by a process.
519
520
505struct file_operations 521struct file_operations
506====================== 522----------------------
507 523
508This describes how the VFS can manipulate an open file. As of kernel 524This describes how the VFS can manipulate an open file. As of kernel
5092.6.13, the following members are defined: 5252.6.13, the following members are defined:
@@ -661,7 +677,7 @@ of child dentries. Child dentries are basically like files in a
661directory. 677directory.
662 678
663 679
664Directory Entry Cache APIs 680Directory Entry Cache API
665-------------------------- 681--------------------------
666 682
667There are a number of functions defined which permit a filesystem to 683There are a number of functions defined which permit a filesystem to
@@ -705,178 +721,24 @@ manipulate dentries:
705 and the dentry is returned. The caller must use d_put() 721 and the dentry is returned. The caller must use d_put()
706 to free the dentry when it finishes using it. 722 to free the dentry when it finishes using it.
707 723
724For further information on dentry locking, please refer to the document
725Documentation/filesystems/dentry-locking.txt.
708 726
709RCU-based dcache locking model
710------------------------------
711 727
712On many workloads, the most common operation on dcache is 728Resources
713to look up a dentry, given a parent dentry and the name 729=========
714of the child. Typically, for every open(), stat() etc., 730
715the dentry corresponding to the pathname will be looked 731(Note some of these resources are not up-to-date with the latest kernel
716up by walking the tree starting with the first component 732 version.)
717of the pathname and using that dentry along with the next 733
718component to look up the next level and so on. Since it 734Creating Linux virtual filesystems. 2002
719is a frequent operation for workloads like multiuser 735 <http://lwn.net/Articles/13325/>
720environments and web servers, it is important to optimize 736
721this path. 737The Linux Virtual File-system Layer by Neil Brown. 1999
722 738 <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
723Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus 739
724in every component during path look-up. Since 2.5.10 onwards, 740A tour of the Linux VFS by Michael K. Johnson. 1996
725fast-walk algorithm changed this by holding the dcache_lock 741 <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
726at the beginning and walking as many cached path component
727dentries as possible. This significantly decreases the number
728of acquisition of dcache_lock. However it also increases the
729lock hold time significantly and affects performance in large
730SMP machines. Since 2.5.62 kernel, dcache has been using
731a new locking model that uses RCU to make dcache look-up
732lock-free.
733
734The current dcache locking model is not very different from the existing
735dcache locking model. Prior to 2.5.62 kernel, dcache_lock
736protected the hash chain, d_child, d_alias, d_lru lists as well
737as d_inode and several other things like mount look-up. RCU-based
738changes affect only the way the hash chain is protected. For everything
739else the dcache_lock must be taken for both traversing as well as
740updating. The hash chain updates too take the dcache_lock.
741The significant change is the way d_lookup traverses the hash chain,
742it doesn't acquire the dcache_lock for this and rely on RCU to
743ensure that the dentry has not been *freed*.
744
745
746Dcache locking details
747----------------------
748 742
749For many multi-user workloads, open() and stat() on files are 743A small trail through the Linux kernel by Andries Brouwer. 2001
750very frequently occurring operations. Both involve walking 744 <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>
751of path names to find the dentry corresponding to the
752concerned file. In 2.4 kernel, dcache_lock was held
753during look-up of each path component. Contention and
754cache-line bouncing of this global lock caused significant
755scalability problems. With the introduction of RCU
756in Linux kernel, this was worked around by making
757the look-up of path components during path walking lock-free.
758
759
760Safe lock-free look-up of dcache hash table
761===========================================
762
763Dcache is a complex data structure with the hash table entries
764also linked together in other lists. In 2.4 kernel, dcache_lock
765protected all the lists. We applied RCU only on hash chain
766walking. The rest of the lists are still protected by dcache_lock.
767Some of the important changes are :
768
7691. The deletion from hash chain is done using hlist_del_rcu() macro which
770 doesn't initialize next pointer of the deleted dentry and this
771 allows us to walk safely lock-free while a deletion is happening.
772
7732. Insertion of a dentry into the hash table is done using
774 hlist_add_head_rcu() which take care of ordering the writes -
775 the writes to the dentry must be visible before the dentry
776 is inserted. This works in conjunction with hlist_for_each_rcu()
777 while walking the hash chain. The only requirement is that
778 all initialization to the dentry must be done before hlist_add_head_rcu()
779 since we don't have dcache_lock protection while traversing
780 the hash chain. This isn't different from the existing code.
781
7823. The dentry looked up without holding dcache_lock by cannot be
783 returned for walking if it is unhashed. It then may have a NULL
784 d_inode or other bogosity since RCU doesn't protect the other
785 fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
786 indicate unhashed dentries and use this in conjunction with a
787 per-dentry lock (d_lock). Once looked up without the dcache_lock,
788 we acquire the per-dentry lock (d_lock) and check if the
789 dentry is unhashed. If so, the look-up is failed. If not, the
790 reference count of the dentry is increased and the dentry is returned.
791
7924. Once a dentry is looked up, it must be ensured during the path
793 walk for that component it doesn't go away. In pre-2.5.10 code,
794 this was done holding a reference to the dentry. dcache_rcu does
795 the same. In some sense, dcache_rcu path walking looks like
796 the pre-2.5.10 version.
797
7985. All dentry hash chain updates must take the dcache_lock as well as
799 the per-dentry lock in that order. dput() does this to ensure
800 that a dentry that has just been looked up in another CPU
801 doesn't get deleted before dget() can be done on it.
802
8036. There are several ways to do reference counting of RCU protected
804 objects. One such example is in ipv4 route cache where
805 deferred freeing (using call_rcu()) is done as soon as
806 the reference count goes to zero. This cannot be done in
807 the case of dentries because tearing down of dentries
808 require blocking (dentry_iput()) which isn't supported from
809 RCU callbacks. Instead, tearing down of dentries happen
810 synchronously in dput(), but actual freeing happens later
811 when RCU grace period is over. This allows safe lock-free
812 walking of the hash chains, but a matched dentry may have
813 been partially torn down. The checking of DCACHE_UNHASHED
814 flag with d_lock held detects such dentries and prevents
815 them from being returned from look-up.
816
817
818Maintaining POSIX rename semantics
819==================================
820
821Since look-up of dentries is lock-free, it can race against
822a concurrent rename operation. For example, during rename
823of file A to B, look-up of either A or B must succeed.
824So, if look-up of B happens after A has been removed from the
825hash chain but not added to the new hash chain, it may fail.
826Also, a comparison while the name is being written concurrently
827by a rename may result in false positive matches violating
828rename semantics. Issues related to race with rename are
829handled as described below :
830
8311. Look-up can be done in two ways - d_lookup() which is safe
832 from simultaneous renames and __d_lookup() which is not.
833 If __d_lookup() fails, it must be followed up by a d_lookup()
834 to correctly determine whether a dentry is in the hash table
835 or not. d_lookup() protects look-ups using a sequence
836 lock (rename_lock).
837
8382. The name associated with a dentry (d_name) may be changed if
839 a rename is allowed to happen simultaneously. To avoid memcmp()
840 in __d_lookup() go out of bounds due to a rename and false
841 positive comparison, the name comparison is done while holding the
842 per-dentry lock. This prevents concurrent renames during this
843 operation.
844
8453. Hash table walking during look-up may move to a different bucket as
846 the current dentry is moved to a different bucket due to rename.
847 But we use hlists in dcache hash table and they are null-terminated.
848 So, even if a dentry moves to a different bucket, hash chain
849 walk will terminate. [with a list_head list, it may not since
850 termination is when the list_head in the original bucket is reached].
851 Since we redo the d_parent check and compare name while holding
852 d_lock, lock-free look-up will not race against d_move().
853
8544. There can be a theoretical race when a dentry keeps coming back
855 to original bucket due to double moves. Due to this look-up may
856 consider that it has never moved and can end up in a infinite loop.
857 But this is not any worse that theoretical livelocks we already
858 have in the kernel.
859
860
861Important guidelines for filesystem developers related to dcache_rcu
862====================================================================
863
8641. Existing dcache interfaces (pre-2.5.62) exported to filesystem
865 don't change. Only dcache internal implementation changes. However
866 filesystems *must not* delete from the dentry hash chains directly
867 using the list macros like allowed earlier. They must use dcache
868 APIs like d_drop() or __d_drop() depending on the situation.
869
8702. d_flags is now protected by a per-dentry lock (d_lock). All
871 access to d_flags must be protected by it.
872
8733. For a hashed dentry, checking of d_count needs to be protected
874 by d_lock.
875
876
877Papers and other documentation on dcache locking
878================================================
879
8801. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
881
8822. http://lse.sourceforge.net/locking/dcache/dcache.html
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
index c7d5d0c7067d..74aeb142ae5f 100644
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -19,15 +19,43 @@ Mount Options
19 19
20When mounting an XFS filesystem, the following options are accepted. 20When mounting an XFS filesystem, the following options are accepted.
21 21
22 biosize=size 22 allocsize=size
23 Sets the preferred buffered I/O size (default size is 64K). 23 Sets the buffered I/O end-of-file preallocation size when
24 "size" must be expressed as the logarithm (base2) of the 24 doing delayed allocation writeout (default size is 64KiB).
25 desired I/O size. 25 Valid values for this option are page size (typically 4KiB)
26 Valid values for this option are 14 through 16, inclusive 26 through to 1GiB, inclusive, in power-of-2 increments.
27 (i.e. 16K, 32K, and 64K bytes). On machines with a 4K 27
28 pagesize, 13 (8K bytes) is also a valid size. 28 attr2/noattr2
29 The preferred buffered I/O size can also be altered on an 29 The options enable/disable (default is disabled for backward
30 individual file basis using the ioctl(2) system call. 30 compatibility on-disk) an "opportunistic" improvement to be
31 made in the way inline extended attributes are stored on-disk.
32 When the new form is used for the first time (by setting or
33 removing extended attributes) the on-disk superblock feature
34 bit field will be updated to reflect this format being in use.
35
36 barrier
37 Enables the use of block layer write barriers for writes into
38 the journal and unwritten extent conversion. This allows for
39 drive level write caching to be enabled, for devices that
40 support write barriers.
41
42 dmapi
43 Enable the DMAPI (Data Management API) event callouts.
44 Use with the "mtpt" option.
45
46 grpid/bsdgroups and nogrpid/sysvgroups
47 These options define what group ID a newly created file gets.
48 When grpid is set, it takes the group ID of the directory in
49 which it is created; otherwise (the default) it takes the fsgid
50 of the current process, unless the directory has the setgid bit
51 set, in which case it takes the gid from the parent directory,
52 and also gets the setgid bit set if it is a directory itself.
53
54 ihashsize=value
55 Sets the number of hash buckets available for hashing the
56 in-memory inodes of the specified mount point. If a value
57 of zero is used, the value selected by the default algorithm
58 will be displayed in /proc/mounts.
31 59
32 ikeep/noikeep 60 ikeep/noikeep
33 When inode clusters are emptied of inodes, keep them around 61 When inode clusters are emptied of inodes, keep them around
@@ -35,12 +63,31 @@ When mounting an XFS filesystem, the following options are accepted.
35 and is still the default for now. Using the noikeep option, 63 and is still the default for now. Using the noikeep option,
36 inode clusters are returned to the free space pool. 64 inode clusters are returned to the free space pool.
37 65
66 inode64
67 Indicates that XFS is allowed to create inodes at any location
68 in the filesystem, including those which will result in inode
69 numbers occupying more than 32 bits of significance. This is
70 provided for backwards compatibility, but causes problems for
71 backup applications that cannot handle large inode numbers.
72
73 largeio/nolargeio
74 If "nolargeio" is specified, the optimal I/O reported in
75 st_blksize by stat(2) will be as small as possible to allow user
76 applications to avoid inefficient read/modify/write I/O.
77 If "largeio" specified, a filesystem that has a "swidth" specified
78 will return the "swidth" value (in bytes) in st_blksize. If the
79 filesystem does not have a "swidth" specified but does specify
80 an "allocsize" then "allocsize" (in bytes) will be returned
81 instead.
82 If neither of these two options are specified, then filesystem
83 will behave as if "nolargeio" was specified.
84
38 logbufs=value 85 logbufs=value
39 Set the number of in-memory log buffers. Valid numbers range 86 Set the number of in-memory log buffers. Valid numbers range
40 from 2-8 inclusive. 87 from 2-8 inclusive.
41 The default value is 8 buffers for filesystems with a 88 The default value is 8 buffers for filesystems with a
42 blocksize of 64K, 4 buffers for filesystems with a blocksize 89 blocksize of 64KiB, 4 buffers for filesystems with a blocksize
43 of 32K, 3 buffers for filesystems with a blocksize of 16K 90 of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB
44 and 2 buffers for all other configurations. Increasing the 91 and 2 buffers for all other configurations. Increasing the
45 number of buffers may increase performance on some workloads 92 number of buffers may increase performance on some workloads
46 at the cost of the memory used for the additional log buffers 93 at the cost of the memory used for the additional log buffers
@@ -49,10 +96,10 @@ When mounting an XFS filesystem, the following options are accepted.
49 logbsize=value 96 logbsize=value
50 Set the size of each in-memory log buffer. 97 Set the size of each in-memory log buffer.
51 Size may be specified in bytes, or in kilobytes with a "k" suffix. 98 Size may be specified in bytes, or in kilobytes with a "k" suffix.
52 Valid sizes for version 1 and version 2 logs are 16384 (16k) and 99 Valid sizes for version 1 and version 2 logs are 16384 (16k) and
53 32768 (32k). Valid sizes for version 2 logs also include 100 32768 (32k). Valid sizes for version 2 logs also include
54 65536 (64k), 131072 (128k) and 262144 (256k). 101 65536 (64k), 131072 (128k) and 262144 (256k).
55 The default value for machines with more than 32MB of memory 102 The default value for machines with more than 32MiB of memory
56 is 32768, machines with less memory use 16384 by default. 103 is 32768, machines with less memory use 16384 by default.
57 104
58 logdev=device and rtdev=device 105 logdev=device and rtdev=device
@@ -62,6 +109,11 @@ When mounting an XFS filesystem, the following options are accepted.
62 optional, and the log section can be separate from the data 109 optional, and the log section can be separate from the data
63 section or contained within it. 110 section or contained within it.
64 111
112 mtpt=mountpoint
113 Use with the "dmapi" option. The value specified here will be
114 included in the DMAPI mount event, and should be the path of
115 the actual mountpoint that is used.
116
65 noalign 117 noalign
66 Data allocations will not be aligned at stripe unit boundaries. 118 Data allocations will not be aligned at stripe unit boundaries.
67 119
@@ -91,13 +143,17 @@ When mounting an XFS filesystem, the following options are accepted.
91 O_SYNC writes can be lost if the system crashes. 143 O_SYNC writes can be lost if the system crashes.
92 If timestamp updates are critical, use the osyncisosync option. 144 If timestamp updates are critical, use the osyncisosync option.
93 145
94 quota/usrquota/uqnoenforce 146 uquota/usrquota/uqnoenforce/quota
95 User disk quota accounting enabled, and limits (optionally) 147 User disk quota accounting enabled, and limits (optionally)
96 enforced. 148 enforced. Refer to xfs_quota(8) for further details.
97 149
98 grpquota/gqnoenforce 150 gquota/grpquota/gqnoenforce
99 Group disk quota accounting enabled and limits (optionally) 151 Group disk quota accounting enabled and limits (optionally)
100 enforced. 152 enforced. Refer to xfs_quota(8) for further details.
153
154 pquota/prjquota/pqnoenforce
155 Project disk quota accounting enabled and limits (optionally)
156 enforced. Refer to xfs_quota(8) for further details.
101 157
102 sunit=value and swidth=value 158 sunit=value and swidth=value
103 Used to specify the stripe unit and width for a RAID device or 159 Used to specify the stripe unit and width for a RAID device or
@@ -113,15 +169,21 @@ When mounting an XFS filesystem, the following options are accepted.
113 The "swidth" option is required if the "sunit" option has been 169 The "swidth" option is required if the "sunit" option has been
114 specified, and must be a multiple of the "sunit" value. 170 specified, and must be a multiple of the "sunit" value.
115 171
172 swalloc
173 Data allocations will be rounded up to stripe width boundaries
174 when the current end of file is being extended and the file
175 size is larger than the stripe width size.
176
177
116sysctls 178sysctls
117======= 179=======
118 180
119The following sysctls are available for the XFS filesystem: 181The following sysctls are available for the XFS filesystem:
120 182
121 fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1) 183 fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1)
122 Setting this to "1" clears accumulated XFS statistics 184 Setting this to "1" clears accumulated XFS statistics
123 in /proc/fs/xfs/stat. It then immediately resets to "0". 185 in /proc/fs/xfs/stat. It then immediately resets to "0".
124 186
125 fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) 187 fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000)
126 The interval at which the xfssyncd thread flushes metadata 188 The interval at which the xfssyncd thread flushes metadata
127 out to disk. This thread will flush log activity out, and 189 out to disk. This thread will flush log activity out, and
@@ -143,9 +205,9 @@ The following sysctls are available for the XFS filesystem:
143 XFS_ERRLEVEL_HIGH: 5 205 XFS_ERRLEVEL_HIGH: 5
144 206
145 fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127) 207 fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127)
146 Causes certain error conditions to call BUG(). Value is a bitmask; 208 Causes certain error conditions to call BUG(). Value is a bitmask;
147 AND together the tags which represent errors which should cause panics: 209 AND together the tags which represent errors which should cause panics:
148 210
149 XFS_NO_PTAG 0 211 XFS_NO_PTAG 0
150 XFS_PTAG_IFLUSH 0x00000001 212 XFS_PTAG_IFLUSH 0x00000001
151 XFS_PTAG_LOGRES 0x00000002 213 XFS_PTAG_LOGRES 0x00000002
@@ -155,7 +217,7 @@ The following sysctls are available for the XFS filesystem:
155 XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 217 XFS_PTAG_SHUTDOWN_IOERROR 0x00000020
156 XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 218 XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040
157 219
158 This option is intended for debugging only. 220 This option is intended for debugging only.
159 221
160 fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1) 222 fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1)
161 Controls whether symlinks are created with mode 0777 (default) 223 Controls whether symlinks are created with mode 0777 (default)
@@ -164,25 +226,37 @@ The following sysctls are available for the XFS filesystem:
164 fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1) 226 fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1)
165 Controls files created in SGID directories. 227 Controls files created in SGID directories.
166 If the group ID of the new file does not match the effective group 228 If the group ID of the new file does not match the effective group
167 ID or one of the supplementary group IDs of the parent dir, the 229 ID or one of the supplementary group IDs of the parent dir, the
168 ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl 230 ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
169 is set. 231 is set.
170 232
171 fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1) 233 fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1)
172 Controls whether unprivileged users can use chown to "give away" 234 Controls whether unprivileged users can use chown to "give away"
173 a file to another user. 235 a file to another user.
174 236
175 fs.xfs.inherit_sync (Min: 0 Default: 1 Max 1) 237 fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1)
176 Setting this to "1" will cause the "sync" flag set 238 Setting this to "1" will cause the "sync" flag set
177 by the chattr(1) command on a directory to be 239 by the xfs_io(8) chattr command on a directory to be
178 inherited by files in that directory. 240 inherited by files in that directory.
179 241
180 fs.xfs.inherit_nodump (Min: 0 Default: 1 Max 1) 242 fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1)
181 Setting this to "1" will cause the "nodump" flag set 243 Setting this to "1" will cause the "nodump" flag set
182 by the chattr(1) command on a directory to be 244 by the xfs_io(8) chattr command on a directory to be
183 inherited by files in that directory. 245 inherited by files in that directory.
184 246
185 fs.xfs.inherit_noatime (Min: 0 Default: 1 Max 1) 247 fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1)
186 Setting this to "1" will cause the "noatime" flag set 248 Setting this to "1" will cause the "noatime" flag set
187 by the chattr(1) command on a directory to be 249 by the xfs_io(8) chattr command on a directory to be
188 inherited by files in that directory. 250 inherited by files in that directory.
251
252 fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1)
253 Setting this to "1" will cause the "nosymlinks" flag set
254 by the xfs_io(8) chattr command on a directory to be
255 inherited by files in that directory.
256
257 fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256)
258 In "inode32" allocation mode, this option determines how many
259 files the allocator attempts to allocate in the same allocation
260 group before moving to the next allocation group. The intent
261 is to control the rate at which the allocator moves between
262 allocation groups when allocating extents for new files.