aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems/vfs.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems/vfs.txt')
-rw-r--r--Documentation/filesystems/vfs.txt434
1 files changed, 148 insertions, 286 deletions
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f042c12e0ed2..ee4c0a8b8db7 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -3,7 +3,7 @@
3 3
4 Original author: Richard Gooch <rgooch@atnf.csiro.au> 4 Original author: Richard Gooch <rgooch@atnf.csiro.au>
5 5
6 Last updated on August 25, 2005 6 Last updated on October 28, 2005
7 7
8 Copyright (C) 1999 Richard Gooch 8 Copyright (C) 1999 Richard Gooch
9 Copyright (C) 2005 Pekka Enberg 9 Copyright (C) 2005 Pekka Enberg
@@ -11,62 +11,61 @@
11 This file is released under the GPLv2. 11 This file is released under the GPLv2.
12 12
13 13
14What is it? 14Introduction
15=========== 15============
16 16
17The Virtual File System (otherwise known as the Virtual Filesystem 17The Virtual File System (also known as the Virtual Filesystem Switch)
18Switch) is the software layer in the kernel that provides the 18is the software layer in the kernel that provides the filesystem
19filesystem interface to userspace programs. It also provides an 19interface to userspace programs. It also provides an abstraction
20abstraction within the kernel which allows different filesystem 20within the kernel which allows different filesystem implementations to
21implementations to coexist. 21coexist.
22 22
23VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
24on are called from a process context. Filesystem locking is described
25in the document Documentation/filesystems/Locking.
23 26
24A Quick Look At How It Works
25============================
26 27
27In this section I'll briefly describe how things work, before 28Directory Entry Cache (dcache)
28launching into the details. I'll start with describing what happens 29------------------------------
29when user programs open and manipulate files, and then look from the
30other view which is how a filesystem is supported and subsequently
31mounted.
32
33
34Opening a File
35--------------
36
37The VFS implements the open(2), stat(2), chmod(2) and similar system
38calls. The pathname argument is used by the VFS to search through the
39directory entry cache (dentry cache or "dcache"). This provides a very
40fast look-up mechanism to translate a pathname (filename) into a
41specific dentry.
42
43An individual dentry usually has a pointer to an inode. Inodes are the
44things that live on disc drives, and can be regular files (you know:
45those things that you write data into), directories, FIFOs and other
46beasts. Dentries live in RAM and are never saved to disc: they exist
47only for performance. Inodes live on disc and are copied into memory
48when required. Later any changes are written back to disc. The inode
49that lives in RAM is a VFS inode, and it is this which the dentry
50points to. A single inode can be pointed to by multiple dentries
51(think about hardlinks).
52
53The dcache is meant to be a view into your entire filespace. Unlike
54Linus, most of us losers can't fit enough dentries into RAM to cover
55all of our filespace, so the dcache has bits missing. In order to
56resolve your pathname into a dentry, the VFS may have to resort to
57creating dentries along the way, and then loading the inode. This is
58done by looking up the inode.
59
60To look up an inode (usually read from disc) requires that the VFS
61calls the lookup() method of the parent directory inode. This method
62is installed by the specific filesystem implementation that the inode
63lives in. There will be more on this later.
64 30
65Once the VFS has the required dentry (and hence the inode), we can do 31The VFS implements the open(2), stat(2), chmod(2), and similar system
66all those boring things like open(2) the file, or stat(2) it to peek 32calls. The pathname argument that is passed to them is used by the VFS
67at the inode data. The stat(2) operation is fairly simple: once the 33to search through the directory entry cache (also known as the dentry
68VFS has the dentry, it peeks at the inode data and passes some of it 34cache or dcache). This provides a very fast look-up mechanism to
69back to userspace. 35translate a pathname (filename) into a specific dentry. Dentries live
36in RAM and are never saved to disc: they exist only for performance.
37
38The dentry cache is meant to be a view into your entire filespace. As
39most computers cannot fit all dentries in the RAM at the same time,
40some bits of the cache are missing. In order to resolve your pathname
41into a dentry, the VFS may have to resort to creating dentries along
42the way, and then loading the inode. This is done by looking up the
43inode.
44
45
46The Inode Object
47----------------
48
49An individual dentry usually has a pointer to an inode. Inodes are
50filesystem objects such as regular files, directories, FIFOs and other
51beasts. They live either on the disc (for block device filesystems)
52or in the memory (for pseudo filesystems). Inodes that live on the
53disc are copied into the memory when required and changes to the inode
54are written back to disc. A single inode can be pointed to by multiple
55dentries (hard links, for example, do this).
56
57To look up an inode requires that the VFS calls the lookup() method of
58the parent directory inode. This method is installed by the specific
59filesystem implementation that the inode lives in. Once the VFS has
60the required dentry (and hence the inode), we can do all those boring
61things like open(2) the file, or stat(2) it to peek at the inode
62data. The stat(2) operation is fairly simple: once the VFS has the
63dentry, it peeks at the inode data and passes some of it back to
64userspace.
65
66
67The File Object
68---------------
70 69
71Opening a file requires another operation: allocation of a file 70Opening a file requires another operation: allocation of a file
72structure (this is the kernel-side implementation of file 71structure (this is the kernel-side implementation of file
@@ -74,51 +73,39 @@ descriptors). The freshly allocated file structure is initialized with
74a pointer to the dentry and a set of file operation member functions. 73a pointer to the dentry and a set of file operation member functions.
75These are taken from the inode data. The open() file method is then 74These are taken from the inode data. The open() file method is then
76called so the specific filesystem implementation can do it's work. You 75called so the specific filesystem implementation can do it's work. You
77can see that this is another switch performed by the VFS. 76can see that this is another switch performed by the VFS. The file
78 77structure is placed into the file descriptor table for the process.
79The file structure is placed into the file descriptor table for the
80process.
81 78
82Reading, writing and closing files (and other assorted VFS operations) 79Reading, writing and closing files (and other assorted VFS operations)
83is done by using the userspace file descriptor to grab the appropriate 80is done by using the userspace file descriptor to grab the appropriate
84file structure, and then calling the required file structure method 81file structure, and then calling the required file structure method to
85function to do whatever is required. 82do whatever is required. For as long as the file is open, it keeps the
86 83dentry in use, which in turn means that the VFS inode is still in use.
87For as long as the file is open, it keeps the dentry "open" (in use),
88which in turn means that the VFS inode is still in use.
89
90All VFS system calls (i.e. open(2), stat(2), read(2), write(2),
91chmod(2) and so on) are called from a process context. You should
92assume that these calls are made without any kernel locks being
93held. This means that the processes may be executing the same piece of
94filesystem or driver code at the same time, on different
95processors. You should ensure that access to shared resources is
96protected by appropriate locks.
97 84
98 85
99Registering and Mounting a Filesystem 86Registering and Mounting a Filesystem
100------------------------------------- 87=====================================
101 88
102If you want to support a new kind of filesystem in the kernel, all you 89To register and unregister a filesystem, use the following API
103need to do is call register_filesystem(). You pass a structure 90functions:
104describing the filesystem implementation (struct file_system_type)
105which is then added to an internal table of supported filesystems. You
106can do:
107 91
108% cat /proc/filesystems 92 #include <linux/fs.h>
109 93
110to see what filesystems are currently available on your system. 94 extern int register_filesystem(struct file_system_type *);
95 extern int unregister_filesystem(struct file_system_type *);
111 96
112When a request is made to mount a block device onto a directory in 97The passed struct file_system_type describes your filesystem. When a
113your filespace the VFS will call the appropriate method for the 98request is made to mount a device onto a directory in your filespace,
114specific filesystem. The dentry for the mount point will then be 99the VFS will call the appropriate get_sb() method for the specific
115updated to point to the root inode for the new filesystem. 100filesystem. The dentry for the mount point will then be updated to
101point to the root inode for the new filesystem.
116 102
117It's now time to look at things in more detail. 103You can see all filesystems that are registered to the kernel in the
104file /proc/filesystems.
118 105
119 106
120struct file_system_type 107struct file_system_type
121======================= 108-----------------------
122 109
123This describes the filesystem. As of kernel 2.6.13, the following 110This describes the filesystem. As of kernel 2.6.13, the following
124members are defined: 111members are defined:
@@ -197,8 +184,14 @@ A fill_super() method implementation has the following arguments:
197 int silent: whether or not to be silent on error 184 int silent: whether or not to be silent on error
198 185
199 186
187The Superblock Object
188=====================
189
190A superblock object represents a mounted filesystem.
191
192
200struct super_operations 193struct super_operations
201======================= 194-----------------------
202 195
203This describes how the VFS can manipulate the superblock of your 196This describes how the VFS can manipulate the superblock of your
204filesystem. As of kernel 2.6.13, the following members are defined: 197filesystem. As of kernel 2.6.13, the following members are defined:
@@ -286,9 +279,9 @@ or bottom half).
286 a superblock. The second parameter indicates whether the method 279 a superblock. The second parameter indicates whether the method
287 should wait until the write out has been completed. Optional. 280 should wait until the write out has been completed. Optional.
288 281
289 write_super_lockfs: called when VFS is locking a filesystem and forcing 282 write_super_lockfs: called when VFS is locking a filesystem and
290 it into a consistent state. This function is currently used by the 283 forcing it into a consistent state. This method is currently
291 Logical Volume Manager (LVM). 284 used by the Logical Volume Manager (LVM).
292 285
293 unlockfs: called when VFS is unlocking a filesystem and making it writable 286 unlockfs: called when VFS is unlocking a filesystem and making it writable
294 again. 287 again.
@@ -317,8 +310,14 @@ field. This is a pointer to a "struct inode_operations" which
317describes the methods that can be performed on individual inodes. 310describes the methods that can be performed on individual inodes.
318 311
319 312
313The Inode Object
314================
315
316An inode object represents an object within the filesystem.
317
318
320struct inode_operations 319struct inode_operations
321======================= 320-----------------------
322 321
323This describes how the VFS can manipulate an inode in your 322This describes how the VFS can manipulate an inode in your
324filesystem. As of kernel 2.6.13, the following members are defined: 323filesystem. As of kernel 2.6.13, the following members are defined:
@@ -394,51 +393,62 @@ otherwise noted.
394 will probably need to call d_instantiate() just as you would 393 will probably need to call d_instantiate() just as you would
395 in the create() method 394 in the create() method
396 395
396 rename: called by the rename(2) system call to rename the object to
397 have the parent and name given by the second inode and dentry.
398
397 readlink: called by the readlink(2) system call. Only required if 399 readlink: called by the readlink(2) system call. Only required if
398 you want to support reading symbolic links 400 you want to support reading symbolic links
399 401
400 follow_link: called by the VFS to follow a symbolic link to the 402 follow_link: called by the VFS to follow a symbolic link to the
401 inode it points to. Only required if you want to support 403 inode it points to. Only required if you want to support
402 symbolic links. This function returns a void pointer cookie 404 symbolic links. This method returns a void pointer cookie
403 that is passed to put_link(). 405 that is passed to put_link().
404 406
405 put_link: called by the VFS to release resources allocated by 407 put_link: called by the VFS to release resources allocated by
406 follow_link(). The cookie returned by follow_link() is passed to 408 follow_link(). The cookie returned by follow_link() is passed
407 to this function as the last parameter. It is used by filesystems 409 to to this method as the last parameter. It is used by
408 such as NFS where page cache is not stable (i.e. page that was 410 filesystems such as NFS where page cache is not stable
409 installed when the symbolic link walk started might not be in the 411 (i.e. page that was installed when the symbolic link walk
410 page cache at the end of the walk). 412 started might not be in the page cache at the end of the
411 413 walk).
412 truncate: called by the VFS to change the size of a file. The i_size 414
413 field of the inode is set to the desired size by the VFS before 415 truncate: called by the VFS to change the size of a file. The
414 this function is called. This function is called by the truncate(2) 416 i_size field of the inode is set to the desired size by the
415 system call and related functionality. 417 VFS before this method is called. This method is called by
418 the truncate(2) system call and related functionality.
416 419
417 permission: called by the VFS to check for access rights on a POSIX-like 420 permission: called by the VFS to check for access rights on a POSIX-like
418 filesystem. 421 filesystem.
419 422
420 setattr: called by the VFS to set attributes for a file. This function is 423 setattr: called by the VFS to set attributes for a file. This method
421 called by chmod(2) and related system calls. 424 is called by chmod(2) and related system calls.
422 425
423 getattr: called by the VFS to get attributes of a file. This function is 426 getattr: called by the VFS to get attributes of a file. This method
424 called by stat(2) and related system calls. 427 is called by stat(2) and related system calls.
425 428
426 setxattr: called by the VFS to set an extended attribute for a file. 429 setxattr: called by the VFS to set an extended attribute for a file.
427 Extended attribute is a name:value pair associated with an inode. This 430 Extended attribute is a name:value pair associated with an
428 function is called by setxattr(2) system call. 431 inode. This method is called by setxattr(2) system call.
432
433 getxattr: called by the VFS to retrieve the value of an extended
434 attribute name. This method is called by getxattr(2) function
435 call.
429 436
430 getxattr: called by the VFS to retrieve the value of an extended attribute 437 listxattr: called by the VFS to list all extended attributes for a
431 name. This function is called by getxattr(2) function call. 438 given file. This method is called by listxattr(2) system call.
432 439
433 listxattr: called by the VFS to list all extended attributes for a given 440 removexattr: called by the VFS to remove an extended attribute from
434 file. This function is called by listxattr(2) system call. 441 a file. This method is called by removexattr(2) system call.
435 442
436 removexattr: called by the VFS to remove an extended attribute from a file. 443
437 This function is called by removexattr(2) system call. 444The Address Space Object
445========================
446
447The address space object is used to identify pages in the page cache.
438 448
439 449
440struct address_space_operations 450struct address_space_operations
441=============================== 451-------------------------------
442 452
443This describes how the VFS can manipulate mapping of a file to page cache in 453This describes how the VFS can manipulate mapping of a file to page cache in
444your filesystem. As of kernel 2.6.13, the following members are defined: 454your filesystem. As of kernel 2.6.13, the following members are defined:
@@ -502,8 +512,14 @@ struct address_space_operations {
502 it. An example implementation can be found in fs/ext2/xip.c. 512 it. An example implementation can be found in fs/ext2/xip.c.
503 513
504 514
515The File Object
516===============
517
518A file object represents a file opened by a process.
519
520
505struct file_operations 521struct file_operations
506====================== 522----------------------
507 523
508This describes how the VFS can manipulate an open file. As of kernel 524This describes how the VFS can manipulate an open file. As of kernel
5092.6.13, the following members are defined: 5252.6.13, the following members are defined:
@@ -661,7 +677,7 @@ of child dentries. Child dentries are basically like files in a
661directory. 677directory.
662 678
663 679
664Directory Entry Cache APIs 680Directory Entry Cache API
665-------------------------- 681--------------------------
666 682
667There are a number of functions defined which permit a filesystem to 683There are a number of functions defined which permit a filesystem to
@@ -705,178 +721,24 @@ manipulate dentries:
705 and the dentry is returned. The caller must use d_put() 721 and the dentry is returned. The caller must use d_put()
706 to free the dentry when it finishes using it. 722 to free the dentry when it finishes using it.
707 723
724For further information on dentry locking, please refer to the document
725Documentation/filesystems/dentry-locking.txt.
708 726
709RCU-based dcache locking model
710------------------------------
711 727
712On many workloads, the most common operation on dcache is 728Resources
713to look up a dentry, given a parent dentry and the name 729=========
714of the child. Typically, for every open(), stat() etc., 730
715the dentry corresponding to the pathname will be looked 731(Note some of these resources are not up-to-date with the latest kernel
716up by walking the tree starting with the first component 732 version.)
717of the pathname and using that dentry along with the next 733
718component to look up the next level and so on. Since it 734Creating Linux virtual filesystems. 2002
719is a frequent operation for workloads like multiuser 735 <http://lwn.net/Articles/13325/>
720environments and web servers, it is important to optimize 736
721this path. 737The Linux Virtual File-system Layer by Neil Brown. 1999
722 738 <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
723Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus 739
724in every component during path look-up. Since 2.5.10 onwards, 740A tour of the Linux VFS by Michael K. Johnson. 1996
725fast-walk algorithm changed this by holding the dcache_lock 741 <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
726at the beginning and walking as many cached path component
727dentries as possible. This significantly decreases the number
728of acquisition of dcache_lock. However it also increases the
729lock hold time significantly and affects performance in large
730SMP machines. Since 2.5.62 kernel, dcache has been using
731a new locking model that uses RCU to make dcache look-up
732lock-free.
733
734The current dcache locking model is not very different from the existing
735dcache locking model. Prior to 2.5.62 kernel, dcache_lock
736protected the hash chain, d_child, d_alias, d_lru lists as well
737as d_inode and several other things like mount look-up. RCU-based
738changes affect only the way the hash chain is protected. For everything
739else the dcache_lock must be taken for both traversing as well as
740updating. The hash chain updates too take the dcache_lock.
741The significant change is the way d_lookup traverses the hash chain,
742it doesn't acquire the dcache_lock for this and rely on RCU to
743ensure that the dentry has not been *freed*.
744
745
746Dcache locking details
747----------------------
748 742
749For many multi-user workloads, open() and stat() on files are 743A small trail through the Linux kernel by Andries Brouwer. 2001
750very frequently occurring operations. Both involve walking 744 <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>
751of path names to find the dentry corresponding to the
752concerned file. In 2.4 kernel, dcache_lock was held
753during look-up of each path component. Contention and
754cache-line bouncing of this global lock caused significant
755scalability problems. With the introduction of RCU
756in Linux kernel, this was worked around by making
757the look-up of path components during path walking lock-free.
758
759
760Safe lock-free look-up of dcache hash table
761===========================================
762
763Dcache is a complex data structure with the hash table entries
764also linked together in other lists. In 2.4 kernel, dcache_lock
765protected all the lists. We applied RCU only on hash chain
766walking. The rest of the lists are still protected by dcache_lock.
767Some of the important changes are :
768
7691. The deletion from hash chain is done using hlist_del_rcu() macro which
770 doesn't initialize next pointer of the deleted dentry and this
771 allows us to walk safely lock-free while a deletion is happening.
772
7732. Insertion of a dentry into the hash table is done using
774 hlist_add_head_rcu() which take care of ordering the writes -
775 the writes to the dentry must be visible before the dentry
776 is inserted. This works in conjunction with hlist_for_each_rcu()
777 while walking the hash chain. The only requirement is that
778 all initialization to the dentry must be done before hlist_add_head_rcu()
779 since we don't have dcache_lock protection while traversing
780 the hash chain. This isn't different from the existing code.
781
7823. The dentry looked up without holding dcache_lock by cannot be
783 returned for walking if it is unhashed. It then may have a NULL
784 d_inode or other bogosity since RCU doesn't protect the other
785 fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
786 indicate unhashed dentries and use this in conjunction with a
787 per-dentry lock (d_lock). Once looked up without the dcache_lock,
788 we acquire the per-dentry lock (d_lock) and check if the
789 dentry is unhashed. If so, the look-up is failed. If not, the
790 reference count of the dentry is increased and the dentry is returned.
791
7924. Once a dentry is looked up, it must be ensured during the path
793 walk for that component it doesn't go away. In pre-2.5.10 code,
794 this was done holding a reference to the dentry. dcache_rcu does
795 the same. In some sense, dcache_rcu path walking looks like
796 the pre-2.5.10 version.
797
7985. All dentry hash chain updates must take the dcache_lock as well as
799 the per-dentry lock in that order. dput() does this to ensure
800 that a dentry that has just been looked up in another CPU
801 doesn't get deleted before dget() can be done on it.
802
8036. There are several ways to do reference counting of RCU protected
804 objects. One such example is in ipv4 route cache where
805 deferred freeing (using call_rcu()) is done as soon as
806 the reference count goes to zero. This cannot be done in
807 the case of dentries because tearing down of dentries
808 require blocking (dentry_iput()) which isn't supported from
809 RCU callbacks. Instead, tearing down of dentries happen
810 synchronously in dput(), but actual freeing happens later
811 when RCU grace period is over. This allows safe lock-free
812 walking of the hash chains, but a matched dentry may have
813 been partially torn down. The checking of DCACHE_UNHASHED
814 flag with d_lock held detects such dentries and prevents
815 them from being returned from look-up.
816
817
818Maintaining POSIX rename semantics
819==================================
820
821Since look-up of dentries is lock-free, it can race against
822a concurrent rename operation. For example, during rename
823of file A to B, look-up of either A or B must succeed.
824So, if look-up of B happens after A has been removed from the
825hash chain but not added to the new hash chain, it may fail.
826Also, a comparison while the name is being written concurrently
827by a rename may result in false positive matches violating
828rename semantics. Issues related to race with rename are
829handled as described below :
830
8311. Look-up can be done in two ways - d_lookup() which is safe
832 from simultaneous renames and __d_lookup() which is not.
833 If __d_lookup() fails, it must be followed up by a d_lookup()
834 to correctly determine whether a dentry is in the hash table
835 or not. d_lookup() protects look-ups using a sequence
836 lock (rename_lock).
837
8382. The name associated with a dentry (d_name) may be changed if
839 a rename is allowed to happen simultaneously. To avoid memcmp()
840 in __d_lookup() go out of bounds due to a rename and false
841 positive comparison, the name comparison is done while holding the
842 per-dentry lock. This prevents concurrent renames during this
843 operation.
844
8453. Hash table walking during look-up may move to a different bucket as
846 the current dentry is moved to a different bucket due to rename.
847 But we use hlists in dcache hash table and they are null-terminated.
848 So, even if a dentry moves to a different bucket, hash chain
849 walk will terminate. [with a list_head list, it may not since
850 termination is when the list_head in the original bucket is reached].
851 Since we redo the d_parent check and compare name while holding
852 d_lock, lock-free look-up will not race against d_move().
853
8544. There can be a theoretical race when a dentry keeps coming back
855 to original bucket due to double moves. Due to this look-up may
856 consider that it has never moved and can end up in a infinite loop.
857 But this is not any worse that theoretical livelocks we already
858 have in the kernel.
859
860
861Important guidelines for filesystem developers related to dcache_rcu
862====================================================================
863
8641. Existing dcache interfaces (pre-2.5.62) exported to filesystem
865 don't change. Only dcache internal implementation changes. However
866 filesystems *must not* delete from the dentry hash chains directly
867 using the list macros like allowed earlier. They must use dcache
868 APIs like d_drop() or __d_drop() depending on the situation.
869
8702. d_flags is now protected by a per-dentry lock (d_lock). All
871 access to d_flags must be protected by it.
872
8733. For a hashed dentry, checking of d_count needs to be protected
874 by d_lock.
875
876
877Papers and other documentation on dcache locking
878================================================
879
8801. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
881
8822. http://lse.sourceforge.net/locking/dcache/dcache.html