aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--Documentation/filesystems/vfs.txt263
1 files changed, 149 insertions, 114 deletions
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f042c12e0ed2..821090946a1a 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -3,7 +3,7 @@
3 3
4 Original author: Richard Gooch <rgooch@atnf.csiro.au> 4 Original author: Richard Gooch <rgooch@atnf.csiro.au>
5 5
6 Last updated on August 25, 2005 6 Last updated on October 28, 2005
7 7
8 Copyright (C) 1999 Richard Gooch 8 Copyright (C) 1999 Richard Gooch
9 Copyright (C) 2005 Pekka Enberg 9 Copyright (C) 2005 Pekka Enberg
@@ -11,62 +11,61 @@
11 This file is released under the GPLv2. 11 This file is released under the GPLv2.
12 12
13 13
14What is it? 14Introduction
15=========== 15============
16 16
17The Virtual File System (otherwise known as the Virtual Filesystem 17The Virtual File System (also known as the Virtual Filesystem Switch)
18Switch) is the software layer in the kernel that provides the 18is the software layer in the kernel that provides the filesystem
19filesystem interface to userspace programs. It also provides an 19interface to userspace programs. It also provides an abstraction
20abstraction within the kernel which allows different filesystem 20within the kernel which allows different filesystem implementations to
21implementations to coexist. 21coexist.
22 22
23VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
24on are called from a process context. Filesystem locking is described
25in the document Documentation/filesystems/Locking.
23 26
24A Quick Look At How It Works
25============================
26 27
27In this section I'll briefly describe how things work, before 28Directory Entry Cache (dcache)
28launching into the details. I'll start with describing what happens 29------------------------------
29when user programs open and manipulate files, and then look from the
30other view which is how a filesystem is supported and subsequently
31mounted.
32
33
34Opening a File
35--------------
36
37The VFS implements the open(2), stat(2), chmod(2) and similar system
38calls. The pathname argument is used by the VFS to search through the
39directory entry cache (dentry cache or "dcache"). This provides a very
40fast look-up mechanism to translate a pathname (filename) into a
41specific dentry.
42
43An individual dentry usually has a pointer to an inode. Inodes are the
44things that live on disc drives, and can be regular files (you know:
45those things that you write data into), directories, FIFOs and other
46beasts. Dentries live in RAM and are never saved to disc: they exist
47only for performance. Inodes live on disc and are copied into memory
48when required. Later any changes are written back to disc. The inode
49that lives in RAM is a VFS inode, and it is this which the dentry
50points to. A single inode can be pointed to by multiple dentries
51(think about hardlinks).
52
53The dcache is meant to be a view into your entire filespace. Unlike
54Linus, most of us losers can't fit enough dentries into RAM to cover
55all of our filespace, so the dcache has bits missing. In order to
56resolve your pathname into a dentry, the VFS may have to resort to
57creating dentries along the way, and then loading the inode. This is
58done by looking up the inode.
59
60To look up an inode (usually read from disc) requires that the VFS
61calls the lookup() method of the parent directory inode. This method
62is installed by the specific filesystem implementation that the inode
63lives in. There will be more on this later.
64 30
65Once the VFS has the required dentry (and hence the inode), we can do 31The VFS implements the open(2), stat(2), chmod(2), and similar system
66all those boring things like open(2) the file, or stat(2) it to peek 32calls. The pathname argument that is passed to them is used by the VFS
67at the inode data. The stat(2) operation is fairly simple: once the 33to search through the directory entry cache (also known as the dentry
68VFS has the dentry, it peeks at the inode data and passes some of it 34cache or dcache). This provides a very fast look-up mechanism to
69back to userspace. 35translate a pathname (filename) into a specific dentry. Dentries live
36in RAM and are never saved to disc: they exist only for performance.
37
38The dentry cache is meant to be a view into your entire filespace. As
39most computers cannot fit all dentries in the RAM at the same time,
40some bits of the cache are missing. In order to resolve your pathname
41into a dentry, the VFS may have to resort to creating dentries along
42the way, and then loading the inode. This is done by looking up the
43inode.
44
45
46The Inode Object
47----------------
48
49An individual dentry usually has a pointer to an inode. Inodes are
50filesystem objects such as regular files, directories, FIFOs and other
51beasts. They live either on the disc (for block device filesystems)
52or in the memory (for pseudo filesystems). Inodes that live on the
53disc are copied into the memory when required and changes to the inode
54are written back to disc. A single inode can be pointed to by multiple
55dentries (hard links, for example, do this).
56
57To look up an inode requires that the VFS calls the lookup() method of
58the parent directory inode. This method is installed by the specific
59filesystem implementation that the inode lives in. Once the VFS has
60the required dentry (and hence the inode), we can do all those boring
61things like open(2) the file, or stat(2) it to peek at the inode
62data. The stat(2) operation is fairly simple: once the VFS has the
63dentry, it peeks at the inode data and passes some of it back to
64userspace.
65
66
67The File Object
68---------------
70 69
71Opening a file requires another operation: allocation of a file 70Opening a file requires another operation: allocation of a file
72structure (this is the kernel-side implementation of file 71structure (this is the kernel-side implementation of file
@@ -74,51 +73,39 @@ descriptors). The freshly allocated file structure is initialized with
74a pointer to the dentry and a set of file operation member functions. 73a pointer to the dentry and a set of file operation member functions.
75These are taken from the inode data. The open() file method is then 74These are taken from the inode data. The open() file method is then
76called so the specific filesystem implementation can do it's work. You 75called so the specific filesystem implementation can do it's work. You
77can see that this is another switch performed by the VFS. 76can see that this is another switch performed by the VFS. The file
78 77structure is placed into the file descriptor table for the process.
79The file structure is placed into the file descriptor table for the
80process.
81 78
82Reading, writing and closing files (and other assorted VFS operations) 79Reading, writing and closing files (and other assorted VFS operations)
83is done by using the userspace file descriptor to grab the appropriate 80is done by using the userspace file descriptor to grab the appropriate
84file structure, and then calling the required file structure method 81file structure, and then calling the required file structure method to
85function to do whatever is required. 82do whatever is required. For as long as the file is open, it keeps the
86 83dentry in use, which in turn means that the VFS inode is still in use.
87For as long as the file is open, it keeps the dentry "open" (in use),
88which in turn means that the VFS inode is still in use.
89
90All VFS system calls (i.e. open(2), stat(2), read(2), write(2),
91chmod(2) and so on) are called from a process context. You should
92assume that these calls are made without any kernel locks being
93held. This means that the processes may be executing the same piece of
94filesystem or driver code at the same time, on different
95processors. You should ensure that access to shared resources is
96protected by appropriate locks.
97 84
98 85
99Registering and Mounting a Filesystem 86Registering and Mounting a Filesystem
100------------------------------------- 87=====================================
101 88
102If you want to support a new kind of filesystem in the kernel, all you 89To register and unregister a filesystem, use the following API
103need to do is call register_filesystem(). You pass a structure 90functions:
104describing the filesystem implementation (struct file_system_type)
105which is then added to an internal table of supported filesystems. You
106can do:
107 91
108% cat /proc/filesystems 92 #include <linux/fs.h>
109 93
110to see what filesystems are currently available on your system. 94 extern int register_filesystem(struct file_system_type *);
95 extern int unregister_filesystem(struct file_system_type *);
111 96
112When a request is made to mount a block device onto a directory in 97The passed struct file_system_type describes your filesystem. When a
113your filespace the VFS will call the appropriate method for the 98request is made to mount a device onto a directory in your filespace,
114specific filesystem. The dentry for the mount point will then be 99the VFS will call the appropriate get_sb() method for the specific
115updated to point to the root inode for the new filesystem. 100filesystem. The dentry for the mount point will then be updated to
101point to the root inode for the new filesystem.
116 102
117It's now time to look at things in more detail. 103You can see all filesystems that are registered to the kernel in the
104file /proc/filesystems.
118 105
119 106
120struct file_system_type 107struct file_system_type
121======================= 108-----------------------
122 109
123This describes the filesystem. As of kernel 2.6.13, the following 110This describes the filesystem. As of kernel 2.6.13, the following
124members are defined: 111members are defined:
@@ -197,8 +184,14 @@ A fill_super() method implementation has the following arguments:
197 int silent: whether or not to be silent on error 184 int silent: whether or not to be silent on error
198 185
199 186
187The Superblock Object
188=====================
189
190A superblock object represents a mounted filesystem.
191
192
200struct super_operations 193struct super_operations
201======================= 194-----------------------
202 195
203This describes how the VFS can manipulate the superblock of your 196This describes how the VFS can manipulate the superblock of your
204filesystem. As of kernel 2.6.13, the following members are defined: 197filesystem. As of kernel 2.6.13, the following members are defined:
@@ -286,9 +279,9 @@ or bottom half).
286 a superblock. The second parameter indicates whether the method 279 a superblock. The second parameter indicates whether the method
287 should wait until the write out has been completed. Optional. 280 should wait until the write out has been completed. Optional.
288 281
289 write_super_lockfs: called when VFS is locking a filesystem and forcing 282 write_super_lockfs: called when VFS is locking a filesystem and
290 it into a consistent state. This function is currently used by the 283 forcing it into a consistent state. This method is currently
291 Logical Volume Manager (LVM). 284 used by the Logical Volume Manager (LVM).
292 285
293 unlockfs: called when VFS is unlocking a filesystem and making it writable 286 unlockfs: called when VFS is unlocking a filesystem and making it writable
294 again. 287 again.
@@ -317,8 +310,14 @@ field. This is a pointer to a "struct inode_operations" which
317describes the methods that can be performed on individual inodes. 310describes the methods that can be performed on individual inodes.
318 311
319 312
313The Inode Object
314================
315
316An inode object represents an object within the filesystem.
317
318
320struct inode_operations 319struct inode_operations
321======================= 320-----------------------
322 321
323This describes how the VFS can manipulate an inode in your 322This describes how the VFS can manipulate an inode in your
324filesystem. As of kernel 2.6.13, the following members are defined: 323filesystem. As of kernel 2.6.13, the following members are defined:
@@ -394,51 +393,62 @@ otherwise noted.
394 will probably need to call d_instantiate() just as you would 393 will probably need to call d_instantiate() just as you would
395 in the create() method 394 in the create() method
396 395
396 rename: called by the rename(2) system call to rename the object to
397 have the parent and name given by the second inode and dentry.
398
397 readlink: called by the readlink(2) system call. Only required if 399 readlink: called by the readlink(2) system call. Only required if
398 you want to support reading symbolic links 400 you want to support reading symbolic links
399 401
400 follow_link: called by the VFS to follow a symbolic link to the 402 follow_link: called by the VFS to follow a symbolic link to the
401 inode it points to. Only required if you want to support 403 inode it points to. Only required if you want to support
402 symbolic links. This function returns a void pointer cookie 404 symbolic links. This method returns a void pointer cookie
403 that is passed to put_link(). 405 that is passed to put_link().
404 406
405 put_link: called by the VFS to release resources allocated by 407 put_link: called by the VFS to release resources allocated by
406 follow_link(). The cookie returned by follow_link() is passed to 408 follow_link(). The cookie returned by follow_link() is passed
407 to this function as the last parameter. It is used by filesystems 409 to to this method as the last parameter. It is used by
408 such as NFS where page cache is not stable (i.e. page that was 410 filesystems such as NFS where page cache is not stable
409 installed when the symbolic link walk started might not be in the 411 (i.e. page that was installed when the symbolic link walk
410 page cache at the end of the walk). 412 started might not be in the page cache at the end of the
411 413 walk).
412 truncate: called by the VFS to change the size of a file. The i_size 414
413 field of the inode is set to the desired size by the VFS before 415 truncate: called by the VFS to change the size of a file. The
414 this function is called. This function is called by the truncate(2) 416 i_size field of the inode is set to the desired size by the
415 system call and related functionality. 417 VFS before this method is called. This method is called by
418 the truncate(2) system call and related functionality.
416 419
417 permission: called by the VFS to check for access rights on a POSIX-like 420 permission: called by the VFS to check for access rights on a POSIX-like
418 filesystem. 421 filesystem.
419 422
420 setattr: called by the VFS to set attributes for a file. This function is 423 setattr: called by the VFS to set attributes for a file. This method
421 called by chmod(2) and related system calls. 424 is called by chmod(2) and related system calls.
422 425
423 getattr: called by the VFS to get attributes of a file. This function is 426 getattr: called by the VFS to get attributes of a file. This method
424 called by stat(2) and related system calls. 427 is called by stat(2) and related system calls.
425 428
426 setxattr: called by the VFS to set an extended attribute for a file. 429 setxattr: called by the VFS to set an extended attribute for a file.
427 Extended attribute is a name:value pair associated with an inode. This 430 Extended attribute is a name:value pair associated with an
428 function is called by setxattr(2) system call. 431 inode. This method is called by setxattr(2) system call.
432
433 getxattr: called by the VFS to retrieve the value of an extended
434 attribute name. This method is called by getxattr(2) function
435 call.
436
437 listxattr: called by the VFS to list all extended attributes for a
438 given file. This method is called by listxattr(2) system call.
429 439
430 getxattr: called by the VFS to retrieve the value of an extended attribute 440 removexattr: called by the VFS to remove an extended attribute from
431 name. This function is called by getxattr(2) function call. 441 a file. This method is called by removexattr(2) system call.
432 442
433 listxattr: called by the VFS to list all extended attributes for a given
434 file. This function is called by listxattr(2) system call.
435 443
436 removexattr: called by the VFS to remove an extended attribute from a file. 444The Address Space Object
437 This function is called by removexattr(2) system call. 445========================
446
447The address space object is used to identify pages in the page cache.
438 448
439 449
440struct address_space_operations 450struct address_space_operations
441=============================== 451-------------------------------
442 452
443This describes how the VFS can manipulate mapping of a file to page cache in 453This describes how the VFS can manipulate mapping of a file to page cache in
444your filesystem. As of kernel 2.6.13, the following members are defined: 454your filesystem. As of kernel 2.6.13, the following members are defined:
@@ -502,8 +512,14 @@ struct address_space_operations {
502 it. An example implementation can be found in fs/ext2/xip.c. 512 it. An example implementation can be found in fs/ext2/xip.c.
503 513
504 514
515The File Object
516===============
517
518A file object represents a file opened by a process.
519
520
505struct file_operations 521struct file_operations
506====================== 522----------------------
507 523
508This describes how the VFS can manipulate an open file. As of kernel 524This describes how the VFS can manipulate an open file. As of kernel
5092.6.13, the following members are defined: 5252.6.13, the following members are defined:
@@ -661,7 +677,7 @@ of child dentries. Child dentries are basically like files in a
661directory. 677directory.
662 678
663 679
664Directory Entry Cache APIs 680Directory Entry Cache API
665-------------------------- 681--------------------------
666 682
667There are a number of functions defined which permit a filesystem to 683There are a number of functions defined which permit a filesystem to
@@ -880,3 +896,22 @@ Papers and other documentation on dcache locking
8801. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). 8961. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
881 897
8822. http://lse.sourceforge.net/locking/dcache/dcache.html 8982. http://lse.sourceforge.net/locking/dcache/dcache.html
899
900
901Resources
902=========
903
904(Note some of these resources are not up-to-date with the latest kernel
905 version.)
906
907Creating Linux virtual filesystems. 2002
908 <http://lwn.net/Articles/13325/>
909
910The Linux Virtual File-system Layer by Neil Brown. 1999
911 <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
912
913A tour of the Linux VFS by Michael K. Johnson. 1996
914 <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
915
916A small trail through the Linux kernel by Andries Brouwer. 2001
917 <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>