diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/vfs.txt | 263 |
1 files changed, 149 insertions, 114 deletions
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f042c12e0ed2..821090946a1a 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt | |||
@@ -3,7 +3,7 @@ | |||
3 | 3 | ||
4 | Original author: Richard Gooch <rgooch@atnf.csiro.au> | 4 | Original author: Richard Gooch <rgooch@atnf.csiro.au> |
5 | 5 | ||
6 | Last updated on August 25, 2005 | 6 | Last updated on October 28, 2005 |
7 | 7 | ||
8 | Copyright (C) 1999 Richard Gooch | 8 | Copyright (C) 1999 Richard Gooch |
9 | Copyright (C) 2005 Pekka Enberg | 9 | Copyright (C) 2005 Pekka Enberg |
@@ -11,62 +11,61 @@ | |||
11 | This file is released under the GPLv2. | 11 | This file is released under the GPLv2. |
12 | 12 | ||
13 | 13 | ||
14 | What is it? | 14 | Introduction |
15 | =========== | 15 | ============ |
16 | 16 | ||
17 | The Virtual File System (otherwise known as the Virtual Filesystem | 17 | The Virtual File System (also known as the Virtual Filesystem Switch) |
18 | Switch) is the software layer in the kernel that provides the | 18 | is the software layer in the kernel that provides the filesystem |
19 | filesystem interface to userspace programs. It also provides an | 19 | interface to userspace programs. It also provides an abstraction |
20 | abstraction within the kernel which allows different filesystem | 20 | within the kernel which allows different filesystem implementations to |
21 | implementations to coexist. | 21 | coexist. |
22 | 22 | ||
23 | VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so | ||
24 | on are called from a process context. Filesystem locking is described | ||
25 | in the document Documentation/filesystems/Locking. | ||
23 | 26 | ||
24 | A Quick Look At How It Works | ||
25 | ============================ | ||
26 | 27 | ||
27 | In this section I'll briefly describe how things work, before | 28 | Directory Entry Cache (dcache) |
28 | launching into the details. I'll start with describing what happens | 29 | ------------------------------ |
29 | when user programs open and manipulate files, and then look from the | ||
30 | other view which is how a filesystem is supported and subsequently | ||
31 | mounted. | ||
32 | |||
33 | |||
34 | Opening a File | ||
35 | -------------- | ||
36 | |||
37 | The VFS implements the open(2), stat(2), chmod(2) and similar system | ||
38 | calls. The pathname argument is used by the VFS to search through the | ||
39 | directory entry cache (dentry cache or "dcache"). This provides a very | ||
40 | fast look-up mechanism to translate a pathname (filename) into a | ||
41 | specific dentry. | ||
42 | |||
43 | An individual dentry usually has a pointer to an inode. Inodes are the | ||
44 | things that live on disc drives, and can be regular files (you know: | ||
45 | those things that you write data into), directories, FIFOs and other | ||
46 | beasts. Dentries live in RAM and are never saved to disc: they exist | ||
47 | only for performance. Inodes live on disc and are copied into memory | ||
48 | when required. Later any changes are written back to disc. The inode | ||
49 | that lives in RAM is a VFS inode, and it is this which the dentry | ||
50 | points to. A single inode can be pointed to by multiple dentries | ||
51 | (think about hardlinks). | ||
52 | |||
53 | The dcache is meant to be a view into your entire filespace. Unlike | ||
54 | Linus, most of us losers can't fit enough dentries into RAM to cover | ||
55 | all of our filespace, so the dcache has bits missing. In order to | ||
56 | resolve your pathname into a dentry, the VFS may have to resort to | ||
57 | creating dentries along the way, and then loading the inode. This is | ||
58 | done by looking up the inode. | ||
59 | |||
60 | To look up an inode (usually read from disc) requires that the VFS | ||
61 | calls the lookup() method of the parent directory inode. This method | ||
62 | is installed by the specific filesystem implementation that the inode | ||
63 | lives in. There will be more on this later. | ||
64 | 30 | ||
65 | Once the VFS has the required dentry (and hence the inode), we can do | 31 | The VFS implements the open(2), stat(2), chmod(2), and similar system |
66 | all those boring things like open(2) the file, or stat(2) it to peek | 32 | calls. The pathname argument that is passed to them is used by the VFS |
67 | at the inode data. The stat(2) operation is fairly simple: once the | 33 | to search through the directory entry cache (also known as the dentry |
68 | VFS has the dentry, it peeks at the inode data and passes some of it | 34 | cache or dcache). This provides a very fast look-up mechanism to |
69 | back to userspace. | 35 | translate a pathname (filename) into a specific dentry. Dentries live |
36 | in RAM and are never saved to disc: they exist only for performance. | ||
37 | |||
38 | The dentry cache is meant to be a view into your entire filespace. As | ||
39 | most computers cannot fit all dentries in the RAM at the same time, | ||
40 | some bits of the cache are missing. In order to resolve your pathname | ||
41 | into a dentry, the VFS may have to resort to creating dentries along | ||
42 | the way, and then loading the inode. This is done by looking up the | ||
43 | inode. | ||
44 | |||
45 | |||
46 | The Inode Object | ||
47 | ---------------- | ||
48 | |||
49 | An individual dentry usually has a pointer to an inode. Inodes are | ||
50 | filesystem objects such as regular files, directories, FIFOs and other | ||
51 | beasts. They live either on the disc (for block device filesystems) | ||
52 | or in the memory (for pseudo filesystems). Inodes that live on the | ||
53 | disc are copied into the memory when required and changes to the inode | ||
54 | are written back to disc. A single inode can be pointed to by multiple | ||
55 | dentries (hard links, for example, do this). | ||
56 | |||
57 | To look up an inode requires that the VFS calls the lookup() method of | ||
58 | the parent directory inode. This method is installed by the specific | ||
59 | filesystem implementation that the inode lives in. Once the VFS has | ||
60 | the required dentry (and hence the inode), we can do all those boring | ||
61 | things like open(2) the file, or stat(2) it to peek at the inode | ||
62 | data. The stat(2) operation is fairly simple: once the VFS has the | ||
63 | dentry, it peeks at the inode data and passes some of it back to | ||
64 | userspace. | ||
65 | |||
66 | |||
67 | The File Object | ||
68 | --------------- | ||
70 | 69 | ||
71 | Opening a file requires another operation: allocation of a file | 70 | Opening a file requires another operation: allocation of a file |
72 | structure (this is the kernel-side implementation of file | 71 | structure (this is the kernel-side implementation of file |
@@ -74,51 +73,39 @@ descriptors). The freshly allocated file structure is initialized with | |||
74 | a pointer to the dentry and a set of file operation member functions. | 73 | a pointer to the dentry and a set of file operation member functions. |
75 | These are taken from the inode data. The open() file method is then | 74 | These are taken from the inode data. The open() file method is then |
76 | called so the specific filesystem implementation can do it's work. You | 75 | called so the specific filesystem implementation can do it's work. You |
77 | can see that this is another switch performed by the VFS. | 76 | can see that this is another switch performed by the VFS. The file |
78 | 77 | structure is placed into the file descriptor table for the process. | |
79 | The file structure is placed into the file descriptor table for the | ||
80 | process. | ||
81 | 78 | ||
82 | Reading, writing and closing files (and other assorted VFS operations) | 79 | Reading, writing and closing files (and other assorted VFS operations) |
83 | is done by using the userspace file descriptor to grab the appropriate | 80 | is done by using the userspace file descriptor to grab the appropriate |
84 | file structure, and then calling the required file structure method | 81 | file structure, and then calling the required file structure method to |
85 | function to do whatever is required. | 82 | do whatever is required. For as long as the file is open, it keeps the |
86 | 83 | dentry in use, which in turn means that the VFS inode is still in use. | |
87 | For as long as the file is open, it keeps the dentry "open" (in use), | ||
88 | which in turn means that the VFS inode is still in use. | ||
89 | |||
90 | All VFS system calls (i.e. open(2), stat(2), read(2), write(2), | ||
91 | chmod(2) and so on) are called from a process context. You should | ||
92 | assume that these calls are made without any kernel locks being | ||
93 | held. This means that the processes may be executing the same piece of | ||
94 | filesystem or driver code at the same time, on different | ||
95 | processors. You should ensure that access to shared resources is | ||
96 | protected by appropriate locks. | ||
97 | 84 | ||
98 | 85 | ||
99 | Registering and Mounting a Filesystem | 86 | Registering and Mounting a Filesystem |
100 | ------------------------------------- | 87 | ===================================== |
101 | 88 | ||
102 | If you want to support a new kind of filesystem in the kernel, all you | 89 | To register and unregister a filesystem, use the following API |
103 | need to do is call register_filesystem(). You pass a structure | 90 | functions: |
104 | describing the filesystem implementation (struct file_system_type) | ||
105 | which is then added to an internal table of supported filesystems. You | ||
106 | can do: | ||
107 | 91 | ||
108 | % cat /proc/filesystems | 92 | #include <linux/fs.h> |
109 | 93 | ||
110 | to see what filesystems are currently available on your system. | 94 | extern int register_filesystem(struct file_system_type *); |
95 | extern int unregister_filesystem(struct file_system_type *); | ||
111 | 96 | ||
112 | When a request is made to mount a block device onto a directory in | 97 | The passed struct file_system_type describes your filesystem. When a |
113 | your filespace the VFS will call the appropriate method for the | 98 | request is made to mount a device onto a directory in your filespace, |
114 | specific filesystem. The dentry for the mount point will then be | 99 | the VFS will call the appropriate get_sb() method for the specific |
115 | updated to point to the root inode for the new filesystem. | 100 | filesystem. The dentry for the mount point will then be updated to |
101 | point to the root inode for the new filesystem. | ||
116 | 102 | ||
117 | It's now time to look at things in more detail. | 103 | You can see all filesystems that are registered to the kernel in the |
104 | file /proc/filesystems. | ||
118 | 105 | ||
119 | 106 | ||
120 | struct file_system_type | 107 | struct file_system_type |
121 | ======================= | 108 | ----------------------- |
122 | 109 | ||
123 | This describes the filesystem. As of kernel 2.6.13, the following | 110 | This describes the filesystem. As of kernel 2.6.13, the following |
124 | members are defined: | 111 | members are defined: |
@@ -197,8 +184,14 @@ A fill_super() method implementation has the following arguments: | |||
197 | int silent: whether or not to be silent on error | 184 | int silent: whether or not to be silent on error |
198 | 185 | ||
199 | 186 | ||
187 | The Superblock Object | ||
188 | ===================== | ||
189 | |||
190 | A superblock object represents a mounted filesystem. | ||
191 | |||
192 | |||
200 | struct super_operations | 193 | struct super_operations |
201 | ======================= | 194 | ----------------------- |
202 | 195 | ||
203 | This describes how the VFS can manipulate the superblock of your | 196 | This describes how the VFS can manipulate the superblock of your |
204 | filesystem. As of kernel 2.6.13, the following members are defined: | 197 | filesystem. As of kernel 2.6.13, the following members are defined: |
@@ -286,9 +279,9 @@ or bottom half). | |||
286 | a superblock. The second parameter indicates whether the method | 279 | a superblock. The second parameter indicates whether the method |
287 | should wait until the write out has been completed. Optional. | 280 | should wait until the write out has been completed. Optional. |
288 | 281 | ||
289 | write_super_lockfs: called when VFS is locking a filesystem and forcing | 282 | write_super_lockfs: called when VFS is locking a filesystem and |
290 | it into a consistent state. This function is currently used by the | 283 | forcing it into a consistent state. This method is currently |
291 | Logical Volume Manager (LVM). | 284 | used by the Logical Volume Manager (LVM). |
292 | 285 | ||
293 | unlockfs: called when VFS is unlocking a filesystem and making it writable | 286 | unlockfs: called when VFS is unlocking a filesystem and making it writable |
294 | again. | 287 | again. |
@@ -317,8 +310,14 @@ field. This is a pointer to a "struct inode_operations" which | |||
317 | describes the methods that can be performed on individual inodes. | 310 | describes the methods that can be performed on individual inodes. |
318 | 311 | ||
319 | 312 | ||
313 | The Inode Object | ||
314 | ================ | ||
315 | |||
316 | An inode object represents an object within the filesystem. | ||
317 | |||
318 | |||
320 | struct inode_operations | 319 | struct inode_operations |
321 | ======================= | 320 | ----------------------- |
322 | 321 | ||
323 | This describes how the VFS can manipulate an inode in your | 322 | This describes how the VFS can manipulate an inode in your |
324 | filesystem. As of kernel 2.6.13, the following members are defined: | 323 | filesystem. As of kernel 2.6.13, the following members are defined: |
@@ -394,51 +393,62 @@ otherwise noted. | |||
394 | will probably need to call d_instantiate() just as you would | 393 | will probably need to call d_instantiate() just as you would |
395 | in the create() method | 394 | in the create() method |
396 | 395 | ||
396 | rename: called by the rename(2) system call to rename the object to | ||
397 | have the parent and name given by the second inode and dentry. | ||
398 | |||
397 | readlink: called by the readlink(2) system call. Only required if | 399 | readlink: called by the readlink(2) system call. Only required if |
398 | you want to support reading symbolic links | 400 | you want to support reading symbolic links |
399 | 401 | ||
400 | follow_link: called by the VFS to follow a symbolic link to the | 402 | follow_link: called by the VFS to follow a symbolic link to the |
401 | inode it points to. Only required if you want to support | 403 | inode it points to. Only required if you want to support |
402 | symbolic links. This function returns a void pointer cookie | 404 | symbolic links. This method returns a void pointer cookie |
403 | that is passed to put_link(). | 405 | that is passed to put_link(). |
404 | 406 | ||
405 | put_link: called by the VFS to release resources allocated by | 407 | put_link: called by the VFS to release resources allocated by |
406 | follow_link(). The cookie returned by follow_link() is passed to | 408 | follow_link(). The cookie returned by follow_link() is passed |
407 | to this function as the last parameter. It is used by filesystems | 409 | to to this method as the last parameter. It is used by |
408 | such as NFS where page cache is not stable (i.e. page that was | 410 | filesystems such as NFS where page cache is not stable |
409 | installed when the symbolic link walk started might not be in the | 411 | (i.e. page that was installed when the symbolic link walk |
410 | page cache at the end of the walk). | 412 | started might not be in the page cache at the end of the |
411 | 413 | walk). | |
412 | truncate: called by the VFS to change the size of a file. The i_size | 414 | |
413 | field of the inode is set to the desired size by the VFS before | 415 | truncate: called by the VFS to change the size of a file. The |
414 | this function is called. This function is called by the truncate(2) | 416 | i_size field of the inode is set to the desired size by the |
415 | system call and related functionality. | 417 | VFS before this method is called. This method is called by |
418 | the truncate(2) system call and related functionality. | ||
416 | 419 | ||
417 | permission: called by the VFS to check for access rights on a POSIX-like | 420 | permission: called by the VFS to check for access rights on a POSIX-like |
418 | filesystem. | 421 | filesystem. |
419 | 422 | ||
420 | setattr: called by the VFS to set attributes for a file. This function is | 423 | setattr: called by the VFS to set attributes for a file. This method |
421 | called by chmod(2) and related system calls. | 424 | is called by chmod(2) and related system calls. |
422 | 425 | ||
423 | getattr: called by the VFS to get attributes of a file. This function is | 426 | getattr: called by the VFS to get attributes of a file. This method |
424 | called by stat(2) and related system calls. | 427 | is called by stat(2) and related system calls. |
425 | 428 | ||
426 | setxattr: called by the VFS to set an extended attribute for a file. | 429 | setxattr: called by the VFS to set an extended attribute for a file. |
427 | Extended attribute is a name:value pair associated with an inode. This | 430 | Extended attribute is a name:value pair associated with an |
428 | function is called by setxattr(2) system call. | 431 | inode. This method is called by setxattr(2) system call. |
432 | |||
433 | getxattr: called by the VFS to retrieve the value of an extended | ||
434 | attribute name. This method is called by getxattr(2) function | ||
435 | call. | ||
436 | |||
437 | listxattr: called by the VFS to list all extended attributes for a | ||
438 | given file. This method is called by listxattr(2) system call. | ||
429 | 439 | ||
430 | getxattr: called by the VFS to retrieve the value of an extended attribute | 440 | removexattr: called by the VFS to remove an extended attribute from |
431 | name. This function is called by getxattr(2) function call. | 441 | a file. This method is called by removexattr(2) system call. |
432 | 442 | ||
433 | listxattr: called by the VFS to list all extended attributes for a given | ||
434 | file. This function is called by listxattr(2) system call. | ||
435 | 443 | ||
436 | removexattr: called by the VFS to remove an extended attribute from a file. | 444 | The Address Space Object |
437 | This function is called by removexattr(2) system call. | 445 | ======================== |
446 | |||
447 | The address space object is used to identify pages in the page cache. | ||
438 | 448 | ||
439 | 449 | ||
440 | struct address_space_operations | 450 | struct address_space_operations |
441 | =============================== | 451 | ------------------------------- |
442 | 452 | ||
443 | This describes how the VFS can manipulate mapping of a file to page cache in | 453 | This describes how the VFS can manipulate mapping of a file to page cache in |
444 | your filesystem. As of kernel 2.6.13, the following members are defined: | 454 | your filesystem. As of kernel 2.6.13, the following members are defined: |
@@ -502,8 +512,14 @@ struct address_space_operations { | |||
502 | it. An example implementation can be found in fs/ext2/xip.c. | 512 | it. An example implementation can be found in fs/ext2/xip.c. |
503 | 513 | ||
504 | 514 | ||
515 | The File Object | ||
516 | =============== | ||
517 | |||
518 | A file object represents a file opened by a process. | ||
519 | |||
520 | |||
505 | struct file_operations | 521 | struct file_operations |
506 | ====================== | 522 | ---------------------- |
507 | 523 | ||
508 | This describes how the VFS can manipulate an open file. As of kernel | 524 | This describes how the VFS can manipulate an open file. As of kernel |
509 | 2.6.13, the following members are defined: | 525 | 2.6.13, the following members are defined: |
@@ -661,7 +677,7 @@ of child dentries. Child dentries are basically like files in a | |||
661 | directory. | 677 | directory. |
662 | 678 | ||
663 | 679 | ||
664 | Directory Entry Cache APIs | 680 | Directory Entry Cache API |
665 | -------------------------- | 681 | -------------------------- |
666 | 682 | ||
667 | There are a number of functions defined which permit a filesystem to | 683 | There are a number of functions defined which permit a filesystem to |
@@ -880,3 +896,22 @@ Papers and other documentation on dcache locking | |||
880 | 1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). | 896 | 1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). |
881 | 897 | ||
882 | 2. http://lse.sourceforge.net/locking/dcache/dcache.html | 898 | 2. http://lse.sourceforge.net/locking/dcache/dcache.html |
899 | |||
900 | |||
901 | Resources | ||
902 | ========= | ||
903 | |||
904 | (Note some of these resources are not up-to-date with the latest kernel | ||
905 | version.) | ||
906 | |||
907 | Creating Linux virtual filesystems. 2002 | ||
908 | <http://lwn.net/Articles/13325/> | ||
909 | |||
910 | The Linux Virtual File-system Layer by Neil Brown. 1999 | ||
911 | <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html> | ||
912 | |||
913 | A tour of the Linux VFS by Michael K. Johnson. 1996 | ||
914 | <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html> | ||
915 | |||
916 | A small trail through the Linux kernel by Andries Brouwer. 2001 | ||
917 | <http://www.win.tue.nl/~aeb/linux/vfs/trail.html> | ||