diff options
Diffstat (limited to 'Documentation/filesystems/vfs.txt')
-rw-r--r-- | Documentation/filesystems/vfs.txt | 434 |
1 files changed, 148 insertions, 286 deletions
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f042c12e0ed2..ee4c0a8b8db7 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt | |||
@@ -3,7 +3,7 @@ | |||
3 | 3 | ||
4 | Original author: Richard Gooch <rgooch@atnf.csiro.au> | 4 | Original author: Richard Gooch <rgooch@atnf.csiro.au> |
5 | 5 | ||
6 | Last updated on August 25, 2005 | 6 | Last updated on October 28, 2005 |
7 | 7 | ||
8 | Copyright (C) 1999 Richard Gooch | 8 | Copyright (C) 1999 Richard Gooch |
9 | Copyright (C) 2005 Pekka Enberg | 9 | Copyright (C) 2005 Pekka Enberg |
@@ -11,62 +11,61 @@ | |||
11 | This file is released under the GPLv2. | 11 | This file is released under the GPLv2. |
12 | 12 | ||
13 | 13 | ||
14 | What is it? | 14 | Introduction |
15 | =========== | 15 | ============ |
16 | 16 | ||
17 | The Virtual File System (otherwise known as the Virtual Filesystem | 17 | The Virtual File System (also known as the Virtual Filesystem Switch) |
18 | Switch) is the software layer in the kernel that provides the | 18 | is the software layer in the kernel that provides the filesystem |
19 | filesystem interface to userspace programs. It also provides an | 19 | interface to userspace programs. It also provides an abstraction |
20 | abstraction within the kernel which allows different filesystem | 20 | within the kernel which allows different filesystem implementations to |
21 | implementations to coexist. | 21 | coexist. |
22 | 22 | ||
23 | VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so | ||
24 | on are called from a process context. Filesystem locking is described | ||
25 | in the document Documentation/filesystems/Locking. | ||
23 | 26 | ||
24 | A Quick Look At How It Works | ||
25 | ============================ | ||
26 | 27 | ||
27 | In this section I'll briefly describe how things work, before | 28 | Directory Entry Cache (dcache) |
28 | launching into the details. I'll start with describing what happens | 29 | ------------------------------ |
29 | when user programs open and manipulate files, and then look from the | ||
30 | other view which is how a filesystem is supported and subsequently | ||
31 | mounted. | ||
32 | |||
33 | |||
34 | Opening a File | ||
35 | -------------- | ||
36 | |||
37 | The VFS implements the open(2), stat(2), chmod(2) and similar system | ||
38 | calls. The pathname argument is used by the VFS to search through the | ||
39 | directory entry cache (dentry cache or "dcache"). This provides a very | ||
40 | fast look-up mechanism to translate a pathname (filename) into a | ||
41 | specific dentry. | ||
42 | |||
43 | An individual dentry usually has a pointer to an inode. Inodes are the | ||
44 | things that live on disc drives, and can be regular files (you know: | ||
45 | those things that you write data into), directories, FIFOs and other | ||
46 | beasts. Dentries live in RAM and are never saved to disc: they exist | ||
47 | only for performance. Inodes live on disc and are copied into memory | ||
48 | when required. Later any changes are written back to disc. The inode | ||
49 | that lives in RAM is a VFS inode, and it is this which the dentry | ||
50 | points to. A single inode can be pointed to by multiple dentries | ||
51 | (think about hardlinks). | ||
52 | |||
53 | The dcache is meant to be a view into your entire filespace. Unlike | ||
54 | Linus, most of us losers can't fit enough dentries into RAM to cover | ||
55 | all of our filespace, so the dcache has bits missing. In order to | ||
56 | resolve your pathname into a dentry, the VFS may have to resort to | ||
57 | creating dentries along the way, and then loading the inode. This is | ||
58 | done by looking up the inode. | ||
59 | |||
60 | To look up an inode (usually read from disc) requires that the VFS | ||
61 | calls the lookup() method of the parent directory inode. This method | ||
62 | is installed by the specific filesystem implementation that the inode | ||
63 | lives in. There will be more on this later. | ||
64 | 30 | ||
65 | Once the VFS has the required dentry (and hence the inode), we can do | 31 | The VFS implements the open(2), stat(2), chmod(2), and similar system |
66 | all those boring things like open(2) the file, or stat(2) it to peek | 32 | calls. The pathname argument that is passed to them is used by the VFS |
67 | at the inode data. The stat(2) operation is fairly simple: once the | 33 | to search through the directory entry cache (also known as the dentry |
68 | VFS has the dentry, it peeks at the inode data and passes some of it | 34 | cache or dcache). This provides a very fast look-up mechanism to |
69 | back to userspace. | 35 | translate a pathname (filename) into a specific dentry. Dentries live |
36 | in RAM and are never saved to disc: they exist only for performance. | ||
37 | |||
38 | The dentry cache is meant to be a view into your entire filespace. As | ||
39 | most computers cannot fit all dentries in the RAM at the same time, | ||
40 | some bits of the cache are missing. In order to resolve your pathname | ||
41 | into a dentry, the VFS may have to resort to creating dentries along | ||
42 | the way, and then loading the inode. This is done by looking up the | ||
43 | inode. | ||
44 | |||
45 | |||
46 | The Inode Object | ||
47 | ---------------- | ||
48 | |||
49 | An individual dentry usually has a pointer to an inode. Inodes are | ||
50 | filesystem objects such as regular files, directories, FIFOs and other | ||
51 | beasts. They live either on the disc (for block device filesystems) | ||
52 | or in the memory (for pseudo filesystems). Inodes that live on the | ||
53 | disc are copied into the memory when required and changes to the inode | ||
54 | are written back to disc. A single inode can be pointed to by multiple | ||
55 | dentries (hard links, for example, do this). | ||
56 | |||
57 | To look up an inode requires that the VFS calls the lookup() method of | ||
58 | the parent directory inode. This method is installed by the specific | ||
59 | filesystem implementation that the inode lives in. Once the VFS has | ||
60 | the required dentry (and hence the inode), we can do all those boring | ||
61 | things like open(2) the file, or stat(2) it to peek at the inode | ||
62 | data. The stat(2) operation is fairly simple: once the VFS has the | ||
63 | dentry, it peeks at the inode data and passes some of it back to | ||
64 | userspace. | ||
65 | |||
66 | |||
67 | The File Object | ||
68 | --------------- | ||
70 | 69 | ||
71 | Opening a file requires another operation: allocation of a file | 70 | Opening a file requires another operation: allocation of a file |
72 | structure (this is the kernel-side implementation of file | 71 | structure (this is the kernel-side implementation of file |
@@ -74,51 +73,39 @@ descriptors). The freshly allocated file structure is initialized with | |||
74 | a pointer to the dentry and a set of file operation member functions. | 73 | a pointer to the dentry and a set of file operation member functions. |
75 | These are taken from the inode data. The open() file method is then | 74 | These are taken from the inode data. The open() file method is then |
76 | called so the specific filesystem implementation can do it's work. You | 75 | called so the specific filesystem implementation can do it's work. You |
77 | can see that this is another switch performed by the VFS. | 76 | can see that this is another switch performed by the VFS. The file |
78 | 77 | structure is placed into the file descriptor table for the process. | |
79 | The file structure is placed into the file descriptor table for the | ||
80 | process. | ||
81 | 78 | ||
82 | Reading, writing and closing files (and other assorted VFS operations) | 79 | Reading, writing and closing files (and other assorted VFS operations) |
83 | is done by using the userspace file descriptor to grab the appropriate | 80 | is done by using the userspace file descriptor to grab the appropriate |
84 | file structure, and then calling the required file structure method | 81 | file structure, and then calling the required file structure method to |
85 | function to do whatever is required. | 82 | do whatever is required. For as long as the file is open, it keeps the |
86 | 83 | dentry in use, which in turn means that the VFS inode is still in use. | |
87 | For as long as the file is open, it keeps the dentry "open" (in use), | ||
88 | which in turn means that the VFS inode is still in use. | ||
89 | |||
90 | All VFS system calls (i.e. open(2), stat(2), read(2), write(2), | ||
91 | chmod(2) and so on) are called from a process context. You should | ||
92 | assume that these calls are made without any kernel locks being | ||
93 | held. This means that the processes may be executing the same piece of | ||
94 | filesystem or driver code at the same time, on different | ||
95 | processors. You should ensure that access to shared resources is | ||
96 | protected by appropriate locks. | ||
97 | 84 | ||
98 | 85 | ||
99 | Registering and Mounting a Filesystem | 86 | Registering and Mounting a Filesystem |
100 | ------------------------------------- | 87 | ===================================== |
101 | 88 | ||
102 | If you want to support a new kind of filesystem in the kernel, all you | 89 | To register and unregister a filesystem, use the following API |
103 | need to do is call register_filesystem(). You pass a structure | 90 | functions: |
104 | describing the filesystem implementation (struct file_system_type) | ||
105 | which is then added to an internal table of supported filesystems. You | ||
106 | can do: | ||
107 | 91 | ||
108 | % cat /proc/filesystems | 92 | #include <linux/fs.h> |
109 | 93 | ||
110 | to see what filesystems are currently available on your system. | 94 | extern int register_filesystem(struct file_system_type *); |
95 | extern int unregister_filesystem(struct file_system_type *); | ||
111 | 96 | ||
112 | When a request is made to mount a block device onto a directory in | 97 | The passed struct file_system_type describes your filesystem. When a |
113 | your filespace the VFS will call the appropriate method for the | 98 | request is made to mount a device onto a directory in your filespace, |
114 | specific filesystem. The dentry for the mount point will then be | 99 | the VFS will call the appropriate get_sb() method for the specific |
115 | updated to point to the root inode for the new filesystem. | 100 | filesystem. The dentry for the mount point will then be updated to |
101 | point to the root inode for the new filesystem. | ||
116 | 102 | ||
117 | It's now time to look at things in more detail. | 103 | You can see all filesystems that are registered to the kernel in the |
104 | file /proc/filesystems. | ||
118 | 105 | ||
119 | 106 | ||
120 | struct file_system_type | 107 | struct file_system_type |
121 | ======================= | 108 | ----------------------- |
122 | 109 | ||
123 | This describes the filesystem. As of kernel 2.6.13, the following | 110 | This describes the filesystem. As of kernel 2.6.13, the following |
124 | members are defined: | 111 | members are defined: |
@@ -197,8 +184,14 @@ A fill_super() method implementation has the following arguments: | |||
197 | int silent: whether or not to be silent on error | 184 | int silent: whether or not to be silent on error |
198 | 185 | ||
199 | 186 | ||
187 | The Superblock Object | ||
188 | ===================== | ||
189 | |||
190 | A superblock object represents a mounted filesystem. | ||
191 | |||
192 | |||
200 | struct super_operations | 193 | struct super_operations |
201 | ======================= | 194 | ----------------------- |
202 | 195 | ||
203 | This describes how the VFS can manipulate the superblock of your | 196 | This describes how the VFS can manipulate the superblock of your |
204 | filesystem. As of kernel 2.6.13, the following members are defined: | 197 | filesystem. As of kernel 2.6.13, the following members are defined: |
@@ -286,9 +279,9 @@ or bottom half). | |||
286 | a superblock. The second parameter indicates whether the method | 279 | a superblock. The second parameter indicates whether the method |
287 | should wait until the write out has been completed. Optional. | 280 | should wait until the write out has been completed. Optional. |
288 | 281 | ||
289 | write_super_lockfs: called when VFS is locking a filesystem and forcing | 282 | write_super_lockfs: called when VFS is locking a filesystem and |
290 | it into a consistent state. This function is currently used by the | 283 | forcing it into a consistent state. This method is currently |
291 | Logical Volume Manager (LVM). | 284 | used by the Logical Volume Manager (LVM). |
292 | 285 | ||
293 | unlockfs: called when VFS is unlocking a filesystem and making it writable | 286 | unlockfs: called when VFS is unlocking a filesystem and making it writable |
294 | again. | 287 | again. |
@@ -317,8 +310,14 @@ field. This is a pointer to a "struct inode_operations" which | |||
317 | describes the methods that can be performed on individual inodes. | 310 | describes the methods that can be performed on individual inodes. |
318 | 311 | ||
319 | 312 | ||
313 | The Inode Object | ||
314 | ================ | ||
315 | |||
316 | An inode object represents an object within the filesystem. | ||
317 | |||
318 | |||
320 | struct inode_operations | 319 | struct inode_operations |
321 | ======================= | 320 | ----------------------- |
322 | 321 | ||
323 | This describes how the VFS can manipulate an inode in your | 322 | This describes how the VFS can manipulate an inode in your |
324 | filesystem. As of kernel 2.6.13, the following members are defined: | 323 | filesystem. As of kernel 2.6.13, the following members are defined: |
@@ -394,51 +393,62 @@ otherwise noted. | |||
394 | will probably need to call d_instantiate() just as you would | 393 | will probably need to call d_instantiate() just as you would |
395 | in the create() method | 394 | in the create() method |
396 | 395 | ||
396 | rename: called by the rename(2) system call to rename the object to | ||
397 | have the parent and name given by the second inode and dentry. | ||
398 | |||
397 | readlink: called by the readlink(2) system call. Only required if | 399 | readlink: called by the readlink(2) system call. Only required if |
398 | you want to support reading symbolic links | 400 | you want to support reading symbolic links |
399 | 401 | ||
400 | follow_link: called by the VFS to follow a symbolic link to the | 402 | follow_link: called by the VFS to follow a symbolic link to the |
401 | inode it points to. Only required if you want to support | 403 | inode it points to. Only required if you want to support |
402 | symbolic links. This function returns a void pointer cookie | 404 | symbolic links. This method returns a void pointer cookie |
403 | that is passed to put_link(). | 405 | that is passed to put_link(). |
404 | 406 | ||
405 | put_link: called by the VFS to release resources allocated by | 407 | put_link: called by the VFS to release resources allocated by |
406 | follow_link(). The cookie returned by follow_link() is passed to | 408 | follow_link(). The cookie returned by follow_link() is passed |
407 | to this function as the last parameter. It is used by filesystems | 409 | to to this method as the last parameter. It is used by |
408 | such as NFS where page cache is not stable (i.e. page that was | 410 | filesystems such as NFS where page cache is not stable |
409 | installed when the symbolic link walk started might not be in the | 411 | (i.e. page that was installed when the symbolic link walk |
410 | page cache at the end of the walk). | 412 | started might not be in the page cache at the end of the |
411 | 413 | walk). | |
412 | truncate: called by the VFS to change the size of a file. The i_size | 414 | |
413 | field of the inode is set to the desired size by the VFS before | 415 | truncate: called by the VFS to change the size of a file. The |
414 | this function is called. This function is called by the truncate(2) | 416 | i_size field of the inode is set to the desired size by the |
415 | system call and related functionality. | 417 | VFS before this method is called. This method is called by |
418 | the truncate(2) system call and related functionality. | ||
416 | 419 | ||
417 | permission: called by the VFS to check for access rights on a POSIX-like | 420 | permission: called by the VFS to check for access rights on a POSIX-like |
418 | filesystem. | 421 | filesystem. |
419 | 422 | ||
420 | setattr: called by the VFS to set attributes for a file. This function is | 423 | setattr: called by the VFS to set attributes for a file. This method |
421 | called by chmod(2) and related system calls. | 424 | is called by chmod(2) and related system calls. |
422 | 425 | ||
423 | getattr: called by the VFS to get attributes of a file. This function is | 426 | getattr: called by the VFS to get attributes of a file. This method |
424 | called by stat(2) and related system calls. | 427 | is called by stat(2) and related system calls. |
425 | 428 | ||
426 | setxattr: called by the VFS to set an extended attribute for a file. | 429 | setxattr: called by the VFS to set an extended attribute for a file. |
427 | Extended attribute is a name:value pair associated with an inode. This | 430 | Extended attribute is a name:value pair associated with an |
428 | function is called by setxattr(2) system call. | 431 | inode. This method is called by setxattr(2) system call. |
432 | |||
433 | getxattr: called by the VFS to retrieve the value of an extended | ||
434 | attribute name. This method is called by getxattr(2) function | ||
435 | call. | ||
429 | 436 | ||
430 | getxattr: called by the VFS to retrieve the value of an extended attribute | 437 | listxattr: called by the VFS to list all extended attributes for a |
431 | name. This function is called by getxattr(2) function call. | 438 | given file. This method is called by listxattr(2) system call. |
432 | 439 | ||
433 | listxattr: called by the VFS to list all extended attributes for a given | 440 | removexattr: called by the VFS to remove an extended attribute from |
434 | file. This function is called by listxattr(2) system call. | 441 | a file. This method is called by removexattr(2) system call. |
435 | 442 | ||
436 | removexattr: called by the VFS to remove an extended attribute from a file. | 443 | |
437 | This function is called by removexattr(2) system call. | 444 | The Address Space Object |
445 | ======================== | ||
446 | |||
447 | The address space object is used to identify pages in the page cache. | ||
438 | 448 | ||
439 | 449 | ||
440 | struct address_space_operations | 450 | struct address_space_operations |
441 | =============================== | 451 | ------------------------------- |
442 | 452 | ||
443 | This describes how the VFS can manipulate mapping of a file to page cache in | 453 | This describes how the VFS can manipulate mapping of a file to page cache in |
444 | your filesystem. As of kernel 2.6.13, the following members are defined: | 454 | your filesystem. As of kernel 2.6.13, the following members are defined: |
@@ -502,8 +512,14 @@ struct address_space_operations { | |||
502 | it. An example implementation can be found in fs/ext2/xip.c. | 512 | it. An example implementation can be found in fs/ext2/xip.c. |
503 | 513 | ||
504 | 514 | ||
515 | The File Object | ||
516 | =============== | ||
517 | |||
518 | A file object represents a file opened by a process. | ||
519 | |||
520 | |||
505 | struct file_operations | 521 | struct file_operations |
506 | ====================== | 522 | ---------------------- |
507 | 523 | ||
508 | This describes how the VFS can manipulate an open file. As of kernel | 524 | This describes how the VFS can manipulate an open file. As of kernel |
509 | 2.6.13, the following members are defined: | 525 | 2.6.13, the following members are defined: |
@@ -661,7 +677,7 @@ of child dentries. Child dentries are basically like files in a | |||
661 | directory. | 677 | directory. |
662 | 678 | ||
663 | 679 | ||
664 | Directory Entry Cache APIs | 680 | Directory Entry Cache API |
665 | -------------------------- | 681 | -------------------------- |
666 | 682 | ||
667 | There are a number of functions defined which permit a filesystem to | 683 | There are a number of functions defined which permit a filesystem to |
@@ -705,178 +721,24 @@ manipulate dentries: | |||
705 | and the dentry is returned. The caller must use d_put() | 721 | and the dentry is returned. The caller must use d_put() |
706 | to free the dentry when it finishes using it. | 722 | to free the dentry when it finishes using it. |
707 | 723 | ||
724 | For further information on dentry locking, please refer to the document | ||
725 | Documentation/filesystems/dentry-locking.txt. | ||
708 | 726 | ||
709 | RCU-based dcache locking model | ||
710 | ------------------------------ | ||
711 | 727 | ||
712 | On many workloads, the most common operation on dcache is | 728 | Resources |
713 | to look up a dentry, given a parent dentry and the name | 729 | ========= |
714 | of the child. Typically, for every open(), stat() etc., | 730 | |
715 | the dentry corresponding to the pathname will be looked | 731 | (Note some of these resources are not up-to-date with the latest kernel |
716 | up by walking the tree starting with the first component | 732 | version.) |
717 | of the pathname and using that dentry along with the next | 733 | |
718 | component to look up the next level and so on. Since it | 734 | Creating Linux virtual filesystems. 2002 |
719 | is a frequent operation for workloads like multiuser | 735 | <http://lwn.net/Articles/13325/> |
720 | environments and web servers, it is important to optimize | 736 | |
721 | this path. | 737 | The Linux Virtual File-system Layer by Neil Brown. 1999 |
722 | 738 | <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html> | |
723 | Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus | 739 | |
724 | in every component during path look-up. Since 2.5.10 onwards, | 740 | A tour of the Linux VFS by Michael K. Johnson. 1996 |
725 | fast-walk algorithm changed this by holding the dcache_lock | 741 | <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html> |
726 | at the beginning and walking as many cached path component | ||
727 | dentries as possible. This significantly decreases the number | ||
728 | of acquisition of dcache_lock. However it also increases the | ||
729 | lock hold time significantly and affects performance in large | ||
730 | SMP machines. Since 2.5.62 kernel, dcache has been using | ||
731 | a new locking model that uses RCU to make dcache look-up | ||
732 | lock-free. | ||
733 | |||
734 | The current dcache locking model is not very different from the existing | ||
735 | dcache locking model. Prior to 2.5.62 kernel, dcache_lock | ||
736 | protected the hash chain, d_child, d_alias, d_lru lists as well | ||
737 | as d_inode and several other things like mount look-up. RCU-based | ||
738 | changes affect only the way the hash chain is protected. For everything | ||
739 | else the dcache_lock must be taken for both traversing as well as | ||
740 | updating. The hash chain updates too take the dcache_lock. | ||
741 | The significant change is the way d_lookup traverses the hash chain, | ||
742 | it doesn't acquire the dcache_lock for this and rely on RCU to | ||
743 | ensure that the dentry has not been *freed*. | ||
744 | |||
745 | |||
746 | Dcache locking details | ||
747 | ---------------------- | ||
748 | 742 | ||
749 | For many multi-user workloads, open() and stat() on files are | 743 | A small trail through the Linux kernel by Andries Brouwer. 2001 |
750 | very frequently occurring operations. Both involve walking | 744 | <http://www.win.tue.nl/~aeb/linux/vfs/trail.html> |
751 | of path names to find the dentry corresponding to the | ||
752 | concerned file. In 2.4 kernel, dcache_lock was held | ||
753 | during look-up of each path component. Contention and | ||
754 | cache-line bouncing of this global lock caused significant | ||
755 | scalability problems. With the introduction of RCU | ||
756 | in Linux kernel, this was worked around by making | ||
757 | the look-up of path components during path walking lock-free. | ||
758 | |||
759 | |||
760 | Safe lock-free look-up of dcache hash table | ||
761 | =========================================== | ||
762 | |||
763 | Dcache is a complex data structure with the hash table entries | ||
764 | also linked together in other lists. In 2.4 kernel, dcache_lock | ||
765 | protected all the lists. We applied RCU only on hash chain | ||
766 | walking. The rest of the lists are still protected by dcache_lock. | ||
767 | Some of the important changes are : | ||
768 | |||
769 | 1. The deletion from hash chain is done using hlist_del_rcu() macro which | ||
770 | doesn't initialize next pointer of the deleted dentry and this | ||
771 | allows us to walk safely lock-free while a deletion is happening. | ||
772 | |||
773 | 2. Insertion of a dentry into the hash table is done using | ||
774 | hlist_add_head_rcu() which take care of ordering the writes - | ||
775 | the writes to the dentry must be visible before the dentry | ||
776 | is inserted. This works in conjunction with hlist_for_each_rcu() | ||
777 | while walking the hash chain. The only requirement is that | ||
778 | all initialization to the dentry must be done before hlist_add_head_rcu() | ||
779 | since we don't have dcache_lock protection while traversing | ||
780 | the hash chain. This isn't different from the existing code. | ||
781 | |||
782 | 3. The dentry looked up without holding dcache_lock by cannot be | ||
783 | returned for walking if it is unhashed. It then may have a NULL | ||
784 | d_inode or other bogosity since RCU doesn't protect the other | ||
785 | fields in the dentry. We therefore use a flag DCACHE_UNHASHED to | ||
786 | indicate unhashed dentries and use this in conjunction with a | ||
787 | per-dentry lock (d_lock). Once looked up without the dcache_lock, | ||
788 | we acquire the per-dentry lock (d_lock) and check if the | ||
789 | dentry is unhashed. If so, the look-up is failed. If not, the | ||
790 | reference count of the dentry is increased and the dentry is returned. | ||
791 | |||
792 | 4. Once a dentry is looked up, it must be ensured during the path | ||
793 | walk for that component it doesn't go away. In pre-2.5.10 code, | ||
794 | this was done holding a reference to the dentry. dcache_rcu does | ||
795 | the same. In some sense, dcache_rcu path walking looks like | ||
796 | the pre-2.5.10 version. | ||
797 | |||
798 | 5. All dentry hash chain updates must take the dcache_lock as well as | ||
799 | the per-dentry lock in that order. dput() does this to ensure | ||
800 | that a dentry that has just been looked up in another CPU | ||
801 | doesn't get deleted before dget() can be done on it. | ||
802 | |||
803 | 6. There are several ways to do reference counting of RCU protected | ||
804 | objects. One such example is in ipv4 route cache where | ||
805 | deferred freeing (using call_rcu()) is done as soon as | ||
806 | the reference count goes to zero. This cannot be done in | ||
807 | the case of dentries because tearing down of dentries | ||
808 | require blocking (dentry_iput()) which isn't supported from | ||
809 | RCU callbacks. Instead, tearing down of dentries happen | ||
810 | synchronously in dput(), but actual freeing happens later | ||
811 | when RCU grace period is over. This allows safe lock-free | ||
812 | walking of the hash chains, but a matched dentry may have | ||
813 | been partially torn down. The checking of DCACHE_UNHASHED | ||
814 | flag with d_lock held detects such dentries and prevents | ||
815 | them from being returned from look-up. | ||
816 | |||
817 | |||
818 | Maintaining POSIX rename semantics | ||
819 | ================================== | ||
820 | |||
821 | Since look-up of dentries is lock-free, it can race against | ||
822 | a concurrent rename operation. For example, during rename | ||
823 | of file A to B, look-up of either A or B must succeed. | ||
824 | So, if look-up of B happens after A has been removed from the | ||
825 | hash chain but not added to the new hash chain, it may fail. | ||
826 | Also, a comparison while the name is being written concurrently | ||
827 | by a rename may result in false positive matches violating | ||
828 | rename semantics. Issues related to race with rename are | ||
829 | handled as described below : | ||
830 | |||
831 | 1. Look-up can be done in two ways - d_lookup() which is safe | ||
832 | from simultaneous renames and __d_lookup() which is not. | ||
833 | If __d_lookup() fails, it must be followed up by a d_lookup() | ||
834 | to correctly determine whether a dentry is in the hash table | ||
835 | or not. d_lookup() protects look-ups using a sequence | ||
836 | lock (rename_lock). | ||
837 | |||
838 | 2. The name associated with a dentry (d_name) may be changed if | ||
839 | a rename is allowed to happen simultaneously. To avoid memcmp() | ||
840 | in __d_lookup() go out of bounds due to a rename and false | ||
841 | positive comparison, the name comparison is done while holding the | ||
842 | per-dentry lock. This prevents concurrent renames during this | ||
843 | operation. | ||
844 | |||
845 | 3. Hash table walking during look-up may move to a different bucket as | ||
846 | the current dentry is moved to a different bucket due to rename. | ||
847 | But we use hlists in dcache hash table and they are null-terminated. | ||
848 | So, even if a dentry moves to a different bucket, hash chain | ||
849 | walk will terminate. [with a list_head list, it may not since | ||
850 | termination is when the list_head in the original bucket is reached]. | ||
851 | Since we redo the d_parent check and compare name while holding | ||
852 | d_lock, lock-free look-up will not race against d_move(). | ||
853 | |||
854 | 4. There can be a theoretical race when a dentry keeps coming back | ||
855 | to original bucket due to double moves. Due to this look-up may | ||
856 | consider that it has never moved and can end up in a infinite loop. | ||
857 | But this is not any worse that theoretical livelocks we already | ||
858 | have in the kernel. | ||
859 | |||
860 | |||
861 | Important guidelines for filesystem developers related to dcache_rcu | ||
862 | ==================================================================== | ||
863 | |||
864 | 1. Existing dcache interfaces (pre-2.5.62) exported to filesystem | ||
865 | don't change. Only dcache internal implementation changes. However | ||
866 | filesystems *must not* delete from the dentry hash chains directly | ||
867 | using the list macros like allowed earlier. They must use dcache | ||
868 | APIs like d_drop() or __d_drop() depending on the situation. | ||
869 | |||
870 | 2. d_flags is now protected by a per-dentry lock (d_lock). All | ||
871 | access to d_flags must be protected by it. | ||
872 | |||
873 | 3. For a hashed dentry, checking of d_count needs to be protected | ||
874 | by d_lock. | ||
875 | |||
876 | |||
877 | Papers and other documentation on dcache locking | ||
878 | ================================================ | ||
879 | |||
880 | 1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). | ||
881 | |||
882 | 2. http://lse.sourceforge.net/locking/dcache/dcache.html | ||