aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/files.txt123
-rw-r--r--Documentation/filesystems/fuse.txt315
-rw-r--r--Documentation/filesystems/proc.txt42
-rw-r--r--Documentation/filesystems/v9fs.txt95
-rw-r--r--Documentation/filesystems/vfs.txt435
5 files changed, 888 insertions, 122 deletions
diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt
new file mode 100644
index 000000000000..8c206f4e0250
--- /dev/null
+++ b/Documentation/filesystems/files.txt
@@ -0,0 +1,123 @@
1File management in the Linux kernel
2-----------------------------------
3
4This document describes how locking for files (struct file)
5and file descriptor table (struct files) works.
6
7Up until 2.6.12, the file descriptor table has been protected
8with a lock (files->file_lock) and reference count (files->count).
9->file_lock protected accesses to all the file related fields
10of the table. ->count was used for sharing the file descriptor
11table between tasks cloned with CLONE_FILES flag. Typically
12this would be the case for posix threads. As with the common
13refcounting model in the kernel, the last task doing
14a put_files_struct() frees the file descriptor (fd) table.
15The files (struct file) themselves are protected using
16reference count (->f_count).
17
18In the new lock-free model of file descriptor management,
19the reference counting is similar, but the locking is
20based on RCU. The file descriptor table contains multiple
21elements - the fd sets (open_fds and close_on_exec, the
22array of file pointers, the sizes of the sets and the array
23etc.). In order for the updates to appear atomic to
24a lock-free reader, all the elements of the file descriptor
25table are in a separate structure - struct fdtable.
26files_struct contains a pointer to struct fdtable through
27which the actual fd table is accessed. Initially the
28fdtable is embedded in files_struct itself. On a subsequent
29expansion of fdtable, a new fdtable structure is allocated
30and files->fdtab points to the new structure. The fdtable
31structure is freed with RCU and lock-free readers either
32see the old fdtable or the new fdtable making the update
33appear atomic. Here are the locking rules for
34the fdtable structure -
35
361. All references to the fdtable must be done through
37 the files_fdtable() macro :
38
39 struct fdtable *fdt;
40
41 rcu_read_lock();
42
43 fdt = files_fdtable(files);
44 ....
45 if (n <= fdt->max_fds)
46 ....
47 ...
48 rcu_read_unlock();
49
50 files_fdtable() uses rcu_dereference() macro which takes care of
51 the memory barrier requirements for lock-free dereference.
52 The fdtable pointer must be read within the read-side
53 critical section.
54
552. Reading of the fdtable as described above must be protected
56 by rcu_read_lock()/rcu_read_unlock().
57
583. For any update to the the fd table, files->file_lock must
59 be held.
60
614. To look up the file structure given an fd, a reader
62 must use either fcheck() or fcheck_files() APIs. These
63 take care of barrier requirements due to lock-free lookup.
64 An example :
65
66 struct file *file;
67
68 rcu_read_lock();
69 file = fcheck(fd);
70 if (file) {
71 ...
72 }
73 ....
74 rcu_read_unlock();
75
765. Handling of the file structures is special. Since the look-up
77 of the fd (fget()/fget_light()) are lock-free, it is possible
78 that look-up may race with the last put() operation on the
79 file structure. This is avoided using the rcuref APIs
80 on ->f_count :
81
82 rcu_read_lock();
83 file = fcheck_files(files, fd);
84 if (file) {
85 if (rcuref_inc_lf(&file->f_count))
86 *fput_needed = 1;
87 else
88 /* Didn't get the reference, someone's freed */
89 file = NULL;
90 }
91 rcu_read_unlock();
92 ....
93 return file;
94
95 rcuref_inc_lf() detects if refcounts is already zero or
96 goes to zero during increment. If it does, we fail
97 fget()/fget_light().
98
996. Since both fdtable and file structures can be looked up
100 lock-free, they must be installed using rcu_assign_pointer()
101 API. If they are looked up lock-free, rcu_dereference()
102 must be used. However it is advisable to use files_fdtable()
103 and fcheck()/fcheck_files() which take care of these issues.
104
1057. While updating, the fdtable pointer must be looked up while
106 holding files->file_lock. If ->file_lock is dropped, then
107 another thread expand the files thereby creating a new
108 fdtable and making the earlier fdtable pointer stale.
109 For example :
110
111 spin_lock(&files->file_lock);
112 fd = locate_fd(files, file, start);
113 if (fd >= 0) {
114 /* locate_fd() may have expanded fdtable, load the ptr */
115 fdt = files_fdtable(files);
116 FD_SET(fd, fdt->open_fds);
117 FD_CLR(fd, fdt->close_on_exec);
118 spin_unlock(&files->file_lock);
119 .....
120
121 Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
122 the fdtable pointer (fdt) must be loaded after locate_fd().
123
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt
new file mode 100644
index 000000000000..6b5741e651a2
--- /dev/null
+++ b/Documentation/filesystems/fuse.txt
@@ -0,0 +1,315 @@
1Definitions
2~~~~~~~~~~~
3
4Userspace filesystem:
5
6 A filesystem in which data and metadata are provided by an ordinary
7 userspace process. The filesystem can be accessed normally through
8 the kernel interface.
9
10Filesystem daemon:
11
12 The process(es) providing the data and metadata of the filesystem.
13
14Non-privileged mount (or user mount):
15
16 A userspace filesystem mounted by a non-privileged (non-root) user.
17 The filesystem daemon is running with the privileges of the mounting
18 user. NOTE: this is not the same as mounts allowed with the "user"
19 option in /etc/fstab, which is not discussed here.
20
21Mount owner:
22
23 The user who does the mounting.
24
25User:
26
27 The user who is performing filesystem operations.
28
29What is FUSE?
30~~~~~~~~~~~~~
31
32FUSE is a userspace filesystem framework. It consists of a kernel
33module (fuse.ko), a userspace library (libfuse.*) and a mount utility
34(fusermount).
35
36One of the most important features of FUSE is allowing secure,
37non-privileged mounts. This opens up new possibilities for the use of
38filesystems. A good example is sshfs: a secure network filesystem
39using the sftp protocol.
40
41The userspace library and utilities are available from the FUSE
42homepage:
43
44 http://fuse.sourceforge.net/
45
46Mount options
47~~~~~~~~~~~~~
48
49'fd=N'
50
51 The file descriptor to use for communication between the userspace
52 filesystem and the kernel. The file descriptor must have been
53 obtained by opening the FUSE device ('/dev/fuse').
54
55'rootmode=M'
56
57 The file mode of the filesystem's root in octal representation.
58
59'user_id=N'
60
61 The numeric user id of the mount owner.
62
63'group_id=N'
64
65 The numeric group id of the mount owner.
66
67'default_permissions'
68
69 By default FUSE doesn't check file access permissions, the
70 filesystem is free to implement it's access policy or leave it to
71 the underlying file access mechanism (e.g. in case of network
72 filesystems). This option enables permission checking, restricting
73 access based on file mode. This is option is usually useful
74 together with the 'allow_other' mount option.
75
76'allow_other'
77
78 This option overrides the security measure restricting file access
79 to the user mounting the filesystem. This option is by default only
80 allowed to root, but this restriction can be removed with a
81 (userspace) configuration option.
82
83'max_read=N'
84
85 With this option the maximum size of read operations can be set.
86 The default is infinite. Note that the size of read requests is
87 limited anyway to 32 pages (which is 128kbyte on i386).
88
89How do non-privileged mounts work?
90~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
91
92Since the mount() system call is a privileged operation, a helper
93program (fusermount) is needed, which is installed setuid root.
94
95The implication of providing non-privileged mounts is that the mount
96owner must not be able to use this capability to compromise the
97system. Obvious requirements arising from this are:
98
99 A) mount owner should not be able to get elevated privileges with the
100 help of the mounted filesystem
101
102 B) mount owner should not get illegitimate access to information from
103 other users' and the super user's processes
104
105 C) mount owner should not be able to induce undesired behavior in
106 other users' or the super user's processes
107
108How are requirements fulfilled?
109~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
110
111 A) The mount owner could gain elevated privileges by either:
112
113 1) creating a filesystem containing a device file, then opening
114 this device
115
116 2) creating a filesystem containing a suid or sgid application,
117 then executing this application
118
119 The solution is not to allow opening device files and ignore
120 setuid and setgid bits when executing programs. To ensure this
121 fusermount always adds "nosuid" and "nodev" to the mount options
122 for non-privileged mounts.
123
124 B) If another user is accessing files or directories in the
125 filesystem, the filesystem daemon serving requests can record the
126 exact sequence and timing of operations performed. This
127 information is otherwise inaccessible to the mount owner, so this
128 counts as an information leak.
129
130 The solution to this problem will be presented in point 2) of C).
131
132 C) There are several ways in which the mount owner can induce
133 undesired behavior in other users' processes, such as:
134
135 1) mounting a filesystem over a file or directory which the mount
136 owner could otherwise not be able to modify (or could only
137 make limited modifications).
138
139 This is solved in fusermount, by checking the access
140 permissions on the mountpoint and only allowing the mount if
141 the mount owner can do unlimited modification (has write
142 access to the mountpoint, and mountpoint is not a "sticky"
143 directory)
144
145 2) Even if 1) is solved the mount owner can change the behavior
146 of other users' processes.
147
148 i) It can slow down or indefinitely delay the execution of a
149 filesystem operation creating a DoS against the user or the
150 whole system. For example a suid application locking a
151 system file, and then accessing a file on the mount owner's
152 filesystem could be stopped, and thus causing the system
153 file to be locked forever.
154
155 ii) It can present files or directories of unlimited length, or
156 directory structures of unlimited depth, possibly causing a
157 system process to eat up diskspace, memory or other
158 resources, again causing DoS.
159
160 The solution to this as well as B) is not to allow processes
161 to access the filesystem, which could otherwise not be
162 monitored or manipulated by the mount owner. Since if the
163 mount owner can ptrace a process, it can do all of the above
164 without using a FUSE mount, the same criteria as used in
165 ptrace can be used to check if a process is allowed to access
166 the filesystem or not.
167
168 Note that the ptrace check is not strictly necessary to
169 prevent B/2/i, it is enough to check if mount owner has enough
170 privilege to send signal to the process accessing the
171 filesystem, since SIGSTOP can be used to get a similar effect.
172
173I think these limitations are unacceptable?
174~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
175
176If a sysadmin trusts the users enough, or can ensure through other
177measures, that system processes will never enter non-privileged
178mounts, it can relax the last limitation with a "user_allow_other"
179config option. If this config option is set, the mounting user can
180add the "allow_other" mount option which disables the check for other
181users' processes.
182
183Kernel - userspace interface
184~~~~~~~~~~~~~~~~~~~~~~~~~~~~
185
186The following diagram shows how a filesystem operation (in this
187example unlink) is performed in FUSE.
188
189NOTE: everything in this description is greatly simplified
190
191 | "rm /mnt/fuse/file" | FUSE filesystem daemon
192 | |
193 | | >sys_read()
194 | | >fuse_dev_read()
195 | | >request_wait()
196 | | [sleep on fc->waitq]
197 | |
198 | >sys_unlink() |
199 | >fuse_unlink() |
200 | [get request from |
201 | fc->unused_list] |
202 | >request_send() |
203 | [queue req on fc->pending] |
204 | [wake up fc->waitq] | [woken up]
205 | >request_wait_answer() |
206 | [sleep on req->waitq] |
207 | | <request_wait()
208 | | [remove req from fc->pending]
209 | | [copy req to read buffer]
210 | | [add req to fc->processing]
211 | | <fuse_dev_read()
212 | | <sys_read()
213 | |
214 | | [perform unlink]
215 | |
216 | | >sys_write()
217 | | >fuse_dev_write()
218 | | [look up req in fc->processing]
219 | | [remove from fc->processing]
220 | | [copy write buffer to req]
221 | [woken up] | [wake up req->waitq]
222 | | <fuse_dev_write()
223 | | <sys_write()
224 | <request_wait_answer() |
225 | <request_send() |
226 | [add request to |
227 | fc->unused_list] |
228 | <fuse_unlink() |
229 | <sys_unlink() |
230
231There are a couple of ways in which to deadlock a FUSE filesystem.
232Since we are talking about unprivileged userspace programs,
233something must be done about these.
234
235Scenario 1 - Simple deadlock
236-----------------------------
237
238 | "rm /mnt/fuse/file" | FUSE filesystem daemon
239 | |
240 | >sys_unlink("/mnt/fuse/file") |
241 | [acquire inode semaphore |
242 | for "file"] |
243 | >fuse_unlink() |
244 | [sleep on req->waitq] |
245 | | <sys_read()
246 | | >sys_unlink("/mnt/fuse/file")
247 | | [acquire inode semaphore
248 | | for "file"]
249 | | *DEADLOCK*
250
251The solution for this is to allow requests to be interrupted while
252they are in userspace:
253
254 | [interrupted by signal] |
255 | <fuse_unlink() |
256 | [release semaphore] | [semaphore acquired]
257 | <sys_unlink() |
258 | | >fuse_unlink()
259 | | [queue req on fc->pending]
260 | | [wake up fc->waitq]
261 | | [sleep on req->waitq]
262
263If the filesystem daemon was single threaded, this will stop here,
264since there's no other thread to dequeue and execute the request.
265In this case the solution is to kill the FUSE daemon as well. If
266there are multiple serving threads, you just have to kill them as
267long as any remain.
268
269Moral: a filesystem which deadlocks, can soon find itself dead.
270
271Scenario 2 - Tricky deadlock
272----------------------------
273
274This one needs a carefully crafted filesystem. It's a variation on
275the above, only the call back to the filesystem is not explicit,
276but is caused by a pagefault.
277
278 | Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2
279 | |
280 | [fd = open("/mnt/fuse/file")] | [request served normally]
281 | [mmap fd to 'addr'] |
282 | [close fd] | [FLUSH triggers 'magic' flag]
283 | [read a byte from addr] |
284 | >do_page_fault() |
285 | [find or create page] |
286 | [lock page] |
287 | >fuse_readpage() |
288 | [queue READ request] |
289 | [sleep on req->waitq] |
290 | | [read request to buffer]
291 | | [create reply header before addr]
292 | | >sys_write(addr - headerlength)
293 | | >fuse_dev_write()
294 | | [look up req in fc->processing]
295 | | [remove from fc->processing]
296 | | [copy write buffer to req]
297 | | >do_page_fault()
298 | | [find or create page]
299 | | [lock page]
300 | | * DEADLOCK *
301
302Solution is again to let the the request be interrupted (not
303elaborated further).
304
305An additional problem is that while the write buffer is being
306copied to the request, the request must not be interrupted. This
307is because the destination address of the copy may not be valid
308after the request is interrupted.
309
310This is solved with doing the copy atomically, and allowing
311interruption while the page(s) belonging to the write buffer are
312faulted with get_user_pages(). The 'req->locked' flag indicates
313when the copy is taking place, and interruption is delayed until
314this flag is unset.
315
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 5024ba7a592c..d4773565ea2f 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1241,16 +1241,38 @@ swap-intensive.
1241overcommit_memory 1241overcommit_memory
1242----------------- 1242-----------------
1243 1243
1244This file contains one value. The following algorithm is used to decide if 1244Controls overcommit of system memory, possibly allowing processes
1245there's enough memory: if the value of overcommit_memory is positive, then 1245to allocate (but not use) more memory than is actually available.
1246there's always enough memory. This is a useful feature, since programs often 1246
1247malloc() huge amounts of memory 'just in case', while they only use a small 1247
1248part of it. Leaving this value at 0 will lead to the failure of such a huge 12480 - Heuristic overcommit handling. Obvious overcommits of
1249malloc(), when in fact the system has enough memory for the program to run. 1249 address space are refused. Used for a typical system. It
1250 1250 ensures a seriously wild allocation fails while allowing
1251On the other hand, enabling this feature can cause you to run out of memory 1251 overcommit to reduce swap usage. root is allowed to
1252and thrash the system to death, so large and/or important servers will want to 1252 allocate slighly more memory in this mode. This is the
1253set this value to 0. 1253 default.
1254
12551 - Always overcommit. Appropriate for some scientific
1256 applications.
1257
12582 - Don't overcommit. The total address space commit
1259 for the system is not permitted to exceed swap plus a
1260 configurable percentage (default is 50) of physical RAM.
1261 Depending on the percentage you use, in most situations
1262 this means a process will not be killed while attempting
1263 to use already-allocated memory but will receive errors
1264 on memory allocation as appropriate.
1265
1266overcommit_ratio
1267----------------
1268
1269Percentage of physical memory size to include in overcommit calculations
1270(see above.)
1271
1272Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100)
1273
1274 swapspace = total size of all swap areas
1275 physmem = size of physical memory in system
1254 1276
1255nr_hugepages and hugetlb_shm_group 1277nr_hugepages and hugetlb_shm_group
1256---------------------------------- 1278----------------------------------
diff --git a/Documentation/filesystems/v9fs.txt b/Documentation/filesystems/v9fs.txt
new file mode 100644
index 000000000000..4e92feb6b507
--- /dev/null
+++ b/Documentation/filesystems/v9fs.txt
@@ -0,0 +1,95 @@
1 V9FS: 9P2000 for Linux
2 ======================
3
4ABOUT
5=====
6
7v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
8
9This software was originally developed by Ron Minnich <rminnich@lanl.gov>
10and Maya Gokhale <maya@lanl.gov>. Additional development by Greg Watson
11<gwatson@lanl.gov> and most recently Eric Van Hensbergen
12<ericvh@gmail.com> and Latchesar Ionkov <lucho@ionkov.net>.
13
14USAGE
15=====
16
17For remote file server:
18
19 mount -t 9P 10.10.1.2 /mnt/9
20
21For Plan 9 From User Space applications (http://swtch.com/plan9)
22
23 mount -t 9P `namespace`/acme /mnt/9 -o proto=unix,name=$USER
24
25OPTIONS
26=======
27
28 proto=name select an alternative transport. Valid options are
29 currently:
30 unix - specifying a named pipe mount point
31 tcp - specifying a normal TCP/IP connection
32 fd - used passed file descriptors for connection
33 (see rfdno and wfdno)
34
35 name=name user name to attempt mount as on the remote server. The
36 server may override or ignore this value. Certain user
37 names may require authentication.
38
39 aname=name aname specifies the file tree to access when the server is
40 offering several exported file systems.
41
42 debug=n specifies debug level. The debug level is a bitmask.
43 0x01 = display verbose error messages
44 0x02 = developer debug (DEBUG_CURRENT)
45 0x04 = display 9P trace
46 0x08 = display VFS trace
47 0x10 = display Marshalling debug
48 0x20 = display RPC debug
49 0x40 = display transport debug
50 0x80 = display allocation debug
51
52 rfdno=n the file descriptor for reading with proto=fd
53
54 wfdno=n the file descriptor for writing with proto=fd
55
56 maxdata=n the number of bytes to use for 9P packet payload (msize)
57
58 port=n port to connect to on the remote server
59
60 timeout=n request timeouts (in ms) (default 60000ms)
61
62 noextend force legacy mode (no 9P2000.u semantics)
63
64 uid attempt to mount as a particular uid
65
66 gid attempt to mount with a particular gid
67
68 afid security channel - used by Plan 9 authentication protocols
69
70 nodevmap do not map special files - represent them as normal files.
71 This can be used to share devices/named pipes/sockets between
72 hosts. This functionality will be expanded in later versions.
73
74RESOURCES
75=========
76
77The Linux version of the 9P server, along with some client-side utilities
78can be found at http://v9fs.sf.net (along with a CVS repository of the
79development branch of this module). There are user and developer mailing
80lists here, as well as a bug-tracker.
81
82For more information on the Plan 9 Operating System check out
83http://plan9.bell-labs.com/plan9
84
85For information on Plan 9 from User Space (Plan 9 applications and libraries
86ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
87
88
89STATUS
90======
91
92The 2.6 kernel support is working on PPC and x86.
93
94PLEASE USE THE SOURCEFORGE BUG-TRACKER TO REPORT PROBLEMS.
95
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 3f318dd44c77..f042c12e0ed2 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -1,35 +1,27 @@
1/* -*- auto-fill -*- */
2 1
3 Overview of the Virtual File System 2 Overview of the Linux Virtual File System
4 3
5 Richard Gooch <rgooch@atnf.csiro.au> 4 Original author: Richard Gooch <rgooch@atnf.csiro.au>
6 5
7 5-JUL-1999 6 Last updated on August 25, 2005
8 7
8 Copyright (C) 1999 Richard Gooch
9 Copyright (C) 2005 Pekka Enberg
9 10
10Conventions used in this document <section> 11 This file is released under the GPLv2.
11=================================
12 12
13Each section in this document will have the string "<section>" at the
14right-hand side of the section title. Each subsection will have
15"<subsection>" at the right-hand side. These strings are meant to make
16it easier to search through the document.
17 13
18NOTE that the master copy of this document is available online at: 14What is it?
19http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt
20
21
22What is it? <section>
23=========== 15===========
24 16
25The Virtual File System (otherwise known as the Virtual Filesystem 17The Virtual File System (otherwise known as the Virtual Filesystem
26Switch) is the software layer in the kernel that provides the 18Switch) is the software layer in the kernel that provides the
27filesystem interface to userspace programs. It also provides an 19filesystem interface to userspace programs. It also provides an
28abstraction within the kernel which allows different filesystem 20abstraction within the kernel which allows different filesystem
29implementations to co-exist. 21implementations to coexist.
30 22
31 23
32A Quick Look At How It Works <section> 24A Quick Look At How It Works
33============================ 25============================
34 26
35In this section I'll briefly describe how things work, before 27In this section I'll briefly describe how things work, before
@@ -38,7 +30,8 @@ when user programs open and manipulate files, and then look from the
38other view which is how a filesystem is supported and subsequently 30other view which is how a filesystem is supported and subsequently
39mounted. 31mounted.
40 32
41Opening a File <subsection> 33
34Opening a File
42-------------- 35--------------
43 36
44The VFS implements the open(2), stat(2), chmod(2) and similar system 37The VFS implements the open(2), stat(2), chmod(2) and similar system
@@ -77,7 +70,7 @@ back to userspace.
77 70
78Opening a file requires another operation: allocation of a file 71Opening a file requires another operation: allocation of a file
79structure (this is the kernel-side implementation of file 72structure (this is the kernel-side implementation of file
80descriptors). The freshly allocated file structure is initialised with 73descriptors). The freshly allocated file structure is initialized with
81a pointer to the dentry and a set of file operation member functions. 74a pointer to the dentry and a set of file operation member functions.
82These are taken from the inode data. The open() file method is then 75These are taken from the inode data. The open() file method is then
83called so the specific filesystem implementation can do it's work. You 76called so the specific filesystem implementation can do it's work. You
@@ -102,7 +95,8 @@ filesystem or driver code at the same time, on different
102processors. You should ensure that access to shared resources is 95processors. You should ensure that access to shared resources is
103protected by appropriate locks. 96protected by appropriate locks.
104 97
105Registering and Mounting a Filesystem <subsection> 98
99Registering and Mounting a Filesystem
106------------------------------------- 100-------------------------------------
107 101
108If you want to support a new kind of filesystem in the kernel, all you 102If you want to support a new kind of filesystem in the kernel, all you
@@ -123,17 +117,21 @@ updated to point to the root inode for the new filesystem.
123It's now time to look at things in more detail. 117It's now time to look at things in more detail.
124 118
125 119
126struct file_system_type <section> 120struct file_system_type
127======================= 121=======================
128 122
129This describes the filesystem. As of kernel 2.1.99, the following 123This describes the filesystem. As of kernel 2.6.13, the following
130members are defined: 124members are defined:
131 125
132struct file_system_type { 126struct file_system_type {
133 const char *name; 127 const char *name;
134 int fs_flags; 128 int fs_flags;
135 struct super_block *(*read_super) (struct super_block *, void *, int); 129 struct super_block *(*get_sb) (struct file_system_type *, int,
136 struct file_system_type * next; 130 const char *, void *);
131 void (*kill_sb) (struct super_block *);
132 struct module *owner;
133 struct file_system_type * next;
134 struct list_head fs_supers;
137}; 135};
138 136
139 name: the name of the filesystem type, such as "ext2", "iso9660", 137 name: the name of the filesystem type, such as "ext2", "iso9660",
@@ -141,51 +139,97 @@ struct file_system_type {
141 139
142 fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) 140 fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
143 141
144 read_super: the method to call when a new instance of this 142 get_sb: the method to call when a new instance of this
145 filesystem should be mounted 143 filesystem should be mounted
146 144
147 next: for internal VFS use: you should initialise this to NULL 145 kill_sb: the method to call when an instance of this filesystem
146 should be unmounted
147
148 owner: for internal VFS use: you should initialize this to THIS_MODULE in
149 most cases.
148 150
149The read_super() method has the following arguments: 151 next: for internal VFS use: you should initialize this to NULL
152
153The get_sb() method has the following arguments:
150 154
151 struct super_block *sb: the superblock structure. This is partially 155 struct super_block *sb: the superblock structure. This is partially
152 initialised by the VFS and the rest must be initialised by the 156 initialized by the VFS and the rest must be initialized by the
153 read_super() method 157 get_sb() method
158
159 int flags: mount flags
160
161 const char *dev_name: the device name we are mounting.
154 162
155 void *data: arbitrary mount options, usually comes as an ASCII 163 void *data: arbitrary mount options, usually comes as an ASCII
156 string 164 string
157 165
158 int silent: whether or not to be silent on error 166 int silent: whether or not to be silent on error
159 167
160The read_super() method must determine if the block device specified 168The get_sb() method must determine if the block device specified
161in the superblock contains a filesystem of the type the method 169in the superblock contains a filesystem of the type the method
162supports. On success the method returns the superblock pointer, on 170supports. On success the method returns the superblock pointer, on
163failure it returns NULL. 171failure it returns NULL.
164 172
165The most interesting member of the superblock structure that the 173The most interesting member of the superblock structure that the
166read_super() method fills in is the "s_op" field. This is a pointer to 174get_sb() method fills in is the "s_op" field. This is a pointer to
167a "struct super_operations" which describes the next level of the 175a "struct super_operations" which describes the next level of the
168filesystem implementation. 176filesystem implementation.
169 177
178Usually, a filesystem uses generic one of the generic get_sb()
179implementations and provides a fill_super() method instead. The
180generic methods are:
181
182 get_sb_bdev: mount a filesystem residing on a block device
170 183
171struct super_operations <section> 184 get_sb_nodev: mount a filesystem that is not backed by a device
185
186 get_sb_single: mount a filesystem which shares the instance between
187 all mounts
188
189A fill_super() method implementation has the following arguments:
190
191 struct super_block *sb: the superblock structure. The method fill_super()
192 must initialize this properly.
193
194 void *data: arbitrary mount options, usually comes as an ASCII
195 string
196
197 int silent: whether or not to be silent on error
198
199
200struct super_operations
172======================= 201=======================
173 202
174This describes how the VFS can manipulate the superblock of your 203This describes how the VFS can manipulate the superblock of your
175filesystem. As of kernel 2.1.99, the following members are defined: 204filesystem. As of kernel 2.6.13, the following members are defined:
176 205
177struct super_operations { 206struct super_operations {
178 void (*read_inode) (struct inode *); 207 struct inode *(*alloc_inode)(struct super_block *sb);
179 int (*write_inode) (struct inode *, int); 208 void (*destroy_inode)(struct inode *);
180 void (*put_inode) (struct inode *); 209
181 void (*drop_inode) (struct inode *); 210 void (*read_inode) (struct inode *);
182 void (*delete_inode) (struct inode *); 211
183 int (*notify_change) (struct dentry *, struct iattr *); 212 void (*dirty_inode) (struct inode *);
184 void (*put_super) (struct super_block *); 213 int (*write_inode) (struct inode *, int);
185 void (*write_super) (struct super_block *); 214 void (*put_inode) (struct inode *);
186 int (*statfs) (struct super_block *, struct statfs *, int); 215 void (*drop_inode) (struct inode *);
187 int (*remount_fs) (struct super_block *, int *, char *); 216 void (*delete_inode) (struct inode *);
188 void (*clear_inode) (struct inode *); 217 void (*put_super) (struct super_block *);
218 void (*write_super) (struct super_block *);
219 int (*sync_fs)(struct super_block *sb, int wait);
220 void (*write_super_lockfs) (struct super_block *);
221 void (*unlockfs) (struct super_block *);
222 int (*statfs) (struct super_block *, struct kstatfs *);
223 int (*remount_fs) (struct super_block *, int *, char *);
224 void (*clear_inode) (struct inode *);
225 void (*umount_begin) (struct super_block *);
226
227 void (*sync_inodes) (struct super_block *sb,
228 struct writeback_control *wbc);
229 int (*show_options)(struct seq_file *, struct vfsmount *);
230
231 ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
232 ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
189}; 233};
190 234
191All methods are called without any locks being held, unless otherwise 235All methods are called without any locks being held, unless otherwise
@@ -193,43 +237,62 @@ noted. This means that most methods can block safely. All methods are
193only called from a process context (i.e. not from an interrupt handler 237only called from a process context (i.e. not from an interrupt handler
194or bottom half). 238or bottom half).
195 239
240 alloc_inode: this method is called by inode_alloc() to allocate memory
241 for struct inode and initialize it.
242
243 destroy_inode: this method is called by destroy_inode() to release
244 resources allocated for struct inode.
245
196 read_inode: this method is called to read a specific inode from the 246 read_inode: this method is called to read a specific inode from the
197 mounted filesystem. The "i_ino" member in the "struct inode" 247 mounted filesystem. The i_ino member in the struct inode is
198 will be initialised by the VFS to indicate which inode to 248 initialized by the VFS to indicate which inode to read. Other
199 read. Other members are filled in by this method 249 members are filled in by this method.
250
251 You can set this to NULL and use iget5_locked() instead of iget()
252 to read inodes. This is necessary for filesystems for which the
253 inode number is not sufficient to identify an inode.
254
255 dirty_inode: this method is called by the VFS to mark an inode dirty.
200 256
201 write_inode: this method is called when the VFS needs to write an 257 write_inode: this method is called when the VFS needs to write an
202 inode to disc. The second parameter indicates whether the write 258 inode to disc. The second parameter indicates whether the write
203 should be synchronous or not, not all filesystems check this flag. 259 should be synchronous or not, not all filesystems check this flag.
204 260
205 put_inode: called when the VFS inode is removed from the inode 261 put_inode: called when the VFS inode is removed from the inode
206 cache. This method is optional 262 cache.
207 263
208 drop_inode: called when the last access to the inode is dropped, 264 drop_inode: called when the last access to the inode is dropped,
209 with the inode_lock spinlock held. 265 with the inode_lock spinlock held.
210 266
211 This method should be either NULL (normal unix filesystem 267 This method should be either NULL (normal UNIX filesystem
212 semantics) or "generic_delete_inode" (for filesystems that do not 268 semantics) or "generic_delete_inode" (for filesystems that do not
213 want to cache inodes - causing "delete_inode" to always be 269 want to cache inodes - causing "delete_inode" to always be
214 called regardless of the value of i_nlink) 270 called regardless of the value of i_nlink)
215 271
216 The "generic_delete_inode()" behaviour is equivalent to the 272 The "generic_delete_inode()" behavior is equivalent to the
217 old practice of using "force_delete" in the put_inode() case, 273 old practice of using "force_delete" in the put_inode() case,
218 but does not have the races that the "force_delete()" approach 274 but does not have the races that the "force_delete()" approach
219 had. 275 had.
220 276
221 delete_inode: called when the VFS wants to delete an inode 277 delete_inode: called when the VFS wants to delete an inode
222 278
223 notify_change: called when VFS inode attributes are changed. If this
224 is NULL the VFS falls back to the write_inode() method. This
225 is called with the kernel lock held
226
227 put_super: called when the VFS wishes to free the superblock 279 put_super: called when the VFS wishes to free the superblock
228 (i.e. unmount). This is called with the superblock lock held 280 (i.e. unmount). This is called with the superblock lock held
229 281
230 write_super: called when the VFS superblock needs to be written to 282 write_super: called when the VFS superblock needs to be written to
231 disc. This method is optional 283 disc. This method is optional
232 284
285 sync_fs: called when VFS is writing out all dirty data associated with
286 a superblock. The second parameter indicates whether the method
287 should wait until the write out has been completed. Optional.
288
289 write_super_lockfs: called when VFS is locking a filesystem and forcing
290 it into a consistent state. This function is currently used by the
291 Logical Volume Manager (LVM).
292
293 unlockfs: called when VFS is unlocking a filesystem and making it writable
294 again.
295
233 statfs: called when the VFS needs to get filesystem statistics. This 296 statfs: called when the VFS needs to get filesystem statistics. This
234 is called with the kernel lock held 297 is called with the kernel lock held
235 298
@@ -238,21 +301,31 @@ or bottom half).
238 301
239 clear_inode: called then the VFS clears the inode. Optional 302 clear_inode: called then the VFS clears the inode. Optional
240 303
304 umount_begin: called when the VFS is unmounting a filesystem.
305
306 sync_inodes: called when the VFS is writing out dirty data associated with
307 a superblock.
308
309 show_options: called by the VFS to show mount options for /proc/<pid>/mounts.
310
311 quota_read: called by the VFS to read from filesystem quota file.
312
313 quota_write: called by the VFS to write to filesystem quota file.
314
241The read_inode() method is responsible for filling in the "i_op" 315The read_inode() method is responsible for filling in the "i_op"
242field. This is a pointer to a "struct inode_operations" which 316field. This is a pointer to a "struct inode_operations" which
243describes the methods that can be performed on individual inodes. 317describes the methods that can be performed on individual inodes.
244 318
245 319
246struct inode_operations <section> 320struct inode_operations
247======================= 321=======================
248 322
249This describes how the VFS can manipulate an inode in your 323This describes how the VFS can manipulate an inode in your
250filesystem. As of kernel 2.1.99, the following members are defined: 324filesystem. As of kernel 2.6.13, the following members are defined:
251 325
252struct inode_operations { 326struct inode_operations {
253 struct file_operations * default_file_ops; 327 int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
254 int (*create) (struct inode *,struct dentry *,int); 328 struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
255 int (*lookup) (struct inode *,struct dentry *);
256 int (*link) (struct dentry *,struct inode *,struct dentry *); 329 int (*link) (struct dentry *,struct inode *,struct dentry *);
257 int (*unlink) (struct inode *,struct dentry *); 330 int (*unlink) (struct inode *,struct dentry *);
258 int (*symlink) (struct inode *,struct dentry *,const char *); 331 int (*symlink) (struct inode *,struct dentry *,const char *);
@@ -261,25 +334,22 @@ struct inode_operations {
261 int (*mknod) (struct inode *,struct dentry *,int,dev_t); 334 int (*mknod) (struct inode *,struct dentry *,int,dev_t);
262 int (*rename) (struct inode *, struct dentry *, 335 int (*rename) (struct inode *, struct dentry *,
263 struct inode *, struct dentry *); 336 struct inode *, struct dentry *);
264 int (*readlink) (struct dentry *, char *,int); 337 int (*readlink) (struct dentry *, char __user *,int);
265 struct dentry * (*follow_link) (struct dentry *, struct dentry *); 338 void * (*follow_link) (struct dentry *, struct nameidata *);
266 int (*readpage) (struct file *, struct page *); 339 void (*put_link) (struct dentry *, struct nameidata *, void *);
267 int (*writepage) (struct page *page, struct writeback_control *wbc);
268 int (*bmap) (struct inode *,int);
269 void (*truncate) (struct inode *); 340 void (*truncate) (struct inode *);
270 int (*permission) (struct inode *, int); 341 int (*permission) (struct inode *, int, struct nameidata *);
271 int (*smap) (struct inode *,int); 342 int (*setattr) (struct dentry *, struct iattr *);
272 int (*updatepage) (struct file *, struct page *, const char *, 343 int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
273 unsigned long, unsigned int, int); 344 int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
274 int (*revalidate) (struct dentry *); 345 ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
346 ssize_t (*listxattr) (struct dentry *, char *, size_t);
347 int (*removexattr) (struct dentry *, const char *);
275}; 348};
276 349
277Again, all methods are called without any locks being held, unless 350Again, all methods are called without any locks being held, unless
278otherwise noted. 351otherwise noted.
279 352
280 default_file_ops: this is a pointer to a "struct file_operations"
281 which describes how to open and then manipulate open files
282
283 create: called by the open(2) and creat(2) system calls. Only 353 create: called by the open(2) and creat(2) system calls. Only
284 required if you want to support regular files. The dentry you 354 required if you want to support regular files. The dentry you
285 get should not have an inode (i.e. it should be a negative 355 get should not have an inode (i.e. it should be a negative
@@ -328,31 +398,143 @@ otherwise noted.
328 you want to support reading symbolic links 398 you want to support reading symbolic links
329 399
330 follow_link: called by the VFS to follow a symbolic link to the 400 follow_link: called by the VFS to follow a symbolic link to the
331 inode it points to. Only required if you want to support 401 inode it points to. Only required if you want to support
332 symbolic links 402 symbolic links. This function returns a void pointer cookie
403 that is passed to put_link().
404
405 put_link: called by the VFS to release resources allocated by
406 follow_link(). The cookie returned by follow_link() is passed to
407 to this function as the last parameter. It is used by filesystems
408 such as NFS where page cache is not stable (i.e. page that was
409 installed when the symbolic link walk started might not be in the
410 page cache at the end of the walk).
411
412 truncate: called by the VFS to change the size of a file. The i_size
413 field of the inode is set to the desired size by the VFS before
414 this function is called. This function is called by the truncate(2)
415 system call and related functionality.
416
417 permission: called by the VFS to check for access rights on a POSIX-like
418 filesystem.
419
420 setattr: called by the VFS to set attributes for a file. This function is
421 called by chmod(2) and related system calls.
422
423 getattr: called by the VFS to get attributes of a file. This function is
424 called by stat(2) and related system calls.
425
426 setxattr: called by the VFS to set an extended attribute for a file.
427 Extended attribute is a name:value pair associated with an inode. This
428 function is called by setxattr(2) system call.
429
430 getxattr: called by the VFS to retrieve the value of an extended attribute
431 name. This function is called by getxattr(2) function call.
432
433 listxattr: called by the VFS to list all extended attributes for a given
434 file. This function is called by listxattr(2) system call.
435
436 removexattr: called by the VFS to remove an extended attribute from a file.
437 This function is called by removexattr(2) system call.
438
439
440struct address_space_operations
441===============================
442
443This describes how the VFS can manipulate mapping of a file to page cache in
444your filesystem. As of kernel 2.6.13, the following members are defined:
445
446struct address_space_operations {
447 int (*writepage)(struct page *page, struct writeback_control *wbc);
448 int (*readpage)(struct file *, struct page *);
449 int (*sync_page)(struct page *);
450 int (*writepages)(struct address_space *, struct writeback_control *);
451 int (*set_page_dirty)(struct page *page);
452 int (*readpages)(struct file *filp, struct address_space *mapping,
453 struct list_head *pages, unsigned nr_pages);
454 int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
455 int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
456 sector_t (*bmap)(struct address_space *, sector_t);
457 int (*invalidatepage) (struct page *, unsigned long);
458 int (*releasepage) (struct page *, int);
459 ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
460 loff_t offset, unsigned long nr_segs);
461 struct page* (*get_xip_page)(struct address_space *, sector_t,
462 int);
463};
464
465 writepage: called by the VM write a dirty page to backing store.
466
467 readpage: called by the VM to read a page from backing store.
468
469 sync_page: called by the VM to notify the backing store to perform all
470 queued I/O operations for a page. I/O operations for other pages
471 associated with this address_space object may also be performed.
472
473 writepages: called by the VM to write out pages associated with the
474 address_space object.
475
476 set_page_dirty: called by the VM to set a page dirty.
477
478 readpages: called by the VM to read pages associated with the address_space
479 object.
333 480
481 prepare_write: called by the generic write path in VM to set up a write
482 request for a page.
334 483
335struct file_operations <section> 484 commit_write: called by the generic write path in VM to write page to
485 its backing store.
486
487 bmap: called by the VFS to map a logical block offset within object to
488 physical block number. This method is use by for the legacy FIBMAP
489 ioctl. Other uses are discouraged.
490
491 invalidatepage: called by the VM on truncate to disassociate a page from its
492 address_space mapping.
493
494 releasepage: called by the VFS to release filesystem specific metadata from
495 a page.
496
497 direct_IO: called by the VM for direct I/O writes and reads.
498
499 get_xip_page: called by the VM to translate a block number to a page.
500 The page is valid until the corresponding filesystem is unmounted.
501 Filesystems that want to use execute-in-place (XIP) need to implement
502 it. An example implementation can be found in fs/ext2/xip.c.
503
504
505struct file_operations
336====================== 506======================
337 507
338This describes how the VFS can manipulate an open file. As of kernel 508This describes how the VFS can manipulate an open file. As of kernel
3392.1.99, the following members are defined: 5092.6.13, the following members are defined:
340 510
341struct file_operations { 511struct file_operations {
342 loff_t (*llseek) (struct file *, loff_t, int); 512 loff_t (*llseek) (struct file *, loff_t, int);
343 ssize_t (*read) (struct file *, char *, size_t, loff_t *); 513 ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
344 ssize_t (*write) (struct file *, const char *, size_t, loff_t *); 514 ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
515 ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
516 ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);
345 int (*readdir) (struct file *, void *, filldir_t); 517 int (*readdir) (struct file *, void *, filldir_t);
346 unsigned int (*poll) (struct file *, struct poll_table_struct *); 518 unsigned int (*poll) (struct file *, struct poll_table_struct *);
347 int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); 519 int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
520 long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
521 long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
348 int (*mmap) (struct file *, struct vm_area_struct *); 522 int (*mmap) (struct file *, struct vm_area_struct *);
349 int (*open) (struct inode *, struct file *); 523 int (*open) (struct inode *, struct file *);
524 int (*flush) (struct file *);
350 int (*release) (struct inode *, struct file *); 525 int (*release) (struct inode *, struct file *);
351 int (*fsync) (struct file *, struct dentry *); 526 int (*fsync) (struct file *, struct dentry *, int datasync);
352 int (*fasync) (struct file *, int); 527 int (*aio_fsync) (struct kiocb *, int datasync);
353 int (*check_media_change) (kdev_t dev); 528 int (*fasync) (int, struct file *, int);
354 int (*revalidate) (kdev_t dev);
355 int (*lock) (struct file *, int, struct file_lock *); 529 int (*lock) (struct file *, int, struct file_lock *);
530 ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
531 ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
532 ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
533 ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
534 unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
535 int (*check_flags)(int);
536 int (*dir_notify)(struct file *filp, unsigned long arg);
537 int (*flock) (struct file *, int, struct file_lock *);
356}; 538};
357 539
358Again, all methods are called without any locks being held, unless 540Again, all methods are called without any locks being held, unless
@@ -362,8 +544,12 @@ otherwise noted.
362 544
363 read: called by read(2) and related system calls 545 read: called by read(2) and related system calls
364 546
547 aio_read: called by io_submit(2) and other asynchronous I/O operations
548
365 write: called by write(2) and related system calls 549 write: called by write(2) and related system calls
366 550
551 aio_write: called by io_submit(2) and other asynchronous I/O operations
552
367 readdir: called when the VFS needs to read the directory contents 553 readdir: called when the VFS needs to read the directory contents
368 554
369 poll: called by the VFS when a process wants to check if there is 555 poll: called by the VFS when a process wants to check if there is
@@ -372,18 +558,25 @@ otherwise noted.
372 558
373 ioctl: called by the ioctl(2) system call 559 ioctl: called by the ioctl(2) system call
374 560
561 unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not
562 require the BKL should use this method instead of the ioctl() above.
563
564 compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
565 are used on 64 bit kernels.
566
375 mmap: called by the mmap(2) system call 567 mmap: called by the mmap(2) system call
376 568
377 open: called by the VFS when an inode should be opened. When the VFS 569 open: called by the VFS when an inode should be opened. When the VFS
378 opens a file, it creates a new "struct file" and initialises 570 opens a file, it creates a new "struct file". It then calls the
379 the "f_op" file operations member with the "default_file_ops" 571 open method for the newly allocated file structure. You might
380 field in the inode structure. It then calls the open method 572 think that the open method really belongs in
381 for the newly allocated file structure. You might think that 573 "struct inode_operations", and you may be right. I think it's
382 the open method really belongs in "struct inode_operations", 574 done the way it is because it makes filesystems simpler to
383 and you may be right. I think it's done the way it is because 575 implement. The open() method is a good place to initialize the
384 it makes filesystems simpler to implement. The open() method 576 "private_data" member in the file structure if you want to point
385 is a good place to initialise the "private_data" member in the 577 to a device structure
386 file structure if you want to point to a device structure 578
579 flush: called by the close(2) system call to flush a file
387 580
388 release: called when the last reference to an open file is closed 581 release: called when the last reference to an open file is closed
389 582
@@ -392,6 +585,23 @@ otherwise noted.
392 fasync: called by the fcntl(2) system call when asynchronous 585 fasync: called by the fcntl(2) system call when asynchronous
393 (non-blocking) mode is enabled for a file 586 (non-blocking) mode is enabled for a file
394 587
588 lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
589 commands
590
591 readv: called by the readv(2) system call
592
593 writev: called by the writev(2) system call
594
595 sendfile: called by the sendfile(2) system call
596
597 get_unmapped_area: called by the mmap(2) system call
598
599 check_flags: called by the fcntl(2) system call for F_SETFL command
600
601 dir_notify: called by the fcntl(2) system call for F_NOTIFY command
602
603 flock: called by the flock(2) system call
604
395Note that the file operations are implemented by the specific 605Note that the file operations are implemented by the specific
396filesystem in which the inode resides. When opening a device node 606filesystem in which the inode resides. When opening a device node
397(character or block special) most filesystems will call special 607(character or block special) most filesystems will call special
@@ -400,29 +610,28 @@ driver information. These support routines replace the filesystem file
400operations with those for the device driver, and then proceed to call 610operations with those for the device driver, and then proceed to call
401the new open() method for the file. This is how opening a device file 611the new open() method for the file. This is how opening a device file
402in the filesystem eventually ends up calling the device driver open() 612in the filesystem eventually ends up calling the device driver open()
403method. Note the devfs (the Device FileSystem) has a more direct path 613method.
404from device node to device driver (this is an unofficial kernel
405patch).
406 614
407 615
408Directory Entry Cache (dcache) <section> 616Directory Entry Cache (dcache)
409------------------------------ 617==============================
618
410 619
411struct dentry_operations 620struct dentry_operations
412======================== 621------------------------
413 622
414This describes how a filesystem can overload the standard dentry 623This describes how a filesystem can overload the standard dentry
415operations. Dentries and the dcache are the domain of the VFS and the 624operations. Dentries and the dcache are the domain of the VFS and the
416individual filesystem implementations. Device drivers have no business 625individual filesystem implementations. Device drivers have no business
417here. These methods may be set to NULL, as they are either optional or 626here. These methods may be set to NULL, as they are either optional or
418the VFS uses a default. As of kernel 2.1.99, the following members are 627the VFS uses a default. As of kernel 2.6.13, the following members are
419defined: 628defined:
420 629
421struct dentry_operations { 630struct dentry_operations {
422 int (*d_revalidate)(struct dentry *); 631 int (*d_revalidate)(struct dentry *, struct nameidata *);
423 int (*d_hash) (struct dentry *, struct qstr *); 632 int (*d_hash) (struct dentry *, struct qstr *);
424 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); 633 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
425 void (*d_delete)(struct dentry *); 634 int (*d_delete)(struct dentry *);
426 void (*d_release)(struct dentry *); 635 void (*d_release)(struct dentry *);
427 void (*d_iput)(struct dentry *, struct inode *); 636 void (*d_iput)(struct dentry *, struct inode *);
428}; 637};
@@ -451,6 +660,7 @@ Each dentry has a pointer to its parent dentry, as well as a hash list
451of child dentries. Child dentries are basically like files in a 660of child dentries. Child dentries are basically like files in a
452directory. 661directory.
453 662
663
454Directory Entry Cache APIs 664Directory Entry Cache APIs
455-------------------------- 665--------------------------
456 666
@@ -471,7 +681,7 @@ manipulate dentries:
471 "d_delete" method is called 681 "d_delete" method is called
472 682
473 d_drop: this unhashes a dentry from its parents hash list. A 683 d_drop: this unhashes a dentry from its parents hash list. A
474 subsequent call to dput() will dellocate the dentry if its 684 subsequent call to dput() will deallocate the dentry if its
475 usage count drops to 0 685 usage count drops to 0
476 686
477 d_delete: delete a dentry. If there are no other open references to 687 d_delete: delete a dentry. If there are no other open references to
@@ -507,16 +717,16 @@ up by walking the tree starting with the first component
507of the pathname and using that dentry along with the next 717of the pathname and using that dentry along with the next
508component to look up the next level and so on. Since it 718component to look up the next level and so on. Since it
509is a frequent operation for workloads like multiuser 719is a frequent operation for workloads like multiuser
510environments and webservers, it is important to optimize 720environments and web servers, it is important to optimize
511this path. 721this path.
512 722
513Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus 723Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
514in every component during path look-up. Since 2.5.10 onwards, 724in every component during path look-up. Since 2.5.10 onwards,
515fastwalk algorithm changed this by holding the dcache_lock 725fast-walk algorithm changed this by holding the dcache_lock
516at the beginning and walking as many cached path component 726at the beginning and walking as many cached path component
517dentries as possible. This signficantly decreases the number 727dentries as possible. This significantly decreases the number
518of acquisition of dcache_lock. However it also increases the 728of acquisition of dcache_lock. However it also increases the
519lock hold time signficantly and affects performance in large 729lock hold time significantly and affects performance in large
520SMP machines. Since 2.5.62 kernel, dcache has been using 730SMP machines. Since 2.5.62 kernel, dcache has been using
521a new locking model that uses RCU to make dcache look-up 731a new locking model that uses RCU to make dcache look-up
522lock-free. 732lock-free.
@@ -527,7 +737,7 @@ protected the hash chain, d_child, d_alias, d_lru lists as well
527as d_inode and several other things like mount look-up. RCU-based 737as d_inode and several other things like mount look-up. RCU-based
528changes affect only the way the hash chain is protected. For everything 738changes affect only the way the hash chain is protected. For everything
529else the dcache_lock must be taken for both traversing as well as 739else the dcache_lock must be taken for both traversing as well as
530updating. The hash chain updations too take the dcache_lock. 740updating. The hash chain updates too take the dcache_lock.
531The significant change is the way d_lookup traverses the hash chain, 741The significant change is the way d_lookup traverses the hash chain,
532it doesn't acquire the dcache_lock for this and rely on RCU to 742it doesn't acquire the dcache_lock for this and rely on RCU to
533ensure that the dentry has not been *freed*. 743ensure that the dentry has not been *freed*.
@@ -535,14 +745,15 @@ ensure that the dentry has not been *freed*.
535 745
536Dcache locking details 746Dcache locking details
537---------------------- 747----------------------
748
538For many multi-user workloads, open() and stat() on files are 749For many multi-user workloads, open() and stat() on files are
539very frequently occurring operations. Both involve walking 750very frequently occurring operations. Both involve walking
540of path names to find the dentry corresponding to the 751of path names to find the dentry corresponding to the
541concerned file. In 2.4 kernel, dcache_lock was held 752concerned file. In 2.4 kernel, dcache_lock was held
542during look-up of each path component. Contention and 753during look-up of each path component. Contention and
543cacheline bouncing of this global lock caused significant 754cache-line bouncing of this global lock caused significant
544scalability problems. With the introduction of RCU 755scalability problems. With the introduction of RCU
545in linux kernel, this was worked around by making 756in Linux kernel, this was worked around by making
546the look-up of path components during path walking lock-free. 757the look-up of path components during path walking lock-free.
547 758
548 759
@@ -562,7 +773,7 @@ Some of the important changes are :
5622. Insertion of a dentry into the hash table is done using 7732. Insertion of a dentry into the hash table is done using
563 hlist_add_head_rcu() which take care of ordering the writes - 774 hlist_add_head_rcu() which take care of ordering the writes -
564 the writes to the dentry must be visible before the dentry 775 the writes to the dentry must be visible before the dentry
565 is inserted. This works in conjuction with hlist_for_each_rcu() 776 is inserted. This works in conjunction with hlist_for_each_rcu()
566 while walking the hash chain. The only requirement is that 777 while walking the hash chain. The only requirement is that
567 all initialization to the dentry must be done before hlist_add_head_rcu() 778 all initialization to the dentry must be done before hlist_add_head_rcu()
568 since we don't have dcache_lock protection while traversing 779 since we don't have dcache_lock protection while traversing
@@ -584,7 +795,7 @@ Some of the important changes are :
584 the same. In some sense, dcache_rcu path walking looks like 795 the same. In some sense, dcache_rcu path walking looks like
585 the pre-2.5.10 version. 796 the pre-2.5.10 version.
586 797
5875. All dentry hash chain updations must take the dcache_lock as well as 7985. All dentry hash chain updates must take the dcache_lock as well as
588 the per-dentry lock in that order. dput() does this to ensure 799 the per-dentry lock in that order. dput() does this to ensure
589 that a dentry that has just been looked up in another CPU 800 that a dentry that has just been looked up in another CPU
590 doesn't get deleted before dget() can be done on it. 801 doesn't get deleted before dget() can be done on it.
@@ -640,10 +851,10 @@ handled as described below :
640 Since we redo the d_parent check and compare name while holding 851 Since we redo the d_parent check and compare name while holding
641 d_lock, lock-free look-up will not race against d_move(). 852 d_lock, lock-free look-up will not race against d_move().
642 853
6434. There can be a theoritical race when a dentry keeps coming back 8544. There can be a theoretical race when a dentry keeps coming back
644 to original bucket due to double moves. Due to this look-up may 855 to original bucket due to double moves. Due to this look-up may
645 consider that it has never moved and can end up in a infinite loop. 856 consider that it has never moved and can end up in a infinite loop.
646 But this is not any worse that theoritical livelocks we already 857 But this is not any worse that theoretical livelocks we already
647 have in the kernel. 858 have in the kernel.
648 859
649 860