aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/files.txt123
-rw-r--r--Documentation/filesystems/fuse.txt315
-rw-r--r--Documentation/filesystems/ntfs.txt12
-rw-r--r--Documentation/filesystems/proc.txt41
-rw-r--r--Documentation/filesystems/relayfs.txt362
-rw-r--r--Documentation/filesystems/sysfs.txt28
-rw-r--r--Documentation/filesystems/v9fs.txt95
-rw-r--r--Documentation/filesystems/vfs.txt435
8 files changed, 1276 insertions, 135 deletions
diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt
new file mode 100644
index 000000000000..8c206f4e0250
--- /dev/null
+++ b/Documentation/filesystems/files.txt
@@ -0,0 +1,123 @@
1File management in the Linux kernel
2-----------------------------------
3
4This document describes how locking for files (struct file)
5and file descriptor table (struct files) works.
6
7Up until 2.6.12, the file descriptor table has been protected
8with a lock (files->file_lock) and reference count (files->count).
9->file_lock protected accesses to all the file related fields
10of the table. ->count was used for sharing the file descriptor
11table between tasks cloned with CLONE_FILES flag. Typically
12this would be the case for posix threads. As with the common
13refcounting model in the kernel, the last task doing
14a put_files_struct() frees the file descriptor (fd) table.
15The files (struct file) themselves are protected using
16reference count (->f_count).
17
18In the new lock-free model of file descriptor management,
19the reference counting is similar, but the locking is
20based on RCU. The file descriptor table contains multiple
21elements - the fd sets (open_fds and close_on_exec, the
22array of file pointers, the sizes of the sets and the array
23etc.). In order for the updates to appear atomic to
24a lock-free reader, all the elements of the file descriptor
25table are in a separate structure - struct fdtable.
26files_struct contains a pointer to struct fdtable through
27which the actual fd table is accessed. Initially the
28fdtable is embedded in files_struct itself. On a subsequent
29expansion of fdtable, a new fdtable structure is allocated
30and files->fdtab points to the new structure. The fdtable
31structure is freed with RCU and lock-free readers either
32see the old fdtable or the new fdtable making the update
33appear atomic. Here are the locking rules for
34the fdtable structure -
35
361. All references to the fdtable must be done through
37 the files_fdtable() macro :
38
39 struct fdtable *fdt;
40
41 rcu_read_lock();
42
43 fdt = files_fdtable(files);
44 ....
45 if (n <= fdt->max_fds)
46 ....
47 ...
48 rcu_read_unlock();
49
50 files_fdtable() uses rcu_dereference() macro which takes care of
51 the memory barrier requirements for lock-free dereference.
52 The fdtable pointer must be read within the read-side
53 critical section.
54
552. Reading of the fdtable as described above must be protected
56 by rcu_read_lock()/rcu_read_unlock().
57
583. For any update to the the fd table, files->file_lock must
59 be held.
60
614. To look up the file structure given an fd, a reader
62 must use either fcheck() or fcheck_files() APIs. These
63 take care of barrier requirements due to lock-free lookup.
64 An example :
65
66 struct file *file;
67
68 rcu_read_lock();
69 file = fcheck(fd);
70 if (file) {
71 ...
72 }
73 ....
74 rcu_read_unlock();
75
765. Handling of the file structures is special. Since the look-up
77 of the fd (fget()/fget_light()) are lock-free, it is possible
78 that look-up may race with the last put() operation on the
79 file structure. This is avoided using the rcuref APIs
80 on ->f_count :
81
82 rcu_read_lock();
83 file = fcheck_files(files, fd);
84 if (file) {
85 if (rcuref_inc_lf(&file->f_count))
86 *fput_needed = 1;
87 else
88 /* Didn't get the reference, someone's freed */
89 file = NULL;
90 }
91 rcu_read_unlock();
92 ....
93 return file;
94
95 rcuref_inc_lf() detects if refcounts is already zero or
96 goes to zero during increment. If it does, we fail
97 fget()/fget_light().
98
996. Since both fdtable and file structures can be looked up
100 lock-free, they must be installed using rcu_assign_pointer()
101 API. If they are looked up lock-free, rcu_dereference()
102 must be used. However it is advisable to use files_fdtable()
103 and fcheck()/fcheck_files() which take care of these issues.
104
1057. While updating, the fdtable pointer must be looked up while
106 holding files->file_lock. If ->file_lock is dropped, then
107 another thread expand the files thereby creating a new
108 fdtable and making the earlier fdtable pointer stale.
109 For example :
110
111 spin_lock(&files->file_lock);
112 fd = locate_fd(files, file, start);
113 if (fd >= 0) {
114 /* locate_fd() may have expanded fdtable, load the ptr */
115 fdt = files_fdtable(files);
116 FD_SET(fd, fdt->open_fds);
117 FD_CLR(fd, fdt->close_on_exec);
118 spin_unlock(&files->file_lock);
119 .....
120
121 Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
122 the fdtable pointer (fdt) must be loaded after locate_fd().
123
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt
new file mode 100644
index 000000000000..6b5741e651a2
--- /dev/null
+++ b/Documentation/filesystems/fuse.txt
@@ -0,0 +1,315 @@
1Definitions
2~~~~~~~~~~~
3
4Userspace filesystem:
5
6 A filesystem in which data and metadata are provided by an ordinary
7 userspace process. The filesystem can be accessed normally through
8 the kernel interface.
9
10Filesystem daemon:
11
12 The process(es) providing the data and metadata of the filesystem.
13
14Non-privileged mount (or user mount):
15
16 A userspace filesystem mounted by a non-privileged (non-root) user.
17 The filesystem daemon is running with the privileges of the mounting
18 user. NOTE: this is not the same as mounts allowed with the "user"
19 option in /etc/fstab, which is not discussed here.
20
21Mount owner:
22
23 The user who does the mounting.
24
25User:
26
27 The user who is performing filesystem operations.
28
29What is FUSE?
30~~~~~~~~~~~~~
31
32FUSE is a userspace filesystem framework. It consists of a kernel
33module (fuse.ko), a userspace library (libfuse.*) and a mount utility
34(fusermount).
35
36One of the most important features of FUSE is allowing secure,
37non-privileged mounts. This opens up new possibilities for the use of
38filesystems. A good example is sshfs: a secure network filesystem
39using the sftp protocol.
40
41The userspace library and utilities are available from the FUSE
42homepage:
43
44 http://fuse.sourceforge.net/
45
46Mount options
47~~~~~~~~~~~~~
48
49'fd=N'
50
51 The file descriptor to use for communication between the userspace
52 filesystem and the kernel. The file descriptor must have been
53 obtained by opening the FUSE device ('/dev/fuse').
54
55'rootmode=M'
56
57 The file mode of the filesystem's root in octal representation.
58
59'user_id=N'
60
61 The numeric user id of the mount owner.
62
63'group_id=N'
64
65 The numeric group id of the mount owner.
66
67'default_permissions'
68
69 By default FUSE doesn't check file access permissions, the
70 filesystem is free to implement it's access policy or leave it to
71 the underlying file access mechanism (e.g. in case of network
72 filesystems). This option enables permission checking, restricting
73 access based on file mode. This is option is usually useful
74 together with the 'allow_other' mount option.
75
76'allow_other'
77
78 This option overrides the security measure restricting file access
79 to the user mounting the filesystem. This option is by default only
80 allowed to root, but this restriction can be removed with a
81 (userspace) configuration option.
82
83'max_read=N'
84
85 With this option the maximum size of read operations can be set.
86 The default is infinite. Note that the size of read requests is
87 limited anyway to 32 pages (which is 128kbyte on i386).
88
89How do non-privileged mounts work?
90~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
91
92Since the mount() system call is a privileged operation, a helper
93program (fusermount) is needed, which is installed setuid root.
94
95The implication of providing non-privileged mounts is that the mount
96owner must not be able to use this capability to compromise the
97system. Obvious requirements arising from this are:
98
99 A) mount owner should not be able to get elevated privileges with the
100 help of the mounted filesystem
101
102 B) mount owner should not get illegitimate access to information from
103 other users' and the super user's processes
104
105 C) mount owner should not be able to induce undesired behavior in
106 other users' or the super user's processes
107
108How are requirements fulfilled?
109~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
110
111 A) The mount owner could gain elevated privileges by either:
112
113 1) creating a filesystem containing a device file, then opening
114 this device
115
116 2) creating a filesystem containing a suid or sgid application,
117 then executing this application
118
119 The solution is not to allow opening device files and ignore
120 setuid and setgid bits when executing programs. To ensure this
121 fusermount always adds "nosuid" and "nodev" to the mount options
122 for non-privileged mounts.
123
124 B) If another user is accessing files or directories in the
125 filesystem, the filesystem daemon serving requests can record the
126 exact sequence and timing of operations performed. This
127 information is otherwise inaccessible to the mount owner, so this
128 counts as an information leak.
129
130 The solution to this problem will be presented in point 2) of C).
131
132 C) There are several ways in which the mount owner can induce
133 undesired behavior in other users' processes, such as:
134
135 1) mounting a filesystem over a file or directory which the mount
136 owner could otherwise not be able to modify (or could only
137 make limited modifications).
138
139 This is solved in fusermount, by checking the access
140 permissions on the mountpoint and only allowing the mount if
141 the mount owner can do unlimited modification (has write
142 access to the mountpoint, and mountpoint is not a "sticky"
143 directory)
144
145 2) Even if 1) is solved the mount owner can change the behavior
146 of other users' processes.
147
148 i) It can slow down or indefinitely delay the execution of a
149 filesystem operation creating a DoS against the user or the
150 whole system. For example a suid application locking a
151 system file, and then accessing a file on the mount owner's
152 filesystem could be stopped, and thus causing the system
153 file to be locked forever.
154
155 ii) It can present files or directories of unlimited length, or
156 directory structures of unlimited depth, possibly causing a
157 system process to eat up diskspace, memory or other
158 resources, again causing DoS.
159
160 The solution to this as well as B) is not to allow processes
161 to access the filesystem, which could otherwise not be
162 monitored or manipulated by the mount owner. Since if the
163 mount owner can ptrace a process, it can do all of the above
164 without using a FUSE mount, the same criteria as used in
165 ptrace can be used to check if a process is allowed to access
166 the filesystem or not.
167
168 Note that the ptrace check is not strictly necessary to
169 prevent B/2/i, it is enough to check if mount owner has enough
170 privilege to send signal to the process accessing the
171 filesystem, since SIGSTOP can be used to get a similar effect.
172
173I think these limitations are unacceptable?
174~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
175
176If a sysadmin trusts the users enough, or can ensure through other
177measures, that system processes will never enter non-privileged
178mounts, it can relax the last limitation with a "user_allow_other"
179config option. If this config option is set, the mounting user can
180add the "allow_other" mount option which disables the check for other
181users' processes.
182
183Kernel - userspace interface
184~~~~~~~~~~~~~~~~~~~~~~~~~~~~
185
186The following diagram shows how a filesystem operation (in this
187example unlink) is performed in FUSE.
188
189NOTE: everything in this description is greatly simplified
190
191 | "rm /mnt/fuse/file" | FUSE filesystem daemon
192 | |
193 | | >sys_read()
194 | | >fuse_dev_read()
195 | | >request_wait()
196 | | [sleep on fc->waitq]
197 | |
198 | >sys_unlink() |
199 | >fuse_unlink() |
200 | [get request from |
201 | fc->unused_list] |
202 | >request_send() |
203 | [queue req on fc->pending] |
204 | [wake up fc->waitq] | [woken up]
205 | >request_wait_answer() |
206 | [sleep on req->waitq] |
207 | | <request_wait()
208 | | [remove req from fc->pending]
209 | | [copy req to read buffer]
210 | | [add req to fc->processing]
211 | | <fuse_dev_read()
212 | | <sys_read()
213 | |
214 | | [perform unlink]
215 | |
216 | | >sys_write()
217 | | >fuse_dev_write()
218 | | [look up req in fc->processing]
219 | | [remove from fc->processing]
220 | | [copy write buffer to req]
221 | [woken up] | [wake up req->waitq]
222 | | <fuse_dev_write()
223 | | <sys_write()
224 | <request_wait_answer() |
225 | <request_send() |
226 | [add request to |
227 | fc->unused_list] |
228 | <fuse_unlink() |
229 | <sys_unlink() |
230
231There are a couple of ways in which to deadlock a FUSE filesystem.
232Since we are talking about unprivileged userspace programs,
233something must be done about these.
234
235Scenario 1 - Simple deadlock
236-----------------------------
237
238 | "rm /mnt/fuse/file" | FUSE filesystem daemon
239 | |
240 | >sys_unlink("/mnt/fuse/file") |
241 | [acquire inode semaphore |
242 | for "file"] |
243 | >fuse_unlink() |
244 | [sleep on req->waitq] |
245 | | <sys_read()
246 | | >sys_unlink("/mnt/fuse/file")
247 | | [acquire inode semaphore
248 | | for "file"]
249 | | *DEADLOCK*
250
251The solution for this is to allow requests to be interrupted while
252they are in userspace:
253
254 | [interrupted by signal] |
255 | <fuse_unlink() |
256 | [release semaphore] | [semaphore acquired]
257 | <sys_unlink() |
258 | | >fuse_unlink()
259 | | [queue req on fc->pending]
260 | | [wake up fc->waitq]
261 | | [sleep on req->waitq]
262
263If the filesystem daemon was single threaded, this will stop here,
264since there's no other thread to dequeue and execute the request.
265In this case the solution is to kill the FUSE daemon as well. If
266there are multiple serving threads, you just have to kill them as
267long as any remain.
268
269Moral: a filesystem which deadlocks, can soon find itself dead.
270
271Scenario 2 - Tricky deadlock
272----------------------------
273
274This one needs a carefully crafted filesystem. It's a variation on
275the above, only the call back to the filesystem is not explicit,
276but is caused by a pagefault.
277
278 | Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2
279 | |
280 | [fd = open("/mnt/fuse/file")] | [request served normally]
281 | [mmap fd to 'addr'] |
282 | [close fd] | [FLUSH triggers 'magic' flag]
283 | [read a byte from addr] |
284 | >do_page_fault() |
285 | [find or create page] |
286 | [lock page] |
287 | >fuse_readpage() |
288 | [queue READ request] |
289 | [sleep on req->waitq] |
290 | | [read request to buffer]
291 | | [create reply header before addr]
292 | | >sys_write(addr - headerlength)
293 | | >fuse_dev_write()
294 | | [look up req in fc->processing]
295 | | [remove from fc->processing]
296 | | [copy write buffer to req]
297 | | >do_page_fault()
298 | | [find or create page]
299 | | [lock page]
300 | | * DEADLOCK *
301
302Solution is again to let the the request be interrupted (not
303elaborated further).
304
305An additional problem is that while the write buffer is being
306copied to the request, the request must not be interrupted. This
307is because the destination address of the copy may not be valid
308after the request is interrupted.
309
310This is solved with doing the copy atomically, and allowing
311interruption while the page(s) belonging to the write buffer are
312faulted with get_user_pages(). The 'req->locked' flag indicates
313when the copy is taking place, and interruption is delayed until
314this flag is unset.
315
diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt
index eef4aca0c753..a5fbc8e897fa 100644
--- a/Documentation/filesystems/ntfs.txt
+++ b/Documentation/filesystems/ntfs.txt
@@ -439,6 +439,18 @@ ChangeLog
439 439
440Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. 440Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
441 441
4422.1.24:
443 - Support journals ($LogFile) which have been modified by chkdsk. This
444 means users can boot into Windows after we marked the volume dirty.
445 The Windows boot will run chkdsk and then reboot. The user can then
446 immediately boot into Linux rather than having to do a full Windows
447 boot first before rebooting into Linux and we will recognize such a
448 journal and empty it as it is clean by definition.
449 - Support journals ($LogFile) with only one restart page as well as
450 journals with two different restart pages. We sanity check both and
451 either use the only sane one or the more recent one of the two in the
452 case that both are valid.
453 - Lots of bug fixes and enhancements across the board.
4422.1.23: 4542.1.23:
443 - Stamp the user space journal, aka transaction log, aka $UsnJrnl, if 455 - Stamp the user space journal, aka transaction log, aka $UsnJrnl, if
444 it is present and active thus telling Windows and applications using 456 it is present and active thus telling Windows and applications using
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 6c98f2bd421e..d4773565ea2f 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -133,6 +133,7 @@ Table 1-1: Process specific entries in /proc
133 statm Process memory status information 133 statm Process memory status information
134 status Process status in human readable form 134 status Process status in human readable form
135 wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan 135 wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan
136 smaps Extension based on maps, presenting the rss size for each mapped file
136.............................................................................. 137..............................................................................
137 138
138For example, to get the status information of a process, all you have to do is 139For example, to get the status information of a process, all you have to do is
@@ -1240,16 +1241,38 @@ swap-intensive.
1240overcommit_memory 1241overcommit_memory
1241----------------- 1242-----------------
1242 1243
1243This file contains one value. The following algorithm is used to decide if 1244Controls overcommit of system memory, possibly allowing processes
1244there's enough memory: if the value of overcommit_memory is positive, then 1245to allocate (but not use) more memory than is actually available.
1245there's always enough memory. This is a useful feature, since programs often
1246malloc() huge amounts of memory 'just in case', while they only use a small
1247part of it. Leaving this value at 0 will lead to the failure of such a huge
1248malloc(), when in fact the system has enough memory for the program to run.
1249 1246
1250On the other hand, enabling this feature can cause you to run out of memory 1247
1251and thrash the system to death, so large and/or important servers will want to 12480 - Heuristic overcommit handling. Obvious overcommits of
1252set this value to 0. 1249 address space are refused. Used for a typical system. It
1250 ensures a seriously wild allocation fails while allowing
1251 overcommit to reduce swap usage. root is allowed to
1252 allocate slighly more memory in this mode. This is the
1253 default.
1254
12551 - Always overcommit. Appropriate for some scientific
1256 applications.
1257
12582 - Don't overcommit. The total address space commit
1259 for the system is not permitted to exceed swap plus a
1260 configurable percentage (default is 50) of physical RAM.
1261 Depending on the percentage you use, in most situations
1262 this means a process will not be killed while attempting
1263 to use already-allocated memory but will receive errors
1264 on memory allocation as appropriate.
1265
1266overcommit_ratio
1267----------------
1268
1269Percentage of physical memory size to include in overcommit calculations
1270(see above.)
1271
1272Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100)
1273
1274 swapspace = total size of all swap areas
1275 physmem = size of physical memory in system
1253 1276
1254nr_hugepages and hugetlb_shm_group 1277nr_hugepages and hugetlb_shm_group
1255---------------------------------- 1278----------------------------------
diff --git a/Documentation/filesystems/relayfs.txt b/Documentation/filesystems/relayfs.txt
new file mode 100644
index 000000000000..d24e1b0d4f39
--- /dev/null
+++ b/Documentation/filesystems/relayfs.txt
@@ -0,0 +1,362 @@
1
2relayfs - a high-speed data relay filesystem
3============================================
4
5relayfs is a filesystem designed to provide an efficient mechanism for
6tools and facilities to relay large and potentially sustained streams
7of data from kernel space to user space.
8
9The main abstraction of relayfs is the 'channel'. A channel consists
10of a set of per-cpu kernel buffers each represented by a file in the
11relayfs filesystem. Kernel clients write into a channel using
12efficient write functions which automatically log to the current cpu's
13channel buffer. User space applications mmap() the per-cpu files and
14retrieve the data as it becomes available.
15
16The format of the data logged into the channel buffers is completely
17up to the relayfs client; relayfs does however provide hooks which
18allow clients to impose some stucture on the buffer data. Nor does
19relayfs implement any form of data filtering - this also is left to
20the client. The purpose is to keep relayfs as simple as possible.
21
22This document provides an overview of the relayfs API. The details of
23the function parameters are documented along with the functions in the
24filesystem code - please see that for details.
25
26Semantics
27=========
28
29Each relayfs channel has one buffer per CPU, each buffer has one or
30more sub-buffers. Messages are written to the first sub-buffer until
31it is too full to contain a new message, in which case it it is
32written to the next (if available). Messages are never split across
33sub-buffers. At this point, userspace can be notified so it empties
34the first sub-buffer, while the kernel continues writing to the next.
35
36When notified that a sub-buffer is full, the kernel knows how many
37bytes of it are padding i.e. unused. Userspace can use this knowledge
38to copy only valid data.
39
40After copying it, userspace can notify the kernel that a sub-buffer
41has been consumed.
42
43relayfs can operate in a mode where it will overwrite data not yet
44collected by userspace, and not wait for it to consume it.
45
46relayfs itself does not provide for communication of such data between
47userspace and kernel, allowing the kernel side to remain simple and not
48impose a single interface on userspace. It does provide a separate
49helper though, described below.
50
51klog, relay-app & librelay
52==========================
53
54relayfs itself is ready to use, but to make things easier, two
55additional systems are provided. klog is a simple wrapper to make
56writing formatted text or raw data to a channel simpler, regardless of
57whether a channel to write into exists or not, or whether relayfs is
58compiled into the kernel or is configured as a module. relay-app is
59the kernel counterpart of userspace librelay.c, combined these two
60files provide glue to easily stream data to disk, without having to
61bother with housekeeping. klog and relay-app can be used together,
62with klog providing high-level logging functions to the kernel and
63relay-app taking care of kernel-user control and disk-logging chores.
64
65It is possible to use relayfs without relay-app & librelay, but you'll
66have to implement communication between userspace and kernel, allowing
67both to convey the state of buffers (full, empty, amount of padding).
68
69klog, relay-app and librelay can be found in the relay-apps tarball on
70http://relayfs.sourceforge.net
71
72The relayfs user space API
73==========================
74
75relayfs implements basic file operations for user space access to
76relayfs channel buffer data. Here are the file operations that are
77available and some comments regarding their behavior:
78
79open() enables user to open an _existing_ buffer.
80
81mmap() results in channel buffer being mapped into the caller's
82 memory space. Note that you can't do a partial mmap - you must
83 map the entire file, which is NRBUF * SUBBUFSIZE.
84
85read() read the contents of a channel buffer. The bytes read are
86 'consumed' by the reader i.e. they won't be available again
87 to subsequent reads. If the channel is being used in
88 no-overwrite mode (the default), it can be read at any time
89 even if there's an active kernel writer. If the channel is
90 being used in overwrite mode and there are active channel
91 writers, results may be unpredictable - users should make
92 sure that all logging to the channel has ended before using
93 read() with overwrite mode.
94
95poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
96 notified when sub-buffer boundaries are crossed.
97
98close() decrements the channel buffer's refcount. When the refcount
99 reaches 0 i.e. when no process or kernel client has the buffer
100 open, the channel buffer is freed.
101
102
103In order for a user application to make use of relayfs files, the
104relayfs filesystem must be mounted. For example,
105
106 mount -t relayfs relayfs /mnt/relay
107
108NOTE: relayfs doesn't need to be mounted for kernel clients to create
109 or use channels - it only needs to be mounted when user space
110 applications need access to the buffer data.
111
112
113The relayfs kernel API
114======================
115
116Here's a summary of the API relayfs provides to in-kernel clients:
117
118
119 channel management functions:
120
121 relay_open(base_filename, parent, subbuf_size, n_subbufs,
122 callbacks)
123 relay_close(chan)
124 relay_flush(chan)
125 relay_reset(chan)
126 relayfs_create_dir(name, parent)
127 relayfs_remove_dir(dentry)
128
129 channel management typically called on instigation of userspace:
130
131 relay_subbufs_consumed(chan, cpu, subbufs_consumed)
132
133 write functions:
134
135 relay_write(chan, data, length)
136 __relay_write(chan, data, length)
137 relay_reserve(chan, length)
138
139 callbacks:
140
141 subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
142 buf_mapped(buf, filp)
143 buf_unmapped(buf, filp)
144
145 helper functions:
146
147 relay_buf_full(buf)
148 subbuf_start_reserve(buf, length)
149
150
151Creating a channel
152------------------
153
154relay_open() is used to create a channel, along with its per-cpu
155channel buffers. Each channel buffer will have an associated file
156created for it in the relayfs filesystem, which can be opened and
157mmapped from user space if desired. The files are named
158basename0...basenameN-1 where N is the number of online cpus, and by
159default will be created in the root of the filesystem. If you want a
160directory structure to contain your relayfs files, you can create it
161with relayfs_create_dir() and pass the parent directory to
162relay_open(). Clients are responsible for cleaning up any directory
163structure they create when the channel is closed - use
164relayfs_remove_dir() for that.
165
166The total size of each per-cpu buffer is calculated by multiplying the
167number of sub-buffers by the sub-buffer size passed into relay_open().
168The idea behind sub-buffers is that they're basically an extension of
169double-buffering to N buffers, and they also allow applications to
170easily implement random-access-on-buffer-boundary schemes, which can
171be important for some high-volume applications. The number and size
172of sub-buffers is completely dependent on the application and even for
173the same application, different conditions will warrant different
174values for these parameters at different times. Typically, the right
175values to use are best decided after some experimentation; in general,
176though, it's safe to assume that having only 1 sub-buffer is a bad
177idea - you're guaranteed to either overwrite data or lose events
178depending on the channel mode being used.
179
180Channel 'modes'
181---------------
182
183relayfs channels can be used in either of two modes - 'overwrite' or
184'no-overwrite'. The mode is entirely determined by the implementation
185of the subbuf_start() callback, as described below. In 'overwrite'
186mode, also known as 'flight recorder' mode, writes continuously cycle
187around the buffer and will never fail, but will unconditionally
188overwrite old data regardless of whether it's actually been consumed.
189In no-overwrite mode, writes will fail i.e. data will be lost, if the
190number of unconsumed sub-buffers equals the total number of
191sub-buffers in the channel. It should be clear that if there is no
192consumer or if the consumer can't consume sub-buffers fast enought,
193data will be lost in either case; the only difference is whether data
194is lost from the beginning or the end of a buffer.
195
196As explained above, a relayfs channel is made of up one or more
197per-cpu channel buffers, each implemented as a circular buffer
198subdivided into one or more sub-buffers. Messages are written into
199the current sub-buffer of the channel's current per-cpu buffer via the
200write functions described below. Whenever a message can't fit into
201the current sub-buffer, because there's no room left for it, the
202client is notified via the subbuf_start() callback that a switch to a
203new sub-buffer is about to occur. The client uses this callback to 1)
204initialize the next sub-buffer if appropriate 2) finalize the previous
205sub-buffer if appropriate and 3) return a boolean value indicating
206whether or not to actually go ahead with the sub-buffer switch.
207
208To implement 'no-overwrite' mode, the userspace client would provide
209an implementation of the subbuf_start() callback something like the
210following:
211
212static int subbuf_start(struct rchan_buf *buf,
213 void *subbuf,
214 void *prev_subbuf,
215 unsigned int prev_padding)
216{
217 if (prev_subbuf)
218 *((unsigned *)prev_subbuf) = prev_padding;
219
220 if (relay_buf_full(buf))
221 return 0;
222
223 subbuf_start_reserve(buf, sizeof(unsigned int));
224
225 return 1;
226}
227
228If the current buffer is full i.e. all sub-buffers remain unconsumed,
229the callback returns 0 to indicate that the buffer switch should not
230occur yet i.e. until the consumer has had a chance to read the current
231set of ready sub-buffers. For the relay_buf_full() function to make
232sense, the consumer is reponsible for notifying relayfs when
233sub-buffers have been consumed via relay_subbufs_consumed(). Any
234subsequent attempts to write into the buffer will again invoke the
235subbuf_start() callback with the same parameters; only when the
236consumer has consumed one or more of the ready sub-buffers will
237relay_buf_full() return 0, in which case the buffer switch can
238continue.
239
240The implementation of the subbuf_start() callback for 'overwrite' mode
241would be very similar:
242
243static int subbuf_start(struct rchan_buf *buf,
244 void *subbuf,
245 void *prev_subbuf,
246 unsigned int prev_padding)
247{
248 if (prev_subbuf)
249 *((unsigned *)prev_subbuf) = prev_padding;
250
251 subbuf_start_reserve(buf, sizeof(unsigned int));
252
253 return 1;
254}
255
256In this case, the relay_buf_full() check is meaningless and the
257callback always returns 1, causing the buffer switch to occur
258unconditionally. It's also meaningless for the client to use the
259relay_subbufs_consumed() function in this mode, as it's never
260consulted.
261
262The default subbuf_start() implementation, used if the client doesn't
263define any callbacks, or doesn't define the subbuf_start() callback,
264implements the simplest possible 'no-overwrite' mode i.e. it does
265nothing but return 0.
266
267Header information can be reserved at the beginning of each sub-buffer
268by calling the subbuf_start_reserve() helper function from within the
269subbuf_start() callback. This reserved area can be used to store
270whatever information the client wants. In the example above, room is
271reserved in each sub-buffer to store the padding count for that
272sub-buffer. This is filled in for the previous sub-buffer in the
273subbuf_start() implementation; the padding value for the previous
274sub-buffer is passed into the subbuf_start() callback along with a
275pointer to the previous sub-buffer, since the padding value isn't
276known until a sub-buffer is filled. The subbuf_start() callback is
277also called for the first sub-buffer when the channel is opened, to
278give the client a chance to reserve space in it. In this case the
279previous sub-buffer pointer passed into the callback will be NULL, so
280the client should check the value of the prev_subbuf pointer before
281writing into the previous sub-buffer.
282
283Writing to a channel
284--------------------
285
286kernel clients write data into the current cpu's channel buffer using
287relay_write() or __relay_write(). relay_write() is the main logging
288function - it uses local_irqsave() to protect the buffer and should be
289used if you might be logging from interrupt context. If you know
290you'll never be logging from interrupt context, you can use
291__relay_write(), which only disables preemption. These functions
292don't return a value, so you can't determine whether or not they
293failed - the assumption is that you wouldn't want to check a return
294value in the fast logging path anyway, and that they'll always succeed
295unless the buffer is full and no-overwrite mode is being used, in
296which case you can detect a failed write in the subbuf_start()
297callback by calling the relay_buf_full() helper function.
298
299relay_reserve() is used to reserve a slot in a channel buffer which
300can be written to later. This would typically be used in applications
301that need to write directly into a channel buffer without having to
302stage data in a temporary buffer beforehand. Because the actual write
303may not happen immediately after the slot is reserved, applications
304using relay_reserve() can keep a count of the number of bytes actually
305written, either in space reserved in the sub-buffers themselves or as
306a separate array. See the 'reserve' example in the relay-apps tarball
307at http://relayfs.sourceforge.net for an example of how this can be
308done. Because the write is under control of the client and is
309separated from the reserve, relay_reserve() doesn't protect the buffer
310at all - it's up to the client to provide the appropriate
311synchronization when using relay_reserve().
312
313Closing a channel
314-----------------
315
316The client calls relay_close() when it's finished using the channel.
317The channel and its associated buffers are destroyed when there are no
318longer any references to any of the channel buffers. relay_flush()
319forces a sub-buffer switch on all the channel buffers, and can be used
320to finalize and process the last sub-buffers before the channel is
321closed.
322
323Misc
324----
325
326Some applications may want to keep a channel around and re-use it
327rather than open and close a new channel for each use. relay_reset()
328can be used for this purpose - it resets a channel to its initial
329state without reallocating channel buffer memory or destroying
330existing mappings. It should however only be called when it's safe to
331do so i.e. when the channel isn't currently being written to.
332
333Finally, there are a couple of utility callbacks that can be used for
334different purposes. buf_mapped() is called whenever a channel buffer
335is mmapped from user space and buf_unmapped() is called when it's
336unmapped. The client can use this notification to trigger actions
337within the kernel application, such as enabling/disabling logging to
338the channel.
339
340
341Resources
342=========
343
344For news, example code, mailing list, etc. see the relayfs homepage:
345
346 http://relayfs.sourceforge.net
347
348
349Credits
350=======
351
352The ideas and specs for relayfs came about as a result of discussions
353on tracing involving the following:
354
355Michel Dagenais <michel.dagenais@polymtl.ca>
356Richard Moore <richardj_moore@uk.ibm.com>
357Bob Wisniewski <bob@watson.ibm.com>
358Karim Yaghmour <karim@opersys.com>
359Tom Zanussi <zanussi@us.ibm.com>
360
361Also thanks to Hubertus Franke for a lot of useful suggestions and bug
362reports.
diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt
index dc276598a65a..c8bce82ddcac 100644
--- a/Documentation/filesystems/sysfs.txt
+++ b/Documentation/filesystems/sysfs.txt
@@ -90,7 +90,7 @@ void device_remove_file(struct device *, struct device_attribute *);
90 90
91It also defines this helper for defining device attributes: 91It also defines this helper for defining device attributes:
92 92
93#define DEVICE_ATTR(_name,_mode,_show,_store) \ 93#define DEVICE_ATTR(_name, _mode, _show, _store) \
94struct device_attribute dev_attr_##_name = { \ 94struct device_attribute dev_attr_##_name = { \
95 .attr = {.name = __stringify(_name) , .mode = _mode }, \ 95 .attr = {.name = __stringify(_name) , .mode = _mode }, \
96 .show = _show, \ 96 .show = _show, \
@@ -99,14 +99,14 @@ struct device_attribute dev_attr_##_name = { \
99 99
100For example, declaring 100For example, declaring
101 101
102static DEVICE_ATTR(foo,0644,show_foo,store_foo); 102static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
103 103
104is equivalent to doing: 104is equivalent to doing:
105 105
106static struct device_attribute dev_attr_foo = { 106static struct device_attribute dev_attr_foo = {
107 .attr = { 107 .attr = {
108 .name = "foo", 108 .name = "foo",
109 .mode = 0644, 109 .mode = S_IWUSR | S_IRUGO,
110 }, 110 },
111 .show = show_foo, 111 .show = show_foo,
112 .store = store_foo, 112 .store = store_foo,
@@ -121,8 +121,8 @@ set of sysfs operations for forwarding read and write calls to the
121show and store methods of the attribute owners. 121show and store methods of the attribute owners.
122 122
123struct sysfs_ops { 123struct sysfs_ops {
124 ssize_t (*show)(struct kobject *, struct attribute *,char *); 124 ssize_t (*show)(struct kobject *, struct attribute *, char *);
125 ssize_t (*store)(struct kobject *,struct attribute *,const char *); 125 ssize_t (*store)(struct kobject *, struct attribute *, const char *);
126}; 126};
127 127
128[ Subsystems should have already defined a struct kobj_type as a 128[ Subsystems should have already defined a struct kobj_type as a
@@ -137,7 +137,7 @@ calls the associated methods.
137 137
138To illustrate: 138To illustrate:
139 139
140#define to_dev_attr(_attr) container_of(_attr,struct device_attribute,attr) 140#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
141#define to_dev(d) container_of(d, struct device, kobj) 141#define to_dev(d) container_of(d, struct device, kobj)
142 142
143static ssize_t 143static ssize_t
@@ -148,7 +148,7 @@ dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf)
148 ssize_t ret = 0; 148 ssize_t ret = 0;
149 149
150 if (dev_attr->show) 150 if (dev_attr->show)
151 ret = dev_attr->show(dev,buf); 151 ret = dev_attr->show(dev, buf);
152 return ret; 152 return ret;
153} 153}
154 154
@@ -216,16 +216,16 @@ A very simple (and naive) implementation of a device attribute is:
216 216
217static ssize_t show_name(struct device *dev, struct device_attribute *attr, char *buf) 217static ssize_t show_name(struct device *dev, struct device_attribute *attr, char *buf)
218{ 218{
219 return sprintf(buf,"%s\n",dev->name); 219 return snprintf(buf, PAGE_SIZE, "%s\n", dev->name);
220} 220}
221 221
222static ssize_t store_name(struct device * dev, const char * buf) 222static ssize_t store_name(struct device * dev, const char * buf)
223{ 223{
224 sscanf(buf,"%20s",dev->name); 224 sscanf(buf, "%20s", dev->name);
225 return strlen(buf); 225 return strnlen(buf, PAGE_SIZE);
226} 226}
227 227
228static DEVICE_ATTR(name,S_IRUGO,show_name,store_name); 228static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);
229 229
230 230
231(Note that the real implementation doesn't allow userspace to set the 231(Note that the real implementation doesn't allow userspace to set the
@@ -290,7 +290,7 @@ struct device_attribute {
290 290
291Declaring: 291Declaring:
292 292
293DEVICE_ATTR(_name,_str,_mode,_show,_store); 293DEVICE_ATTR(_name, _str, _mode, _show, _store);
294 294
295Creation/Removal: 295Creation/Removal:
296 296
@@ -310,7 +310,7 @@ struct bus_attribute {
310 310
311Declaring: 311Declaring:
312 312
313BUS_ATTR(_name,_mode,_show,_store) 313BUS_ATTR(_name, _mode, _show, _store)
314 314
315Creation/Removal: 315Creation/Removal:
316 316
@@ -331,7 +331,7 @@ struct driver_attribute {
331 331
332Declaring: 332Declaring:
333 333
334DRIVER_ATTR(_name,_mode,_show,_store) 334DRIVER_ATTR(_name, _mode, _show, _store)
335 335
336Creation/Removal: 336Creation/Removal:
337 337
diff --git a/Documentation/filesystems/v9fs.txt b/Documentation/filesystems/v9fs.txt
new file mode 100644
index 000000000000..4e92feb6b507
--- /dev/null
+++ b/Documentation/filesystems/v9fs.txt
@@ -0,0 +1,95 @@
1 V9FS: 9P2000 for Linux
2 ======================
3
4ABOUT
5=====
6
7v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
8
9This software was originally developed by Ron Minnich <rminnich@lanl.gov>
10and Maya Gokhale <maya@lanl.gov>. Additional development by Greg Watson
11<gwatson@lanl.gov> and most recently Eric Van Hensbergen
12<ericvh@gmail.com> and Latchesar Ionkov <lucho@ionkov.net>.
13
14USAGE
15=====
16
17For remote file server:
18
19 mount -t 9P 10.10.1.2 /mnt/9
20
21For Plan 9 From User Space applications (http://swtch.com/plan9)
22
23 mount -t 9P `namespace`/acme /mnt/9 -o proto=unix,name=$USER
24
25OPTIONS
26=======
27
28 proto=name select an alternative transport. Valid options are
29 currently:
30 unix - specifying a named pipe mount point
31 tcp - specifying a normal TCP/IP connection
32 fd - used passed file descriptors for connection
33 (see rfdno and wfdno)
34
35 name=name user name to attempt mount as on the remote server. The
36 server may override or ignore this value. Certain user
37 names may require authentication.
38
39 aname=name aname specifies the file tree to access when the server is
40 offering several exported file systems.
41
42 debug=n specifies debug level. The debug level is a bitmask.
43 0x01 = display verbose error messages
44 0x02 = developer debug (DEBUG_CURRENT)
45 0x04 = display 9P trace
46 0x08 = display VFS trace
47 0x10 = display Marshalling debug
48 0x20 = display RPC debug
49 0x40 = display transport debug
50 0x80 = display allocation debug
51
52 rfdno=n the file descriptor for reading with proto=fd
53
54 wfdno=n the file descriptor for writing with proto=fd
55
56 maxdata=n the number of bytes to use for 9P packet payload (msize)
57
58 port=n port to connect to on the remote server
59
60 timeout=n request timeouts (in ms) (default 60000ms)
61
62 noextend force legacy mode (no 9P2000.u semantics)
63
64 uid attempt to mount as a particular uid
65
66 gid attempt to mount with a particular gid
67
68 afid security channel - used by Plan 9 authentication protocols
69
70 nodevmap do not map special files - represent them as normal files.
71 This can be used to share devices/named pipes/sockets between
72 hosts. This functionality will be expanded in later versions.
73
74RESOURCES
75=========
76
77The Linux version of the 9P server, along with some client-side utilities
78can be found at http://v9fs.sf.net (along with a CVS repository of the
79development branch of this module). There are user and developer mailing
80lists here, as well as a bug-tracker.
81
82For more information on the Plan 9 Operating System check out
83http://plan9.bell-labs.com/plan9
84
85For information on Plan 9 from User Space (Plan 9 applications and libraries
86ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
87
88
89STATUS
90======
91
92The 2.6 kernel support is working on PPC and x86.
93
94PLEASE USE THE SOURCEFORGE BUG-TRACKER TO REPORT PROBLEMS.
95
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 3f318dd44c77..f042c12e0ed2 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -1,35 +1,27 @@
1/* -*- auto-fill -*- */
2 1
3 Overview of the Virtual File System 2 Overview of the Linux Virtual File System
4 3
5 Richard Gooch <rgooch@atnf.csiro.au> 4 Original author: Richard Gooch <rgooch@atnf.csiro.au>
6 5
7 5-JUL-1999 6 Last updated on August 25, 2005
8 7
8 Copyright (C) 1999 Richard Gooch
9 Copyright (C) 2005 Pekka Enberg
9 10
10Conventions used in this document <section> 11 This file is released under the GPLv2.
11=================================
12 12
13Each section in this document will have the string "<section>" at the
14right-hand side of the section title. Each subsection will have
15"<subsection>" at the right-hand side. These strings are meant to make
16it easier to search through the document.
17 13
18NOTE that the master copy of this document is available online at: 14What is it?
19http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt
20
21
22What is it? <section>
23=========== 15===========
24 16
25The Virtual File System (otherwise known as the Virtual Filesystem 17The Virtual File System (otherwise known as the Virtual Filesystem
26Switch) is the software layer in the kernel that provides the 18Switch) is the software layer in the kernel that provides the
27filesystem interface to userspace programs. It also provides an 19filesystem interface to userspace programs. It also provides an
28abstraction within the kernel which allows different filesystem 20abstraction within the kernel which allows different filesystem
29implementations to co-exist. 21implementations to coexist.
30 22
31 23
32A Quick Look At How It Works <section> 24A Quick Look At How It Works
33============================ 25============================
34 26
35In this section I'll briefly describe how things work, before 27In this section I'll briefly describe how things work, before
@@ -38,7 +30,8 @@ when user programs open and manipulate files, and then look from the
38other view which is how a filesystem is supported and subsequently 30other view which is how a filesystem is supported and subsequently
39mounted. 31mounted.
40 32
41Opening a File <subsection> 33
34Opening a File
42-------------- 35--------------
43 36
44The VFS implements the open(2), stat(2), chmod(2) and similar system 37The VFS implements the open(2), stat(2), chmod(2) and similar system
@@ -77,7 +70,7 @@ back to userspace.
77 70
78Opening a file requires another operation: allocation of a file 71Opening a file requires another operation: allocation of a file
79structure (this is the kernel-side implementation of file 72structure (this is the kernel-side implementation of file
80descriptors). The freshly allocated file structure is initialised with 73descriptors). The freshly allocated file structure is initialized with
81a pointer to the dentry and a set of file operation member functions. 74a pointer to the dentry and a set of file operation member functions.
82These are taken from the inode data. The open() file method is then 75These are taken from the inode data. The open() file method is then
83called so the specific filesystem implementation can do it's work. You 76called so the specific filesystem implementation can do it's work. You
@@ -102,7 +95,8 @@ filesystem or driver code at the same time, on different
102processors. You should ensure that access to shared resources is 95processors. You should ensure that access to shared resources is
103protected by appropriate locks. 96protected by appropriate locks.
104 97
105Registering and Mounting a Filesystem <subsection> 98
99Registering and Mounting a Filesystem
106------------------------------------- 100-------------------------------------
107 101
108If you want to support a new kind of filesystem in the kernel, all you 102If you want to support a new kind of filesystem in the kernel, all you
@@ -123,17 +117,21 @@ updated to point to the root inode for the new filesystem.
123It's now time to look at things in more detail. 117It's now time to look at things in more detail.
124 118
125 119
126struct file_system_type <section> 120struct file_system_type
127======================= 121=======================
128 122
129This describes the filesystem. As of kernel 2.1.99, the following 123This describes the filesystem. As of kernel 2.6.13, the following
130members are defined: 124members are defined:
131 125
132struct file_system_type { 126struct file_system_type {
133 const char *name; 127 const char *name;
134 int fs_flags; 128 int fs_flags;
135 struct super_block *(*read_super) (struct super_block *, void *, int); 129 struct super_block *(*get_sb) (struct file_system_type *, int,
136 struct file_system_type * next; 130 const char *, void *);
131 void (*kill_sb) (struct super_block *);
132 struct module *owner;
133 struct file_system_type * next;
134 struct list_head fs_supers;
137}; 135};
138 136
139 name: the name of the filesystem type, such as "ext2", "iso9660", 137 name: the name of the filesystem type, such as "ext2", "iso9660",
@@ -141,51 +139,97 @@ struct file_system_type {
141 139
142 fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) 140 fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
143 141
144 read_super: the method to call when a new instance of this 142 get_sb: the method to call when a new instance of this
145 filesystem should be mounted 143 filesystem should be mounted
146 144
147 next: for internal VFS use: you should initialise this to NULL 145 kill_sb: the method to call when an instance of this filesystem
146 should be unmounted
147
148 owner: for internal VFS use: you should initialize this to THIS_MODULE in
149 most cases.
148 150
149The read_super() method has the following arguments: 151 next: for internal VFS use: you should initialize this to NULL
152
153The get_sb() method has the following arguments:
150 154
151 struct super_block *sb: the superblock structure. This is partially 155 struct super_block *sb: the superblock structure. This is partially
152 initialised by the VFS and the rest must be initialised by the 156 initialized by the VFS and the rest must be initialized by the
153 read_super() method 157 get_sb() method
158
159 int flags: mount flags
160
161 const char *dev_name: the device name we are mounting.
154 162
155 void *data: arbitrary mount options, usually comes as an ASCII 163 void *data: arbitrary mount options, usually comes as an ASCII
156 string 164 string
157 165
158 int silent: whether or not to be silent on error 166 int silent: whether or not to be silent on error
159 167
160The read_super() method must determine if the block device specified 168The get_sb() method must determine if the block device specified
161in the superblock contains a filesystem of the type the method 169in the superblock contains a filesystem of the type the method
162supports. On success the method returns the superblock pointer, on 170supports. On success the method returns the superblock pointer, on
163failure it returns NULL. 171failure it returns NULL.
164 172
165The most interesting member of the superblock structure that the 173The most interesting member of the superblock structure that the
166read_super() method fills in is the "s_op" field. This is a pointer to 174get_sb() method fills in is the "s_op" field. This is a pointer to
167a "struct super_operations" which describes the next level of the 175a "struct super_operations" which describes the next level of the
168filesystem implementation. 176filesystem implementation.
169 177
178Usually, a filesystem uses generic one of the generic get_sb()
179implementations and provides a fill_super() method instead. The
180generic methods are:
181
182 get_sb_bdev: mount a filesystem residing on a block device
170 183
171struct super_operations <section> 184 get_sb_nodev: mount a filesystem that is not backed by a device
185
186 get_sb_single: mount a filesystem which shares the instance between
187 all mounts
188
189A fill_super() method implementation has the following arguments:
190
191 struct super_block *sb: the superblock structure. The method fill_super()
192 must initialize this properly.
193
194 void *data: arbitrary mount options, usually comes as an ASCII
195 string
196
197 int silent: whether or not to be silent on error
198
199
200struct super_operations
172======================= 201=======================
173 202
174This describes how the VFS can manipulate the superblock of your 203This describes how the VFS can manipulate the superblock of your
175filesystem. As of kernel 2.1.99, the following members are defined: 204filesystem. As of kernel 2.6.13, the following members are defined:
176 205
177struct super_operations { 206struct super_operations {
178 void (*read_inode) (struct inode *); 207 struct inode *(*alloc_inode)(struct super_block *sb);
179 int (*write_inode) (struct inode *, int); 208 void (*destroy_inode)(struct inode *);
180 void (*put_inode) (struct inode *); 209
181 void (*drop_inode) (struct inode *); 210 void (*read_inode) (struct inode *);
182 void (*delete_inode) (struct inode *); 211
183 int (*notify_change) (struct dentry *, struct iattr *); 212 void (*dirty_inode) (struct inode *);
184 void (*put_super) (struct super_block *); 213 int (*write_inode) (struct inode *, int);
185 void (*write_super) (struct super_block *); 214 void (*put_inode) (struct inode *);
186 int (*statfs) (struct super_block *, struct statfs *, int); 215 void (*drop_inode) (struct inode *);
187 int (*remount_fs) (struct super_block *, int *, char *); 216 void (*delete_inode) (struct inode *);
188 void (*clear_inode) (struct inode *); 217 void (*put_super) (struct super_block *);
218 void (*write_super) (struct super_block *);
219 int (*sync_fs)(struct super_block *sb, int wait);
220 void (*write_super_lockfs) (struct super_block *);
221 void (*unlockfs) (struct super_block *);
222 int (*statfs) (struct super_block *, struct kstatfs *);
223 int (*remount_fs) (struct super_block *, int *, char *);
224 void (*clear_inode) (struct inode *);
225 void (*umount_begin) (struct super_block *);
226
227 void (*sync_inodes) (struct super_block *sb,
228 struct writeback_control *wbc);
229 int (*show_options)(struct seq_file *, struct vfsmount *);
230
231 ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
232 ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
189}; 233};
190 234
191All methods are called without any locks being held, unless otherwise 235All methods are called without any locks being held, unless otherwise
@@ -193,43 +237,62 @@ noted. This means that most methods can block safely. All methods are
193only called from a process context (i.e. not from an interrupt handler 237only called from a process context (i.e. not from an interrupt handler
194or bottom half). 238or bottom half).
195 239
240 alloc_inode: this method is called by inode_alloc() to allocate memory
241 for struct inode and initialize it.
242
243 destroy_inode: this method is called by destroy_inode() to release
244 resources allocated for struct inode.
245
196 read_inode: this method is called to read a specific inode from the 246 read_inode: this method is called to read a specific inode from the
197 mounted filesystem. The "i_ino" member in the "struct inode" 247 mounted filesystem. The i_ino member in the struct inode is
198 will be initialised by the VFS to indicate which inode to 248 initialized by the VFS to indicate which inode to read. Other
199 read. Other members are filled in by this method 249 members are filled in by this method.
250
251 You can set this to NULL and use iget5_locked() instead of iget()
252 to read inodes. This is necessary for filesystems for which the
253 inode number is not sufficient to identify an inode.
254
255 dirty_inode: this method is called by the VFS to mark an inode dirty.
200 256
201 write_inode: this method is called when the VFS needs to write an 257 write_inode: this method is called when the VFS needs to write an
202 inode to disc. The second parameter indicates whether the write 258 inode to disc. The second parameter indicates whether the write
203 should be synchronous or not, not all filesystems check this flag. 259 should be synchronous or not, not all filesystems check this flag.
204 260
205 put_inode: called when the VFS inode is removed from the inode 261 put_inode: called when the VFS inode is removed from the inode
206 cache. This method is optional 262 cache.
207 263
208 drop_inode: called when the last access to the inode is dropped, 264 drop_inode: called when the last access to the inode is dropped,
209 with the inode_lock spinlock held. 265 with the inode_lock spinlock held.
210 266
211 This method should be either NULL (normal unix filesystem 267 This method should be either NULL (normal UNIX filesystem
212 semantics) or "generic_delete_inode" (for filesystems that do not 268 semantics) or "generic_delete_inode" (for filesystems that do not
213 want to cache inodes - causing "delete_inode" to always be 269 want to cache inodes - causing "delete_inode" to always be
214 called regardless of the value of i_nlink) 270 called regardless of the value of i_nlink)
215 271
216 The "generic_delete_inode()" behaviour is equivalent to the 272 The "generic_delete_inode()" behavior is equivalent to the
217 old practice of using "force_delete" in the put_inode() case, 273 old practice of using "force_delete" in the put_inode() case,
218 but does not have the races that the "force_delete()" approach 274 but does not have the races that the "force_delete()" approach
219 had. 275 had.
220 276
221 delete_inode: called when the VFS wants to delete an inode 277 delete_inode: called when the VFS wants to delete an inode
222 278
223 notify_change: called when VFS inode attributes are changed. If this
224 is NULL the VFS falls back to the write_inode() method. This
225 is called with the kernel lock held
226
227 put_super: called when the VFS wishes to free the superblock 279 put_super: called when the VFS wishes to free the superblock
228 (i.e. unmount). This is called with the superblock lock held 280 (i.e. unmount). This is called with the superblock lock held
229 281
230 write_super: called when the VFS superblock needs to be written to 282 write_super: called when the VFS superblock needs to be written to
231 disc. This method is optional 283 disc. This method is optional
232 284
285 sync_fs: called when VFS is writing out all dirty data associated with
286 a superblock. The second parameter indicates whether the method
287 should wait until the write out has been completed. Optional.
288
289 write_super_lockfs: called when VFS is locking a filesystem and forcing
290 it into a consistent state. This function is currently used by the
291 Logical Volume Manager (LVM).
292
293 unlockfs: called when VFS is unlocking a filesystem and making it writable
294 again.
295
233 statfs: called when the VFS needs to get filesystem statistics. This 296 statfs: called when the VFS needs to get filesystem statistics. This
234 is called with the kernel lock held 297 is called with the kernel lock held
235 298
@@ -238,21 +301,31 @@ or bottom half).
238 301
239 clear_inode: called then the VFS clears the inode. Optional 302 clear_inode: called then the VFS clears the inode. Optional
240 303
304 umount_begin: called when the VFS is unmounting a filesystem.
305
306 sync_inodes: called when the VFS is writing out dirty data associated with
307 a superblock.
308
309 show_options: called by the VFS to show mount options for /proc/<pid>/mounts.
310
311 quota_read: called by the VFS to read from filesystem quota file.
312
313 quota_write: called by the VFS to write to filesystem quota file.
314
241The read_inode() method is responsible for filling in the "i_op" 315The read_inode() method is responsible for filling in the "i_op"
242field. This is a pointer to a "struct inode_operations" which 316field. This is a pointer to a "struct inode_operations" which
243describes the methods that can be performed on individual inodes. 317describes the methods that can be performed on individual inodes.
244 318
245 319
246struct inode_operations <section> 320struct inode_operations
247======================= 321=======================
248 322
249This describes how the VFS can manipulate an inode in your 323This describes how the VFS can manipulate an inode in your
250filesystem. As of kernel 2.1.99, the following members are defined: 324filesystem. As of kernel 2.6.13, the following members are defined:
251 325
252struct inode_operations { 326struct inode_operations {
253 struct file_operations * default_file_ops; 327 int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
254 int (*create) (struct inode *,struct dentry *,int); 328 struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
255 int (*lookup) (struct inode *,struct dentry *);
256 int (*link) (struct dentry *,struct inode *,struct dentry *); 329 int (*link) (struct dentry *,struct inode *,struct dentry *);
257 int (*unlink) (struct inode *,struct dentry *); 330 int (*unlink) (struct inode *,struct dentry *);
258 int (*symlink) (struct inode *,struct dentry *,const char *); 331 int (*symlink) (struct inode *,struct dentry *,const char *);
@@ -261,25 +334,22 @@ struct inode_operations {
261 int (*mknod) (struct inode *,struct dentry *,int,dev_t); 334 int (*mknod) (struct inode *,struct dentry *,int,dev_t);
262 int (*rename) (struct inode *, struct dentry *, 335 int (*rename) (struct inode *, struct dentry *,
263 struct inode *, struct dentry *); 336 struct inode *, struct dentry *);
264 int (*readlink) (struct dentry *, char *,int); 337 int (*readlink) (struct dentry *, char __user *,int);
265 struct dentry * (*follow_link) (struct dentry *, struct dentry *); 338 void * (*follow_link) (struct dentry *, struct nameidata *);
266 int (*readpage) (struct file *, struct page *); 339 void (*put_link) (struct dentry *, struct nameidata *, void *);
267 int (*writepage) (struct page *page, struct writeback_control *wbc);
268 int (*bmap) (struct inode *,int);
269 void (*truncate) (struct inode *); 340 void (*truncate) (struct inode *);
270 int (*permission) (struct inode *, int); 341 int (*permission) (struct inode *, int, struct nameidata *);
271 int (*smap) (struct inode *,int); 342 int (*setattr) (struct dentry *, struct iattr *);
272 int (*updatepage) (struct file *, struct page *, const char *, 343 int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
273 unsigned long, unsigned int, int); 344 int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
274 int (*revalidate) (struct dentry *); 345 ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
346 ssize_t (*listxattr) (struct dentry *, char *, size_t);
347 int (*removexattr) (struct dentry *, const char *);
275}; 348};
276 349
277Again, all methods are called without any locks being held, unless 350Again, all methods are called without any locks being held, unless
278otherwise noted. 351otherwise noted.
279 352
280 default_file_ops: this is a pointer to a "struct file_operations"
281 which describes how to open and then manipulate open files
282
283 create: called by the open(2) and creat(2) system calls. Only 353 create: called by the open(2) and creat(2) system calls. Only
284 required if you want to support regular files. The dentry you 354 required if you want to support regular files. The dentry you
285 get should not have an inode (i.e. it should be a negative 355 get should not have an inode (i.e. it should be a negative
@@ -328,31 +398,143 @@ otherwise noted.
328 you want to support reading symbolic links 398 you want to support reading symbolic links
329 399
330 follow_link: called by the VFS to follow a symbolic link to the 400 follow_link: called by the VFS to follow a symbolic link to the
331 inode it points to. Only required if you want to support 401 inode it points to. Only required if you want to support
332 symbolic links 402 symbolic links. This function returns a void pointer cookie
403 that is passed to put_link().
404
405 put_link: called by the VFS to release resources allocated by
406 follow_link(). The cookie returned by follow_link() is passed to
407 to this function as the last parameter. It is used by filesystems
408 such as NFS where page cache is not stable (i.e. page that was
409 installed when the symbolic link walk started might not be in the
410 page cache at the end of the walk).
411
412 truncate: called by the VFS to change the size of a file. The i_size
413 field of the inode is set to the desired size by the VFS before
414 this function is called. This function is called by the truncate(2)
415 system call and related functionality.
416
417 permission: called by the VFS to check for access rights on a POSIX-like
418 filesystem.
419
420 setattr: called by the VFS to set attributes for a file. This function is
421 called by chmod(2) and related system calls.
422
423 getattr: called by the VFS to get attributes of a file. This function is
424 called by stat(2) and related system calls.
425
426 setxattr: called by the VFS to set an extended attribute for a file.
427 Extended attribute is a name:value pair associated with an inode. This
428 function is called by setxattr(2) system call.
429
430 getxattr: called by the VFS to retrieve the value of an extended attribute
431 name. This function is called by getxattr(2) function call.
432
433 listxattr: called by the VFS to list all extended attributes for a given
434 file. This function is called by listxattr(2) system call.
435
436 removexattr: called by the VFS to remove an extended attribute from a file.
437 This function is called by removexattr(2) system call.
438
439
440struct address_space_operations
441===============================
442
443This describes how the VFS can manipulate mapping of a file to page cache in
444your filesystem. As of kernel 2.6.13, the following members are defined:
445
446struct address_space_operations {
447 int (*writepage)(struct page *page, struct writeback_control *wbc);
448 int (*readpage)(struct file *, struct page *);
449 int (*sync_page)(struct page *);
450 int (*writepages)(struct address_space *, struct writeback_control *);
451 int (*set_page_dirty)(struct page *page);
452 int (*readpages)(struct file *filp, struct address_space *mapping,
453 struct list_head *pages, unsigned nr_pages);
454 int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
455 int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
456 sector_t (*bmap)(struct address_space *, sector_t);
457 int (*invalidatepage) (struct page *, unsigned long);
458 int (*releasepage) (struct page *, int);
459 ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
460 loff_t offset, unsigned long nr_segs);
461 struct page* (*get_xip_page)(struct address_space *, sector_t,
462 int);
463};
464
465 writepage: called by the VM write a dirty page to backing store.
466
467 readpage: called by the VM to read a page from backing store.
468
469 sync_page: called by the VM to notify the backing store to perform all
470 queued I/O operations for a page. I/O operations for other pages
471 associated with this address_space object may also be performed.
472
473 writepages: called by the VM to write out pages associated with the
474 address_space object.
475
476 set_page_dirty: called by the VM to set a page dirty.
477
478 readpages: called by the VM to read pages associated with the address_space
479 object.
333 480
481 prepare_write: called by the generic write path in VM to set up a write
482 request for a page.
334 483
335struct file_operations <section> 484 commit_write: called by the generic write path in VM to write page to
485 its backing store.
486
487 bmap: called by the VFS to map a logical block offset within object to
488 physical block number. This method is use by for the legacy FIBMAP
489 ioctl. Other uses are discouraged.
490
491 invalidatepage: called by the VM on truncate to disassociate a page from its
492 address_space mapping.
493
494 releasepage: called by the VFS to release filesystem specific metadata from
495 a page.
496
497 direct_IO: called by the VM for direct I/O writes and reads.
498
499 get_xip_page: called by the VM to translate a block number to a page.
500 The page is valid until the corresponding filesystem is unmounted.
501 Filesystems that want to use execute-in-place (XIP) need to implement
502 it. An example implementation can be found in fs/ext2/xip.c.
503
504
505struct file_operations
336====================== 506======================
337 507
338This describes how the VFS can manipulate an open file. As of kernel 508This describes how the VFS can manipulate an open file. As of kernel
3392.1.99, the following members are defined: 5092.6.13, the following members are defined:
340 510
341struct file_operations { 511struct file_operations {
342 loff_t (*llseek) (struct file *, loff_t, int); 512 loff_t (*llseek) (struct file *, loff_t, int);
343 ssize_t (*read) (struct file *, char *, size_t, loff_t *); 513 ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
344 ssize_t (*write) (struct file *, const char *, size_t, loff_t *); 514 ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
515 ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
516 ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);
345 int (*readdir) (struct file *, void *, filldir_t); 517 int (*readdir) (struct file *, void *, filldir_t);
346 unsigned int (*poll) (struct file *, struct poll_table_struct *); 518 unsigned int (*poll) (struct file *, struct poll_table_struct *);
347 int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); 519 int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
520 long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
521 long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
348 int (*mmap) (struct file *, struct vm_area_struct *); 522 int (*mmap) (struct file *, struct vm_area_struct *);
349 int (*open) (struct inode *, struct file *); 523 int (*open) (struct inode *, struct file *);
524 int (*flush) (struct file *);
350 int (*release) (struct inode *, struct file *); 525 int (*release) (struct inode *, struct file *);
351 int (*fsync) (struct file *, struct dentry *); 526 int (*fsync) (struct file *, struct dentry *, int datasync);
352 int (*fasync) (struct file *, int); 527 int (*aio_fsync) (struct kiocb *, int datasync);
353 int (*check_media_change) (kdev_t dev); 528 int (*fasync) (int, struct file *, int);
354 int (*revalidate) (kdev_t dev);
355 int (*lock) (struct file *, int, struct file_lock *); 529 int (*lock) (struct file *, int, struct file_lock *);
530 ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
531 ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
532 ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
533 ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
534 unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
535 int (*check_flags)(int);
536 int (*dir_notify)(struct file *filp, unsigned long arg);
537 int (*flock) (struct file *, int, struct file_lock *);
356}; 538};
357 539
358Again, all methods are called without any locks being held, unless 540Again, all methods are called without any locks being held, unless
@@ -362,8 +544,12 @@ otherwise noted.
362 544
363 read: called by read(2) and related system calls 545 read: called by read(2) and related system calls
364 546
547 aio_read: called by io_submit(2) and other asynchronous I/O operations
548
365 write: called by write(2) and related system calls 549 write: called by write(2) and related system calls
366 550
551 aio_write: called by io_submit(2) and other asynchronous I/O operations
552
367 readdir: called when the VFS needs to read the directory contents 553 readdir: called when the VFS needs to read the directory contents
368 554
369 poll: called by the VFS when a process wants to check if there is 555 poll: called by the VFS when a process wants to check if there is
@@ -372,18 +558,25 @@ otherwise noted.
372 558
373 ioctl: called by the ioctl(2) system call 559 ioctl: called by the ioctl(2) system call
374 560
561 unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not
562 require the BKL should use this method instead of the ioctl() above.
563
564 compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
565 are used on 64 bit kernels.
566
375 mmap: called by the mmap(2) system call 567 mmap: called by the mmap(2) system call
376 568
377 open: called by the VFS when an inode should be opened. When the VFS 569 open: called by the VFS when an inode should be opened. When the VFS
378 opens a file, it creates a new "struct file" and initialises 570 opens a file, it creates a new "struct file". It then calls the
379 the "f_op" file operations member with the "default_file_ops" 571 open method for the newly allocated file structure. You might
380 field in the inode structure. It then calls the open method 572 think that the open method really belongs in
381 for the newly allocated file structure. You might think that 573 "struct inode_operations", and you may be right. I think it's
382 the open method really belongs in "struct inode_operations", 574 done the way it is because it makes filesystems simpler to
383 and you may be right. I think it's done the way it is because 575 implement. The open() method is a good place to initialize the
384 it makes filesystems simpler to implement. The open() method 576 "private_data" member in the file structure if you want to point
385 is a good place to initialise the "private_data" member in the 577 to a device structure
386 file structure if you want to point to a device structure 578
579 flush: called by the close(2) system call to flush a file
387 580
388 release: called when the last reference to an open file is closed 581 release: called when the last reference to an open file is closed
389 582
@@ -392,6 +585,23 @@ otherwise noted.
392 fasync: called by the fcntl(2) system call when asynchronous 585 fasync: called by the fcntl(2) system call when asynchronous
393 (non-blocking) mode is enabled for a file 586 (non-blocking) mode is enabled for a file
394 587
588 lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
589 commands
590
591 readv: called by the readv(2) system call
592
593 writev: called by the writev(2) system call
594
595 sendfile: called by the sendfile(2) system call
596
597 get_unmapped_area: called by the mmap(2) system call
598
599 check_flags: called by the fcntl(2) system call for F_SETFL command
600
601 dir_notify: called by the fcntl(2) system call for F_NOTIFY command
602
603 flock: called by the flock(2) system call
604
395Note that the file operations are implemented by the specific 605Note that the file operations are implemented by the specific
396filesystem in which the inode resides. When opening a device node 606filesystem in which the inode resides. When opening a device node
397(character or block special) most filesystems will call special 607(character or block special) most filesystems will call special
@@ -400,29 +610,28 @@ driver information. These support routines replace the filesystem file
400operations with those for the device driver, and then proceed to call 610operations with those for the device driver, and then proceed to call
401the new open() method for the file. This is how opening a device file 611the new open() method for the file. This is how opening a device file
402in the filesystem eventually ends up calling the device driver open() 612in the filesystem eventually ends up calling the device driver open()
403method. Note the devfs (the Device FileSystem) has a more direct path 613method.
404from device node to device driver (this is an unofficial kernel
405patch).
406 614
407 615
408Directory Entry Cache (dcache) <section> 616Directory Entry Cache (dcache)
409------------------------------ 617==============================
618
410 619
411struct dentry_operations 620struct dentry_operations
412======================== 621------------------------
413 622
414This describes how a filesystem can overload the standard dentry 623This describes how a filesystem can overload the standard dentry
415operations. Dentries and the dcache are the domain of the VFS and the 624operations. Dentries and the dcache are the domain of the VFS and the
416individual filesystem implementations. Device drivers have no business 625individual filesystem implementations. Device drivers have no business
417here. These methods may be set to NULL, as they are either optional or 626here. These methods may be set to NULL, as they are either optional or
418the VFS uses a default. As of kernel 2.1.99, the following members are 627the VFS uses a default. As of kernel 2.6.13, the following members are
419defined: 628defined:
420 629
421struct dentry_operations { 630struct dentry_operations {
422 int (*d_revalidate)(struct dentry *); 631 int (*d_revalidate)(struct dentry *, struct nameidata *);
423 int (*d_hash) (struct dentry *, struct qstr *); 632 int (*d_hash) (struct dentry *, struct qstr *);
424 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); 633 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
425 void (*d_delete)(struct dentry *); 634 int (*d_delete)(struct dentry *);
426 void (*d_release)(struct dentry *); 635 void (*d_release)(struct dentry *);
427 void (*d_iput)(struct dentry *, struct inode *); 636 void (*d_iput)(struct dentry *, struct inode *);
428}; 637};
@@ -451,6 +660,7 @@ Each dentry has a pointer to its parent dentry, as well as a hash list
451of child dentries. Child dentries are basically like files in a 660of child dentries. Child dentries are basically like files in a
452directory. 661directory.
453 662
663
454Directory Entry Cache APIs 664Directory Entry Cache APIs
455-------------------------- 665--------------------------
456 666
@@ -471,7 +681,7 @@ manipulate dentries:
471 "d_delete" method is called 681 "d_delete" method is called
472 682
473 d_drop: this unhashes a dentry from its parents hash list. A 683 d_drop: this unhashes a dentry from its parents hash list. A
474 subsequent call to dput() will dellocate the dentry if its 684 subsequent call to dput() will deallocate the dentry if its
475 usage count drops to 0 685 usage count drops to 0
476 686
477 d_delete: delete a dentry. If there are no other open references to 687 d_delete: delete a dentry. If there are no other open references to
@@ -507,16 +717,16 @@ up by walking the tree starting with the first component
507of the pathname and using that dentry along with the next 717of the pathname and using that dentry along with the next
508component to look up the next level and so on. Since it 718component to look up the next level and so on. Since it
509is a frequent operation for workloads like multiuser 719is a frequent operation for workloads like multiuser
510environments and webservers, it is important to optimize 720environments and web servers, it is important to optimize
511this path. 721this path.
512 722
513Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus 723Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
514in every component during path look-up. Since 2.5.10 onwards, 724in every component during path look-up. Since 2.5.10 onwards,
515fastwalk algorithm changed this by holding the dcache_lock 725fast-walk algorithm changed this by holding the dcache_lock
516at the beginning and walking as many cached path component 726at the beginning and walking as many cached path component
517dentries as possible. This signficantly decreases the number 727dentries as possible. This significantly decreases the number
518of acquisition of dcache_lock. However it also increases the 728of acquisition of dcache_lock. However it also increases the
519lock hold time signficantly and affects performance in large 729lock hold time significantly and affects performance in large
520SMP machines. Since 2.5.62 kernel, dcache has been using 730SMP machines. Since 2.5.62 kernel, dcache has been using
521a new locking model that uses RCU to make dcache look-up 731a new locking model that uses RCU to make dcache look-up
522lock-free. 732lock-free.
@@ -527,7 +737,7 @@ protected the hash chain, d_child, d_alias, d_lru lists as well
527as d_inode and several other things like mount look-up. RCU-based 737as d_inode and several other things like mount look-up. RCU-based
528changes affect only the way the hash chain is protected. For everything 738changes affect only the way the hash chain is protected. For everything
529else the dcache_lock must be taken for both traversing as well as 739else the dcache_lock must be taken for both traversing as well as
530updating. The hash chain updations too take the dcache_lock. 740updating. The hash chain updates too take the dcache_lock.
531The significant change is the way d_lookup traverses the hash chain, 741The significant change is the way d_lookup traverses the hash chain,
532it doesn't acquire the dcache_lock for this and rely on RCU to 742it doesn't acquire the dcache_lock for this and rely on RCU to
533ensure that the dentry has not been *freed*. 743ensure that the dentry has not been *freed*.
@@ -535,14 +745,15 @@ ensure that the dentry has not been *freed*.
535 745
536Dcache locking details 746Dcache locking details
537---------------------- 747----------------------
748
538For many multi-user workloads, open() and stat() on files are 749For many multi-user workloads, open() and stat() on files are
539very frequently occurring operations. Both involve walking 750very frequently occurring operations. Both involve walking
540of path names to find the dentry corresponding to the 751of path names to find the dentry corresponding to the
541concerned file. In 2.4 kernel, dcache_lock was held 752concerned file. In 2.4 kernel, dcache_lock was held
542during look-up of each path component. Contention and 753during look-up of each path component. Contention and
543cacheline bouncing of this global lock caused significant 754cache-line bouncing of this global lock caused significant
544scalability problems. With the introduction of RCU 755scalability problems. With the introduction of RCU
545in linux kernel, this was worked around by making 756in Linux kernel, this was worked around by making
546the look-up of path components during path walking lock-free. 757the look-up of path components during path walking lock-free.
547 758
548 759
@@ -562,7 +773,7 @@ Some of the important changes are :
5622. Insertion of a dentry into the hash table is done using 7732. Insertion of a dentry into the hash table is done using
563 hlist_add_head_rcu() which take care of ordering the writes - 774 hlist_add_head_rcu() which take care of ordering the writes -
564 the writes to the dentry must be visible before the dentry 775 the writes to the dentry must be visible before the dentry
565 is inserted. This works in conjuction with hlist_for_each_rcu() 776 is inserted. This works in conjunction with hlist_for_each_rcu()
566 while walking the hash chain. The only requirement is that 777 while walking the hash chain. The only requirement is that
567 all initialization to the dentry must be done before hlist_add_head_rcu() 778 all initialization to the dentry must be done before hlist_add_head_rcu()
568 since we don't have dcache_lock protection while traversing 779 since we don't have dcache_lock protection while traversing
@@ -584,7 +795,7 @@ Some of the important changes are :
584 the same. In some sense, dcache_rcu path walking looks like 795 the same. In some sense, dcache_rcu path walking looks like
585 the pre-2.5.10 version. 796 the pre-2.5.10 version.
586 797
5875. All dentry hash chain updations must take the dcache_lock as well as 7985. All dentry hash chain updates must take the dcache_lock as well as
588 the per-dentry lock in that order. dput() does this to ensure 799 the per-dentry lock in that order. dput() does this to ensure
589 that a dentry that has just been looked up in another CPU 800 that a dentry that has just been looked up in another CPU
590 doesn't get deleted before dget() can be done on it. 801 doesn't get deleted before dget() can be done on it.
@@ -640,10 +851,10 @@ handled as described below :
640 Since we redo the d_parent check and compare name while holding 851 Since we redo the d_parent check and compare name while holding
641 d_lock, lock-free look-up will not race against d_move(). 852 d_lock, lock-free look-up will not race against d_move().
642 853
6434. There can be a theoritical race when a dentry keeps coming back 8544. There can be a theoretical race when a dentry keeps coming back
644 to original bucket due to double moves. Due to this look-up may 855 to original bucket due to double moves. Due to this look-up may
645 consider that it has never moved and can end up in a infinite loop. 856 consider that it has never moved and can end up in a infinite loop.
646 But this is not any worse that theoritical livelocks we already 857 But this is not any worse that theoretical livelocks we already
647 have in the kernel. 858 have in the kernel.
648 859
649 860