aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems/vfs.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems/vfs.txt')
-rw-r--r--Documentation/filesystems/vfs.txt435
1 files changed, 323 insertions, 112 deletions
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 3f318dd44c77..f042c12e0ed2 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -1,35 +1,27 @@
1/* -*- auto-fill -*- */
2 1
3 Overview of the Virtual File System 2 Overview of the Linux Virtual File System
4 3
5 Richard Gooch <rgooch@atnf.csiro.au> 4 Original author: Richard Gooch <rgooch@atnf.csiro.au>
6 5
7 5-JUL-1999 6 Last updated on August 25, 2005
8 7
8 Copyright (C) 1999 Richard Gooch
9 Copyright (C) 2005 Pekka Enberg
9 10
10Conventions used in this document <section> 11 This file is released under the GPLv2.
11=================================
12 12
13Each section in this document will have the string "<section>" at the
14right-hand side of the section title. Each subsection will have
15"<subsection>" at the right-hand side. These strings are meant to make
16it easier to search through the document.
17 13
18NOTE that the master copy of this document is available online at: 14What is it?
19http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt
20
21
22What is it? <section>
23=========== 15===========
24 16
25The Virtual File System (otherwise known as the Virtual Filesystem 17The Virtual File System (otherwise known as the Virtual Filesystem
26Switch) is the software layer in the kernel that provides the 18Switch) is the software layer in the kernel that provides the
27filesystem interface to userspace programs. It also provides an 19filesystem interface to userspace programs. It also provides an
28abstraction within the kernel which allows different filesystem 20abstraction within the kernel which allows different filesystem
29implementations to co-exist. 21implementations to coexist.
30 22
31 23
32A Quick Look At How It Works <section> 24A Quick Look At How It Works
33============================ 25============================
34 26
35In this section I'll briefly describe how things work, before 27In this section I'll briefly describe how things work, before
@@ -38,7 +30,8 @@ when user programs open and manipulate files, and then look from the
38other view which is how a filesystem is supported and subsequently 30other view which is how a filesystem is supported and subsequently
39mounted. 31mounted.
40 32
41Opening a File <subsection> 33
34Opening a File
42-------------- 35--------------
43 36
44The VFS implements the open(2), stat(2), chmod(2) and similar system 37The VFS implements the open(2), stat(2), chmod(2) and similar system
@@ -77,7 +70,7 @@ back to userspace.
77 70
78Opening a file requires another operation: allocation of a file 71Opening a file requires another operation: allocation of a file
79structure (this is the kernel-side implementation of file 72structure (this is the kernel-side implementation of file
80descriptors). The freshly allocated file structure is initialised with 73descriptors). The freshly allocated file structure is initialized with
81a pointer to the dentry and a set of file operation member functions. 74a pointer to the dentry and a set of file operation member functions.
82These are taken from the inode data. The open() file method is then 75These are taken from the inode data. The open() file method is then
83called so the specific filesystem implementation can do it's work. You 76called so the specific filesystem implementation can do it's work. You
@@ -102,7 +95,8 @@ filesystem or driver code at the same time, on different
102processors. You should ensure that access to shared resources is 95processors. You should ensure that access to shared resources is
103protected by appropriate locks. 96protected by appropriate locks.
104 97
105Registering and Mounting a Filesystem <subsection> 98
99Registering and Mounting a Filesystem
106------------------------------------- 100-------------------------------------
107 101
108If you want to support a new kind of filesystem in the kernel, all you 102If you want to support a new kind of filesystem in the kernel, all you
@@ -123,17 +117,21 @@ updated to point to the root inode for the new filesystem.
123It's now time to look at things in more detail. 117It's now time to look at things in more detail.
124 118
125 119
126struct file_system_type <section> 120struct file_system_type
127======================= 121=======================
128 122
129This describes the filesystem. As of kernel 2.1.99, the following 123This describes the filesystem. As of kernel 2.6.13, the following
130members are defined: 124members are defined:
131 125
132struct file_system_type { 126struct file_system_type {
133 const char *name; 127 const char *name;
134 int fs_flags; 128 int fs_flags;
135 struct super_block *(*read_super) (struct super_block *, void *, int); 129 struct super_block *(*get_sb) (struct file_system_type *, int,
136 struct file_system_type * next; 130 const char *, void *);
131 void (*kill_sb) (struct super_block *);
132 struct module *owner;
133 struct file_system_type * next;
134 struct list_head fs_supers;
137}; 135};
138 136
139 name: the name of the filesystem type, such as "ext2", "iso9660", 137 name: the name of the filesystem type, such as "ext2", "iso9660",
@@ -141,51 +139,97 @@ struct file_system_type {
141 139
142 fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) 140 fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
143 141
144 read_super: the method to call when a new instance of this 142 get_sb: the method to call when a new instance of this
145 filesystem should be mounted 143 filesystem should be mounted
146 144
147 next: for internal VFS use: you should initialise this to NULL 145 kill_sb: the method to call when an instance of this filesystem
146 should be unmounted
147
148 owner: for internal VFS use: you should initialize this to THIS_MODULE in
149 most cases.
148 150
149The read_super() method has the following arguments: 151 next: for internal VFS use: you should initialize this to NULL
152
153The get_sb() method has the following arguments:
150 154
151 struct super_block *sb: the superblock structure. This is partially 155 struct super_block *sb: the superblock structure. This is partially
152 initialised by the VFS and the rest must be initialised by the 156 initialized by the VFS and the rest must be initialized by the
153 read_super() method 157 get_sb() method
158
159 int flags: mount flags
160
161 const char *dev_name: the device name we are mounting.
154 162
155 void *data: arbitrary mount options, usually comes as an ASCII 163 void *data: arbitrary mount options, usually comes as an ASCII
156 string 164 string
157 165
158 int silent: whether or not to be silent on error 166 int silent: whether or not to be silent on error
159 167
160The read_super() method must determine if the block device specified 168The get_sb() method must determine if the block device specified
161in the superblock contains a filesystem of the type the method 169in the superblock contains a filesystem of the type the method
162supports. On success the method returns the superblock pointer, on 170supports. On success the method returns the superblock pointer, on
163failure it returns NULL. 171failure it returns NULL.
164 172
165The most interesting member of the superblock structure that the 173The most interesting member of the superblock structure that the
166read_super() method fills in is the "s_op" field. This is a pointer to 174get_sb() method fills in is the "s_op" field. This is a pointer to
167a "struct super_operations" which describes the next level of the 175a "struct super_operations" which describes the next level of the
168filesystem implementation. 176filesystem implementation.
169 177
178Usually, a filesystem uses generic one of the generic get_sb()
179implementations and provides a fill_super() method instead. The
180generic methods are:
181
182 get_sb_bdev: mount a filesystem residing on a block device
170 183
171struct super_operations <section> 184 get_sb_nodev: mount a filesystem that is not backed by a device
185
186 get_sb_single: mount a filesystem which shares the instance between
187 all mounts
188
189A fill_super() method implementation has the following arguments:
190
191 struct super_block *sb: the superblock structure. The method fill_super()
192 must initialize this properly.
193
194 void *data: arbitrary mount options, usually comes as an ASCII
195 string
196
197 int silent: whether or not to be silent on error
198
199
200struct super_operations
172======================= 201=======================
173 202
174This describes how the VFS can manipulate the superblock of your 203This describes how the VFS can manipulate the superblock of your
175filesystem. As of kernel 2.1.99, the following members are defined: 204filesystem. As of kernel 2.6.13, the following members are defined:
176 205
177struct super_operations { 206struct super_operations {
178 void (*read_inode) (struct inode *); 207 struct inode *(*alloc_inode)(struct super_block *sb);
179 int (*write_inode) (struct inode *, int); 208 void (*destroy_inode)(struct inode *);
180 void (*put_inode) (struct inode *); 209
181 void (*drop_inode) (struct inode *); 210 void (*read_inode) (struct inode *);
182 void (*delete_inode) (struct inode *); 211
183 int (*notify_change) (struct dentry *, struct iattr *); 212 void (*dirty_inode) (struct inode *);
184 void (*put_super) (struct super_block *); 213 int (*write_inode) (struct inode *, int);
185 void (*write_super) (struct super_block *); 214 void (*put_inode) (struct inode *);
186 int (*statfs) (struct super_block *, struct statfs *, int); 215 void (*drop_inode) (struct inode *);
187 int (*remount_fs) (struct super_block *, int *, char *); 216 void (*delete_inode) (struct inode *);
188 void (*clear_inode) (struct inode *); 217 void (*put_super) (struct super_block *);
218 void (*write_super) (struct super_block *);
219 int (*sync_fs)(struct super_block *sb, int wait);
220 void (*write_super_lockfs) (struct super_block *);
221 void (*unlockfs) (struct super_block *);
222 int (*statfs) (struct super_block *, struct kstatfs *);
223 int (*remount_fs) (struct super_block *, int *, char *);
224 void (*clear_inode) (struct inode *);
225 void (*umount_begin) (struct super_block *);
226
227 void (*sync_inodes) (struct super_block *sb,
228 struct writeback_control *wbc);
229 int (*show_options)(struct seq_file *, struct vfsmount *);
230
231 ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
232 ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
189}; 233};
190 234
191All methods are called without any locks being held, unless otherwise 235All methods are called without any locks being held, unless otherwise
@@ -193,43 +237,62 @@ noted. This means that most methods can block safely. All methods are
193only called from a process context (i.e. not from an interrupt handler 237only called from a process context (i.e. not from an interrupt handler
194or bottom half). 238or bottom half).
195 239
240 alloc_inode: this method is called by inode_alloc() to allocate memory
241 for struct inode and initialize it.
242
243 destroy_inode: this method is called by destroy_inode() to release
244 resources allocated for struct inode.
245
196 read_inode: this method is called to read a specific inode from the 246 read_inode: this method is called to read a specific inode from the
197 mounted filesystem. The "i_ino" member in the "struct inode" 247 mounted filesystem. The i_ino member in the struct inode is
198 will be initialised by the VFS to indicate which inode to 248 initialized by the VFS to indicate which inode to read. Other
199 read. Other members are filled in by this method 249 members are filled in by this method.
250
251 You can set this to NULL and use iget5_locked() instead of iget()
252 to read inodes. This is necessary for filesystems for which the
253 inode number is not sufficient to identify an inode.
254
255 dirty_inode: this method is called by the VFS to mark an inode dirty.
200 256
201 write_inode: this method is called when the VFS needs to write an 257 write_inode: this method is called when the VFS needs to write an
202 inode to disc. The second parameter indicates whether the write 258 inode to disc. The second parameter indicates whether the write
203 should be synchronous or not, not all filesystems check this flag. 259 should be synchronous or not, not all filesystems check this flag.
204 260
205 put_inode: called when the VFS inode is removed from the inode 261 put_inode: called when the VFS inode is removed from the inode
206 cache. This method is optional 262 cache.
207 263
208 drop_inode: called when the last access to the inode is dropped, 264 drop_inode: called when the last access to the inode is dropped,
209 with the inode_lock spinlock held. 265 with the inode_lock spinlock held.
210 266
211 This method should be either NULL (normal unix filesystem 267 This method should be either NULL (normal UNIX filesystem
212 semantics) or "generic_delete_inode" (for filesystems that do not 268 semantics) or "generic_delete_inode" (for filesystems that do not
213 want to cache inodes - causing "delete_inode" to always be 269 want to cache inodes - causing "delete_inode" to always be
214 called regardless of the value of i_nlink) 270 called regardless of the value of i_nlink)
215 271
216 The "generic_delete_inode()" behaviour is equivalent to the 272 The "generic_delete_inode()" behavior is equivalent to the
217 old practice of using "force_delete" in the put_inode() case, 273 old practice of using "force_delete" in the put_inode() case,
218 but does not have the races that the "force_delete()" approach 274 but does not have the races that the "force_delete()" approach
219 had. 275 had.
220 276
221 delete_inode: called when the VFS wants to delete an inode 277 delete_inode: called when the VFS wants to delete an inode
222 278
223 notify_change: called when VFS inode attributes are changed. If this
224 is NULL the VFS falls back to the write_inode() method. This
225 is called with the kernel lock held
226
227 put_super: called when the VFS wishes to free the superblock 279 put_super: called when the VFS wishes to free the superblock
228 (i.e. unmount). This is called with the superblock lock held 280 (i.e. unmount). This is called with the superblock lock held
229 281
230 write_super: called when the VFS superblock needs to be written to 282 write_super: called when the VFS superblock needs to be written to
231 disc. This method is optional 283 disc. This method is optional
232 284
285 sync_fs: called when VFS is writing out all dirty data associated with
286 a superblock. The second parameter indicates whether the method
287 should wait until the write out has been completed. Optional.
288
289 write_super_lockfs: called when VFS is locking a filesystem and forcing
290 it into a consistent state. This function is currently used by the
291 Logical Volume Manager (LVM).
292
293 unlockfs: called when VFS is unlocking a filesystem and making it writable
294 again.
295
233 statfs: called when the VFS needs to get filesystem statistics. This 296 statfs: called when the VFS needs to get filesystem statistics. This
234 is called with the kernel lock held 297 is called with the kernel lock held
235 298
@@ -238,21 +301,31 @@ or bottom half).
238 301
239 clear_inode: called then the VFS clears the inode. Optional 302 clear_inode: called then the VFS clears the inode. Optional
240 303
304 umount_begin: called when the VFS is unmounting a filesystem.
305
306 sync_inodes: called when the VFS is writing out dirty data associated with
307 a superblock.
308
309 show_options: called by the VFS to show mount options for /proc/<pid>/mounts.
310
311 quota_read: called by the VFS to read from filesystem quota file.
312
313 quota_write: called by the VFS to write to filesystem quota file.
314
241The read_inode() method is responsible for filling in the "i_op" 315The read_inode() method is responsible for filling in the "i_op"
242field. This is a pointer to a "struct inode_operations" which 316field. This is a pointer to a "struct inode_operations" which
243describes the methods that can be performed on individual inodes. 317describes the methods that can be performed on individual inodes.
244 318
245 319
246struct inode_operations <section> 320struct inode_operations
247======================= 321=======================
248 322
249This describes how the VFS can manipulate an inode in your 323This describes how the VFS can manipulate an inode in your
250filesystem. As of kernel 2.1.99, the following members are defined: 324filesystem. As of kernel 2.6.13, the following members are defined:
251 325
252struct inode_operations { 326struct inode_operations {
253 struct file_operations * default_file_ops; 327 int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
254 int (*create) (struct inode *,struct dentry *,int); 328 struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
255 int (*lookup) (struct inode *,struct dentry *);
256 int (*link) (struct dentry *,struct inode *,struct dentry *); 329 int (*link) (struct dentry *,struct inode *,struct dentry *);
257 int (*unlink) (struct inode *,struct dentry *); 330 int (*unlink) (struct inode *,struct dentry *);
258 int (*symlink) (struct inode *,struct dentry *,const char *); 331 int (*symlink) (struct inode *,struct dentry *,const char *);
@@ -261,25 +334,22 @@ struct inode_operations {
261 int (*mknod) (struct inode *,struct dentry *,int,dev_t); 334 int (*mknod) (struct inode *,struct dentry *,int,dev_t);
262 int (*rename) (struct inode *, struct dentry *, 335 int (*rename) (struct inode *, struct dentry *,
263 struct inode *, struct dentry *); 336 struct inode *, struct dentry *);
264 int (*readlink) (struct dentry *, char *,int); 337 int (*readlink) (struct dentry *, char __user *,int);
265 struct dentry * (*follow_link) (struct dentry *, struct dentry *); 338 void * (*follow_link) (struct dentry *, struct nameidata *);
266 int (*readpage) (struct file *, struct page *); 339 void (*put_link) (struct dentry *, struct nameidata *, void *);
267 int (*writepage) (struct page *page, struct writeback_control *wbc);
268 int (*bmap) (struct inode *,int);
269 void (*truncate) (struct inode *); 340 void (*truncate) (struct inode *);
270 int (*permission) (struct inode *, int); 341 int (*permission) (struct inode *, int, struct nameidata *);
271 int (*smap) (struct inode *,int); 342 int (*setattr) (struct dentry *, struct iattr *);
272 int (*updatepage) (struct file *, struct page *, const char *, 343 int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
273 unsigned long, unsigned int, int); 344 int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
274 int (*revalidate) (struct dentry *); 345 ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
346 ssize_t (*listxattr) (struct dentry *, char *, size_t);
347 int (*removexattr) (struct dentry *, const char *);
275}; 348};
276 349
277Again, all methods are called without any locks being held, unless 350Again, all methods are called without any locks being held, unless
278otherwise noted. 351otherwise noted.
279 352
280 default_file_ops: this is a pointer to a "struct file_operations"
281 which describes how to open and then manipulate open files
282
283 create: called by the open(2) and creat(2) system calls. Only 353 create: called by the open(2) and creat(2) system calls. Only
284 required if you want to support regular files. The dentry you 354 required if you want to support regular files. The dentry you
285 get should not have an inode (i.e. it should be a negative 355 get should not have an inode (i.e. it should be a negative
@@ -328,31 +398,143 @@ otherwise noted.
328 you want to support reading symbolic links 398 you want to support reading symbolic links
329 399
330 follow_link: called by the VFS to follow a symbolic link to the 400 follow_link: called by the VFS to follow a symbolic link to the
331 inode it points to. Only required if you want to support 401 inode it points to. Only required if you want to support
332 symbolic links 402 symbolic links. This function returns a void pointer cookie
403 that is passed to put_link().
404
405 put_link: called by the VFS to release resources allocated by
406 follow_link(). The cookie returned by follow_link() is passed to
407 to this function as the last parameter. It is used by filesystems
408 such as NFS where page cache is not stable (i.e. page that was
409 installed when the symbolic link walk started might not be in the
410 page cache at the end of the walk).
411
412 truncate: called by the VFS to change the size of a file. The i_size
413 field of the inode is set to the desired size by the VFS before
414 this function is called. This function is called by the truncate(2)
415 system call and related functionality.
416
417 permission: called by the VFS to check for access rights on a POSIX-like
418 filesystem.
419
420 setattr: called by the VFS to set attributes for a file. This function is
421 called by chmod(2) and related system calls.
422
423 getattr: called by the VFS to get attributes of a file. This function is
424 called by stat(2) and related system calls.
425
426 setxattr: called by the VFS to set an extended attribute for a file.
427 Extended attribute is a name:value pair associated with an inode. This
428 function is called by setxattr(2) system call.
429
430 getxattr: called by the VFS to retrieve the value of an extended attribute
431 name. This function is called by getxattr(2) function call.
432
433 listxattr: called by the VFS to list all extended attributes for a given
434 file. This function is called by listxattr(2) system call.
435
436 removexattr: called by the VFS to remove an extended attribute from a file.
437 This function is called by removexattr(2) system call.
438
439
440struct address_space_operations
441===============================
442
443This describes how the VFS can manipulate mapping of a file to page cache in
444your filesystem. As of kernel 2.6.13, the following members are defined:
445
446struct address_space_operations {
447 int (*writepage)(struct page *page, struct writeback_control *wbc);
448 int (*readpage)(struct file *, struct page *);
449 int (*sync_page)(struct page *);
450 int (*writepages)(struct address_space *, struct writeback_control *);
451 int (*set_page_dirty)(struct page *page);
452 int (*readpages)(struct file *filp, struct address_space *mapping,
453 struct list_head *pages, unsigned nr_pages);
454 int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
455 int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
456 sector_t (*bmap)(struct address_space *, sector_t);
457 int (*invalidatepage) (struct page *, unsigned long);
458 int (*releasepage) (struct page *, int);
459 ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
460 loff_t offset, unsigned long nr_segs);
461 struct page* (*get_xip_page)(struct address_space *, sector_t,
462 int);
463};
464
465 writepage: called by the VM write a dirty page to backing store.
466
467 readpage: called by the VM to read a page from backing store.
468
469 sync_page: called by the VM to notify the backing store to perform all
470 queued I/O operations for a page. I/O operations for other pages
471 associated with this address_space object may also be performed.
472
473 writepages: called by the VM to write out pages associated with the
474 address_space object.
475
476 set_page_dirty: called by the VM to set a page dirty.
477
478 readpages: called by the VM to read pages associated with the address_space
479 object.
333 480
481 prepare_write: called by the generic write path in VM to set up a write
482 request for a page.
334 483
335struct file_operations <section> 484 commit_write: called by the generic write path in VM to write page to
485 its backing store.
486
487 bmap: called by the VFS to map a logical block offset within object to
488 physical block number. This method is use by for the legacy FIBMAP
489 ioctl. Other uses are discouraged.
490
491 invalidatepage: called by the VM on truncate to disassociate a page from its
492 address_space mapping.
493
494 releasepage: called by the VFS to release filesystem specific metadata from
495 a page.
496
497 direct_IO: called by the VM for direct I/O writes and reads.
498
499 get_xip_page: called by the VM to translate a block number to a page.
500 The page is valid until the corresponding filesystem is unmounted.
501 Filesystems that want to use execute-in-place (XIP) need to implement
502 it. An example implementation can be found in fs/ext2/xip.c.
503
504
505struct file_operations
336====================== 506======================
337 507
338This describes how the VFS can manipulate an open file. As of kernel 508This describes how the VFS can manipulate an open file. As of kernel
3392.1.99, the following members are defined: 5092.6.13, the following members are defined:
340 510
341struct file_operations { 511struct file_operations {
342 loff_t (*llseek) (struct file *, loff_t, int); 512 loff_t (*llseek) (struct file *, loff_t, int);
343 ssize_t (*read) (struct file *, char *, size_t, loff_t *); 513 ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
344 ssize_t (*write) (struct file *, const char *, size_t, loff_t *); 514 ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
515 ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
516 ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);
345 int (*readdir) (struct file *, void *, filldir_t); 517 int (*readdir) (struct file *, void *, filldir_t);
346 unsigned int (*poll) (struct file *, struct poll_table_struct *); 518 unsigned int (*poll) (struct file *, struct poll_table_struct *);
347 int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); 519 int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
520 long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
521 long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
348 int (*mmap) (struct file *, struct vm_area_struct *); 522 int (*mmap) (struct file *, struct vm_area_struct *);
349 int (*open) (struct inode *, struct file *); 523 int (*open) (struct inode *, struct file *);
524 int (*flush) (struct file *);
350 int (*release) (struct inode *, struct file *); 525 int (*release) (struct inode *, struct file *);
351 int (*fsync) (struct file *, struct dentry *); 526 int (*fsync) (struct file *, struct dentry *, int datasync);
352 int (*fasync) (struct file *, int); 527 int (*aio_fsync) (struct kiocb *, int datasync);
353 int (*check_media_change) (kdev_t dev); 528 int (*fasync) (int, struct file *, int);
354 int (*revalidate) (kdev_t dev);
355 int (*lock) (struct file *, int, struct file_lock *); 529 int (*lock) (struct file *, int, struct file_lock *);
530 ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
531 ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
532 ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
533 ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
534 unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
535 int (*check_flags)(int);
536 int (*dir_notify)(struct file *filp, unsigned long arg);
537 int (*flock) (struct file *, int, struct file_lock *);
356}; 538};
357 539
358Again, all methods are called without any locks being held, unless 540Again, all methods are called without any locks being held, unless
@@ -362,8 +544,12 @@ otherwise noted.
362 544
363 read: called by read(2) and related system calls 545 read: called by read(2) and related system calls
364 546
547 aio_read: called by io_submit(2) and other asynchronous I/O operations
548
365 write: called by write(2) and related system calls 549 write: called by write(2) and related system calls
366 550
551 aio_write: called by io_submit(2) and other asynchronous I/O operations
552
367 readdir: called when the VFS needs to read the directory contents 553 readdir: called when the VFS needs to read the directory contents
368 554
369 poll: called by the VFS when a process wants to check if there is 555 poll: called by the VFS when a process wants to check if there is
@@ -372,18 +558,25 @@ otherwise noted.
372 558
373 ioctl: called by the ioctl(2) system call 559 ioctl: called by the ioctl(2) system call
374 560
561 unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not
562 require the BKL should use this method instead of the ioctl() above.
563
564 compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
565 are used on 64 bit kernels.
566
375 mmap: called by the mmap(2) system call 567 mmap: called by the mmap(2) system call
376 568
377 open: called by the VFS when an inode should be opened. When the VFS 569 open: called by the VFS when an inode should be opened. When the VFS
378 opens a file, it creates a new "struct file" and initialises 570 opens a file, it creates a new "struct file". It then calls the
379 the "f_op" file operations member with the "default_file_ops" 571 open method for the newly allocated file structure. You might
380 field in the inode structure. It then calls the open method 572 think that the open method really belongs in
381 for the newly allocated file structure. You might think that 573 "struct inode_operations", and you may be right. I think it's
382 the open method really belongs in "struct inode_operations", 574 done the way it is because it makes filesystems simpler to
383 and you may be right. I think it's done the way it is because 575 implement. The open() method is a good place to initialize the
384 it makes filesystems simpler to implement. The open() method 576 "private_data" member in the file structure if you want to point
385 is a good place to initialise the "private_data" member in the 577 to a device structure
386 file structure if you want to point to a device structure 578
579 flush: called by the close(2) system call to flush a file
387 580
388 release: called when the last reference to an open file is closed 581 release: called when the last reference to an open file is closed
389 582
@@ -392,6 +585,23 @@ otherwise noted.
392 fasync: called by the fcntl(2) system call when asynchronous 585 fasync: called by the fcntl(2) system call when asynchronous
393 (non-blocking) mode is enabled for a file 586 (non-blocking) mode is enabled for a file
394 587
588 lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
589 commands
590
591 readv: called by the readv(2) system call
592
593 writev: called by the writev(2) system call
594
595 sendfile: called by the sendfile(2) system call
596
597 get_unmapped_area: called by the mmap(2) system call
598
599 check_flags: called by the fcntl(2) system call for F_SETFL command
600
601 dir_notify: called by the fcntl(2) system call for F_NOTIFY command
602
603 flock: called by the flock(2) system call
604
395Note that the file operations are implemented by the specific 605Note that the file operations are implemented by the specific
396filesystem in which the inode resides. When opening a device node 606filesystem in which the inode resides. When opening a device node
397(character or block special) most filesystems will call special 607(character or block special) most filesystems will call special
@@ -400,29 +610,28 @@ driver information. These support routines replace the filesystem file
400operations with those for the device driver, and then proceed to call 610operations with those for the device driver, and then proceed to call
401the new open() method for the file. This is how opening a device file 611the new open() method for the file. This is how opening a device file
402in the filesystem eventually ends up calling the device driver open() 612in the filesystem eventually ends up calling the device driver open()
403method. Note the devfs (the Device FileSystem) has a more direct path 613method.
404from device node to device driver (this is an unofficial kernel
405patch).
406 614
407 615
408Directory Entry Cache (dcache) <section> 616Directory Entry Cache (dcache)
409------------------------------ 617==============================
618
410 619
411struct dentry_operations 620struct dentry_operations
412======================== 621------------------------
413 622
414This describes how a filesystem can overload the standard dentry 623This describes how a filesystem can overload the standard dentry
415operations. Dentries and the dcache are the domain of the VFS and the 624operations. Dentries and the dcache are the domain of the VFS and the
416individual filesystem implementations. Device drivers have no business 625individual filesystem implementations. Device drivers have no business
417here. These methods may be set to NULL, as they are either optional or 626here. These methods may be set to NULL, as they are either optional or
418the VFS uses a default. As of kernel 2.1.99, the following members are 627the VFS uses a default. As of kernel 2.6.13, the following members are
419defined: 628defined:
420 629
421struct dentry_operations { 630struct dentry_operations {
422 int (*d_revalidate)(struct dentry *); 631 int (*d_revalidate)(struct dentry *, struct nameidata *);
423 int (*d_hash) (struct dentry *, struct qstr *); 632 int (*d_hash) (struct dentry *, struct qstr *);
424 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); 633 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
425 void (*d_delete)(struct dentry *); 634 int (*d_delete)(struct dentry *);
426 void (*d_release)(struct dentry *); 635 void (*d_release)(struct dentry *);
427 void (*d_iput)(struct dentry *, struct inode *); 636 void (*d_iput)(struct dentry *, struct inode *);
428}; 637};
@@ -451,6 +660,7 @@ Each dentry has a pointer to its parent dentry, as well as a hash list
451of child dentries. Child dentries are basically like files in a 660of child dentries. Child dentries are basically like files in a
452directory. 661directory.
453 662
663
454Directory Entry Cache APIs 664Directory Entry Cache APIs
455-------------------------- 665--------------------------
456 666
@@ -471,7 +681,7 @@ manipulate dentries:
471 "d_delete" method is called 681 "d_delete" method is called
472 682
473 d_drop: this unhashes a dentry from its parents hash list. A 683 d_drop: this unhashes a dentry from its parents hash list. A
474 subsequent call to dput() will dellocate the dentry if its 684 subsequent call to dput() will deallocate the dentry if its
475 usage count drops to 0 685 usage count drops to 0
476 686
477 d_delete: delete a dentry. If there are no other open references to 687 d_delete: delete a dentry. If there are no other open references to
@@ -507,16 +717,16 @@ up by walking the tree starting with the first component
507of the pathname and using that dentry along with the next 717of the pathname and using that dentry along with the next
508component to look up the next level and so on. Since it 718component to look up the next level and so on. Since it
509is a frequent operation for workloads like multiuser 719is a frequent operation for workloads like multiuser
510environments and webservers, it is important to optimize 720environments and web servers, it is important to optimize
511this path. 721this path.
512 722
513Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus 723Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
514in every component during path look-up. Since 2.5.10 onwards, 724in every component during path look-up. Since 2.5.10 onwards,
515fastwalk algorithm changed this by holding the dcache_lock 725fast-walk algorithm changed this by holding the dcache_lock
516at the beginning and walking as many cached path component 726at the beginning and walking as many cached path component
517dentries as possible. This signficantly decreases the number 727dentries as possible. This significantly decreases the number
518of acquisition of dcache_lock. However it also increases the 728of acquisition of dcache_lock. However it also increases the
519lock hold time signficantly and affects performance in large 729lock hold time significantly and affects performance in large
520SMP machines. Since 2.5.62 kernel, dcache has been using 730SMP machines. Since 2.5.62 kernel, dcache has been using
521a new locking model that uses RCU to make dcache look-up 731a new locking model that uses RCU to make dcache look-up
522lock-free. 732lock-free.
@@ -527,7 +737,7 @@ protected the hash chain, d_child, d_alias, d_lru lists as well
527as d_inode and several other things like mount look-up. RCU-based 737as d_inode and several other things like mount look-up. RCU-based
528changes affect only the way the hash chain is protected. For everything 738changes affect only the way the hash chain is protected. For everything
529else the dcache_lock must be taken for both traversing as well as 739else the dcache_lock must be taken for both traversing as well as
530updating. The hash chain updations too take the dcache_lock. 740updating. The hash chain updates too take the dcache_lock.
531The significant change is the way d_lookup traverses the hash chain, 741The significant change is the way d_lookup traverses the hash chain,
532it doesn't acquire the dcache_lock for this and rely on RCU to 742it doesn't acquire the dcache_lock for this and rely on RCU to
533ensure that the dentry has not been *freed*. 743ensure that the dentry has not been *freed*.
@@ -535,14 +745,15 @@ ensure that the dentry has not been *freed*.
535 745
536Dcache locking details 746Dcache locking details
537---------------------- 747----------------------
748
538For many multi-user workloads, open() and stat() on files are 749For many multi-user workloads, open() and stat() on files are
539very frequently occurring operations. Both involve walking 750very frequently occurring operations. Both involve walking
540of path names to find the dentry corresponding to the 751of path names to find the dentry corresponding to the
541concerned file. In 2.4 kernel, dcache_lock was held 752concerned file. In 2.4 kernel, dcache_lock was held
542during look-up of each path component. Contention and 753during look-up of each path component. Contention and
543cacheline bouncing of this global lock caused significant 754cache-line bouncing of this global lock caused significant
544scalability problems. With the introduction of RCU 755scalability problems. With the introduction of RCU
545in linux kernel, this was worked around by making 756in Linux kernel, this was worked around by making
546the look-up of path components during path walking lock-free. 757the look-up of path components during path walking lock-free.
547 758
548 759
@@ -562,7 +773,7 @@ Some of the important changes are :
5622. Insertion of a dentry into the hash table is done using 7732. Insertion of a dentry into the hash table is done using
563 hlist_add_head_rcu() which take care of ordering the writes - 774 hlist_add_head_rcu() which take care of ordering the writes -
564 the writes to the dentry must be visible before the dentry 775 the writes to the dentry must be visible before the dentry
565 is inserted. This works in conjuction with hlist_for_each_rcu() 776 is inserted. This works in conjunction with hlist_for_each_rcu()
566 while walking the hash chain. The only requirement is that 777 while walking the hash chain. The only requirement is that
567 all initialization to the dentry must be done before hlist_add_head_rcu() 778 all initialization to the dentry must be done before hlist_add_head_rcu()
568 since we don't have dcache_lock protection while traversing 779 since we don't have dcache_lock protection while traversing
@@ -584,7 +795,7 @@ Some of the important changes are :
584 the same. In some sense, dcache_rcu path walking looks like 795 the same. In some sense, dcache_rcu path walking looks like
585 the pre-2.5.10 version. 796 the pre-2.5.10 version.
586 797
5875. All dentry hash chain updations must take the dcache_lock as well as 7985. All dentry hash chain updates must take the dcache_lock as well as
588 the per-dentry lock in that order. dput() does this to ensure 799 the per-dentry lock in that order. dput() does this to ensure
589 that a dentry that has just been looked up in another CPU 800 that a dentry that has just been looked up in another CPU
590 doesn't get deleted before dget() can be done on it. 801 doesn't get deleted before dget() can be done on it.
@@ -640,10 +851,10 @@ handled as described below :
640 Since we redo the d_parent check and compare name while holding 851 Since we redo the d_parent check and compare name while holding
641 d_lock, lock-free look-up will not race against d_move(). 852 d_lock, lock-free look-up will not race against d_move().
642 853
6434. There can be a theoritical race when a dentry keeps coming back 8544. There can be a theoretical race when a dentry keeps coming back
644 to original bucket due to double moves. Due to this look-up may 855 to original bucket due to double moves. Due to this look-up may
645 consider that it has never moved and can end up in a infinite loop. 856 consider that it has never moved and can end up in a infinite loop.
646 But this is not any worse that theoritical livelocks we already 857 But this is not any worse that theoretical livelocks we already
647 have in the kernel. 858 have in the kernel.
648 859
649 860