aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
authorGlenn Elliott <gelliott@cs.unc.edu>2012-03-04 19:47:13 -0500
committerGlenn Elliott <gelliott@cs.unc.edu>2012-03-04 19:47:13 -0500
commitc71c03bda1e86c9d5198c5d83f712e695c4f2a1e (patch)
treeecb166cb3e2b7e2adb3b5e292245fefd23381ac8 /Documentation/filesystems
parentea53c912f8a86a8567697115b6a0d8152beee5c8 (diff)
parent6a00f206debf8a5c8899055726ad127dbeeed098 (diff)
Merge branch 'mpi-master' into wip-k-fmlpwip-k-fmlp
Conflicts: litmus/sched_cedf.c
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/00-INDEX2
-rw-r--r--Documentation/filesystems/9p.txt33
-rw-r--r--Documentation/filesystems/Locking270
-rw-r--r--Documentation/filesystems/adfs.txt18
-rw-r--r--Documentation/filesystems/autofs4-mount-control.txt2
-rw-r--r--Documentation/filesystems/caching/netfs-api.txt34
-rw-r--r--Documentation/filesystems/configfs/configfs.txt2
-rw-r--r--Documentation/filesystems/configfs/configfs_example_explicit.c8
-rw-r--r--Documentation/filesystems/configfs/configfs_example_macros.c6
-rw-r--r--Documentation/filesystems/dentry-locking.txt174
-rw-r--r--Documentation/filesystems/exofs.txt10
-rw-r--r--Documentation/filesystems/ext4.txt229
-rw-r--r--Documentation/filesystems/gfs2-uevents.txt2
-rw-r--r--Documentation/filesystems/gfs2.txt2
-rw-r--r--Documentation/filesystems/nfs/00-INDEX4
-rw-r--r--Documentation/filesystems/nfs/idmapper.txt67
-rw-r--r--Documentation/filesystems/nfs/nfsroot.txt22
-rw-r--r--Documentation/filesystems/nfs/pnfs.txt55
-rw-r--r--Documentation/filesystems/nilfs2.txt1
-rw-r--r--Documentation/filesystems/ntfs.txt7
-rw-r--r--Documentation/filesystems/ocfs2.txt17
-rw-r--r--Documentation/filesystems/path-lookup.txt382
-rw-r--r--Documentation/filesystems/pohmelfs/network_protocol.txt2
-rw-r--r--Documentation/filesystems/porting101
-rw-r--r--Documentation/filesystems/proc.txt73
-rw-r--r--Documentation/filesystems/romfs.txt3
-rw-r--r--Documentation/filesystems/sharedsubtree.txt4
-rw-r--r--Documentation/filesystems/smbfs.txt8
-rw-r--r--Documentation/filesystems/squashfs.txt30
-rw-r--r--Documentation/filesystems/sysfs.txt18
-rw-r--r--Documentation/filesystems/ubifs.txt30
-rw-r--r--Documentation/filesystems/vfs.txt189
-rw-r--r--Documentation/filesystems/xfs-delayed-logging-design.txt26
-rw-r--r--Documentation/filesystems/xfs.txt6
34 files changed, 1331 insertions, 506 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 4303614b5add..8c624a18f67d 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -96,8 +96,6 @@ seq_file.txt
96 - how to use the seq_file API 96 - how to use the seq_file API
97sharedsubtree.txt 97sharedsubtree.txt
98 - a description of shared subtrees for namespaces. 98 - a description of shared subtrees for namespaces.
99smbfs.txt
100 - info on using filesystems with the SMB protocol (Win 3.11 and NT).
101spufs.txt 99spufs.txt
102 - info and mount options for the SPU filesystem used on Cell. 100 - info and mount options for the SPU filesystem used on Cell.
103sysfs-pci.txt 101sysfs-pci.txt
diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt
index f9765e8cf086..13de64c7f0ab 100644
--- a/Documentation/filesystems/9p.txt
+++ b/Documentation/filesystems/9p.txt
@@ -25,6 +25,8 @@ Other applications are described in the following papers:
25 http://xcpu.org/papers/cellfs-talk.pdf 25 http://xcpu.org/papers/cellfs-talk.pdf
26 * PROSE I/O: Using 9p to enable Application Partitions 26 * PROSE I/O: Using 9p to enable Application Partitions
27 http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf 27 http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
28 * VirtFS: A Virtualization Aware File System pass-through
29 http://goo.gl/3WPDg
28 30
29USAGE 31USAGE
30===== 32=====
@@ -111,7 +113,7 @@ OPTIONS
111 This can be used to share devices/named pipes/sockets between 113 This can be used to share devices/named pipes/sockets between
112 hosts. This functionality will be expanded in later versions. 114 hosts. This functionality will be expanded in later versions.
113 115
114 access there are three access modes. 116 access there are four access modes.
115 user = if a user tries to access a file on v9fs 117 user = if a user tries to access a file on v9fs
116 filesystem for the first time, v9fs sends an 118 filesystem for the first time, v9fs sends an
117 attach command (Tattach) for that user. 119 attach command (Tattach) for that user.
@@ -120,6 +122,8 @@ OPTIONS
120 the files on the mounted filesystem 122 the files on the mounted filesystem
121 any = v9fs does single attach and performs all 123 any = v9fs does single attach and performs all
122 operations as one user 124 operations as one user
125 client = ACL based access check on the 9p client
126 side for access validation
123 127
124 cachetag cache tag to use the specified persistent cache. 128 cachetag cache tag to use the specified persistent cache.
125 cache tags for existing cache sessions can be listed at 129 cache tags for existing cache sessions can be listed at
@@ -128,31 +132,20 @@ OPTIONS
128RESOURCES 132RESOURCES
129========= 133=========
130 134
131Our current recommendation is to use Inferno (http://www.vitanuova.com/nferno/index.html) 135Protocol specifications are maintained on github:
132as the 9p server. You can start a 9p server under Inferno by issuing the 136http://ericvh.github.com/9p-rfc/
133following command:
134 ; styxlisten -A tcp!*!564 export '#U*'
135 137
136The -A specifies an unauthenticated export. The 564 is the port # (you may 1389p client and server implementations are listed on
137have to choose a higher port number if running as a normal user). The '#U*' 139http://9p.cat-v.org/implementations
138specifies exporting the root of the Linux name space. You may specify a
139subset of the namespace by extending the path: '#U*'/tmp would just export
140/tmp. For more information, see the Inferno manual pages covering styxlisten
141and export.
142 140
143A Linux version of the 9p server is now maintained under the npfs project 141A 9p2000.L server is being developed by LLNL and can be found
144on sourceforge (http://sourceforge.net/projects/npfs). The currently 142at http://code.google.com/p/diod/
145maintained version is the single-threaded version of the server (named spfs)
146available from the same SVN repository.
147 143
148There are user and developer mailing lists available through the v9fs project 144There are user and developer mailing lists available through the v9fs project
149on sourceforge (http://sourceforge.net/projects/v9fs). 145on sourceforge (http://sourceforge.net/projects/v9fs).
150 146
151A stand-alone version of the module (which should build for any 2.6 kernel) 147News and other information is maintained on a Wiki.
152is available via (http://github.com/ericvh/9p-sac/tree/master) 148(http://sf.net/apps/mediawiki/v9fs/index.php).
153
154News and other information is maintained on SWiK (http://swik.net/v9fs)
155and the Wiki (http://sf.net/apps/mediawiki/v9fs/index.php).
156 149
157Bug reports may be issued through the kernel.org bugzilla 150Bug reports may be issued through the kernel.org bugzilla
158(http://bugzilla.kernel.org) 151(http://bugzilla.kernel.org)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 2db4283efa8d..57d827d6071d 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -9,24 +9,30 @@ be able to use diff(1).
9 9
10--------------------------- dentry_operations -------------------------- 10--------------------------- dentry_operations --------------------------
11prototypes: 11prototypes:
12 int (*d_revalidate)(struct dentry *, int); 12 int (*d_revalidate)(struct dentry *, struct nameidata *);
13 int (*d_hash) (struct dentry *, struct qstr *); 13 int (*d_hash)(const struct dentry *, const struct inode *,
14 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); 14 struct qstr *);
15 int (*d_compare)(const struct dentry *, const struct inode *,
16 const struct dentry *, const struct inode *,
17 unsigned int, const char *, const struct qstr *);
15 int (*d_delete)(struct dentry *); 18 int (*d_delete)(struct dentry *);
16 void (*d_release)(struct dentry *); 19 void (*d_release)(struct dentry *);
17 void (*d_iput)(struct dentry *, struct inode *); 20 void (*d_iput)(struct dentry *, struct inode *);
18 char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen); 21 char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
22 struct vfsmount *(*d_automount)(struct path *path);
23 int (*d_manage)(struct dentry *, bool);
19 24
20locking rules: 25locking rules:
21 none have BKL 26 rename_lock ->d_lock may block rcu-walk
22 dcache_lock rename_lock ->d_lock may block 27d_revalidate: no no yes (ref-walk) maybe
23d_revalidate: no no no yes 28d_hash no no no maybe
24d_hash no no no yes 29d_compare: yes no no maybe
25d_compare: no yes no no 30d_delete: no yes no no
26d_delete: yes no yes no 31d_release: no no yes no
27d_release: no no no yes 32d_iput: no no yes no
28d_iput: no no no yes
29d_dname: no no no no 33d_dname: no no no no
34d_automount: no no yes no
35d_manage: no no yes (ref-walk) maybe
30 36
31--------------------------- inode_operations --------------------------- 37--------------------------- inode_operations ---------------------------
32prototypes: 38prototypes:
@@ -42,18 +48,22 @@ ata *);
42 int (*rename) (struct inode *, struct dentry *, 48 int (*rename) (struct inode *, struct dentry *,
43 struct inode *, struct dentry *); 49 struct inode *, struct dentry *);
44 int (*readlink) (struct dentry *, char __user *,int); 50 int (*readlink) (struct dentry *, char __user *,int);
45 int (*follow_link) (struct dentry *, struct nameidata *); 51 void * (*follow_link) (struct dentry *, struct nameidata *);
52 void (*put_link) (struct dentry *, struct nameidata *, void *);
46 void (*truncate) (struct inode *); 53 void (*truncate) (struct inode *);
47 int (*permission) (struct inode *, int, struct nameidata *); 54 int (*permission) (struct inode *, int, unsigned int);
55 int (*check_acl)(struct inode *, int, unsigned int);
48 int (*setattr) (struct dentry *, struct iattr *); 56 int (*setattr) (struct dentry *, struct iattr *);
49 int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *); 57 int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *);
50 int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); 58 int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
51 ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); 59 ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
52 ssize_t (*listxattr) (struct dentry *, char *, size_t); 60 ssize_t (*listxattr) (struct dentry *, char *, size_t);
53 int (*removexattr) (struct dentry *, const char *); 61 int (*removexattr) (struct dentry *, const char *);
62 void (*truncate_range)(struct inode *, loff_t, loff_t);
63 int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len);
54 64
55locking rules: 65locking rules:
56 all may block, none have BKL 66 all may block
57 i_mutex(inode) 67 i_mutex(inode)
58lookup: yes 68lookup: yes
59create: yes 69create: yes
@@ -66,19 +76,23 @@ rmdir: yes (both) (see below)
66rename: yes (all) (see below) 76rename: yes (all) (see below)
67readlink: no 77readlink: no
68follow_link: no 78follow_link: no
79put_link: no
69truncate: yes (see below) 80truncate: yes (see below)
70setattr: yes 81setattr: yes
71permission: no 82permission: no (may not block if called in rcu-walk mode)
83check_acl: no
72getattr: no 84getattr: no
73setxattr: yes 85setxattr: yes
74getxattr: no 86getxattr: no
75listxattr: no 87listxattr: no
76removexattr: yes 88removexattr: yes
89truncate_range: yes
90fiemap: no
77 Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on 91 Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on
78victim. 92victim.
79 cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. 93 cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
80 ->truncate() is never called directly - it's a callback, not a 94 ->truncate() is never called directly - it's a callback, not a
81method. It's called by vmtruncate() - library function normally used by 95method. It's called by vmtruncate() - deprecated library function used by
82->setattr(). Locking information above applies to that call (i.e. is 96->setattr(). Locking information above applies to that call (i.e. is
83inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been 97inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
84passed). 98passed).
@@ -90,8 +104,8 @@ of the locking scheme for directory operations.
90prototypes: 104prototypes:
91 struct inode *(*alloc_inode)(struct super_block *sb); 105 struct inode *(*alloc_inode)(struct super_block *sb);
92 void (*destroy_inode)(struct inode *); 106 void (*destroy_inode)(struct inode *);
93 void (*dirty_inode) (struct inode *); 107 void (*dirty_inode) (struct inode *, int flags);
94 int (*write_inode) (struct inode *, int); 108 int (*write_inode) (struct inode *, struct writeback_control *wbc);
95 int (*drop_inode) (struct inode *); 109 int (*drop_inode) (struct inode *);
96 void (*evict_inode) (struct inode *); 110 void (*evict_inode) (struct inode *);
97 void (*put_super) (struct super_block *); 111 void (*put_super) (struct super_block *);
@@ -105,16 +119,16 @@ prototypes:
105 int (*show_options)(struct seq_file *, struct vfsmount *); 119 int (*show_options)(struct seq_file *, struct vfsmount *);
106 ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); 120 ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
107 ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); 121 ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
122 int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
108 123
109locking rules: 124locking rules:
110 All may block [not true, see below] 125 All may block [not true, see below]
111 None have BKL
112 s_umount 126 s_umount
113alloc_inode: 127alloc_inode:
114destroy_inode: 128destroy_inode:
115dirty_inode: (must not sleep) 129dirty_inode:
116write_inode: 130write_inode:
117drop_inode: !!!inode_lock!!! 131drop_inode: !!!inode->i_lock!!!
118evict_inode: 132evict_inode:
119put_super: write 133put_super: write
120write_super: read 134write_super: read
@@ -127,6 +141,7 @@ umount_begin: no
127show_options: no (namespace_sem) 141show_options: no (namespace_sem)
128quota_read: no (see below) 142quota_read: no (see below)
129quota_write: no (see below) 143quota_write: no (see below)
144bdev_try_to_free_page: no (see below)
130 145
131->statfs() has s_umount (shared) when called by ustat(2) (native or 146->statfs() has s_umount (shared) when called by ustat(2) (native or
132compat), but that's an accident of bad API; s_umount is used to pin 147compat), but that's an accident of bad API; s_umount is used to pin
@@ -139,19 +154,23 @@ be the only ones operating on the quota file by the quota code (via
139dqio_sem) (unless an admin really wants to screw up something and 154dqio_sem) (unless an admin really wants to screw up something and
140writes to quota files with quotas on). For other details about locking 155writes to quota files with quotas on). For other details about locking
141see also dquot_operations section. 156see also dquot_operations section.
157->bdev_try_to_free_page is called from the ->releasepage handler of
158the block device inode. See there for more details.
142 159
143--------------------------- file_system_type --------------------------- 160--------------------------- file_system_type ---------------------------
144prototypes: 161prototypes:
145 int (*get_sb) (struct file_system_type *, int, 162 int (*get_sb) (struct file_system_type *, int,
146 const char *, void *, struct vfsmount *); 163 const char *, void *, struct vfsmount *);
164 struct dentry *(*mount) (struct file_system_type *, int,
165 const char *, void *);
147 void (*kill_sb) (struct super_block *); 166 void (*kill_sb) (struct super_block *);
148locking rules: 167locking rules:
149 may block BKL 168 may block
150get_sb yes no 169mount yes
151kill_sb yes no 170kill_sb yes
152 171
153->get_sb() returns error or 0 with locked superblock attached to the vfsmount 172->mount() returns ERR_PTR or the root dentry; its superblock should be locked
154(exclusive on ->s_umount). 173on return.
155->kill_sb() takes a write-locked superblock, does all shutdown work on it, 174->kill_sb() takes a write-locked superblock, does all shutdown work on it,
156unlocks and drops the reference. 175unlocks and drops the reference.
157 176
@@ -173,28 +192,38 @@ prototypes:
173 sector_t (*bmap)(struct address_space *, sector_t); 192 sector_t (*bmap)(struct address_space *, sector_t);
174 int (*invalidatepage) (struct page *, unsigned long); 193 int (*invalidatepage) (struct page *, unsigned long);
175 int (*releasepage) (struct page *, int); 194 int (*releasepage) (struct page *, int);
195 void (*freepage)(struct page *);
176 int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, 196 int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
177 loff_t offset, unsigned long nr_segs); 197 loff_t offset, unsigned long nr_segs);
178 int (*launder_page) (struct page *); 198 int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
199 unsigned long *);
200 int (*migratepage)(struct address_space *, struct page *, struct page *);
201 int (*launder_page)(struct page *);
202 int (*is_partially_uptodate)(struct page *, read_descriptor_t *, unsigned long);
203 int (*error_remove_page)(struct address_space *, struct page *);
179 204
180locking rules: 205locking rules:
181 All except set_page_dirty may block 206 All except set_page_dirty and freepage may block
182 207
183 BKL PageLocked(page) i_mutex 208 PageLocked(page) i_mutex
184writepage: no yes, unlocks (see below) 209writepage: yes, unlocks (see below)
185readpage: no yes, unlocks 210readpage: yes, unlocks
186sync_page: no maybe 211sync_page: maybe
187writepages: no 212writepages:
188set_page_dirty no no 213set_page_dirty no
189readpages: no 214readpages:
190write_begin: no locks the page yes 215write_begin: locks the page yes
191write_end: no yes, unlocks yes 216write_end: yes, unlocks yes
192perform_write: no n/a yes 217bmap:
193bmap: no 218invalidatepage: yes
194invalidatepage: no yes 219releasepage: yes
195releasepage: no yes 220freepage: yes
196direct_IO: no 221direct_IO:
197launder_page: no yes 222get_xip_mem: maybe
223migratepage: yes (both)
224launder_page: yes
225is_partially_uptodate: yes
226error_remove_page: yes
198 227
199 ->write_begin(), ->write_end(), ->sync_page() and ->readpage() 228 ->write_begin(), ->write_end(), ->sync_page() and ->readpage()
200may be called from the request handler (/dev/loop). 229may be called from the request handler (/dev/loop).
@@ -274,9 +303,8 @@ under spinlock (it cannot block) and is sometimes called with the page
274not locked. 303not locked.
275 304
276 ->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some 305 ->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
277filesystems and by the swapper. The latter will eventually go away. All 306filesystems and by the swapper. The latter will eventually go away. Please,
278instances do not actually need the BKL. Please, keep it that way and don't 307keep it that way and don't breed new callers.
279breed new callers.
280 308
281 ->invalidatepage() is called when the filesystem must attempt to drop 309 ->invalidatepage() is called when the filesystem must attempt to drop
282some or all of the buffers from the page when it is being truncated. It 310some or all of the buffers from the page when it is being truncated. It
@@ -288,55 +316,44 @@ buffers from the page in preparation for freeing it. It returns zero to
288indicate that the buffers are (or may be) freeable. If ->releasepage is zero, 316indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
289the kernel assumes that the fs has no private interest in the buffers. 317the kernel assumes that the fs has no private interest in the buffers.
290 318
319 ->freepage() is called when the kernel is done dropping the page
320from the page cache.
321
291 ->launder_page() may be called prior to releasing a page if 322 ->launder_page() may be called prior to releasing a page if
292it is still found to be dirty. It returns zero if the page was successfully 323it is still found to be dirty. It returns zero if the page was successfully
293cleaned, or an error value if not. Note that in order to prevent the page 324cleaned, or an error value if not. Note that in order to prevent the page
294getting mapped back in and redirtied, it needs to be kept locked 325getting mapped back in and redirtied, it needs to be kept locked
295across the entire operation. 326across the entire operation.
296 327
297 Note: currently almost all instances of address_space methods are
298using BKL for internal serialization and that's one of the worst sources
299of contention. Normally they are calling library functions (in fs/buffer.c)
300and pass foo_get_block() as a callback (on local block-based filesystems,
301indeed). BKL is not needed for library stuff and is usually taken by
302foo_get_block(). It's an overkill, since block bitmaps can be protected by
303internal fs locking and real critical areas are much smaller than the areas
304filesystems protect now.
305
306----------------------- file_lock_operations ------------------------------ 328----------------------- file_lock_operations ------------------------------
307prototypes: 329prototypes:
308 void (*fl_insert)(struct file_lock *); /* lock insertion callback */
309 void (*fl_remove)(struct file_lock *); /* lock removal callback */
310 void (*fl_copy_lock)(struct file_lock *, struct file_lock *); 330 void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
311 void (*fl_release_private)(struct file_lock *); 331 void (*fl_release_private)(struct file_lock *);
312 332
313 333
314locking rules: 334locking rules:
315 BKL may block 335 file_lock_lock may block
316fl_insert: yes no 336fl_copy_lock: yes no
317fl_remove: yes no 337fl_release_private: maybe no
318fl_copy_lock: yes no
319fl_release_private: yes yes
320 338
321----------------------- lock_manager_operations --------------------------- 339----------------------- lock_manager_operations ---------------------------
322prototypes: 340prototypes:
323 int (*fl_compare_owner)(struct file_lock *, struct file_lock *); 341 int (*fl_compare_owner)(struct file_lock *, struct file_lock *);
324 void (*fl_notify)(struct file_lock *); /* unblock callback */ 342 void (*fl_notify)(struct file_lock *); /* unblock callback */
325 void (*fl_copy_lock)(struct file_lock *, struct file_lock *); 343 int (*fl_grant)(struct file_lock *, struct file_lock *, int);
326 void (*fl_release_private)(struct file_lock *); 344 void (*fl_release_private)(struct file_lock *);
327 void (*fl_break)(struct file_lock *); /* break_lease callback */ 345 void (*fl_break)(struct file_lock *); /* break_lease callback */
346 int (*fl_change)(struct file_lock **, int);
328 347
329locking rules: 348locking rules:
330 BKL may block 349 file_lock_lock may block
331fl_compare_owner: yes no 350fl_compare_owner: yes no
332fl_notify: yes no 351fl_notify: yes no
333fl_copy_lock: yes no 352fl_grant: no no
334fl_release_private: yes yes 353fl_release_private: maybe no
335fl_break: yes no 354fl_break: yes no
336 355fl_change yes no
337 Currently only NFSD and NLM provide instances of this class. None of the 356
338them block. If you have out-of-tree instances - please, show up. Locking
339in that area will change.
340--------------------------- buffer_head ----------------------------------- 357--------------------------- buffer_head -----------------------------------
341prototypes: 358prototypes:
342 void (*b_end_io)(struct buffer_head *bh, int uptodate); 359 void (*b_end_io)(struct buffer_head *bh, int uptodate);
@@ -349,21 +366,36 @@ call this method upon the IO completion.
349 366
350--------------------------- block_device_operations ----------------------- 367--------------------------- block_device_operations -----------------------
351prototypes: 368prototypes:
352 int (*open) (struct inode *, struct file *); 369 int (*open) (struct block_device *, fmode_t);
353 int (*release) (struct inode *, struct file *); 370 int (*release) (struct gendisk *, fmode_t);
354 int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long); 371 int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
372 int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
373 int (*direct_access) (struct block_device *, sector_t, void **, unsigned long *);
355 int (*media_changed) (struct gendisk *); 374 int (*media_changed) (struct gendisk *);
375 void (*unlock_native_capacity) (struct gendisk *);
356 int (*revalidate_disk) (struct gendisk *); 376 int (*revalidate_disk) (struct gendisk *);
377 int (*getgeo)(struct block_device *, struct hd_geometry *);
378 void (*swap_slot_free_notify) (struct block_device *, unsigned long);
357 379
358locking rules: 380locking rules:
359 BKL bd_sem 381 bd_mutex
360open: yes yes 382open: yes
361release: yes yes 383release: yes
362ioctl: yes no 384ioctl: no
363media_changed: no no 385compat_ioctl: no
364revalidate_disk: no no 386direct_access: no
387media_changed: no
388unlock_native_capacity: no
389revalidate_disk: no
390getgeo: no
391swap_slot_free_notify: no (see below)
392
393media_changed, unlock_native_capacity and revalidate_disk are called only from
394check_disk_change().
395
396swap_slot_free_notify is called with swap_lock and sometimes the page lock
397held.
365 398
366The last two are called only from check_disk_change().
367 399
368--------------------------- file_operations ------------------------------- 400--------------------------- file_operations -------------------------------
369prototypes: 401prototypes:
@@ -395,34 +427,22 @@ prototypes:
395 unsigned long (*get_unmapped_area)(struct file *, unsigned long, 427 unsigned long (*get_unmapped_area)(struct file *, unsigned long,
396 unsigned long, unsigned long, unsigned long); 428 unsigned long, unsigned long, unsigned long);
397 int (*check_flags)(int); 429 int (*check_flags)(int);
430 int (*flock) (struct file *, int, struct file_lock *);
431 ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *,
432 size_t, unsigned int);
433 ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *,
434 size_t, unsigned int);
435 int (*setlease)(struct file *, long, struct file_lock **);
436 long (*fallocate)(struct file *, int, loff_t, loff_t);
398}; 437};
399 438
400locking rules: 439locking rules:
401 All may block. 440 All may block except for ->setlease.
402 BKL 441 No VFS locks held on entry except for ->fsync and ->setlease.
403llseek: no (see below) 442
404read: no 443->fsync() has i_mutex on inode.
405aio_read: no 444
406write: no 445->setlease has the file_list_lock held and must not sleep.
407aio_write: no
408readdir: no
409poll: no
410unlocked_ioctl: no
411compat_ioctl: no
412mmap: no
413open: no
414flush: no
415release: no
416fsync: no (see below)
417aio_fsync: no
418fasync: no
419lock: yes
420readv: no
421writev: no
422sendfile: no
423sendpage: no
424get_unmapped_area: no
425check_flags: no
426 446
427->llseek() locking has moved from llseek to the individual llseek 447->llseek() locking has moved from llseek to the individual llseek
428implementations. If your fs is not using generic_file_llseek, you 448implementations. If your fs is not using generic_file_llseek, you
@@ -432,17 +452,10 @@ mutex or just to use i_size_read() instead.
432Note: this does not protect the file->f_pos against concurrent modifications 452Note: this does not protect the file->f_pos against concurrent modifications
433since this is something the userspace has to take care about. 453since this is something the userspace has to take care about.
434 454
435Note: ext2_release() was *the* source of contention on fs-intensive 455->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags.
436loads and dropping BKL on ->release() helps to get rid of that (we still 456Most instances call fasync_helper(), which does that maintenance, so it's
437grab BKL for cases when we close a file that had been opened r/w, but that 457not normally something one needs to worry about. Return values > 0 will be
438can and should be done using the internal locking with smaller critical areas). 458mapped to zero in the VFS layer.
439Current worst offender is ext2_get_block()...
440
441->fasync() is called without BKL protection, and is responsible for
442maintaining the FASYNC bit in filp->f_flags. Most instances call
443fasync_helper(), which does that maintenance, so it's not normally
444something one needs to worry about. Return values > 0 will be mapped to
445zero in the VFS layer.
446 459
447->readdir() and ->ioctl() on directories must be changed. Ideally we would 460->readdir() and ->ioctl() on directories must be changed. Ideally we would
448move ->readdir() to inode_operations and use a separate method for directory 461move ->readdir() to inode_operations and use a separate method for directory
@@ -453,8 +466,6 @@ components. And there are other reasons why the current interface is a mess...
453->read on directories probably must go away - we should just enforce -EISDIR 466->read on directories probably must go away - we should just enforce -EISDIR
454in sys_read() and friends. 467in sys_read() and friends.
455 468
456->fsync() has i_mutex on inode.
457
458--------------------------- dquot_operations ------------------------------- 469--------------------------- dquot_operations -------------------------------
459prototypes: 470prototypes:
460 int (*write_dquot) (struct dquot *); 471 int (*write_dquot) (struct dquot *);
@@ -489,12 +500,12 @@ prototypes:
489 int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); 500 int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
490 501
491locking rules: 502locking rules:
492 BKL mmap_sem PageLocked(page) 503 mmap_sem PageLocked(page)
493open: no yes 504open: yes
494close: no yes 505close: yes
495fault: no yes can return with page locked 506fault: yes can return with page locked
496page_mkwrite: no yes can return with page locked 507page_mkwrite: yes can return with page locked
497access: no yes 508access: yes
498 509
499 ->fault() is called when a previously not present pte is about 510 ->fault() is called when a previously not present pte is about
500to be faulted in. The filesystem must find and return the page associated 511to be faulted in. The filesystem must find and return the page associated
@@ -521,6 +532,3 @@ VM_IO | VM_PFNMAP VMAs.
521 532
522(if you break something or notice that it is broken and do not fix it yourself 533(if you break something or notice that it is broken and do not fix it yourself
523- at least put it here) 534- at least put it here)
524
525ipc/shm.c::shm_delete() - may need BKL.
526->read() and ->write() in many drivers are (probably) missing BKL.
diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.txt
index 9e8811f92b84..5949766353f7 100644
--- a/Documentation/filesystems/adfs.txt
+++ b/Documentation/filesystems/adfs.txt
@@ -9,6 +9,9 @@ Mount options for ADFS
9 will be nnn. Default 0700. 9 will be nnn. Default 0700.
10 othmask=nnn The permission mask for ADFS 'other' permissions 10 othmask=nnn The permission mask for ADFS 'other' permissions
11 will be nnn. Default 0077. 11 will be nnn. Default 0077.
12 ftsuffix=n When ftsuffix=0, no file type suffix will be applied.
13 When ftsuffix=1, a hexadecimal suffix corresponding to
14 the RISC OS file type will be added. Default 0.
12 15
13Mapping of ADFS permissions to Linux permissions 16Mapping of ADFS permissions to Linux permissions
14------------------------------------------------ 17------------------------------------------------
@@ -55,3 +58,18 @@ Mapping of ADFS permissions to Linux permissions
55 58
56 You can therefore tailor the permission translation to whatever you 59 You can therefore tailor the permission translation to whatever you
57 desire the permissions should be under Linux. 60 desire the permissions should be under Linux.
61
62RISC OS file type suffix
63------------------------
64
65 RISC OS file types are stored in bits 19..8 of the file load address.
66
67 To enable non-RISC OS systems to be used to store files without losing
68 file type information, a file naming convention was devised (initially
69 for use with NFS) such that a hexadecimal suffix of the form ,xyz
70 denoted the file type: e.g. BasicFile,ffb is a BASIC (0xffb) file. This
71 naming convention is now also used by RISC OS emulators such as RPCEmu.
72
73 Mounting an ADFS disc with option ftsuffix=1 will cause appropriate file
74 type suffixes to be appended to file names read from a directory. If the
75 ftsuffix option is zero or omitted, no file type suffixes will be added.
diff --git a/Documentation/filesystems/autofs4-mount-control.txt b/Documentation/filesystems/autofs4-mount-control.txt
index 51986bf08a4d..4c95935cbcf4 100644
--- a/Documentation/filesystems/autofs4-mount-control.txt
+++ b/Documentation/filesystems/autofs4-mount-control.txt
@@ -309,7 +309,7 @@ ioctlfd field set to the descriptor obtained from the open call.
309AUTOFS_DEV_IOCTL_TIMEOUT_CMD 309AUTOFS_DEV_IOCTL_TIMEOUT_CMD
310---------------------------- 310----------------------------
311 311
312Set the expire timeout for mounts withing an autofs mount point. 312Set the expire timeout for mounts within an autofs mount point.
313 313
314The call requires an initialized struct autofs_dev_ioctl with the 314The call requires an initialized struct autofs_dev_ioctl with the
315ioctlfd field set to the descriptor obtained from the open call. 315ioctlfd field set to the descriptor obtained from the open call.
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
index 1902c57b72ef..7cc6bf2871eb 100644
--- a/Documentation/filesystems/caching/netfs-api.txt
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -95,7 +95,7 @@ restraints as possible on how an index is structured and where it is placed in
95the tree. The netfs can even mix indices and data files at the same level, but 95the tree. The netfs can even mix indices and data files at the same level, but
96it's not recommended. 96it's not recommended.
97 97
98Each index entry consists of a key of indeterminate length plus some auxilliary 98Each index entry consists of a key of indeterminate length plus some auxiliary
99data, also of indeterminate length. 99data, also of indeterminate length.
100 100
101There are some limits on indices: 101There are some limits on indices:
@@ -203,23 +203,23 @@ This has the following fields:
203 203
204 If the function is absent, a file size of 0 is assumed. 204 If the function is absent, a file size of 0 is assumed.
205 205
206 (6) A function to retrieve auxilliary data from the netfs [optional]. 206 (6) A function to retrieve auxiliary data from the netfs [optional].
207 207
208 This function will be called with the netfs data that was passed to the 208 This function will be called with the netfs data that was passed to the
209 cookie acquisition function and the maximum length of auxilliary data that 209 cookie acquisition function and the maximum length of auxiliary data that
210 it may provide. It should write the auxilliary data into the given buffer 210 it may provide. It should write the auxiliary data into the given buffer
211 and return the quantity it wrote. 211 and return the quantity it wrote.
212 212
213 If this function is absent, the auxilliary data length will be set to 0. 213 If this function is absent, the auxiliary data length will be set to 0.
214 214
215 The length of the auxilliary data buffer may be dependent on the key 215 The length of the auxiliary data buffer may be dependent on the key
216 length. A netfs mustn't rely on being able to provide more than 400 bytes 216 length. A netfs mustn't rely on being able to provide more than 400 bytes
217 for both. 217 for both.
218 218
219 (7) A function to check the auxilliary data [optional]. 219 (7) A function to check the auxiliary data [optional].
220 220
221 This function will be called to check that a match found in the cache for 221 This function will be called to check that a match found in the cache for
222 this object is valid. For instance with AFS it could check the auxilliary 222 this object is valid. For instance with AFS it could check the auxiliary
223 data against the data version number returned by the server to determine 223 data against the data version number returned by the server to determine
224 whether the index entry in a cache is still valid. 224 whether the index entry in a cache is still valid.
225 225
@@ -232,7 +232,7 @@ This has the following fields:
232 (*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update 232 (*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update
233 (*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted 233 (*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted
234 234
235 This function can also be used to extract data from the auxilliary data in 235 This function can also be used to extract data from the auxiliary data in
236 the cache and copy it into the netfs's structures. 236 the cache and copy it into the netfs's structures.
237 237
238 (8) A pair of functions to manage contexts for the completion callback 238 (8) A pair of functions to manage contexts for the completion callback
@@ -673,6 +673,22 @@ storage request to complete, or it may attempt to cancel the storage request -
673in which case the page will not be stored in the cache this time. 673in which case the page will not be stored in the cache this time.
674 674
675 675
676BULK INODE PAGE UNCACHE
677-----------------------
678
679A convenience routine is provided to perform an uncache on all the pages
680attached to an inode. This assumes that the pages on the inode correspond on a
6811:1 basis with the pages in the cache.
682
683 void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie,
684 struct inode *inode);
685
686This takes the netfs cookie that the pages were cached with and the inode that
687the pages are attached to. This function will wait for pages to finish being
688written to the cache and for the cache to finish with the page generally. No
689error is returned.
690
691
676========================== 692==========================
677INDEX AND DATA FILE UPDATE 693INDEX AND DATA FILE UPDATE
678========================== 694==========================
diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs/configfs.txt
index fabcb0e00f25..dd57bb6bb390 100644
--- a/Documentation/filesystems/configfs/configfs.txt
+++ b/Documentation/filesystems/configfs/configfs.txt
@@ -409,7 +409,7 @@ As a consequence of this, default_groups cannot be removed directly via
409rmdir(2). They also are not considered when rmdir(2) on the parent 409rmdir(2). They also are not considered when rmdir(2) on the parent
410group is checking for children. 410group is checking for children.
411 411
412[Dependant Subsystems] 412[Dependent Subsystems]
413 413
414Sometimes other drivers depend on particular configfs items. For 414Sometimes other drivers depend on particular configfs items. For
415example, ocfs2 mounts depend on a heartbeat region item. If that 415example, ocfs2 mounts depend on a heartbeat region item. If that
diff --git a/Documentation/filesystems/configfs/configfs_example_explicit.c b/Documentation/filesystems/configfs/configfs_example_explicit.c
index d428cc9f07f3..1420233dfa55 100644
--- a/Documentation/filesystems/configfs/configfs_example_explicit.c
+++ b/Documentation/filesystems/configfs/configfs_example_explicit.c
@@ -89,7 +89,7 @@ static ssize_t childless_storeme_write(struct childless *childless,
89 char *p = (char *) page; 89 char *p = (char *) page;
90 90
91 tmp = simple_strtoul(p, &p, 10); 91 tmp = simple_strtoul(p, &p, 10);
92 if (!p || (*p && (*p != '\n'))) 92 if ((*p != '\0') && (*p != '\n'))
93 return -EINVAL; 93 return -EINVAL;
94 94
95 if (tmp > INT_MAX) 95 if (tmp > INT_MAX)
@@ -464,9 +464,8 @@ static int __init configfs_example_init(void)
464 return 0; 464 return 0;
465 465
466out_unregister: 466out_unregister:
467 for (; i >= 0; i--) { 467 for (i--; i >= 0; i--)
468 configfs_unregister_subsystem(example_subsys[i]); 468 configfs_unregister_subsystem(example_subsys[i]);
469 }
470 469
471 return ret; 470 return ret;
472} 471}
@@ -475,9 +474,8 @@ static void __exit configfs_example_exit(void)
475{ 474{
476 int i; 475 int i;
477 476
478 for (i = 0; example_subsys[i]; i++) { 477 for (i = 0; example_subsys[i]; i++)
479 configfs_unregister_subsystem(example_subsys[i]); 478 configfs_unregister_subsystem(example_subsys[i]);
480 }
481} 479}
482 480
483module_init(configfs_example_init); 481module_init(configfs_example_init);
diff --git a/Documentation/filesystems/configfs/configfs_example_macros.c b/Documentation/filesystems/configfs/configfs_example_macros.c
index d8e30a0378aa..327dfbc640a9 100644
--- a/Documentation/filesystems/configfs/configfs_example_macros.c
+++ b/Documentation/filesystems/configfs/configfs_example_macros.c
@@ -427,9 +427,8 @@ static int __init configfs_example_init(void)
427 return 0; 427 return 0;
428 428
429out_unregister: 429out_unregister:
430 for (; i >= 0; i--) { 430 for (i--; i >= 0; i--)
431 configfs_unregister_subsystem(example_subsys[i]); 431 configfs_unregister_subsystem(example_subsys[i]);
432 }
433 432
434 return ret; 433 return ret;
435} 434}
@@ -438,9 +437,8 @@ static void __exit configfs_example_exit(void)
438{ 437{
439 int i; 438 int i;
440 439
441 for (i = 0; example_subsys[i]; i++) { 440 for (i = 0; example_subsys[i]; i++)
442 configfs_unregister_subsystem(example_subsys[i]); 441 configfs_unregister_subsystem(example_subsys[i]);
443 }
444} 442}
445 443
446module_init(configfs_example_init); 444module_init(configfs_example_init);
diff --git a/Documentation/filesystems/dentry-locking.txt b/Documentation/filesystems/dentry-locking.txt
deleted file mode 100644
index 79334ed5daa7..000000000000
--- a/Documentation/filesystems/dentry-locking.txt
+++ /dev/null
@@ -1,174 +0,0 @@
1RCU-based dcache locking model
2==============================
3
4On many workloads, the most common operation on dcache is to look up a
5dentry, given a parent dentry and the name of the child. Typically,
6for every open(), stat() etc., the dentry corresponding to the
7pathname will be looked up by walking the tree starting with the first
8component of the pathname and using that dentry along with the next
9component to look up the next level and so on. Since it is a frequent
10operation for workloads like multiuser environments and web servers,
11it is important to optimize this path.
12
13Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in
14every component during path look-up. Since 2.5.10 onwards, fast-walk
15algorithm changed this by holding the dcache_lock at the beginning and
16walking as many cached path component dentries as possible. This
17significantly decreases the number of acquisition of
18dcache_lock. However it also increases the lock hold time
19significantly and affects performance in large SMP machines. Since
202.5.62 kernel, dcache has been using a new locking model that uses RCU
21to make dcache look-up lock-free.
22
23The current dcache locking model is not very different from the
24existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock
25protected the hash chain, d_child, d_alias, d_lru lists as well as
26d_inode and several other things like mount look-up. RCU-based changes
27affect only the way the hash chain is protected. For everything else
28the dcache_lock must be taken for both traversing as well as
29updating. The hash chain updates too take the dcache_lock. The
30significant change is the way d_lookup traverses the hash chain, it
31doesn't acquire the dcache_lock for this and rely on RCU to ensure
32that the dentry has not been *freed*.
33
34
35Dcache locking details
36======================
37
38For many multi-user workloads, open() and stat() on files are very
39frequently occurring operations. Both involve walking of path names to
40find the dentry corresponding to the concerned file. In 2.4 kernel,
41dcache_lock was held during look-up of each path component. Contention
42and cache-line bouncing of this global lock caused significant
43scalability problems. With the introduction of RCU in Linux kernel,
44this was worked around by making the look-up of path components during
45path walking lock-free.
46
47
48Safe lock-free look-up of dcache hash table
49===========================================
50
51Dcache is a complex data structure with the hash table entries also
52linked together in other lists. In 2.4 kernel, dcache_lock protected
53all the lists. We applied RCU only on hash chain walking. The rest of
54the lists are still protected by dcache_lock. Some of the important
55changes are :
56
571. The deletion from hash chain is done using hlist_del_rcu() macro
58 which doesn't initialize next pointer of the deleted dentry and
59 this allows us to walk safely lock-free while a deletion is
60 happening.
61
622. Insertion of a dentry into the hash table is done using
63 hlist_add_head_rcu() which take care of ordering the writes - the
64 writes to the dentry must be visible before the dentry is
65 inserted. This works in conjunction with hlist_for_each_rcu(),
66 which has since been replaced by hlist_for_each_entry_rcu(), while
67 walking the hash chain. The only requirement is that all
68 initialization to the dentry must be done before
69 hlist_add_head_rcu() since we don't have dcache_lock protection
70 while traversing the hash chain. This isn't different from the
71 existing code.
72
733. The dentry looked up without holding dcache_lock by cannot be
74 returned for walking if it is unhashed. It then may have a NULL
75 d_inode or other bogosity since RCU doesn't protect the other
76 fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
77 indicate unhashed dentries and use this in conjunction with a
78 per-dentry lock (d_lock). Once looked up without the dcache_lock,
79 we acquire the per-dentry lock (d_lock) and check if the dentry is
80 unhashed. If so, the look-up is failed. If not, the reference count
81 of the dentry is increased and the dentry is returned.
82
834. Once a dentry is looked up, it must be ensured during the path walk
84 for that component it doesn't go away. In pre-2.5.10 code, this was
85 done holding a reference to the dentry. dcache_rcu does the same.
86 In some sense, dcache_rcu path walking looks like the pre-2.5.10
87 version.
88
895. All dentry hash chain updates must take the dcache_lock as well as
90 the per-dentry lock in that order. dput() does this to ensure that
91 a dentry that has just been looked up in another CPU doesn't get
92 deleted before dget() can be done on it.
93
946. There are several ways to do reference counting of RCU protected
95 objects. One such example is in ipv4 route cache where deferred
96 freeing (using call_rcu()) is done as soon as the reference count
97 goes to zero. This cannot be done in the case of dentries because
98 tearing down of dentries require blocking (dentry_iput()) which
99 isn't supported from RCU callbacks. Instead, tearing down of
100 dentries happen synchronously in dput(), but actual freeing happens
101 later when RCU grace period is over. This allows safe lock-free
102 walking of the hash chains, but a matched dentry may have been
103 partially torn down. The checking of DCACHE_UNHASHED flag with
104 d_lock held detects such dentries and prevents them from being
105 returned from look-up.
106
107
108Maintaining POSIX rename semantics
109==================================
110
111Since look-up of dentries is lock-free, it can race against a
112concurrent rename operation. For example, during rename of file A to
113B, look-up of either A or B must succeed. So, if look-up of B happens
114after A has been removed from the hash chain but not added to the new
115hash chain, it may fail. Also, a comparison while the name is being
116written concurrently by a rename may result in false positive matches
117violating rename semantics. Issues related to race with rename are
118handled as described below :
119
1201. Look-up can be done in two ways - d_lookup() which is safe from
121 simultaneous renames and __d_lookup() which is not. If
122 __d_lookup() fails, it must be followed up by a d_lookup() to
123 correctly determine whether a dentry is in the hash table or
124 not. d_lookup() protects look-ups using a sequence lock
125 (rename_lock).
126
1272. The name associated with a dentry (d_name) may be changed if a
128 rename is allowed to happen simultaneously. To avoid memcmp() in
129 __d_lookup() go out of bounds due to a rename and false positive
130 comparison, the name comparison is done while holding the
131 per-dentry lock. This prevents concurrent renames during this
132 operation.
133
1343. Hash table walking during look-up may move to a different bucket as
135 the current dentry is moved to a different bucket due to rename.
136 But we use hlists in dcache hash table and they are
137 null-terminated. So, even if a dentry moves to a different bucket,
138 hash chain walk will terminate. [with a list_head list, it may not
139 since termination is when the list_head in the original bucket is
140 reached]. Since we redo the d_parent check and compare name while
141 holding d_lock, lock-free look-up will not race against d_move().
142
1434. There can be a theoretical race when a dentry keeps coming back to
144 original bucket due to double moves. Due to this look-up may
145 consider that it has never moved and can end up in a infinite loop.
146 But this is not any worse that theoretical livelocks we already
147 have in the kernel.
148
149
150Important guidelines for filesystem developers related to dcache_rcu
151====================================================================
152
1531. Existing dcache interfaces (pre-2.5.62) exported to filesystem
154 don't change. Only dcache internal implementation changes. However
155 filesystems *must not* delete from the dentry hash chains directly
156 using the list macros like allowed earlier. They must use dcache
157 APIs like d_drop() or __d_drop() depending on the situation.
158
1592. d_flags is now protected by a per-dentry lock (d_lock). All access
160 to d_flags must be protected by it.
161
1623. For a hashed dentry, checking of d_count needs to be protected by
163 d_lock.
164
165
166Papers and other documentation on dcache locking
167================================================
168
1691. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
170
1712. http://lse.sourceforge.net/locking/dcache/dcache.html
172
173
174
diff --git a/Documentation/filesystems/exofs.txt b/Documentation/filesystems/exofs.txt
index abd2a9b5b787..23583a136975 100644
--- a/Documentation/filesystems/exofs.txt
+++ b/Documentation/filesystems/exofs.txt
@@ -104,7 +104,15 @@ Where:
104 exofs specific options: Options are separated by commas (,) 104 exofs specific options: Options are separated by commas (,)
105 pid=<integer> - The partition number to mount/create as 105 pid=<integer> - The partition number to mount/create as
106 container of the filesystem. 106 container of the filesystem.
107 This option is mandatory. 107 This option is mandatory. integer can be
108 Hex by pre-pending an 0x to the number.
109 osdname=<id> - Mount by a device's osdname.
110 osdname is usually a 36 character uuid of the
111 form "d2683732-c906-4ee1-9dbd-c10c27bb40df".
112 It is one of the device's uuid specified in the
113 mkfs.exofs format command.
114 If this option is specified then the /dev/osdX
115 above can be empty and is ignored.
108 to=<integer> - Timeout in ticks for a single command. 116 to=<integer> - Timeout in ticks for a single command.
109 default is (60 * HZ) [for debugging only] 117 default is (60 * HZ) [for debugging only]
110 118
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index e1def1786e50..3ae9bc94352a 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -97,7 +97,7 @@ Note: More extensive information for getting started with ext4 can be
97* Inode allocation using large virtual block groups via flex_bg 97* Inode allocation using large virtual block groups via flex_bg
98* delayed allocation 98* delayed allocation
99* large block (up to pagesize) support 99* large block (up to pagesize) support
100* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force 100* efficient new ordered mode in JBD2 and ext4(avoid using buffer head to force
101 the ordering) 101 the ordering)
102 102
103[1] Filesystems with a block size of 1k may see a limit imposed by the 103[1] Filesystems with a block size of 1k may see a limit imposed by the
@@ -106,7 +106,7 @@ directory hash tree having a maximum depth of two.
1062.2 Candidate features for future inclusion 1062.2 Candidate features for future inclusion
107 107
108* Online defrag (patches available but not well tested) 108* Online defrag (patches available but not well tested)
109* reduced mke2fs time via lazy itable initialization in conjuction with 109* reduced mke2fs time via lazy itable initialization in conjunction with
110 the uninit_bg feature (capability to do this is available in e2fsprogs 110 the uninit_bg feature (capability to do this is available in e2fsprogs
111 but a kernel thread to do lazy zeroing of unused inode table blocks 111 but a kernel thread to do lazy zeroing of unused inode table blocks
112 after filesystem is first mounted is required for safety) 112 after filesystem is first mounted is required for safety)
@@ -226,10 +226,6 @@ acl Enables POSIX Access Control Lists support.
226noacl This option disables POSIX Access Control List 226noacl This option disables POSIX Access Control List
227 support. 227 support.
228 228
229reservation
230
231noreservation
232
233bsddf (*) Make 'df' act like BSD. 229bsddf (*) Make 'df' act like BSD.
234minixdf Make 'df' act like Minix. 230minixdf Make 'df' act like Minix.
235 231
@@ -353,12 +349,61 @@ noauto_da_alloc replacing existing files via patterns such as
353 system crashes before the delayed allocation 349 system crashes before the delayed allocation
354 blocks are forced to disk. 350 blocks are forced to disk.
355 351
356discard Controls whether ext4 should issue discard/TRIM 352noinit_itable Do not initialize any uninitialized inode table
353 blocks in the background. This feature may be
354 used by installation CD's so that the install
355 process can complete as quickly as possible; the
356 inode table initialization process would then be
357 deferred until the next time the file system
358 is unmounted.
359
360init_itable=n The lazy itable init code will wait n times the
361 number of milliseconds it took to zero out the
362 previous block group's inode table. This
363 minimizes the impact on the systme performance
364 while file system's inode table is being initialized.
365
366discard Controls whether ext4 should issue discard/TRIM
357nodiscard(*) commands to the underlying block device when 367nodiscard(*) commands to the underlying block device when
358 blocks are freed. This is useful for SSD devices 368 blocks are freed. This is useful for SSD devices
359 and sparse/thinly-provisioned LUNs, but it is off 369 and sparse/thinly-provisioned LUNs, but it is off
360 by default until sufficient testing has been done. 370 by default until sufficient testing has been done.
361 371
372nouid32 Disables 32-bit UIDs and GIDs. This is for
373 interoperability with older kernels which only
374 store and expect 16-bit values.
375
376resize Allows to resize filesystem to the end of the last
377 existing block group, further resize has to be done
378 with resize2fs either online, or offline. It can be
379 used only with conjunction with remount.
380
381block_validity This options allows to enables/disables the in-kernel
382noblock_validity facility for tracking filesystem metadata blocks
383 within internal data structures. This allows multi-
384 block allocator and other routines to quickly locate
385 extents which might overlap with filesystem metadata
386 blocks. This option is intended for debugging
387 purposes and since it negatively affects the
388 performance, it is off by default.
389
390dioread_lock Controls whether or not ext4 should use the DIO read
391dioread_nolock locking. If the dioread_nolock option is specified
392 ext4 will allocate uninitialized extent before buffer
393 write and convert the extent to initialized after IO
394 completes. This approach allows ext4 code to avoid
395 using inode mutex, which improves scalability on high
396 speed storages. However this does not work with nobh
397 option and the mount will fail. Nor does it work with
398 data journaling and dioread_nolock option will be
399 ignored with kernel warning. Note that dioread_nolock
400 code path is only used for extent-based files.
401 Because of the restrictions this options comprises
402 it is off by default (e.g. dioread_lock).
403
404i_version Enable 64-bit inode version support. This option is
405 off by default.
406
362Data Mode 407Data Mode
363========= 408=========
364There are 3 different data modes: 409There are 3 different data modes:
@@ -386,6 +431,176 @@ needs to be read from and written to disk at the same time where it
386outperforms all others modes. Currently ext4 does not have delayed 431outperforms all others modes. Currently ext4 does not have delayed
387allocation support if this data journalling mode is selected. 432allocation support if this data journalling mode is selected.
388 433
434/proc entries
435=============
436
437Information about mounted ext4 file systems can be found in
438/proc/fs/ext4. Each mounted filesystem will have a directory in
439/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
440/proc/fs/ext4/dm-0). The files in each per-device directory are shown
441in table below.
442
443Files in /proc/fs/ext4/<devname>
444..............................................................................
445 File Content
446 mb_groups details of multiblock allocator buddy cache of free blocks
447..............................................................................
448
449/sys entries
450============
451
452Information about mounted ext4 file systems can be found in
453/sys/fs/ext4. Each mounted filesystem will have a directory in
454/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
455/sys/fs/ext4/dm-0). The files in each per-device directory are shown
456in table below.
457
458Files in /sys/fs/ext4/<devname>
459(see also Documentation/ABI/testing/sysfs-fs-ext4)
460..............................................................................
461 File Content
462
463 delayed_allocation_blocks This file is read-only and shows the number of
464 blocks that are dirty in the page cache, but
465 which do not have their location in the
466 filesystem allocated yet.
467
468 inode_goal Tuning parameter which (if non-zero) controls
469 the goal inode used by the inode allocator in
470 preference to all other allocation heuristics.
471 This is intended for debugging use only, and
472 should be 0 on production systems.
473
474 inode_readahead_blks Tuning parameter which controls the maximum
475 number of inode table blocks that ext4's inode
476 table readahead algorithm will pre-read into
477 the buffer cache
478
479 lifetime_write_kbytes This file is read-only and shows the number of
480 kilobytes of data that have been written to this
481 filesystem since it was created.
482
483 max_writeback_mb_bump The maximum number of megabytes the writeback
484 code will try to write out before move on to
485 another inode.
486
487 mb_group_prealloc The multiblock allocator will round up allocation
488 requests to a multiple of this tuning parameter if
489 the stripe size is not set in the ext4 superblock
490
491 mb_max_to_scan The maximum number of extents the multiblock
492 allocator will search to find the best extent
493
494 mb_min_to_scan The minimum number of extents the multiblock
495 allocator will search to find the best extent
496
497 mb_order2_req Tuning parameter which controls the minimum size
498 for requests (as a power of 2) where the buddy
499 cache is used
500
501 mb_stats Controls whether the multiblock allocator should
502 collect statistics, which are shown during the
503 unmount. 1 means to collect statistics, 0 means
504 not to collect statistics
505
506 mb_stream_req Files which have fewer blocks than this tunable
507 parameter will have their blocks allocated out
508 of a block group specific preallocation pool, so
509 that small files are packed closely together.
510 Each large file will have its blocks allocated
511 out of its own unique preallocation pool.
512
513 session_write_kbytes This file is read-only and shows the number of
514 kilobytes of data that have been written to this
515 filesystem since it was mounted.
516..............................................................................
517
518Ioctls
519======
520
521There is some Ext4 specific functionality which can be accessed by applications
522through the system call interfaces. The list of all Ext4 specific ioctls are
523shown in the table below.
524
525Table of Ext4 specific ioctls
526..............................................................................
527 Ioctl Description
528 EXT4_IOC_GETFLAGS Get additional attributes associated with inode.
529 The ioctl argument is an integer bitfield, with
530 bit values described in ext4.h. This ioctl is an
531 alias for FS_IOC_GETFLAGS.
532
533 EXT4_IOC_SETFLAGS Set additional attributes associated with inode.
534 The ioctl argument is an integer bitfield, with
535 bit values described in ext4.h. This ioctl is an
536 alias for FS_IOC_SETFLAGS.
537
538 EXT4_IOC_GETVERSION
539 EXT4_IOC_GETVERSION_OLD
540 Get the inode i_generation number stored for
541 each inode. The i_generation number is normally
542 changed only when new inode is created and it is
543 particularly useful for network filesystems. The
544 '_OLD' version of this ioctl is an alias for
545 FS_IOC_GETVERSION.
546
547 EXT4_IOC_SETVERSION
548 EXT4_IOC_SETVERSION_OLD
549 Set the inode i_generation number stored for
550 each inode. The '_OLD' version of this ioctl
551 is an alias for FS_IOC_SETVERSION.
552
553 EXT4_IOC_GROUP_EXTEND This ioctl has the same purpose as the resize
554 mount option. It allows to resize filesystem
555 to the end of the last existing block group,
556 further resize has to be done with resize2fs,
557 either online, or offline. The argument points
558 to the unsigned logn number representing the
559 filesystem new block count.
560
561 EXT4_IOC_MOVE_EXT Move the block extents from orig_fd (the one
562 this ioctl is pointing to) to the donor_fd (the
563 one specified in move_extent structure passed
564 as an argument to this ioctl). Then, exchange
565 inode metadata between orig_fd and donor_fd.
566 This is especially useful for online
567 defragmentation, because the allocator has the
568 opportunity to allocate moved blocks better,
569 ideally into one contiguous extent.
570
571 EXT4_IOC_GROUP_ADD Add a new group descriptor to an existing or
572 new group descriptor block. The new group
573 descriptor is described by ext4_new_group_input
574 structure, which is passed as an argument to
575 this ioctl. This is especially useful in
576 conjunction with EXT4_IOC_GROUP_EXTEND,
577 which allows online resize of the filesystem
578 to the end of the last existing block group.
579 Those two ioctls combined is used in userspace
580 online resize tool (e.g. resize2fs).
581
582 EXT4_IOC_MIGRATE This ioctl operates on the filesystem itself.
583 It converts (migrates) ext3 indirect block mapped
584 inode to ext4 extent mapped inode by walking
585 through indirect block mapping of the original
586 inode and converting contiguous block ranges
587 into ext4 extents of the temporary inode. Then,
588 inodes are swapped. This ioctl might help, when
589 migrating from ext3 to ext4 filesystem, however
590 suggestion is to create fresh ext4 filesystem
591 and copy data from the backup. Note, that
592 filesystem has to support extents for this ioctl
593 to work.
594
595 EXT4_IOC_ALLOC_DA_BLKS Force all of the delay allocated blocks to be
596 allocated to preserve application-expected ext3
597 behaviour. Note that this will also start
598 triggering a write of the data blocks, but this
599 behaviour may change in the future as it is
600 not necessary and has been done this way only
601 for sake of simplicity.
602..............................................................................
603
389References 604References
390========== 605==========
391 606
diff --git a/Documentation/filesystems/gfs2-uevents.txt b/Documentation/filesystems/gfs2-uevents.txt
index fd966dc9979a..d81889669293 100644
--- a/Documentation/filesystems/gfs2-uevents.txt
+++ b/Documentation/filesystems/gfs2-uevents.txt
@@ -62,7 +62,7 @@ be fixed.
62 62
63The REMOVE uevent is generated at the end of an unsuccessful mount 63The REMOVE uevent is generated at the end of an unsuccessful mount
64or at the end of a umount of the filesystem. All REMOVE uevents will 64or at the end of a umount of the filesystem. All REMOVE uevents will
65have been preceeded by at least an ADD uevent for the same fileystem, 65have been preceded by at least an ADD uevent for the same fileystem,
66and unlike the other uevents is generated automatically by the kernel's 66and unlike the other uevents is generated automatically by the kernel's
67kobject subsystem. 67kobject subsystem.
68 68
diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.txt
index 0b59c0200912..4cda926628aa 100644
--- a/Documentation/filesystems/gfs2.txt
+++ b/Documentation/filesystems/gfs2.txt
@@ -11,7 +11,7 @@ their I/O so file system consistency is maintained. One of the nifty
11features of GFS is perfect consistency -- changes made to the file system 11features of GFS is perfect consistency -- changes made to the file system
12on one machine show up immediately on all other machines in the cluster. 12on one machine show up immediately on all other machines in the cluster.
13 13
14GFS uses interchangable inter-node locking mechanisms, the currently 14GFS uses interchangeable inter-node locking mechanisms, the currently
15supported mechanisms are: 15supported mechanisms are:
16 16
17 lock_nolock -- allows gfs to be used as a local file system 17 lock_nolock -- allows gfs to be used as a local file system
diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX
index 2f68cd688769..a57e12411d2a 100644
--- a/Documentation/filesystems/nfs/00-INDEX
+++ b/Documentation/filesystems/nfs/00-INDEX
@@ -12,5 +12,9 @@ nfs-rdma.txt
12 - how to install and setup the Linux NFS/RDMA client and server software 12 - how to install and setup the Linux NFS/RDMA client and server software
13nfsroot.txt 13nfsroot.txt
14 - short guide on setting up a diskless box with NFS root filesystem. 14 - short guide on setting up a diskless box with NFS root filesystem.
15pnfs.txt
16 - short explanation of some of the internals of the pnfs client code
15rpc-cache.txt 17rpc-cache.txt
16 - introduction to the caching mechanisms in the sunrpc layer. 18 - introduction to the caching mechanisms in the sunrpc layer.
19idmapper.txt
20 - information for configuring request-keys to be used by idmapper
diff --git a/Documentation/filesystems/nfs/idmapper.txt b/Documentation/filesystems/nfs/idmapper.txt
new file mode 100644
index 000000000000..9c8fd6148656
--- /dev/null
+++ b/Documentation/filesystems/nfs/idmapper.txt
@@ -0,0 +1,67 @@
1
2=========
3ID Mapper
4=========
5Id mapper is used by NFS to translate user and group ids into names, and to
6translate user and group names into ids. Part of this translation involves
7performing an upcall to userspace to request the information. Id mapper will
8user request-key to perform this upcall and cache the result. The program
9/usr/sbin/nfs.idmap should be called by request-key, and will perform the
10translation and initialize a key with the resulting information.
11
12 NFS_USE_NEW_IDMAPPER must be selected when configuring the kernel to use this
13 feature.
14
15===========
16Configuring
17===========
18The file /etc/request-key.conf will need to be modified so /sbin/request-key can
19direct the upcall. The following line should be added:
20
21#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...
22#====== ======= =============== =============== ===============================
23create id_resolver * * /usr/sbin/nfs.idmap %k %d 600
24
25This will direct all id_resolver requests to the program /usr/sbin/nfs.idmap.
26The last parameter, 600, defines how many seconds into the future the key will
27expire. This parameter is optional for /usr/sbin/nfs.idmap. When the timeout
28is not specified, nfs.idmap will default to 600 seconds.
29
30id mapper uses for key descriptions:
31 uid: Find the UID for the given user
32 gid: Find the GID for the given group
33 user: Find the user name for the given UID
34 group: Find the group name for the given GID
35
36You can handle any of these individually, rather than using the generic upcall
37program. If you would like to use your own program for a uid lookup then you
38would edit your request-key.conf so it look similar to this:
39
40#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...
41#====== ======= =============== =============== ===============================
42create id_resolver uid:* * /some/other/program %k %d 600
43create id_resolver * * /usr/sbin/nfs.idmap %k %d 600
44
45Notice that the new line was added above the line for the generic program.
46request-key will find the first matching line and corresponding program. In
47this case, /some/other/program will handle all uid lookups and
48/usr/sbin/nfs.idmap will handle gid, user, and group lookups.
49
50See <file:Documentation/security/keys-request-keys.txt> for more information
51about the request-key function.
52
53
54=========
55nfs.idmap
56=========
57nfs.idmap is designed to be called by request-key, and should not be run "by
58hand". This program takes two arguments, a serialized key and a key
59description. The serialized key is first converted into a key_serial_t, and
60then passed as an argument to keyctl_instantiate (both are part of keyutils.h).
61
62The actual lookups are performed by functions found in nfsidmap.h. nfs.idmap
63determines the correct function to call by looking at the first part of the
64description string. For example, a uid lookup description will appear as
65"uid:user@domain".
66
67nfs.idmap will return 0 if the key was instantiated, and non-zero otherwise.
diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt
index f2430a7974e1..90c71c6f0d00 100644
--- a/Documentation/filesystems/nfs/nfsroot.txt
+++ b/Documentation/filesystems/nfs/nfsroot.txt
@@ -159,6 +159,28 @@ ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>
159 Default: any 159 Default: any
160 160
161 161
162nfsrootdebug
163
164 This parameter enables debugging messages to appear in the kernel
165 log at boot time so that administrators can verify that the correct
166 NFS mount options, server address, and root path are passed to the
167 NFS client.
168
169
170rdinit=<executable file>
171
172 To specify which file contains the program that starts system
173 initialization, administrators can use this command line parameter.
174 The default value of this parameter is "/init". If the specified
175 file exists and the kernel can execute it, root filesystem related
176 kernel command line parameters, including `nfsroot=', are ignored.
177
178 A description of the process of mounting the root file system can be
179 found in:
180
181 Documentation/early-userspace/README
182
183
162 184
163 185
1643.) Boot Loader 1863.) Boot Loader
diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.txt
new file mode 100644
index 000000000000..983e14abe7e9
--- /dev/null
+++ b/Documentation/filesystems/nfs/pnfs.txt
@@ -0,0 +1,55 @@
1Reference counting in pnfs:
2==========================
3
4The are several inter-related caches. We have layouts which can
5reference multiple devices, each of which can reference multiple data servers.
6Each data server can be referenced by multiple devices. Each device
7can be referenced by multiple layouts. To keep all of this straight,
8we need to reference count.
9
10
11struct pnfs_layout_hdr
12----------------------
13The on-the-wire command LAYOUTGET corresponds to struct
14pnfs_layout_segment, usually referred to by the variable name lseg.
15Each nfs_inode may hold a pointer to a cache of of these layout
16segments in nfsi->layout, of type struct pnfs_layout_hdr.
17
18We reference the header for the inode pointing to it, across each
19outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN,
20LAYOUTCOMMIT), and for each lseg held within.
21
22Each header is also (when non-empty) put on a list associated with
23struct nfs_client (cl_layouts). Being put on this list does not bump
24the reference count, as the layout is kept around by the lseg that
25keeps it in the list.
26
27deviceid_cache
28--------------
29lsegs reference device ids, which are resolved per nfs_client and
30layout driver type. The device ids are held in a RCU cache (struct
31nfs4_deviceid_cache). The cache itself is referenced across each
32mount. The entries (struct nfs4_deviceid) themselves are held across
33the lifetime of each lseg referencing them.
34
35RCU is used because the deviceid is basically a write once, read many
36data structure. The hlist size of 32 buckets needs better
37justification, but seems reasonable given that we can have multiple
38deviceid's per filesystem, and multiple filesystems per nfs_client.
39
40The hash code is copied from the nfsd code base. A discussion of
41hashing and variations of this algorithm can be found at:
42http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809
43
44data server cache
45-----------------
46file driver devices refer to data servers, which are kept in a module
47level cache. Its reference is held over the lifetime of the deviceid
48pointing to it.
49
50lseg
51----
52lseg maintains an extra reference corresponding to the NFS_LSEG_VALID
53bit which holds it in the pnfs_layout_hdr's list. When the final lseg
54is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED
55bit is set, preventing any new lsegs from being added.
diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.txt
index d5c0cef38a71..873a2ab2e9f8 100644
--- a/Documentation/filesystems/nilfs2.txt
+++ b/Documentation/filesystems/nilfs2.txt
@@ -40,7 +40,6 @@ Features which NILFS2 does not support yet:
40 - POSIX ACLs 40 - POSIX ACLs
41 - quotas 41 - quotas
42 - fsck 42 - fsck
43 - resize
44 - defragmentation 43 - defragmentation
45 44
46Mount options 45Mount options
diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt
index ac2a261c5f7d..791af8dac065 100644
--- a/Documentation/filesystems/ntfs.txt
+++ b/Documentation/filesystems/ntfs.txt
@@ -350,7 +350,7 @@ Note the "Should sync?" parameter "nosync" means that the two mirrors are
350already in sync which will be the case on a clean shutdown of Windows. If the 350already in sync which will be the case on a clean shutdown of Windows. If the
351mirrors are not clean, you can specify the "sync" option instead of "nosync" 351mirrors are not clean, you can specify the "sync" option instead of "nosync"
352and the Device-Mapper driver will then copy the entirety of the "Source Device" 352and the Device-Mapper driver will then copy the entirety of the "Source Device"
353to the "Target Device" or if you specified multipled target devices to all of 353to the "Target Device" or if you specified multiple target devices to all of
354them. 354them.
355 355
356Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1), 356Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
@@ -457,6 +457,11 @@ ChangeLog
457 457
458Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. 458Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
459 459
4602.1.30:
461 - Fix writev() (it kept writing the first segment over and over again
462 instead of moving onto subsequent segments).
463 - Fix crash in ntfs_mft_record_alloc() when mapping the new extent mft
464 record failed.
4602.1.29: 4652.1.29:
461 - Fix a deadlock when mounting read-write. 466 - Fix a deadlock when mounting read-write.
4622.1.28: 4672.1.28:
diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt
index 1f7ae144f6d8..7618a287aa41 100644
--- a/Documentation/filesystems/ocfs2.txt
+++ b/Documentation/filesystems/ocfs2.txt
@@ -46,9 +46,15 @@ errors=panic Panic and halt the machine if an error occurs.
46intr (*) Allow signals to interrupt cluster operations. 46intr (*) Allow signals to interrupt cluster operations.
47nointr Do not allow signals to interrupt cluster 47nointr Do not allow signals to interrupt cluster
48 operations. 48 operations.
49noatime Do not update access time.
50relatime(*) Update atime if the previous atime is older than
51 mtime or ctime
52strictatime Always update atime, but the minimum update interval
53 is specified by atime_quantum.
49atime_quantum=60(*) OCFS2 will not update atime unless this number 54atime_quantum=60(*) OCFS2 will not update atime unless this number
50 of seconds has passed since the last update. 55 of seconds has passed since the last update.
51 Set to zero to always update atime. 56 Set to zero to always update atime. This option need
57 work with strictatime.
52data=ordered (*) All data are forced directly out to the main file 58data=ordered (*) All data are forced directly out to the main file
53 system prior to its metadata being committed to the 59 system prior to its metadata being committed to the
54 journal. 60 journal.
@@ -80,10 +86,17 @@ user_xattr (*) Enables Extended User Attributes.
80nouser_xattr Disables Extended User Attributes. 86nouser_xattr Disables Extended User Attributes.
81acl Enables POSIX Access Control Lists support. 87acl Enables POSIX Access Control Lists support.
82noacl (*) Disables POSIX Access Control Lists support. 88noacl (*) Disables POSIX Access Control Lists support.
83resv_level=2 (*) Set how agressive allocation reservations will be. 89resv_level=2 (*) Set how aggressive allocation reservations will be.
84 Valid values are between 0 (reservations off) to 8 90 Valid values are between 0 (reservations off) to 8
85 (maximum space for reservations). 91 (maximum space for reservations).
86dir_resv_level= (*) By default, directory reservations will scale with file 92dir_resv_level= (*) By default, directory reservations will scale with file
87 reservations - users should rarely need to change this 93 reservations - users should rarely need to change this
88 value. If allocation reservations are turned off, this 94 value. If allocation reservations are turned off, this
89 option will have no effect. 95 option will have no effect.
96coherency=full (*) Disallow concurrent O_DIRECT writes, cluster inode
97 lock will be taken to force other nodes drop cache,
98 therefore full cluster coherency is guaranteed even
99 for O_DIRECT writes.
100coherency=buffered Allow concurrent O_DIRECT writes without EX lock among
101 nodes, which gains high performance at risk of getting
102 stale data on other nodes.
diff --git a/Documentation/filesystems/path-lookup.txt b/Documentation/filesystems/path-lookup.txt
new file mode 100644
index 000000000000..3571667c7105
--- /dev/null
+++ b/Documentation/filesystems/path-lookup.txt
@@ -0,0 +1,382 @@
1Path walking and name lookup locking
2====================================
3
4Path resolution is the finding a dentry corresponding to a path name string, by
5performing a path walk. Typically, for every open(), stat() etc., the path name
6will be resolved. Paths are resolved by walking the namespace tree, starting
7with the first component of the pathname (eg. root or cwd) with a known dentry,
8then finding the child of that dentry, which is named the next component in the
9path string. Then repeating the lookup from the child dentry and finding its
10child with the next element, and so on.
11
12Since it is a frequent operation for workloads like multiuser environments and
13web servers, it is important to optimize this code.
14
15Path walking synchronisation history:
16Prior to 2.5.10, dcache_lock was acquired in d_lookup (dcache hash lookup) and
17thus in every component during path look-up. Since 2.5.10 onwards, fast-walk
18algorithm changed this by holding the dcache_lock at the beginning and walking
19as many cached path component dentries as possible. This significantly
20decreases the number of acquisition of dcache_lock. However it also increases
21the lock hold time significantly and affects performance in large SMP machines.
22Since 2.5.62 kernel, dcache has been using a new locking model that uses RCU to
23make dcache look-up lock-free.
24
25All the above algorithms required taking a lock and reference count on the
26dentry that was looked up, so that may be used as the basis for walking the
27next path element. This is inefficient and unscalable. It is inefficient
28because of the locks and atomic operations required for every dentry element
29slows things down. It is not scalable because many parallel applications that
30are path-walk intensive tend to do path lookups starting from a common dentry
31(usually, the root "/" or current working directory). So contention on these
32common path elements causes lock and cacheline queueing.
33
34Since 2.6.38, RCU is used to make a significant part of the entire path walk
35(including dcache look-up) completely "store-free" (so, no locks, atomics, or
36even stores into cachelines of common dentries). This is known as "rcu-walk"
37path walking.
38
39Path walking overview
40=====================
41
42A name string specifies a start (root directory, cwd, fd-relative) and a
43sequence of elements (directory entry names), which together refer to a path in
44the namespace. A path is represented as a (dentry, vfsmount) tuple. The name
45elements are sub-strings, separated by '/'.
46
47Name lookups will want to find a particular path that a name string refers to
48(usually the final element, or parent of final element). This is done by taking
49the path given by the name's starting point (which we know in advance -- eg.
50current->fs->cwd or current->fs->root) as the first parent of the lookup. Then
51iteratively for each subsequent name element, look up the child of the current
52parent with the given name and if it is not the desired entry, make it the
53parent for the next lookup.
54
55A parent, of course, must be a directory, and we must have appropriate
56permissions on the parent inode to be able to walk into it.
57
58Turning the child into a parent for the next lookup requires more checks and
59procedures. Symlinks essentially substitute the symlink name for the target
60name in the name string, and require some recursive path walking. Mount points
61must be followed into (thus changing the vfsmount that subsequent path elements
62refer to), switching from the mount point path to the root of the particular
63mounted vfsmount. These behaviours are variously modified depending on the
64exact path walking flags.
65
66Path walking then must, broadly, do several particular things:
67- find the start point of the walk;
68- perform permissions and validity checks on inodes;
69- perform dcache hash name lookups on (parent, name element) tuples;
70- traverse mount points;
71- traverse symlinks;
72- lookup and create missing parts of the path on demand.
73
74Safe store-free look-up of dcache hash table
75============================================
76
77Dcache name lookup
78------------------
79In order to lookup a dcache (parent, name) tuple, we take a hash on the tuple
80and use that to select a bucket in the dcache-hash table. The list of entries
81in that bucket is then walked, and we do a full comparison of each entry
82against our (parent, name) tuple.
83
84The hash lists are RCU protected, so list walking is not serialised with
85concurrent updates (insertion, deletion from the hash). This is a standard RCU
86list application with the exception of renames, which will be covered below.
87
88Parent and name members of a dentry, as well as its membership in the dcache
89hash, and its inode are protected by the per-dentry d_lock spinlock. A
90reference is taken on the dentry (while the fields are verified under d_lock),
91and this stabilises its d_inode pointer and actual inode. This gives a stable
92point to perform the next step of our path walk against.
93
94These members are also protected by d_seq seqlock, although this offers
95read-only protection and no durability of results, so care must be taken when
96using d_seq for synchronisation (see seqcount based lookups, below).
97
98Renames
99-------
100Back to the rename case. In usual RCU protected lists, the only operations that
101will happen to an object is insertion, and then eventually removal from the
102list. The object will not be reused until an RCU grace period is complete.
103This ensures the RCU list traversal primitives can run over the object without
104problems (see RCU documentation for how this works).
105
106However when a dentry is renamed, its hash value can change, requiring it to be
107moved to a new hash list. Allocating and inserting a new alias would be
108expensive and also problematic for directory dentries. Latency would be far to
109high to wait for a grace period after removing the dentry and before inserting
110it in the new hash bucket. So what is done is to insert the dentry into the
111new list immediately.
112
113However, when the dentry's list pointers are updated to point to objects in the
114new list before waiting for a grace period, this can result in a concurrent RCU
115lookup of the old list veering off into the new (incorrect) list and missing
116the remaining dentries on the list.
117
118There is no fundamental problem with walking down the wrong list, because the
119dentry comparisons will never match. However it is fatal to miss a matching
120dentry. So a seqlock is used to detect when a rename has occurred, and so the
121lookup can be retried.
122
123 1 2 3
124 +---+ +---+ +---+
125hlist-->| N-+->| N-+->| N-+->
126head <--+-P |<-+-P |<-+-P |
127 +---+ +---+ +---+
128
129Rename of dentry 2 may require it deleted from the above list, and inserted
130into a new list. Deleting 2 gives the following list.
131
132 1 3
133 +---+ +---+ (don't worry, the longer pointers do not
134hlist-->| N-+-------->| N-+-> impose a measurable performance overhead
135head <--+-P |<--------+-P | on modern CPUs)
136 +---+ +---+
137 ^ 2 ^
138 | +---+ |
139 | | N-+----+
140 +----+-P |
141 +---+
142
143This is a standard RCU-list deletion, which leaves the deleted object's
144pointers intact, so a concurrent list walker that is currently looking at
145object 2 will correctly continue to object 3 when it is time to traverse the
146next object.
147
148However, when inserting object 2 onto a new list, we end up with this:
149
150 1 3
151 +---+ +---+
152hlist-->| N-+-------->| N-+->
153head <--+-P |<--------+-P |
154 +---+ +---+
155 2
156 +---+
157 | N-+---->
158 <----+-P |
159 +---+
160
161Because we didn't wait for a grace period, there may be a concurrent lookup
162still at 2. Now when it follows 2's 'next' pointer, it will walk off into
163another list without ever having checked object 3.
164
165A related, but distinctly different, issue is that of rename atomicity versus
166lookup operations. If a file is renamed from 'A' to 'B', a lookup must only
167find either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup
168of 'B' must succeed (note the reverse is not true).
169
170Between deleting the dentry from the old hash list, and inserting it on the new
171hash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same
172rename seqlock is also used to cover this race in much the same way, by
173retrying a negative lookup result if a rename was in progress.
174
175Seqcount based lookups
176----------------------
177In refcount based dcache lookups, d_lock is used to serialise access to
178the dentry, stabilising it while comparing its name and parent and then
179taking a reference count (the reference count then gives a stable place to
180start the next part of the path walk from).
181
182As explained above, we would like to do path walking without taking locks or
183reference counts on intermediate dentries along the path. To do this, a per
184dentry seqlock (d_seq) is used to take a "coherent snapshot" of what the dentry
185looks like (its name, parent, and inode). That snapshot is then used to start
186the next part of the path walk. When loading the coherent snapshot under d_seq,
187care must be taken to load the members up-front, and use those pointers rather
188than reloading from the dentry later on (otherwise we'd have interesting things
189like d_inode going NULL underneath us, if the name was unlinked).
190
191Also important is to avoid performing any destructive operations (pretty much:
192no non-atomic stores to shared data), and to recheck the seqcount when we are
193"done" with the operation. Retry or abort if the seqcount does not match.
194Avoiding destructive or changing operations means we can easily unwind from
195failure.
196
197What this means is that a caller, provided they are holding RCU lock to
198protect the dentry object from disappearing, can perform a seqcount based
199lookup which does not increment the refcount on the dentry or write to
200it in any way. This returned dentry can be used for subsequent operations,
201provided that d_seq is rechecked after that operation is complete.
202
203Inodes are also rcu freed, so the seqcount lookup dentry's inode may also be
204queried for permissions.
205
206With this two parts of the puzzle, we can do path lookups without taking
207locks or refcounts on dentry elements.
208
209RCU-walk path walking design
210============================
211
212Path walking code now has two distinct modes, ref-walk and rcu-walk. ref-walk
213is the traditional[*] way of performing dcache lookups using d_lock to
214serialise concurrent modifications to the dentry and take a reference count on
215it. ref-walk is simple and obvious, and may sleep, take locks, etc while path
216walking is operating on each dentry. rcu-walk uses seqcount based dentry
217lookups, and can perform lookup of intermediate elements without any stores to
218shared data in the dentry or inode. rcu-walk can not be applied to all cases,
219eg. if the filesystem must sleep or perform non trivial operations, rcu-walk
220must be switched to ref-walk mode.
221
222[*] RCU is still used for the dentry hash lookup in ref-walk, but not the full
223 path walk.
224
225Where ref-walk uses a stable, refcounted ``parent'' to walk the remaining
226path string, rcu-walk uses a d_seq protected snapshot. When looking up a
227child of this parent snapshot, we open d_seq critical section on the child
228before closing d_seq critical section on the parent. This gives an interlocking
229ladder of snapshots to walk down.
230
231
232 proc 101
233 /----------------\
234 / comm: "vi" \
235 / fs.root: dentry0 \
236 \ fs.cwd: dentry2 /
237 \ /
238 \----------------/
239
240So when vi wants to open("/home/npiggin/test.c", O_RDWR), then it will
241start from current->fs->root, which is a pinned dentry. Alternatively,
242"./test.c" would start from cwd; both names refer to the same path in
243the context of proc101.
244
245 dentry 0
246 +---------------------+ rcu-walk begins here, we note d_seq, check the
247 | name: "/" | inode's permission, and then look up the next
248 | inode: 10 | path element which is "home"...
249 | children:"home", ...|
250 +---------------------+
251 |
252 dentry 1 V
253 +---------------------+ ... which brings us here. We find dentry1 via
254 | name: "home" | hash lookup, then note d_seq and compare name
255 | inode: 678 | string and parent pointer. When we have a match,
256 | children:"npiggin" | we now recheck the d_seq of dentry0. Then we
257 +---------------------+ check inode and look up the next element.
258 |
259 dentry2 V
260 +---------------------+ Note: if dentry0 is now modified, lookup is
261 | name: "npiggin" | not necessarily invalid, so we need only keep a
262 | inode: 543 | parent for d_seq verification, and grandparents
263 | children:"a.c", ... | can be forgotten.
264 +---------------------+
265 |
266 dentry3 V
267 +---------------------+ At this point we have our destination dentry.
268 | name: "a.c" | We now take its d_lock, verify d_seq of this
269 | inode: 14221 | dentry. If that checks out, we can increment
270 | children:NULL | its refcount because we're holding d_lock.
271 +---------------------+
272
273Taking a refcount on a dentry from rcu-walk mode, by taking its d_lock,
274re-checking its d_seq, and then incrementing its refcount is called
275"dropping rcu" or dropping from rcu-walk into ref-walk mode.
276
277It is, in some sense, a bit of a house of cards. If the seqcount check of the
278parent snapshot fails, the house comes down, because we had closed the d_seq
279section on the grandparent, so we have nothing left to stand on. In that case,
280the path walk must be fully restarted (which we do in ref-walk mode, to avoid
281live locks). It is costly to have a full restart, but fortunately they are
282quite rare.
283
284When we reach a point where sleeping is required, or a filesystem callout
285requires ref-walk, then instead of restarting the walk, we attempt to drop rcu
286at the last known good dentry we have. Avoiding a full restart in ref-walk in
287these cases is fundamental for performance and scalability because blocking
288operations such as creates and unlinks are not uncommon.
289
290The detailed design for rcu-walk is like this:
291* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
292* Take the RCU lock for the entire path walk, starting with the acquiring
293 of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
294 not required for dentry persistence.
295* synchronize_rcu is called when unregistering a filesystem, so we can
296 access d_ops and i_ops during rcu-walk.
297* Similarly take the vfsmount lock for the entire path walk. So now mnt
298 refcounts are not required for persistence. Also we are free to perform mount
299 lookups, and to assume dentry mount points and mount roots are stable up and
300 down the path.
301* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
302 so we can load this tuple atomically, and also check whether any of its
303 members have changed.
304* Dentry lookups (based on parent, candidate string tuple) recheck the parent
305 sequence after the child is found in case anything changed in the parent
306 during the path walk.
307* inode is also RCU protected so we can load d_inode and use the inode for
308 limited things.
309* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
310* i_op can be loaded.
311* When the destination dentry is reached, drop rcu there (ie. take d_lock,
312 verify d_seq, increment refcount).
313* If seqlock verification fails anywhere along the path, do a full restart
314 of the path lookup in ref-walk mode. -ECHILD tends to be used (for want of
315 a better errno) to signal an rcu-walk failure.
316
317The cases where rcu-walk cannot continue are:
318* NULL dentry (ie. any uncached path element)
319* Following links
320
321It may be possible eventually to make following links rcu-walk aware.
322
323Uncached path elements will always require dropping to ref-walk mode, at the
324very least because i_mutex needs to be grabbed, and objects allocated.
325
326Final note:
327"store-free" path walking is not strictly store free. We take vfsmount lock
328and refcounts (both of which can be made per-cpu), and we also store to the
329stack (which is essentially CPU-local), and we also have to take locks and
330refcount on final dentry.
331
332The point is that shared data, where practically possible, is not locked
333or stored into. The result is massive improvements in performance and
334scalability of path resolution.
335
336
337Interesting statistics
338======================
339
340The following table gives rcu lookup statistics for a few simple workloads
341(2s12c24t Westmere, debian non-graphical system). Ungraceful are attempts to
342drop rcu that fail due to d_seq failure and requiring the entire path lookup
343again. Other cases are successful rcu-drops that are required before the final
344element, nodentry for missing dentry, revalidate for filesystem revalidate
345routine requiring rcu drop, permission for permission check requiring drop,
346and link for symlink traversal requiring drop.
347
348 rcu-lookups restart nodentry link revalidate permission
349bootup 47121 0 4624 1010 10283 7852
350dbench 25386793 0 6778659(26.7%) 55 549 1156
351kbuild 2696672 10 64442(2.3%) 108764(4.0%) 1 1590
352git diff 39605 0 28 2 0 106
353vfstest 24185492 4945 708725(2.9%) 1076136(4.4%) 0 2651
354
355What this shows is that failed rcu-walk lookups, ie. ones that are restarted
356entirely with ref-walk, are quite rare. Even the "vfstest" case which
357specifically has concurrent renames/mkdir/rmdir/ creat/unlink/etc to exercise
358such races is not showing a huge amount of restarts.
359
360Dropping from rcu-walk to ref-walk mean that we have encountered a dentry where
361the reference count needs to be taken for some reason. This is either because
362we have reached the target of the path walk, or because we have encountered a
363condition that can't be resolved in rcu-walk mode. Ideally, we drop rcu-walk
364only when we have reached the target dentry, so the other statistics show where
365this does not happen.
366
367Note that a graceful drop from rcu-walk mode due to something such as the
368dentry not existing (which can be common) is not necessarily a failure of
369rcu-walk scheme, because some elements of the path may have been walked in
370rcu-walk mode. The further we get from common path elements (such as cwd or
371root), the less contended the dentry is likely to be. The closer we are to
372common path elements, the more likely they will exist in dentry cache.
373
374
375Papers and other documentation on dcache locking
376================================================
377
3781. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
379
3802. http://lse.sourceforge.net/locking/dcache/dcache.html
381
382
diff --git a/Documentation/filesystems/pohmelfs/network_protocol.txt b/Documentation/filesystems/pohmelfs/network_protocol.txt
index 40ea6c295afb..65e03dd44823 100644
--- a/Documentation/filesystems/pohmelfs/network_protocol.txt
+++ b/Documentation/filesystems/pohmelfs/network_protocol.txt
@@ -20,7 +20,7 @@ Commands can be embedded into transaction command (which in turn has own command
20so one can extend protocol as needed without breaking backward compatibility as long 20so one can extend protocol as needed without breaking backward compatibility as long
21as old commands are supported. All string lengths include tail 0 byte. 21as old commands are supported. All string lengths include tail 0 byte.
22 22
23All commans are transfered over the network in big-endian. CPU endianess is used at the end peers. 23All commands are transferred over the network in big-endian. CPU endianess is used at the end peers.
24 24
25@cmd - command number, which specifies command to be processed. Following 25@cmd - command number, which specifies command to be processed. Following
26 commands are used currently: 26 commands are used currently:
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index b12c89538680..6e29954851a2 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -216,7 +216,6 @@ had ->revalidate()) add calls in ->follow_link()/->readlink().
216->d_parent changes are not protected by BKL anymore. Read access is safe 216->d_parent changes are not protected by BKL anymore. Read access is safe
217if at least one of the following is true: 217if at least one of the following is true:
218 * filesystem has no cross-directory rename() 218 * filesystem has no cross-directory rename()
219 * dcache_lock is held
220 * we know that parent had been locked (e.g. we are looking at 219 * we know that parent had been locked (e.g. we are looking at
221->d_parent of ->lookup() argument). 220->d_parent of ->lookup() argument).
222 * we are called from ->rename(). 221 * we are called from ->rename().
@@ -299,11 +298,14 @@ be used instead. It gets called whenever the inode is evicted, whether it has
299remaining links or not. Caller does *not* evict the pagecache or inode-associated 298remaining links or not. Caller does *not* evict the pagecache or inode-associated
300metadata buffers; getting rid of those is responsibility of method, as it had 299metadata buffers; getting rid of those is responsibility of method, as it had
301been for ->delete_inode(). 300been for ->delete_inode().
302 ->drop_inode() returns int now; it's called on final iput() with inode_lock 301
303held and it returns true if filesystems wants the inode to be dropped. As before, 302 ->drop_inode() returns int now; it's called on final iput() with
304generic_drop_inode() is still the default and it's been updated appropriately. 303inode->i_lock held and it returns true if filesystems wants the inode to be
305generic_delete_inode() is also alive and it consists simply of return 1. Note that 304dropped. As before, generic_drop_inode() is still the default and it's been
306all actual eviction work is done by caller after ->drop_inode() returns. 305updated appropriately. generic_delete_inode() is also alive and it consists
306simply of return 1. Note that all actual eviction work is done by caller after
307->drop_inode() returns.
308
307 clear_inode() is gone; use end_writeback() instead. As before, it must 309 clear_inode() is gone; use end_writeback() instead. As before, it must
308be called exactly once on each call of ->evict_inode() (as it used to be for 310be called exactly once on each call of ->evict_inode() (as it used to be for
309each call of ->delete_inode()). Unlike before, if you are using inode-associated 311each call of ->delete_inode()). Unlike before, if you are using inode-associated
@@ -318,3 +320,90 @@ if it's zero is not *and* *never* *had* *been* enough. Final unlink() and iput(
318may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly 320may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
319free the on-disk inode, you may end up doing that while ->write_inode() is writing 321free the on-disk inode, you may end up doing that while ->write_inode() is writing
320to it. 322to it.
323
324---
325[mandatory]
326
327 .d_delete() now only advises the dcache as to whether or not to cache
328unreferenced dentries, and is now only called when the dentry refcount goes to
3290. Even on 0 refcount transition, it must be able to tolerate being called 0,
3301, or more times (eg. constant, idempotent).
331
332---
333[mandatory]
334
335 .d_compare() calling convention and locking rules are significantly
336changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
337look at examples of other filesystems) for guidance.
338
339---
340[mandatory]
341
342 .d_hash() calling convention and locking rules are significantly
343changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
344look at examples of other filesystems) for guidance.
345
346---
347[mandatory]
348 dcache_lock is gone, replaced by fine grained locks. See fs/dcache.c
349for details of what locks to replace dcache_lock with in order to protect
350particular things. Most of the time, a filesystem only needs ->d_lock, which
351protects *all* the dcache state of a given dentry.
352
353--
354[mandatory]
355
356 Filesystems must RCU-free their inodes, if they can have been accessed
357via rcu-walk path walk (basically, if the file can have had a path name in the
358vfs namespace).
359
360 i_dentry and i_rcu share storage in a union, and the vfs expects
361i_dentry to be reinitialized before it is freed, so an:
362
363 INIT_LIST_HEAD(&inode->i_dentry);
364
365must be done in the RCU callback.
366
367--
368[recommended]
369 vfs now tries to do path walking in "rcu-walk mode", which avoids
370atomic operations and scalability hazards on dentries and inodes (see
371Documentation/filesystems/path-lookup.txt). d_hash and d_compare changes
372(above) are examples of the changes required to support this. For more complex
373filesystem callbacks, the vfs drops out of rcu-walk mode before the fs call, so
374no changes are required to the filesystem. However, this is costly and loses
375the benefits of rcu-walk mode. We will begin to add filesystem callbacks that
376are rcu-walk aware, shown below. Filesystems should take advantage of this
377where possible.
378
379--
380[mandatory]
381 d_revalidate is a callback that is made on every path element (if
382the filesystem provides it), which requires dropping out of rcu-walk mode. This
383may now be called in rcu-walk mode (nd->flags & LOOKUP_RCU). -ECHILD should be
384returned if the filesystem cannot handle rcu-walk. See
385Documentation/filesystems/vfs.txt for more details.
386
387 permission and check_acl are inode permission checks that are called
388on many or all directory inodes on the way down a path walk (to check for
389exec permission). These must now be rcu-walk aware (flags & IPERM_FLAG_RCU).
390See Documentation/filesystems/vfs.txt for more details.
391
392--
393[mandatory]
394 In ->fallocate() you must check the mode option passed in. If your
395filesystem does not support hole punching (deallocating space in the middle of a
396file) you must return -EOPNOTSUPP if FALLOC_FL_PUNCH_HOLE is set in mode.
397Currently you can only have FALLOC_FL_PUNCH_HOLE with FALLOC_FL_KEEP_SIZE set,
398so the i_size should not change when hole punching, even when puching the end of
399a file off.
400
401--
402[mandatory]
403
404--
405[mandatory]
406 ->get_sb() is gone. Switch to use of ->mount(). Typically it's just
407a matter of switching from calling get_sb_... to mount_... and changing the
408function type. If you were doing it manually, just switch from setting ->mnt_root
409to some pointer to returning that pointer. On errors return ERR_PTR(...).
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index a6aca8740883..db3b1aba32a3 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -136,6 +136,7 @@ Table 1-1: Process specific entries in /proc
136 statm Process memory status information 136 statm Process memory status information
137 status Process status in human readable form 137 status Process status in human readable form
138 wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan 138 wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan
139 pagemap Page table
139 stack Report full stack trace, enable via CONFIG_STACKTRACE 140 stack Report full stack trace, enable via CONFIG_STACKTRACE
140 smaps a extension based on maps, showing the memory consumption of 141 smaps a extension based on maps, showing the memory consumption of
141 each mapping 142 each mapping
@@ -370,17 +371,25 @@ Shared_Dirty: 0 kB
370Private_Clean: 0 kB 371Private_Clean: 0 kB
371Private_Dirty: 0 kB 372Private_Dirty: 0 kB
372Referenced: 892 kB 373Referenced: 892 kB
374Anonymous: 0 kB
373Swap: 0 kB 375Swap: 0 kB
374KernelPageSize: 4 kB 376KernelPageSize: 4 kB
375MMUPageSize: 4 kB 377MMUPageSize: 4 kB
376 378Locked: 374 kB
377The first of these lines shows the same information as is displayed for the 379
378mapping in /proc/PID/maps. The remaining lines show the size of the mapping, 380The first of these lines shows the same information as is displayed for the
379the amount of the mapping that is currently resident in RAM, the "proportional 381mapping in /proc/PID/maps. The remaining lines show the size of the mapping
380set size” (divide each shared page by the number of processes sharing it), the 382(size), the amount of the mapping that is currently resident in RAM (RSS), the
381number of clean and dirty shared pages in the mapping, and the number of clean 383process' proportional share of this mapping (PSS), the number of clean and
382and dirty private pages in the mapping. The "Referenced" indicates the amount 384dirty private pages in the mapping. Note that even a page which is part of a
383of memory currently marked as referenced or accessed. 385MAP_SHARED mapping, but has only a single pte mapped, i.e. is currently used
386by only one process, is accounted as private and not as shared. "Referenced"
387indicates the amount of memory currently marked as referenced or accessed.
388"Anonymous" shows the amount of memory that does not belong to any file. Even
389a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
390and a page is modified, the file page is replaced by a private anonymous copy.
391"Swap" shows how much would-be-anonymous memory is also used, but out on
392swap.
384 393
385This file is only present if the CONFIG_MMU kernel configuration option is 394This file is only present if the CONFIG_MMU kernel configuration option is
386enabled. 395enabled.
@@ -397,6 +406,9 @@ To clear the bits for the file mapped pages associated with the process
397 > echo 3 > /proc/PID/clear_refs 406 > echo 3 > /proc/PID/clear_refs
398Any other value written to /proc/PID/clear_refs will have no effect. 407Any other value written to /proc/PID/clear_refs will have no effect.
399 408
409The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
410using /proc/kpageflags and number of times a page is mapped using
411/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt.
400 412
4011.2 Kernel data 4131.2 Kernel data
402--------------- 414---------------
@@ -531,7 +543,7 @@ just those considered 'most important'. The new vectors are:
531 their statistics are used by kernel developers and interested users to 543 their statistics are used by kernel developers and interested users to
532 determine the occurrence of interrupts of the given type. 544 determine the occurrence of interrupts of the given type.
533 545
534The above IRQ vectors are displayed only when relevent. For example, 546The above IRQ vectors are displayed only when relevant. For example,
535the threshold vector does not exist on x86_64 platforms. Others are 547the threshold vector does not exist on x86_64 platforms. Others are
536suppressed when the system is a uniprocessor. As of this writing, only 548suppressed when the system is a uniprocessor. As of this writing, only
537i386 and x86_64 platforms support the new IRQ vector displays. 549i386 and x86_64 platforms support the new IRQ vector displays.
@@ -562,6 +574,12 @@ The contents of each smp_affinity file is the same by default:
562 > cat /proc/irq/0/smp_affinity 574 > cat /proc/irq/0/smp_affinity
563 ffffffff 575 ffffffff
564 576
577There is an alternate interface, smp_affinity_list which allows specifying
578a cpu range instead of a bitmask:
579
580 > cat /proc/irq/0/smp_affinity_list
581 1024-1031
582
565The default_smp_affinity mask applies to all non-active IRQs, which are the 583The default_smp_affinity mask applies to all non-active IRQs, which are the
566IRQs which have not yet been allocated/activated, and hence which lack a 584IRQs which have not yet been allocated/activated, and hence which lack a
567/proc/irq/[0-9]* directory. 585/proc/irq/[0-9]* directory.
@@ -571,12 +589,13 @@ reports itself as being attached. This hardware locality information does not
571include information about any possible driver locality preference. 589include information about any possible driver locality preference.
572 590
573prof_cpu_mask specifies which CPUs are to be profiled by the system wide 591prof_cpu_mask specifies which CPUs are to be profiled by the system wide
574profiler. Default value is ffffffff (all cpus). 592profiler. Default value is ffffffff (all cpus if there are only 32 of them).
575 593
576The way IRQs are routed is handled by the IO-APIC, and it's Round Robin 594The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
577between all the CPUs which are allowed to handle it. As usual the kernel has 595between all the CPUs which are allowed to handle it. As usual the kernel has
578more info than you and does a better job than you, so the defaults are the 596more info than you and does a better job than you, so the defaults are the
579best choice for almost everyone. 597best choice for almost everyone. [Note this applies only to those IO-APIC's
598that support "Round Robin" interrupt distribution.]
580 599
581There are three more important subdirectories in /proc: net, scsi, and sys. 600There are three more important subdirectories in /proc: net, scsi, and sys.
582The general rule is that the contents, or even the existence of these 601The general rule is that the contents, or even the existence of these
@@ -659,6 +678,8 @@ varies by architecture and compile options. The following is from a
659 678
660> cat /proc/meminfo 679> cat /proc/meminfo
661 680
681The "Locked" indicates whether the mapping is locked in memory or not.
682
662 683
663MemTotal: 16344972 kB 684MemTotal: 16344972 kB
664MemFree: 13634064 kB 685MemFree: 13634064 kB
@@ -1170,6 +1191,30 @@ Table 1-12: Files in /proc/fs/ext4/<devname>
1170 mb_groups details of multiblock allocator buddy cache of free blocks 1191 mb_groups details of multiblock allocator buddy cache of free blocks
1171.............................................................................. 1192..............................................................................
1172 1193
11942.0 /proc/consoles
1195------------------
1196Shows registered system console lines.
1197
1198To see which character device lines are currently used for the system console
1199/dev/console, you may simply look into the file /proc/consoles:
1200
1201 > cat /proc/consoles
1202 tty0 -WU (ECp) 4:7
1203 ttyS0 -W- (Ep) 4:64
1204
1205The columns are:
1206
1207 device name of the device
1208 operations R = can do read operations
1209 W = can do write operations
1210 U = can do unblank
1211 flags E = it is enabled
1212 C = it is preferred console
1213 B = it is primary boot console
1214 p = it is used for printk buffer
1215 b = it is not a TTY but a Braille device
1216 a = it is safe to use when cpu is offline
1217 major:minor major and minor number of the device separated by a colon
1173 1218
1174------------------------------------------------------------------------------ 1219------------------------------------------------------------------------------
1175Summary 1220Summary
@@ -1285,11 +1330,15 @@ scaled linearly with /proc/<pid>/oom_score_adj.
1285Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the 1330Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
1286other with its scaled value. 1331other with its scaled value.
1287 1332
1333The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
1334value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
1335requires CAP_SYS_RESOURCE.
1336
1288NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see 1337NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
1289Documentation/feature-removal-schedule.txt. 1338Documentation/feature-removal-schedule.txt.
1290 1339
1291Caveat: when a parent task is selected, the oom killer will sacrifice any first 1340Caveat: when a parent task is selected, the oom killer will sacrifice any first
1292generation children with seperate address spaces instead, if possible. This 1341generation children with separate address spaces instead, if possible. This
1293avoids servers and important system daemons from being killed and loses the 1342avoids servers and important system daemons from being killed and loses the
1294minimal amount of work. 1343minimal amount of work.
1295 1344
diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.txt
index 2d2a7b2a16b9..e2b07cc9120a 100644
--- a/Documentation/filesystems/romfs.txt
+++ b/Documentation/filesystems/romfs.txt
@@ -17,8 +17,7 @@ comparison, an actual rescue disk used up 3202 blocks with ext2, while
17with romfs, it needed 3079 blocks. 17with romfs, it needed 3079 blocks.
18 18
19To create such a file system, you'll need a user program named 19To create such a file system, you'll need a user program named
20genromfs. It is available via anonymous ftp on sunsite.unc.edu and 20genromfs. It is available on http://romfs.sourceforge.net/
21its mirrors, in the /pub/Linux/system/recovery/ directory.
22 21
23As the name suggests, romfs could be also used (space-efficiently) on 22As the name suggests, romfs could be also used (space-efficiently) on
24various read-only media, like (E)EPROM disks if someone will have the 23various read-only media, like (E)EPROM disks if someone will have the
diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.txt
index fc0e39af43c3..4ede421c9687 100644
--- a/Documentation/filesystems/sharedsubtree.txt
+++ b/Documentation/filesystems/sharedsubtree.txt
@@ -62,10 +62,10 @@ replicas continue to be exactly same.
62 # mount /dev/sd0 /tmp/a 62 # mount /dev/sd0 /tmp/a
63 63
64 #ls /tmp/a 64 #ls /tmp/a
65 t1 t2 t2 65 t1 t2 t3
66 66
67 #ls /mnt/a 67 #ls /mnt/a
68 t1 t2 t2 68 t1 t2 t3
69 69
70 Note that the mount has propagated to the mount at /mnt as well. 70 Note that the mount has propagated to the mount at /mnt as well.
71 71
diff --git a/Documentation/filesystems/smbfs.txt b/Documentation/filesystems/smbfs.txt
deleted file mode 100644
index 194fb0decd2c..000000000000
--- a/Documentation/filesystems/smbfs.txt
+++ /dev/null
@@ -1,8 +0,0 @@
1Smbfs is a filesystem that implements the SMB protocol, which is the
2protocol used by Windows for Workgroups, Windows 95 and Windows NT.
3Smbfs was inspired by Samba, the program written by Andrew Tridgell
4that turns any Unix host into a file server for DOS or Windows clients.
5
6Smbfs is a SMB client, but uses parts of samba for its operation. For
7more info on samba, including documentation, please go to
8http://www.samba.org/ and then on to your nearest mirror.
diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.txt
index 66699afd66ca..d4d41465a0b1 100644
--- a/Documentation/filesystems/squashfs.txt
+++ b/Documentation/filesystems/squashfs.txt
@@ -59,12 +59,15 @@ obtained from this site also.
593. SQUASHFS FILESYSTEM DESIGN 593. SQUASHFS FILESYSTEM DESIGN
60----------------------------- 60-----------------------------
61 61
62A squashfs filesystem consists of a maximum of eight parts, packed together on a byte 62A squashfs filesystem consists of a maximum of nine parts, packed together on a
63alignment: 63byte alignment:
64 64
65 --------------- 65 ---------------
66 | superblock | 66 | superblock |
67 |---------------| 67 |---------------|
68 | compression |
69 | options |
70 |---------------|
68 | datablocks | 71 | datablocks |
69 | & fragments | 72 | & fragments |
70 |---------------| 73 |---------------|
@@ -91,7 +94,14 @@ the source directory, and checked for duplicates. Once all file data has been
91written the completed inode, directory, fragment, export and uid/gid lookup 94written the completed inode, directory, fragment, export and uid/gid lookup
92tables are written. 95tables are written.
93 96
943.1 Inodes 973.1 Compression options
98-----------------------
99
100Compressors can optionally support compression specific options (e.g.
101dictionary size). If non-default compression options have been used, then
102these are stored here.
103
1043.2 Inodes
95---------- 105----------
96 106
97Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each 107Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each
@@ -114,7 +124,7 @@ directory inode are defined: inodes optimised for frequently occurring
114regular files and directories, and extended types where extra 124regular files and directories, and extended types where extra
115information has to be stored. 125information has to be stored.
116 126
1173.2 Directories 1273.3 Directories
118--------------- 128---------------
119 129
120Like inodes, directories are packed into compressed metadata blocks, stored 130Like inodes, directories are packed into compressed metadata blocks, stored
@@ -144,7 +154,7 @@ decompressed to do a lookup irrespective of the length of the directory.
144This scheme has the advantage that it doesn't require extra memory overhead 154This scheme has the advantage that it doesn't require extra memory overhead
145and doesn't require much extra storage on disk. 155and doesn't require much extra storage on disk.
146 156
1473.3 File data 1573.4 File data
148------------- 158-------------
149 159
150Regular files consist of a sequence of contiguous compressed blocks, and/or a 160Regular files consist of a sequence of contiguous compressed blocks, and/or a
@@ -163,7 +173,7 @@ Larger files use multiple slots, with 1.75 TiB files using all 8 slots.
163The index cache is designed to be memory efficient, and by default uses 173The index cache is designed to be memory efficient, and by default uses
16416 KiB. 17416 KiB.
165 175
1663.4 Fragment lookup table 1763.5 Fragment lookup table
167------------------------- 177-------------------------
168 178
169Regular files can contain a fragment index which is mapped to a fragment 179Regular files can contain a fragment index which is mapped to a fragment
@@ -173,7 +183,7 @@ A second index table is used to locate these. This second index table for
173speed of access (and because it is small) is read at mount time and cached 183speed of access (and because it is small) is read at mount time and cached
174in memory. 184in memory.
175 185
1763.5 Uid/gid lookup table 1863.6 Uid/gid lookup table
177------------------------ 187------------------------
178 188
179For space efficiency regular files store uid and gid indexes, which are 189For space efficiency regular files store uid and gid indexes, which are
@@ -182,7 +192,7 @@ stored compressed into metadata blocks. A second index table is used to
182locate these. This second index table for speed of access (and because it 192locate these. This second index table for speed of access (and because it
183is small) is read at mount time and cached in memory. 193is small) is read at mount time and cached in memory.
184 194
1853.6 Export table 1953.7 Export table
186---------------- 196----------------
187 197
188To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems 198To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
@@ -196,7 +206,7 @@ This table is stored compressed into metadata blocks. A second index table is
196used to locate these. This second index table for speed of access (and because 206used to locate these. This second index table for speed of access (and because
197it is small) is read at mount time and cached in memory. 207it is small) is read at mount time and cached in memory.
198 208
1993.7 Xattr table 2093.8 Xattr table
200--------------- 210---------------
201 211
202The xattr table contains extended attributes for each inode. The xattrs 212The xattr table contains extended attributes for each inode. The xattrs
@@ -209,7 +219,7 @@ or if it is stored out of line (in which case the value field stores a
209reference to where the actual value is stored). This allows large values 219reference to where the actual value is stored). This allows large values
210to be stored out of line improving scanning and lookup performance and it 220to be stored out of line improving scanning and lookup performance and it
211also allows values to be de-duplicated, the value being stored once, and 221also allows values to be de-duplicated, the value being stored once, and
212all other occurences holding an out of line reference to that value. 222all other occurrences holding an out of line reference to that value.
213 223
214The xattr lists are packed into compressed 8K metadata blocks. 224The xattr lists are packed into compressed 8K metadata blocks.
215To reduce overhead in inodes, rather than storing the on-disk 225To reduce overhead in inodes, rather than storing the on-disk
diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt
index 5d1335faec2d..597f728e7b4e 100644
--- a/Documentation/filesystems/sysfs.txt
+++ b/Documentation/filesystems/sysfs.txt
@@ -39,10 +39,12 @@ userspace. Top-level directories in sysfs represent the common
39ancestors of object hierarchies; i.e. the subsystems the objects 39ancestors of object hierarchies; i.e. the subsystems the objects
40belong to. 40belong to.
41 41
42Sysfs internally stores the kobject that owns the directory in the 42Sysfs internally stores a pointer to the kobject that implements a
43->d_fsdata pointer of the directory's dentry. This allows sysfs to do 43directory in the sysfs_dirent object associated with the directory. In
44reference counting directly on the kobject when the file is opened and 44the past this kobject pointer has been used by sysfs to do reference
45closed. 45counting directly on the kobject whenever the file is opened or closed.
46With the current sysfs implementation the kobject reference count is
47only modified directly by the function sysfs_schedule_callback().
46 48
47 49
48Attributes 50Attributes
@@ -60,7 +62,7 @@ values of the same type.
60 62
61Mixing types, expressing multiple lines of data, and doing fancy 63Mixing types, expressing multiple lines of data, and doing fancy
62formatting of data is heavily frowned upon. Doing these things may get 64formatting of data is heavily frowned upon. Doing these things may get
63you publically humiliated and your code rewritten without notice. 65you publicly humiliated and your code rewritten without notice.
64 66
65 67
66An attribute definition is simply: 68An attribute definition is simply:
@@ -208,9 +210,9 @@ Other notes:
208 is 4096. 210 is 4096.
209 211
210- show() methods should return the number of bytes printed into the 212- show() methods should return the number of bytes printed into the
211 buffer. This is the return value of snprintf(). 213 buffer. This is the return value of scnprintf().
212 214
213- show() should always use snprintf(). 215- show() should always use scnprintf().
214 216
215- store() should return the number of bytes used from the buffer. If the 217- store() should return the number of bytes used from the buffer. If the
216 entire buffer has been used, just return the count argument. 218 entire buffer has been used, just return the count argument.
@@ -229,7 +231,7 @@ A very simple (and naive) implementation of a device attribute is:
229static ssize_t show_name(struct device *dev, struct device_attribute *attr, 231static ssize_t show_name(struct device *dev, struct device_attribute *attr,
230 char *buf) 232 char *buf)
231{ 233{
232 return snprintf(buf, PAGE_SIZE, "%s\n", dev->name); 234 return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name);
233} 235}
234 236
235static ssize_t store_name(struct device *dev, struct device_attribute *attr, 237static ssize_t store_name(struct device *dev, struct device_attribute *attr,
diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt
index 12fedb7834c6..8e4fab639d9c 100644
--- a/Documentation/filesystems/ubifs.txt
+++ b/Documentation/filesystems/ubifs.txt
@@ -82,12 +82,12 @@ Mount options
82bulk_read read more in one go to take advantage of flash 82bulk_read read more in one go to take advantage of flash
83 media that read faster sequentially 83 media that read faster sequentially
84no_bulk_read (*) do not bulk-read 84no_bulk_read (*) do not bulk-read
85no_chk_data_crc skip checking of CRCs on data nodes in order to 85no_chk_data_crc (*) skip checking of CRCs on data nodes in order to
86 improve read performance. Use this option only 86 improve read performance. Use this option only
87 if the flash media is highly reliable. The effect 87 if the flash media is highly reliable. The effect
88 of this option is that corruption of the contents 88 of this option is that corruption of the contents
89 of a file can go unnoticed. 89 of a file can go unnoticed.
90chk_data_crc (*) do not skip checking CRCs on data nodes 90chk_data_crc do not skip checking CRCs on data nodes
91compr=none override default compressor and set it to "none" 91compr=none override default compressor and set it to "none"
92compr=lzo override default compressor and set it to "lzo" 92compr=lzo override default compressor and set it to "lzo"
93compr=zlib override default compressor and set it to "zlib" 93compr=zlib override default compressor and set it to "zlib"
@@ -115,28 +115,8 @@ ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs
115Module Parameters for Debugging 115Module Parameters for Debugging
116=============================== 116===============================
117 117
118When UBIFS has been compiled with debugging enabled, there are 3 module 118When UBIFS has been compiled with debugging enabled, there are 2 module
119parameters that are available to control aspects of testing and debugging. 119parameters that are available to control aspects of testing and debugging.
120The parameters are unsigned integers where each bit controls an option.
121The parameters are:
122
123debug_msgs Selects which debug messages to display, as follows:
124
125 Message Type Flag value
126
127 General messages 1
128 Journal messages 2
129 Mount messages 4
130 Commit messages 8
131 LEB search messages 16
132 Budgeting messages 32
133 Garbage collection messages 64
134 Tree Node Cache (TNC) messages 128
135 LEB properties (lprops) messages 256
136 Input/output messages 512
137 Log messages 1024
138 Scan messages 2048
139 Recovery messages 4096
140 120
141debug_chks Selects extra checks that UBIFS can do while running: 121debug_chks Selects extra checks that UBIFS can do while running:
142 122
@@ -154,11 +134,9 @@ debug_tsts Selects a mode of testing, as follows:
154 134
155 Test mode Flag value 135 Test mode Flag value
156 136
157 Force in-the-gaps method 2
158 Failure mode for recovery testing 4 137 Failure mode for recovery testing 4
159 138
160For example, set debug_msgs to 5 to display General messages and Mount 139For example, set debug_chks to 3 to enable general and TNC checks.
161messages.
162 140
163 141
164References 142References
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index ed7e5efc06d8..88b9f5519af9 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -95,10 +95,11 @@ functions:
95 extern int unregister_filesystem(struct file_system_type *); 95 extern int unregister_filesystem(struct file_system_type *);
96 96
97The passed struct file_system_type describes your filesystem. When a 97The passed struct file_system_type describes your filesystem. When a
98request is made to mount a device onto a directory in your filespace, 98request is made to mount a filesystem onto a directory in your namespace,
99the VFS will call the appropriate get_sb() method for the specific 99the VFS will call the appropriate mount() method for the specific
100filesystem. The dentry for the mount point will then be updated to 100filesystem. New vfsmount referring to the tree returned by ->mount()
101point to the root inode for the new filesystem. 101will be attached to the mountpoint, so that when pathname resolution
102reaches the mountpoint it will jump into the root of that vfsmount.
102 103
103You can see all filesystems that are registered to the kernel in the 104You can see all filesystems that are registered to the kernel in the
104file /proc/filesystems. 105file /proc/filesystems.
@@ -107,14 +108,14 @@ file /proc/filesystems.
107struct file_system_type 108struct file_system_type
108----------------------- 109-----------------------
109 110
110This describes the filesystem. As of kernel 2.6.22, the following 111This describes the filesystem. As of kernel 2.6.39, the following
111members are defined: 112members are defined:
112 113
113struct file_system_type { 114struct file_system_type {
114 const char *name; 115 const char *name;
115 int fs_flags; 116 int fs_flags;
116 int (*get_sb) (struct file_system_type *, int, 117 struct dentry (*mount) (struct file_system_type *, int,
117 const char *, void *, struct vfsmount *); 118 const char *, void *);
118 void (*kill_sb) (struct super_block *); 119 void (*kill_sb) (struct super_block *);
119 struct module *owner; 120 struct module *owner;
120 struct file_system_type * next; 121 struct file_system_type * next;
@@ -128,11 +129,11 @@ struct file_system_type {
128 129
129 fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) 130 fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
130 131
131 get_sb: the method to call when a new instance of this 132 mount: the method to call when a new instance of this
132 filesystem should be mounted 133 filesystem should be mounted
133 134
134 kill_sb: the method to call when an instance of this filesystem 135 kill_sb: the method to call when an instance of this filesystem
135 should be unmounted 136 should be shut down
136 137
137 owner: for internal VFS use: you should initialize this to THIS_MODULE in 138 owner: for internal VFS use: you should initialize this to THIS_MODULE in
138 most cases. 139 most cases.
@@ -141,7 +142,7 @@ struct file_system_type {
141 142
142 s_lock_key, s_umount_key: lockdep-specific 143 s_lock_key, s_umount_key: lockdep-specific
143 144
144The get_sb() method has the following arguments: 145The mount() method has the following arguments:
145 146
146 struct file_system_type *fs_type: describes the filesystem, partly initialized 147 struct file_system_type *fs_type: describes the filesystem, partly initialized
147 by the specific filesystem code 148 by the specific filesystem code
@@ -153,32 +154,39 @@ The get_sb() method has the following arguments:
153 void *data: arbitrary mount options, usually comes as an ASCII 154 void *data: arbitrary mount options, usually comes as an ASCII
154 string (see "Mount Options" section) 155 string (see "Mount Options" section)
155 156
156 struct vfsmount *mnt: a vfs-internal representation of a mount point 157The mount() method must return the root dentry of the tree requested by
158caller. An active reference to its superblock must be grabbed and the
159superblock must be locked. On failure it should return ERR_PTR(error).
157 160
158The get_sb() method must determine if the block device specified 161The arguments match those of mount(2) and their interpretation
159in the dev_name and fs_type contains a filesystem of the type the method 162depends on filesystem type. E.g. for block filesystems, dev_name is
160supports. If it succeeds in opening the named block device, it initializes a 163interpreted as block device name, that device is opened and if it
161struct super_block descriptor for the filesystem contained by the block device. 164contains a suitable filesystem image the method creates and initializes
162On failure it returns an error. 165struct super_block accordingly, returning its root dentry to caller.
166
167->mount() may choose to return a subtree of existing filesystem - it
168doesn't have to create a new one. The main result from the caller's
169point of view is a reference to dentry at the root of (sub)tree to
170be attached; creation of new superblock is a common side effect.
163 171
164The most interesting member of the superblock structure that the 172The most interesting member of the superblock structure that the
165get_sb() method fills in is the "s_op" field. This is a pointer to 173mount() method fills in is the "s_op" field. This is a pointer to
166a "struct super_operations" which describes the next level of the 174a "struct super_operations" which describes the next level of the
167filesystem implementation. 175filesystem implementation.
168 176
169Usually, a filesystem uses one of the generic get_sb() implementations 177Usually, a filesystem uses one of the generic mount() implementations
170and provides a fill_super() method instead. The generic methods are: 178and provides a fill_super() callback instead. The generic variants are:
171 179
172 get_sb_bdev: mount a filesystem residing on a block device 180 mount_bdev: mount a filesystem residing on a block device
173 181
174 get_sb_nodev: mount a filesystem that is not backed by a device 182 mount_nodev: mount a filesystem that is not backed by a device
175 183
176 get_sb_single: mount a filesystem which shares the instance between 184 mount_single: mount a filesystem which shares the instance between
177 all mounts 185 all mounts
178 186
179A fill_super() method implementation has the following arguments: 187A fill_super() callback implementation has the following arguments:
180 188
181 struct super_block *sb: the superblock structure. The method fill_super() 189 struct super_block *sb: the superblock structure. The callback
182 must initialize this properly. 190 must initialize this properly.
183 191
184 void *data: arbitrary mount options, usually comes as an ASCII 192 void *data: arbitrary mount options, usually comes as an ASCII
@@ -203,7 +211,7 @@ struct super_operations {
203 struct inode *(*alloc_inode)(struct super_block *sb); 211 struct inode *(*alloc_inode)(struct super_block *sb);
204 void (*destroy_inode)(struct inode *); 212 void (*destroy_inode)(struct inode *);
205 213
206 void (*dirty_inode) (struct inode *); 214 void (*dirty_inode) (struct inode *, int flags);
207 int (*write_inode) (struct inode *, int); 215 int (*write_inode) (struct inode *, int);
208 void (*drop_inode) (struct inode *); 216 void (*drop_inode) (struct inode *);
209 void (*delete_inode) (struct inode *); 217 void (*delete_inode) (struct inode *);
@@ -246,7 +254,7 @@ or bottom half).
246 should be synchronous or not, not all filesystems check this flag. 254 should be synchronous or not, not all filesystems check this flag.
247 255
248 drop_inode: called when the last access to the inode is dropped, 256 drop_inode: called when the last access to the inode is dropped,
249 with the inode_lock spinlock held. 257 with the inode->i_lock spinlock held.
250 258
251 This method should be either NULL (normal UNIX filesystem 259 This method should be either NULL (normal UNIX filesystem
252 semantics) or "generic_delete_inode" (for filesystems that do not 260 semantics) or "generic_delete_inode" (for filesystems that do not
@@ -325,7 +333,8 @@ struct inode_operations {
325 void * (*follow_link) (struct dentry *, struct nameidata *); 333 void * (*follow_link) (struct dentry *, struct nameidata *);
326 void (*put_link) (struct dentry *, struct nameidata *, void *); 334 void (*put_link) (struct dentry *, struct nameidata *, void *);
327 void (*truncate) (struct inode *); 335 void (*truncate) (struct inode *);
328 int (*permission) (struct inode *, int, struct nameidata *); 336 int (*permission) (struct inode *, int, unsigned int);
337 int (*check_acl)(struct inode *, int, unsigned int);
329 int (*setattr) (struct dentry *, struct iattr *); 338 int (*setattr) (struct dentry *, struct iattr *);
330 int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); 339 int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
331 int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); 340 int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
@@ -414,6 +423,13 @@ otherwise noted.
414 permission: called by the VFS to check for access rights on a POSIX-like 423 permission: called by the VFS to check for access rights on a POSIX-like
415 filesystem. 424 filesystem.
416 425
426 May be called in rcu-walk mode (flags & IPERM_FLAG_RCU). If in rcu-walk
427 mode, the filesystem must check the permission without blocking or
428 storing to the inode.
429
430 If a situation is encountered that rcu-walk cannot handle, return
431 -ECHILD and it will be called again in ref-walk mode.
432
417 setattr: called by the VFS to set attributes for a file. This method 433 setattr: called by the VFS to set attributes for a file. This method
418 is called by chmod(2) and related system calls. 434 is called by chmod(2) and related system calls.
419 435
@@ -534,6 +550,7 @@ struct address_space_operations {
534 sector_t (*bmap)(struct address_space *, sector_t); 550 sector_t (*bmap)(struct address_space *, sector_t);
535 int (*invalidatepage) (struct page *, unsigned long); 551 int (*invalidatepage) (struct page *, unsigned long);
536 int (*releasepage) (struct page *, int); 552 int (*releasepage) (struct page *, int);
553 void (*freepage)(struct page *);
537 ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, 554 ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
538 loff_t offset, unsigned long nr_segs); 555 loff_t offset, unsigned long nr_segs);
539 struct page* (*get_xip_page)(struct address_space *, sector_t, 556 struct page* (*get_xip_page)(struct address_space *, sector_t,
@@ -660,11 +677,10 @@ struct address_space_operations {
660 releasepage: releasepage is called on PagePrivate pages to indicate 677 releasepage: releasepage is called on PagePrivate pages to indicate
661 that the page should be freed if possible. ->releasepage 678 that the page should be freed if possible. ->releasepage
662 should remove any private data from the page and clear the 679 should remove any private data from the page and clear the
663 PagePrivate flag. It may also remove the page from the 680 PagePrivate flag. If releasepage() fails for some reason, it must
664 address_space. If this fails for some reason, it may indicate 681 indicate failure with a 0 return value.
665 failure with a 0 return value. 682 releasepage() is used in two distinct though related cases. The
666 This is used in two distinct though related cases. The first 683 first is when the VM finds a clean page with no active users and
667 is when the VM finds a clean page with no active users and
668 wants to make it a free page. If ->releasepage succeeds, the 684 wants to make it a free page. If ->releasepage succeeds, the
669 page will be removed from the address_space and become free. 685 page will be removed from the address_space and become free.
670 686
@@ -679,6 +695,12 @@ struct address_space_operations {
679 need to ensure this. Possibly it can clear the PageUptodate 695 need to ensure this. Possibly it can clear the PageUptodate
680 bit if it cannot free private data yet. 696 bit if it cannot free private data yet.
681 697
698 freepage: freepage is called once the page is no longer visible in
699 the page cache in order to allow the cleanup of any private
700 data. Since it may be called by the memory reclaimer, it
701 should not assume that the original address_space mapping still
702 exists, and it should not block.
703
682 direct_IO: called by the generic read/write routines to perform 704 direct_IO: called by the generic read/write routines to perform
683 direct_IO - that is IO requests which bypass the page cache 705 direct_IO - that is IO requests which bypass the page cache
684 and transfer data directly between the storage and the 706 and transfer data directly between the storage and the
@@ -841,12 +863,17 @@ defined:
841 863
842struct dentry_operations { 864struct dentry_operations {
843 int (*d_revalidate)(struct dentry *, struct nameidata *); 865 int (*d_revalidate)(struct dentry *, struct nameidata *);
844 int (*d_hash) (struct dentry *, struct qstr *); 866 int (*d_hash)(const struct dentry *, const struct inode *,
845 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); 867 struct qstr *);
846 int (*d_delete)(struct dentry *); 868 int (*d_compare)(const struct dentry *, const struct inode *,
869 const struct dentry *, const struct inode *,
870 unsigned int, const char *, const struct qstr *);
871 int (*d_delete)(const struct dentry *);
847 void (*d_release)(struct dentry *); 872 void (*d_release)(struct dentry *);
848 void (*d_iput)(struct dentry *, struct inode *); 873 void (*d_iput)(struct dentry *, struct inode *);
849 char *(*d_dname)(struct dentry *, char *, int); 874 char *(*d_dname)(struct dentry *, char *, int);
875 struct vfsmount *(*d_automount)(struct path *);
876 int (*d_manage)(struct dentry *, bool);
850}; 877};
851 878
852 d_revalidate: called when the VFS needs to revalidate a dentry. This 879 d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -854,13 +881,45 @@ struct dentry_operations {
854 dcache. Most filesystems leave this as NULL, because all their 881 dcache. Most filesystems leave this as NULL, because all their
855 dentries in the dcache are valid 882 dentries in the dcache are valid
856 883
857 d_hash: called when the VFS adds a dentry to the hash table 884 d_revalidate may be called in rcu-walk mode (nd->flags & LOOKUP_RCU).
885 If in rcu-walk mode, the filesystem must revalidate the dentry without
886 blocking or storing to the dentry, d_parent and d_inode should not be
887 used without care (because they can go NULL), instead nd->inode should
888 be used.
889
890 If a situation is encountered that rcu-walk cannot handle, return
891 -ECHILD and it will be called again in ref-walk mode.
892
893 d_hash: called when the VFS adds a dentry to the hash table. The first
894 dentry passed to d_hash is the parent directory that the name is
895 to be hashed into. The inode is the dentry's inode.
858 896
859 d_compare: called when a dentry should be compared with another 897 Same locking and synchronisation rules as d_compare regarding
898 what is safe to dereference etc.
860 899
861 d_delete: called when the last reference to a dentry is 900 d_compare: called to compare a dentry name with a given name. The first
862 deleted. This means no-one is using the dentry, however it is 901 dentry is the parent of the dentry to be compared, the second is
863 still valid and in the dcache 902 the parent's inode, then the dentry and inode (may be NULL) of the
903 child dentry. len and name string are properties of the dentry to be
904 compared. qstr is the name to compare it with.
905
906 Must be constant and idempotent, and should not take locks if
907 possible, and should not or store into the dentry or inodes.
908 Should not dereference pointers outside the dentry or inodes without
909 lots of care (eg. d_parent, d_inode, d_name should not be used).
910
911 However, our vfsmount is pinned, and RCU held, so the dentries and
912 inodes won't disappear, neither will our sb or filesystem module.
913 ->i_sb and ->d_sb may be used.
914
915 It is a tricky calling convention because it needs to be called under
916 "rcu-walk", ie. without any locks or references on things.
917
918 d_delete: called when the last reference to a dentry is dropped and the
919 dcache is deciding whether or not to cache it. Return 1 to delete
920 immediately, or 0 to cache the dentry. Default is NULL which means to
921 always cache a reachable dentry. d_delete must be constant and
922 idempotent.
864 923
865 d_release: called when a dentry is really deallocated 924 d_release: called when a dentry is really deallocated
866 925
@@ -881,6 +940,43 @@ struct dentry_operations {
881 at the end of the buffer, and returns a pointer to the first char. 940 at the end of the buffer, and returns a pointer to the first char.
882 dynamic_dname() helper function is provided to take care of this. 941 dynamic_dname() helper function is provided to take care of this.
883 942
943 d_automount: called when an automount dentry is to be traversed (optional).
944 This should create a new VFS mount record and return the record to the
945 caller. The caller is supplied with a path parameter giving the
946 automount directory to describe the automount target and the parent
947 VFS mount record to provide inheritable mount parameters. NULL should
948 be returned if someone else managed to make the automount first. If
949 the vfsmount creation failed, then an error code should be returned.
950 If -EISDIR is returned, then the directory will be treated as an
951 ordinary directory and returned to pathwalk to continue walking.
952
953 If a vfsmount is returned, the caller will attempt to mount it on the
954 mountpoint and will remove the vfsmount from its expiration list in
955 the case of failure. The vfsmount should be returned with 2 refs on
956 it to prevent automatic expiration - the caller will clean up the
957 additional ref.
958
959 This function is only used if DCACHE_NEED_AUTOMOUNT is set on the
960 dentry. This is set by __d_instantiate() if S_AUTOMOUNT is set on the
961 inode being added.
962
963 d_manage: called to allow the filesystem to manage the transition from a
964 dentry (optional). This allows autofs, for example, to hold up clients
965 waiting to explore behind a 'mountpoint' whilst letting the daemon go
966 past and construct the subtree there. 0 should be returned to let the
967 calling process continue. -EISDIR can be returned to tell pathwalk to
968 use this directory as an ordinary directory and to ignore anything
969 mounted on it and not to check the automount flag. Any other error
970 code will abort pathwalk completely.
971
972 If the 'rcu_walk' parameter is true, then the caller is doing a
973 pathwalk in RCU-walk mode. Sleeping is not permitted in this mode,
974 and the caller can be asked to leave it and call again by returing
975 -ECHILD.
976
977 This function is only used if DCACHE_MANAGE_TRANSIT is set on the
978 dentry being transited from.
979
884Example : 980Example :
885 981
886static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen) 982static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen)
@@ -904,14 +1000,11 @@ manipulate dentries:
904 the usage count) 1000 the usage count)
905 1001
906 dput: close a handle for a dentry (decrements the usage count). If 1002 dput: close a handle for a dentry (decrements the usage count). If
907 the usage count drops to 0, the "d_delete" method is called 1003 the usage count drops to 0, and the dentry is still in its
908 and the dentry is placed on the unused list if the dentry is 1004 parent's hash, the "d_delete" method is called to check whether
909 still in its parents hash list. Putting the dentry on the 1005 it should be cached. If it should not be cached, or if the dentry
910 unused list just means that if the system needs some RAM, it 1006 is not hashed, it is deleted. Otherwise cached dentries are put
911 goes through the unused list of dentries and deallocates them. 1007 into an LRU list to be reclaimed on memory shortage.
912 If the dentry has already been unhashed and the usage count
913 drops to 0, in this case the dentry is deallocated after the
914 "d_delete" method is called
915 1008
916 d_drop: this unhashes a dentry from its parents hash list. A 1009 d_drop: this unhashes a dentry from its parents hash list. A
917 subsequent call to dput() will deallocate the dentry if its 1010 subsequent call to dput() will deallocate the dentry if its
diff --git a/Documentation/filesystems/xfs-delayed-logging-design.txt b/Documentation/filesystems/xfs-delayed-logging-design.txt
index 96d0df28bed3..2ce36439c09f 100644
--- a/Documentation/filesystems/xfs-delayed-logging-design.txt
+++ b/Documentation/filesystems/xfs-delayed-logging-design.txt
@@ -42,7 +42,7 @@ the aggregation of all the previous changes currently held only in the log.
42This relogging technique also allows objects to be moved forward in the log so 42This relogging technique also allows objects to be moved forward in the log so
43that an object being relogged does not prevent the tail of the log from ever 43that an object being relogged does not prevent the tail of the log from ever
44moving forward. This can be seen in the table above by the changing 44moving forward. This can be seen in the table above by the changing
45(increasing) LSN of each subsquent transaction - the LSN is effectively a 45(increasing) LSN of each subsequent transaction - the LSN is effectively a
46direct encoding of the location in the log of the transaction. 46direct encoding of the location in the log of the transaction.
47 47
48This relogging is also used to implement long-running, multiple-commit 48This relogging is also used to implement long-running, multiple-commit
@@ -338,7 +338,7 @@ the same time another transaction modifies the item and inserts the log item
338into the new CIL, then checkpoint transaction commit code cannot use log items 338into the new CIL, then checkpoint transaction commit code cannot use log items
339to store the list of log vectors that need to be written into the transaction. 339to store the list of log vectors that need to be written into the transaction.
340Hence log vectors need to be able to be chained together to allow them to be 340Hence log vectors need to be able to be chained together to allow them to be
341detatched from the log items. That is, when the CIL is flushed the memory 341detached from the log items. That is, when the CIL is flushed the memory
342buffer and log vector attached to each log item needs to be attached to the 342buffer and log vector attached to each log item needs to be attached to the
343checkpoint context so that the log item can be released. In diagrammatic form, 343checkpoint context so that the log item can be released. In diagrammatic form,
344the CIL would look like this before the flush: 344the CIL would look like this before the flush:
@@ -577,7 +577,7 @@ only becomes unpinned when all the transactions complete and there are no
577pending transactions. Thus the pinning and unpinning of a log item is symmetric 577pending transactions. Thus the pinning and unpinning of a log item is symmetric
578as there is a 1:1 relationship with transaction commit and log item completion. 578as there is a 1:1 relationship with transaction commit and log item completion.
579 579
580For delayed logging, however, we have an assymetric transaction commit to 580For delayed logging, however, we have an asymmetric transaction commit to
581completion relationship. Every time an object is relogged in the CIL it goes 581completion relationship. Every time an object is relogged in the CIL it goes
582through the commit process without a corresponding completion being registered. 582through the commit process without a corresponding completion being registered.
583That is, we now have a many-to-one relationship between transaction commit and 583That is, we now have a many-to-one relationship between transaction commit and
@@ -780,7 +780,7 @@ With delayed logging, there are new steps inserted into the life cycle:
780From this, it can be seen that the only life cycle differences between the two 780From this, it can be seen that the only life cycle differences between the two
781logging methods are in the middle of the life cycle - they still have the same 781logging methods are in the middle of the life cycle - they still have the same
782beginning and end and execution constraints. The only differences are in the 782beginning and end and execution constraints. The only differences are in the
783commiting of the log items to the log itself and the completion processing. 783committing of the log items to the log itself and the completion processing.
784Hence delayed logging should not introduce any constraints on log item 784Hence delayed logging should not introduce any constraints on log item
785behaviour, allocation or freeing that don't already exist. 785behaviour, allocation or freeing that don't already exist.
786 786
@@ -791,21 +791,3 @@ mount option. Fundamentally, there is no reason why the log manager would not
791be able to swap methods automatically and transparently depending on load 791be able to swap methods automatically and transparently depending on load
792characteristics, but this should not be necessary if delayed logging works as 792characteristics, but this should not be necessary if delayed logging works as
793designed. 793designed.
794
795Roadmap:
796
7972.6.37 Remove experimental tag from mount option
798 => should be roughly 6 months after initial merge
799 => enough time to:
800 => gain confidence and fix problems reported by early
801 adopters (a.k.a. guinea pigs)
802 => address worst performance regressions and undesired
803 behaviours
804 => start tuning/optimising code for parallelism
805 => start tuning/optimising algorithms consuming
806 excessive CPU time
807
8082.6.39 Switch default mount option to use delayed logging
809 => should be roughly 12 months after initial merge
810 => enough time to shake out remaining problems before next round of
811 enterprise distro kernel rebases
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
index 7bff3e4f35df..3fc0c31a6f5d 100644
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -39,6 +39,12 @@ When mounting an XFS filesystem, the following options are accepted.
39 drive level write caching to be enabled, for devices that 39 drive level write caching to be enabled, for devices that
40 support write barriers. 40 support write barriers.
41 41
42 discard
43 Issue command to let the block device reclaim space freed by the
44 filesystem. This is useful for SSD devices, thinly provisioned
45 LUNs and virtual machine images, but may have a performance
46 impact. This option is incompatible with the nodelaylog option.
47
42 dmapi 48 dmapi
43 Enable the DMAPI (Data Management API) event callouts. 49 Enable the DMAPI (Data Management API) event callouts.
44 Use with the "mtpt" option. 50 Use with the "mtpt" option.