aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
authorDmitry Torokhov <dmitry.torokhov@gmail.com>2013-05-01 11:47:44 -0400
committerDmitry Torokhov <dmitry.torokhov@gmail.com>2013-05-01 11:47:44 -0400
commitbf61c8840efe60fd8f91446860b63338fb424158 (patch)
tree7a71832407a4f0d6346db773343f4c3ae2257b19 /Documentation/filesystems
parent5846115b30f3a881e542c8bfde59a699c1c13740 (diff)
parent0c6a61657da78098472fd0eb71cc01f2387fa1bb (diff)
Merge branch 'next' into for-linus
Prepare first set of updates for 3.10 merge window.
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/00-INDEX4
-rw-r--r--Documentation/filesystems/Locking8
-rw-r--r--Documentation/filesystems/caching/backend-api.txt38
-rw-r--r--Documentation/filesystems/caching/netfs-api.txt46
-rw-r--r--Documentation/filesystems/caching/object.txt23
-rw-r--r--Documentation/filesystems/caching/operations.txt2
-rw-r--r--Documentation/filesystems/efivarfs.txt16
-rw-r--r--Documentation/filesystems/ext4.txt9
-rw-r--r--Documentation/filesystems/f2fs.txt421
-rw-r--r--Documentation/filesystems/nfs/nfs41-server.txt20
-rw-r--r--Documentation/filesystems/porting6
-rw-r--r--Documentation/filesystems/proc.txt146
-rw-r--r--Documentation/filesystems/vfat.txt9
-rw-r--r--Documentation/filesystems/vfs.txt35
-rw-r--r--Documentation/filesystems/xfs.txt13
15 files changed, 727 insertions, 69 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 8c624a18f67d..8042050eb265 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -38,6 +38,8 @@ dnotify_test.c
38 - example program for dnotify 38 - example program for dnotify
39ecryptfs.txt 39ecryptfs.txt
40 - docs on eCryptfs: stacked cryptographic filesystem for Linux. 40 - docs on eCryptfs: stacked cryptographic filesystem for Linux.
41efivarfs.txt
42 - info for the efivarfs filesystem.
41exofs.txt 43exofs.txt
42 - info, usage, mount options, design about EXOFS. 44 - info, usage, mount options, design about EXOFS.
43ext2.txt 45ext2.txt
@@ -48,6 +50,8 @@ ext4.txt
48 - info, mount options and specifications for the Ext4 filesystem. 50 - info, mount options and specifications for the Ext4 filesystem.
49files.txt 51files.txt
50 - info on file management in the Linux kernel. 52 - info on file management in the Linux kernel.
53f2fs.txt
54 - info and mount options for the F2FS filesystem.
51fuse.txt 55fuse.txt
52 - info on the Filesystem in User SpacE including mount options. 56 - info on the Filesystem in User SpacE including mount options.
53gfs2.txt 57gfs2.txt
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index e540a24e5d06..0706d32a61e6 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -10,6 +10,7 @@ be able to use diff(1).
10--------------------------- dentry_operations -------------------------- 10--------------------------- dentry_operations --------------------------
11prototypes: 11prototypes:
12 int (*d_revalidate)(struct dentry *, unsigned int); 12 int (*d_revalidate)(struct dentry *, unsigned int);
13 int (*d_weak_revalidate)(struct dentry *, unsigned int);
13 int (*d_hash)(const struct dentry *, const struct inode *, 14 int (*d_hash)(const struct dentry *, const struct inode *,
14 struct qstr *); 15 struct qstr *);
15 int (*d_compare)(const struct dentry *, const struct inode *, 16 int (*d_compare)(const struct dentry *, const struct inode *,
@@ -25,6 +26,7 @@ prototypes:
25locking rules: 26locking rules:
26 rename_lock ->d_lock may block rcu-walk 27 rename_lock ->d_lock may block rcu-walk
27d_revalidate: no no yes (ref-walk) maybe 28d_revalidate: no no yes (ref-walk) maybe
29d_weak_revalidate:no no yes no
28d_hash no no no maybe 30d_hash no no no maybe
29d_compare: yes no no maybe 31d_compare: yes no no maybe
30d_delete: no yes no no 32d_delete: no yes no no
@@ -80,7 +82,6 @@ rename: yes (all) (see below)
80readlink: no 82readlink: no
81follow_link: no 83follow_link: no
82put_link: no 84put_link: no
83truncate: yes (see below)
84setattr: yes 85setattr: yes
85permission: no (may not block if called in rcu-walk mode) 86permission: no (may not block if called in rcu-walk mode)
86get_acl: no 87get_acl: no
@@ -96,11 +97,6 @@ atomic_open: yes
96 Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on 97 Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on
97victim. 98victim.
98 cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. 99 cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
99 ->truncate() is never called directly - it's a callback, not a
100method. It's called by vmtruncate() - deprecated library function used by
101->setattr(). Locking information above applies to that call (i.e. is
102inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
103passed).
104 100
105See Documentation/filesystems/directory-locking for more detailed discussion 101See Documentation/filesystems/directory-locking for more detailed discussion
106of the locking scheme for directory operations. 102of the locking scheme for directory operations.
diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.txt
index 382d52cdaf2d..d78bab9622c6 100644
--- a/Documentation/filesystems/caching/backend-api.txt
+++ b/Documentation/filesystems/caching/backend-api.txt
@@ -308,6 +308,18 @@ performed on the denizens of the cache. These are held in a structure of type:
308 obtained by calling object->cookie->def->get_aux()/get_attr(). 308 obtained by calling object->cookie->def->get_aux()/get_attr().
309 309
310 310
311 (*) Invalidate data object [mandatory]:
312
313 int (*invalidate_object)(struct fscache_operation *op)
314
315 This is called to invalidate a data object (as pointed to by op->object).
316 All the data stored for this object should be discarded and an
317 attr_changed operation should be performed. The caller will follow up
318 with an object update operation.
319
320 fscache_op_complete() must be called on op before returning.
321
322
311 (*) Discard object [mandatory]: 323 (*) Discard object [mandatory]:
312 324
313 void (*drop_object)(struct fscache_object *object) 325 void (*drop_object)(struct fscache_object *object)
@@ -419,7 +431,10 @@ performed on the denizens of the cache. These are held in a structure of type:
419 431
420 If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS 432 If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS
421 returned if possible or fscache_end_io() called with a suitable error 433 returned if possible or fscache_end_io() called with a suitable error
422 code.. 434 code.
435
436 fscache_put_retrieval() should be called after a page or pages are dealt
437 with. This will complete the operation when all pages are dealt with.
423 438
424 439
425 (*) Request pages be read from cache [mandatory]: 440 (*) Request pages be read from cache [mandatory]:
@@ -526,6 +541,27 @@ FS-Cache provides some utilities that a cache backend may make use of:
526 error value should be 0 if successful and an error otherwise. 541 error value should be 0 if successful and an error otherwise.
527 542
528 543
544 (*) Record that one or more pages being retrieved or allocated have been dealt
545 with:
546
547 void fscache_retrieval_complete(struct fscache_retrieval *op,
548 int n_pages);
549
550 This is called to record the fact that one or more pages have been dealt
551 with and are no longer the concern of this operation. When the number of
552 pages remaining in the operation reaches 0, the operation will be
553 completed.
554
555
556 (*) Record operation completion:
557
558 void fscache_op_complete(struct fscache_operation *op);
559
560 This is called to record the completion of an operation. This deducts
561 this operation from the parent object's run state, potentially permitting
562 one or more pending operations to start running.
563
564
529 (*) Set highest store limit: 565 (*) Set highest store limit:
530 566
531 void fscache_set_store_limit(struct fscache_object *object, 567 void fscache_set_store_limit(struct fscache_object *object,
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
index 7cc6bf2871eb..97e6c0ecc5ef 100644
--- a/Documentation/filesystems/caching/netfs-api.txt
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -35,8 +35,9 @@ This document contains the following sections:
35 (12) Index and data file update 35 (12) Index and data file update
36 (13) Miscellaneous cookie operations 36 (13) Miscellaneous cookie operations
37 (14) Cookie unregistration 37 (14) Cookie unregistration
38 (15) Index and data file invalidation 38 (15) Index invalidation
39 (16) FS-Cache specific page flags. 39 (16) Data file invalidation
40 (17) FS-Cache specific page flags.
40 41
41 42
42============================= 43=============================
@@ -767,13 +768,42 @@ the cookies for "child" indices, objects and pages have been relinquished
767first. 768first.
768 769
769 770
770================================ 771==================
771INDEX AND DATA FILE INVALIDATION 772INDEX INVALIDATION
772================================ 773==================
774
775There is no direct way to invalidate an index subtree. To do this, the caller
776should relinquish and retire the cookie they have, and then acquire a new one.
777
778
779======================
780DATA FILE INVALIDATION
781======================
782
783Sometimes it will be necessary to invalidate an object that contains data.
784Typically this will be necessary when the server tells the netfs of a foreign
785change - at which point the netfs has to throw away all the state it had for an
786inode and reload from the server.
787
788To indicate that a cache object should be invalidated, the following function
789can be called:
790
791 void fscache_invalidate(struct fscache_cookie *cookie);
792
793This can be called with spinlocks held as it defers the work to a thread pool.
794All extant storage, retrieval and attribute change ops at this point are
795cancelled and discarded. Some future operations will be rejected until the
796cache has had a chance to insert a barrier in the operations queue. After
797that, operations will be queued again behind the invalidation operation.
798
799The invalidation operation will perform an attribute change operation and an
800auxiliary data update operation as it is very likely these will have changed.
801
802Using the following function, the netfs can wait for the invalidation operation
803to have reached a point at which it can start submitting ordinary operations
804once again:
773 805
774There is no direct way to invalidate an index subtree or a data file. To do 806 void fscache_wait_on_invalidate(struct fscache_cookie *cookie);
775this, the caller should relinquish and retire the cookie they have, and then
776acquire a new one.
777 807
778 808
779=========================== 809===========================
diff --git a/Documentation/filesystems/caching/object.txt b/Documentation/filesystems/caching/object.txt
index 58313348da87..100ff41127e4 100644
--- a/Documentation/filesystems/caching/object.txt
+++ b/Documentation/filesystems/caching/object.txt
@@ -216,7 +216,14 @@ servicing netfs requests:
216 The normal running state. In this state, requests the netfs makes will be 216 The normal running state. In this state, requests the netfs makes will be
217 passed on to the cache. 217 passed on to the cache.
218 218
219 (6) State FSCACHE_OBJECT_UPDATING. 219 (6) State FSCACHE_OBJECT_INVALIDATING.
220
221 The object is undergoing invalidation. When the state comes here, it
222 discards all pending read, write and attribute change operations as it is
223 going to clear out the cache entirely and reinitialise it. It will then
224 continue to the FSCACHE_OBJECT_UPDATING state.
225
226 (7) State FSCACHE_OBJECT_UPDATING.
220 227
221 The state machine comes here to update the object in the cache from the 228 The state machine comes here to update the object in the cache from the
222 netfs's records. This involves updating the auxiliary data that is used 229 netfs's records. This involves updating the auxiliary data that is used
@@ -225,13 +232,13 @@ servicing netfs requests:
225And there are terminal states in which an object cleans itself up, deallocates 232And there are terminal states in which an object cleans itself up, deallocates
226memory and potentially deletes stuff from disk: 233memory and potentially deletes stuff from disk:
227 234
228 (7) State FSCACHE_OBJECT_LC_DYING. 235 (8) State FSCACHE_OBJECT_LC_DYING.
229 236
230 The object comes here if it is dying because of a lookup or creation 237 The object comes here if it is dying because of a lookup or creation
231 error. This would be due to a disk error or system error of some sort. 238 error. This would be due to a disk error or system error of some sort.
232 Temporary data is cleaned up, and the parent is released. 239 Temporary data is cleaned up, and the parent is released.
233 240
234 (8) State FSCACHE_OBJECT_DYING. 241 (9) State FSCACHE_OBJECT_DYING.
235 242
236 The object comes here if it is dying due to an error, because its parent 243 The object comes here if it is dying due to an error, because its parent
237 cookie has been relinquished by the netfs or because the cache is being 244 cookie has been relinquished by the netfs or because the cache is being
@@ -241,27 +248,27 @@ memory and potentially deletes stuff from disk:
241 can destroy themselves. This object waits for all its children to go away 248 can destroy themselves. This object waits for all its children to go away
242 before advancing to the next state. 249 before advancing to the next state.
243 250
244 (9) State FSCACHE_OBJECT_ABORT_INIT. 251(10) State FSCACHE_OBJECT_ABORT_INIT.
245 252
246 The object comes to this state if it was waiting on its parent in 253 The object comes to this state if it was waiting on its parent in
247 FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself 254 FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself
248 so that the parent may proceed from the FSCACHE_OBJECT_DYING state. 255 so that the parent may proceed from the FSCACHE_OBJECT_DYING state.
249 256
250(10) State FSCACHE_OBJECT_RELEASING. 257(11) State FSCACHE_OBJECT_RELEASING.
251(11) State FSCACHE_OBJECT_RECYCLING. 258(12) State FSCACHE_OBJECT_RECYCLING.
252 259
253 The object comes to one of these two states when dying once it is rid of 260 The object comes to one of these two states when dying once it is rid of
254 all its children, if it is dying because the netfs relinquished its 261 all its children, if it is dying because the netfs relinquished its
255 cookie. In the first state, the cached data is expected to persist, and 262 cookie. In the first state, the cached data is expected to persist, and
256 in the second it will be deleted. 263 in the second it will be deleted.
257 264
258(12) State FSCACHE_OBJECT_WITHDRAWING. 265(13) State FSCACHE_OBJECT_WITHDRAWING.
259 266
260 The object transits to this state if the cache decides it wants to 267 The object transits to this state if the cache decides it wants to
261 withdraw the object from service, perhaps to make space, but also due to 268 withdraw the object from service, perhaps to make space, but also due to
262 error or just because the whole cache is being withdrawn. 269 error or just because the whole cache is being withdrawn.
263 270
264(13) State FSCACHE_OBJECT_DEAD. 271(14) State FSCACHE_OBJECT_DEAD.
265 272
266 The object transits to this state when the in-memory object record is 273 The object transits to this state when the in-memory object record is
267 ready to be deleted. The object processor shouldn't ever see an object in 274 ready to be deleted. The object processor shouldn't ever see an object in
diff --git a/Documentation/filesystems/caching/operations.txt b/Documentation/filesystems/caching/operations.txt
index b6b070c57cbf..bee2a5f93d60 100644
--- a/Documentation/filesystems/caching/operations.txt
+++ b/Documentation/filesystems/caching/operations.txt
@@ -174,7 +174,7 @@ Operations are used through the following procedure:
174 necessary (the object might have died whilst the thread was waiting). 174 necessary (the object might have died whilst the thread was waiting).
175 175
176 When it has finished doing its processing, it should call 176 When it has finished doing its processing, it should call
177 fscache_put_operation() on it. 177 fscache_op_complete() and fscache_put_operation() on it.
178 178
179 (4) The operation holds an effective lock upon the object, preventing other 179 (4) The operation holds an effective lock upon the object, preventing other
180 exclusive ops conflicting until it is released. The operation can be 180 exclusive ops conflicting until it is released. The operation can be
diff --git a/Documentation/filesystems/efivarfs.txt b/Documentation/filesystems/efivarfs.txt
new file mode 100644
index 000000000000..c477af086e65
--- /dev/null
+++ b/Documentation/filesystems/efivarfs.txt
@@ -0,0 +1,16 @@
1
2efivarfs - a (U)EFI variable filesystem
3
4The efivarfs filesystem was created to address the shortcomings of
5using entries in sysfs to maintain EFI variables. The old sysfs EFI
6variables code only supported variables of up to 1024 bytes. This
7limitation existed in version 0.99 of the EFI specification, but was
8removed before any full releases. Since variables can now be larger
9than a single page, sysfs isn't the best interface for this.
10
11Variables can be created, deleted and modified with the efivarfs
12filesystem.
13
14efivarfs is typically mounted like this,
15
16 mount -t efivarfs none /sys/firmware/efi/efivars
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 104322bf378c..34ea4f1fa6ea 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -200,12 +200,9 @@ inode_readahead_blks=n This tuning parameter controls the maximum
200 table readahead algorithm will pre-read into 200 table readahead algorithm will pre-read into
201 the buffer cache. The default value is 32 blocks. 201 the buffer cache. The default value is 32 blocks.
202 202
203nouser_xattr Disables Extended User Attributes. If you have extended 203nouser_xattr Disables Extended User Attributes. See the
204 attribute support enabled in the kernel configuration 204 attr(5) manual page and http://acl.bestbits.at/
205 (CONFIG_EXT4_FS_XATTR), extended attribute support 205 for more information about extended attributes.
206 is enabled by default on mount. See the attr(5) manual
207 page and http://acl.bestbits.at/ for more information
208 about extended attributes.
209 206
210noacl This option disables POSIX Access Control List 207noacl This option disables POSIX Access Control List
211 support. If ACL support is enabled in the kernel 208 support. If ACL support is enabled in the kernel
diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt
new file mode 100644
index 000000000000..dcf338e62b71
--- /dev/null
+++ b/Documentation/filesystems/f2fs.txt
@@ -0,0 +1,421 @@
1================================================================================
2WHAT IS Flash-Friendly File System (F2FS)?
3================================================================================
4
5NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
6been equipped on a variety systems ranging from mobile to server systems. Since
7they are known to have different characteristics from the conventional rotating
8disks, a file system, an upper layer to the storage device, should adapt to the
9changes from the sketch in the design level.
10
11F2FS is a file system exploiting NAND flash memory-based storage devices, which
12is based on Log-structured File System (LFS). The design has been focused on
13addressing the fundamental issues in LFS, which are snowball effect of wandering
14tree and high cleaning overhead.
15
16Since a NAND flash memory-based storage device shows different characteristic
17according to its internal geometry or flash memory management scheme, namely FTL,
18F2FS and its tools support various parameters not only for configuring on-disk
19layout, but also for selecting allocation and cleaning algorithms.
20
21The file system formatting tool, "mkfs.f2fs", is available from the following
22git tree:
23>> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
24
25For reporting bugs and sending patches, please use the following mailing list:
26>> linux-f2fs-devel@lists.sourceforge.net
27
28================================================================================
29BACKGROUND AND DESIGN ISSUES
30================================================================================
31
32Log-structured File System (LFS)
33--------------------------------
34"A log-structured file system writes all modifications to disk sequentially in
35a log-like structure, thereby speeding up both file writing and crash recovery.
36The log is the only structure on disk; it contains indexing information so that
37files can be read back from the log efficiently. In order to maintain large free
38areas on disk for fast writing, we divide the log into segments and use a
39segment cleaner to compress the live information from heavily fragmented
40segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and
41implementation of a log-structured file system", ACM Trans. Computer Systems
4210, 1, 26–52.
43
44Wandering Tree Problem
45----------------------
46In LFS, when a file data is updated and written to the end of log, its direct
47pointer block is updated due to the changed location. Then the indirect pointer
48block is also updated due to the direct pointer block update. In this manner,
49the upper index structures such as inode, inode map, and checkpoint block are
50also updated recursively. This problem is called as wandering tree problem [1],
51and in order to enhance the performance, it should eliminate or relax the update
52propagation as much as possible.
53
54[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/
55
56Cleaning Overhead
57-----------------
58Since LFS is based on out-of-place writes, it produces so many obsolete blocks
59scattered across the whole storage. In order to serve new empty log space, it
60needs to reclaim these obsolete blocks seamlessly to users. This job is called
61as a cleaning process.
62
63The process consists of three operations as follows.
641. A victim segment is selected through referencing segment usage table.
652. It loads parent index structures of all the data in the victim identified by
66 segment summary blocks.
673. It checks the cross-reference between the data and its parent index structure.
684. It moves valid data selectively.
69
70This cleaning job may cause unexpected long delays, so the most important goal
71is to hide the latencies to users. And also definitely, it should reduce the
72amount of valid data to be moved, and move them quickly as well.
73
74================================================================================
75KEY FEATURES
76================================================================================
77
78Flash Awareness
79---------------
80- Enlarge the random write area for better performance, but provide the high
81 spatial locality
82- Align FS data structures to the operational units in FTL as best efforts
83
84Wandering Tree Problem
85----------------------
86- Use a term, “node”, that represents inodes as well as various pointer blocks
87- Introduce Node Address Table (NAT) containing the locations of all the “node”
88 blocks; this will cut off the update propagation.
89
90Cleaning Overhead
91-----------------
92- Support a background cleaning process
93- Support greedy and cost-benefit algorithms for victim selection policies
94- Support multi-head logs for static/dynamic hot and cold data separation
95- Introduce adaptive logging for efficient block allocation
96
97================================================================================
98MOUNT OPTIONS
99================================================================================
100
101background_gc_off Turn off cleaning operations, namely garbage collection,
102 triggered in background when I/O subsystem is idle.
103disable_roll_forward Disable the roll-forward recovery routine
104discard Issue discard/TRIM commands when a segment is cleaned.
105no_heap Disable heap-style segment allocation which finds free
106 segments for data from the beginning of main area, while
107 for node from the end of main area.
108nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
109 by default if CONFIG_F2FS_FS_XATTR is selected.
110noacl Disable POSIX Access Control List. Note: acl is enabled
111 by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
112active_logs=%u Support configuring the number of active logs. In the
113 current design, f2fs supports only 2, 4, and 6 logs.
114 Default number is 6.
115disable_ext_identify Disable the extension list configured by mkfs, so f2fs
116 does not aware of cold files such as media files.
117
118================================================================================
119DEBUGFS ENTRIES
120================================================================================
121
122/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as
123f2fs. Each file shows the whole f2fs information.
124
125/sys/kernel/debug/f2fs/status includes:
126 - major file system information managed by f2fs currently
127 - average SIT information about whole segments
128 - current memory footprint consumed by f2fs.
129
130================================================================================
131USAGE
132================================================================================
133
1341. Download userland tools and compile them.
135
1362. Skip, if f2fs was compiled statically inside kernel.
137 Otherwise, insert the f2fs.ko module.
138 # insmod f2fs.ko
139
1403. Create a directory trying to mount
141 # mkdir /mnt/f2fs
142
1434. Format the block device, and then mount as f2fs
144 # mkfs.f2fs -l label /dev/block_device
145 # mount -t f2fs /dev/block_device /mnt/f2fs
146
147Format options
148--------------
149-l [label] : Give a volume label, up to 256 unicode name.
150-a [0 or 1] : Split start location of each area for heap-based allocation.
151 1 is set by default, which performs this.
152-o [int] : Set overprovision ratio in percent over volume size.
153 5 is set by default.
154-s [int] : Set the number of segments per section.
155 1 is set by default.
156-z [int] : Set the number of sections per zone.
157 1 is set by default.
158-e [str] : Set basic extension list. e.g. "mp3,gif,mov"
159
160================================================================================
161DESIGN
162================================================================================
163
164On-disk Layout
165--------------
166
167F2FS divides the whole volume into a number of segments, each of which is fixed
168to 2MB in size. A section is composed of consecutive segments, and a zone
169consists of a set of sections. By default, section and zone sizes are set to one
170segment size identically, but users can easily modify the sizes by mkfs.
171
172F2FS splits the entire volume into six areas, and all the areas except superblock
173consists of multiple segments as described below.
174
175 align with the zone size <-|
176 |-> align with the segment size
177 _________________________________________________________________________
178 | | | Segment | Node | Segment | |
179 | Superblock | Checkpoint | Info. | Address | Summary | Main |
180 | (SB) | (CP) | Table (SIT) | Table (NAT) | Area (SSA) | |
181 |____________|_____2______|______N______|______N______|______N_____|__N___|
182 . .
183 . .
184 . .
185 ._________________________________________.
186 |_Segment_|_..._|_Segment_|_..._|_Segment_|
187 . .
188 ._________._________
189 |_section_|__...__|_
190 . .
191 .________.
192 |__zone__|
193
194- Superblock (SB)
195 : It is located at the beginning of the partition, and there exist two copies
196 to avoid file system crash. It contains basic partition information and some
197 default parameters of f2fs.
198
199- Checkpoint (CP)
200 : It contains file system information, bitmaps for valid NAT/SIT sets, orphan
201 inode lists, and summary entries of current active segments.
202
203- Segment Information Table (SIT)
204 : It contains segment information such as valid block count and bitmap for the
205 validity of all the blocks.
206
207- Node Address Table (NAT)
208 : It is composed of a block address table for all the node blocks stored in
209 Main area.
210
211- Segment Summary Area (SSA)
212 : It contains summary entries which contains the owner information of all the
213 data and node blocks stored in Main area.
214
215- Main Area
216 : It contains file and directory data including their indices.
217
218In order to avoid misalignment between file system and flash-based storage, F2FS
219aligns the start block address of CP with the segment size. Also, it aligns the
220start block address of Main area with the zone size by reserving some segments
221in SSA area.
222
223Reference the following survey for additional technical details.
224https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey
225
226File System Metadata Structure
227------------------------------
228
229F2FS adopts the checkpointing scheme to maintain file system consistency. At
230mount time, F2FS first tries to find the last valid checkpoint data by scanning
231CP area. In order to reduce the scanning time, F2FS uses only two copies of CP.
232One of them always indicates the last valid data, which is called as shadow copy
233mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism.
234
235For file system consistency, each CP points to which NAT and SIT copies are
236valid, as shown as below.
237
238 +--------+----------+---------+
239 | CP | SIT | NAT |
240 +--------+----------+---------+
241 . . . .
242 . . . .
243 . . . .
244 +-------+-------+--------+--------+--------+--------+
245 | CP #0 | CP #1 | SIT #0 | SIT #1 | NAT #0 | NAT #1 |
246 +-------+-------+--------+--------+--------+--------+
247 | ^ ^
248 | | |
249 `----------------------------------------'
250
251Index Structure
252---------------
253
254The key data structure to manage the data locations is a "node". Similar to
255traditional file structures, F2FS has three types of node: inode, direct node,
256indirect node. F2FS assigns 4KB to an inode block which contains 923 data block
257indices, two direct node pointers, two indirect node pointers, and one double
258indirect node pointer as described below. One direct node block contains 1018
259data blocks, and one indirect node block contains also 1018 node blocks. Thus,
260one inode block (i.e., a file) covers:
261
262 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
263
264 Inode block (4KB)
265 |- data (923)
266 |- direct node (2)
267 | `- data (1018)
268 |- indirect node (2)
269 | `- direct node (1018)
270 | `- data (1018)
271 `- double indirect node (1)
272 `- indirect node (1018)
273 `- direct node (1018)
274 `- data (1018)
275
276Note that, all the node blocks are mapped by NAT which means the location of
277each node is translated by the NAT table. In the consideration of the wandering
278tree problem, F2FS is able to cut off the propagation of node updates caused by
279leaf data writes.
280
281Directory Structure
282-------------------
283
284A directory entry occupies 11 bytes, which consists of the following attributes.
285
286- hash hash value of the file name
287- ino inode number
288- len the length of file name
289- type file type such as directory, symlink, etc
290
291A dentry block consists of 214 dentry slots and file names. Therein a bitmap is
292used to represent whether each dentry is valid or not. A dentry block occupies
2934KB with the following composition.
294
295 Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) +
296 dentries(11 * 214 bytes) + file name (8 * 214 bytes)
297
298 [Bucket]
299 +--------------------------------+
300 |dentry block 1 | dentry block 2 |
301 +--------------------------------+
302 . .
303 . .
304 . [Dentry Block Structure: 4KB] .
305 +--------+----------+----------+------------+
306 | bitmap | reserved | dentries | file names |
307 +--------+----------+----------+------------+
308 [Dentry Block: 4KB] . .
309 . .
310 . .
311 +------+------+-----+------+
312 | hash | ino | len | type |
313 +------+------+-----+------+
314 [Dentry Structure: 11 bytes]
315
316F2FS implements multi-level hash tables for directory structure. Each level has
317a hash table with dedicated number of hash buckets as shown below. Note that
318"A(2B)" means a bucket includes 2 data blocks.
319
320----------------------
321A : bucket
322B : block
323N : MAX_DIR_HASH_DEPTH
324----------------------
325
326level #0 | A(2B)
327 |
328level #1 | A(2B) - A(2B)
329 |
330level #2 | A(2B) - A(2B) - A(2B) - A(2B)
331 . | . . . .
332level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
333 . | . . . .
334level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
335
336The number of blocks and buckets are determined by,
337
338 ,- 2, if n < MAX_DIR_HASH_DEPTH / 2,
339 # of blocks in level #n = |
340 `- 4, Otherwise
341
342 ,- 2^n, if n < MAX_DIR_HASH_DEPTH / 2,
343 # of buckets in level #n = |
344 `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), Otherwise
345
346When F2FS finds a file name in a directory, at first a hash value of the file
347name is calculated. Then, F2FS scans the hash table in level #0 to find the
348dentry consisting of the file name and its inode number. If not found, F2FS
349scans the next hash table in level #1. In this way, F2FS scans hash tables in
350each levels incrementally from 1 to N. In each levels F2FS needs to scan only
351one bucket determined by the following equation, which shows O(log(# of files))
352complexity.
353
354 bucket number to scan in level #n = (hash value) % (# of buckets in level #n)
355
356In the case of file creation, F2FS finds empty consecutive slots that cover the
357file name. F2FS searches the empty slots in the hash tables of whole levels from
3581 to N in the same way as the lookup operation.
359
360The following figure shows an example of two cases holding children.
361 --------------> Dir <--------------
362 | |
363 child child
364
365 child - child [hole] - child
366
367 child - child - child [hole] - [hole] - child
368
369 Case 1: Case 2:
370 Number of children = 6, Number of children = 3,
371 File size = 7 File size = 7
372
373Default Block Allocation
374------------------------
375
376At runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node
377and Hot/Warm/Cold data.
378
379- Hot node contains direct node blocks of directories.
380- Warm node contains direct node blocks except hot node blocks.
381- Cold node contains indirect node blocks
382- Hot data contains dentry blocks
383- Warm data contains data blocks except hot and cold data blocks
384- Cold data contains multimedia data or migrated data blocks
385
386LFS has two schemes for free space management: threaded log and copy-and-compac-
387tion. The copy-and-compaction scheme which is known as cleaning, is well-suited
388for devices showing very good sequential write performance, since free segments
389are served all the time for writing new data. However, it suffers from cleaning
390overhead under high utilization. Contrarily, the threaded log scheme suffers
391from random writes, but no cleaning process is needed. F2FS adopts a hybrid
392scheme where the copy-and-compaction scheme is adopted by default, but the
393policy is dynamically changed to the threaded log scheme according to the file
394system status.
395
396In order to align F2FS with underlying flash-based storage, F2FS allocates a
397segment in a unit of section. F2FS expects that the section size would be the
398same as the unit size of garbage collection in FTL. Furthermore, with respect
399to the mapping granularity in FTL, F2FS allocates each section of the active
400logs from different zones as much as possible, since FTL can write the data in
401the active logs into one allocation unit according to its mapping granularity.
402
403Cleaning process
404----------------
405
406F2FS does cleaning both on demand and in the background. On-demand cleaning is
407triggered when there are not enough free segments to serve VFS calls. Background
408cleaner is operated by a kernel thread, and triggers the cleaning job when the
409system is idle.
410
411F2FS supports two victim selection policies: greedy and cost-benefit algorithms.
412In the greedy algorithm, F2FS selects a victim segment having the smallest number
413of valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment
414according to the segment age and the number of valid blocks in order to address
415log block thrashing problem in the greedy algorithm. F2FS adopts the greedy
416algorithm for on-demand cleaner, while background cleaner adopts cost-benefit
417algorithm.
418
419In order to identify whether the data in the victim segment are valid or not,
420F2FS manages a bitmap. Each bit represents the validity of a block, and the
421bitmap is composed of a bit stream covering whole blocks in main area.
diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt
index 092fad92a3f0..01c2db769791 100644
--- a/Documentation/filesystems/nfs/nfs41-server.txt
+++ b/Documentation/filesystems/nfs/nfs41-server.txt
@@ -39,21 +39,10 @@ interoperability problems with future clients. Known issues:
39 from a linux client are possible, but we aren't really 39 from a linux client are possible, but we aren't really
40 conformant with the spec (for example, we don't use kerberos 40 conformant with the spec (for example, we don't use kerberos
41 on the backchannel correctly). 41 on the backchannel correctly).
42 - Incomplete backchannel support: incomplete backchannel gss
43 support and no support for BACKCHANNEL_CTL mean that
44 callbacks (hence delegations and layouts) may not be
45 available and clients confused by the incomplete
46 implementation may fail.
47 - We do not support SSV, which provides security for shared 42 - We do not support SSV, which provides security for shared
48 client-server state (thus preventing unauthorized tampering 43 client-server state (thus preventing unauthorized tampering
49 with locks and opens, for example). It is mandatory for 44 with locks and opens, for example). It is mandatory for
50 servers to support this, though no clients use it yet. 45 servers to support this, though no clients use it yet.
51 - Mandatory operations which we do not support, such as
52 DESTROY_CLIENTID, are not currently used by clients, but will be
53 (and the spec recommends their uses in common cases), and
54 clients should not be expected to know how to recover from the
55 case where they are not supported. This will eventually cause
56 interoperability failures.
57 46
58In addition, some limitations are inherited from the current NFSv4 47In addition, some limitations are inherited from the current NFSv4
59implementation: 48implementation:
@@ -89,7 +78,7 @@ Operations
89 | | MNI | or OPT) | | 78 | | MNI | or OPT) | |
90 +----------------------+------------+--------------+----------------+ 79 +----------------------+------------+--------------+----------------+
91 | ACCESS | REQ | | Section 18.1 | 80 | ACCESS | REQ | | Section 18.1 |
92NS | BACKCHANNEL_CTL | REQ | | Section 18.33 | 81I | BACKCHANNEL_CTL | REQ | | Section 18.33 |
93I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | 82I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 |
94 | CLOSE | REQ | | Section 18.2 | 83 | CLOSE | REQ | | Section 18.2 |
95 | COMMIT | REQ | | Section 18.3 | 84 | COMMIT | REQ | | Section 18.3 |
@@ -99,7 +88,7 @@ NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 |
99 | DELEGRETURN | OPT | FDELG, | Section 18.6 | 88 | DELEGRETURN | OPT | FDELG, | Section 18.6 |
100 | | | DDELG, pNFS | | 89 | | | DDELG, pNFS | |
101 | | | (REQ) | | 90 | | | (REQ) | |
102NS | DESTROY_CLIENTID | REQ | | Section 18.50 | 91I | DESTROY_CLIENTID | REQ | | Section 18.50 |
103I | DESTROY_SESSION | REQ | | Section 18.37 | 92I | DESTROY_SESSION | REQ | | Section 18.37 |
104I | EXCHANGE_ID | REQ | | Section 18.35 | 93I | EXCHANGE_ID | REQ | | Section 18.35 |
105I | FREE_STATEID | REQ | | Section 18.38 | 94I | FREE_STATEID | REQ | | Section 18.38 |
@@ -192,7 +181,6 @@ EXCHANGE_ID:
192 181
193CREATE_SESSION: 182CREATE_SESSION:
194* backchannel attributes are ignored 183* backchannel attributes are ignored
195* backchannel security parameters are ignored
196 184
197SEQUENCE: 185SEQUENCE:
198* no support for dynamic slot table renegotiation (optional) 186* no support for dynamic slot table renegotiation (optional)
@@ -202,7 +190,7 @@ Nonstandard compound limitations:
202 ca_maxrequestsize request and a ca_maxresponsesize reply, so we may 190 ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
203 fail to live up to the promise we made in CREATE_SESSION fore channel 191 fail to live up to the promise we made in CREATE_SESSION fore channel
204 negotiation. 192 negotiation.
205* No more than one IO operation (read, write, readdir) allowed per 193* No more than one read-like operation allowed per compound; encoding
206 compound. 194 replies that cross page boundaries (except for read data) not handled.
207 195
208See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues. 196See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues.
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index 0742feebc6e2..4db22f6491e0 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -281,7 +281,7 @@ ext2_write_failed and callers for an example.
281 281
282[mandatory] 282[mandatory]
283 283
284 ->truncate is going away. The whole truncate sequence needs to be 284 ->truncate is gone. The whole truncate sequence needs to be
285implemented in ->setattr, which is now mandatory for filesystems 285implemented in ->setattr, which is now mandatory for filesystems
286implementing on-disk size changes. Start with a copy of the old inode_setattr 286implementing on-disk size changes. Start with a copy of the old inode_setattr
287and vmtruncate, and the reorder the vmtruncate + foofs_vmtruncate sequence to 287and vmtruncate, and the reorder the vmtruncate + foofs_vmtruncate sequence to
@@ -441,3 +441,7 @@ d_make_root() drops the reference to inode if dentry allocation fails.
441two, it gets "is it an O_EXCL or equivalent?" boolean argument. Note that 441two, it gets "is it an O_EXCL or equivalent?" boolean argument. Note that
442local filesystems can ignore tha argument - they are guaranteed that the 442local filesystems can ignore tha argument - they are guaranteed that the
443object doesn't exist. It's remote/distributed ones that might care... 443object doesn't exist. It's remote/distributed ones that might care...
444--
445[mandatory]
446 FS_REVAL_DOT is gone; if you used to have it, add ->d_weak_revalidate()
447in your dentry operations instead.
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index a1793d670cd0..fd8d0d594fc7 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,7 +33,7 @@ Table of Contents
33 2 Modifying System Parameters 33 2 Modifying System Parameters
34 34
35 3 Per-Process Parameters 35 3 Per-Process Parameters
36 3.1 /proc/<pid>/oom_score_adj - Adjust the oom-killer 36 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
37 score 37 score
38 3.2 /proc/<pid>/oom_score - Display current oom-killer score 38 3.2 /proc/<pid>/oom_score - Display current oom-killer score
39 3.3 /proc/<pid>/io - Display the IO accounting fields 39 3.3 /proc/<pid>/io - Display the IO accounting fields
@@ -41,6 +41,7 @@ Table of Contents
41 3.5 /proc/<pid>/mountinfo - Information about mounts 41 3.5 /proc/<pid>/mountinfo - Information about mounts
42 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm 42 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
43 3.7 /proc/<pid>/task/<tid>/children - Information about task children 43 3.7 /proc/<pid>/task/<tid>/children - Information about task children
44 3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file
44 45
45 4 Configuring procfs 46 4 Configuring procfs
46 4.1 Mount options 47 4.1 Mount options
@@ -142,7 +143,7 @@ Table 1-1: Process specific entries in /proc
142 pagemap Page table 143 pagemap Page table
143 stack Report full stack trace, enable via CONFIG_STACKTRACE 144 stack Report full stack trace, enable via CONFIG_STACKTRACE
144 smaps a extension based on maps, showing the memory consumption of 145 smaps a extension based on maps, showing the memory consumption of
145 each mapping 146 each mapping and flags associated with it
146.............................................................................. 147..............................................................................
147 148
148For example, to get the status information of a process, all you have to do is 149For example, to get the status information of a process, all you have to do is
@@ -181,6 +182,7 @@ read the file /proc/PID/status:
181 CapPrm: 0000000000000000 182 CapPrm: 0000000000000000
182 CapEff: 0000000000000000 183 CapEff: 0000000000000000
183 CapBnd: ffffffffffffffff 184 CapBnd: ffffffffffffffff
185 Seccomp: 0
184 voluntary_ctxt_switches: 0 186 voluntary_ctxt_switches: 0
185 nonvoluntary_ctxt_switches: 1 187 nonvoluntary_ctxt_switches: 1
186 188
@@ -237,6 +239,7 @@ Table 1-2: Contents of the status files (as of 2.6.30-rc7)
237 CapPrm bitmap of permitted capabilities 239 CapPrm bitmap of permitted capabilities
238 CapEff bitmap of effective capabilities 240 CapEff bitmap of effective capabilities
239 CapBnd bitmap of capabilities bounding set 241 CapBnd bitmap of capabilities bounding set
242 Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...)
240 Cpus_allowed mask of CPUs on which this process may run 243 Cpus_allowed mask of CPUs on which this process may run
241 Cpus_allowed_list Same as previous, but in "list format" 244 Cpus_allowed_list Same as previous, but in "list format"
242 Mems_allowed mask of memory nodes allowed to this process 245 Mems_allowed mask of memory nodes allowed to this process
@@ -415,8 +418,9 @@ Swap: 0 kB
415KernelPageSize: 4 kB 418KernelPageSize: 4 kB
416MMUPageSize: 4 kB 419MMUPageSize: 4 kB
417Locked: 374 kB 420Locked: 374 kB
421VmFlags: rd ex mr mw me de
418 422
419The first of these lines shows the same information as is displayed for the 423the first of these lines shows the same information as is displayed for the
420mapping in /proc/PID/maps. The remaining lines show the size of the mapping 424mapping in /proc/PID/maps. The remaining lines show the size of the mapping
421(size), the amount of the mapping that is currently resident in RAM (RSS), the 425(size), the amount of the mapping that is currently resident in RAM (RSS), the
422process' proportional share of this mapping (PSS), the number of clean and 426process' proportional share of this mapping (PSS), the number of clean and
@@ -430,6 +434,41 @@ and a page is modified, the file page is replaced by a private anonymous copy.
430"Swap" shows how much would-be-anonymous memory is also used, but out on 434"Swap" shows how much would-be-anonymous memory is also used, but out on
431swap. 435swap.
432 436
437"VmFlags" field deserves a separate description. This member represents the kernel
438flags associated with the particular virtual memory area in two letter encoded
439manner. The codes are the following:
440 rd - readable
441 wr - writeable
442 ex - executable
443 sh - shared
444 mr - may read
445 mw - may write
446 me - may execute
447 ms - may share
448 gd - stack segment growns down
449 pf - pure PFN range
450 dw - disabled write to the mapped file
451 lo - pages are locked in memory
452 io - memory mapped I/O area
453 sr - sequential read advise provided
454 rr - random read advise provided
455 dc - do not copy area on fork
456 de - do not expand area on remapping
457 ac - area is accountable
458 nr - swap space is not reserved for the area
459 ht - area uses huge tlb pages
460 nl - non-linear mapping
461 ar - architecture specific flag
462 dd - do not include area into core dump
463 mm - mixed map area
464 hg - huge page advise flag
465 nh - no-huge page advise flag
466 mg - mergable advise flag
467
468Note that there is no guarantee that every flag and associated mnemonic will
469be present in all further kernel releases. Things get changed, the flags may
470be vanished or the reverse -- new added.
471
433This file is only present if the CONFIG_MMU kernel configuration option is 472This file is only present if the CONFIG_MMU kernel configuration option is
434enabled. 473enabled.
435 474
@@ -1320,10 +1359,10 @@ of the kernel.
1320CHAPTER 3: PER-PROCESS PARAMETERS 1359CHAPTER 3: PER-PROCESS PARAMETERS
1321------------------------------------------------------------------------------ 1360------------------------------------------------------------------------------
1322 1361
13233.1 /proc/<pid>/oom_score_adj- Adjust the oom-killer score 13623.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
1324-------------------------------------------------------------------------------- 1363--------------------------------------------------------------------------------
1325 1364
1326This file can be used to adjust the badness heuristic used to select which 1365These file can be used to adjust the badness heuristic used to select which
1327process gets killed in out of memory conditions. 1366process gets killed in out of memory conditions.
1328 1367
1329The badness heuristic assigns a value to each candidate task ranging from 0 1368The badness heuristic assigns a value to each candidate task ranging from 0
@@ -1361,6 +1400,12 @@ same system, cpuset, mempolicy, or memory controller resources to use at least
1361equivalent to discounting 50% of the task's allowed memory from being considered 1400equivalent to discounting 50% of the task's allowed memory from being considered
1362as scoring against the task. 1401as scoring against the task.
1363 1402
1403For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
1404be used to tune the badness score. Its acceptable values range from -16
1405(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
1406(OOM_DISABLE) to disable oom killing entirely for that task. Its value is
1407scaled linearly with /proc/<pid>/oom_score_adj.
1408
1364The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last 1409The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
1365value set by a CAP_SYS_RESOURCE process. To reduce the value any lower 1410value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
1366requires CAP_SYS_RESOURCE. 1411requires CAP_SYS_RESOURCE.
@@ -1375,7 +1420,9 @@ minimal amount of work.
1375------------------------------------------------------------- 1420-------------------------------------------------------------
1376 1421
1377This file can be used to check the current score used by the oom-killer is for 1422This file can be used to check the current score used by the oom-killer is for
1378any given <pid>. 1423any given <pid>. Use it together with /proc/<pid>/oom_score_adj to tune which
1424process should be killed in an out-of-memory situation.
1425
1379 1426
13803.3 /proc/<pid>/io - Display the IO accounting fields 14273.3 /proc/<pid>/io - Display the IO accounting fields
1381------------------------------------------------------- 1428-------------------------------------------------------
@@ -1587,6 +1634,93 @@ pids, so one need to either stop or freeze processes being inspected
1587if precise results are needed. 1634if precise results are needed.
1588 1635
1589 1636
16373.7 /proc/<pid>/fdinfo/<fd> - Information about opened file
1638---------------------------------------------------------------
1639This file provides information associated with an opened file. The regular
1640files have at least two fields -- 'pos' and 'flags'. The 'pos' represents
1641the current offset of the opened file in decimal form [see lseek(2) for
1642details] and 'flags' denotes the octal O_xxx mask the file has been
1643created with [see open(2) for details].
1644
1645A typical output is
1646
1647 pos: 0
1648 flags: 0100002
1649
1650The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags
1651pair provide additional information particular to the objects they represent.
1652
1653 Eventfd files
1654 ~~~~~~~~~~~~~
1655 pos: 0
1656 flags: 04002
1657 eventfd-count: 5a
1658
1659 where 'eventfd-count' is hex value of a counter.
1660
1661 Signalfd files
1662 ~~~~~~~~~~~~~~
1663 pos: 0
1664 flags: 04002
1665 sigmask: 0000000000000200
1666
1667 where 'sigmask' is hex value of the signal mask associated
1668 with a file.
1669
1670 Epoll files
1671 ~~~~~~~~~~~
1672 pos: 0
1673 flags: 02
1674 tfd: 5 events: 1d data: ffffffffffffffff
1675
1676 where 'tfd' is a target file descriptor number in decimal form,
1677 'events' is events mask being watched and the 'data' is data
1678 associated with a target [see epoll(7) for more details].
1679
1680 Fsnotify files
1681 ~~~~~~~~~~~~~~
1682 For inotify files the format is the following
1683
1684 pos: 0
1685 flags: 02000000
1686 inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
1687
1688 where 'wd' is a watch descriptor in decimal form, ie a target file
1689 descriptor number, 'ino' and 'sdev' are inode and device where the
1690 target file resides and the 'mask' is the mask of events, all in hex
1691 form [see inotify(7) for more details].
1692
1693 If the kernel was built with exportfs support, the path to the target
1694 file is encoded as a file handle. The file handle is provided by three
1695 fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex
1696 format.
1697
1698 If the kernel is built without exportfs support the file handle won't be
1699 printed out.
1700
1701 If there is no inotify mark attached yet the 'inotify' line will be omitted.
1702
1703 For fanotify files the format is
1704
1705 pos: 0
1706 flags: 02
1707 fanotify flags:10 event-flags:0
1708 fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003
1709 fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4
1710
1711 where fanotify 'flags' and 'event-flags' are values used in fanotify_init
1712 call, 'mnt_id' is the mount point identifier, 'mflags' is the value of
1713 flags associated with mark which are tracked separately from events
1714 mask. 'ino', 'sdev' are target inode and device, 'mask' is the events
1715 mask and 'ignored_mask' is the mask of events which are to be ignored.
1716 All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask'
1717 does provide information about flags and mask used in fanotify_mark
1718 call [see fsnotify manpage for details].
1719
1720 While the first three lines are mandatory and always printed, the rest is
1721 optional and may be omitted if no marks created yet.
1722
1723
1590------------------------------------------------------------------------------ 1724------------------------------------------------------------------------------
1591Configuring procfs 1725Configuring procfs
1592------------------------------------------------------------------------------ 1726------------------------------------------------------------------------------
diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt
index de1e6c4dccff..d230dd9c99b0 100644
--- a/Documentation/filesystems/vfat.txt
+++ b/Documentation/filesystems/vfat.txt
@@ -111,6 +111,15 @@ tz=UTC -- Interpret timestamps as UTC rather than local time.
111 useful when mounting devices (like digital cameras) 111 useful when mounting devices (like digital cameras)
112 that are set to UTC in order to avoid the pitfalls of 112 that are set to UTC in order to avoid the pitfalls of
113 local time. 113 local time.
114time_offset=minutes
115 -- Set offset for conversion of timestamps from local time
116 used by FAT to UTC. I.e. <minutes> minutes will be subtracted
117 from each timestamp to convert it to UTC used internally by
118 Linux. This is useful when time zone set in sys_tz is
119 not the time zone used by the filesystem. Note that this
120 option still does not provide correct time stamps in all
121 cases in presence of DST - time stamps in a different DST
122 setting will be off by one hour.
114 123
115showexec -- If set, the execute permission bits of the file will be 124showexec -- If set, the execute permission bits of the file will be
116 allowed only if the extension part of the name is .EXE, 125 allowed only if the extension part of the name is .EXE,
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 2ee133e030c3..bc4b06b3160a 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -350,7 +350,6 @@ struct inode_operations {
350 int (*readlink) (struct dentry *, char __user *,int); 350 int (*readlink) (struct dentry *, char __user *,int);
351 void * (*follow_link) (struct dentry *, struct nameidata *); 351 void * (*follow_link) (struct dentry *, struct nameidata *);
352 void (*put_link) (struct dentry *, struct nameidata *, void *); 352 void (*put_link) (struct dentry *, struct nameidata *, void *);
353 void (*truncate) (struct inode *);
354 int (*permission) (struct inode *, int); 353 int (*permission) (struct inode *, int);
355 int (*get_acl)(struct inode *, int); 354 int (*get_acl)(struct inode *, int);
356 int (*setattr) (struct dentry *, struct iattr *); 355 int (*setattr) (struct dentry *, struct iattr *);
@@ -431,16 +430,6 @@ otherwise noted.
431 started might not be in the page cache at the end of the 430 started might not be in the page cache at the end of the
432 walk). 431 walk).
433 432
434 truncate: Deprecated. This will not be called if ->setsize is defined.
435 Called by the VFS to change the size of a file. The
436 i_size field of the inode is set to the desired size by the
437 VFS before this method is called. This method is called by
438 the truncate(2) system call and related functionality.
439
440 Note: ->truncate and vmtruncate are deprecated. Do not add new
441 instances/calls of these. Filesystems should be converted to do their
442 truncate sequence via ->setattr().
443
444 permission: called by the VFS to check for access rights on a POSIX-like 433 permission: called by the VFS to check for access rights on a POSIX-like
445 filesystem. 434 filesystem.
446 435
@@ -911,6 +900,7 @@ defined:
911 900
912struct dentry_operations { 901struct dentry_operations {
913 int (*d_revalidate)(struct dentry *, unsigned int); 902 int (*d_revalidate)(struct dentry *, unsigned int);
903 int (*d_weak_revalidate)(struct dentry *, unsigned int);
914 int (*d_hash)(const struct dentry *, const struct inode *, 904 int (*d_hash)(const struct dentry *, const struct inode *,
915 struct qstr *); 905 struct qstr *);
916 int (*d_compare)(const struct dentry *, const struct inode *, 906 int (*d_compare)(const struct dentry *, const struct inode *,
@@ -926,8 +916,13 @@ struct dentry_operations {
926 916
927 d_revalidate: called when the VFS needs to revalidate a dentry. This 917 d_revalidate: called when the VFS needs to revalidate a dentry. This
928 is called whenever a name look-up finds a dentry in the 918 is called whenever a name look-up finds a dentry in the
929 dcache. Most filesystems leave this as NULL, because all their 919 dcache. Most local filesystems leave this as NULL, because all their
930 dentries in the dcache are valid 920 dentries in the dcache are valid. Network filesystems are different
921 since things can change on the server without the client necessarily
922 being aware of it.
923
924 This function should return a positive value if the dentry is still
925 valid, and zero or a negative error code if it isn't.
931 926
932 d_revalidate may be called in rcu-walk mode (flags & LOOKUP_RCU). 927 d_revalidate may be called in rcu-walk mode (flags & LOOKUP_RCU).
933 If in rcu-walk mode, the filesystem must revalidate the dentry without 928 If in rcu-walk mode, the filesystem must revalidate the dentry without
@@ -938,6 +933,20 @@ struct dentry_operations {
938 If a situation is encountered that rcu-walk cannot handle, return 933 If a situation is encountered that rcu-walk cannot handle, return
939 -ECHILD and it will be called again in ref-walk mode. 934 -ECHILD and it will be called again in ref-walk mode.
940 935
936 d_weak_revalidate: called when the VFS needs to revalidate a "jumped" dentry.
937 This is called when a path-walk ends at dentry that was not acquired by
938 doing a lookup in the parent directory. This includes "/", "." and "..",
939 as well as procfs-style symlinks and mountpoint traversal.
940
941 In this case, we are less concerned with whether the dentry is still
942 fully correct, but rather that the inode is still valid. As with
943 d_revalidate, most local filesystems will set this to NULL since their
944 dcache entries are always valid.
945
946 This function has the same return code semantics as d_revalidate.
947
948 d_weak_revalidate is only called after leaving rcu-walk mode.
949
941 d_hash: called when the VFS adds a dentry to the hash table. The first 950 d_hash: called when the VFS adds a dentry to the hash table. The first
942 dentry passed to d_hash is the parent directory that the name is 951 dentry passed to d_hash is the parent directory that the name is
943 to be hashed into. The inode is the dentry's inode. 952 to be hashed into. The inode is the dentry's inode.
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
index 3fc0c31a6f5d..3e4b3dd1e046 100644
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -43,7 +43,7 @@ When mounting an XFS filesystem, the following options are accepted.
43 Issue command to let the block device reclaim space freed by the 43 Issue command to let the block device reclaim space freed by the
44 filesystem. This is useful for SSD devices, thinly provisioned 44 filesystem. This is useful for SSD devices, thinly provisioned
45 LUNs and virtual machine images, but may have a performance 45 LUNs and virtual machine images, but may have a performance
46 impact. This option is incompatible with the nodelaylog option. 46 impact.
47 47
48 dmapi 48 dmapi
49 Enable the DMAPI (Data Management API) event callouts. 49 Enable the DMAPI (Data Management API) event callouts.
@@ -72,8 +72,15 @@ When mounting an XFS filesystem, the following options are accepted.
72 Indicates that XFS is allowed to create inodes at any location 72 Indicates that XFS is allowed to create inodes at any location
73 in the filesystem, including those which will result in inode 73 in the filesystem, including those which will result in inode
74 numbers occupying more than 32 bits of significance. This is 74 numbers occupying more than 32 bits of significance. This is
75 provided for backwards compatibility, but causes problems for 75 the default allocation option. Applications which do not handle
76 backup applications that cannot handle large inode numbers. 76 inode numbers bigger than 32 bits, should use inode32 option.
77
78 inode32
79 Indicates that XFS is limited to create inodes at locations which
80 will not result in inode numbers with more than 32 bits of
81 significance. This is provided for backwards compatibility, since
82 64 bits inode numbers might cause problems for some applications
83 that cannot handle large inode numbers.
77 84
78 largeio/nolargeio 85 largeio/nolargeio
79 If "nolargeio" is specified, the optimal I/O reported in 86 If "nolargeio" is specified, the optimal I/O reported in