diff options
author | Thomas Gleixner <tglx@linutronix.de> | 2013-05-05 02:24:42 -0400 |
---|---|---|
committer | Thomas Gleixner <tglx@linutronix.de> | 2013-05-05 02:27:03 -0400 |
commit | f99e44a7f3352d7131c7526207f153f13ec5acd4 (patch) | |
tree | 0f448b21128c478053ee7f7765b865954c4eebe8 /Documentation/filesystems | |
parent | fd29f424d458118f02e89596505c68a63dcb3007 (diff) | |
parent | ce857229e0c3adc211944a13a5579ef84fd7b4af (diff) |
Merge branch 'linus' into core/urgent
Update with Linus tree so fixes for the same can be applied.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/ext4.txt | 21 | ||||
-rw-r--r-- | Documentation/filesystems/vfat.txt | 26 | ||||
-rw-r--r-- | Documentation/filesystems/xfs-self-describing-metadata.txt | 350 |
3 files changed, 392 insertions, 5 deletions
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 34ea4f1fa6ea..f7cbf574a875 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt | |||
@@ -494,6 +494,17 @@ Files in /sys/fs/ext4/<devname> | |||
494 | session_write_kbytes This file is read-only and shows the number of | 494 | session_write_kbytes This file is read-only and shows the number of |
495 | kilobytes of data that have been written to this | 495 | kilobytes of data that have been written to this |
496 | filesystem since it was mounted. | 496 | filesystem since it was mounted. |
497 | |||
498 | reserved_clusters This is RW file and contains number of reserved | ||
499 | clusters in the file system which will be used | ||
500 | in the specific situations to avoid costly | ||
501 | zeroout, unexpected ENOSPC, or possible data | ||
502 | loss. The default is 2% or 4096 clusters, | ||
503 | whichever is smaller and this can be changed | ||
504 | however it can never exceed number of clusters | ||
505 | in the file system. If there is not enough space | ||
506 | for the reserved space when mounting the file | ||
507 | mount will _not_ fail. | ||
497 | .............................................................................. | 508 | .............................................................................. |
498 | 509 | ||
499 | Ioctls | 510 | Ioctls |
@@ -587,6 +598,16 @@ Table of Ext4 specific ioctls | |||
587 | bitmaps and inode table, the userspace tool thus | 598 | bitmaps and inode table, the userspace tool thus |
588 | just passes the new number of blocks. | 599 | just passes the new number of blocks. |
589 | 600 | ||
601 | EXT4_IOC_SWAP_BOOT Swap i_blocks and associated attributes | ||
602 | (like i_blocks, i_size, i_flags, ...) from | ||
603 | the specified inode with inode | ||
604 | EXT4_BOOT_LOADER_INO (#5). This is typically | ||
605 | used to store a boot loader in a secure part of | ||
606 | the filesystem, where it can't be changed by a | ||
607 | normal user by accident. | ||
608 | The data blocks of the previous boot loader | ||
609 | will be associated with the given inode. | ||
610 | |||
590 | .............................................................................. | 611 | .............................................................................. |
591 | 612 | ||
592 | References | 613 | References |
diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt index d230dd9c99b0..4a93e98b290a 100644 --- a/Documentation/filesystems/vfat.txt +++ b/Documentation/filesystems/vfat.txt | |||
@@ -150,12 +150,28 @@ discard -- If set, issues discard/TRIM commands to the block | |||
150 | device when blocks are freed. This is useful for SSD devices | 150 | device when blocks are freed. This is useful for SSD devices |
151 | and sparse/thinly-provisoned LUNs. | 151 | and sparse/thinly-provisoned LUNs. |
152 | 152 | ||
153 | nfs -- This option maintains an index (cache) of directory | 153 | nfs=stale_rw|nostale_ro |
154 | inodes by i_logstart which is used by the nfs-related code to | 154 | Enable this only if you want to export the FAT filesystem |
155 | improve look-ups. | 155 | over NFS. |
156 | |||
157 | stale_rw: This option maintains an index (cache) of directory | ||
158 | inodes by i_logstart which is used by the nfs-related code to | ||
159 | improve look-ups. Full file operations (read/write) over NFS is | ||
160 | supported but with cache eviction at NFS server, this could | ||
161 | result in ESTALE issues. | ||
162 | |||
163 | nostale_ro: This option bases the inode number and filehandle | ||
164 | on the on-disk location of a file in the MS-DOS directory entry. | ||
165 | This ensures that ESTALE will not be returned after a file is | ||
166 | evicted from the inode cache. However, it means that operations | ||
167 | such as rename, create and unlink could cause filehandles that | ||
168 | previously pointed at one file to point at a different file, | ||
169 | potentially causing data corruption. For this reason, this | ||
170 | option also mounts the filesystem readonly. | ||
171 | |||
172 | To maintain backward compatibility, '-o nfs' is also accepted, | ||
173 | defaulting to stale_rw | ||
156 | 174 | ||
157 | Enable this only if you want to export the FAT filesystem | ||
158 | over NFS | ||
159 | 175 | ||
160 | <bool>: 0,1,yes,no,true,false | 176 | <bool>: 0,1,yes,no,true,false |
161 | 177 | ||
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt b/Documentation/filesystems/xfs-self-describing-metadata.txt new file mode 100644 index 000000000000..05aa455163e3 --- /dev/null +++ b/Documentation/filesystems/xfs-self-describing-metadata.txt | |||
@@ -0,0 +1,350 @@ | |||
1 | XFS Self Describing Metadata | ||
2 | ---------------------------- | ||
3 | |||
4 | Introduction | ||
5 | ------------ | ||
6 | |||
7 | The largest scalability problem facing XFS is not one of algorithmic | ||
8 | scalability, but of verification of the filesystem structure. Scalabilty of the | ||
9 | structures and indexes on disk and the algorithms for iterating them are | ||
10 | adequate for supporting PB scale filesystems with billions of inodes, however it | ||
11 | is this very scalability that causes the verification problem. | ||
12 | |||
13 | Almost all metadata on XFS is dynamically allocated. The only fixed location | ||
14 | metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all | ||
15 | other metadata structures need to be discovered by walking the filesystem | ||
16 | structure in different ways. While this is already done by userspace tools for | ||
17 | validating and repairing the structure, there are limits to what they can | ||
18 | verify, and this in turn limits the supportable size of an XFS filesystem. | ||
19 | |||
20 | For example, it is entirely possible to manually use xfs_db and a bit of | ||
21 | scripting to analyse the structure of a 100TB filesystem when trying to | ||
22 | determine the root cause of a corruption problem, but it is still mainly a | ||
23 | manual task of verifying that things like single bit errors or misplaced writes | ||
24 | weren't the ultimate cause of a corruption event. It may take a few hours to a | ||
25 | few days to perform such forensic analysis, so for at this scale root cause | ||
26 | analysis is entirely possible. | ||
27 | |||
28 | However, if we scale the filesystem up to 1PB, we now have 10x as much metadata | ||
29 | to analyse and so that analysis blows out towards weeks/months of forensic work. | ||
30 | Most of the analysis work is slow and tedious, so as the amount of analysis goes | ||
31 | up, the more likely that the cause will be lost in the noise. Hence the primary | ||
32 | concern for supporting PB scale filesystems is minimising the time and effort | ||
33 | required for basic forensic analysis of the filesystem structure. | ||
34 | |||
35 | |||
36 | Self Describing Metadata | ||
37 | ------------------------ | ||
38 | |||
39 | One of the problems with the current metadata format is that apart from the | ||
40 | magic number in the metadata block, we have no other way of identifying what it | ||
41 | is supposed to be. We can't even identify if it is the right place. Put simply, | ||
42 | you can't look at a single metadata block in isolation and say "yes, it is | ||
43 | supposed to be there and the contents are valid". | ||
44 | |||
45 | Hence most of the time spent on forensic analysis is spent doing basic | ||
46 | verification of metadata values, looking for values that are in range (and hence | ||
47 | not detected by automated verification checks) but are not correct. Finding and | ||
48 | understanding how things like cross linked block lists (e.g. sibling | ||
49 | pointers in a btree end up with loops in them) are the key to understanding what | ||
50 | went wrong, but it is impossible to tell what order the blocks were linked into | ||
51 | each other or written to disk after the fact. | ||
52 | |||
53 | Hence we need to record more information into the metadata to allow us to | ||
54 | quickly determine if the metadata is intact and can be ignored for the purpose | ||
55 | of analysis. We can't protect against every possible type of error, but we can | ||
56 | ensure that common types of errors are easily detectable. Hence the concept of | ||
57 | self describing metadata. | ||
58 | |||
59 | The first, fundamental requirement of self describing metadata is that the | ||
60 | metadata object contains some form of unique identifier in a well known | ||
61 | location. This allows us to identify the expected contents of the block and | ||
62 | hence parse and verify the metadata object. IF we can't independently identify | ||
63 | the type of metadata in the object, then the metadata doesn't describe itself | ||
64 | very well at all! | ||
65 | |||
66 | Luckily, almost all XFS metadata has magic numbers embedded already - only the | ||
67 | AGFL, remote symlinks and remote attribute blocks do not contain identifying | ||
68 | magic numbers. Hence we can change the on-disk format of all these objects to | ||
69 | add more identifying information and detect this simply by changing the magic | ||
70 | numbers in the metadata objects. That is, if it has the current magic number, | ||
71 | the metadata isn't self identifying. If it contains a new magic number, it is | ||
72 | self identifying and we can do much more expansive automated verification of the | ||
73 | metadata object at runtime, during forensic analysis or repair. | ||
74 | |||
75 | As a primary concern, self describing metadata needs some form of overall | ||
76 | integrity checking. We cannot trust the metadata if we cannot verify that it has | ||
77 | not been changed as a result of external influences. Hence we need some form of | ||
78 | integrity check, and this is done by adding CRC32c validation to the metadata | ||
79 | block. If we can verify the block contains the metadata it was intended to | ||
80 | contain, a large amount of the manual verification work can be skipped. | ||
81 | |||
82 | CRC32c was selected as metadata cannot be more than 64k in length in XFS and | ||
83 | hence a 32 bit CRC is more than sufficient to detect multi-bit errors in | ||
84 | metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is | ||
85 | fast. So while CRC32c is not the strongest of possible integrity checks that | ||
86 | could be used, it is more than sufficient for our needs and has relatively | ||
87 | little overhead. Adding support for larger integrity fields and/or algorithms | ||
88 | does really provide any extra value over CRC32c, but it does add a lot of | ||
89 | complexity and so there is no provision for changing the integrity checking | ||
90 | mechanism. | ||
91 | |||
92 | Self describing metadata needs to contain enough information so that the | ||
93 | metadata block can be verified as being in the correct place without needing to | ||
94 | look at any other metadata. This means it needs to contain location information. | ||
95 | Just adding a block number to the metadata is not sufficient to protect against | ||
96 | mis-directed writes - a write might be misdirected to the wrong LUN and so be | ||
97 | written to the "correct block" of the wrong filesystem. Hence location | ||
98 | information must contain a filesystem identifier as well as a block number. | ||
99 | |||
100 | Another key information point in forensic analysis is knowing who the metadata | ||
101 | block belongs to. We already know the type, the location, that it is valid | ||
102 | and/or corrupted, and how long ago that it was last modified. Knowing the owner | ||
103 | of the block is important as it allows us to find other related metadata to | ||
104 | determine the scope of the corruption. For example, if we have a extent btree | ||
105 | object, we don't know what inode it belongs to and hence have to walk the entire | ||
106 | filesystem to find the owner of the block. Worse, the corruption could mean that | ||
107 | no owner can be found (i.e. it's an orphan block), and so without an owner field | ||
108 | in the metadata we have no idea of the scope of the corruption. If we have an | ||
109 | owner field in the metadata object, we can immediately do top down validation to | ||
110 | determine the scope of the problem. | ||
111 | |||
112 | Different types of metadata have different owner identifiers. For example, | ||
113 | directory, attribute and extent tree blocks are all owned by an inode, whilst | ||
114 | freespace btree blocks are owned by an allocation group. Hence the size and | ||
115 | contents of the owner field are determined by the type of metadata object we are | ||
116 | looking at. The owner information can also identify misplaced writes (e.g. | ||
117 | freespace btree block written to the wrong AG). | ||
118 | |||
119 | Self describing metadata also needs to contain some indication of when it was | ||
120 | written to the filesystem. One of the key information points when doing forensic | ||
121 | analysis is how recently the block was modified. Correlation of set of corrupted | ||
122 | metadata blocks based on modification times is important as it can indicate | ||
123 | whether the corruptions are related, whether there's been multiple corruption | ||
124 | events that lead to the eventual failure, and even whether there are corruptions | ||
125 | present that the run-time verification is not detecting. | ||
126 | |||
127 | For example, we can determine whether a metadata object is supposed to be free | ||
128 | space or still allocated if it is still referenced by its owner by looking at | ||
129 | when the free space btree block that contains the block was last written | ||
130 | compared to when the metadata object itself was last written. If the free space | ||
131 | block is more recent than the object and the object's owner, then there is a | ||
132 | very good chance that the block should have been removed from the owner. | ||
133 | |||
134 | To provide this "written timestamp", each metadata block gets the Log Sequence | ||
135 | Number (LSN) of the most recent transaction it was modified on written into it. | ||
136 | This number will always increase over the life of the filesystem, and the only | ||
137 | thing that resets it is running xfs_repair on the filesystem. Further, by use of | ||
138 | the LSN we can tell if the corrupted metadata all belonged to the same log | ||
139 | checkpoint and hence have some idea of how much modification occurred between | ||
140 | the first and last instance of corrupt metadata on disk and, further, how much | ||
141 | modification occurred between the corruption being written and when it was | ||
142 | detected. | ||
143 | |||
144 | Runtime Validation | ||
145 | ------------------ | ||
146 | |||
147 | Validation of self-describing metadata takes place at runtime in two places: | ||
148 | |||
149 | - immediately after a successful read from disk | ||
150 | - immediately prior to write IO submission | ||
151 | |||
152 | The verification is completely stateless - it is done independently of the | ||
153 | modification process, and seeks only to check that the metadata is what it says | ||
154 | it is and that the metadata fields are within bounds and internally consistent. | ||
155 | As such, we cannot catch all types of corruption that can occur within a block | ||
156 | as there may be certain limitations that operational state enforces of the | ||
157 | metadata, or there may be corruption of interblock relationships (e.g. corrupted | ||
158 | sibling pointer lists). Hence we still need stateful checking in the main code | ||
159 | body, but in general most of the per-field validation is handled by the | ||
160 | verifiers. | ||
161 | |||
162 | For read verification, the caller needs to specify the expected type of metadata | ||
163 | that it should see, and the IO completion process verifies that the metadata | ||
164 | object matches what was expected. If the verification process fails, then it | ||
165 | marks the object being read as EFSCORRUPTED. The caller needs to catch this | ||
166 | error (same as for IO errors), and if it needs to take special action due to a | ||
167 | verification error it can do so by catching the EFSCORRUPTED error value. If we | ||
168 | need more discrimination of error type at higher levels, we can define new | ||
169 | error numbers for different errors as necessary. | ||
170 | |||
171 | The first step in read verification is checking the magic number and determining | ||
172 | whether CRC validating is necessary. If it is, the CRC32c is calculated and | ||
173 | compared against the value stored in the object itself. Once this is validated, | ||
174 | further checks are made against the location information, followed by extensive | ||
175 | object specific metadata validation. If any of these checks fail, then the | ||
176 | buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. | ||
177 | |||
178 | Write verification is the opposite of the read verification - first the object | ||
179 | is extensively verified and if it is OK we then update the LSN from the last | ||
180 | modification made to the object, After this, we calculate the CRC and insert it | ||
181 | into the object. Once this is done the write IO is allowed to continue. If any | ||
182 | error occurs during this process, the buffer is again marked with a EFSCORRUPTED | ||
183 | error for the higher layers to catch. | ||
184 | |||
185 | Structures | ||
186 | ---------- | ||
187 | |||
188 | A typical on-disk structure needs to contain the following information: | ||
189 | |||
190 | struct xfs_ondisk_hdr { | ||
191 | __be32 magic; /* magic number */ | ||
192 | __be32 crc; /* CRC, not logged */ | ||
193 | uuid_t uuid; /* filesystem identifier */ | ||
194 | __be64 owner; /* parent object */ | ||
195 | __be64 blkno; /* location on disk */ | ||
196 | __be64 lsn; /* last modification in log, not logged */ | ||
197 | }; | ||
198 | |||
199 | Depending on the metadata, this information may be part of a header structure | ||
200 | separate to the metadata contents, or may be distributed through an existing | ||
201 | structure. The latter occurs with metadata that already contains some of this | ||
202 | information, such as the superblock and AG headers. | ||
203 | |||
204 | Other metadata may have different formats for the information, but the same | ||
205 | level of information is generally provided. For example: | ||
206 | |||
207 | - short btree blocks have a 32 bit owner (ag number) and a 32 bit block | ||
208 | number for location. The two of these combined provide the same | ||
209 | information as @owner and @blkno in eh above structure, but using 8 | ||
210 | bytes less space on disk. | ||
211 | |||
212 | - directory/attribute node blocks have a 16 bit magic number, and the | ||
213 | header that contains the magic number has other information in it as | ||
214 | well. hence the additional metadata headers change the overall format | ||
215 | of the metadata. | ||
216 | |||
217 | A typical buffer read verifier is structured as follows: | ||
218 | |||
219 | #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) | ||
220 | |||
221 | static void | ||
222 | xfs_foo_read_verify( | ||
223 | struct xfs_buf *bp) | ||
224 | { | ||
225 | struct xfs_mount *mp = bp->b_target->bt_mount; | ||
226 | |||
227 | if ((xfs_sb_version_hascrc(&mp->m_sb) && | ||
228 | !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), | ||
229 | XFS_FOO_CRC_OFF)) || | ||
230 | !xfs_foo_verify(bp)) { | ||
231 | XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); | ||
232 | xfs_buf_ioerror(bp, EFSCORRUPTED); | ||
233 | } | ||
234 | } | ||
235 | |||
236 | The code ensures that the CRC is only checked if the filesystem has CRCs enabled | ||
237 | by checking the superblock of the feature bit, and then if the CRC verifies OK | ||
238 | (or is not needed) it verifies the actual contents of the block. | ||
239 | |||
240 | The verifier function will take a couple of different forms, depending on | ||
241 | whether the magic number can be used to determine the format of the block. In | ||
242 | the case it can't, the code is structured as follows: | ||
243 | |||
244 | static bool | ||
245 | xfs_foo_verify( | ||
246 | struct xfs_buf *bp) | ||
247 | { | ||
248 | struct xfs_mount *mp = bp->b_target->bt_mount; | ||
249 | struct xfs_ondisk_hdr *hdr = bp->b_addr; | ||
250 | |||
251 | if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) | ||
252 | return false; | ||
253 | |||
254 | if (!xfs_sb_version_hascrc(&mp->m_sb)) { | ||
255 | if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) | ||
256 | return false; | ||
257 | if (bp->b_bn != be64_to_cpu(hdr->blkno)) | ||
258 | return false; | ||
259 | if (hdr->owner == 0) | ||
260 | return false; | ||
261 | } | ||
262 | |||
263 | /* object specific verification checks here */ | ||
264 | |||
265 | return true; | ||
266 | } | ||
267 | |||
268 | If there are different magic numbers for the different formats, the verifier | ||
269 | will look like: | ||
270 | |||
271 | static bool | ||
272 | xfs_foo_verify( | ||
273 | struct xfs_buf *bp) | ||
274 | { | ||
275 | struct xfs_mount *mp = bp->b_target->bt_mount; | ||
276 | struct xfs_ondisk_hdr *hdr = bp->b_addr; | ||
277 | |||
278 | if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { | ||
279 | if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) | ||
280 | return false; | ||
281 | if (bp->b_bn != be64_to_cpu(hdr->blkno)) | ||
282 | return false; | ||
283 | if (hdr->owner == 0) | ||
284 | return false; | ||
285 | } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) | ||
286 | return false; | ||
287 | |||
288 | /* object specific verification checks here */ | ||
289 | |||
290 | return true; | ||
291 | } | ||
292 | |||
293 | Write verifiers are very similar to the read verifiers, they just do things in | ||
294 | the opposite order to the read verifiers. A typical write verifier: | ||
295 | |||
296 | static void | ||
297 | xfs_foo_write_verify( | ||
298 | struct xfs_buf *bp) | ||
299 | { | ||
300 | struct xfs_mount *mp = bp->b_target->bt_mount; | ||
301 | struct xfs_buf_log_item *bip = bp->b_fspriv; | ||
302 | |||
303 | if (!xfs_foo_verify(bp)) { | ||
304 | XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); | ||
305 | xfs_buf_ioerror(bp, EFSCORRUPTED); | ||
306 | return; | ||
307 | } | ||
308 | |||
309 | if (!xfs_sb_version_hascrc(&mp->m_sb)) | ||
310 | return; | ||
311 | |||
312 | |||
313 | if (bip) { | ||
314 | struct xfs_ondisk_hdr *hdr = bp->b_addr; | ||
315 | hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); | ||
316 | } | ||
317 | xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); | ||
318 | } | ||
319 | |||
320 | This will verify the internal structure of the metadata before we go any | ||
321 | further, detecting corruptions that have occurred as the metadata has been | ||
322 | modified in memory. If the metadata verifies OK, and CRCs are enabled, we then | ||
323 | update the LSN field (when it was last modified) and calculate the CRC on the | ||
324 | metadata. Once this is done, we can issue the IO. | ||
325 | |||
326 | Inodes and Dquots | ||
327 | ----------------- | ||
328 | |||
329 | Inodes and dquots are special snowflakes. They have per-object CRC and | ||
330 | self-identifiers, but they are packed so that there are multiple objects per | ||
331 | buffer. Hence we do not use per-buffer verifiers to do the work of per-object | ||
332 | verification and CRC calculations. The per-buffer verifiers simply perform basic | ||
333 | identification of the buffer - that they contain inodes or dquots, and that | ||
334 | there are magic numbers in all the expected spots. All further CRC and | ||
335 | verification checks are done when each inode is read from or written back to the | ||
336 | buffer. | ||
337 | |||
338 | The structure of the verifiers and the identifiers checks is very similar to the | ||
339 | buffer code described above. The only difference is where they are called. For | ||
340 | example, inode read verification is done in xfs_iread() when the inode is first | ||
341 | read out of the buffer and the struct xfs_inode is instantiated. The inode is | ||
342 | already extensively verified during writeback in xfs_iflush_int, so the only | ||
343 | addition here is to add the LSN and CRC to the inode as it is copied back into | ||
344 | the buffer. | ||
345 | |||
346 | XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of | ||
347 | the unlinked list modifications check or update CRCs, neither during unlink nor | ||
348 | log recovery. So, it's gone unnoticed until now. This won't matter immediately - | ||
349 | repair will probably complain about it - but it needs to be fixed. | ||
350 | |||