Merge tag 'for-linus-v3.10-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs update from Ben Myers: "For 3.10-rc1 we have a number of bug fixes and cleanups and a currently experimental feature from David Chinner, CRCs protection for metadata. CRCs are enabled by using mkfs.xfs to create a filesystem with the feature bits set. - numerous fixes for speculative preallocation - don't verify buffers on IO errors - rename of random32 to prandom32 - refactoring/rearrangement in xfs_bmap.c - removal of unused m_inode_shrink in struct xfs_mount - fix error handling of xfs_bufs and readahead - quota driven preallocation throttling - fix WARN_ON in xfs_vm_releasepage - add ratelimited printk for different alert levels - fix spurious forced shutdowns due to freed Extent Free Intents - remove some obsolete XLOG_CIL_HARD_SPACE_LIMIT() macros - remove some obsoleted comments - (experimental) CRC support for metadata" * tag 'for-linus-v3.10-rc1' of git://oss.sgi.com/xfs/xfs: (46 commits) xfs: fix da node magic number mismatches xfs: Remote attr validation fixes and optimisations xfs: Teach dquot recovery about CONFIG_XFS_QUOTA xfs: add metadata CRC documentation xfs: implement extended feature masks xfs: add CRC checks to the superblock xfs: buffer type overruns blf_flags field xfs: add buffer types to directory and attribute buffers xfs: add CRC protection to remote attributes xfs: split remote attribute code out xfs: add CRCs to attr leaf blocks xfs: add CRCs to dir2/da node blocks xfs: shortform directory offsets change for dir3 format xfs: add CRC checking to dir2 leaf blocks xfs: add CRC checking to dir2 data blocks xfs: add CRC checking to dir2 free blocks xfs: add CRC checks to block format directory blocks xfs: add CRC checks to remote symlinks xfs: split out symlink code into it's own file. xfs: add version 3 inode format with CRCs ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2013-05-02 17:49:33 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2013-05-02 17:49:33 -0400
commit: c8d8566952fda026966784a62f324c8352f77430 (patch)
tree: 89f25a47ce7662768bc6780bb7daeb85070784f1 /Documentation/filesystems
parent: a9586d9be812be4a0046ad4d312b013e587607cb (diff)
parent: cab09a81fbefcb21db5213a84461d421946f6eb8 (diff)
1 files changed, 350 insertions, 0 deletions
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt b/Documentation/filesystems/xfs-self-describing-metadata.txt
new file mode 100644
index 000000000000..05aa455163e3
--- /dev/null
+++ b/Documentation/filesystems/xfs-self-describing-metadata.txt
@@ -0,0 +1,350 @@
+XFS Self Describing Metadata
+----------------------------
+Introduction
+------------
+The largest scalability problem facing XFS is not one of algorithmic
+scalability, but of verification of the filesystem structure. Scalabilty of the
+structures and indexes on disk and the algorithms for iterating them are
+adequate for supporting PB scale filesystems with billions of inodes, however it
+is this very scalability that causes the verification problem.
+Almost all metadata on XFS is dynamically allocated. The only fixed location
+metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
+other metadata structures need to be discovered by walking the filesystem
+structure in different ways. While this is already done by userspace tools for
+validating and repairing the structure, there are limits to what they can
+verify, and this in turn limits the supportable size of an XFS filesystem.
+For example, it is entirely possible to manually use xfs_db and a bit of
+scripting to analyse the structure of a 100TB filesystem when trying to
+determine the root cause of a corruption problem, but it is still mainly a
+manual task of verifying that things like single bit errors or misplaced writes
+weren't the ultimate cause of a corruption event. It may take a few hours to a
+few days to perform such forensic analysis, so for at this scale root cause
+analysis is entirely possible.
+However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
+to analyse and so that analysis blows out towards weeks/months of forensic work.
+Most of the analysis work is slow and tedious, so as the amount of analysis goes
+up, the more likely that the cause will be lost in the noise.  Hence the primary
+concern for supporting PB scale filesystems is minimising the time and effort
+required for basic forensic analysis of the filesystem structure.
+Self Describing Metadata
+------------------------
+One of the problems with the current metadata format is that apart from the
+magic number in the metadata block, we have no other way of identifying what it
+is supposed to be. We can't even identify if it is the right place. Put simply,
+you can't look at a single metadata block in isolation and say "yes, it is
+supposed to be there and the contents are valid".
+Hence most of the time spent on forensic analysis is spent doing basic
+verification of metadata values, looking for values that are in range (and hence
+not detected by automated verification checks) but are not correct. Finding and
+understanding how things like cross linked block lists (e.g. sibling
+pointers in a btree end up with loops in them) are the key to understanding what
+went wrong, but it is impossible to tell what order the blocks were linked into
+each other or written to disk after the fact.
+Hence we need to record more information into the metadata to allow us to
+quickly determine if the metadata is intact and can be ignored for the purpose
+of analysis. We can't protect against every possible type of error, but we can
+ensure that common types of errors are easily detectable.  Hence the concept of
+self describing metadata.
+The first, fundamental requirement of self describing metadata is that the
+metadata object contains some form of unique identifier in a well known
+location. This allows us to identify the expected contents of the block and
+hence parse and verify the metadata object. IF we can't independently identify
+the type of metadata in the object, then the metadata doesn't describe itself
+very well at all!
+Luckily, almost all XFS metadata has magic numbers embedded already - only the
+AGFL, remote symlinks and remote attribute blocks do not contain identifying
+magic numbers. Hence we can change the on-disk format of all these objects to
+add more identifying information and detect this simply by changing the magic
+numbers in the metadata objects. That is, if it has the current magic number,
+the metadata isn't self identifying. If it contains a new magic number, it is
+self identifying and we can do much more expansive automated verification of the
+metadata object at runtime, during forensic analysis or repair.
+As a primary concern, self describing metadata needs some form of overall
+integrity checking. We cannot trust the metadata if we cannot verify that it has
+not been changed as a result of external influences. Hence we need some form of
+integrity check, and this is done by adding CRC32c validation to the metadata
+block. If we can verify the block contains the metadata it was intended to
+contain, a large amount of the manual verification work can be skipped.
+CRC32c was selected as metadata cannot be more than 64k in length in XFS and
+hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
+metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
+fast. So while CRC32c is not the strongest of possible integrity checks that
+could be used, it is more than sufficient for our needs and has relatively
+little overhead. Adding support for larger integrity fields and/or algorithms
+does really provide any extra value over CRC32c, but it does add a lot of
+complexity and so there is no provision for changing the integrity checking
+mechanism.
+Self describing metadata needs to contain enough information so that the
+metadata block can be verified as being in the correct place without needing to
+look at any other metadata. This means it needs to contain location information.
+Just adding a block number to the metadata is not sufficient to protect against
+mis-directed writes - a write might be misdirected to the wrong LUN and so be
+written to the "correct block" of the wrong filesystem. Hence location
+information must contain a filesystem identifier as well as a block number.
+Another key information point in forensic analysis is knowing who the metadata
+block belongs to. We already know the type, the location, that it is valid
+and/or corrupted, and how long ago that it was last modified. Knowing the owner
+of the block is important as it allows us to find other related metadata to
+determine the scope of the corruption. For example, if we have a extent btree
+object, we don't know what inode it belongs to and hence have to walk the entire
+filesystem to find the owner of the block. Worse, the corruption could mean that
+no owner can be found (i.e. it's an orphan block), and so without an owner field
+in the metadata we have no idea of the scope of the corruption. If we have an
+owner field in the metadata object, we can immediately do top down validation to
+determine the scope of the problem.
+Different types of metadata have different owner identifiers. For example,
+directory, attribute and extent tree blocks are all owned by an inode, whilst
+freespace btree blocks are owned by an allocation group. Hence the size and
+contents of the owner field are determined by the type of metadata object we are
+looking at.  The owner information can also identify misplaced writes (e.g.
+freespace btree block written to the wrong AG).
+Self describing metadata also needs to contain some indication of when it was
+written to the filesystem. One of the key information points when doing forensic
+analysis is how recently the block was modified. Correlation of set of corrupted
+metadata blocks based on modification times is important as it can indicate
+whether the corruptions are related, whether there's been multiple corruption
+events that lead to the eventual failure, and even whether there are corruptions
+present that the run-time verification is not detecting.
+For example, we can determine whether a metadata object is supposed to be free
+space or still allocated if it is still referenced by its owner by looking at
+when the free space btree block that contains the block was last written
+compared to when the metadata object itself was last written.  If the free space
+block is more recent than the object and the object's owner, then there is a
+very good chance that the block should have been removed from the owner.
+To provide this "written timestamp", each metadata block gets the Log Sequence
+Number (LSN) of the most recent transaction it was modified on written into it.
+This number will always increase over the life of the filesystem, and the only
+thing that resets it is running xfs_repair on the filesystem. Further, by use of
+the LSN we can tell if the corrupted metadata all belonged to the same log
+checkpoint and hence have some idea of how much modification occurred between
+the first and last instance of corrupt metadata on disk and, further, how much
+modification occurred between the corruption being written and when it was
+detected.
+Runtime Validation
+------------------
+Validation of self-describing metadata takes place at runtime in two places:
+        - immediately after a successful read from disk
+        - immediately prior to write IO submission
+The verification is completely stateless - it is done independently of the
+modification process, and seeks only to check that the metadata is what it says
+it is and that the metadata fields are within bounds and internally consistent.
+As such, we cannot catch all types of corruption that can occur within a block
+as there may be certain limitations that operational state enforces of the
+metadata, or there may be corruption of interblock relationships (e.g. corrupted
+sibling pointer lists). Hence we still need stateful checking in the main code
+body, but in general most of the per-field validation is handled by the
+verifiers.
+For read verification, the caller needs to specify the expected type of metadata
+that it should see, and the IO completion process verifies that the metadata
+object matches what was expected. If the verification process fails, then it
+marks the object being read as EFSCORRUPTED. The caller needs to catch this
+error (same as for IO errors), and if it needs to take special action due to a
+verification error it can do so by catching the EFSCORRUPTED error value. If we
+need more discrimination of error type at higher levels, we can define new
+error numbers for different errors as necessary.
+The first step in read verification is checking the magic number and determining
+whether CRC validating is necessary. If it is, the CRC32c is calculated and
+compared against the value stored in the object itself. Once this is validated,
+further checks are made against the location information, followed by extensive
+object specific metadata validation. If any of these checks fail, then the
+buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
+Write verification is the opposite of the read verification - first the object
+is extensively verified and if it is OK we then update the LSN from the last
+modification made to the object, After this, we calculate the CRC and insert it
+into the object. Once this is done the write IO is allowed to continue. If any
+error occurs during this process, the buffer is again marked with a EFSCORRUPTED
+error for the higher layers to catch.
+Structures
+----------
+A typical on-disk structure needs to contain the following information:
+struct xfs_ondisk_hdr {
+        __be32  magic;          /* magic number */
+        __be32  crc;            /* CRC, not logged */
+        uuid_t  uuid;           /* filesystem identifier */
+        __be64  owner;          /* parent object */
+        __be64  blkno;          /* location on disk */
+        __be64  lsn;            /* last modification in log, not logged */
+};
+Depending on the metadata, this information may be part of a header structure
+separate to the metadata contents, or may be distributed through an existing
+structure. The latter occurs with metadata that already contains some of this
+information, such as the superblock and AG headers.
+Other metadata may have different formats for the information, but the same
+level of information is generally provided. For example:
+        - short btree blocks have a 32 bit owner (ag number) and a 32 bit block
+          number for location. The two of these combined provide the same
+          information as @owner and @blkno in eh above structure, but using 8
+          bytes less space on disk.
+        - directory/attribute node blocks have a 16 bit magic number, and the
+          header that contains the magic number has other information in it as
+          well. hence the additional metadata headers change the overall format
+          of the metadata.
+A typical buffer read verifier is structured as follows:
+#define XFS_FOO_CRC_OFF         offsetof(struct xfs_ondisk_hdr, crc)
+static void
+xfs_foo_read_verify(
+        struct xfs_buf  *bp)
+{
+       struct xfs_mount *mp = bp->b_target->bt_mount;
+        if ((xfs_sb_version_hascrc(&mp->m_sb) &&
+             !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
+                                        XFS_FOO_CRC_OFF)) ||
+            !xfs_foo_verify(bp)) {
+                XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
+                xfs_buf_ioerror(bp, EFSCORRUPTED);
+        }
+}
+The code ensures that the CRC is only checked if the filesystem has CRCs enabled
+by checking the superblock of the feature bit, and then if the CRC verifies OK
+(or is not needed) it verifies the actual contents of the block.
+The verifier function will take a couple of different forms, depending on
+whether the magic number can be used to determine the format of the block. In
+the case it can't, the code is structured as follows:
+static bool
+xfs_foo_verify(
+        struct xfs_buf          *bp)
+{
+        struct xfs_mount        *mp = bp->b_target->bt_mount;
+        struct xfs_ondisk_hdr   *hdr = bp->b_addr;
+        if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
+                return false;
+        if (!xfs_sb_version_hascrc(&mp->m_sb)) {
+                if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
+                        return false;
+                if (bp->b_bn != be64_to_cpu(hdr->blkno))
+                        return false;
+                if (hdr->owner == 0)
+                        return false;
+        }
+        /* object specific verification checks here */
+        return true;
+}
+If there are different magic numbers for the different formats, the verifier
+will look like:
+static bool
+xfs_foo_verify(
+        struct xfs_buf          *bp)
+{
+        struct xfs_mount        *mp = bp->b_target->bt_mount;
+        struct xfs_ondisk_hdr   *hdr = bp->b_addr;
+        if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
+                if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
+                        return false;
+                if (bp->b_bn != be64_to_cpu(hdr->blkno))
+                        return false;
+                if (hdr->owner == 0)
+                        return false;
+        } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
+                return false;
+        /* object specific verification checks here */
+        return true;
+}
+Write verifiers are very similar to the read verifiers, they just do things in
+the opposite order to the read verifiers. A typical write verifier:
+static void
+xfs_foo_write_verify(
+        struct xfs_buf  *bp)
+{
+        struct xfs_mount        *mp = bp->b_target->bt_mount;
+        struct xfs_buf_log_item *bip = bp->b_fspriv;
+        if (!xfs_foo_verify(bp)) {
+                XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
+                xfs_buf_ioerror(bp, EFSCORRUPTED);
+                return;
+        }
+        if (!xfs_sb_version_hascrc(&mp->m_sb))
+                return;
+        if (bip) {
+                struct xfs_ondisk_hdr   *hdr = bp->b_addr;
+                hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
+        }
+        xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
+}
+This will verify the internal structure of the metadata before we go any
+further, detecting corruptions that have occurred as the metadata has been
+modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
+update the LSN field (when it was last modified) and calculate the CRC on the
+metadata. Once this is done, we can issue the IO.
+Inodes and Dquots
+-----------------
+Inodes and dquots are special snowflakes. They have per-object CRC and
+self-identifiers, but they are packed so that there are multiple objects per
+buffer. Hence we do not use per-buffer verifiers to do the work of per-object
+verification and CRC calculations. The per-buffer verifiers simply perform basic
+identification of the buffer - that they contain inodes or dquots, and that
+there are magic numbers in all the expected spots. All further CRC and
+verification checks are done when each inode is read from or written back to the
+buffer.
+The structure of the verifiers and the identifiers checks is very similar to the
+buffer code described above. The only difference is where they are called. For
+example, inode read verification is done in xfs_iread() when the inode is first
+read out of the buffer and the struct xfs_inode is instantiated. The inode is
+already extensively verified during writeback in xfs_iflush_int, so the only
+addition here is to add the LSN and CRC to the inode as it is copied back into
+the buffer.
+XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
+the unlinked list modifications check or update CRCs, neither during unlink nor
+log recovery. So, it's gone unnoticed until now. This won't matter immediately -
+repair will probably complain about it - but it needs to be fixed.
author	Linus Torvalds <torvalds@linux-foundation.org>	2013-05-02 17:49:33 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2013-05-02 17:49:33 -0400
commit	c8d8566952fda026966784a62f324c8352f77430 (patch)
tree	89f25a47ce7662768bc6780bb7daeb85070784f1 /Documentation/filesystems
parent	a9586d9be812be4a0046ad4d312b013e587607cb (diff)
parent	cab09a81fbefcb21db5213a84461d421946f6eb8 (diff)

diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt b/Documentation/filesystems/xfs-self-describing-metadata.txt new file mode 100644 index 000000000000..05aa455163e3 --- /dev/null +++ b/Documentation/filesystems/xfs-self-describing-metadata.txt
@@ -0,0 +1,350 @@
	1	XFS Self Describing Metadata
	2	----------------------------
	3
	4	Introduction
	5	------------
	6
	7	The largest scalability problem facing XFS is not one of algorithmic
	8	scalability, but of verification of the filesystem structure. Scalabilty of the
	9	structures and indexes on disk and the algorithms for iterating them are
	10	adequate for supporting PB scale filesystems with billions of inodes, however it
	11	is this very scalability that causes the verification problem.
	12
	13	Almost all metadata on XFS is dynamically allocated. The only fixed location
	14	metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
	15	other metadata structures need to be discovered by walking the filesystem
	16	structure in different ways. While this is already done by userspace tools for
	17	validating and repairing the structure, there are limits to what they can
	18	verify, and this in turn limits the supportable size of an XFS filesystem.
	19
	20	For example, it is entirely possible to manually use xfs_db and a bit of
	21	scripting to analyse the structure of a 100TB filesystem when trying to
	22	determine the root cause of a corruption problem, but it is still mainly a
	23	manual task of verifying that things like single bit errors or misplaced writes
	24	weren't the ultimate cause of a corruption event. It may take a few hours to a
	25	few days to perform such forensic analysis, so for at this scale root cause
	26	analysis is entirely possible.
	27
	28	However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
	29	to analyse and so that analysis blows out towards weeks/months of forensic work.
	30	Most of the analysis work is slow and tedious, so as the amount of analysis goes
	31	up, the more likely that the cause will be lost in the noise. Hence the primary
	32	concern for supporting PB scale filesystems is minimising the time and effort
	33	required for basic forensic analysis of the filesystem structure.
	34
	35
	36	Self Describing Metadata
	37	------------------------
	38
	39	One of the problems with the current metadata format is that apart from the
	40	magic number in the metadata block, we have no other way of identifying what it
	41	is supposed to be. We can't even identify if it is the right place. Put simply,
	42	you can't look at a single metadata block in isolation and say "yes, it is
	43	supposed to be there and the contents are valid".
	44
	45	Hence most of the time spent on forensic analysis is spent doing basic
	46	verification of metadata values, looking for values that are in range (and hence
	47	not detected by automated verification checks) but are not correct. Finding and
	48	understanding how things like cross linked block lists (e.g. sibling
	49	pointers in a btree end up with loops in them) are the key to understanding what
	50	went wrong, but it is impossible to tell what order the blocks were linked into
	51	each other or written to disk after the fact.
	52
	53	Hence we need to record more information into the metadata to allow us to
	54	quickly determine if the metadata is intact and can be ignored for the purpose
	55	of analysis. We can't protect against every possible type of error, but we can
	56	ensure that common types of errors are easily detectable. Hence the concept of
	57	self describing metadata.
	58
	59	The first, fundamental requirement of self describing metadata is that the
	60	metadata object contains some form of unique identifier in a well known
	61	location. This allows us to identify the expected contents of the block and
	62	hence parse and verify the metadata object. IF we can't independently identify
	63	the type of metadata in the object, then the metadata doesn't describe itself
	64	very well at all!
	65
	66	Luckily, almost all XFS metadata has magic numbers embedded already - only the
	67	AGFL, remote symlinks and remote attribute blocks do not contain identifying
	68	magic numbers. Hence we can change the on-disk format of all these objects to
	69	add more identifying information and detect this simply by changing the magic
	70	numbers in the metadata objects. That is, if it has the current magic number,
	71	the metadata isn't self identifying. If it contains a new magic number, it is
	72	self identifying and we can do much more expansive automated verification of the
	73	metadata object at runtime, during forensic analysis or repair.
	74
	75	As a primary concern, self describing metadata needs some form of overall
	76	integrity checking. We cannot trust the metadata if we cannot verify that it has
	77	not been changed as a result of external influences. Hence we need some form of
	78	integrity check, and this is done by adding CRC32c validation to the metadata
	79	block. If we can verify the block contains the metadata it was intended to
	80	contain, a large amount of the manual verification work can be skipped.
	81
	82	CRC32c was selected as metadata cannot be more than 64k in length in XFS and
	83	hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
	84	metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
	85	fast. So while CRC32c is not the strongest of possible integrity checks that
	86	could be used, it is more than sufficient for our needs and has relatively
	87	little overhead. Adding support for larger integrity fields and/or algorithms
	88	does really provide any extra value over CRC32c, but it does add a lot of
	89	complexity and so there is no provision for changing the integrity checking
	90	mechanism.
	91
	92	Self describing metadata needs to contain enough information so that the
	93	metadata block can be verified as being in the correct place without needing to
	94	look at any other metadata. This means it needs to contain location information.
	95	Just adding a block number to the metadata is not sufficient to protect against
	96	mis-directed writes - a write might be misdirected to the wrong LUN and so be
	97	written to the "correct block" of the wrong filesystem. Hence location
	98	information must contain a filesystem identifier as well as a block number.
	99
	100	Another key information point in forensic analysis is knowing who the metadata
	101	block belongs to. We already know the type, the location, that it is valid
	102	and/or corrupted, and how long ago that it was last modified. Knowing the owner
	103	of the block is important as it allows us to find other related metadata to
	104	determine the scope of the corruption. For example, if we have a extent btree
	105	object, we don't know what inode it belongs to and hence have to walk the entire
	106	filesystem to find the owner of the block. Worse, the corruption could mean that
	107	no owner can be found (i.e. it's an orphan block), and so without an owner field
	108	in the metadata we have no idea of the scope of the corruption. If we have an
	109	owner field in the metadata object, we can immediately do top down validation to
	110	determine the scope of the problem.
	111
	112	Different types of metadata have different owner identifiers. For example,
	113	directory, attribute and extent tree blocks are all owned by an inode, whilst
	114	freespace btree blocks are owned by an allocation group. Hence the size and
	115	contents of the owner field are determined by the type of metadata object we are
	116	looking at. The owner information can also identify misplaced writes (e.g.
	117	freespace btree block written to the wrong AG).
	118
	119	Self describing metadata also needs to contain some indication of when it was
	120	written to the filesystem. One of the key information points when doing forensic
	121	analysis is how recently the block was modified. Correlation of set of corrupted
	122	metadata blocks based on modification times is important as it can indicate
	123	whether the corruptions are related, whether there's been multiple corruption
	124	events that lead to the eventual failure, and even whether there are corruptions
	125	present that the run-time verification is not detecting.
	126
	127	For example, we can determine whether a metadata object is supposed to be free
	128	space or still allocated if it is still referenced by its owner by looking at
	129	when the free space btree block that contains the block was last written
	130	compared to when the metadata object itself was last written. If the free space
	131	block is more recent than the object and the object's owner, then there is a
	132	very good chance that the block should have been removed from the owner.
	133
	134	To provide this "written timestamp", each metadata block gets the Log Sequence
	135	Number (LSN) of the most recent transaction it was modified on written into it.
	136	This number will always increase over the life of the filesystem, and the only
	137	thing that resets it is running xfs_repair on the filesystem. Further, by use of
	138	the LSN we can tell if the corrupted metadata all belonged to the same log
	139	checkpoint and hence have some idea of how much modification occurred between
	140	the first and last instance of corrupt metadata on disk and, further, how much
	141	modification occurred between the corruption being written and when it was
	142	detected.
	143
	144	Runtime Validation
	145	------------------
	146
	147	Validation of self-describing metadata takes place at runtime in two places:
	148
	149	- immediately after a successful read from disk
	150	- immediately prior to write IO submission
	151
	152	The verification is completely stateless - it is done independently of the
	153	modification process, and seeks only to check that the metadata is what it says
	154	it is and that the metadata fields are within bounds and internally consistent.
	155	As such, we cannot catch all types of corruption that can occur within a block
	156	as there may be certain limitations that operational state enforces of the
	157	metadata, or there may be corruption of interblock relationships (e.g. corrupted
	158	sibling pointer lists). Hence we still need stateful checking in the main code
	159	body, but in general most of the per-field validation is handled by the
	160	verifiers.
	161
	162	For read verification, the caller needs to specify the expected type of metadata
	163	that it should see, and the IO completion process verifies that the metadata
	164	object matches what was expected. If the verification process fails, then it
	165	marks the object being read as EFSCORRUPTED. The caller needs to catch this
	166	error (same as for IO errors), and if it needs to take special action due to a
	167	verification error it can do so by catching the EFSCORRUPTED error value. If we
	168	need more discrimination of error type at higher levels, we can define new
	169	error numbers for different errors as necessary.
	170
	171	The first step in read verification is checking the magic number and determining
	172	whether CRC validating is necessary. If it is, the CRC32c is calculated and
	173	compared against the value stored in the object itself. Once this is validated,
	174	further checks are made against the location information, followed by extensive
	175	object specific metadata validation. If any of these checks fail, then the
	176	buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
	177
	178	Write verification is the opposite of the read verification - first the object
	179	is extensively verified and if it is OK we then update the LSN from the last
	180	modification made to the object, After this, we calculate the CRC and insert it
	181	into the object. Once this is done the write IO is allowed to continue. If any
	182	error occurs during this process, the buffer is again marked with a EFSCORRUPTED
	183	error for the higher layers to catch.
	184
	185	Structures
	186	----------
	187
	188	A typical on-disk structure needs to contain the following information:
	189
	190	struct xfs_ondisk_hdr {
	191	__be32 magic; /* magic number */
	192	__be32 crc; /* CRC, not logged */
	193	uuid_t uuid; /* filesystem identifier */
	194	__be64 owner; /* parent object */
	195	__be64 blkno; /* location on disk */
	196	__be64 lsn; /* last modification in log, not logged */
	197	};
	198
	199	Depending on the metadata, this information may be part of a header structure
	200	separate to the metadata contents, or may be distributed through an existing
	201	structure. The latter occurs with metadata that already contains some of this
	202	information, such as the superblock and AG headers.
	203
	204	Other metadata may have different formats for the information, but the same
	205	level of information is generally provided. For example:
	206
	207	- short btree blocks have a 32 bit owner (ag number) and a 32 bit block
	208	number for location. The two of these combined provide the same
	209	information as @owner and @blkno in eh above structure, but using 8
	210	bytes less space on disk.
	211
	212	- directory/attribute node blocks have a 16 bit magic number, and the
	213	header that contains the magic number has other information in it as
	214	well. hence the additional metadata headers change the overall format
	215	of the metadata.
	216
	217	A typical buffer read verifier is structured as follows:
	218
	219	#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc)
	220
	221	static void
	222	xfs_foo_read_verify(
	223	struct xfs_buf *bp)
	224	{
	225	struct xfs_mount *mp = bp->b_target->bt_mount;
	226
	227	if ((xfs_sb_version_hascrc(&mp->m_sb) &&
	228	!xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
	229	XFS_FOO_CRC_OFF)) \|\|
	230	!xfs_foo_verify(bp)) {
	231	XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
	232	xfs_buf_ioerror(bp, EFSCORRUPTED);
	233	}
	234	}
	235
	236	The code ensures that the CRC is only checked if the filesystem has CRCs enabled
	237	by checking the superblock of the feature bit, and then if the CRC verifies OK
	238	(or is not needed) it verifies the actual contents of the block.
	239
	240	The verifier function will take a couple of different forms, depending on
	241	whether the magic number can be used to determine the format of the block. In
	242	the case it can't, the code is structured as follows:
	243
	244	static bool
	245	xfs_foo_verify(
	246	struct xfs_buf *bp)
	247	{
	248	struct xfs_mount *mp = bp->b_target->bt_mount;
	249	struct xfs_ondisk_hdr *hdr = bp->b_addr;
	250
	251	if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
	252	return false;
	253
	254	if (!xfs_sb_version_hascrc(&mp->m_sb)) {
	255	if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
	256	return false;
	257	if (bp->b_bn != be64_to_cpu(hdr->blkno))
	258	return false;
	259	if (hdr->owner == 0)
	260	return false;
	261	}
	262
	263	/* object specific verification checks here */
	264
	265	return true;
	266	}
	267
	268	If there are different magic numbers for the different formats, the verifier
	269	will look like:
	270
	271	static bool
	272	xfs_foo_verify(
	273	struct xfs_buf *bp)
	274	{
	275	struct xfs_mount *mp = bp->b_target->bt_mount;
	276	struct xfs_ondisk_hdr *hdr = bp->b_addr;
	277
	278	if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
	279	if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
	280	return false;
	281	if (bp->b_bn != be64_to_cpu(hdr->blkno))
	282	return false;
	283	if (hdr->owner == 0)
	284	return false;
	285	} else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
	286	return false;
	287
	288	/* object specific verification checks here */
	289
	290	return true;
	291	}
	292
	293	Write verifiers are very similar to the read verifiers, they just do things in
	294	the opposite order to the read verifiers. A typical write verifier:
	295
	296	static void
	297	xfs_foo_write_verify(
	298	struct xfs_buf *bp)
	299	{
	300	struct xfs_mount *mp = bp->b_target->bt_mount;
	301	struct xfs_buf_log_item *bip = bp->b_fspriv;
	302
	303	if (!xfs_foo_verify(bp)) {
	304	XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
	305	xfs_buf_ioerror(bp, EFSCORRUPTED);
	306	return;
	307	}
	308
	309	if (!xfs_sb_version_hascrc(&mp->m_sb))
	310	return;
	311
	312
	313	if (bip) {
	314	struct xfs_ondisk_hdr *hdr = bp->b_addr;
	315	hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
	316	}
	317	xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
	318	}
	319
	320	This will verify the internal structure of the metadata before we go any
	321	further, detecting corruptions that have occurred as the metadata has been
	322	modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
	323	update the LSN field (when it was last modified) and calculate the CRC on the
	324	metadata. Once this is done, we can issue the IO.
	325
	326	Inodes and Dquots
	327	-----------------
	328
	329	Inodes and dquots are special snowflakes. They have per-object CRC and
	330	self-identifiers, but they are packed so that there are multiple objects per
	331	buffer. Hence we do not use per-buffer verifiers to do the work of per-object
	332	verification and CRC calculations. The per-buffer verifiers simply perform basic
	333	identification of the buffer - that they contain inodes or dquots, and that
	334	there are magic numbers in all the expected spots. All further CRC and
	335	verification checks are done when each inode is read from or written back to the
	336	buffer.
	337
	338	The structure of the verifiers and the identifiers checks is very similar to the
	339	buffer code described above. The only difference is where they are called. For
	340	example, inode read verification is done in xfs_iread() when the inode is first
	341	read out of the buffer and the struct xfs_inode is instantiated. The inode is
	342	already extensively verified during writeback in xfs_iflush_int, so the only
	343	addition here is to add the LSN and CRC to the inode as it is copied back into
	344	the buffer.
	345
	346	XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
	347	the unlinked list modifications check or update CRCs, neither during unlink nor
	348	log recovery. So, it's gone unnoticed until now. This won't matter immediately -
	349	repair will probably complain about it - but it needs to be fixed.
	350