Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md

Pull MD updates from Shaohua Li: - Add Partial Parity Log (ppl) feature found in Intel IMSM raid array by Artur Paszkiewicz. This feature is another way to close RAID5 writehole. The Linux implementation is also available for normal RAID5 array if specific superblock bit is set. - A number of md-cluser fixes and enabling md-cluster array resize from Guoqing Jiang - A bunch of patches from Ming Lei and Neil Brown to rewrite MD bio handling related code. Now MD doesn't directly access bio bvec, bi_phys_segments and uses modern bio API for bio split. - Improve RAID5 IO pattern to improve performance for hard disk based RAID5/6 from me. - Several patches from Song Liu to speed up raid5-cache recovery and allow raid5 cache feature disabling in runtime. - Fix a performance regression in raid1 resync from Xiao Ni. - Other cleanup and fixes from various people. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (84 commits) md/raid10: skip spare disk as 'first' disk md/raid1: Use a new variable to count flighting sync requests md: clear WantReplacement once disk is removed md/raid1/10: remove unused queue md: handle read-only member devices better. md/raid10: wait up frozen array in handle_write_completed uapi: fix linux/raid/md_p.h userspace compilation error md-cluster: Fix a memleak in an error handling path md: support disabling of create-on-open semantics. md: allow creation of mdNNN arrays via md_mod/parameters/new_array raid5-ppl: use a single mempool for ppl_io_unit and header_page md/raid0: fix up bio splitting. md/linear: improve bio splitting. md/raid5: make chunk_aligned_read() split bios more cleanly. md/raid10: simplify handle_read_error() md/raid10: simplify the splitting of requests. md/raid1: factor out flush_bio_list() md/raid1: simplify handle_read_error(). Revert "block: introduce bio_copy_data_partial" md/raid1: simplify alloc_behind_master_bio() ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2017-05-03 13:05:38 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2017-05-03 13:05:38 -0400
commit: e5021876c91dc3894b2174cca8fa797f8e29e7b9 (patch)
tree: cf6cc6591a8421e0f75cfcfbc10312421bd8e9f1 /Documentation/md
parent: 46f0537b1ecf672052007c97f102a7e6bf0791e4 (diff)
parent: e265eb3a30543a237b2ebc4e0422ac82e55b07e4 (diff)
2 files changed, 45 insertions, 1 deletions
diff --git a/Documentation/md/md-cluster.txt b/Documentation/md/md-cluster.txt
index d22103994aef..82ee51604e9a 100644
--- a/Documentation/md/md-cluster.txt
+++ b/Documentation/md/md-cluster.txt
@@ -321,4 +321,4 @@ The algorithm is:
 There are somethings which are not supported by cluster MD yet.
- update size and change array_sectors.
+- change array_sectors.
diff --git a/Documentation/md/raid5-ppl.txt b/Documentation/md/raid5-ppl.txt
new file mode 100644
index 000000000000..127072b09363
--- /dev/null
+++ b/Documentation/md/raid5-ppl.txt
@@ -0,0 +1,44 @@
+Partial Parity Log
+Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
+addressed by PPL is that after a dirty shutdown, parity of a particular stripe
+may become inconsistent with data on other member disks. If the array is also
+in degraded state, there is no way to recalculate parity, because one of the
+disks is missing. This can lead to silent data corruption when rebuilding the
+array or using it is as degraded - data calculated from parity for array blocks
+that have not been touched by a write request during the unclean shutdown can
+be incorrect. Such condition is known as the RAID5 Write Hole. Because of
+this, md by default does not allow starting a dirty degraded array.
+Partial parity for a write operation is the XOR of stripe data chunks not
+modified by this write. It is just enough data needed for recovering from the
+write hole. XORing partial parity with the modified chunks produces parity for
+the stripe, consistent with its state before the write operation, regardless of
+which chunk writes have completed. If one of the not modified data disks of
+this stripe is missing, this updated parity can be used to recover its
+contents. PPL recovery is also performed when starting an array after an
+unclean shutdown and all disks are available, eliminating the need to resync
+the array. Because of this, using write-intent bitmap and PPL together is not
+supported.
+When handling a write request PPL writes partial parity before new data and
+parity are dispatched to disks. PPL is a distributed log - it is stored on
+array member drives in the metadata area, on the parity drive of a particular
+stripe.  It does not require a dedicated journaling drive. Write performance is
+reduced by up to 30%-40% but it scales with the number of drives in the array
+and the journaling drive does not become a bottleneck or a single point of
+failure.
+Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
+not a true journal. It does not protect from losing in-flight data, only from
+silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
+performed for this stripe (parity is not updated). So it is possible to have
+arbitrary data in the written part of a stripe if that disk is lost. In such
+case the behavior is the same as in plain raid5.
+PPL is available for md version-1 metadata and external (specifically IMSM)
+metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
+Currently, volatile write-back cache should be disabled on all member drives
+when using PPL. Otherwise it cannot guarantee consistency in case of power
+failure.
author	Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 13:05:38 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 13:05:38 -0400
commit	e5021876c91dc3894b2174cca8fa797f8e29e7b9 (patch)
tree	cf6cc6591a8421e0f75cfcfbc10312421bd8e9f1 /Documentation/md
parent	46f0537b1ecf672052007c97f102a7e6bf0791e4 (diff)
parent	e265eb3a30543a237b2ebc4e0422ac82e55b07e4 (diff)

diff --git a/Documentation/md/md-cluster.txt b/Documentation/md/md-cluster.txt index d22103994aef..82ee51604e9a 100644 --- a/Documentation/md/md-cluster.txt +++ b/Documentation/md/md-cluster.txt
@@ -321,4 +321,4 @@ The algorithm is:
321		321
322	There are somethings which are not supported by cluster MD yet.	322	There are somethings which are not supported by cluster MD yet.
323		323
324	- update size and change array_sectors.	324	- change array_sectors.


diff --git a/Documentation/md/raid5-ppl.txt b/Documentation/md/raid5-ppl.txt new file mode 100644 index 000000000000..127072b09363 --- /dev/null +++ b/Documentation/md/raid5-ppl.txt
@@ -0,0 +1,44 @@
		1	Partial Parity Log
		2
		3	Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
		4	addressed by PPL is that after a dirty shutdown, parity of a particular stripe
		5	may become inconsistent with data on other member disks. If the array is also
		6	in degraded state, there is no way to recalculate parity, because one of the
		7	disks is missing. This can lead to silent data corruption when rebuilding the
		8	array or using it is as degraded - data calculated from parity for array blocks
		9	that have not been touched by a write request during the unclean shutdown can
		10	be incorrect. Such condition is known as the RAID5 Write Hole. Because of
		11	this, md by default does not allow starting a dirty degraded array.
		12
		13	Partial parity for a write operation is the XOR of stripe data chunks not
		14	modified by this write. It is just enough data needed for recovering from the
		15	write hole. XORing partial parity with the modified chunks produces parity for
		16	the stripe, consistent with its state before the write operation, regardless of
		17	which chunk writes have completed. If one of the not modified data disks of
		18	this stripe is missing, this updated parity can be used to recover its
		19	contents. PPL recovery is also performed when starting an array after an
		20	unclean shutdown and all disks are available, eliminating the need to resync
		21	the array. Because of this, using write-intent bitmap and PPL together is not
		22	supported.
		23
		24	When handling a write request PPL writes partial parity before new data and
		25	parity are dispatched to disks. PPL is a distributed log - it is stored on
		26	array member drives in the metadata area, on the parity drive of a particular
		27	stripe. It does not require a dedicated journaling drive. Write performance is
		28	reduced by up to 30%-40% but it scales with the number of drives in the array
		29	and the journaling drive does not become a bottleneck or a single point of
		30	failure.
		31
		32	Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
		33	not a true journal. It does not protect from losing in-flight data, only from
		34	silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
		35	performed for this stripe (parity is not updated). So it is possible to have
		36	arbitrary data in the written part of a stripe if that disk is lost. In such
		37	case the behavior is the same as in plain raid5.
		38
		39	PPL is available for md version-1 metadata and external (specifically IMSM)
		40	metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
		41
		42	Currently, volatile write-back cache should be disabled on all member drives
		43	when using PPL. Otherwise it cannot guarantee consistency in case of power
		44	failure.