MD: add doc for raid5-cache

I'm starting document of the raid5-cache feature. Please note this is a kernel doc instead of a mdadm manual, so I don't add the details about how to use the feature in mdadm side. Cc: NeilBrown <neilb@suse.com> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
author: Shaohua Li <shli@fb.com> 2017-01-30 18:47:49 -0500
committer: Shaohua Li <shli@fb.com> 2017-02-13 12:17:54 -0500
commit: 5a6265f9cd98b82d89778b806bc50b3d368c8273 (patch)
tree: 6a859821d584db8a2ec5edbd90badaa896302304
parent: 1601c5907c508637f7816a427ff23b14e54eb11d (diff)
2 files changed, 114 insertions, 0 deletions
diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index e449fb5f277c..1e61bf50595c 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -725,3 +725,8 @@ These currently include:
      to 1.  Setting this to 0 disables bypass accounting and
      requires preread stripes to wait until all full-width stripe-
      writes are complete.  Valid values are 0 to stripe_cache_size.
+  journal_mode (currently raid5 only)
+      The cache mode for raid5. raid5 could include an extra disk for
+      caching. The mode can be "write-throuth" and "write-back". The
+      default is "write-through".
diff --git a/Documentation/md/raid5-cache.txt b/Documentation/md/raid5-cache.txt
new file mode 100644
index 000000000000..2b210f295786
--- /dev/null
+++ b/Documentation/md/raid5-cache.txt
@@ -0,0 +1,109 @@
+RAID5 cache
+Raid 4/5/6 could include an extra disk for data cache besides normal RAID
+disks. The role of RAID disks isn't changed with the cache disk. The cache disk
+caches data to the RAID disks. The cache can be in write-through (supported
+since 4.4) or write-back mode (supported since 4.10). mdadm (supported since
+3.4) has a new option '--write-journal' to create array with cache. Please
+refer to mdadm manual for details. By default (RAID array starts), the cache is
+in write-through mode. A user can switch it to write-back mode by:
+echo "write-back" > /sys/block/md0/md/journal_mode
+And switch it back to write-through mode by:
+echo "write-through" > /sys/block/md0/md/journal_mode
+In both modes, all writes to the array will hit cache disk first. This means
+the cache disk must be fast and sustainable.
+-------------------------------------
+write-through mode:
+This mode mainly fixes the 'write hole' issue. For RAID 4/5/6 array, an unclean
+shutdown can cause data in some stripes to not be in consistent state, eg, data
+and parity don't match. The reason is that a stripe write involves several RAID
+disks and it's possible the writes don't hit all RAID disks yet before the
+unclean shutdown. We call an array degraded if it has inconsistent data. MD
+tries to resync the array to bring it back to normal state. But before the
+resync completes, any system crash will expose the chance of real data
+corruption in the RAID array. This problem is called 'write hole'.
+The write-through cache will cache all data on cache disk first. After the data
+is safe on the cache disk, the data will be flushed onto RAID disks. The
+two-step write will guarantee MD can recover correct data after unclean
+shutdown even the array is degraded. Thus the cache can close the 'write hole'.
+In write-through mode, MD reports IO completion to upper layer (usually
+filesystems) after the data is safe on RAID disks, so cache disk failure
+doesn't cause data loss. Of course cache disk failure means the array is
+exposed to 'write hole' again.
+In write-through mode, the cache disk isn't required to be big. Several
+hundreds megabytes are enough.
+--------------------------------------
+write-back mode:
+write-back mode fixes the 'write hole' issue too, since all write data is
+cached on cache disk. But the main goal of 'write-back' cache is to speed up
+write. If a write crosses all RAID disks of a stripe, we call it full-stripe
+write. For non-full-stripe writes, MD must read old data before the new parity
+can be calculated. These synchronous reads hurt write throughput. Some writes
+which are sequential but not dispatched in the same time will suffer from this
+overhead too. Write-back cache will aggregate the data and flush the data to
+RAID disks only after the data becomes a full stripe write. This will
+completely avoid the overhead, so it's very helpful for some workloads. A
+typical workload which does sequential write followed by fsync is an example.
+In write-back mode, MD reports IO completion to upper layer (usually
+filesystems) right after the data hits cache disk. The data is flushed to raid
+disks later after specific conditions met. So cache disk failure will cause
+data loss.
+In write-back mode, MD also caches data in memory. The memory cache includes
+the same data stored on cache disk, so a power loss doesn't cause data loss.
+The memory cache size has performance impact for the array. It's recommended
+the size is big. A user can configure the size by:
+echo "2048" > /sys/block/md0/md/stripe_cache_size
+Too small cache disk will make the write aggregation less efficient in this
+mode depending on the workloads. It's recommended to use a cache disk with at
+least several gigabytes size in write-back mode.
+--------------------------------------
+The implementation:
+The write-through and write-back cache use the same disk format. The cache disk
+is organized as a simple write log. The log consists of 'meta data' and 'data'
+pairs. The meta data describes the data. It also includes checksum and sequence
+ID for recovery identification. Data can be IO data and parity data. Data is
+checksumed too. The checksum is stored in the meta data ahead of the data. The
+checksum is an optimization because MD can write meta and data freely without
+worry about the order. MD superblock has a field pointed to the valid meta data
+of log head.
+The log implementation is pretty straightforward. The difficult part is the
+order in which MD writes data to cache disk and RAID disks. Specifically, in
+write-through mode, MD calculates parity for IO data, writes both IO data and
+parity to the log, writes the data and parity to RAID disks after the data and
+parity is settled down in log and finally the IO is finished. Read just reads
+from raid disks as usual.
+In write-back mode, MD writes IO data to the log and reports IO completion. The
+data is also fully cached in memory at that time, which means read must query
+memory cache. If some conditions are met, MD will flush the data to RAID disks.
+MD will calculate parity for the data and write parity into the log. After this
+is finished, MD will write both data and parity into RAID disks, then MD can
+release the memory cache. The flush conditions could be stripe becomes a full
+stripe write, free cache disk space is low or free in-kernel memory cache space
+is low.
+After an unclean shutdown, MD does recovery. MD reads all meta data and data
+from the log. The sequence ID and checksum will help us detect corrupted meta
+data and data. If MD finds a stripe with data and valid parities (1 parity for
+raid4/5 and 2 for raid6), MD will write the data and parities to RAID disks. If
+parities are incompleted, they are discarded. If part of data is corrupted,
+they are discarded too. MD then loads valid data and writes them to RAID disks
+in normal way.
author	Shaohua Li <shli@fb.com>	2017-01-30 18:47:49 -0500
committer	Shaohua Li <shli@fb.com>	2017-02-13 12:17:54 -0500
commit	5a6265f9cd98b82d89778b806bc50b3d368c8273 (patch)
tree	6a859821d584db8a2ec5edbd90badaa896302304
parent	1601c5907c508637f7816a427ff23b14e54eb11d (diff)

diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst index e449fb5f277c..1e61bf50595c 100644 --- a/Documentation/admin-guide/md.rst +++ b/Documentation/admin-guide/md.rst
@@ -725,3 +725,8 @@ These currently include:
725	to 1. Setting this to 0 disables bypass accounting and	725	to 1. Setting this to 0 disables bypass accounting and
726	requires preread stripes to wait until all full-width stripe-	726	requires preread stripes to wait until all full-width stripe-
727	writes are complete. Valid values are 0 to stripe_cache_size.	727	writes are complete. Valid values are 0 to stripe_cache_size.
		728
		729	journal_mode (currently raid5 only)
		730	The cache mode for raid5. raid5 could include an extra disk for
		731	caching. The mode can be "write-throuth" and "write-back". The
		732	default is "write-through".


diff --git a/Documentation/md/raid5-cache.txt b/Documentation/md/raid5-cache.txt new file mode 100644 index 000000000000..2b210f295786 --- /dev/null +++ b/Documentation/md/raid5-cache.txt
@@ -0,0 +1,109 @@
		1	RAID5 cache
		2
		3	Raid 4/5/6 could include an extra disk for data cache besides normal RAID
		4	disks. The role of RAID disks isn't changed with the cache disk. The cache disk
		5	caches data to the RAID disks. The cache can be in write-through (supported
		6	since 4.4) or write-back mode (supported since 4.10). mdadm (supported since
		7	3.4) has a new option '--write-journal' to create array with cache. Please
		8	refer to mdadm manual for details. By default (RAID array starts), the cache is
		9	in write-through mode. A user can switch it to write-back mode by:
		10
		11	echo "write-back" > /sys/block/md0/md/journal_mode
		12
		13	And switch it back to write-through mode by:
		14
		15	echo "write-through" > /sys/block/md0/md/journal_mode
		16
		17	In both modes, all writes to the array will hit cache disk first. This means
		18	the cache disk must be fast and sustainable.
		19
		20	-------------------------------------
		21	write-through mode:
		22
		23	This mode mainly fixes the 'write hole' issue. For RAID 4/5/6 array, an unclean
		24	shutdown can cause data in some stripes to not be in consistent state, eg, data
		25	and parity don't match. The reason is that a stripe write involves several RAID
		26	disks and it's possible the writes don't hit all RAID disks yet before the
		27	unclean shutdown. We call an array degraded if it has inconsistent data. MD
		28	tries to resync the array to bring it back to normal state. But before the
		29	resync completes, any system crash will expose the chance of real data
		30	corruption in the RAID array. This problem is called 'write hole'.
		31
		32	The write-through cache will cache all data on cache disk first. After the data
		33	is safe on the cache disk, the data will be flushed onto RAID disks. The
		34	two-step write will guarantee MD can recover correct data after unclean
		35	shutdown even the array is degraded. Thus the cache can close the 'write hole'.
		36
		37	In write-through mode, MD reports IO completion to upper layer (usually
		38	filesystems) after the data is safe on RAID disks, so cache disk failure
		39	doesn't cause data loss. Of course cache disk failure means the array is
		40	exposed to 'write hole' again.
		41
		42	In write-through mode, the cache disk isn't required to be big. Several
		43	hundreds megabytes are enough.
		44
		45	--------------------------------------
		46	write-back mode:
		47
		48	write-back mode fixes the 'write hole' issue too, since all write data is
		49	cached on cache disk. But the main goal of 'write-back' cache is to speed up
		50	write. If a write crosses all RAID disks of a stripe, we call it full-stripe
		51	write. For non-full-stripe writes, MD must read old data before the new parity
		52	can be calculated. These synchronous reads hurt write throughput. Some writes
		53	which are sequential but not dispatched in the same time will suffer from this
		54	overhead too. Write-back cache will aggregate the data and flush the data to
		55	RAID disks only after the data becomes a full stripe write. This will
		56	completely avoid the overhead, so it's very helpful for some workloads. A
		57	typical workload which does sequential write followed by fsync is an example.
		58
		59	In write-back mode, MD reports IO completion to upper layer (usually
		60	filesystems) right after the data hits cache disk. The data is flushed to raid
		61	disks later after specific conditions met. So cache disk failure will cause
		62	data loss.
		63
		64	In write-back mode, MD also caches data in memory. The memory cache includes
		65	the same data stored on cache disk, so a power loss doesn't cause data loss.
		66	The memory cache size has performance impact for the array. It's recommended
		67	the size is big. A user can configure the size by:
		68
		69	echo "2048" > /sys/block/md0/md/stripe_cache_size
		70
		71	Too small cache disk will make the write aggregation less efficient in this
		72	mode depending on the workloads. It's recommended to use a cache disk with at
		73	least several gigabytes size in write-back mode.
		74
		75	--------------------------------------
		76	The implementation:
		77
		78	The write-through and write-back cache use the same disk format. The cache disk
		79	is organized as a simple write log. The log consists of 'meta data' and 'data'
		80	pairs. The meta data describes the data. It also includes checksum and sequence
		81	ID for recovery identification. Data can be IO data and parity data. Data is
		82	checksumed too. The checksum is stored in the meta data ahead of the data. The
		83	checksum is an optimization because MD can write meta and data freely without
		84	worry about the order. MD superblock has a field pointed to the valid meta data
		85	of log head.
		86
		87	The log implementation is pretty straightforward. The difficult part is the
		88	order in which MD writes data to cache disk and RAID disks. Specifically, in
		89	write-through mode, MD calculates parity for IO data, writes both IO data and
		90	parity to the log, writes the data and parity to RAID disks after the data and
		91	parity is settled down in log and finally the IO is finished. Read just reads
		92	from raid disks as usual.
		93
		94	In write-back mode, MD writes IO data to the log and reports IO completion. The
		95	data is also fully cached in memory at that time, which means read must query
		96	memory cache. If some conditions are met, MD will flush the data to RAID disks.
		97	MD will calculate parity for the data and write parity into the log. After this
		98	is finished, MD will write both data and parity into RAID disks, then MD can
		99	release the memory cache. The flush conditions could be stripe becomes a full
		100	stripe write, free cache disk space is low or free in-kernel memory cache space
		101	is low.
		102
		103	After an unclean shutdown, MD does recovery. MD reads all meta data and data
		104	from the log. The sequence ID and checksum will help us detect corrupted meta
		105	data and data. If MD finds a stripe with data and valid parities (1 parity for
		106	raid4/5 and 2 for raid6), MD will write the data and parities to RAID disks. If
		107	parities are incompleted, they are discarded. If part of data is corrupted,
		108	they are discarded too. MD then loads valid data and writes them to RAID disks
		109	in normal way.