aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorShaohua Li <shli@fb.com>2017-01-30 18:47:49 -0500
committerShaohua Li <shli@fb.com>2017-02-13 12:17:54 -0500
commit5a6265f9cd98b82d89778b806bc50b3d368c8273 (patch)
tree6a859821d584db8a2ec5edbd90badaa896302304
parent1601c5907c508637f7816a427ff23b14e54eb11d (diff)
MD: add doc for raid5-cache
I'm starting document of the raid5-cache feature. Please note this is a kernel doc instead of a mdadm manual, so I don't add the details about how to use the feature in mdadm side. Cc: NeilBrown <neilb@suse.com> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
-rw-r--r--Documentation/admin-guide/md.rst5
-rw-r--r--Documentation/md/raid5-cache.txt109
2 files changed, 114 insertions, 0 deletions
diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index e449fb5f277c..1e61bf50595c 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -725,3 +725,8 @@ These currently include:
725 to 1. Setting this to 0 disables bypass accounting and 725 to 1. Setting this to 0 disables bypass accounting and
726 requires preread stripes to wait until all full-width stripe- 726 requires preread stripes to wait until all full-width stripe-
727 writes are complete. Valid values are 0 to stripe_cache_size. 727 writes are complete. Valid values are 0 to stripe_cache_size.
728
729 journal_mode (currently raid5 only)
730 The cache mode for raid5. raid5 could include an extra disk for
731 caching. The mode can be "write-throuth" and "write-back". The
732 default is "write-through".
diff --git a/Documentation/md/raid5-cache.txt b/Documentation/md/raid5-cache.txt
new file mode 100644
index 000000000000..2b210f295786
--- /dev/null
+++ b/Documentation/md/raid5-cache.txt
@@ -0,0 +1,109 @@
1RAID5 cache
2
3Raid 4/5/6 could include an extra disk for data cache besides normal RAID
4disks. The role of RAID disks isn't changed with the cache disk. The cache disk
5caches data to the RAID disks. The cache can be in write-through (supported
6since 4.4) or write-back mode (supported since 4.10). mdadm (supported since
73.4) has a new option '--write-journal' to create array with cache. Please
8refer to mdadm manual for details. By default (RAID array starts), the cache is
9in write-through mode. A user can switch it to write-back mode by:
10
11echo "write-back" > /sys/block/md0/md/journal_mode
12
13And switch it back to write-through mode by:
14
15echo "write-through" > /sys/block/md0/md/journal_mode
16
17In both modes, all writes to the array will hit cache disk first. This means
18the cache disk must be fast and sustainable.
19
20-------------------------------------
21write-through mode:
22
23This mode mainly fixes the 'write hole' issue. For RAID 4/5/6 array, an unclean
24shutdown can cause data in some stripes to not be in consistent state, eg, data
25and parity don't match. The reason is that a stripe write involves several RAID
26disks and it's possible the writes don't hit all RAID disks yet before the
27unclean shutdown. We call an array degraded if it has inconsistent data. MD
28tries to resync the array to bring it back to normal state. But before the
29resync completes, any system crash will expose the chance of real data
30corruption in the RAID array. This problem is called 'write hole'.
31
32The write-through cache will cache all data on cache disk first. After the data
33is safe on the cache disk, the data will be flushed onto RAID disks. The
34two-step write will guarantee MD can recover correct data after unclean
35shutdown even the array is degraded. Thus the cache can close the 'write hole'.
36
37In write-through mode, MD reports IO completion to upper layer (usually
38filesystems) after the data is safe on RAID disks, so cache disk failure
39doesn't cause data loss. Of course cache disk failure means the array is
40exposed to 'write hole' again.
41
42In write-through mode, the cache disk isn't required to be big. Several
43hundreds megabytes are enough.
44
45--------------------------------------
46write-back mode:
47
48write-back mode fixes the 'write hole' issue too, since all write data is
49cached on cache disk. But the main goal of 'write-back' cache is to speed up
50write. If a write crosses all RAID disks of a stripe, we call it full-stripe
51write. For non-full-stripe writes, MD must read old data before the new parity
52can be calculated. These synchronous reads hurt write throughput. Some writes
53which are sequential but not dispatched in the same time will suffer from this
54overhead too. Write-back cache will aggregate the data and flush the data to
55RAID disks only after the data becomes a full stripe write. This will
56completely avoid the overhead, so it's very helpful for some workloads. A
57typical workload which does sequential write followed by fsync is an example.
58
59In write-back mode, MD reports IO completion to upper layer (usually
60filesystems) right after the data hits cache disk. The data is flushed to raid
61disks later after specific conditions met. So cache disk failure will cause
62data loss.
63
64In write-back mode, MD also caches data in memory. The memory cache includes
65the same data stored on cache disk, so a power loss doesn't cause data loss.
66The memory cache size has performance impact for the array. It's recommended
67the size is big. A user can configure the size by:
68
69echo "2048" > /sys/block/md0/md/stripe_cache_size
70
71Too small cache disk will make the write aggregation less efficient in this
72mode depending on the workloads. It's recommended to use a cache disk with at
73least several gigabytes size in write-back mode.
74
75--------------------------------------
76The implementation:
77
78The write-through and write-back cache use the same disk format. The cache disk
79is organized as a simple write log. The log consists of 'meta data' and 'data'
80pairs. The meta data describes the data. It also includes checksum and sequence
81ID for recovery identification. Data can be IO data and parity data. Data is
82checksumed too. The checksum is stored in the meta data ahead of the data. The
83checksum is an optimization because MD can write meta and data freely without
84worry about the order. MD superblock has a field pointed to the valid meta data
85of log head.
86
87The log implementation is pretty straightforward. The difficult part is the
88order in which MD writes data to cache disk and RAID disks. Specifically, in
89write-through mode, MD calculates parity for IO data, writes both IO data and
90parity to the log, writes the data and parity to RAID disks after the data and
91parity is settled down in log and finally the IO is finished. Read just reads
92from raid disks as usual.
93
94In write-back mode, MD writes IO data to the log and reports IO completion. The
95data is also fully cached in memory at that time, which means read must query
96memory cache. If some conditions are met, MD will flush the data to RAID disks.
97MD will calculate parity for the data and write parity into the log. After this
98is finished, MD will write both data and parity into RAID disks, then MD can
99release the memory cache. The flush conditions could be stripe becomes a full
100stripe write, free cache disk space is low or free in-kernel memory cache space
101is low.
102
103After an unclean shutdown, MD does recovery. MD reads all meta data and data
104from the log. The sequence ID and checksum will help us detect corrupted meta
105data and data. If MD finds a stripe with data and valid parities (1 parity for
106raid4/5 and 2 for raid6), MD will write the data and parities to RAID disks. If
107parities are incompleted, they are discarded. If part of data is corrupted,
108they are discarded too. MD then loads valid data and writes them to RAID disks
109in normal way.