diff options
Diffstat (limited to 'Documentation/admin-guide/device-mapper/dm-raid.rst')
-rw-r--r-- | Documentation/admin-guide/device-mapper/dm-raid.rst | 419 |
1 files changed, 419 insertions, 0 deletions
diff --git a/Documentation/admin-guide/device-mapper/dm-raid.rst b/Documentation/admin-guide/device-mapper/dm-raid.rst new file mode 100644 index 000000000000..2fe255b130fb --- /dev/null +++ b/Documentation/admin-guide/device-mapper/dm-raid.rst | |||
@@ -0,0 +1,419 @@ | |||
1 | ======= | ||
2 | dm-raid | ||
3 | ======= | ||
4 | |||
5 | The device-mapper RAID (dm-raid) target provides a bridge from DM to MD. | ||
6 | It allows the MD RAID drivers to be accessed using a device-mapper | ||
7 | interface. | ||
8 | |||
9 | |||
10 | Mapping Table Interface | ||
11 | ----------------------- | ||
12 | The target is named "raid" and it accepts the following parameters:: | ||
13 | |||
14 | <raid_type> <#raid_params> <raid_params> \ | ||
15 | <#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>] | ||
16 | |||
17 | <raid_type>: | ||
18 | |||
19 | ============= =============================================================== | ||
20 | raid0 RAID0 striping (no resilience) | ||
21 | raid1 RAID1 mirroring | ||
22 | raid4 RAID4 with dedicated last parity disk | ||
23 | raid5_n RAID5 with dedicated last parity disk supporting takeover | ||
24 | Same as raid4 | ||
25 | |||
26 | - Transitory layout | ||
27 | raid5_la RAID5 left asymmetric | ||
28 | |||
29 | - rotating parity 0 with data continuation | ||
30 | raid5_ra RAID5 right asymmetric | ||
31 | |||
32 | - rotating parity N with data continuation | ||
33 | raid5_ls RAID5 left symmetric | ||
34 | |||
35 | - rotating parity 0 with data restart | ||
36 | raid5_rs RAID5 right symmetric | ||
37 | |||
38 | - rotating parity N with data restart | ||
39 | raid6_zr RAID6 zero restart | ||
40 | |||
41 | - rotating parity zero (left-to-right) with data restart | ||
42 | raid6_nr RAID6 N restart | ||
43 | |||
44 | - rotating parity N (right-to-left) with data restart | ||
45 | raid6_nc RAID6 N continue | ||
46 | |||
47 | - rotating parity N (right-to-left) with data continuation | ||
48 | raid6_n_6 RAID6 with dedicate parity disks | ||
49 | |||
50 | - parity and Q-syndrome on the last 2 disks; | ||
51 | layout for takeover from/to raid4/raid5_n | ||
52 | raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk | ||
53 | |||
54 | - layout for takeover from raid5_la from/to raid6 | ||
55 | raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk | ||
56 | |||
57 | - layout for takeover from raid5_ra from/to raid6 | ||
58 | raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk | ||
59 | |||
60 | - layout for takeover from raid5_ls from/to raid6 | ||
61 | raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk | ||
62 | |||
63 | - layout for takeover from raid5_rs from/to raid6 | ||
64 | raid10 Various RAID10 inspired algorithms chosen by additional params | ||
65 | (see raid10_format and raid10_copies below) | ||
66 | |||
67 | - RAID10: Striped Mirrors (aka 'Striping on top of mirrors') | ||
68 | - RAID1E: Integrated Adjacent Stripe Mirroring | ||
69 | - RAID1E: Integrated Offset Stripe Mirroring | ||
70 | - and other similar RAID10 variants | ||
71 | ============= =============================================================== | ||
72 | |||
73 | Reference: Chapter 4 of | ||
74 | http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf | ||
75 | |||
76 | <#raid_params>: The number of parameters that follow. | ||
77 | |||
78 | <raid_params> consists of | ||
79 | |||
80 | Mandatory parameters: | ||
81 | <chunk_size>: | ||
82 | Chunk size in sectors. This parameter is often known as | ||
83 | "stripe size". It is the only mandatory parameter and | ||
84 | is placed first. | ||
85 | |||
86 | followed by optional parameters (in any order): | ||
87 | [sync|nosync] | ||
88 | Force or prevent RAID initialization. | ||
89 | |||
90 | [rebuild <idx>] | ||
91 | Rebuild drive number 'idx' (first drive is 0). | ||
92 | |||
93 | [daemon_sleep <ms>] | ||
94 | Interval between runs of the bitmap daemon that | ||
95 | clear bits. A longer interval means less bitmap I/O but | ||
96 | resyncing after a failure is likely to take longer. | ||
97 | |||
98 | [min_recovery_rate <kB/sec/disk>] | ||
99 | Throttle RAID initialization | ||
100 | [max_recovery_rate <kB/sec/disk>] | ||
101 | Throttle RAID initialization | ||
102 | [write_mostly <idx>] | ||
103 | Mark drive index 'idx' write-mostly. | ||
104 | [max_write_behind <sectors>] | ||
105 | See '--write-behind=' (man mdadm) | ||
106 | [stripe_cache <sectors>] | ||
107 | Stripe cache size (RAID 4/5/6 only) | ||
108 | [region_size <sectors>] | ||
109 | The region_size multiplied by the number of regions is the | ||
110 | logical size of the array. The bitmap records the device | ||
111 | synchronisation state for each region. | ||
112 | |||
113 | [raid10_copies <# copies>], [raid10_format <near|far|offset>] | ||
114 | These two options are used to alter the default layout of | ||
115 | a RAID10 configuration. The number of copies is can be | ||
116 | specified, but the default is 2. There are also three | ||
117 | variations to how the copies are laid down - the default | ||
118 | is "near". Near copies are what most people think of with | ||
119 | respect to mirroring. If these options are left unspecified, | ||
120 | or 'raid10_copies 2' and/or 'raid10_format near' are given, | ||
121 | then the layouts for 2, 3 and 4 devices are: | ||
122 | |||
123 | ======== ========== ============== | ||
124 | 2 drives 3 drives 4 drives | ||
125 | ======== ========== ============== | ||
126 | A1 A1 A1 A1 A2 A1 A1 A2 A2 | ||
127 | A2 A2 A2 A3 A3 A3 A3 A4 A4 | ||
128 | A3 A3 A4 A4 A5 A5 A5 A6 A6 | ||
129 | A4 A4 A5 A6 A6 A7 A7 A8 A8 | ||
130 | .. .. .. .. .. .. .. .. .. | ||
131 | ======== ========== ============== | ||
132 | |||
133 | The 2-device layout is equivalent 2-way RAID1. The 4-device | ||
134 | layout is what a traditional RAID10 would look like. The | ||
135 | 3-device layout is what might be called a 'RAID1E - Integrated | ||
136 | Adjacent Stripe Mirroring'. | ||
137 | |||
138 | If 'raid10_copies 2' and 'raid10_format far', then the layouts | ||
139 | for 2, 3 and 4 devices are: | ||
140 | |||
141 | ======== ============ =================== | ||
142 | 2 drives 3 drives 4 drives | ||
143 | ======== ============ =================== | ||
144 | A1 A2 A1 A2 A3 A1 A2 A3 A4 | ||
145 | A3 A4 A4 A5 A6 A5 A6 A7 A8 | ||
146 | A5 A6 A7 A8 A9 A9 A10 A11 A12 | ||
147 | .. .. .. .. .. .. .. .. .. | ||
148 | A2 A1 A3 A1 A2 A2 A1 A4 A3 | ||
149 | A4 A3 A6 A4 A5 A6 A5 A8 A7 | ||
150 | A6 A5 A9 A7 A8 A10 A9 A12 A11 | ||
151 | .. .. .. .. .. .. .. .. .. | ||
152 | ======== ============ =================== | ||
153 | |||
154 | If 'raid10_copies 2' and 'raid10_format offset', then the | ||
155 | layouts for 2, 3 and 4 devices are: | ||
156 | |||
157 | ======== ========== ================ | ||
158 | 2 drives 3 drives 4 drives | ||
159 | ======== ========== ================ | ||
160 | A1 A2 A1 A2 A3 A1 A2 A3 A4 | ||
161 | A2 A1 A3 A1 A2 A2 A1 A4 A3 | ||
162 | A3 A4 A4 A5 A6 A5 A6 A7 A8 | ||
163 | A4 A3 A6 A4 A5 A6 A5 A8 A7 | ||
164 | A5 A6 A7 A8 A9 A9 A10 A11 A12 | ||
165 | A6 A5 A9 A7 A8 A10 A9 A12 A11 | ||
166 | .. .. .. .. .. .. .. .. .. | ||
167 | ======== ========== ================ | ||
168 | |||
169 | Here we see layouts closely akin to 'RAID1E - Integrated | ||
170 | Offset Stripe Mirroring'. | ||
171 | |||
172 | [delta_disks <N>] | ||
173 | The delta_disks option value (-251 < N < +251) triggers | ||
174 | device removal (negative value) or device addition (positive | ||
175 | value) to any reshape supporting raid levels 4/5/6 and 10. | ||
176 | RAID levels 4/5/6 allow for addition of devices (metadata | ||
177 | and data device tuple), raid10_near and raid10_offset only | ||
178 | allow for device addition. raid10_far does not support any | ||
179 | reshaping at all. | ||
180 | A minimum of devices have to be kept to enforce resilience, | ||
181 | which is 3 devices for raid4/5 and 4 devices for raid6. | ||
182 | |||
183 | [data_offset <sectors>] | ||
184 | This option value defines the offset into each data device | ||
185 | where the data starts. This is used to provide out-of-place | ||
186 | reshaping space to avoid writing over data while | ||
187 | changing the layout of stripes, hence an interruption/crash | ||
188 | may happen at any time without the risk of losing data. | ||
189 | E.g. when adding devices to an existing raid set during | ||
190 | forward reshaping, the out-of-place space will be allocated | ||
191 | at the beginning of each raid device. The kernel raid4/5/6/10 | ||
192 | MD personalities supporting such device addition will read the data from | ||
193 | the existing first stripes (those with smaller number of stripes) | ||
194 | starting at data_offset to fill up a new stripe with the larger | ||
195 | number of stripes, calculate the redundancy blocks (CRC/Q-syndrome) | ||
196 | and write that new stripe to offset 0. Same will be applied to all | ||
197 | N-1 other new stripes. This out-of-place scheme is used to change | ||
198 | the RAID type (i.e. the allocation algorithm) as well, e.g. | ||
199 | changing from raid5_ls to raid5_n. | ||
200 | |||
201 | [journal_dev <dev>] | ||
202 | This option adds a journal device to raid4/5/6 raid sets and | ||
203 | uses it to close the 'write hole' caused by the non-atomic updates | ||
204 | to the component devices which can cause data loss during recovery. | ||
205 | The journal device is used as writethrough thus causing writes to | ||
206 | be throttled versus non-journaled raid4/5/6 sets. | ||
207 | Takeover/reshape is not possible with a raid4/5/6 journal device; | ||
208 | it has to be deconfigured before requesting these. | ||
209 | |||
210 | [journal_mode <mode>] | ||
211 | This option sets the caching mode on journaled raid4/5/6 raid sets | ||
212 | (see 'journal_dev <dev>' above) to 'writethrough' or 'writeback'. | ||
213 | If 'writeback' is selected the journal device has to be resilient | ||
214 | and must not suffer from the 'write hole' problem itself (e.g. use | ||
215 | raid1 or raid10) to avoid a single point of failure. | ||
216 | |||
217 | <#raid_devs>: The number of devices composing the array. | ||
218 | Each device consists of two entries. The first is the device | ||
219 | containing the metadata (if any); the second is the one containing the | ||
220 | data. A Maximum of 64 metadata/data device entries are supported | ||
221 | up to target version 1.8.0. | ||
222 | 1.9.0 supports up to 253 which is enforced by the used MD kernel runtime. | ||
223 | |||
224 | If a drive has failed or is missing at creation time, a '-' can be | ||
225 | given for both the metadata and data drives for a given position. | ||
226 | |||
227 | |||
228 | Example Tables | ||
229 | -------------- | ||
230 | |||
231 | :: | ||
232 | |||
233 | # RAID4 - 4 data drives, 1 parity (no metadata devices) | ||
234 | # No metadata devices specified to hold superblock/bitmap info | ||
235 | # Chunk size of 1MiB | ||
236 | # (Lines separated for easy reading) | ||
237 | |||
238 | 0 1960893648 raid \ | ||
239 | raid4 1 2048 \ | ||
240 | 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81 | ||
241 | |||
242 | # RAID4 - 4 data drives, 1 parity (with metadata devices) | ||
243 | # Chunk size of 1MiB, force RAID initialization, | ||
244 | # min recovery rate at 20 kiB/sec/disk | ||
245 | |||
246 | 0 1960893648 raid \ | ||
247 | raid4 4 2048 sync min_recovery_rate 20 \ | ||
248 | 5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82 | ||
249 | |||
250 | |||
251 | Status Output | ||
252 | ------------- | ||
253 | 'dmsetup table' displays the table used to construct the mapping. | ||
254 | The optional parameters are always printed in the order listed | ||
255 | above with "sync" or "nosync" always output ahead of the other | ||
256 | arguments, regardless of the order used when originally loading the table. | ||
257 | Arguments that can be repeated are ordered by value. | ||
258 | |||
259 | |||
260 | 'dmsetup status' yields information on the state and health of the array. | ||
261 | The output is as follows (normally a single line, but expanded here for | ||
262 | clarity):: | ||
263 | |||
264 | 1: <s> <l> raid \ | ||
265 | 2: <raid_type> <#devices> <health_chars> \ | ||
266 | 3: <sync_ratio> <sync_action> <mismatch_cnt> | ||
267 | |||
268 | Line 1 is the standard output produced by device-mapper. | ||
269 | |||
270 | Line 2 & 3 are produced by the raid target and are best explained by example:: | ||
271 | |||
272 | 0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0 | ||
273 | |||
274 | Here we can see the RAID type is raid4, there are 5 devices - all of | ||
275 | which are 'A'live, and the array is 2/490221568 complete with its initial | ||
276 | recovery. Here is a fuller description of the individual fields: | ||
277 | |||
278 | =============== ========================================================= | ||
279 | <raid_type> Same as the <raid_type> used to create the array. | ||
280 | <health_chars> One char for each device, indicating: | ||
281 | |||
282 | - 'A' = alive and in-sync | ||
283 | - 'a' = alive but not in-sync | ||
284 | - 'D' = dead/failed. | ||
285 | <sync_ratio> The ratio indicating how much of the array has undergone | ||
286 | the process described by 'sync_action'. If the | ||
287 | 'sync_action' is "check" or "repair", then the process | ||
288 | of "resync" or "recover" can be considered complete. | ||
289 | <sync_action> One of the following possible states: | ||
290 | |||
291 | idle | ||
292 | - No synchronization action is being performed. | ||
293 | frozen | ||
294 | - The current action has been halted. | ||
295 | resync | ||
296 | - Array is undergoing its initial synchronization | ||
297 | or is resynchronizing after an unclean shutdown | ||
298 | (possibly aided by a bitmap). | ||
299 | recover | ||
300 | - A device in the array is being rebuilt or | ||
301 | replaced. | ||
302 | check | ||
303 | - A user-initiated full check of the array is | ||
304 | being performed. All blocks are read and | ||
305 | checked for consistency. The number of | ||
306 | discrepancies found are recorded in | ||
307 | <mismatch_cnt>. No changes are made to the | ||
308 | array by this action. | ||
309 | repair | ||
310 | - The same as "check", but discrepancies are | ||
311 | corrected. | ||
312 | reshape | ||
313 | - The array is undergoing a reshape. | ||
314 | <mismatch_cnt> The number of discrepancies found between mirror copies | ||
315 | in RAID1/10 or wrong parity values found in RAID4/5/6. | ||
316 | This value is valid only after a "check" of the array | ||
317 | is performed. A healthy array has a 'mismatch_cnt' of 0. | ||
318 | <data_offset> The current data offset to the start of the user data on | ||
319 | each component device of a raid set (see the respective | ||
320 | raid parameter to support out-of-place reshaping). | ||
321 | <journal_char> - 'A' - active write-through journal device. | ||
322 | - 'a' - active write-back journal device. | ||
323 | - 'D' - dead journal device. | ||
324 | - '-' - no journal device. | ||
325 | =============== ========================================================= | ||
326 | |||
327 | |||
328 | Message Interface | ||
329 | ----------------- | ||
330 | The dm-raid target will accept certain actions through the 'message' interface. | ||
331 | ('man dmsetup' for more information on the message interface.) These actions | ||
332 | include: | ||
333 | |||
334 | ========= ================================================ | ||
335 | "idle" Halt the current sync action. | ||
336 | "frozen" Freeze the current sync action. | ||
337 | "resync" Initiate/continue a resync. | ||
338 | "recover" Initiate/continue a recover process. | ||
339 | "check" Initiate a check (i.e. a "scrub") of the array. | ||
340 | "repair" Initiate a repair of the array. | ||
341 | ========= ================================================ | ||
342 | |||
343 | |||
344 | Discard Support | ||
345 | --------------- | ||
346 | The implementation of discard support among hardware vendors varies. | ||
347 | When a block is discarded, some storage devices will return zeroes when | ||
348 | the block is read. These devices set the 'discard_zeroes_data' | ||
349 | attribute. Other devices will return random data. Confusingly, some | ||
350 | devices that advertise 'discard_zeroes_data' will not reliably return | ||
351 | zeroes when discarded blocks are read! Since RAID 4/5/6 uses blocks | ||
352 | from a number of devices to calculate parity blocks and (for performance | ||
353 | reasons) relies on 'discard_zeroes_data' being reliable, it is important | ||
354 | that the devices be consistent. Blocks may be discarded in the middle | ||
355 | of a RAID 4/5/6 stripe and if subsequent read results are not | ||
356 | consistent, the parity blocks may be calculated differently at any time; | ||
357 | making the parity blocks useless for redundancy. It is important to | ||
358 | understand how your hardware behaves with discards if you are going to | ||
359 | enable discards with RAID 4/5/6. | ||
360 | |||
361 | Since the behavior of storage devices is unreliable in this respect, | ||
362 | even when reporting 'discard_zeroes_data', by default RAID 4/5/6 | ||
363 | discard support is disabled -- this ensures data integrity at the | ||
364 | expense of losing some performance. | ||
365 | |||
366 | Storage devices that properly support 'discard_zeroes_data' are | ||
367 | increasingly whitelisted in the kernel and can thus be trusted. | ||
368 | |||
369 | For trusted devices, the following dm-raid module parameter can be set | ||
370 | to safely enable discard support for RAID 4/5/6: | ||
371 | |||
372 | 'devices_handle_discards_safely' | ||
373 | |||
374 | |||
375 | Version History | ||
376 | --------------- | ||
377 | |||
378 | :: | ||
379 | |||
380 | 1.0.0 Initial version. Support for RAID 4/5/6 | ||
381 | 1.1.0 Added support for RAID 1 | ||
382 | 1.2.0 Handle creation of arrays that contain failed devices. | ||
383 | 1.3.0 Added support for RAID 10 | ||
384 | 1.3.1 Allow device replacement/rebuild for RAID 10 | ||
385 | 1.3.2 Fix/improve redundancy checking for RAID10 | ||
386 | 1.4.0 Non-functional change. Removes arg from mapping function. | ||
387 | 1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5). | ||
388 | 1.4.2 Add RAID10 "far" and "offset" algorithm support. | ||
389 | 1.5.0 Add message interface to allow manipulation of the sync_action. | ||
390 | New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt. | ||
391 | 1.5.1 Add ability to restore transiently failed devices on resume. | ||
392 | 1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check". | ||
393 | 1.6.0 Add discard support (and devices_handle_discard_safely module param). | ||
394 | 1.7.0 Add support for MD RAID0 mappings. | ||
395 | 1.8.0 Explicitly check for compatible flags in the superblock metadata | ||
396 | and reject to start the raid set if any are set by a newer | ||
397 | target version, thus avoiding data corruption on a raid set | ||
398 | with a reshape in progress. | ||
399 | 1.9.0 Add support for RAID level takeover/reshape/region size | ||
400 | and set size reduction. | ||
401 | 1.9.1 Fix activation of existing RAID 4/10 mapped devices | ||
402 | 1.9.2 Don't emit '- -' on the status table line in case the constructor | ||
403 | fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and | ||
404 | 'D' on the status line. If '- -' is passed into the constructor, emit | ||
405 | '- -' on the table line and '-' as the status line health character. | ||
406 | 1.10.0 Add support for raid4/5/6 journal device | ||
407 | 1.10.1 Fix data corruption on reshape request | ||
408 | 1.11.0 Fix table line argument order | ||
409 | (wrong raid10_copies/raid10_format sequence) | ||
410 | 1.11.1 Add raid4/5/6 journal write-back support via journal_mode option | ||
411 | 1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available | ||
412 | 1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A') | ||
413 | 1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an | ||
414 | state races. | ||
415 | 1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen | ||
416 | 1.14.0 Fix reshape race on small devices. Fix stripe adding reshape | ||
417 | deadlock/potential data corruption. Update superblock when | ||
418 | specific devices are requested via rebuild. Fix RAID leg | ||
419 | rebuild errors. | ||