diff options
Diffstat (limited to 'Documentation/device-mapper/thin-provisioning.rst')
-rw-r--r-- | Documentation/device-mapper/thin-provisioning.rst | 427 |
1 files changed, 427 insertions, 0 deletions
diff --git a/Documentation/device-mapper/thin-provisioning.rst b/Documentation/device-mapper/thin-provisioning.rst new file mode 100644 index 000000000000..bafebf79da4b --- /dev/null +++ b/Documentation/device-mapper/thin-provisioning.rst | |||
@@ -0,0 +1,427 @@ | |||
1 | ================= | ||
2 | Thin provisioning | ||
3 | ================= | ||
4 | |||
5 | Introduction | ||
6 | ============ | ||
7 | |||
8 | This document describes a collection of device-mapper targets that | ||
9 | between them implement thin-provisioning and snapshots. | ||
10 | |||
11 | The main highlight of this implementation, compared to the previous | ||
12 | implementation of snapshots, is that it allows many virtual devices to | ||
13 | be stored on the same data volume. This simplifies administration and | ||
14 | allows the sharing of data between volumes, thus reducing disk usage. | ||
15 | |||
16 | Another significant feature is support for an arbitrary depth of | ||
17 | recursive snapshots (snapshots of snapshots of snapshots ...). The | ||
18 | previous implementation of snapshots did this by chaining together | ||
19 | lookup tables, and so performance was O(depth). This new | ||
20 | implementation uses a single data structure to avoid this degradation | ||
21 | with depth. Fragmentation may still be an issue, however, in some | ||
22 | scenarios. | ||
23 | |||
24 | Metadata is stored on a separate device from data, giving the | ||
25 | administrator some freedom, for example to: | ||
26 | |||
27 | - Improve metadata resilience by storing metadata on a mirrored volume | ||
28 | but data on a non-mirrored one. | ||
29 | |||
30 | - Improve performance by storing the metadata on SSD. | ||
31 | |||
32 | Status | ||
33 | ====== | ||
34 | |||
35 | These targets are considered safe for production use. But different use | ||
36 | cases will have different performance characteristics, for example due | ||
37 | to fragmentation of the data volume. | ||
38 | |||
39 | If you find this software is not performing as expected please mail | ||
40 | dm-devel@redhat.com with details and we'll try our best to improve | ||
41 | things for you. | ||
42 | |||
43 | Userspace tools for checking and repairing the metadata have been fully | ||
44 | developed and are available as 'thin_check' and 'thin_repair'. The name | ||
45 | of the package that provides these utilities varies by distribution (on | ||
46 | a Red Hat distribution it is named 'device-mapper-persistent-data'). | ||
47 | |||
48 | Cookbook | ||
49 | ======== | ||
50 | |||
51 | This section describes some quick recipes for using thin provisioning. | ||
52 | They use the dmsetup program to control the device-mapper driver | ||
53 | directly. End users will be advised to use a higher-level volume | ||
54 | manager such as LVM2 once support has been added. | ||
55 | |||
56 | Pool device | ||
57 | ----------- | ||
58 | |||
59 | The pool device ties together the metadata volume and the data volume. | ||
60 | It maps I/O linearly to the data volume and updates the metadata via | ||
61 | two mechanisms: | ||
62 | |||
63 | - Function calls from the thin targets | ||
64 | |||
65 | - Device-mapper 'messages' from userspace which control the creation of new | ||
66 | virtual devices amongst other things. | ||
67 | |||
68 | Setting up a fresh pool device | ||
69 | ------------------------------ | ||
70 | |||
71 | Setting up a pool device requires a valid metadata device, and a | ||
72 | data device. If you do not have an existing metadata device you can | ||
73 | make one by zeroing the first 4k to indicate empty metadata. | ||
74 | |||
75 | dd if=/dev/zero of=$metadata_dev bs=4096 count=1 | ||
76 | |||
77 | The amount of metadata you need will vary according to how many blocks | ||
78 | are shared between thin devices (i.e. through snapshots). If you have | ||
79 | less sharing than average you'll need a larger-than-average metadata device. | ||
80 | |||
81 | As a guide, we suggest you calculate the number of bytes to use in the | ||
82 | metadata device as 48 * $data_dev_size / $data_block_size but round it up | ||
83 | to 2MB if the answer is smaller. If you're creating large numbers of | ||
84 | snapshots which are recording large amounts of change, you may find you | ||
85 | need to increase this. | ||
86 | |||
87 | The largest size supported is 16GB: If the device is larger, | ||
88 | a warning will be issued and the excess space will not be used. | ||
89 | |||
90 | Reloading a pool table | ||
91 | ---------------------- | ||
92 | |||
93 | You may reload a pool's table, indeed this is how the pool is resized | ||
94 | if it runs out of space. (N.B. While specifying a different metadata | ||
95 | device when reloading is not forbidden at the moment, things will go | ||
96 | wrong if it does not route I/O to exactly the same on-disk location as | ||
97 | previously.) | ||
98 | |||
99 | Using an existing pool device | ||
100 | ----------------------------- | ||
101 | |||
102 | :: | ||
103 | |||
104 | dmsetup create pool \ | ||
105 | --table "0 20971520 thin-pool $metadata_dev $data_dev \ | ||
106 | $data_block_size $low_water_mark" | ||
107 | |||
108 | $data_block_size gives the smallest unit of disk space that can be | ||
109 | allocated at a time expressed in units of 512-byte sectors. | ||
110 | $data_block_size must be between 128 (64KB) and 2097152 (1GB) and a | ||
111 | multiple of 128 (64KB). $data_block_size cannot be changed after the | ||
112 | thin-pool is created. People primarily interested in thin provisioning | ||
113 | may want to use a value such as 1024 (512KB). People doing lots of | ||
114 | snapshotting may want a smaller value such as 128 (64KB). If you are | ||
115 | not zeroing newly-allocated data, a larger $data_block_size in the | ||
116 | region of 256000 (128MB) is suggested. | ||
117 | |||
118 | $low_water_mark is expressed in blocks of size $data_block_size. If | ||
119 | free space on the data device drops below this level then a dm event | ||
120 | will be triggered which a userspace daemon should catch allowing it to | ||
121 | extend the pool device. Only one such event will be sent. | ||
122 | |||
123 | No special event is triggered if a just resumed device's free space is below | ||
124 | the low water mark. However, resuming a device always triggers an | ||
125 | event; a userspace daemon should verify that free space exceeds the low | ||
126 | water mark when handling this event. | ||
127 | |||
128 | A low water mark for the metadata device is maintained in the kernel and | ||
129 | will trigger a dm event if free space on the metadata device drops below | ||
130 | it. | ||
131 | |||
132 | Updating on-disk metadata | ||
133 | ------------------------- | ||
134 | |||
135 | On-disk metadata is committed every time a FLUSH or FUA bio is written. | ||
136 | If no such requests are made then commits will occur every second. This | ||
137 | means the thin-provisioning target behaves like a physical disk that has | ||
138 | a volatile write cache. If power is lost you may lose some recent | ||
139 | writes. The metadata should always be consistent in spite of any crash. | ||
140 | |||
141 | If data space is exhausted the pool will either error or queue IO | ||
142 | according to the configuration (see: error_if_no_space). If metadata | ||
143 | space is exhausted or a metadata operation fails: the pool will error IO | ||
144 | until the pool is taken offline and repair is performed to 1) fix any | ||
145 | potential inconsistencies and 2) clear the flag that imposes repair. | ||
146 | Once the pool's metadata device is repaired it may be resized, which | ||
147 | will allow the pool to return to normal operation. Note that if a pool | ||
148 | is flagged as needing repair, the pool's data and metadata devices | ||
149 | cannot be resized until repair is performed. It should also be noted | ||
150 | that when the pool's metadata space is exhausted the current metadata | ||
151 | transaction is aborted. Given that the pool will cache IO whose | ||
152 | completion may have already been acknowledged to upper IO layers | ||
153 | (e.g. filesystem) it is strongly suggested that consistency checks | ||
154 | (e.g. fsck) be performed on those layers when repair of the pool is | ||
155 | required. | ||
156 | |||
157 | Thin provisioning | ||
158 | ----------------- | ||
159 | |||
160 | i) Creating a new thinly-provisioned volume. | ||
161 | |||
162 | To create a new thinly- provisioned volume you must send a message to an | ||
163 | active pool device, /dev/mapper/pool in this example:: | ||
164 | |||
165 | dmsetup message /dev/mapper/pool 0 "create_thin 0" | ||
166 | |||
167 | Here '0' is an identifier for the volume, a 24-bit number. It's up | ||
168 | to the caller to allocate and manage these identifiers. If the | ||
169 | identifier is already in use, the message will fail with -EEXIST. | ||
170 | |||
171 | ii) Using a thinly-provisioned volume. | ||
172 | |||
173 | Thinly-provisioned volumes are activated using the 'thin' target:: | ||
174 | |||
175 | dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" | ||
176 | |||
177 | The last parameter is the identifier for the thinp device. | ||
178 | |||
179 | Internal snapshots | ||
180 | ------------------ | ||
181 | |||
182 | i) Creating an internal snapshot. | ||
183 | |||
184 | Snapshots are created with another message to the pool. | ||
185 | |||
186 | N.B. If the origin device that you wish to snapshot is active, you | ||
187 | must suspend it before creating the snapshot to avoid corruption. | ||
188 | This is NOT enforced at the moment, so please be careful! | ||
189 | |||
190 | :: | ||
191 | |||
192 | dmsetup suspend /dev/mapper/thin | ||
193 | dmsetup message /dev/mapper/pool 0 "create_snap 1 0" | ||
194 | dmsetup resume /dev/mapper/thin | ||
195 | |||
196 | Here '1' is the identifier for the volume, a 24-bit number. '0' is the | ||
197 | identifier for the origin device. | ||
198 | |||
199 | ii) Using an internal snapshot. | ||
200 | |||
201 | Once created, the user doesn't have to worry about any connection | ||
202 | between the origin and the snapshot. Indeed the snapshot is no | ||
203 | different from any other thinly-provisioned device and can be | ||
204 | snapshotted itself via the same method. It's perfectly legal to | ||
205 | have only one of them active, and there's no ordering requirement on | ||
206 | activating or removing them both. (This differs from conventional | ||
207 | device-mapper snapshots.) | ||
208 | |||
209 | Activate it exactly the same way as any other thinly-provisioned volume:: | ||
210 | |||
211 | dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" | ||
212 | |||
213 | External snapshots | ||
214 | ------------------ | ||
215 | |||
216 | You can use an external **read only** device as an origin for a | ||
217 | thinly-provisioned volume. Any read to an unprovisioned area of the | ||
218 | thin device will be passed through to the origin. Writes trigger | ||
219 | the allocation of new blocks as usual. | ||
220 | |||
221 | One use case for this is VM hosts that want to run guests on | ||
222 | thinly-provisioned volumes but have the base image on another device | ||
223 | (possibly shared between many VMs). | ||
224 | |||
225 | You must not write to the origin device if you use this technique! | ||
226 | Of course, you may write to the thin device and take internal snapshots | ||
227 | of the thin volume. | ||
228 | |||
229 | i) Creating a snapshot of an external device | ||
230 | |||
231 | This is the same as creating a thin device. | ||
232 | You don't mention the origin at this stage. | ||
233 | |||
234 | :: | ||
235 | |||
236 | dmsetup message /dev/mapper/pool 0 "create_thin 0" | ||
237 | |||
238 | ii) Using a snapshot of an external device. | ||
239 | |||
240 | Append an extra parameter to the thin target specifying the origin:: | ||
241 | |||
242 | dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image" | ||
243 | |||
244 | N.B. All descendants (internal snapshots) of this snapshot require the | ||
245 | same extra origin parameter. | ||
246 | |||
247 | Deactivation | ||
248 | ------------ | ||
249 | |||
250 | All devices using a pool must be deactivated before the pool itself | ||
251 | can be. | ||
252 | |||
253 | :: | ||
254 | |||
255 | dmsetup remove thin | ||
256 | dmsetup remove snap | ||
257 | dmsetup remove pool | ||
258 | |||
259 | Reference | ||
260 | ========= | ||
261 | |||
262 | 'thin-pool' target | ||
263 | ------------------ | ||
264 | |||
265 | i) Constructor | ||
266 | |||
267 | :: | ||
268 | |||
269 | thin-pool <metadata dev> <data dev> <data block size (sectors)> \ | ||
270 | <low water mark (blocks)> [<number of feature args> [<arg>]*] | ||
271 | |||
272 | Optional feature arguments: | ||
273 | |||
274 | skip_block_zeroing: | ||
275 | Skip the zeroing of newly-provisioned blocks. | ||
276 | |||
277 | ignore_discard: | ||
278 | Disable discard support. | ||
279 | |||
280 | no_discard_passdown: | ||
281 | Don't pass discards down to the underlying | ||
282 | data device, but just remove the mapping. | ||
283 | |||
284 | read_only: | ||
285 | Don't allow any changes to be made to the pool | ||
286 | metadata. This mode is only available after the | ||
287 | thin-pool has been created and first used in full | ||
288 | read/write mode. It cannot be specified on initial | ||
289 | thin-pool creation. | ||
290 | |||
291 | error_if_no_space: | ||
292 | Error IOs, instead of queueing, if no space. | ||
293 | |||
294 | Data block size must be between 64KB (128 sectors) and 1GB | ||
295 | (2097152 sectors) inclusive. | ||
296 | |||
297 | |||
298 | ii) Status | ||
299 | |||
300 | :: | ||
301 | |||
302 | <transaction id> <used metadata blocks>/<total metadata blocks> | ||
303 | <used data blocks>/<total data blocks> <held metadata root> | ||
304 | ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space | ||
305 | needs_check|- metadata_low_watermark | ||
306 | |||
307 | transaction id: | ||
308 | A 64-bit number used by userspace to help synchronise with metadata | ||
309 | from volume managers. | ||
310 | |||
311 | used data blocks / total data blocks | ||
312 | If the number of free blocks drops below the pool's low water mark a | ||
313 | dm event will be sent to userspace. This event is edge-triggered and | ||
314 | it will occur only once after each resume so volume manager writers | ||
315 | should register for the event and then check the target's status. | ||
316 | |||
317 | held metadata root: | ||
318 | The location, in blocks, of the metadata root that has been | ||
319 | 'held' for userspace read access. '-' indicates there is no | ||
320 | held root. | ||
321 | |||
322 | discard_passdown|no_discard_passdown | ||
323 | Whether or not discards are actually being passed down to the | ||
324 | underlying device. When this is enabled when loading the table, | ||
325 | it can get disabled if the underlying device doesn't support it. | ||
326 | |||
327 | ro|rw|out_of_data_space | ||
328 | If the pool encounters certain types of device failures it will | ||
329 | drop into a read-only metadata mode in which no changes to | ||
330 | the pool metadata (like allocating new blocks) are permitted. | ||
331 | |||
332 | In serious cases where even a read-only mode is deemed unsafe | ||
333 | no further I/O will be permitted and the status will just | ||
334 | contain the string 'Fail'. The userspace recovery tools | ||
335 | should then be used. | ||
336 | |||
337 | error_if_no_space|queue_if_no_space | ||
338 | If the pool runs out of data or metadata space, the pool will | ||
339 | either queue or error the IO destined to the data device. The | ||
340 | default is to queue the IO until more space is added or the | ||
341 | 'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool | ||
342 | module parameter can be used to change this timeout -- it | ||
343 | defaults to 60 seconds but may be disabled using a value of 0. | ||
344 | |||
345 | needs_check | ||
346 | A metadata operation has failed, resulting in the needs_check | ||
347 | flag being set in the metadata's superblock. The metadata | ||
348 | device must be deactivated and checked/repaired before the | ||
349 | thin-pool can be made fully operational again. '-' indicates | ||
350 | needs_check is not set. | ||
351 | |||
352 | metadata_low_watermark: | ||
353 | Value of metadata low watermark in blocks. The kernel sets this | ||
354 | value internally but userspace needs to know this value to | ||
355 | determine if an event was caused by crossing this threshold. | ||
356 | |||
357 | iii) Messages | ||
358 | |||
359 | create_thin <dev id> | ||
360 | Create a new thinly-provisioned device. | ||
361 | <dev id> is an arbitrary unique 24-bit identifier chosen by | ||
362 | the caller. | ||
363 | |||
364 | create_snap <dev id> <origin id> | ||
365 | Create a new snapshot of another thinly-provisioned device. | ||
366 | <dev id> is an arbitrary unique 24-bit identifier chosen by | ||
367 | the caller. | ||
368 | <origin id> is the identifier of the thinly-provisioned device | ||
369 | of which the new device will be a snapshot. | ||
370 | |||
371 | delete <dev id> | ||
372 | Deletes a thin device. Irreversible. | ||
373 | |||
374 | set_transaction_id <current id> <new id> | ||
375 | Userland volume managers, such as LVM, need a way to | ||
376 | synchronise their external metadata with the internal metadata of the | ||
377 | pool target. The thin-pool target offers to store an | ||
378 | arbitrary 64-bit transaction id and return it on the target's | ||
379 | status line. To avoid races you must provide what you think | ||
380 | the current transaction id is when you change it with this | ||
381 | compare-and-swap message. | ||
382 | |||
383 | reserve_metadata_snap | ||
384 | Reserve a copy of the data mapping btree for use by userland. | ||
385 | This allows userland to inspect the mappings as they were when | ||
386 | this message was executed. Use the pool's status command to | ||
387 | get the root block associated with the metadata snapshot. | ||
388 | |||
389 | release_metadata_snap | ||
390 | Release a previously reserved copy of the data mapping btree. | ||
391 | |||
392 | 'thin' target | ||
393 | ------------- | ||
394 | |||
395 | i) Constructor | ||
396 | |||
397 | :: | ||
398 | |||
399 | thin <pool dev> <dev id> [<external origin dev>] | ||
400 | |||
401 | pool dev: | ||
402 | the thin-pool device, e.g. /dev/mapper/my_pool or 253:0 | ||
403 | |||
404 | dev id: | ||
405 | the internal device identifier of the device to be | ||
406 | activated. | ||
407 | |||
408 | external origin dev: | ||
409 | an optional block device outside the pool to be treated as a | ||
410 | read-only snapshot origin: reads to unprovisioned areas of the | ||
411 | thin target will be mapped to this device. | ||
412 | |||
413 | The pool doesn't store any size against the thin devices. If you | ||
414 | load a thin target that is smaller than you've been using previously, | ||
415 | then you'll have no access to blocks mapped beyond the end. If you | ||
416 | load a target that is bigger than before, then extra blocks will be | ||
417 | provisioned as and when needed. | ||
418 | |||
419 | ii) Status | ||
420 | |||
421 | <nr mapped sectors> <highest mapped sector> | ||
422 | If the pool has encountered device errors and failed, the status | ||
423 | will just contain the string 'Fail'. The userspace recovery | ||
424 | tools should then be used. | ||
425 | |||
426 | In the case where <nr mapped sectors> is 0, there is no highest | ||
427 | mapped sector and the value of <highest mapped sector> is unspecified. | ||