diff options
Diffstat (limited to 'Documentation/device-mapper/thin-provisioning.txt')
-rw-r--r-- | Documentation/device-mapper/thin-provisioning.txt | 353 |
1 files changed, 0 insertions, 353 deletions
diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt deleted file mode 100644 index 30b8b83bd33..00000000000 --- a/Documentation/device-mapper/thin-provisioning.txt +++ /dev/null | |||
@@ -1,353 +0,0 @@ | |||
1 | Introduction | ||
2 | ============ | ||
3 | |||
4 | This document describes a collection of device-mapper targets that | ||
5 | between them implement thin-provisioning and snapshots. | ||
6 | |||
7 | The main highlight of this implementation, compared to the previous | ||
8 | implementation of snapshots, is that it allows many virtual devices to | ||
9 | be stored on the same data volume. This simplifies administration and | ||
10 | allows the sharing of data between volumes, thus reducing disk usage. | ||
11 | |||
12 | Another significant feature is support for an arbitrary depth of | ||
13 | recursive snapshots (snapshots of snapshots of snapshots ...). The | ||
14 | previous implementation of snapshots did this by chaining together | ||
15 | lookup tables, and so performance was O(depth). This new | ||
16 | implementation uses a single data structure to avoid this degradation | ||
17 | with depth. Fragmentation may still be an issue, however, in some | ||
18 | scenarios. | ||
19 | |||
20 | Metadata is stored on a separate device from data, giving the | ||
21 | administrator some freedom, for example to: | ||
22 | |||
23 | - Improve metadata resilience by storing metadata on a mirrored volume | ||
24 | but data on a non-mirrored one. | ||
25 | |||
26 | - Improve performance by storing the metadata on SSD. | ||
27 | |||
28 | Status | ||
29 | ====== | ||
30 | |||
31 | These targets are very much still in the EXPERIMENTAL state. Please | ||
32 | do not yet rely on them in production. But do experiment and offer us | ||
33 | feedback. Different use cases will have different performance | ||
34 | characteristics, for example due to fragmentation of the data volume. | ||
35 | |||
36 | If you find this software is not performing as expected please mail | ||
37 | dm-devel@redhat.com with details and we'll try our best to improve | ||
38 | things for you. | ||
39 | |||
40 | Userspace tools for checking and repairing the metadata are under | ||
41 | development. | ||
42 | |||
43 | Cookbook | ||
44 | ======== | ||
45 | |||
46 | This section describes some quick recipes for using thin provisioning. | ||
47 | They use the dmsetup program to control the device-mapper driver | ||
48 | directly. End users will be advised to use a higher-level volume | ||
49 | manager such as LVM2 once support has been added. | ||
50 | |||
51 | Pool device | ||
52 | ----------- | ||
53 | |||
54 | The pool device ties together the metadata volume and the data volume. | ||
55 | It maps I/O linearly to the data volume and updates the metadata via | ||
56 | two mechanisms: | ||
57 | |||
58 | - Function calls from the thin targets | ||
59 | |||
60 | - Device-mapper 'messages' from userspace which control the creation of new | ||
61 | virtual devices amongst other things. | ||
62 | |||
63 | Setting up a fresh pool device | ||
64 | ------------------------------ | ||
65 | |||
66 | Setting up a pool device requires a valid metadata device, and a | ||
67 | data device. If you do not have an existing metadata device you can | ||
68 | make one by zeroing the first 4k to indicate empty metadata. | ||
69 | |||
70 | dd if=/dev/zero of=$metadata_dev bs=4096 count=1 | ||
71 | |||
72 | The amount of metadata you need will vary according to how many blocks | ||
73 | are shared between thin devices (i.e. through snapshots). If you have | ||
74 | less sharing than average you'll need a larger-than-average metadata device. | ||
75 | |||
76 | As a guide, we suggest you calculate the number of bytes to use in the | ||
77 | metadata device as 48 * $data_dev_size / $data_block_size but round it up | ||
78 | to 2MB if the answer is smaller. If you're creating large numbers of | ||
79 | snapshots which are recording large amounts of change, you may find you | ||
80 | need to increase this. | ||
81 | |||
82 | The largest size supported is 16GB: If the device is larger, | ||
83 | a warning will be issued and the excess space will not be used. | ||
84 | |||
85 | Reloading a pool table | ||
86 | ---------------------- | ||
87 | |||
88 | You may reload a pool's table, indeed this is how the pool is resized | ||
89 | if it runs out of space. (N.B. While specifying a different metadata | ||
90 | device when reloading is not forbidden at the moment, things will go | ||
91 | wrong if it does not route I/O to exactly the same on-disk location as | ||
92 | previously.) | ||
93 | |||
94 | Using an existing pool device | ||
95 | ----------------------------- | ||
96 | |||
97 | dmsetup create pool \ | ||
98 | --table "0 20971520 thin-pool $metadata_dev $data_dev \ | ||
99 | $data_block_size $low_water_mark" | ||
100 | |||
101 | $data_block_size gives the smallest unit of disk space that can be | ||
102 | allocated at a time expressed in units of 512-byte sectors. People | ||
103 | primarily interested in thin provisioning may want to use a value such | ||
104 | as 1024 (512KB). People doing lots of snapshotting may want a smaller value | ||
105 | such as 128 (64KB). If you are not zeroing newly-allocated data, | ||
106 | a larger $data_block_size in the region of 256000 (128MB) is suggested. | ||
107 | $data_block_size must be the same for the lifetime of the | ||
108 | metadata device. | ||
109 | |||
110 | $low_water_mark is expressed in blocks of size $data_block_size. If | ||
111 | free space on the data device drops below this level then a dm event | ||
112 | will be triggered which a userspace daemon should catch allowing it to | ||
113 | extend the pool device. Only one such event will be sent. | ||
114 | Resuming a device with a new table itself triggers an event so the | ||
115 | userspace daemon can use this to detect a situation where a new table | ||
116 | already exceeds the threshold. | ||
117 | |||
118 | Thin provisioning | ||
119 | ----------------- | ||
120 | |||
121 | i) Creating a new thinly-provisioned volume. | ||
122 | |||
123 | To create a new thinly- provisioned volume you must send a message to an | ||
124 | active pool device, /dev/mapper/pool in this example. | ||
125 | |||
126 | dmsetup message /dev/mapper/pool 0 "create_thin 0" | ||
127 | |||
128 | Here '0' is an identifier for the volume, a 24-bit number. It's up | ||
129 | to the caller to allocate and manage these identifiers. If the | ||
130 | identifier is already in use, the message will fail with -EEXIST. | ||
131 | |||
132 | ii) Using a thinly-provisioned volume. | ||
133 | |||
134 | Thinly-provisioned volumes are activated using the 'thin' target: | ||
135 | |||
136 | dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" | ||
137 | |||
138 | The last parameter is the identifier for the thinp device. | ||
139 | |||
140 | Internal snapshots | ||
141 | ------------------ | ||
142 | |||
143 | i) Creating an internal snapshot. | ||
144 | |||
145 | Snapshots are created with another message to the pool. | ||
146 | |||
147 | N.B. If the origin device that you wish to snapshot is active, you | ||
148 | must suspend it before creating the snapshot to avoid corruption. | ||
149 | This is NOT enforced at the moment, so please be careful! | ||
150 | |||
151 | dmsetup suspend /dev/mapper/thin | ||
152 | dmsetup message /dev/mapper/pool 0 "create_snap 1 0" | ||
153 | dmsetup resume /dev/mapper/thin | ||
154 | |||
155 | Here '1' is the identifier for the volume, a 24-bit number. '0' is the | ||
156 | identifier for the origin device. | ||
157 | |||
158 | ii) Using an internal snapshot. | ||
159 | |||
160 | Once created, the user doesn't have to worry about any connection | ||
161 | between the origin and the snapshot. Indeed the snapshot is no | ||
162 | different from any other thinly-provisioned device and can be | ||
163 | snapshotted itself via the same method. It's perfectly legal to | ||
164 | have only one of them active, and there's no ordering requirement on | ||
165 | activating or removing them both. (This differs from conventional | ||
166 | device-mapper snapshots.) | ||
167 | |||
168 | Activate it exactly the same way as any other thinly-provisioned volume: | ||
169 | |||
170 | dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" | ||
171 | |||
172 | External snapshots | ||
173 | ------------------ | ||
174 | |||
175 | You can use an external _read only_ device as an origin for a | ||
176 | thinly-provisioned volume. Any read to an unprovisioned area of the | ||
177 | thin device will be passed through to the origin. Writes trigger | ||
178 | the allocation of new blocks as usual. | ||
179 | |||
180 | One use case for this is VM hosts that want to run guests on | ||
181 | thinly-provisioned volumes but have the base image on another device | ||
182 | (possibly shared between many VMs). | ||
183 | |||
184 | You must not write to the origin device if you use this technique! | ||
185 | Of course, you may write to the thin device and take internal snapshots | ||
186 | of the thin volume. | ||
187 | |||
188 | i) Creating a snapshot of an external device | ||
189 | |||
190 | This is the same as creating a thin device. | ||
191 | You don't mention the origin at this stage. | ||
192 | |||
193 | dmsetup message /dev/mapper/pool 0 "create_thin 0" | ||
194 | |||
195 | ii) Using a snapshot of an external device. | ||
196 | |||
197 | Append an extra parameter to the thin target specifying the origin: | ||
198 | |||
199 | dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image" | ||
200 | |||
201 | N.B. All descendants (internal snapshots) of this snapshot require the | ||
202 | same extra origin parameter. | ||
203 | |||
204 | Deactivation | ||
205 | ------------ | ||
206 | |||
207 | All devices using a pool must be deactivated before the pool itself | ||
208 | can be. | ||
209 | |||
210 | dmsetup remove thin | ||
211 | dmsetup remove snap | ||
212 | dmsetup remove pool | ||
213 | |||
214 | Reference | ||
215 | ========= | ||
216 | |||
217 | 'thin-pool' target | ||
218 | ------------------ | ||
219 | |||
220 | i) Constructor | ||
221 | |||
222 | thin-pool <metadata dev> <data dev> <data block size (sectors)> \ | ||
223 | <low water mark (blocks)> [<number of feature args> [<arg>]*] | ||
224 | |||
225 | Optional feature arguments: | ||
226 | |||
227 | skip_block_zeroing: Skip the zeroing of newly-provisioned blocks. | ||
228 | |||
229 | ignore_discard: Disable discard support. | ||
230 | |||
231 | no_discard_passdown: Don't pass discards down to the underlying | ||
232 | data device, but just remove the mapping. | ||
233 | |||
234 | read_only: Don't allow any changes to be made to the pool | ||
235 | metadata. | ||
236 | |||
237 | Data block size must be between 64KB (128 sectors) and 1GB | ||
238 | (2097152 sectors) inclusive. | ||
239 | |||
240 | |||
241 | ii) Status | ||
242 | |||
243 | <transaction id> <used metadata blocks>/<total metadata blocks> | ||
244 | <used data blocks>/<total data blocks> <held metadata root> | ||
245 | [no_]discard_passdown ro|rw | ||
246 | |||
247 | transaction id: | ||
248 | A 64-bit number used by userspace to help synchronise with metadata | ||
249 | from volume managers. | ||
250 | |||
251 | used data blocks / total data blocks | ||
252 | If the number of free blocks drops below the pool's low water mark a | ||
253 | dm event will be sent to userspace. This event is edge-triggered and | ||
254 | it will occur only once after each resume so volume manager writers | ||
255 | should register for the event and then check the target's status. | ||
256 | |||
257 | held metadata root: | ||
258 | The location, in sectors, of the metadata root that has been | ||
259 | 'held' for userspace read access. '-' indicates there is no | ||
260 | held root. This feature is not yet implemented so '-' is | ||
261 | always returned. | ||
262 | |||
263 | discard_passdown|no_discard_passdown | ||
264 | Whether or not discards are actually being passed down to the | ||
265 | underlying device. When this is enabled when loading the table, | ||
266 | it can get disabled if the underlying device doesn't support it. | ||
267 | |||
268 | ro|rw | ||
269 | If the pool encounters certain types of device failures it will | ||
270 | drop into a read-only metadata mode in which no changes to | ||
271 | the pool metadata (like allocating new blocks) are permitted. | ||
272 | |||
273 | In serious cases where even a read-only mode is deemed unsafe | ||
274 | no further I/O will be permitted and the status will just | ||
275 | contain the string 'Fail'. The userspace recovery tools | ||
276 | should then be used. | ||
277 | |||
278 | iii) Messages | ||
279 | |||
280 | create_thin <dev id> | ||
281 | |||
282 | Create a new thinly-provisioned device. | ||
283 | <dev id> is an arbitrary unique 24-bit identifier chosen by | ||
284 | the caller. | ||
285 | |||
286 | create_snap <dev id> <origin id> | ||
287 | |||
288 | Create a new snapshot of another thinly-provisioned device. | ||
289 | <dev id> is an arbitrary unique 24-bit identifier chosen by | ||
290 | the caller. | ||
291 | <origin id> is the identifier of the thinly-provisioned device | ||
292 | of which the new device will be a snapshot. | ||
293 | |||
294 | delete <dev id> | ||
295 | |||
296 | Deletes a thin device. Irreversible. | ||
297 | |||
298 | set_transaction_id <current id> <new id> | ||
299 | |||
300 | Userland volume managers, such as LVM, need a way to | ||
301 | synchronise their external metadata with the internal metadata of the | ||
302 | pool target. The thin-pool target offers to store an | ||
303 | arbitrary 64-bit transaction id and return it on the target's | ||
304 | status line. To avoid races you must provide what you think | ||
305 | the current transaction id is when you change it with this | ||
306 | compare-and-swap message. | ||
307 | |||
308 | reserve_metadata_snap | ||
309 | |||
310 | Reserve a copy of the data mapping btree for use by userland. | ||
311 | This allows userland to inspect the mappings as they were when | ||
312 | this message was executed. Use the pool's status command to | ||
313 | get the root block associated with the metadata snapshot. | ||
314 | |||
315 | release_metadata_snap | ||
316 | |||
317 | Release a previously reserved copy of the data mapping btree. | ||
318 | |||
319 | 'thin' target | ||
320 | ------------- | ||
321 | |||
322 | i) Constructor | ||
323 | |||
324 | thin <pool dev> <dev id> [<external origin dev>] | ||
325 | |||
326 | pool dev: | ||
327 | the thin-pool device, e.g. /dev/mapper/my_pool or 253:0 | ||
328 | |||
329 | dev id: | ||
330 | the internal device identifier of the device to be | ||
331 | activated. | ||
332 | |||
333 | external origin dev: | ||
334 | an optional block device outside the pool to be treated as a | ||
335 | read-only snapshot origin: reads to unprovisioned areas of the | ||
336 | thin target will be mapped to this device. | ||
337 | |||
338 | The pool doesn't store any size against the thin devices. If you | ||
339 | load a thin target that is smaller than you've been using previously, | ||
340 | then you'll have no access to blocks mapped beyond the end. If you | ||
341 | load a target that is bigger than before, then extra blocks will be | ||
342 | provisioned as and when needed. | ||
343 | |||
344 | If you wish to reduce the size of your thin device and potentially | ||
345 | regain some space then send the 'trim' message to the pool. | ||
346 | |||
347 | ii) Status | ||
348 | |||
349 | <nr mapped sectors> <highest mapped sector> | ||
350 | |||
351 | If the pool has encountered device errors and failed, the status | ||
352 | will just contain the string 'Fail'. The userspace recovery | ||
353 | tools should then be used. | ||