diff options
Diffstat (limited to 'Documentation/device-mapper/thin-provisioning.txt')
-rw-r--r-- | Documentation/device-mapper/thin-provisioning.txt | 285 |
1 files changed, 285 insertions, 0 deletions
diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt new file mode 100644 index 00000000000..801d9d1cf82 --- /dev/null +++ b/Documentation/device-mapper/thin-provisioning.txt | |||
@@ -0,0 +1,285 @@ | |||
1 | Introduction | ||
2 | ============ | ||
3 | |||
4 | This document descibes a collection of device-mapper targets that | ||
5 | between them implement thin-provisioning and snapshots. | ||
6 | |||
7 | The main highlight of this implementation, compared to the previous | ||
8 | implementation of snapshots, is that it allows many virtual devices to | ||
9 | be stored on the same data volume. This simplifies administration and | ||
10 | allows the sharing of data between volumes, thus reducing disk usage. | ||
11 | |||
12 | Another significant feature is support for an arbitrary depth of | ||
13 | recursive snapshots (snapshots of snapshots of snapshots ...). The | ||
14 | previous implementation of snapshots did this by chaining together | ||
15 | lookup tables, and so performance was O(depth). This new | ||
16 | implementation uses a single data structure to avoid this degradation | ||
17 | with depth. Fragmentation may still be an issue, however, in some | ||
18 | scenarios. | ||
19 | |||
20 | Metadata is stored on a separate device from data, giving the | ||
21 | administrator some freedom, for example to: | ||
22 | |||
23 | - Improve metadata resilience by storing metadata on a mirrored volume | ||
24 | but data on a non-mirrored one. | ||
25 | |||
26 | - Improve performance by storing the metadata on SSD. | ||
27 | |||
28 | Status | ||
29 | ====== | ||
30 | |||
31 | These targets are very much still in the EXPERIMENTAL state. Please | ||
32 | do not yet rely on them in production. But do experiment and offer us | ||
33 | feedback. Different use cases will have different performance | ||
34 | characteristics, for example due to fragmentation of the data volume. | ||
35 | |||
36 | If you find this software is not performing as expected please mail | ||
37 | dm-devel@redhat.com with details and we'll try our best to improve | ||
38 | things for you. | ||
39 | |||
40 | Userspace tools for checking and repairing the metadata are under | ||
41 | development. | ||
42 | |||
43 | Cookbook | ||
44 | ======== | ||
45 | |||
46 | This section describes some quick recipes for using thin provisioning. | ||
47 | They use the dmsetup program to control the device-mapper driver | ||
48 | directly. End users will be advised to use a higher-level volume | ||
49 | manager such as LVM2 once support has been added. | ||
50 | |||
51 | Pool device | ||
52 | ----------- | ||
53 | |||
54 | The pool device ties together the metadata volume and the data volume. | ||
55 | It maps I/O linearly to the data volume and updates the metadata via | ||
56 | two mechanisms: | ||
57 | |||
58 | - Function calls from the thin targets | ||
59 | |||
60 | - Device-mapper 'messages' from userspace which control the creation of new | ||
61 | virtual devices amongst other things. | ||
62 | |||
63 | Setting up a fresh pool device | ||
64 | ------------------------------ | ||
65 | |||
66 | Setting up a pool device requires a valid metadata device, and a | ||
67 | data device. If you do not have an existing metadata device you can | ||
68 | make one by zeroing the first 4k to indicate empty metadata. | ||
69 | |||
70 | dd if=/dev/zero of=$metadata_dev bs=4096 count=1 | ||
71 | |||
72 | The amount of metadata you need will vary according to how many blocks | ||
73 | are shared between thin devices (i.e. through snapshots). If you have | ||
74 | less sharing than average you'll need a larger-than-average metadata device. | ||
75 | |||
76 | As a guide, we suggest you calculate the number of bytes to use in the | ||
77 | metadata device as 48 * $data_dev_size / $data_block_size but round it up | ||
78 | to 2MB if the answer is smaller. The largest size supported is 16GB. | ||
79 | |||
80 | If you're creating large numbers of snapshots which are recording large | ||
81 | amounts of change, you may need find you need to increase this. | ||
82 | |||
83 | Reloading a pool table | ||
84 | ---------------------- | ||
85 | |||
86 | You may reload a pool's table, indeed this is how the pool is resized | ||
87 | if it runs out of space. (N.B. While specifying a different metadata | ||
88 | device when reloading is not forbidden at the moment, things will go | ||
89 | wrong if it does not route I/O to exactly the same on-disk location as | ||
90 | previously.) | ||
91 | |||
92 | Using an existing pool device | ||
93 | ----------------------------- | ||
94 | |||
95 | dmsetup create pool \ | ||
96 | --table "0 20971520 thin-pool $metadata_dev $data_dev \ | ||
97 | $data_block_size $low_water_mark" | ||
98 | |||
99 | $data_block_size gives the smallest unit of disk space that can be | ||
100 | allocated at a time expressed in units of 512-byte sectors. People | ||
101 | primarily interested in thin provisioning may want to use a value such | ||
102 | as 1024 (512KB). People doing lots of snapshotting may want a smaller value | ||
103 | such as 128 (64KB). If you are not zeroing newly-allocated data, | ||
104 | a larger $data_block_size in the region of 256000 (128MB) is suggested. | ||
105 | $data_block_size must be the same for the lifetime of the | ||
106 | metadata device. | ||
107 | |||
108 | $low_water_mark is expressed in blocks of size $data_block_size. If | ||
109 | free space on the data device drops below this level then a dm event | ||
110 | will be triggered which a userspace daemon should catch allowing it to | ||
111 | extend the pool device. Only one such event will be sent. | ||
112 | Resuming a device with a new table itself triggers an event so the | ||
113 | userspace daemon can use this to detect a situation where a new table | ||
114 | already exceeds the threshold. | ||
115 | |||
116 | Thin provisioning | ||
117 | ----------------- | ||
118 | |||
119 | i) Creating a new thinly-provisioned volume. | ||
120 | |||
121 | To create a new thinly- provisioned volume you must send a message to an | ||
122 | active pool device, /dev/mapper/pool in this example. | ||
123 | |||
124 | dmsetup message /dev/mapper/pool 0 "create_thin 0" | ||
125 | |||
126 | Here '0' is an identifier for the volume, a 24-bit number. It's up | ||
127 | to the caller to allocate and manage these identifiers. If the | ||
128 | identifier is already in use, the message will fail with -EEXIST. | ||
129 | |||
130 | ii) Using a thinly-provisioned volume. | ||
131 | |||
132 | Thinly-provisioned volumes are activated using the 'thin' target: | ||
133 | |||
134 | dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" | ||
135 | |||
136 | The last parameter is the identifier for the thinp device. | ||
137 | |||
138 | Internal snapshots | ||
139 | ------------------ | ||
140 | |||
141 | i) Creating an internal snapshot. | ||
142 | |||
143 | Snapshots are created with another message to the pool. | ||
144 | |||
145 | N.B. If the origin device that you wish to snapshot is active, you | ||
146 | must suspend it before creating the snapshot to avoid corruption. | ||
147 | This is NOT enforced at the moment, so please be careful! | ||
148 | |||
149 | dmsetup suspend /dev/mapper/thin | ||
150 | dmsetup message /dev/mapper/pool 0 "create_snap 1 0" | ||
151 | dmsetup resume /dev/mapper/thin | ||
152 | |||
153 | Here '1' is the identifier for the volume, a 24-bit number. '0' is the | ||
154 | identifier for the origin device. | ||
155 | |||
156 | ii) Using an internal snapshot. | ||
157 | |||
158 | Once created, the user doesn't have to worry about any connection | ||
159 | between the origin and the snapshot. Indeed the snapshot is no | ||
160 | different from any other thinly-provisioned device and can be | ||
161 | snapshotted itself via the same method. It's perfectly legal to | ||
162 | have only one of them active, and there's no ordering requirement on | ||
163 | activating or removing them both. (This differs from conventional | ||
164 | device-mapper snapshots.) | ||
165 | |||
166 | Activate it exactly the same way as any other thinly-provisioned volume: | ||
167 | |||
168 | dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" | ||
169 | |||
170 | Deactivation | ||
171 | ------------ | ||
172 | |||
173 | All devices using a pool must be deactivated before the pool itself | ||
174 | can be. | ||
175 | |||
176 | dmsetup remove thin | ||
177 | dmsetup remove snap | ||
178 | dmsetup remove pool | ||
179 | |||
180 | Reference | ||
181 | ========= | ||
182 | |||
183 | 'thin-pool' target | ||
184 | ------------------ | ||
185 | |||
186 | i) Constructor | ||
187 | |||
188 | thin-pool <metadata dev> <data dev> <data block size (sectors)> \ | ||
189 | <low water mark (blocks)> [<number of feature args> [<arg>]*] | ||
190 | |||
191 | Optional feature arguments: | ||
192 | - 'skip_block_zeroing': skips the zeroing of newly-provisioned blocks. | ||
193 | |||
194 | Data block size must be between 64KB (128 sectors) and 1GB | ||
195 | (2097152 sectors) inclusive. | ||
196 | |||
197 | |||
198 | ii) Status | ||
199 | |||
200 | <transaction id> <used metadata blocks>/<total metadata blocks> | ||
201 | <used data blocks>/<total data blocks> <held metadata root> | ||
202 | |||
203 | |||
204 | transaction id: | ||
205 | A 64-bit number used by userspace to help synchronise with metadata | ||
206 | from volume managers. | ||
207 | |||
208 | used data blocks / total data blocks | ||
209 | If the number of free blocks drops below the pool's low water mark a | ||
210 | dm event will be sent to userspace. This event is edge-triggered and | ||
211 | it will occur only once after each resume so volume manager writers | ||
212 | should register for the event and then check the target's status. | ||
213 | |||
214 | held metadata root: | ||
215 | The location, in sectors, of the metadata root that has been | ||
216 | 'held' for userspace read access. '-' indicates there is no | ||
217 | held root. This feature is not yet implemented so '-' is | ||
218 | always returned. | ||
219 | |||
220 | iii) Messages | ||
221 | |||
222 | create_thin <dev id> | ||
223 | |||
224 | Create a new thinly-provisioned device. | ||
225 | <dev id> is an arbitrary unique 24-bit identifier chosen by | ||
226 | the caller. | ||
227 | |||
228 | create_snap <dev id> <origin id> | ||
229 | |||
230 | Create a new snapshot of another thinly-provisioned device. | ||
231 | <dev id> is an arbitrary unique 24-bit identifier chosen by | ||
232 | the caller. | ||
233 | <origin id> is the identifier of the thinly-provisioned device | ||
234 | of which the new device will be a snapshot. | ||
235 | |||
236 | delete <dev id> | ||
237 | |||
238 | Deletes a thin device. Irreversible. | ||
239 | |||
240 | trim <dev id> <new size in sectors> | ||
241 | |||
242 | Delete mappings from the end of a thin device. Irreversible. | ||
243 | You might want to use this if you're reducing the size of | ||
244 | your thinly-provisioned device. In many cases, due to the | ||
245 | sharing of blocks between devices, it is not possible to | ||
246 | determine in advance how much space 'trim' will release. (In | ||
247 | future a userspace tool might be able to perform this | ||
248 | calculation.) | ||
249 | |||
250 | set_transaction_id <current id> <new id> | ||
251 | |||
252 | Userland volume managers, such as LVM, need a way to | ||
253 | synchronise their external metadata with the internal metadata of the | ||
254 | pool target. The thin-pool target offers to store an | ||
255 | arbitrary 64-bit transaction id and return it on the target's | ||
256 | status line. To avoid races you must provide what you think | ||
257 | the current transaction id is when you change it with this | ||
258 | compare-and-swap message. | ||
259 | |||
260 | 'thin' target | ||
261 | ------------- | ||
262 | |||
263 | i) Constructor | ||
264 | |||
265 | thin <pool dev> <dev id> | ||
266 | |||
267 | pool dev: | ||
268 | the thin-pool device, e.g. /dev/mapper/my_pool or 253:0 | ||
269 | |||
270 | dev id: | ||
271 | the internal device identifier of the device to be | ||
272 | activated. | ||
273 | |||
274 | The pool doesn't store any size against the thin devices. If you | ||
275 | load a thin target that is smaller than you've been using previously, | ||
276 | then you'll have no access to blocks mapped beyond the end. If you | ||
277 | load a target that is bigger than before, then extra blocks will be | ||
278 | provisioned as and when needed. | ||
279 | |||
280 | If you wish to reduce the size of your thin device and potentially | ||
281 | regain some space then send the 'trim' message to the pool. | ||
282 | |||
283 | ii) Status | ||
284 | |||
285 | <nr mapped sectors> <highest mapped sector> | ||