diff options
Diffstat (limited to 'Documentation/device-mapper/cache.rst')
-rw-r--r-- | Documentation/device-mapper/cache.rst | 337 |
1 files changed, 337 insertions, 0 deletions
diff --git a/Documentation/device-mapper/cache.rst b/Documentation/device-mapper/cache.rst new file mode 100644 index 000000000000..f15e5254d05b --- /dev/null +++ b/Documentation/device-mapper/cache.rst | |||
@@ -0,0 +1,337 @@ | |||
1 | ===== | ||
2 | Cache | ||
3 | ===== | ||
4 | |||
5 | Introduction | ||
6 | ============ | ||
7 | |||
8 | dm-cache is a device mapper target written by Joe Thornber, Heinz | ||
9 | Mauelshagen, and Mike Snitzer. | ||
10 | |||
11 | It aims to improve performance of a block device (eg, a spindle) by | ||
12 | dynamically migrating some of its data to a faster, smaller device | ||
13 | (eg, an SSD). | ||
14 | |||
15 | This device-mapper solution allows us to insert this caching at | ||
16 | different levels of the dm stack, for instance above the data device for | ||
17 | a thin-provisioning pool. Caching solutions that are integrated more | ||
18 | closely with the virtual memory system should give better performance. | ||
19 | |||
20 | The target reuses the metadata library used in the thin-provisioning | ||
21 | library. | ||
22 | |||
23 | The decision as to what data to migrate and when is left to a plug-in | ||
24 | policy module. Several of these have been written as we experiment, | ||
25 | and we hope other people will contribute others for specific io | ||
26 | scenarios (eg. a vm image server). | ||
27 | |||
28 | Glossary | ||
29 | ======== | ||
30 | |||
31 | Migration | ||
32 | Movement of the primary copy of a logical block from one | ||
33 | device to the other. | ||
34 | Promotion | ||
35 | Migration from slow device to fast device. | ||
36 | Demotion | ||
37 | Migration from fast device to slow device. | ||
38 | |||
39 | The origin device always contains a copy of the logical block, which | ||
40 | may be out of date or kept in sync with the copy on the cache device | ||
41 | (depending on policy). | ||
42 | |||
43 | Design | ||
44 | ====== | ||
45 | |||
46 | Sub-devices | ||
47 | ----------- | ||
48 | |||
49 | The target is constructed by passing three devices to it (along with | ||
50 | other parameters detailed later): | ||
51 | |||
52 | 1. An origin device - the big, slow one. | ||
53 | |||
54 | 2. A cache device - the small, fast one. | ||
55 | |||
56 | 3. A small metadata device - records which blocks are in the cache, | ||
57 | which are dirty, and extra hints for use by the policy object. | ||
58 | This information could be put on the cache device, but having it | ||
59 | separate allows the volume manager to configure it differently, | ||
60 | e.g. as a mirror for extra robustness. This metadata device may only | ||
61 | be used by a single cache device. | ||
62 | |||
63 | Fixed block size | ||
64 | ---------------- | ||
65 | |||
66 | The origin is divided up into blocks of a fixed size. This block size | ||
67 | is configurable when you first create the cache. Typically we've been | ||
68 | using block sizes of 256KB - 1024KB. The block size must be between 64 | ||
69 | sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB). | ||
70 | |||
71 | Having a fixed block size simplifies the target a lot. But it is | ||
72 | something of a compromise. For instance, a small part of a block may be | ||
73 | getting hit a lot, yet the whole block will be promoted to the cache. | ||
74 | So large block sizes are bad because they waste cache space. And small | ||
75 | block sizes are bad because they increase the amount of metadata (both | ||
76 | in core and on disk). | ||
77 | |||
78 | Cache operating modes | ||
79 | --------------------- | ||
80 | |||
81 | The cache has three operating modes: writeback, writethrough and | ||
82 | passthrough. | ||
83 | |||
84 | If writeback, the default, is selected then a write to a block that is | ||
85 | cached will go only to the cache and the block will be marked dirty in | ||
86 | the metadata. | ||
87 | |||
88 | If writethrough is selected then a write to a cached block will not | ||
89 | complete until it has hit both the origin and cache devices. Clean | ||
90 | blocks should remain clean. | ||
91 | |||
92 | If passthrough is selected, useful when the cache contents are not known | ||
93 | to be coherent with the origin device, then all reads are served from | ||
94 | the origin device (all reads miss the cache) and all writes are | ||
95 | forwarded to the origin device; additionally, write hits cause cache | ||
96 | block invalidates. To enable passthrough mode the cache must be clean. | ||
97 | Passthrough mode allows a cache device to be activated without having to | ||
98 | worry about coherency. Coherency that exists is maintained, although | ||
99 | the cache will gradually cool as writes take place. If the coherency of | ||
100 | the cache can later be verified, or established through use of the | ||
101 | "invalidate_cblocks" message, the cache device can be transitioned to | ||
102 | writethrough or writeback mode while still warm. Otherwise, the cache | ||
103 | contents can be discarded prior to transitioning to the desired | ||
104 | operating mode. | ||
105 | |||
106 | A simple cleaner policy is provided, which will clean (write back) all | ||
107 | dirty blocks in a cache. Useful for decommissioning a cache or when | ||
108 | shrinking a cache. Shrinking the cache's fast device requires all cache | ||
109 | blocks, in the area of the cache being removed, to be clean. If the | ||
110 | area being removed from the cache still contains dirty blocks the resize | ||
111 | will fail. Care must be taken to never reduce the volume used for the | ||
112 | cache's fast device until the cache is clean. This is of particular | ||
113 | importance if writeback mode is used. Writethrough and passthrough | ||
114 | modes already maintain a clean cache. Future support to partially clean | ||
115 | the cache, above a specified threshold, will allow for keeping the cache | ||
116 | warm and in writeback mode during resize. | ||
117 | |||
118 | Migration throttling | ||
119 | -------------------- | ||
120 | |||
121 | Migrating data between the origin and cache device uses bandwidth. | ||
122 | The user can set a throttle to prevent more than a certain amount of | ||
123 | migration occurring at any one time. Currently we're not taking any | ||
124 | account of normal io traffic going to the devices. More work needs | ||
125 | doing here to avoid migrating during those peak io moments. | ||
126 | |||
127 | For the time being, a message "migration_threshold <#sectors>" | ||
128 | can be used to set the maximum number of sectors being migrated, | ||
129 | the default being 2048 sectors (1MB). | ||
130 | |||
131 | Updating on-disk metadata | ||
132 | ------------------------- | ||
133 | |||
134 | On-disk metadata is committed every time a FLUSH or FUA bio is written. | ||
135 | If no such requests are made then commits will occur every second. This | ||
136 | means the cache behaves like a physical disk that has a volatile write | ||
137 | cache. If power is lost you may lose some recent writes. The metadata | ||
138 | should always be consistent in spite of any crash. | ||
139 | |||
140 | The 'dirty' state for a cache block changes far too frequently for us | ||
141 | to keep updating it on the fly. So we treat it as a hint. In normal | ||
142 | operation it will be written when the dm device is suspended. If the | ||
143 | system crashes all cache blocks will be assumed dirty when restarted. | ||
144 | |||
145 | Per-block policy hints | ||
146 | ---------------------- | ||
147 | |||
148 | Policy plug-ins can store a chunk of data per cache block. It's up to | ||
149 | the policy how big this chunk is, but it should be kept small. Like the | ||
150 | dirty flags this data is lost if there's a crash so a safe fallback | ||
151 | value should always be possible. | ||
152 | |||
153 | Policy hints affect performance, not correctness. | ||
154 | |||
155 | Policy messaging | ||
156 | ---------------- | ||
157 | |||
158 | Policies will have different tunables, specific to each one, so we | ||
159 | need a generic way of getting and setting these. Device-mapper | ||
160 | messages are used. Refer to cache-policies.txt. | ||
161 | |||
162 | Discard bitset resolution | ||
163 | ------------------------- | ||
164 | |||
165 | We can avoid copying data during migration if we know the block has | ||
166 | been discarded. A prime example of this is when mkfs discards the | ||
167 | whole block device. We store a bitset tracking the discard state of | ||
168 | blocks. However, we allow this bitset to have a different block size | ||
169 | from the cache blocks. This is because we need to track the discard | ||
170 | state for all of the origin device (compare with the dirty bitset | ||
171 | which is just for the smaller cache device). | ||
172 | |||
173 | Target interface | ||
174 | ================ | ||
175 | |||
176 | Constructor | ||
177 | ----------- | ||
178 | |||
179 | :: | ||
180 | |||
181 | cache <metadata dev> <cache dev> <origin dev> <block size> | ||
182 | <#feature args> [<feature arg>]* | ||
183 | <policy> <#policy args> [policy args]* | ||
184 | |||
185 | ================ ======================================================= | ||
186 | metadata dev fast device holding the persistent metadata | ||
187 | cache dev fast device holding cached data blocks | ||
188 | origin dev slow device holding original data blocks | ||
189 | block size cache unit size in sectors | ||
190 | |||
191 | #feature args number of feature arguments passed | ||
192 | feature args writethrough or passthrough (The default is writeback.) | ||
193 | |||
194 | policy the replacement policy to use | ||
195 | #policy args an even number of arguments corresponding to | ||
196 | key/value pairs passed to the policy | ||
197 | policy args key/value pairs passed to the policy | ||
198 | E.g. 'sequential_threshold 1024' | ||
199 | See cache-policies.txt for details. | ||
200 | ================ ======================================================= | ||
201 | |||
202 | Optional feature arguments are: | ||
203 | |||
204 | |||
205 | ==================== ======================================================== | ||
206 | writethrough write through caching that prohibits cache block | ||
207 | content from being different from origin block content. | ||
208 | Without this argument, the default behaviour is to write | ||
209 | back cache block contents later for performance reasons, | ||
210 | so they may differ from the corresponding origin blocks. | ||
211 | |||
212 | passthrough a degraded mode useful for various cache coherency | ||
213 | situations (e.g., rolling back snapshots of | ||
214 | underlying storage). Reads and writes always go to | ||
215 | the origin. If a write goes to a cached origin | ||
216 | block, then the cache block is invalidated. | ||
217 | To enable passthrough mode the cache must be clean. | ||
218 | |||
219 | metadata2 use version 2 of the metadata. This stores the dirty | ||
220 | bits in a separate btree, which improves speed of | ||
221 | shutting down the cache. | ||
222 | |||
223 | no_discard_passdown disable passing down discards from the cache | ||
224 | to the origin's data device. | ||
225 | ==================== ======================================================== | ||
226 | |||
227 | A policy called 'default' is always registered. This is an alias for | ||
228 | the policy we currently think is giving best all round performance. | ||
229 | |||
230 | As the default policy could vary between kernels, if you are relying on | ||
231 | the characteristics of a specific policy, always request it by name. | ||
232 | |||
233 | Status | ||
234 | ------ | ||
235 | |||
236 | :: | ||
237 | |||
238 | <metadata block size> <#used metadata blocks>/<#total metadata blocks> | ||
239 | <cache block size> <#used cache blocks>/<#total cache blocks> | ||
240 | <#read hits> <#read misses> <#write hits> <#write misses> | ||
241 | <#demotions> <#promotions> <#dirty> <#features> <features>* | ||
242 | <#core args> <core args>* <policy name> <#policy args> <policy args>* | ||
243 | <cache metadata mode> | ||
244 | |||
245 | |||
246 | ========================= ===================================================== | ||
247 | metadata block size Fixed block size for each metadata block in | ||
248 | sectors | ||
249 | #used metadata blocks Number of metadata blocks used | ||
250 | #total metadata blocks Total number of metadata blocks | ||
251 | cache block size Configurable block size for the cache device | ||
252 | in sectors | ||
253 | #used cache blocks Number of blocks resident in the cache | ||
254 | #total cache blocks Total number of cache blocks | ||
255 | #read hits Number of times a READ bio has been mapped | ||
256 | to the cache | ||
257 | #read misses Number of times a READ bio has been mapped | ||
258 | to the origin | ||
259 | #write hits Number of times a WRITE bio has been mapped | ||
260 | to the cache | ||
261 | #write misses Number of times a WRITE bio has been | ||
262 | mapped to the origin | ||
263 | #demotions Number of times a block has been removed | ||
264 | from the cache | ||
265 | #promotions Number of times a block has been moved to | ||
266 | the cache | ||
267 | #dirty Number of blocks in the cache that differ | ||
268 | from the origin | ||
269 | #feature args Number of feature args to follow | ||
270 | feature args 'writethrough' (optional) | ||
271 | #core args Number of core arguments (must be even) | ||
272 | core args Key/value pairs for tuning the core | ||
273 | e.g. migration_threshold | ||
274 | policy name Name of the policy | ||
275 | #policy args Number of policy arguments to follow (must be even) | ||
276 | policy args Key/value pairs e.g. sequential_threshold | ||
277 | cache metadata mode ro if read-only, rw if read-write | ||
278 | |||
279 | In serious cases where even a read-only mode is | ||
280 | deemed unsafe no further I/O will be permitted and | ||
281 | the status will just contain the string 'Fail'. | ||
282 | The userspace recovery tools should then be used. | ||
283 | needs_check 'needs_check' if set, '-' if not set | ||
284 | A metadata operation has failed, resulting in the | ||
285 | needs_check flag being set in the metadata's | ||
286 | superblock. The metadata device must be | ||
287 | deactivated and checked/repaired before the | ||
288 | cache can be made fully operational again. | ||
289 | '-' indicates needs_check is not set. | ||
290 | ========================= ===================================================== | ||
291 | |||
292 | Messages | ||
293 | -------- | ||
294 | |||
295 | Policies will have different tunables, specific to each one, so we | ||
296 | need a generic way of getting and setting these. Device-mapper | ||
297 | messages are used. (A sysfs interface would also be possible.) | ||
298 | |||
299 | The message format is:: | ||
300 | |||
301 | <key> <value> | ||
302 | |||
303 | E.g.:: | ||
304 | |||
305 | dmsetup message my_cache 0 sequential_threshold 1024 | ||
306 | |||
307 | |||
308 | Invalidation is removing an entry from the cache without writing it | ||
309 | back. Cache blocks can be invalidated via the invalidate_cblocks | ||
310 | message, which takes an arbitrary number of cblock ranges. Each cblock | ||
311 | range's end value is "one past the end", meaning 5-10 expresses a range | ||
312 | of values from 5 to 9. Each cblock must be expressed as a decimal | ||
313 | value, in the future a variant message that takes cblock ranges | ||
314 | expressed in hexadecimal may be needed to better support efficient | ||
315 | invalidation of larger caches. The cache must be in passthrough mode | ||
316 | when invalidate_cblocks is used:: | ||
317 | |||
318 | invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]* | ||
319 | |||
320 | E.g.:: | ||
321 | |||
322 | dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789 | ||
323 | |||
324 | Examples | ||
325 | ======== | ||
326 | |||
327 | The test suite can be found here: | ||
328 | |||
329 | https://github.com/jthornber/device-mapper-test-suite | ||
330 | |||
331 | :: | ||
332 | |||
333 | dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ | ||
334 | /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' | ||
335 | dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ | ||
336 | /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ | ||
337 | mq 4 sequential_threshold 1024 random_threshold 8' | ||