diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/device-mapper/cache.txt | 243 |
1 files changed, 243 insertions, 0 deletions
diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt new file mode 100644 index 000000000000..f50470abe241 --- /dev/null +++ b/Documentation/device-mapper/cache.txt | |||
@@ -0,0 +1,243 @@ | |||
1 | Introduction | ||
2 | ============ | ||
3 | |||
4 | dm-cache is a device mapper target written by Joe Thornber, Heinz | ||
5 | Mauelshagen, and Mike Snitzer. | ||
6 | |||
7 | It aims to improve performance of a block device (eg, a spindle) by | ||
8 | dynamically migrating some of its data to a faster, smaller device | ||
9 | (eg, an SSD). | ||
10 | |||
11 | This device-mapper solution allows us to insert this caching at | ||
12 | different levels of the dm stack, for instance above the data device for | ||
13 | a thin-provisioning pool. Caching solutions that are integrated more | ||
14 | closely with the virtual memory system should give better performance. | ||
15 | |||
16 | The target reuses the metadata library used in the thin-provisioning | ||
17 | library. | ||
18 | |||
19 | The decision as to what data to migrate and when is left to a plug-in | ||
20 | policy module. Several of these have been written as we experiment, | ||
21 | and we hope other people will contribute others for specific io | ||
22 | scenarios (eg. a vm image server). | ||
23 | |||
24 | Glossary | ||
25 | ======== | ||
26 | |||
27 | Migration - Movement of the primary copy of a logical block from one | ||
28 | device to the other. | ||
29 | Promotion - Migration from slow device to fast device. | ||
30 | Demotion - Migration from fast device to slow device. | ||
31 | |||
32 | The origin device always contains a copy of the logical block, which | ||
33 | may be out of date or kept in sync with the copy on the cache device | ||
34 | (depending on policy). | ||
35 | |||
36 | Design | ||
37 | ====== | ||
38 | |||
39 | Sub-devices | ||
40 | ----------- | ||
41 | |||
42 | The target is constructed by passing three devices to it (along with | ||
43 | other parameters detailed later): | ||
44 | |||
45 | 1. An origin device - the big, slow one. | ||
46 | |||
47 | 2. A cache device - the small, fast one. | ||
48 | |||
49 | 3. A small metadata device - records which blocks are in the cache, | ||
50 | which are dirty, and extra hints for use by the policy object. | ||
51 | This information could be put on the cache device, but having it | ||
52 | separate allows the volume manager to configure it differently, | ||
53 | e.g. as a mirror for extra robustness. | ||
54 | |||
55 | Fixed block size | ||
56 | ---------------- | ||
57 | |||
58 | The origin is divided up into blocks of a fixed size. This block size | ||
59 | is configurable when you first create the cache. Typically we've been | ||
60 | using block sizes of 256k - 1024k. | ||
61 | |||
62 | Having a fixed block size simplifies the target a lot. But it is | ||
63 | something of a compromise. For instance, a small part of a block may be | ||
64 | getting hit a lot, yet the whole block will be promoted to the cache. | ||
65 | So large block sizes are bad because they waste cache space. And small | ||
66 | block sizes are bad because they increase the amount of metadata (both | ||
67 | in core and on disk). | ||
68 | |||
69 | Writeback/writethrough | ||
70 | ---------------------- | ||
71 | |||
72 | The cache has two modes, writeback and writethrough. | ||
73 | |||
74 | If writeback, the default, is selected then a write to a block that is | ||
75 | cached will go only to the cache and the block will be marked dirty in | ||
76 | the metadata. | ||
77 | |||
78 | If writethrough is selected then a write to a cached block will not | ||
79 | complete until it has hit both the origin and cache devices. Clean | ||
80 | blocks should remain clean. | ||
81 | |||
82 | A simple cleaner policy is provided, which will clean (write back) all | ||
83 | dirty blocks in a cache. Useful for decommissioning a cache. | ||
84 | |||
85 | Migration throttling | ||
86 | -------------------- | ||
87 | |||
88 | Migrating data between the origin and cache device uses bandwidth. | ||
89 | The user can set a throttle to prevent more than a certain amount of | ||
90 | migration occuring at any one time. Currently we're not taking any | ||
91 | account of normal io traffic going to the devices. More work needs | ||
92 | doing here to avoid migrating during those peak io moments. | ||
93 | |||
94 | For the time being, a message "migration_threshold <#sectors>" | ||
95 | can be used to set the maximum number of sectors being migrated, | ||
96 | the default being 204800 sectors (or 100MB). | ||
97 | |||
98 | Updating on-disk metadata | ||
99 | ------------------------- | ||
100 | |||
101 | On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is | ||
102 | written. If no such requests are made then commits will occur every | ||
103 | second. This means the cache behaves like a physical disk that has a | ||
104 | write cache (the same is true of the thin-provisioning target). If | ||
105 | power is lost you may lose some recent writes. The metadata should | ||
106 | always be consistent in spite of any crash. | ||
107 | |||
108 | The 'dirty' state for a cache block changes far too frequently for us | ||
109 | to keep updating it on the fly. So we treat it as a hint. In normal | ||
110 | operation it will be written when the dm device is suspended. If the | ||
111 | system crashes all cache blocks will be assumed dirty when restarted. | ||
112 | |||
113 | Per-block policy hints | ||
114 | ---------------------- | ||
115 | |||
116 | Policy plug-ins can store a chunk of data per cache block. It's up to | ||
117 | the policy how big this chunk is, but it should be kept small. Like the | ||
118 | dirty flags this data is lost if there's a crash so a safe fallback | ||
119 | value should always be possible. | ||
120 | |||
121 | For instance, the 'mq' policy, which is currently the default policy, | ||
122 | uses this facility to store the hit count of the cache blocks. If | ||
123 | there's a crash this information will be lost, which means the cache | ||
124 | may be less efficient until those hit counts are regenerated. | ||
125 | |||
126 | Policy hints affect performance, not correctness. | ||
127 | |||
128 | Policy messaging | ||
129 | ---------------- | ||
130 | |||
131 | Policies will have different tunables, specific to each one, so we | ||
132 | need a generic way of getting and setting these. Device-mapper | ||
133 | messages are used. Refer to cache-policies.txt. | ||
134 | |||
135 | Discard bitset resolution | ||
136 | ------------------------- | ||
137 | |||
138 | We can avoid copying data during migration if we know the block has | ||
139 | been discarded. A prime example of this is when mkfs discards the | ||
140 | whole block device. We store a bitset tracking the discard state of | ||
141 | blocks. However, we allow this bitset to have a different block size | ||
142 | from the cache blocks. This is because we need to track the discard | ||
143 | state for all of the origin device (compare with the dirty bitset | ||
144 | which is just for the smaller cache device). | ||
145 | |||
146 | Target interface | ||
147 | ================ | ||
148 | |||
149 | Constructor | ||
150 | ----------- | ||
151 | |||
152 | cache <metadata dev> <cache dev> <origin dev> <block size> | ||
153 | <#feature args> [<feature arg>]* | ||
154 | <policy> <#policy args> [policy args]* | ||
155 | |||
156 | metadata dev : fast device holding the persistent metadata | ||
157 | cache dev : fast device holding cached data blocks | ||
158 | origin dev : slow device holding original data blocks | ||
159 | block size : cache unit size in sectors | ||
160 | |||
161 | #feature args : number of feature arguments passed | ||
162 | feature args : writethrough. (The default is writeback.) | ||
163 | |||
164 | policy : the replacement policy to use | ||
165 | #policy args : an even number of arguments corresponding to | ||
166 | key/value pairs passed to the policy | ||
167 | policy args : key/value pairs passed to the policy | ||
168 | E.g. 'sequential_threshold 1024' | ||
169 | See cache-policies.txt for details. | ||
170 | |||
171 | Optional feature arguments are: | ||
172 | writethrough : write through caching that prohibits cache block | ||
173 | content from being different from origin block content. | ||
174 | Without this argument, the default behaviour is to write | ||
175 | back cache block contents later for performance reasons, | ||
176 | so they may differ from the corresponding origin blocks. | ||
177 | |||
178 | A policy called 'default' is always registered. This is an alias for | ||
179 | the policy we currently think is giving best all round performance. | ||
180 | |||
181 | As the default policy could vary between kernels, if you are relying on | ||
182 | the characteristics of a specific policy, always request it by name. | ||
183 | |||
184 | Status | ||
185 | ------ | ||
186 | |||
187 | <#used metadata blocks>/<#total metadata blocks> <#read hits> <#read misses> | ||
188 | <#write hits> <#write misses> <#demotions> <#promotions> <#blocks in cache> | ||
189 | <#dirty> <#features> <features>* <#core args> <core args>* <#policy args> | ||
190 | <policy args>* | ||
191 | |||
192 | #used metadata blocks : Number of metadata blocks used | ||
193 | #total metadata blocks : Total number of metadata blocks | ||
194 | #read hits : Number of times a READ bio has been mapped | ||
195 | to the cache | ||
196 | #read misses : Number of times a READ bio has been mapped | ||
197 | to the origin | ||
198 | #write hits : Number of times a WRITE bio has been mapped | ||
199 | to the cache | ||
200 | #write misses : Number of times a WRITE bio has been | ||
201 | mapped to the origin | ||
202 | #demotions : Number of times a block has been removed | ||
203 | from the cache | ||
204 | #promotions : Number of times a block has been moved to | ||
205 | the cache | ||
206 | #blocks in cache : Number of blocks resident in the cache | ||
207 | #dirty : Number of blocks in the cache that differ | ||
208 | from the origin | ||
209 | #feature args : Number of feature args to follow | ||
210 | feature args : 'writethrough' (optional) | ||
211 | #core args : Number of core arguments (must be even) | ||
212 | core args : Key/value pairs for tuning the core | ||
213 | e.g. migration_threshold | ||
214 | #policy args : Number of policy arguments to follow (must be even) | ||
215 | policy args : Key/value pairs | ||
216 | e.g. 'sequential_threshold 1024 | ||
217 | |||
218 | Messages | ||
219 | -------- | ||
220 | |||
221 | Policies will have different tunables, specific to each one, so we | ||
222 | need a generic way of getting and setting these. Device-mapper | ||
223 | messages are used. (A sysfs interface would also be possible.) | ||
224 | |||
225 | The message format is: | ||
226 | |||
227 | <key> <value> | ||
228 | |||
229 | E.g. | ||
230 | dmsetup message my_cache 0 sequential_threshold 1024 | ||
231 | |||
232 | Examples | ||
233 | ======== | ||
234 | |||
235 | The test suite can be found here: | ||
236 | |||
237 | https://github.com/jthornber/thinp-test-suite | ||
238 | |||
239 | dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ | ||
240 | /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' | ||
241 | dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ | ||
242 | /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ | ||
243 | mq 4 sequential_threshold 1024 random_threshold 8' | ||