diff options
author | Martin K. Petersen <martin.petersen@oracle.com> | 2008-06-17 12:59:57 -0400 |
---|---|---|
committer | Jens Axboe <jens.axboe@oracle.com> | 2008-07-03 07:21:13 -0400 |
commit | c1c72b59941e2f5aad4b02609d7ee7b121734b8d (patch) | |
tree | a57d51e43ecefa11d183ff9b27b661c900e46af6 /Documentation/block/data-integrity.txt | |
parent | 7ba1ba12eeef0aa7113beb16410ef8b7c748e18b (diff) |
block: Data integrity infrastructure documentation
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Diffstat (limited to 'Documentation/block/data-integrity.txt')
-rw-r--r-- | Documentation/block/data-integrity.txt | 327 |
1 files changed, 327 insertions, 0 deletions
diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt new file mode 100644 index 000000000000..e9dc8d86adc7 --- /dev/null +++ b/Documentation/block/data-integrity.txt | |||
@@ -0,0 +1,327 @@ | |||
1 | ---------------------------------------------------------------------- | ||
2 | 1. INTRODUCTION | ||
3 | |||
4 | Modern filesystems feature checksumming of data and metadata to | ||
5 | protect against data corruption. However, the detection of the | ||
6 | corruption is done at read time which could potentially be months | ||
7 | after the data was written. At that point the original data that the | ||
8 | application tried to write is most likely lost. | ||
9 | |||
10 | The solution is to ensure that the disk is actually storing what the | ||
11 | application meant it to. Recent additions to both the SCSI family | ||
12 | protocols (SBC Data Integrity Field, SCC protection proposal) as well | ||
13 | as SATA/T13 (External Path Protection) try to remedy this by adding | ||
14 | support for appending integrity metadata to an I/O. The integrity | ||
15 | metadata (or protection information in SCSI terminology) includes a | ||
16 | checksum for each sector as well as an incrementing counter that | ||
17 | ensures the individual sectors are written in the right order. And | ||
18 | for some protection schemes also that the I/O is written to the right | ||
19 | place on disk. | ||
20 | |||
21 | Current storage controllers and devices implement various protective | ||
22 | measures, for instance checksumming and scrubbing. But these | ||
23 | technologies are working in their own isolated domains or at best | ||
24 | between adjacent nodes in the I/O path. The interesting thing about | ||
25 | DIF and the other integrity extensions is that the protection format | ||
26 | is well defined and every node in the I/O path can verify the | ||
27 | integrity of the I/O and reject it if corruption is detected. This | ||
28 | allows not only corruption prevention but also isolation of the point | ||
29 | of failure. | ||
30 | |||
31 | ---------------------------------------------------------------------- | ||
32 | 2. THE DATA INTEGRITY EXTENSIONS | ||
33 | |||
34 | As written, the protocol extensions only protect the path between | ||
35 | controller and storage device. However, many controllers actually | ||
36 | allow the operating system to interact with the integrity metadata | ||
37 | (IMD). We have been working with several FC/SAS HBA vendors to enable | ||
38 | the protection information to be transferred to and from their | ||
39 | controllers. | ||
40 | |||
41 | The SCSI Data Integrity Field works by appending 8 bytes of protection | ||
42 | information to each sector. The data + integrity metadata is stored | ||
43 | in 520 byte sectors on disk. Data + IMD are interleaved when | ||
44 | transferred between the controller and target. The T13 proposal is | ||
45 | similar. | ||
46 | |||
47 | Because it is highly inconvenient for operating systems to deal with | ||
48 | 520 (and 4104) byte sectors, we approached several HBA vendors and | ||
49 | encouraged them to allow separation of the data and integrity metadata | ||
50 | scatter-gather lists. | ||
51 | |||
52 | The controller will interleave the buffers on write and split them on | ||
53 | read. This means that the Linux can DMA the data buffers to and from | ||
54 | host memory without changes to the page cache. | ||
55 | |||
56 | Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs | ||
57 | is somewhat heavy to compute in software. Benchmarks found that | ||
58 | calculating this checksum had a significant impact on system | ||
59 | performance for a number of workloads. Some controllers allow a | ||
60 | lighter-weight checksum to be used when interfacing with the operating | ||
61 | system. Emulex, for instance, supports the TCP/IP checksum instead. | ||
62 | The IP checksum received from the OS is converted to the 16-bit CRC | ||
63 | when writing and vice versa. This allows the integrity metadata to be | ||
64 | generated by Linux or the application at very low cost (comparable to | ||
65 | software RAID5). | ||
66 | |||
67 | The IP checksum is weaker than the CRC in terms of detecting bit | ||
68 | errors. However, the strength is really in the separation of the data | ||
69 | buffers and the integrity metadata. These two distinct buffers much | ||
70 | match up for an I/O to complete. | ||
71 | |||
72 | The separation of the data and integrity metadata buffers as well as | ||
73 | the choice in checksums is referred to as the Data Integrity | ||
74 | Extensions. As these extensions are outside the scope of the protocol | ||
75 | bodies (T10, T13), Oracle and its partners are trying to standardize | ||
76 | them within the Storage Networking Industry Association. | ||
77 | |||
78 | ---------------------------------------------------------------------- | ||
79 | 3. KERNEL CHANGES | ||
80 | |||
81 | The data integrity framework in Linux enables protection information | ||
82 | to be pinned to I/Os and sent to/received from controllers that | ||
83 | support it. | ||
84 | |||
85 | The advantage to the integrity extensions in SCSI and SATA is that | ||
86 | they enable us to protect the entire path from application to storage | ||
87 | device. However, at the same time this is also the biggest | ||
88 | disadvantage. It means that the protection information must be in a | ||
89 | format that can be understood by the disk. | ||
90 | |||
91 | Generally Linux/POSIX applications are agnostic to the intricacies of | ||
92 | the storage devices they are accessing. The virtual filesystem switch | ||
93 | and the block layer make things like hardware sector size and | ||
94 | transport protocols completely transparent to the application. | ||
95 | |||
96 | However, this level of detail is required when preparing the | ||
97 | protection information to send to a disk. Consequently, the very | ||
98 | concept of an end-to-end protection scheme is a layering violation. | ||
99 | It is completely unreasonable for an application to be aware whether | ||
100 | it is accessing a SCSI or SATA disk. | ||
101 | |||
102 | The data integrity support implemented in Linux attempts to hide this | ||
103 | from the application. As far as the application (and to some extent | ||
104 | the kernel) is concerned, the integrity metadata is opaque information | ||
105 | that's attached to the I/O. | ||
106 | |||
107 | The current implementation allows the block layer to automatically | ||
108 | generate the protection information for any I/O. Eventually the | ||
109 | intent is to move the integrity metadata calculation to userspace for | ||
110 | user data. Metadata and other I/O that originates within the kernel | ||
111 | will still use the automatic generation interface. | ||
112 | |||
113 | Some storage devices allow each hardware sector to be tagged with a | ||
114 | 16-bit value. The owner of this tag space is the owner of the block | ||
115 | device. I.e. the filesystem in most cases. The filesystem can use | ||
116 | this extra space to tag sectors as they see fit. Because the tag | ||
117 | space is limited, the block interface allows tagging bigger chunks by | ||
118 | way of interleaving. This way, 8*16 bits of information can be | ||
119 | attached to a typical 4KB filesystem block. | ||
120 | |||
121 | This also means that applications such as fsck and mkfs will need | ||
122 | access to manipulate the tags from user space. A passthrough | ||
123 | interface for this is being worked on. | ||
124 | |||
125 | |||
126 | ---------------------------------------------------------------------- | ||
127 | 4. BLOCK LAYER IMPLEMENTATION DETAILS | ||
128 | |||
129 | 4.1 BIO | ||
130 | |||
131 | The data integrity patches add a new field to struct bio when | ||
132 | CONFIG_BLK_DEV_INTEGRITY is enabled. bio->bi_integrity is a pointer | ||
133 | to a struct bip which contains the bio integrity payload. Essentially | ||
134 | a bip is a trimmed down struct bio which holds a bio_vec containing | ||
135 | the integrity metadata and the required housekeeping information (bvec | ||
136 | pool, vector count, etc.) | ||
137 | |||
138 | A kernel subsystem can enable data integrity protection on a bio by | ||
139 | calling bio_integrity_alloc(bio). This will allocate and attach the | ||
140 | bip to the bio. | ||
141 | |||
142 | Individual pages containing integrity metadata can subsequently be | ||
143 | attached using bio_integrity_add_page(). | ||
144 | |||
145 | bio_free() will automatically free the bip. | ||
146 | |||
147 | |||
148 | 4.2 BLOCK DEVICE | ||
149 | |||
150 | Because the format of the protection data is tied to the physical | ||
151 | disk, each block device has been extended with a block integrity | ||
152 | profile (struct blk_integrity). This optional profile is registered | ||
153 | with the block layer using blk_integrity_register(). | ||
154 | |||
155 | The profile contains callback functions for generating and verifying | ||
156 | the protection data, as well as getting and setting application tags. | ||
157 | The profile also contains a few constants to aid in completing, | ||
158 | merging and splitting the integrity metadata. | ||
159 | |||
160 | Layered block devices will need to pick a profile that's appropriate | ||
161 | for all subdevices. blk_integrity_compare() can help with that. DM | ||
162 | and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 | ||
163 | will require extra work due to the application tag. | ||
164 | |||
165 | |||
166 | ---------------------------------------------------------------------- | ||
167 | 5.0 BLOCK LAYER INTEGRITY API | ||
168 | |||
169 | 5.1 NORMAL FILESYSTEM | ||
170 | |||
171 | The normal filesystem is unaware that the underlying block device | ||
172 | is capable of sending/receiving integrity metadata. The IMD will | ||
173 | be automatically generated by the block layer at submit_bio() time | ||
174 | in case of a WRITE. A READ request will cause the I/O integrity | ||
175 | to be verified upon completion. | ||
176 | |||
177 | IMD generation and verification can be toggled using the | ||
178 | |||
179 | /sys/block/<bdev>/integrity/write_generate | ||
180 | |||
181 | and | ||
182 | |||
183 | /sys/block/<bdev>/integrity/read_verify | ||
184 | |||
185 | flags. | ||
186 | |||
187 | |||
188 | 5.2 INTEGRITY-AWARE FILESYSTEM | ||
189 | |||
190 | A filesystem that is integrity-aware can prepare I/Os with IMD | ||
191 | attached. It can also use the application tag space if this is | ||
192 | supported by the block device. | ||
193 | |||
194 | |||
195 | int bdev_integrity_enabled(block_device, int rw); | ||
196 | |||
197 | bdev_integrity_enabled() will return 1 if the block device | ||
198 | supports integrity metadata transfer for the data direction | ||
199 | specified in 'rw'. | ||
200 | |||
201 | bdev_integrity_enabled() honors the write_generate and | ||
202 | read_verify flags in sysfs and will respond accordingly. | ||
203 | |||
204 | |||
205 | int bio_integrity_prep(bio); | ||
206 | |||
207 | To generate IMD for WRITE and to set up buffers for READ, the | ||
208 | filesystem must call bio_integrity_prep(bio). | ||
209 | |||
210 | Prior to calling this function, the bio data direction and start | ||
211 | sector must be set, and the bio should have all data pages | ||
212 | added. It is up to the caller to ensure that the bio does not | ||
213 | change while I/O is in progress. | ||
214 | |||
215 | bio_integrity_prep() should only be called if | ||
216 | bio_integrity_enabled() returned 1. | ||
217 | |||
218 | |||
219 | int bio_integrity_tag_size(bio); | ||
220 | |||
221 | If the filesystem wants to use the application tag space it will | ||
222 | first have to find out how much storage space is available. | ||
223 | Because tag space is generally limited (usually 2 bytes per | ||
224 | sector regardless of sector size), the integrity framework | ||
225 | supports interleaving the information between the sectors in an | ||
226 | I/O. | ||
227 | |||
228 | Filesystems can call bio_integrity_tag_size(bio) to find out how | ||
229 | many bytes of storage are available for that particular bio. | ||
230 | |||
231 | Another option is bdev_get_tag_size(block_device) which will | ||
232 | return the number of available bytes per hardware sector. | ||
233 | |||
234 | |||
235 | int bio_integrity_set_tag(bio, void *tag_buf, len); | ||
236 | |||
237 | After a successful return from bio_integrity_prep(), | ||
238 | bio_integrity_set_tag() can be used to attach an opaque tag | ||
239 | buffer to a bio. Obviously this only makes sense if the I/O is | ||
240 | a WRITE. | ||
241 | |||
242 | |||
243 | int bio_integrity_get_tag(bio, void *tag_buf, len); | ||
244 | |||
245 | Similarly, at READ I/O completion time the filesystem can | ||
246 | retrieve the tag buffer using bio_integrity_get_tag(). | ||
247 | |||
248 | |||
249 | 6.3 PASSING EXISTING INTEGRITY METADATA | ||
250 | |||
251 | Filesystems that either generate their own integrity metadata or | ||
252 | are capable of transferring IMD from user space can use the | ||
253 | following calls: | ||
254 | |||
255 | |||
256 | struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages); | ||
257 | |||
258 | Allocates the bio integrity payload and hangs it off of the bio. | ||
259 | nr_pages indicate how many pages of protection data need to be | ||
260 | stored in the integrity bio_vec list (similar to bio_alloc()). | ||
261 | |||
262 | The integrity payload will be freed at bio_free() time. | ||
263 | |||
264 | |||
265 | int bio_integrity_add_page(bio, page, len, offset); | ||
266 | |||
267 | Attaches a page containing integrity metadata to an existing | ||
268 | bio. The bio must have an existing bip, | ||
269 | i.e. bio_integrity_alloc() must have been called. For a WRITE, | ||
270 | the integrity metadata in the pages must be in a format | ||
271 | understood by the target device with the notable exception that | ||
272 | the sector numbers will be remapped as the request traverses the | ||
273 | I/O stack. This implies that the pages added using this call | ||
274 | will be modified during I/O! The first reference tag in the | ||
275 | integrity metadata must have a value of bip->bip_sector. | ||
276 | |||
277 | Pages can be added using bio_integrity_add_page() as long as | ||
278 | there is room in the bip bio_vec array (nr_pages). | ||
279 | |||
280 | Upon completion of a READ operation, the attached pages will | ||
281 | contain the integrity metadata received from the storage device. | ||
282 | It is up to the receiver to process them and verify data | ||
283 | integrity upon completion. | ||
284 | |||
285 | |||
286 | 6.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY | ||
287 | METADATA | ||
288 | |||
289 | To enable integrity exchange on a block device the gendisk must be | ||
290 | registered as capable: | ||
291 | |||
292 | int blk_integrity_register(gendisk, blk_integrity); | ||
293 | |||
294 | The blk_integrity struct is a template and should contain the | ||
295 | following: | ||
296 | |||
297 | static struct blk_integrity my_profile = { | ||
298 | .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", | ||
299 | .generate_fn = my_generate_fn, | ||
300 | .verify_fn = my_verify_fn, | ||
301 | .get_tag_fn = my_get_tag_fn, | ||
302 | .set_tag_fn = my_set_tag_fn, | ||
303 | .tuple_size = sizeof(struct my_tuple_size), | ||
304 | .tag_size = <tag bytes per hw sector>, | ||
305 | }; | ||
306 | |||
307 | 'name' is a text string which will be visible in sysfs. This is | ||
308 | part of the userland API so chose it carefully and never change | ||
309 | it. The format is standards body-type-variant. | ||
310 | E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. | ||
311 | |||
312 | 'generate_fn' generates appropriate integrity metadata (for WRITE). | ||
313 | |||
314 | 'verify_fn' verifies that the data buffer matches the integrity | ||
315 | metadata. | ||
316 | |||
317 | 'tuple_size' must be set to match the size of the integrity | ||
318 | metadata per sector. I.e. 8 for DIF and EPP. | ||
319 | |||
320 | 'tag_size' must be set to identify how many bytes of tag space | ||
321 | are available per hardware sector. For DIF this is either 2 or | ||
322 | 0 depending on the value of the Control Mode Page ATO bit. | ||
323 | |||
324 | See 6.2 for a description of get_tag_fn and set_tag_fn. | ||
325 | |||
326 | ---------------------------------------------------------------------- | ||
327 | 2007-12-24 Martin K. Petersen <martin.petersen@oracle.com> | ||