diff options
Diffstat (limited to 'Documentation/block/biodoc.txt')
-rw-r--r-- | Documentation/block/biodoc.txt | 1213 |
1 files changed, 1213 insertions, 0 deletions
diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt new file mode 100644 index 000000000000..6dd274d7e1cf --- /dev/null +++ b/Documentation/block/biodoc.txt | |||
@@ -0,0 +1,1213 @@ | |||
1 | Notes on the Generic Block Layer Rewrite in Linux 2.5 | ||
2 | ===================================================== | ||
3 | |||
4 | Notes Written on Jan 15, 2002: | ||
5 | Jens Axboe <axboe@suse.de> | ||
6 | Suparna Bhattacharya <suparna@in.ibm.com> | ||
7 | |||
8 | Last Updated May 2, 2002 | ||
9 | September 2003: Updated I/O Scheduler portions | ||
10 | Nick Piggin <piggin@cyberone.com.au> | ||
11 | |||
12 | Introduction: | ||
13 | |||
14 | These are some notes describing some aspects of the 2.5 block layer in the | ||
15 | context of the bio rewrite. The idea is to bring out some of the key | ||
16 | changes and a glimpse of the rationale behind those changes. | ||
17 | |||
18 | Please mail corrections & suggestions to suparna@in.ibm.com. | ||
19 | |||
20 | Credits: | ||
21 | --------- | ||
22 | |||
23 | 2.5 bio rewrite: | ||
24 | Jens Axboe <axboe@suse.de> | ||
25 | |||
26 | Many aspects of the generic block layer redesign were driven by and evolved | ||
27 | over discussions, prior patches and the collective experience of several | ||
28 | people. See sections 8 and 9 for a list of some related references. | ||
29 | |||
30 | The following people helped with review comments and inputs for this | ||
31 | document: | ||
32 | Christoph Hellwig <hch@infradead.org> | ||
33 | Arjan van de Ven <arjanv@redhat.com> | ||
34 | Randy Dunlap <rddunlap@osdl.org> | ||
35 | Andre Hedrick <andre@linux-ide.org> | ||
36 | |||
37 | The following people helped with fixes/contributions to the bio patches | ||
38 | while it was still work-in-progress: | ||
39 | David S. Miller <davem@redhat.com> | ||
40 | |||
41 | |||
42 | Description of Contents: | ||
43 | ------------------------ | ||
44 | |||
45 | 1. Scope for tuning of logic to various needs | ||
46 | 1.1 Tuning based on device or low level driver capabilities | ||
47 | - Per-queue parameters | ||
48 | - Highmem I/O support | ||
49 | - I/O scheduler modularization | ||
50 | 1.2 Tuning based on high level requirements/capabilities | ||
51 | 1.2.1 I/O Barriers | ||
52 | 1.2.2 Request Priority/Latency | ||
53 | 1.3 Direct access/bypass to lower layers for diagnostics and special | ||
54 | device operations | ||
55 | 1.3.1 Pre-built commands | ||
56 | 2. New flexible and generic but minimalist i/o structure or descriptor | ||
57 | (instead of using buffer heads at the i/o layer) | ||
58 | 2.1 Requirements/Goals addressed | ||
59 | 2.2 The bio struct in detail (multi-page io unit) | ||
60 | 2.3 Changes in the request structure | ||
61 | 3. Using bios | ||
62 | 3.1 Setup/teardown (allocation, splitting) | ||
63 | 3.2 Generic bio helper routines | ||
64 | 3.2.1 Traversing segments and completion units in a request | ||
65 | 3.2.2 Setting up DMA scatterlists | ||
66 | 3.2.3 I/O completion | ||
67 | 3.2.4 Implications for drivers that do not interpret bios (don't handle | ||
68 | multiple segments) | ||
69 | 3.2.5 Request command tagging | ||
70 | 3.3 I/O submission | ||
71 | 4. The I/O scheduler | ||
72 | 5. Scalability related changes | ||
73 | 5.1 Granular locking: Removal of io_request_lock | ||
74 | 5.2 Prepare for transition to 64 bit sector_t | ||
75 | 6. Other Changes/Implications | ||
76 | 6.1 Partition re-mapping handled by the generic block layer | ||
77 | 7. A few tips on migration of older drivers | ||
78 | 8. A list of prior/related/impacted patches/ideas | ||
79 | 9. Other References/Discussion Threads | ||
80 | |||
81 | --------------------------------------------------------------------------- | ||
82 | |||
83 | Bio Notes | ||
84 | -------- | ||
85 | |||
86 | Let us discuss the changes in the context of how some overall goals for the | ||
87 | block layer are addressed. | ||
88 | |||
89 | 1. Scope for tuning the generic logic to satisfy various requirements | ||
90 | |||
91 | The block layer design supports adaptable abstractions to handle common | ||
92 | processing with the ability to tune the logic to an appropriate extent | ||
93 | depending on the nature of the device and the requirements of the caller. | ||
94 | One of the objectives of the rewrite was to increase the degree of tunability | ||
95 | and to enable higher level code to utilize underlying device/driver | ||
96 | capabilities to the maximum extent for better i/o performance. This is | ||
97 | important especially in the light of ever improving hardware capabilities | ||
98 | and application/middleware software designed to take advantage of these | ||
99 | capabilities. | ||
100 | |||
101 | 1.1 Tuning based on low level device / driver capabilities | ||
102 | |||
103 | Sophisticated devices with large built-in caches, intelligent i/o scheduling | ||
104 | optimizations, high memory DMA support, etc may find some of the | ||
105 | generic processing an overhead, while for less capable devices the | ||
106 | generic functionality is essential for performance or correctness reasons. | ||
107 | Knowledge of some of the capabilities or parameters of the device should be | ||
108 | used at the generic block layer to take the right decisions on | ||
109 | behalf of the driver. | ||
110 | |||
111 | How is this achieved ? | ||
112 | |||
113 | Tuning at a per-queue level: | ||
114 | |||
115 | i. Per-queue limits/values exported to the generic layer by the driver | ||
116 | |||
117 | Various parameters that the generic i/o scheduler logic uses are set at | ||
118 | a per-queue level (e.g maximum request size, maximum number of segments in | ||
119 | a scatter-gather list, hardsect size) | ||
120 | |||
121 | Some parameters that were earlier available as global arrays indexed by | ||
122 | major/minor are now directly associated with the queue. Some of these may | ||
123 | move into the block device structure in the future. Some characteristics | ||
124 | have been incorporated into a queue flags field rather than separate fields | ||
125 | in themselves. There are blk_queue_xxx functions to set the parameters, | ||
126 | rather than update the fields directly | ||
127 | |||
128 | Some new queue property settings: | ||
129 | |||
130 | blk_queue_bounce_limit(q, u64 dma_address) | ||
131 | Enable I/O to highmem pages, dma_address being the | ||
132 | limit. No highmem default. | ||
133 | |||
134 | blk_queue_max_sectors(q, max_sectors) | ||
135 | Maximum size request you can handle in units of 512 byte | ||
136 | sectors. 255 default. | ||
137 | |||
138 | blk_queue_max_phys_segments(q, max_segments) | ||
139 | Maximum physical segments you can handle in a request. 128 | ||
140 | default (driver limit). (See 3.2.2) | ||
141 | |||
142 | blk_queue_max_hw_segments(q, max_segments) | ||
143 | Maximum dma segments the hardware can handle in a request. 128 | ||
144 | default (host adapter limit, after dma remapping). | ||
145 | (See 3.2.2) | ||
146 | |||
147 | blk_queue_max_segment_size(q, max_seg_size) | ||
148 | Maximum size of a clustered segment, 64kB default. | ||
149 | |||
150 | blk_queue_hardsect_size(q, hardsect_size) | ||
151 | Lowest possible sector size that the hardware can operate | ||
152 | on, 512 bytes default. | ||
153 | |||
154 | New queue flags: | ||
155 | |||
156 | QUEUE_FLAG_CLUSTER (see 3.2.2) | ||
157 | QUEUE_FLAG_QUEUED (see 3.2.4) | ||
158 | |||
159 | |||
160 | ii. High-mem i/o capabilities are now considered the default | ||
161 | |||
162 | The generic bounce buffer logic, present in 2.4, where the block layer would | ||
163 | by default copyin/out i/o requests on high-memory buffers to low-memory buffers | ||
164 | assuming that the driver wouldn't be able to handle it directly, has been | ||
165 | changed in 2.5. The bounce logic is now applied only for memory ranges | ||
166 | for which the device cannot handle i/o. A driver can specify this by | ||
167 | setting the queue bounce limit for the request queue for the device | ||
168 | (blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out | ||
169 | where a device is capable of handling high memory i/o. | ||
170 | |||
171 | In order to enable high-memory i/o where the device is capable of supporting | ||
172 | it, the pci dma mapping routines and associated data structures have now been | ||
173 | modified to accomplish a direct page -> bus translation, without requiring | ||
174 | a virtual address mapping (unlike the earlier scheme of virtual address | ||
175 | -> bus translation). So this works uniformly for high-memory pages (which | ||
176 | do not have a correponding kernel virtual address space mapping) and | ||
177 | low-memory pages. | ||
178 | |||
179 | Note: Please refer to DMA-mapping.txt for a discussion on PCI high mem DMA | ||
180 | aspects and mapping of scatter gather lists, and support for 64 bit PCI. | ||
181 | |||
182 | Special handling is required only for cases where i/o needs to happen on | ||
183 | pages at physical memory addresses beyond what the device can support. In these | ||
184 | cases, a bounce bio representing a buffer from the supported memory range | ||
185 | is used for performing the i/o with copyin/copyout as needed depending on | ||
186 | the type of the operation. For example, in case of a read operation, the | ||
187 | data read has to be copied to the original buffer on i/o completion, so a | ||
188 | callback routine is set up to do this, while for write, the data is copied | ||
189 | from the original buffer to the bounce buffer prior to issuing the | ||
190 | operation. Since an original buffer may be in a high memory area that's not | ||
191 | mapped in kernel virtual addr, a kmap operation may be required for | ||
192 | performing the copy, and special care may be needed in the completion path | ||
193 | as it may not be in irq context. Special care is also required (by way of | ||
194 | GFP flags) when allocating bounce buffers, to avoid certain highmem | ||
195 | deadlock possibilities. | ||
196 | |||
197 | It is also possible that a bounce buffer may be allocated from high-memory | ||
198 | area that's not mapped in kernel virtual addr, but within the range that the | ||
199 | device can use directly; so the bounce page may need to be kmapped during | ||
200 | copy operations. [Note: This does not hold in the current implementation, | ||
201 | though] | ||
202 | |||
203 | There are some situations when pages from high memory may need to | ||
204 | be kmapped, even if bounce buffers are not necessary. For example a device | ||
205 | may need to abort DMA operations and revert to PIO for the transfer, in | ||
206 | which case a virtual mapping of the page is required. For SCSI it is also | ||
207 | done in some scenarios where the low level driver cannot be trusted to | ||
208 | handle a single sg entry correctly. The driver is expected to perform the | ||
209 | kmaps as needed on such occasions using the __bio_kmap_atomic and bio_kmap_irq | ||
210 | routines as appropriate. A driver could also use the blk_queue_bounce() | ||
211 | routine on its own to bounce highmem i/o to low memory for specific requests | ||
212 | if so desired. | ||
213 | |||
214 | iii. The i/o scheduler algorithm itself can be replaced/set as appropriate | ||
215 | |||
216 | As in 2.4, it is possible to plugin a brand new i/o scheduler for a particular | ||
217 | queue or pick from (copy) existing generic schedulers and replace/override | ||
218 | certain portions of it. The 2.5 rewrite provides improved modularization | ||
219 | of the i/o scheduler. There are more pluggable callbacks, e.g for init, | ||
220 | add request, extract request, which makes it possible to abstract specific | ||
221 | i/o scheduling algorithm aspects and details outside of the generic loop. | ||
222 | It also makes it possible to completely hide the implementation details of | ||
223 | the i/o scheduler from block drivers. | ||
224 | |||
225 | I/O scheduler wrappers are to be used instead of accessing the queue directly. | ||
226 | See section 4. The I/O scheduler for details. | ||
227 | |||
228 | 1.2 Tuning Based on High level code capabilities | ||
229 | |||
230 | i. Application capabilities for raw i/o | ||
231 | |||
232 | This comes from some of the high-performance database/middleware | ||
233 | requirements where an application prefers to make its own i/o scheduling | ||
234 | decisions based on an understanding of the access patterns and i/o | ||
235 | characteristics | ||
236 | |||
237 | ii. High performance filesystems or other higher level kernel code's | ||
238 | capabilities | ||
239 | |||
240 | Kernel components like filesystems could also take their own i/o scheduling | ||
241 | decisions for optimizing performance. Journalling filesystems may need | ||
242 | some control over i/o ordering. | ||
243 | |||
244 | What kind of support exists at the generic block layer for this ? | ||
245 | |||
246 | The flags and rw fields in the bio structure can be used for some tuning | ||
247 | from above e.g indicating that an i/o is just a readahead request, or for | ||
248 | marking barrier requests (discussed next), or priority settings (currently | ||
249 | unused). As far as user applications are concerned they would need an | ||
250 | additional mechanism either via open flags or ioctls, or some other upper | ||
251 | level mechanism to communicate such settings to block. | ||
252 | |||
253 | 1.2.1 I/O Barriers | ||
254 | |||
255 | There is a way to enforce strict ordering for i/os through barriers. | ||
256 | All requests before a barrier point must be serviced before the barrier | ||
257 | request and any other requests arriving after the barrier will not be | ||
258 | serviced until after the barrier has completed. This is useful for higher | ||
259 | level control on write ordering, e.g flushing a log of committed updates | ||
260 | to disk before the corresponding updates themselves. | ||
261 | |||
262 | A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o. | ||
263 | The generic i/o scheduler would make sure that it places the barrier request and | ||
264 | all other requests coming after it after all the previous requests in the | ||
265 | queue. Barriers may be implemented in different ways depending on the | ||
266 | driver. A SCSI driver for example could make use of ordered tags to | ||
267 | preserve the necessary ordering with a lower impact on throughput. For IDE | ||
268 | this might be two sync cache flush: a pre and post flush when encountering | ||
269 | a barrier write. | ||
270 | |||
271 | There is a provision for queues to indicate what kind of barriers they | ||
272 | can provide. This is as of yet unmerged, details will be added here once it | ||
273 | is in the kernel. | ||
274 | |||
275 | 1.2.2 Request Priority/Latency | ||
276 | |||
277 | Todo/Under discussion: | ||
278 | Arjan's proposed request priority scheme allows higher levels some broad | ||
279 | control (high/med/low) over the priority of an i/o request vs other pending | ||
280 | requests in the queue. For example it allows reads for bringing in an | ||
281 | executable page on demand to be given a higher priority over pending write | ||
282 | requests which haven't aged too much on the queue. Potentially this priority | ||
283 | could even be exposed to applications in some manner, providing higher level | ||
284 | tunability. Time based aging avoids starvation of lower priority | ||
285 | requests. Some bits in the bi_rw flags field in the bio structure are | ||
286 | intended to be used for this priority information. | ||
287 | |||
288 | |||
289 | 1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode) | ||
290 | (e.g Diagnostics, Systems Management) | ||
291 | |||
292 | There are situations where high-level code needs to have direct access to | ||
293 | the low level device capabilities or requires the ability to issue commands | ||
294 | to the device bypassing some of the intermediate i/o layers. | ||
295 | These could, for example, be special control commands issued through ioctl | ||
296 | interfaces, or could be raw read/write commands that stress the drive's | ||
297 | capabilities for certain kinds of fitness tests. Having direct interfaces at | ||
298 | multiple levels without having to pass through upper layers makes | ||
299 | it possible to perform bottom up validation of the i/o path, layer by | ||
300 | layer, starting from the media. | ||
301 | |||
302 | The normal i/o submission interfaces, e.g submit_bio, could be bypassed | ||
303 | for specially crafted requests which such ioctl or diagnostics | ||
304 | interfaces would typically use, and the elevator add_request routine | ||
305 | can instead be used to directly insert such requests in the queue or preferably | ||
306 | the blk_do_rq routine can be used to place the request on the queue and | ||
307 | wait for completion. Alternatively, sometimes the caller might just | ||
308 | invoke a lower level driver specific interface with the request as a | ||
309 | parameter. | ||
310 | |||
311 | If the request is a means for passing on special information associated with | ||
312 | the command, then such information is associated with the request->special | ||
313 | field (rather than misuse the request->buffer field which is meant for the | ||
314 | request data buffer's virtual mapping). | ||
315 | |||
316 | For passing request data, the caller must build up a bio descriptor | ||
317 | representing the concerned memory buffer if the underlying driver interprets | ||
318 | bio segments or uses the block layer end*request* functions for i/o | ||
319 | completion. Alternatively one could directly use the request->buffer field to | ||
320 | specify the virtual address of the buffer, if the driver expects buffer | ||
321 | addresses passed in this way and ignores bio entries for the request type | ||
322 | involved. In the latter case, the driver would modify and manage the | ||
323 | request->buffer, request->sector and request->nr_sectors or | ||
324 | request->current_nr_sectors fields itself rather than using the block layer | ||
325 | end_request or end_that_request_first completion interfaces. | ||
326 | (See 2.3 or Documentation/block/request.txt for a brief explanation of | ||
327 | the request structure fields) | ||
328 | |||
329 | [TBD: end_that_request_last should be usable even in this case; | ||
330 | Perhaps an end_that_direct_request_first routine could be implemented to make | ||
331 | handling direct requests easier for such drivers; Also for drivers that | ||
332 | expect bios, a helper function could be provided for setting up a bio | ||
333 | corresponding to a data buffer] | ||
334 | |||
335 | <JENS: I dont understand the above, why is end_that_request_first() not | ||
336 | usable? Or _last for that matter. I must be missing something> | ||
337 | <SUP: What I meant here was that if the request doesn't have a bio, then | ||
338 | end_that_request_first doesn't modify nr_sectors or current_nr_sectors, | ||
339 | and hence can't be used for advancing request state settings on the | ||
340 | completion of partial transfers. The driver has to modify these fields | ||
341 | directly by hand. | ||
342 | This is because end_that_request_first only iterates over the bio list, | ||
343 | and always returns 0 if there are none associated with the request. | ||
344 | _last works OK in this case, and is not a problem, as I mentioned earlier | ||
345 | > | ||
346 | |||
347 | 1.3.1 Pre-built Commands | ||
348 | |||
349 | A request can be created with a pre-built custom command to be sent directly | ||
350 | to the device. The cmd block in the request structure has room for filling | ||
351 | in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for | ||
352 | command pre-building, and the type of the request is now indicated | ||
353 | through rq->flags instead of via rq->cmd) | ||
354 | |||
355 | The request structure flags can be set up to indicate the type of request | ||
356 | in such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC: | ||
357 | packet command issued via blk_do_rq, REQ_SPECIAL: special request). | ||
358 | |||
359 | It can help to pre-build device commands for requests in advance. | ||
360 | Drivers can now specify a request prepare function (q->prep_rq_fn) that the | ||
361 | block layer would invoke to pre-build device commands for a given request, | ||
362 | or perform other preparatory processing for the request. This is routine is | ||
363 | called by elv_next_request(), i.e. typically just before servicing a request. | ||
364 | (The prepare function would not be called for requests that have REQ_DONTPREP | ||
365 | enabled) | ||
366 | |||
367 | Aside: | ||
368 | Pre-building could possibly even be done early, i.e before placing the | ||
369 | request on the queue, rather than construct the command on the fly in the | ||
370 | driver while servicing the request queue when it may affect latencies in | ||
371 | interrupt context or responsiveness in general. One way to add early | ||
372 | pre-building would be to do it whenever we fail to merge on a request. | ||
373 | Now REQ_NOMERGE is set in the request flags to skip this one in the future, | ||
374 | which means that it will not change before we feed it to the device. So | ||
375 | the pre-builder hook can be invoked there. | ||
376 | |||
377 | |||
378 | 2. Flexible and generic but minimalist i/o structure/descriptor. | ||
379 | |||
380 | 2.1 Reason for a new structure and requirements addressed | ||
381 | |||
382 | Prior to 2.5, buffer heads were used as the unit of i/o at the generic block | ||
383 | layer, and the low level request structure was associated with a chain of | ||
384 | buffer heads for a contiguous i/o request. This led to certain inefficiencies | ||
385 | when it came to large i/o requests and readv/writev style operations, as it | ||
386 | forced such requests to be broken up into small chunks before being passed | ||
387 | on to the generic block layer, only to be merged by the i/o scheduler | ||
388 | when the underlying device was capable of handling the i/o in one shot. | ||
389 | Also, using the buffer head as an i/o structure for i/os that didn't originate | ||
390 | from the buffer cache unecessarily added to the weight of the descriptors | ||
391 | which were generated for each such chunk. | ||
392 | |||
393 | The following were some of the goals and expectations considered in the | ||
394 | redesign of the block i/o data structure in 2.5. | ||
395 | |||
396 | i. Should be appropriate as a descriptor for both raw and buffered i/o - | ||
397 | avoid cache related fields which are irrelevant in the direct/page i/o path, | ||
398 | or filesystem block size alignment restrictions which may not be relevant | ||
399 | for raw i/o. | ||
400 | ii. Ability to represent high-memory buffers (which do not have a virtual | ||
401 | address mapping in kernel address space). | ||
402 | iii.Ability to represent large i/os w/o unecessarily breaking them up (i.e | ||
403 | greater than PAGE_SIZE chunks in one shot) | ||
404 | iv. At the same time, ability to retain independent identity of i/os from | ||
405 | different sources or i/o units requiring individual completion (e.g. for | ||
406 | latency reasons) | ||
407 | v. Ability to represent an i/o involving multiple physical memory segments | ||
408 | (including non-page aligned page fragments, as specified via readv/writev) | ||
409 | without unecessarily breaking it up, if the underlying device is capable of | ||
410 | handling it. | ||
411 | vi. Preferably should be based on a memory descriptor structure that can be | ||
412 | passed around different types of subsystems or layers, maybe even | ||
413 | networking, without duplication or extra copies of data/descriptor fields | ||
414 | themselves in the process | ||
415 | vii.Ability to handle the possibility of splits/merges as the structure passes | ||
416 | through layered drivers (lvm, md, evms), with minimal overhead. | ||
417 | |||
418 | The solution was to define a new structure (bio) for the block layer, | ||
419 | instead of using the buffer head structure (bh) directly, the idea being | ||
420 | avoidance of some associated baggage and limitations. The bio structure | ||
421 | is uniformly used for all i/o at the block layer ; it forms a part of the | ||
422 | bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are | ||
423 | mapped to bio structures. | ||
424 | |||
425 | 2.2 The bio struct | ||
426 | |||
427 | The bio structure uses a vector representation pointing to an array of tuples | ||
428 | of <page, offset, len> to describe the i/o buffer, and has various other | ||
429 | fields describing i/o parameters and state that needs to be maintained for | ||
430 | performing the i/o. | ||
431 | |||
432 | Notice that this representation means that a bio has no virtual address | ||
433 | mapping at all (unlike buffer heads). | ||
434 | |||
435 | struct bio_vec { | ||
436 | struct page *bv_page; | ||
437 | unsigned short bv_len; | ||
438 | unsigned short bv_offset; | ||
439 | }; | ||
440 | |||
441 | /* | ||
442 | * main unit of I/O for the block layer and lower layers (ie drivers) | ||
443 | */ | ||
444 | struct bio { | ||
445 | sector_t bi_sector; | ||
446 | struct bio *bi_next; /* request queue link */ | ||
447 | struct block_device *bi_bdev; /* target device */ | ||
448 | unsigned long bi_flags; /* status, command, etc */ | ||
449 | unsigned long bi_rw; /* low bits: r/w, high: priority */ | ||
450 | |||
451 | unsigned int bi_vcnt; /* how may bio_vec's */ | ||
452 | unsigned int bi_idx; /* current index into bio_vec array */ | ||
453 | |||
454 | unsigned int bi_size; /* total size in bytes */ | ||
455 | unsigned short bi_phys_segments; /* segments after physaddr coalesce*/ | ||
456 | unsigned short bi_hw_segments; /* segments after DMA remapping */ | ||
457 | unsigned int bi_max; /* max bio_vecs we can hold | ||
458 | used as index into pool */ | ||
459 | struct bio_vec *bi_io_vec; /* the actual vec list */ | ||
460 | bio_end_io_t *bi_end_io; /* bi_end_io (bio) */ | ||
461 | atomic_t bi_cnt; /* pin count: free when it hits zero */ | ||
462 | void *bi_private; | ||
463 | bio_destructor_t *bi_destructor; /* bi_destructor (bio) */ | ||
464 | }; | ||
465 | |||
466 | With this multipage bio design: | ||
467 | |||
468 | - Large i/os can be sent down in one go using a bio_vec list consisting | ||
469 | of an array of <page, offset, len> fragments (similar to the way fragments | ||
470 | are represented in the zero-copy network code) | ||
471 | - Splitting of an i/o request across multiple devices (as in the case of | ||
472 | lvm or raid) is achieved by cloning the bio (where the clone points to | ||
473 | the same bi_io_vec array, but with the index and size accordingly modified) | ||
474 | - A linked list of bios is used as before for unrelated merges (*) - this | ||
475 | avoids reallocs and makes independent completions easier to handle. | ||
476 | - Code that traverses the req list needs to make a distinction between | ||
477 | segments of a request (bio_for_each_segment) and the distinct completion | ||
478 | units/bios (rq_for_each_bio). | ||
479 | - Drivers which can't process a large bio in one shot can use the bi_idx | ||
480 | field to keep track of the next bio_vec entry to process. | ||
481 | (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE) | ||
482 | [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying | ||
483 | bi_offset an len fields] | ||
484 | |||
485 | (*) unrelated merges -- a request ends up containing two or more bios that | ||
486 | didn't originate from the same place. | ||
487 | |||
488 | bi_end_io() i/o callback gets called on i/o completion of the entire bio. | ||
489 | |||
490 | At a lower level, drivers build a scatter gather list from the merged bios. | ||
491 | The scatter gather list is in the form of an array of <page, offset, len> | ||
492 | entries with their corresponding dma address mappings filled in at the | ||
493 | appropriate time. As an optimization, contiguous physical pages can be | ||
494 | covered by a single entry where <page> refers to the first page and <len> | ||
495 | covers the range of pages (upto 16 contiguous pages could be covered this | ||
496 | way). There is a helper routine (blk_rq_map_sg) which drivers can use to build | ||
497 | the sg list. | ||
498 | |||
499 | Note: Right now the only user of bios with more than one page is ll_rw_kio, | ||
500 | which in turn means that only raw I/O uses it (direct i/o may not work | ||
501 | right now). The intent however is to enable clustering of pages etc to | ||
502 | become possible. The pagebuf abstraction layer from SGI also uses multi-page | ||
503 | bios, but that is currently not included in the stock development kernels. | ||
504 | The same is true of Andrew Morton's work-in-progress multipage bio writeout | ||
505 | and readahead patches. | ||
506 | |||
507 | 2.3 Changes in the Request Structure | ||
508 | |||
509 | The request structure is the structure that gets passed down to low level | ||
510 | drivers. The block layer make_request function builds up a request structure, | ||
511 | places it on the queue and invokes the drivers request_fn. The driver makes | ||
512 | use of block layer helper routine elv_next_request to pull the next request | ||
513 | off the queue. Control or diagnostic functions might bypass block and directly | ||
514 | invoke underlying driver entry points passing in a specially constructed | ||
515 | request structure. | ||
516 | |||
517 | Only some relevant fields (mainly those which changed or may be referred | ||
518 | to in some of the discussion here) are listed below, not necessarily in | ||
519 | the order in which they occur in the structure (see include/linux/blkdev.h) | ||
520 | Refer to Documentation/block/request.txt for details about all the request | ||
521 | structure fields and a quick reference about the layers which are | ||
522 | supposed to use or modify those fields. | ||
523 | |||
524 | struct request { | ||
525 | struct list_head queuelist; /* Not meant to be directly accessed by | ||
526 | the driver. | ||
527 | Used by q->elv_next_request_fn | ||
528 | rq->queue is gone | ||
529 | */ | ||
530 | . | ||
531 | . | ||
532 | unsigned char cmd[16]; /* prebuilt command data block */ | ||
533 | unsigned long flags; /* also includes earlier rq->cmd settings */ | ||
534 | . | ||
535 | . | ||
536 | sector_t sector; /* this field is now of type sector_t instead of int | ||
537 | preparation for 64 bit sectors */ | ||
538 | . | ||
539 | . | ||
540 | |||
541 | /* Number of scatter-gather DMA addr+len pairs after | ||
542 | * physical address coalescing is performed. | ||
543 | */ | ||
544 | unsigned short nr_phys_segments; | ||
545 | |||
546 | /* Number of scatter-gather addr+len pairs after | ||
547 | * physical and DMA remapping hardware coalescing is performed. | ||
548 | * This is the number of scatter-gather entries the driver | ||
549 | * will actually have to deal with after DMA mapping is done. | ||
550 | */ | ||
551 | unsigned short nr_hw_segments; | ||
552 | |||
553 | /* Various sector counts */ | ||
554 | unsigned long nr_sectors; /* no. of sectors left: driver modifiable */ | ||
555 | unsigned long hard_nr_sectors; /* block internal copy of above */ | ||
556 | unsigned int current_nr_sectors; /* no. of sectors left in the | ||
557 | current segment:driver modifiable */ | ||
558 | unsigned long hard_cur_sectors; /* block internal copy of the above */ | ||
559 | . | ||
560 | . | ||
561 | int tag; /* command tag associated with request */ | ||
562 | void *special; /* same as before */ | ||
563 | char *buffer; /* valid only for low memory buffers upto | ||
564 | current_nr_sectors */ | ||
565 | . | ||
566 | . | ||
567 | struct bio *bio, *biotail; /* bio list instead of bh */ | ||
568 | struct request_list *rl; | ||
569 | } | ||
570 | |||
571 | See the rq_flag_bits definitions for an explanation of the various flags | ||
572 | available. Some bits are used by the block layer or i/o scheduler. | ||
573 | |||
574 | The behaviour of the various sector counts are almost the same as before, | ||
575 | except that since we have multi-segment bios, current_nr_sectors refers | ||
576 | to the numbers of sectors in the current segment being processed which could | ||
577 | be one of the many segments in the current bio (i.e i/o completion unit). | ||
578 | The nr_sectors value refers to the total number of sectors in the whole | ||
579 | request that remain to be transferred (no change). The purpose of the | ||
580 | hard_xxx values is for block to remember these counts every time it hands | ||
581 | over the request to the driver. These values are updated by block on | ||
582 | end_that_request_first, i.e. every time the driver completes a part of the | ||
583 | transfer and invokes block end*request helpers to mark this. The | ||
584 | driver should not modify these values. The block layer sets up the | ||
585 | nr_sectors and current_nr_sectors fields (based on the corresponding | ||
586 | hard_xxx values and the number of bytes transferred) and updates it on | ||
587 | every transfer that invokes end_that_request_first. It does the same for the | ||
588 | buffer, bio, bio->bi_idx fields too. | ||
589 | |||
590 | The buffer field is just a virtual address mapping of the current segment | ||
591 | of the i/o buffer in cases where the buffer resides in low-memory. For high | ||
592 | memory i/o, this field is not valid and must not be used by drivers. | ||
593 | |||
594 | Code that sets up its own request structures and passes them down to | ||
595 | a driver needs to be careful about interoperation with the block layer helper | ||
596 | functions which the driver uses. (Section 1.3) | ||
597 | |||
598 | 3. Using bios | ||
599 | |||
600 | 3.1 Setup/Teardown | ||
601 | |||
602 | There are routines for managing the allocation, and reference counting, and | ||
603 | freeing of bios (bio_alloc, bio_get, bio_put). | ||
604 | |||
605 | This makes use of Ingo Molnar's mempool implementation, which enables | ||
606 | subsystems like bio to maintain their own reserve memory pools for guaranteed | ||
607 | deadlock-free allocations during extreme VM load. For example, the VM | ||
608 | subsystem makes use of the block layer to writeout dirty pages in order to be | ||
609 | able to free up memory space, a case which needs careful handling. The | ||
610 | allocation logic draws from the preallocated emergency reserve in situations | ||
611 | where it cannot allocate through normal means. If the pool is empty and it | ||
612 | can wait, then it would trigger action that would help free up memory or | ||
613 | replenish the pool (without deadlocking) and wait for availability in the pool. | ||
614 | If it is in IRQ context, and hence not in a position to do this, allocation | ||
615 | could fail if the pool is empty. In general mempool always first tries to | ||
616 | perform allocation without having to wait, even if it means digging into the | ||
617 | pool as long it is not less that 50% full. | ||
618 | |||
619 | On a free, memory is released to the pool or directly freed depending on | ||
620 | the current availability in the pool. The mempool interface lets the | ||
621 | subsystem specify the routines to be used for normal alloc and free. In the | ||
622 | case of bio, these routines make use of the standard slab allocator. | ||
623 | |||
624 | The caller of bio_alloc is expected to taken certain steps to avoid | ||
625 | deadlocks, e.g. avoid trying to allocate more memory from the pool while | ||
626 | already holding memory obtained from the pool. | ||
627 | [TBD: This is a potential issue, though a rare possibility | ||
628 | in the bounce bio allocation that happens in the current code, since | ||
629 | it ends up allocating a second bio from the same pool while | ||
630 | holding the original bio ] | ||
631 | |||
632 | Memory allocated from the pool should be released back within a limited | ||
633 | amount of time (in the case of bio, that would be after the i/o is completed). | ||
634 | This ensures that if part of the pool has been used up, some work (in this | ||
635 | case i/o) must already be in progress and memory would be available when it | ||
636 | is over. If allocating from multiple pools in the same code path, the order | ||
637 | or hierarchy of allocation needs to be consistent, just the way one deals | ||
638 | with multiple locks. | ||
639 | |||
640 | The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc()) | ||
641 | for a non-clone bio. There are the 6 pools setup for different size biovecs, | ||
642 | so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the | ||
643 | given size from these slabs. | ||
644 | |||
645 | The bi_destructor() routine takes into account the possibility of the bio | ||
646 | having originated from a different source (see later discussions on | ||
647 | n/w to block transfers and kvec_cb) | ||
648 | |||
649 | The bio_get() routine may be used to hold an extra reference on a bio prior | ||
650 | to i/o submission, if the bio fields are likely to be accessed after the | ||
651 | i/o is issued (since the bio may otherwise get freed in case i/o completion | ||
652 | happens in the meantime). | ||
653 | |||
654 | The bio_clone() routine may be used to duplicate a bio, where the clone | ||
655 | shares the bio_vec_list with the original bio (i.e. both point to the | ||
656 | same bio_vec_list). This would typically be used for splitting i/o requests | ||
657 | in lvm or md. | ||
658 | |||
659 | 3.2 Generic bio helper Routines | ||
660 | |||
661 | 3.2.1 Traversing segments and completion units in a request | ||
662 | |||
663 | The macros bio_for_each_segment() and rq_for_each_bio() should be used for | ||
664 | traversing the bios in the request list (drivers should avoid directly | ||
665 | trying to do it themselves). Using these helpers should also make it easier | ||
666 | to cope with block changes in the future. | ||
667 | |||
668 | rq_for_each_bio(bio, rq) | ||
669 | bio_for_each_segment(bio_vec, bio, i) | ||
670 | /* bio_vec is now current segment */ | ||
671 | |||
672 | I/O completion callbacks are per-bio rather than per-segment, so drivers | ||
673 | that traverse bio chains on completion need to keep that in mind. Drivers | ||
674 | which don't make a distinction between segments and completion units would | ||
675 | need to be reorganized to support multi-segment bios. | ||
676 | |||
677 | 3.2.2 Setting up DMA scatterlists | ||
678 | |||
679 | The blk_rq_map_sg() helper routine would be used for setting up scatter | ||
680 | gather lists from a request, so a driver need not do it on its own. | ||
681 | |||
682 | nr_segments = blk_rq_map_sg(q, rq, scatterlist); | ||
683 | |||
684 | The helper routine provides a level of abstraction which makes it easier | ||
685 | to modify the internals of request to scatterlist conversion down the line | ||
686 | without breaking drivers. The blk_rq_map_sg routine takes care of several | ||
687 | things like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER | ||
688 | is set) and correct segment accounting to avoid exceeding the limits which | ||
689 | the i/o hardware can handle, based on various queue properties. | ||
690 | |||
691 | - Prevents a clustered segment from crossing a 4GB mem boundary | ||
692 | - Avoids building segments that would exceed the number of physical | ||
693 | memory segments that the driver can handle (phys_segments) and the | ||
694 | number that the underlying hardware can handle at once, accounting for | ||
695 | DMA remapping (hw_segments) (i.e. IOMMU aware limits). | ||
696 | |||
697 | Routines which the low level driver can use to set up the segment limits: | ||
698 | |||
699 | blk_queue_max_hw_segments() : Sets an upper limit of the maximum number of | ||
700 | hw data segments in a request (i.e. the maximum number of address/length | ||
701 | pairs the host adapter can actually hand to the device at once) | ||
702 | |||
703 | blk_queue_max_phys_segments() : Sets an upper limit on the maximum number | ||
704 | of physical data segments in a request (i.e. the largest sized scatter list | ||
705 | a driver could handle) | ||
706 | |||
707 | 3.2.3 I/O completion | ||
708 | |||
709 | The existing generic block layer helper routines end_request, | ||
710 | end_that_request_first and end_that_request_last can be used for i/o | ||
711 | completion (and setting things up so the rest of the i/o or the next | ||
712 | request can be kicked of) as before. With the introduction of multi-page | ||
713 | bio support, end_that_request_first requires an additional argument indicating | ||
714 | the number of sectors completed. | ||
715 | |||
716 | 3.2.4 Implications for drivers that do not interpret bios (don't handle | ||
717 | multiple segments) | ||
718 | |||
719 | Drivers that do not interpret bios e.g those which do not handle multiple | ||
720 | segments and do not support i/o into high memory addresses (require bounce | ||
721 | buffers) and expect only virtually mapped buffers, can access the rq->buffer | ||
722 | field. As before the driver should use current_nr_sectors to determine the | ||
723 | size of remaining data in the current segment (that is the maximum it can | ||
724 | transfer in one go unless it interprets segments), and rely on the block layer | ||
725 | end_request, or end_that_request_first/last to take care of all accounting | ||
726 | and transparent mapping of the next bio segment when a segment boundary | ||
727 | is crossed on completion of a transfer. (The end*request* functions should | ||
728 | be used if only if the request has come down from block/bio path, not for | ||
729 | direct access requests which only specify rq->buffer without a valid rq->bio) | ||
730 | |||
731 | 3.2.5 Generic request command tagging | ||
732 | |||
733 | 3.2.5.1 Tag helpers | ||
734 | |||
735 | Block now offers some simple generic functionality to help support command | ||
736 | queueing (typically known as tagged command queueing), ie manage more than | ||
737 | one outstanding command on a queue at any given time. | ||
738 | |||
739 | blk_queue_init_tags(request_queue_t *q, int depth) | ||
740 | |||
741 | Initialize internal command tagging structures for a maximum | ||
742 | depth of 'depth'. | ||
743 | |||
744 | blk_queue_free_tags((request_queue_t *q) | ||
745 | |||
746 | Teardown tag info associated with the queue. This will be done | ||
747 | automatically by block if blk_queue_cleanup() is called on a queue | ||
748 | that is using tagging. | ||
749 | |||
750 | The above are initialization and exit management, the main helpers during | ||
751 | normal operations are: | ||
752 | |||
753 | blk_queue_start_tag(request_queue_t *q, struct request *rq) | ||
754 | |||
755 | Start tagged operation for this request. A free tag number between | ||
756 | 0 and 'depth' is assigned to the request (rq->tag holds this number), | ||
757 | and 'rq' is added to the internal tag management. If the maximum depth | ||
758 | for this queue is already achieved (or if the tag wasn't started for | ||
759 | some other reason), 1 is returned. Otherwise 0 is returned. | ||
760 | |||
761 | blk_queue_end_tag(request_queue_t *q, struct request *rq) | ||
762 | |||
763 | End tagged operation on this request. 'rq' is removed from the internal | ||
764 | book keeping structures. | ||
765 | |||
766 | To minimize struct request and queue overhead, the tag helpers utilize some | ||
767 | of the same request members that are used for normal request queue management. | ||
768 | This means that a request cannot both be an active tag and be on the queue | ||
769 | list at the same time. blk_queue_start_tag() will remove the request, but | ||
770 | the driver must remember to call blk_queue_end_tag() before signalling | ||
771 | completion of the request to the block layer. This means ending tag | ||
772 | operations before calling end_that_request_last()! For an example of a user | ||
773 | of these helpers, see the IDE tagged command queueing support. | ||
774 | |||
775 | Certain hardware conditions may dictate a need to invalidate the block tag | ||
776 | queue. For instance, on IDE any tagged request error needs to clear both | ||
777 | the hardware and software block queue and enable the driver to sanely restart | ||
778 | all the outstanding requests. There's a third helper to do that: | ||
779 | |||
780 | blk_queue_invalidate_tags(request_queue_t *q) | ||
781 | |||
782 | Clear the internal block tag queue and readd all the pending requests | ||
783 | to the request queue. The driver will receive them again on the | ||
784 | next request_fn run, just like it did the first time it encountered | ||
785 | them. | ||
786 | |||
787 | 3.2.5.2 Tag info | ||
788 | |||
789 | Some block functions exist to query current tag status or to go from a | ||
790 | tag number to the associated request. These are, in no particular order: | ||
791 | |||
792 | blk_queue_tagged(q) | ||
793 | |||
794 | Returns 1 if the queue 'q' is using tagging, 0 if not. | ||
795 | |||
796 | blk_queue_tag_request(q, tag) | ||
797 | |||
798 | Returns a pointer to the request associated with tag 'tag'. | ||
799 | |||
800 | blk_queue_tag_depth(q) | ||
801 | |||
802 | Return current queue depth. | ||
803 | |||
804 | blk_queue_tag_queue(q) | ||
805 | |||
806 | Returns 1 if the queue can accept a new queued command, 0 if we are | ||
807 | at the maximum depth already. | ||
808 | |||
809 | blk_queue_rq_tagged(rq) | ||
810 | |||
811 | Returns 1 if the request 'rq' is tagged. | ||
812 | |||
813 | 3.2.5.2 Internal structure | ||
814 | |||
815 | Internally, block manages tags in the blk_queue_tag structure: | ||
816 | |||
817 | struct blk_queue_tag { | ||
818 | struct request **tag_index; /* array or pointers to rq */ | ||
819 | unsigned long *tag_map; /* bitmap of free tags */ | ||
820 | struct list_head busy_list; /* fifo list of busy tags */ | ||
821 | int busy; /* queue depth */ | ||
822 | int max_depth; /* max queue depth */ | ||
823 | }; | ||
824 | |||
825 | Most of the above is simple and straight forward, however busy_list may need | ||
826 | a bit of explaining. Normally we don't care too much about request ordering, | ||
827 | but in the event of any barrier requests in the tag queue we need to ensure | ||
828 | that requests are restarted in the order they were queue. This may happen | ||
829 | if the driver needs to use blk_queue_invalidate_tags(). | ||
830 | |||
831 | Tagging also defines a new request flag, REQ_QUEUED. This is set whenever | ||
832 | a request is currently tagged. You should not use this flag directly, | ||
833 | blk_rq_tagged(rq) is the portable way to do so. | ||
834 | |||
835 | 3.3 I/O Submission | ||
836 | |||
837 | The routine submit_bio() is used to submit a single io. Higher level i/o | ||
838 | routines make use of this: | ||
839 | |||
840 | (a) Buffered i/o: | ||
841 | The routine submit_bh() invokes submit_bio() on a bio corresponding to the | ||
842 | bh, allocating the bio if required. ll_rw_block() uses submit_bh() as before. | ||
843 | |||
844 | (b) Kiobuf i/o (for raw/direct i/o): | ||
845 | The ll_rw_kio() routine breaks up the kiobuf into page sized chunks and | ||
846 | maps the array to one or more multi-page bios, issuing submit_bio() to | ||
847 | perform the i/o on each of these. | ||
848 | |||
849 | The embedded bh array in the kiobuf structure has been removed and no | ||
850 | preallocation of bios is done for kiobufs. [The intent is to remove the | ||
851 | blocks array as well, but it's currently in there to kludge around direct i/o.] | ||
852 | Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc. | ||
853 | |||
854 | Todo/Observation: | ||
855 | |||
856 | A single kiobuf structure is assumed to correspond to a contiguous range | ||
857 | of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec. | ||
858 | So right now it wouldn't work for direct i/o on non-contiguous blocks. | ||
859 | This is to be resolved. The eventual direction is to replace kiobuf | ||
860 | by kvec's. | ||
861 | |||
862 | Badari Pulavarty has a patch to implement direct i/o correctly using | ||
863 | bio and kvec. | ||
864 | |||
865 | |||
866 | (c) Page i/o: | ||
867 | Todo/Under discussion: | ||
868 | |||
869 | Andrew Morton's multi-page bio patches attempt to issue multi-page | ||
870 | writeouts (and reads) from the page cache, by directly building up | ||
871 | large bios for submission completely bypassing the usage of buffer | ||
872 | heads. This work is still in progress. | ||
873 | |||
874 | Christoph Hellwig had some code that uses bios for page-io (rather than | ||
875 | bh). This isn't included in bio as yet. Christoph was also working on a | ||
876 | design for representing virtual/real extents as an entity and modifying | ||
877 | some of the address space ops interfaces to utilize this abstraction rather | ||
878 | than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf | ||
879 | abstraction, but intended to be as lightweight as possible). | ||
880 | |||
881 | (d) Direct access i/o: | ||
882 | Direct access requests that do not contain bios would be submitted differently | ||
883 | as discussed earlier in section 1.3. | ||
884 | |||
885 | Aside: | ||
886 | |||
887 | Kvec i/o: | ||
888 | |||
889 | Ben LaHaise's aio code uses a slighly different structure instead | ||
890 | of kiobufs, called a kvec_cb. This contains an array of <page, offset, len> | ||
891 | tuples (very much like the networking code), together with a callback function | ||
892 | and data pointer. This is embedded into a brw_cb structure when passed | ||
893 | to brw_kvec_async(). | ||
894 | |||
895 | Now it should be possible to directly map these kvecs to a bio. Just as while | ||
896 | cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec | ||
897 | array pointer to point to the veclet array in kvecs. | ||
898 | |||
899 | TBD: In order for this to work, some changes are needed in the way multi-page | ||
900 | bios are handled today. The values of the tuples in such a vector passed in | ||
901 | from higher level code should not be modified by the block layer in the course | ||
902 | of its request processing, since that would make it hard for the higher layer | ||
903 | to continue to use the vector descriptor (kvec) after i/o completes. Instead, | ||
904 | all such transient state should either be maintained in the request structure, | ||
905 | and passed on in some way to the endio completion routine. | ||
906 | |||
907 | |||
908 | 4. The I/O scheduler | ||
909 | I/O schedulers are now per queue. They should be runtime switchable and modular | ||
910 | but aren't yet. Jens has most bits to do this, but the sysfs implementation is | ||
911 | missing. | ||
912 | |||
913 | A block layer call to the i/o scheduler follows the convention elv_xxx(). This | ||
914 | calls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh, | ||
915 | xxx and xxx might not match exactly, but use your imagination. If an elevator | ||
916 | doesn't implement a function, the switch does nothing or some minimal house | ||
917 | keeping work. | ||
918 | |||
919 | 4.1. I/O scheduler API | ||
920 | |||
921 | The functions an elevator may implement are: (* are mandatory) | ||
922 | elevator_merge_fn called to query requests for merge with a bio | ||
923 | |||
924 | elevator_merge_req_fn " " " with another request | ||
925 | |||
926 | elevator_merged_fn called when a request in the scheduler has been | ||
927 | involved in a merge. It is used in the deadline | ||
928 | scheduler for example, to reposition the request | ||
929 | if its sorting order has changed. | ||
930 | |||
931 | *elevator_next_req_fn returns the next scheduled request, or NULL | ||
932 | if there are none (or none are ready). | ||
933 | |||
934 | *elevator_add_req_fn called to add a new request into the scheduler | ||
935 | |||
936 | elevator_queue_empty_fn returns true if the merge queue is empty. | ||
937 | Drivers shouldn't use this, but rather check | ||
938 | if elv_next_request is NULL (without losing the | ||
939 | request if one exists!) | ||
940 | |||
941 | elevator_remove_req_fn This is called when a driver claims ownership of | ||
942 | the target request - it now belongs to the | ||
943 | driver. It must not be modified or merged. | ||
944 | Drivers must not lose the request! A subsequent | ||
945 | call of elevator_next_req_fn must return the | ||
946 | _next_ request. | ||
947 | |||
948 | elevator_requeue_req_fn called to add a request to the scheduler. This | ||
949 | is used when the request has alrnadebeen | ||
950 | returned by elv_next_request, but hasn't | ||
951 | completed. If this is not implemented then | ||
952 | elevator_add_req_fn is called instead. | ||
953 | |||
954 | elevator_former_req_fn | ||
955 | elevator_latter_req_fn These return the request before or after the | ||
956 | one specified in disk sort order. Used by the | ||
957 | block layer to find merge possibilities. | ||
958 | |||
959 | elevator_completed_req_fn called when a request is completed. This might | ||
960 | come about due to being merged with another or | ||
961 | when the device completes the request. | ||
962 | |||
963 | elevator_may_queue_fn returns true if the scheduler wants to allow the | ||
964 | current context to queue a new request even if | ||
965 | it is over the queue limit. This must be used | ||
966 | very carefully!! | ||
967 | |||
968 | elevator_set_req_fn | ||
969 | elevator_put_req_fn Must be used to allocate and free any elevator | ||
970 | specific storate for a request. | ||
971 | |||
972 | elevator_init_fn | ||
973 | elevator_exit_fn Allocate and free any elevator specific storage | ||
974 | for a queue. | ||
975 | |||
976 | 4.2 I/O scheduler implementation | ||
977 | The generic i/o scheduler algorithm attempts to sort/merge/batch requests for | ||
978 | optimal disk scan and request servicing performance (based on generic | ||
979 | principles and device capabilities), optimized for: | ||
980 | i. improved throughput | ||
981 | ii. improved latency | ||
982 | iii. better utilization of h/w & CPU time | ||
983 | |||
984 | Characteristics: | ||
985 | |||
986 | i. Binary tree | ||
987 | AS and deadline i/o schedulers use red black binary trees for disk position | ||
988 | sorting and searching, and a fifo linked list for time-based searching. This | ||
989 | gives good scalability and good availablility of information. Requests are | ||
990 | almost always dispatched in disk sort order, so a cache is kept of the next | ||
991 | request in sort order to prevent binary tree lookups. | ||
992 | |||
993 | This arrangement is not a generic block layer characteristic however, so | ||
994 | elevators may implement queues as they please. | ||
995 | |||
996 | ii. Last merge hint | ||
997 | The last merge hint is part of the generic queue layer. I/O schedulers must do | ||
998 | some management on it. For the most part, the most important thing is to make | ||
999 | sure q->last_merge is cleared (set to NULL) when the request on it is no longer | ||
1000 | a candidate for merging (for example if it has been sent to the driver). | ||
1001 | |||
1002 | The last merge performed is cached as a hint for the subsequent request. If | ||
1003 | sequential data is being submitted, the hint is used to perform merges without | ||
1004 | any scanning. This is not sufficient when there are multiple processes doing | ||
1005 | I/O though, so a "merge hash" is used by some schedulers. | ||
1006 | |||
1007 | iii. Merge hash | ||
1008 | AS and deadline use a hash table indexed by the last sector of a request. This | ||
1009 | enables merging code to quickly look up "back merge" candidates, even when | ||
1010 | multiple I/O streams are being performed at once on one disk. | ||
1011 | |||
1012 | "Front merges", a new request being merged at the front of an existing request, | ||
1013 | are far less common than "back merges" due to the nature of most I/O patterns. | ||
1014 | Front merges are handled by the binary trees in AS and deadline schedulers. | ||
1015 | |||
1016 | iv. Handling barrier cases | ||
1017 | A request with flags REQ_HARDBARRIER or REQ_SOFTBARRIER must not be ordered | ||
1018 | around. That is, they must be processed after all older requests, and before | ||
1019 | any newer ones. This includes merges! | ||
1020 | |||
1021 | In AS and deadline schedulers, barriers have the effect of flushing the reorder | ||
1022 | queue. The performance cost of this will vary from nothing to a lot depending | ||
1023 | on i/o patterns and device characteristics. Obviously they won't improve | ||
1024 | performance, so their use should be kept to a minimum. | ||
1025 | |||
1026 | v. Handling insertion position directives | ||
1027 | A request may be inserted with a position directive. The directives are one of | ||
1028 | ELEVATOR_INSERT_BACK, ELEVATOR_INSERT_FRONT, ELEVATOR_INSERT_SORT. | ||
1029 | |||
1030 | ELEVATOR_INSERT_SORT is a general directive for non-barrier requests. | ||
1031 | ELEVATOR_INSERT_BACK is used to insert a barrier to the back of the queue. | ||
1032 | ELEVATOR_INSERT_FRONT is used to insert a barrier to the front of the queue, and | ||
1033 | overrides the ordering requested by any previous barriers. In practice this is | ||
1034 | harmless and required, because it is used for SCSI requeueing. This does not | ||
1035 | require flushing the reorder queue, so does not impose a performance penalty. | ||
1036 | |||
1037 | vi. Plugging the queue to batch requests in anticipation of opportunities for | ||
1038 | merge/sort optimizations | ||
1039 | |||
1040 | This is just the same as in 2.4 so far, though per-device unplugging | ||
1041 | support is anticipated for 2.5. Also with a priority-based i/o scheduler, | ||
1042 | such decisions could be based on request priorities. | ||
1043 | |||
1044 | Plugging is an approach that the current i/o scheduling algorithm resorts to so | ||
1045 | that it collects up enough requests in the queue to be able to take | ||
1046 | advantage of the sorting/merging logic in the elevator. If the | ||
1047 | queue is empty when a request comes in, then it plugs the request queue | ||
1048 | (sort of like plugging the bottom of a vessel to get fluid to build up) | ||
1049 | till it fills up with a few more requests, before starting to service | ||
1050 | the requests. This provides an opportunity to merge/sort the requests before | ||
1051 | passing them down to the device. There are various conditions when the queue is | ||
1052 | unplugged (to open up the flow again), either through a scheduled task or | ||
1053 | could be on demand. For example wait_on_buffer sets the unplugging going | ||
1054 | (by running tq_disk) so the read gets satisfied soon. So in the read case, | ||
1055 | the queue gets explicitly unplugged as part of waiting for completion, | ||
1056 | in fact all queues get unplugged as a side-effect. | ||
1057 | |||
1058 | Aside: | ||
1059 | This is kind of controversial territory, as it's not clear if plugging is | ||
1060 | always the right thing to do. Devices typically have their own queues, | ||
1061 | and allowing a big queue to build up in software, while letting the device be | ||
1062 | idle for a while may not always make sense. The trick is to handle the fine | ||
1063 | balance between when to plug and when to open up. Also now that we have | ||
1064 | multi-page bios being queued in one shot, we may not need to wait to merge | ||
1065 | a big request from the broken up pieces coming by. | ||
1066 | |||
1067 | Per-queue granularity unplugging (still a Todo) may help reduce some of the | ||
1068 | concerns with just a single tq_disk flush approach. Something like | ||
1069 | blk_kick_queue() to unplug a specific queue (right away ?) | ||
1070 | or optionally, all queues, is in the plan. | ||
1071 | |||
1072 | 4.3 I/O contexts | ||
1073 | I/O contexts provide a dynamically allocated per process data area. They may | ||
1074 | be used in I/O schedulers, and in the block layer (could be used for IO statis, | ||
1075 | priorities for example). See *io_context in drivers/block/ll_rw_blk.c, and | ||
1076 | as-iosched.c for an example of usage in an i/o scheduler. | ||
1077 | |||
1078 | |||
1079 | 5. Scalability related changes | ||
1080 | |||
1081 | 5.1 Granular Locking: io_request_lock replaced by a per-queue lock | ||
1082 | |||
1083 | The global io_request_lock has been removed as of 2.5, to avoid | ||
1084 | the scalability bottleneck it was causing, and has been replaced by more | ||
1085 | granular locking. The request queue structure has a pointer to the | ||
1086 | lock to be used for that queue. As a result, locking can now be | ||
1087 | per-queue, with a provision for sharing a lock across queues if | ||
1088 | necessary (e.g the scsi layer sets the queue lock pointers to the | ||
1089 | corresponding adapter lock, which results in a per host locking | ||
1090 | granularity). The locking semantics are the same, i.e. locking is | ||
1091 | still imposed by the block layer, grabbing the lock before | ||
1092 | request_fn execution which it means that lots of older drivers | ||
1093 | should still be SMP safe. Drivers are free to drop the queue | ||
1094 | lock themselves, if required. Drivers that explicitly used the | ||
1095 | io_request_lock for serialization need to be modified accordingly. | ||
1096 | Usually it's as easy as adding a global lock: | ||
1097 | |||
1098 | static spinlock_t my_driver_lock = SPIN_LOCK_UNLOCKED; | ||
1099 | |||
1100 | and passing the address to that lock to blk_init_queue(). | ||
1101 | |||
1102 | 5.2 64 bit sector numbers (sector_t prepares for 64 bit support) | ||
1103 | |||
1104 | The sector number used in the bio structure has been changed to sector_t, | ||
1105 | which could be defined as 64 bit in preparation for 64 bit sector support. | ||
1106 | |||
1107 | 6. Other Changes/Implications | ||
1108 | |||
1109 | 6.1 Partition re-mapping handled by the generic block layer | ||
1110 | |||
1111 | In 2.5 some of the gendisk/partition related code has been reorganized. | ||
1112 | Now the generic block layer performs partition-remapping early and thus | ||
1113 | provides drivers with a sector number relative to whole device, rather than | ||
1114 | having to take partition number into account in order to arrive at the true | ||
1115 | sector number. The routine blk_partition_remap() is invoked by | ||
1116 | generic_make_request even before invoking the queue specific make_request_fn, | ||
1117 | so the i/o scheduler also gets to operate on whole disk sector numbers. This | ||
1118 | should typically not require changes to block drivers, it just never gets | ||
1119 | to invoke its own partition sector offset calculations since all bios | ||
1120 | sent are offset from the beginning of the device. | ||
1121 | |||
1122 | |||
1123 | 7. A Few Tips on Migration of older drivers | ||
1124 | |||
1125 | Old-style drivers that just use CURRENT and ignores clustered requests, | ||
1126 | may not need much change. The generic layer will automatically handle | ||
1127 | clustered requests, multi-page bios, etc for the driver. | ||
1128 | |||
1129 | For a low performance driver or hardware that is PIO driven or just doesn't | ||
1130 | support scatter-gather changes should be minimal too. | ||
1131 | |||
1132 | The following are some points to keep in mind when converting old drivers | ||
1133 | to bio. | ||
1134 | |||
1135 | Drivers should use elv_next_request to pick up requests and are no longer | ||
1136 | supposed to handle looping directly over the request list. | ||
1137 | (struct request->queue has been removed) | ||
1138 | |||
1139 | Now end_that_request_first takes an additional number_of_sectors argument. | ||
1140 | It used to handle always just the first buffer_head in a request, now | ||
1141 | it will loop and handle as many sectors (on a bio-segment granularity) | ||
1142 | as specified. | ||
1143 | |||
1144 | Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the | ||
1145 | right thing to use is bio_endio(bio, uptodate) instead. | ||
1146 | |||
1147 | If the driver is dropping the io_request_lock from its request_fn strategy, | ||
1148 | then it just needs to replace that with q->queue_lock instead. | ||
1149 | |||
1150 | As described in Sec 1.1, drivers can set max sector size, max segment size | ||
1151 | etc per queue now. Drivers that used to define their own merge functions i | ||
1152 | to handle things like this can now just use the blk_queue_* functions at | ||
1153 | blk_init_queue time. | ||
1154 | |||
1155 | Drivers no longer have to map a {partition, sector offset} into the | ||
1156 | correct absolute location anymore, this is done by the block layer, so | ||
1157 | where a driver received a request ala this before: | ||
1158 | |||
1159 | rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */ | ||
1160 | rq->sector = 0; /* first sector on hda5 */ | ||
1161 | |||
1162 | it will now see | ||
1163 | |||
1164 | rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */ | ||
1165 | rq->sector = 123128; /* offset from start of disk */ | ||
1166 | |||
1167 | As mentioned, there is no virtual mapping of a bio. For DMA, this is | ||
1168 | not a problem as the driver probably never will need a virtual mapping. | ||
1169 | Instead it needs a bus mapping (pci_map_page for a single segment or | ||
1170 | use blk_rq_map_sg for scatter gather) to be able to ship it to the driver. For | ||
1171 | PIO drivers (or drivers that need to revert to PIO transfer once in a | ||
1172 | while (IDE for example)), where the CPU is doing the actual data | ||
1173 | transfer a virtual mapping is needed. If the driver supports highmem I/O, | ||
1174 | (Sec 1.1, (ii) ) it needs to use __bio_kmap_atomic and bio_kmap_irq to | ||
1175 | temporarily map a bio into the virtual address space. | ||
1176 | |||
1177 | |||
1178 | 8. Prior/Related/Impacted patches | ||
1179 | |||
1180 | 8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp) | ||
1181 | - orig kiobuf & raw i/o patches (now in 2.4 tree) | ||
1182 | - direct kiobuf based i/o to devices (no intermediate bh's) | ||
1183 | - page i/o using kiobuf | ||
1184 | - kiobuf splitting for lvm (mkp) | ||
1185 | - elevator support for kiobuf request merging (axboe) | ||
1186 | 8.2. Zero-copy networking (Dave Miller) | ||
1187 | 8.3. SGI XFS - pagebuf patches - use of kiobufs | ||
1188 | 8.4. Multi-page pioent patch for bio (Christoph Hellwig) | ||
1189 | 8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11 | ||
1190 | 8.6. Async i/o implementation patch (Ben LaHaise) | ||
1191 | 8.7. EVMS layering design (IBM EVMS team) | ||
1192 | 8.8. Larger page cache size patch (Ben LaHaise) and | ||
1193 | Large page size (Daniel Phillips) | ||
1194 | => larger contiguous physical memory buffers | ||
1195 | 8.9. VM reservations patch (Ben LaHaise) | ||
1196 | 8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?) | ||
1197 | 8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+ | ||
1198 | 8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, | ||
1199 | Badari) | ||
1200 | 8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven) | ||
1201 | 8.14 IDE Taskfile i/o patch (Andre Hedrick) | ||
1202 | 8.15 Multi-page writeout and readahead patches (Andrew Morton) | ||
1203 | 8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy) | ||
1204 | |||
1205 | 9. Other References: | ||
1206 | |||
1207 | 9.1 The Splice I/O Model - Larry McVoy (and subsequent discussions on lkml, | ||
1208 | and Linus' comments - Jan 2001) | ||
1209 | 9.2 Discussions about kiobuf and bh design on lkml between sct, linus, alan | ||
1210 | et al - Feb-March 2001 (many of the initial thoughts that led to bio were | ||
1211 | brought up in this discusion thread) | ||
1212 | 9.3 Discussions on mempool on lkml - Dec 2001. | ||
1213 | |||