diff options
Diffstat (limited to 'Documentation/block/biovecs.rst')
| -rw-r--r-- | Documentation/block/biovecs.rst | 146 |
1 files changed, 146 insertions, 0 deletions
diff --git a/Documentation/block/biovecs.rst b/Documentation/block/biovecs.rst new file mode 100644 index 000000000000..86fa66c87172 --- /dev/null +++ b/Documentation/block/biovecs.rst | |||
| @@ -0,0 +1,146 @@ | |||
| 1 | ====================================== | ||
| 2 | Immutable biovecs and biovec iterators | ||
| 3 | ====================================== | ||
| 4 | |||
| 5 | Kent Overstreet <kmo@daterainc.com> | ||
| 6 | |||
| 7 | As of 3.13, biovecs should never be modified after a bio has been submitted. | ||
| 8 | Instead, we have a new struct bvec_iter which represents a range of a biovec - | ||
| 9 | the iterator will be modified as the bio is completed, not the biovec. | ||
| 10 | |||
| 11 | More specifically, old code that needed to partially complete a bio would | ||
| 12 | update bi_sector and bi_size, and advance bi_idx to the next biovec. If it | ||
| 13 | ended up partway through a biovec, it would increment bv_offset and decrement | ||
| 14 | bv_len by the number of bytes completed in that biovec. | ||
| 15 | |||
| 16 | In the new scheme of things, everything that must be mutated in order to | ||
| 17 | partially complete a bio is segregated into struct bvec_iter: bi_sector, | ||
| 18 | bi_size and bi_idx have been moved there; and instead of modifying bv_offset | ||
| 19 | and bv_len, struct bvec_iter has bi_bvec_done, which represents the number of | ||
| 20 | bytes completed in the current bvec. | ||
| 21 | |||
| 22 | There are a bunch of new helper macros for hiding the gory details - in | ||
| 23 | particular, presenting the illusion of partially completed biovecs so that | ||
| 24 | normal code doesn't have to deal with bi_bvec_done. | ||
| 25 | |||
| 26 | * Driver code should no longer refer to biovecs directly; we now have | ||
| 27 | bio_iovec() and bio_iter_iovec() macros that return literal struct biovecs, | ||
| 28 | constructed from the raw biovecs but taking into account bi_bvec_done and | ||
| 29 | bi_size. | ||
| 30 | |||
| 31 | bio_for_each_segment() has been updated to take a bvec_iter argument | ||
| 32 | instead of an integer (that corresponded to bi_idx); for a lot of code the | ||
| 33 | conversion just required changing the types of the arguments to | ||
| 34 | bio_for_each_segment(). | ||
| 35 | |||
| 36 | * Advancing a bvec_iter is done with bio_advance_iter(); bio_advance() is a | ||
| 37 | wrapper around bio_advance_iter() that operates on bio->bi_iter, and also | ||
| 38 | advances the bio integrity's iter if present. | ||
| 39 | |||
| 40 | There is a lower level advance function - bvec_iter_advance() - which takes | ||
| 41 | a pointer to a biovec, not a bio; this is used by the bio integrity code. | ||
| 42 | |||
| 43 | What's all this get us? | ||
| 44 | ======================= | ||
| 45 | |||
| 46 | Having a real iterator, and making biovecs immutable, has a number of | ||
| 47 | advantages: | ||
| 48 | |||
| 49 | * Before, iterating over bios was very awkward when you weren't processing | ||
| 50 | exactly one bvec at a time - for example, bio_copy_data() in fs/bio.c, | ||
| 51 | which copies the contents of one bio into another. Because the biovecs | ||
| 52 | wouldn't necessarily be the same size, the old code was tricky convoluted - | ||
| 53 | it had to walk two different bios at the same time, keeping both bi_idx and | ||
| 54 | and offset into the current biovec for each. | ||
| 55 | |||
| 56 | The new code is much more straightforward - have a look. This sort of | ||
| 57 | pattern comes up in a lot of places; a lot of drivers were essentially open | ||
| 58 | coding bvec iterators before, and having common implementation considerably | ||
| 59 | simplifies a lot of code. | ||
| 60 | |||
| 61 | * Before, any code that might need to use the biovec after the bio had been | ||
| 62 | completed (perhaps to copy the data somewhere else, or perhaps to resubmit | ||
| 63 | it somewhere else if there was an error) had to save the entire bvec array | ||
| 64 | - again, this was being done in a fair number of places. | ||
| 65 | |||
| 66 | * Biovecs can be shared between multiple bios - a bvec iter can represent an | ||
| 67 | arbitrary range of an existing biovec, both starting and ending midway | ||
| 68 | through biovecs. This is what enables efficient splitting of arbitrary | ||
| 69 | bios. Note that this means we _only_ use bi_size to determine when we've | ||
| 70 | reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes | ||
| 71 | bi_size into account when constructing biovecs. | ||
| 72 | |||
| 73 | * Splitting bios is now much simpler. The old bio_split() didn't even work on | ||
| 74 | bios with more than a single bvec! Now, we can efficiently split arbitrary | ||
| 75 | size bios - because the new bio can share the old bio's biovec. | ||
| 76 | |||
| 77 | Care must be taken to ensure the biovec isn't freed while the split bio is | ||
| 78 | still using it, in case the original bio completes first, though. Using | ||
| 79 | bio_chain() when splitting bios helps with this. | ||
| 80 | |||
| 81 | * Submitting partially completed bios is now perfectly fine - this comes up | ||
| 82 | occasionally in stacking block drivers and various code (e.g. md and | ||
| 83 | bcache) had some ugly workarounds for this. | ||
| 84 | |||
| 85 | It used to be the case that submitting a partially completed bio would work | ||
| 86 | fine to _most_ devices, but since accessing the raw bvec array was the | ||
| 87 | norm, not all drivers would respect bi_idx and those would break. Now, | ||
| 88 | since all drivers _must_ go through the bvec iterator - and have been | ||
| 89 | audited to make sure they are - submitting partially completed bios is | ||
| 90 | perfectly fine. | ||
| 91 | |||
| 92 | Other implications: | ||
| 93 | =================== | ||
| 94 | |||
| 95 | * Almost all usage of bi_idx is now incorrect and has been removed; instead, | ||
| 96 | where previously you would have used bi_idx you'd now use a bvec_iter, | ||
| 97 | probably passing it to one of the helper macros. | ||
| 98 | |||
| 99 | I.e. instead of using bio_iovec_idx() (or bio->bi_iovec[bio->bi_idx]), you | ||
| 100 | now use bio_iter_iovec(), which takes a bvec_iter and returns a | ||
| 101 | literal struct bio_vec - constructed on the fly from the raw biovec but | ||
| 102 | taking into account bi_bvec_done (and bi_size). | ||
| 103 | |||
| 104 | * bi_vcnt can't be trusted or relied upon by driver code - i.e. anything that | ||
| 105 | doesn't actually own the bio. The reason is twofold: firstly, it's not | ||
| 106 | actually needed for iterating over the bio anymore - we only use bi_size. | ||
| 107 | Secondly, when cloning a bio and reusing (a portion of) the original bio's | ||
| 108 | biovec, in order to calculate bi_vcnt for the new bio we'd have to iterate | ||
| 109 | over all the biovecs in the new bio - which is silly as it's not needed. | ||
| 110 | |||
| 111 | So, don't use bi_vcnt anymore. | ||
| 112 | |||
| 113 | * The current interface allows the block layer to split bios as needed, so we | ||
| 114 | could eliminate a lot of complexity particularly in stacked drivers. Code | ||
| 115 | that creates bios can then create whatever size bios are convenient, and | ||
| 116 | more importantly stacked drivers don't have to deal with both their own bio | ||
| 117 | size limitations and the limitations of the underlying devices. Thus | ||
| 118 | there's no need to define ->merge_bvec_fn() callbacks for individual block | ||
| 119 | drivers. | ||
| 120 | |||
| 121 | Usage of helpers: | ||
| 122 | ================= | ||
| 123 | |||
| 124 | * The following helpers whose names have the suffix of `_all` can only be used | ||
| 125 | on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers | ||
| 126 | shouldn't use them because the bio may have been split before it reached the | ||
| 127 | driver. | ||
| 128 | |||
| 129 | :: | ||
| 130 | |||
| 131 | bio_for_each_segment_all() | ||
| 132 | bio_first_bvec_all() | ||
| 133 | bio_first_page_all() | ||
| 134 | bio_last_bvec_all() | ||
| 135 | |||
| 136 | * The following helpers iterate over single-page segment. The passed 'struct | ||
| 137 | bio_vec' will contain a single-page IO vector during the iteration:: | ||
| 138 | |||
| 139 | bio_for_each_segment() | ||
| 140 | bio_for_each_segment_all() | ||
| 141 | |||
| 142 | * The following helpers iterate over multi-page bvec. The passed 'struct | ||
| 143 | bio_vec' will contain a multi-page IO vector during the iteration:: | ||
| 144 | |||
| 145 | bio_for_each_bvec() | ||
| 146 | rq_for_each_bvec() | ||
