diff options
Diffstat (limited to 'Documentation/block/biovecs.txt')
-rw-r--r-- | Documentation/block/biovecs.txt | 111 |
1 files changed, 111 insertions, 0 deletions
diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt new file mode 100644 index 000000000000..74a32ad52f53 --- /dev/null +++ b/Documentation/block/biovecs.txt | |||
@@ -0,0 +1,111 @@ | |||
1 | |||
2 | Immutable biovecs and biovec iterators: | ||
3 | ======================================= | ||
4 | |||
5 | Kent Overstreet <kmo@daterainc.com> | ||
6 | |||
7 | As of 3.13, biovecs should never be modified after a bio has been submitted. | ||
8 | Instead, we have a new struct bvec_iter which represents a range of a biovec - | ||
9 | the iterator will be modified as the bio is completed, not the biovec. | ||
10 | |||
11 | More specifically, old code that needed to partially complete a bio would | ||
12 | update bi_sector and bi_size, and advance bi_idx to the next biovec. If it | ||
13 | ended up partway through a biovec, it would increment bv_offset and decrement | ||
14 | bv_len by the number of bytes completed in that biovec. | ||
15 | |||
16 | In the new scheme of things, everything that must be mutated in order to | ||
17 | partially complete a bio is segregated into struct bvec_iter: bi_sector, | ||
18 | bi_size and bi_idx have been moved there; and instead of modifying bv_offset | ||
19 | and bv_len, struct bvec_iter has bi_bvec_done, which represents the number of | ||
20 | bytes completed in the current bvec. | ||
21 | |||
22 | There are a bunch of new helper macros for hiding the gory details - in | ||
23 | particular, presenting the illusion of partially completed biovecs so that | ||
24 | normal code doesn't have to deal with bi_bvec_done. | ||
25 | |||
26 | * Driver code should no longer refer to biovecs directly; we now have | ||
27 | bio_iovec() and bio_iovec_iter() macros that return literal struct biovecs, | ||
28 | constructed from the raw biovecs but taking into account bi_bvec_done and | ||
29 | bi_size. | ||
30 | |||
31 | bio_for_each_segment() has been updated to take a bvec_iter argument | ||
32 | instead of an integer (that corresponded to bi_idx); for a lot of code the | ||
33 | conversion just required changing the types of the arguments to | ||
34 | bio_for_each_segment(). | ||
35 | |||
36 | * Advancing a bvec_iter is done with bio_advance_iter(); bio_advance() is a | ||
37 | wrapper around bio_advance_iter() that operates on bio->bi_iter, and also | ||
38 | advances the bio integrity's iter if present. | ||
39 | |||
40 | There is a lower level advance function - bvec_iter_advance() - which takes | ||
41 | a pointer to a biovec, not a bio; this is used by the bio integrity code. | ||
42 | |||
43 | What's all this get us? | ||
44 | ======================= | ||
45 | |||
46 | Having a real iterator, and making biovecs immutable, has a number of | ||
47 | advantages: | ||
48 | |||
49 | * Before, iterating over bios was very awkward when you weren't processing | ||
50 | exactly one bvec at a time - for example, bio_copy_data() in fs/bio.c, | ||
51 | which copies the contents of one bio into another. Because the biovecs | ||
52 | wouldn't necessarily be the same size, the old code was tricky convoluted - | ||
53 | it had to walk two different bios at the same time, keeping both bi_idx and | ||
54 | and offset into the current biovec for each. | ||
55 | |||
56 | The new code is much more straightforward - have a look. This sort of | ||
57 | pattern comes up in a lot of places; a lot of drivers were essentially open | ||
58 | coding bvec iterators before, and having common implementation considerably | ||
59 | simplifies a lot of code. | ||
60 | |||
61 | * Before, any code that might need to use the biovec after the bio had been | ||
62 | completed (perhaps to copy the data somewhere else, or perhaps to resubmit | ||
63 | it somewhere else if there was an error) had to save the entire bvec array | ||
64 | - again, this was being done in a fair number of places. | ||
65 | |||
66 | * Biovecs can be shared between multiple bios - a bvec iter can represent an | ||
67 | arbitrary range of an existing biovec, both starting and ending midway | ||
68 | through biovecs. This is what enables efficient splitting of arbitrary | ||
69 | bios. Note that this means we _only_ use bi_size to determine when we've | ||
70 | reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes | ||
71 | bi_size into account when constructing biovecs. | ||
72 | |||
73 | * Splitting bios is now much simpler. The old bio_split() didn't even work on | ||
74 | bios with more than a single bvec! Now, we can efficiently split arbitrary | ||
75 | size bios - because the new bio can share the old bio's biovec. | ||
76 | |||
77 | Care must be taken to ensure the biovec isn't freed while the split bio is | ||
78 | still using it, in case the original bio completes first, though. Using | ||
79 | bio_chain() when splitting bios helps with this. | ||
80 | |||
81 | * Submitting partially completed bios is now perfectly fine - this comes up | ||
82 | occasionally in stacking block drivers and various code (e.g. md and | ||
83 | bcache) had some ugly workarounds for this. | ||
84 | |||
85 | It used to be the case that submitting a partially completed bio would work | ||
86 | fine to _most_ devices, but since accessing the raw bvec array was the | ||
87 | norm, not all drivers would respect bi_idx and those would break. Now, | ||
88 | since all drivers _must_ go through the bvec iterator - and have been | ||
89 | audited to make sure they are - submitting partially completed bios is | ||
90 | perfectly fine. | ||
91 | |||
92 | Other implications: | ||
93 | =================== | ||
94 | |||
95 | * Almost all usage of bi_idx is now incorrect and has been removed; instead, | ||
96 | where previously you would have used bi_idx you'd now use a bvec_iter, | ||
97 | probably passing it to one of the helper macros. | ||
98 | |||
99 | I.e. instead of using bio_iovec_idx() (or bio->bi_iovec[bio->bi_idx]), you | ||
100 | now use bio_iter_iovec(), which takes a bvec_iter and returns a | ||
101 | literal struct bio_vec - constructed on the fly from the raw biovec but | ||
102 | taking into account bi_bvec_done (and bi_size). | ||
103 | |||
104 | * bi_vcnt can't be trusted or relied upon by driver code - i.e. anything that | ||
105 | doesn't actually own the bio. The reason is twofold: firstly, it's not | ||
106 | actually needed for iterating over the bio anymore - we only use bi_size. | ||
107 | Secondly, when cloning a bio and reusing (a portion of) the original bio's | ||
108 | biovec, in order to calculate bi_vcnt for the new bio we'd have to iterate | ||
109 | over all the biovecs in the new bio - which is silly as it's not needed. | ||
110 | |||
111 | So, don't use bi_vcnt anymore. | ||