aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorKent Overstreet <kmo@daterainc.com>2013-08-07 17:26:21 -0400
committerKent Overstreet <kmo@daterainc.com>2013-11-24 01:33:49 -0500
commit4550dd6c6b062fc5e5b647296d55da22616123c3 (patch)
tree9ca83cc0283151980b7c2f18c0e928aa34a2a0d9
parent7988613b0e5b2638caf6cd493cc78e9595eba19c (diff)
block: Immutable bio vecs
This adds a mechanism by which we can advance a bio by an arbitrary number of bytes without modifying the biovec: bio->bi_iter.bi_bvec_done indicates the number of bytes completed in the current bvec. Various driver code still needs to be updated to not refer to the bvec directly before we can use this for interesting things, like efficient bio splitting. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Paul Clements <Paul.Clements@steeleye.com> Cc: drbd-user@lists.linbit.com Cc: nbd-general@lists.sourceforge.net
-rw-r--r--Documentation/block/biovecs.txt111
-rw-r--r--drivers/block/drbd/drbd_main.c4
-rw-r--r--drivers/block/nbd.c2
-rw-r--r--fs/bio.c27
-rw-r--r--include/linux/bio.h81
-rw-r--r--include/linux/blk_types.h3
-rw-r--r--include/linux/blkdev.h4
7 files changed, 194 insertions, 38 deletions
diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
new file mode 100644
index 000000000000..74a32ad52f53
--- /dev/null
+++ b/Documentation/block/biovecs.txt
@@ -0,0 +1,111 @@
1
2Immutable biovecs and biovec iterators:
3=======================================
4
5Kent Overstreet <kmo@daterainc.com>
6
7As of 3.13, biovecs should never be modified after a bio has been submitted.
8Instead, we have a new struct bvec_iter which represents a range of a biovec -
9the iterator will be modified as the bio is completed, not the biovec.
10
11More specifically, old code that needed to partially complete a bio would
12update bi_sector and bi_size, and advance bi_idx to the next biovec. If it
13ended up partway through a biovec, it would increment bv_offset and decrement
14bv_len by the number of bytes completed in that biovec.
15
16In the new scheme of things, everything that must be mutated in order to
17partially complete a bio is segregated into struct bvec_iter: bi_sector,
18bi_size and bi_idx have been moved there; and instead of modifying bv_offset
19and bv_len, struct bvec_iter has bi_bvec_done, which represents the number of
20bytes completed in the current bvec.
21
22There are a bunch of new helper macros for hiding the gory details - in
23particular, presenting the illusion of partially completed biovecs so that
24normal code doesn't have to deal with bi_bvec_done.
25
26 * Driver code should no longer refer to biovecs directly; we now have
27 bio_iovec() and bio_iovec_iter() macros that return literal struct biovecs,
28 constructed from the raw biovecs but taking into account bi_bvec_done and
29 bi_size.
30
31 bio_for_each_segment() has been updated to take a bvec_iter argument
32 instead of an integer (that corresponded to bi_idx); for a lot of code the
33 conversion just required changing the types of the arguments to
34 bio_for_each_segment().
35
36 * Advancing a bvec_iter is done with bio_advance_iter(); bio_advance() is a
37 wrapper around bio_advance_iter() that operates on bio->bi_iter, and also
38 advances the bio integrity's iter if present.
39
40 There is a lower level advance function - bvec_iter_advance() - which takes
41 a pointer to a biovec, not a bio; this is used by the bio integrity code.
42
43What's all this get us?
44=======================
45
46Having a real iterator, and making biovecs immutable, has a number of
47advantages:
48
49 * Before, iterating over bios was very awkward when you weren't processing
50 exactly one bvec at a time - for example, bio_copy_data() in fs/bio.c,
51 which copies the contents of one bio into another. Because the biovecs
52 wouldn't necessarily be the same size, the old code was tricky convoluted -
53 it had to walk two different bios at the same time, keeping both bi_idx and
54 and offset into the current biovec for each.
55
56 The new code is much more straightforward - have a look. This sort of
57 pattern comes up in a lot of places; a lot of drivers were essentially open
58 coding bvec iterators before, and having common implementation considerably
59 simplifies a lot of code.
60
61 * Before, any code that might need to use the biovec after the bio had been
62 completed (perhaps to copy the data somewhere else, or perhaps to resubmit
63 it somewhere else if there was an error) had to save the entire bvec array
64 - again, this was being done in a fair number of places.
65
66 * Biovecs can be shared between multiple bios - a bvec iter can represent an
67 arbitrary range of an existing biovec, both starting and ending midway
68 through biovecs. This is what enables efficient splitting of arbitrary
69 bios. Note that this means we _only_ use bi_size to determine when we've
70 reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes
71 bi_size into account when constructing biovecs.
72
73 * Splitting bios is now much simpler. The old bio_split() didn't even work on
74 bios with more than a single bvec! Now, we can efficiently split arbitrary
75 size bios - because the new bio can share the old bio's biovec.
76
77 Care must be taken to ensure the biovec isn't freed while the split bio is
78 still using it, in case the original bio completes first, though. Using
79 bio_chain() when splitting bios helps with this.
80
81 * Submitting partially completed bios is now perfectly fine - this comes up
82 occasionally in stacking block drivers and various code (e.g. md and
83 bcache) had some ugly workarounds for this.
84
85 It used to be the case that submitting a partially completed bio would work
86 fine to _most_ devices, but since accessing the raw bvec array was the
87 norm, not all drivers would respect bi_idx and those would break. Now,
88 since all drivers _must_ go through the bvec iterator - and have been
89 audited to make sure they are - submitting partially completed bios is
90 perfectly fine.
91
92Other implications:
93===================
94
95 * Almost all usage of bi_idx is now incorrect and has been removed; instead,
96 where previously you would have used bi_idx you'd now use a bvec_iter,
97 probably passing it to one of the helper macros.
98
99 I.e. instead of using bio_iovec_idx() (or bio->bi_iovec[bio->bi_idx]), you
100 now use bio_iter_iovec(), which takes a bvec_iter and returns a
101 literal struct bio_vec - constructed on the fly from the raw biovec but
102 taking into account bi_bvec_done (and bi_size).
103
104 * bi_vcnt can't be trusted or relied upon by driver code - i.e. anything that
105 doesn't actually own the bio. The reason is twofold: firstly, it's not
106 actually needed for iterating over the bio anymore - we only use bi_size.
107 Secondly, when cloning a bio and reusing (a portion of) the original bio's
108 biovec, in order to calculate bi_vcnt for the new bio we'd have to iterate
109 over all the biovecs in the new bio - which is silly as it's not needed.
110
111 So, don't use bi_vcnt anymore.
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index f4e5440aba05..929468e1512a 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1546,7 +1546,7 @@ static int _drbd_send_bio(struct drbd_conf *mdev, struct bio *bio)
1546 1546
1547 err = _drbd_no_send_page(mdev, bvec.bv_page, 1547 err = _drbd_no_send_page(mdev, bvec.bv_page,
1548 bvec.bv_offset, bvec.bv_len, 1548 bvec.bv_offset, bvec.bv_len,
1549 bio_iter_last(bio, iter) 1549 bio_iter_last(bvec, iter)
1550 ? 0 : MSG_MORE); 1550 ? 0 : MSG_MORE);
1551 if (err) 1551 if (err)
1552 return err; 1552 return err;
@@ -1565,7 +1565,7 @@ static int _drbd_send_zc_bio(struct drbd_conf *mdev, struct bio *bio)
1565 1565
1566 err = _drbd_send_page(mdev, bvec.bv_page, 1566 err = _drbd_send_page(mdev, bvec.bv_page,
1567 bvec.bv_offset, bvec.bv_len, 1567 bvec.bv_offset, bvec.bv_len,
1568 bio_iter_last(bio, iter) ? 0 : MSG_MORE); 1568 bio_iter_last(bvec, iter) ? 0 : MSG_MORE);
1569 if (err) 1569 if (err)
1570 return err; 1570 return err;
1571 } 1571 }
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index aa362f493216..55298db36b2d 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -278,7 +278,7 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
278 */ 278 */
279 rq_for_each_segment(bvec, req, iter) { 279 rq_for_each_segment(bvec, req, iter) {
280 flags = 0; 280 flags = 0;
281 if (!rq_iter_last(req, iter)) 281 if (!rq_iter_last(bvec, iter))
282 flags = MSG_MORE; 282 flags = MSG_MORE;
283 dprintk(DBG_TX, "%s: request %p: sending %d bytes data\n", 283 dprintk(DBG_TX, "%s: request %p: sending %d bytes data\n",
284 nbd->disk->disk_name, req, bvec.bv_len); 284 nbd->disk->disk_name, req, bvec.bv_len);
diff --git a/fs/bio.c b/fs/bio.c
index 8b7f14a95503..07b4b7afa695 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -532,13 +532,11 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
532 * most users will be overriding ->bi_bdev with a new target, 532 * most users will be overriding ->bi_bdev with a new target,
533 * so we don't set nor calculate new physical/hw segment counts here 533 * so we don't set nor calculate new physical/hw segment counts here
534 */ 534 */
535 bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector;
536 bio->bi_bdev = bio_src->bi_bdev; 535 bio->bi_bdev = bio_src->bi_bdev;
537 bio->bi_flags |= 1 << BIO_CLONED; 536 bio->bi_flags |= 1 << BIO_CLONED;
538 bio->bi_rw = bio_src->bi_rw; 537 bio->bi_rw = bio_src->bi_rw;
539 bio->bi_vcnt = bio_src->bi_vcnt; 538 bio->bi_vcnt = bio_src->bi_vcnt;
540 bio->bi_iter.bi_size = bio_src->bi_iter.bi_size; 539 bio->bi_iter = bio_src->bi_iter;
541 bio->bi_iter.bi_idx = bio_src->bi_iter.bi_idx;
542} 540}
543EXPORT_SYMBOL(__bio_clone); 541EXPORT_SYMBOL(__bio_clone);
544 542
@@ -808,28 +806,7 @@ void bio_advance(struct bio *bio, unsigned bytes)
808 if (bio_integrity(bio)) 806 if (bio_integrity(bio))
809 bio_integrity_advance(bio, bytes); 807 bio_integrity_advance(bio, bytes);
810 808
811 bio->bi_iter.bi_sector += bytes >> 9; 809 bio_advance_iter(bio, &bio->bi_iter, bytes);
812 bio->bi_iter.bi_size -= bytes;
813
814 if (bio->bi_rw & BIO_NO_ADVANCE_ITER_MASK)
815 return;
816
817 while (bytes) {
818 if (unlikely(bio->bi_iter.bi_idx >= bio->bi_vcnt)) {
819 WARN_ONCE(1, "bio idx %d >= vcnt %d\n",
820 bio->bi_iter.bi_idx, bio->bi_vcnt);
821 break;
822 }
823
824 if (bytes >= bio_iovec(bio).bv_len) {
825 bytes -= bio_iovec(bio).bv_len;
826 bio->bi_iter.bi_idx++;
827 } else {
828 bio_iovec(bio).bv_len -= bytes;
829 bio_iovec(bio).bv_offset += bytes;
830 bytes = 0;
831 }
832 }
833} 810}
834EXPORT_SYMBOL(bio_advance); 811EXPORT_SYMBOL(bio_advance);
835 812
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c16adb5f69f8..04e592e74c92 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -64,11 +64,38 @@
64#define bio_iovec_idx(bio, idx) (&((bio)->bi_io_vec[(idx)])) 64#define bio_iovec_idx(bio, idx) (&((bio)->bi_io_vec[(idx)]))
65#define __bio_iovec(bio) bio_iovec_idx((bio), (bio)->bi_iter.bi_idx) 65#define __bio_iovec(bio) bio_iovec_idx((bio), (bio)->bi_iter.bi_idx)
66 66
67#define bio_iter_iovec(bio, iter) ((bio)->bi_io_vec[(iter).bi_idx]) 67#define __bvec_iter_bvec(bvec, iter) (&(bvec)[(iter).bi_idx])
68 68
69#define bio_page(bio) (bio_iovec((bio)).bv_page) 69#define bvec_iter_page(bvec, iter) \
70#define bio_offset(bio) (bio_iovec((bio)).bv_offset) 70 (__bvec_iter_bvec((bvec), (iter))->bv_page)
71#define bio_iovec(bio) (*__bio_iovec(bio)) 71
72#define bvec_iter_len(bvec, iter) \
73 min((iter).bi_size, \
74 __bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
75
76#define bvec_iter_offset(bvec, iter) \
77 (__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
78
79#define bvec_iter_bvec(bvec, iter) \
80((struct bio_vec) { \
81 .bv_page = bvec_iter_page((bvec), (iter)), \
82 .bv_len = bvec_iter_len((bvec), (iter)), \
83 .bv_offset = bvec_iter_offset((bvec), (iter)), \
84})
85
86#define bio_iter_iovec(bio, iter) \
87 bvec_iter_bvec((bio)->bi_io_vec, (iter))
88
89#define bio_iter_page(bio, iter) \
90 bvec_iter_page((bio)->bi_io_vec, (iter))
91#define bio_iter_len(bio, iter) \
92 bvec_iter_len((bio)->bi_io_vec, (iter))
93#define bio_iter_offset(bio, iter) \
94 bvec_iter_offset((bio)->bi_io_vec, (iter))
95
96#define bio_page(bio) bio_iter_page((bio), (bio)->bi_iter)
97#define bio_offset(bio) bio_iter_offset((bio), (bio)->bi_iter)
98#define bio_iovec(bio) bio_iter_iovec((bio), (bio)->bi_iter)
72 99
73#define bio_segments(bio) ((bio)->bi_vcnt - (bio)->bi_iter.bi_idx) 100#define bio_segments(bio) ((bio)->bi_vcnt - (bio)->bi_iter.bi_idx)
74#define bio_sectors(bio) ((bio)->bi_iter.bi_size >> 9) 101#define bio_sectors(bio) ((bio)->bi_iter.bi_size >> 9)
@@ -145,16 +172,54 @@ static inline void *bio_data(struct bio *bio)
145 bvl = bio_iovec_idx((bio), (i)), i < (bio)->bi_vcnt; \ 172 bvl = bio_iovec_idx((bio), (i)), i < (bio)->bi_vcnt; \
146 i++) 173 i++)
147 174
175static inline void bvec_iter_advance(struct bio_vec *bv, struct bvec_iter *iter,
176 unsigned bytes)
177{
178 WARN_ONCE(bytes > iter->bi_size,
179 "Attempted to advance past end of bvec iter\n");
180
181 while (bytes) {
182 unsigned len = min(bytes, bvec_iter_len(bv, *iter));
183
184 bytes -= len;
185 iter->bi_size -= len;
186 iter->bi_bvec_done += len;
187
188 if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
189 iter->bi_bvec_done = 0;
190 iter->bi_idx++;
191 }
192 }
193}
194
195#define for_each_bvec(bvl, bio_vec, iter, start) \
196 for ((iter) = start; \
197 (bvl) = bvec_iter_bvec((bio_vec), (iter)), \
198 (iter).bi_size; \
199 bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len))
200
201
202static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
203 unsigned bytes)
204{
205 iter->bi_sector += bytes >> 9;
206
207 if (bio->bi_rw & BIO_NO_ADVANCE_ITER_MASK)
208 iter->bi_size -= bytes;
209 else
210 bvec_iter_advance(bio->bi_io_vec, iter, bytes);
211}
212
148#define __bio_for_each_segment(bvl, bio, iter, start) \ 213#define __bio_for_each_segment(bvl, bio, iter, start) \
149 for (iter = (start); \ 214 for (iter = (start); \
150 bvl = bio_iter_iovec((bio), (iter)), \ 215 (iter).bi_size && \
151 (iter).bi_idx < (bio)->bi_vcnt; \ 216 ((bvl = bio_iter_iovec((bio), (iter))), 1); \
152 (iter).bi_idx++) 217 bio_advance_iter((bio), &(iter), (bvl).bv_len))
153 218
154#define bio_for_each_segment(bvl, bio, iter) \ 219#define bio_for_each_segment(bvl, bio, iter) \
155 __bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter) 220 __bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
156 221
157#define bio_iter_last(bio, iter) ((iter).bi_idx == (bio)->bi_vcnt - 1) 222#define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
158 223
159/* 224/*
160 * get a reference to a bio, so it won't disappear. the intended use is 225 * get a reference to a bio, so it won't disappear. the intended use is
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 29b5b84d8a29..d369f8f6af79 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -34,6 +34,9 @@ struct bvec_iter {
34 unsigned int bi_size; /* residual I/O count */ 34 unsigned int bi_size; /* residual I/O count */
35 35
36 unsigned int bi_idx; /* current index into bvl_vec */ 36 unsigned int bi_idx; /* current index into bvl_vec */
37
38 unsigned int bi_bvec_done; /* number of bytes completed in
39 current bvec */
37}; 40};
38 41
39/* 42/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 337b92a54658..02cb6f0ea71d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -750,9 +750,9 @@ struct req_iterator {
750 __rq_for_each_bio(_iter.bio, _rq) \ 750 __rq_for_each_bio(_iter.bio, _rq) \
751 bio_for_each_segment(bvl, _iter.bio, _iter.iter) 751 bio_for_each_segment(bvl, _iter.bio, _iter.iter)
752 752
753#define rq_iter_last(rq, _iter) \ 753#define rq_iter_last(bvec, _iter) \
754 (_iter.bio->bi_next == NULL && \ 754 (_iter.bio->bi_next == NULL && \
755 bio_iter_last(_iter.bio, _iter.iter)) 755 bio_iter_last(bvec, _iter.iter))
756 756
757#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 757#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
758# error "You should define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE for your platform" 758# error "You should define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE for your platform"