aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorKiyoshi Ueda <k-ueda@ct.jp.nec.com>2006-12-08 05:41:09 -0500
committerLinus Torvalds <torvalds@woody.osdl.org>2006-12-08 11:29:09 -0500
commit2e93ccc1933d08d32d9bde3784c3823e67b9b030 (patch)
tree1e8ad6a6444fc0a568e35f19628c89cdef9ad512
parent81fdb096dbcedcc3b94c7e47b59362b5214891e2 (diff)
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing. For example the multipath target can be configured to store I/O when no paths are available instead of returning it -EIO. This patch allows the device-mapper core to instruct a target to transfer the contents of any such in-target queue back into the core. This frees up the resources used by the target so the core can replace that target with an alternative one and then resend the I/O to it. Without this patch the only way to change the target in such circumstances involves returning the I/O with an error back to the filesystem/application. In the multipath case, this patch will let us add new paths for existing I/O to try after all the existing paths have failed. DMF_NOFLUSH_SUSPENDING ---------------------- If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It is always cleared before dm_suspend() returns. The flag must be visible while the target is flushing pending I/Os so it is set before presuspend where the flush starts and unset after the wait for md->pending where the flush ends. Target drivers can check this flag by calling dm_noflush_suspending(). DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE ----------------------------------- A target's map() function can now return DM_MAPIO_REQUEUE to request the device mapper core queue the bio. Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request the same. This has been labelled 'pushback'. The __map_bio() and clone_endio() functions in the core treat these return values as errors and call dec_pending() to end the I/O. dec_pending ----------- dec_pending() saves the pushback request in struct dm_io->error. Once all the split clones have ended, dec_pending() will put the original bio on the md->pushback list. Note that this supercedes any I/O errors. It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while in progress (e.g. by user interrupt). dec_pending() checks for this and returns -EIO if it happened. pushdback list and pushback_lock -------------------------------- The bio is queued on md->pushback temporarily in dec_pending(), and after all pending I/Os return, md->pushback is merged into md->deferred in dm_suspend() for re-issuing at resume time. md->pushback_lock protects md->pushback. The lock should be held with irq disabled because dec_pending() can be called from interrupt context. Queueing bios to md->pushback in dec_pending() must be done atomically with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is held when checking the flag. Otherwise dec_pending() may queue a bio to md->pushback after the interrupted dm_suspend() flushes md->pushback. Then the bio would be left in md->pushback. Flag setting in dm_suspend() can be done without md->pushback_lock because the flag is checked only after presuspend and the set value is already made visible via the target's presuspend function. The flag can be checked without md->pushback_lock (e.g. the first part of the dec_pending() or target drivers), because the flag is checked again with md->pushback_lock held when the bio is really queued to md->pushback as described above. So even if the flag is cleared after the lockless checkings, the bio isn't left in md->pushback but returned to applications with -EIO. Other notes on the current patch -------------------------------- - md->pushback is added to the struct mapped_device instead of using md->deferred directly because md->io_lock which protects md->deferred is rw_semaphore and can't be used in interrupt context like dec_pending(), and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too. - Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG ioctl option is specified, because I/Os generated by lock_fs() would be pushed back and never return if there were no valid devices. - If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING flag is set, md->pushback must be flushed because I/Os may be queued to the list already. (flush_and_out label in dm_suspend()) Test results ------------ I have tested using multipath target with the next patch. The following tests are for regression/compatibility: - I/Os succeed when valid paths exist; - I/Os fail when there are no valid paths and queue_if_no_path is not set; - I/Os are queued in the multipath target when there are no valid paths and queue_if_no_path is set; - The queued I/Os above fail when suspend is issued without the DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also fail. The following tests are for the normal code path of new pushback feature: - Queued I/Os in the multipath target are flushed from the target but don't return when suspend is issued with the DM_NOFLUSH_FLAG ioctl option; - The I/Os above are queued in the multipath target again when resume is issued without path recovery; - The I/Os above succeed when resume is issued after path recovery or table load; - Queued I/Os in the multipath target succeed when resume is issued with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os spanning 2 multipath targets also succeed. The following tests are for the error paths of the new pushback feature: - When the bdget_disk() fails in dm_suspend(), the DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the pushback list are flushed properly. - When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted, o I/Os which had already been queued to the pushback list at the time don't return, and are re-issued at resume time; o I/Os which hadn't been returned at the time return with EIO. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-rw-r--r--drivers/md/dm-bio-list.h14
-rw-r--r--drivers/md/dm.c105
-rw-r--r--drivers/md/dm.h2
-rw-r--r--include/linux/device-mapper.h3
4 files changed, 114 insertions, 10 deletions
diff --git a/drivers/md/dm-bio-list.h b/drivers/md/dm-bio-list.h
index bbf4615f0e30..da4349649f7f 100644
--- a/drivers/md/dm-bio-list.h
+++ b/drivers/md/dm-bio-list.h
@@ -44,6 +44,20 @@ static inline void bio_list_merge(struct bio_list *bl, struct bio_list *bl2)
44 bl->tail = bl2->tail; 44 bl->tail = bl2->tail;
45} 45}
46 46
47static inline void bio_list_merge_head(struct bio_list *bl,
48 struct bio_list *bl2)
49{
50 if (!bl2->head)
51 return;
52
53 if (bl->head)
54 bl2->tail->bi_next = bl->head;
55 else
56 bl->tail = bl2->tail;
57
58 bl->head = bl2->head;
59}
60
47static inline struct bio *bio_list_pop(struct bio_list *bl) 61static inline struct bio *bio_list_pop(struct bio_list *bl)
48{ 62{
49 struct bio *bio = bl->head; 63 struct bio *bio = bl->head;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index d8544e1a4c1f..fe7c56e10435 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -68,10 +68,12 @@ union map_info *dm_get_mapinfo(struct bio *bio)
68#define DMF_FROZEN 2 68#define DMF_FROZEN 2
69#define DMF_FREEING 3 69#define DMF_FREEING 3
70#define DMF_DELETING 4 70#define DMF_DELETING 4
71#define DMF_NOFLUSH_SUSPENDING 5
71 72
72struct mapped_device { 73struct mapped_device {
73 struct rw_semaphore io_lock; 74 struct rw_semaphore io_lock;
74 struct semaphore suspend_lock; 75 struct semaphore suspend_lock;
76 spinlock_t pushback_lock;
75 rwlock_t map_lock; 77 rwlock_t map_lock;
76 atomic_t holders; 78 atomic_t holders;
77 atomic_t open_count; 79 atomic_t open_count;
@@ -90,6 +92,7 @@ struct mapped_device {
90 atomic_t pending; 92 atomic_t pending;
91 wait_queue_head_t wait; 93 wait_queue_head_t wait;
92 struct bio_list deferred; 94 struct bio_list deferred;
95 struct bio_list pushback;
93 96
94 /* 97 /*
95 * The current mapping. 98 * The current mapping.
@@ -444,23 +447,50 @@ int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo)
444 * you this clearly demarcated crap. 447 * you this clearly demarcated crap.
445 *---------------------------------------------------------------*/ 448 *---------------------------------------------------------------*/
446 449
450static int __noflush_suspending(struct mapped_device *md)
451{
452 return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
453}
454
447/* 455/*
448 * Decrements the number of outstanding ios that a bio has been 456 * Decrements the number of outstanding ios that a bio has been
449 * cloned into, completing the original io if necc. 457 * cloned into, completing the original io if necc.
450 */ 458 */
451static void dec_pending(struct dm_io *io, int error) 459static void dec_pending(struct dm_io *io, int error)
452{ 460{
453 if (error) 461 unsigned long flags;
462
463 /* Push-back supersedes any I/O errors */
464 if (error && !(io->error > 0 && __noflush_suspending(io->md)))
454 io->error = error; 465 io->error = error;
455 466
456 if (atomic_dec_and_test(&io->io_count)) { 467 if (atomic_dec_and_test(&io->io_count)) {
468 if (io->error == DM_ENDIO_REQUEUE) {
469 /*
470 * Target requested pushing back the I/O.
471 * This must be handled before the sleeper on
472 * suspend queue merges the pushback list.
473 */
474 spin_lock_irqsave(&io->md->pushback_lock, flags);
475 if (__noflush_suspending(io->md))
476 bio_list_add(&io->md->pushback, io->bio);
477 else
478 /* noflush suspend was interrupted. */
479 io->error = -EIO;
480 spin_unlock_irqrestore(&io->md->pushback_lock, flags);
481 }
482
457 if (end_io_acct(io)) 483 if (end_io_acct(io))
458 /* nudge anyone waiting on suspend queue */ 484 /* nudge anyone waiting on suspend queue */
459 wake_up(&io->md->wait); 485 wake_up(&io->md->wait);
460 486
461 blk_add_trace_bio(io->md->queue, io->bio, BLK_TA_COMPLETE); 487 if (io->error != DM_ENDIO_REQUEUE) {
488 blk_add_trace_bio(io->md->queue, io->bio,
489 BLK_TA_COMPLETE);
490
491 bio_endio(io->bio, io->bio->bi_size, io->error);
492 }
462 493
463 bio_endio(io->bio, io->bio->bi_size, io->error);
464 free_io(io->md, io); 494 free_io(io->md, io);
465 } 495 }
466} 496}
@@ -480,7 +510,11 @@ static int clone_endio(struct bio *bio, unsigned int done, int error)
480 510
481 if (endio) { 511 if (endio) {
482 r = endio(tio->ti, bio, error, &tio->info); 512 r = endio(tio->ti, bio, error, &tio->info);
483 if (r < 0) 513 if (r < 0 || r == DM_ENDIO_REQUEUE)
514 /*
515 * error and requeue request are handled
516 * in dec_pending().
517 */
484 error = r; 518 error = r;
485 else if (r == DM_ENDIO_INCOMPLETE) 519 else if (r == DM_ENDIO_INCOMPLETE)
486 /* The target will handle the io */ 520 /* The target will handle the io */
@@ -554,8 +588,8 @@ static void __map_bio(struct dm_target *ti, struct bio *clone,
554 clone->bi_sector); 588 clone->bi_sector);
555 589
556 generic_make_request(clone); 590 generic_make_request(clone);
557 } else if (r < 0) { 591 } else if (r < 0 || r == DM_MAPIO_REQUEUE) {
558 /* error the io and bail out */ 592 /* error the io and bail out, or requeue it if needed */
559 md = tio->io->md; 593 md = tio->io->md;
560 dec_pending(tio->io, r); 594 dec_pending(tio->io, r);
561 /* 595 /*
@@ -952,6 +986,7 @@ static struct mapped_device *alloc_dev(int minor)
952 memset(md, 0, sizeof(*md)); 986 memset(md, 0, sizeof(*md));
953 init_rwsem(&md->io_lock); 987 init_rwsem(&md->io_lock);
954 init_MUTEX(&md->suspend_lock); 988 init_MUTEX(&md->suspend_lock);
989 spin_lock_init(&md->pushback_lock);
955 rwlock_init(&md->map_lock); 990 rwlock_init(&md->map_lock);
956 atomic_set(&md->holders, 1); 991 atomic_set(&md->holders, 1);
957 atomic_set(&md->open_count, 0); 992 atomic_set(&md->open_count, 0);
@@ -1282,10 +1317,12 @@ static void unlock_fs(struct mapped_device *md)
1282int dm_suspend(struct mapped_device *md, unsigned suspend_flags) 1317int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
1283{ 1318{
1284 struct dm_table *map = NULL; 1319 struct dm_table *map = NULL;
1320 unsigned long flags;
1285 DECLARE_WAITQUEUE(wait, current); 1321 DECLARE_WAITQUEUE(wait, current);
1286 struct bio *def; 1322 struct bio *def;
1287 int r = -EINVAL; 1323 int r = -EINVAL;
1288 int do_lockfs = suspend_flags & DM_SUSPEND_LOCKFS_FLAG ? 1 : 0; 1324 int do_lockfs = suspend_flags & DM_SUSPEND_LOCKFS_FLAG ? 1 : 0;
1325 int noflush = suspend_flags & DM_SUSPEND_NOFLUSH_FLAG ? 1 : 0;
1289 1326
1290 down(&md->suspend_lock); 1327 down(&md->suspend_lock);
1291 1328
@@ -1294,6 +1331,13 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
1294 1331
1295 map = dm_get_table(md); 1332 map = dm_get_table(md);
1296 1333
1334 /*
1335 * DMF_NOFLUSH_SUSPENDING must be set before presuspend.
1336 * This flag is cleared before dm_suspend returns.
1337 */
1338 if (noflush)
1339 set_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
1340
1297 /* This does not get reverted if there's an error later. */ 1341 /* This does not get reverted if there's an error later. */
1298 dm_table_presuspend_targets(map); 1342 dm_table_presuspend_targets(map);
1299 1343
@@ -1301,11 +1345,14 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
1301 if (!md->suspended_bdev) { 1345 if (!md->suspended_bdev) {
1302 DMWARN("bdget failed in dm_suspend"); 1346 DMWARN("bdget failed in dm_suspend");
1303 r = -ENOMEM; 1347 r = -ENOMEM;
1304 goto out; 1348 goto flush_and_out;
1305 } 1349 }
1306 1350
1307 /* Flush I/O to the device. */ 1351 /*
1308 if (do_lockfs) { 1352 * Flush I/O to the device.
1353 * noflush supersedes do_lockfs, because lock_fs() needs to flush I/Os.
1354 */
1355 if (do_lockfs && !noflush) {
1309 r = lock_fs(md); 1356 r = lock_fs(md);
1310 if (r) 1357 if (r)
1311 goto out; 1358 goto out;
@@ -1341,6 +1388,14 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
1341 down_write(&md->io_lock); 1388 down_write(&md->io_lock);
1342 remove_wait_queue(&md->wait, &wait); 1389 remove_wait_queue(&md->wait, &wait);
1343 1390
1391 if (noflush) {
1392 spin_lock_irqsave(&md->pushback_lock, flags);
1393 clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
1394 bio_list_merge_head(&md->deferred, &md->pushback);
1395 bio_list_init(&md->pushback);
1396 spin_unlock_irqrestore(&md->pushback_lock, flags);
1397 }
1398
1344 /* were we interrupted ? */ 1399 /* were we interrupted ? */
1345 r = -EINTR; 1400 r = -EINTR;
1346 if (atomic_read(&md->pending)) { 1401 if (atomic_read(&md->pending)) {
@@ -1349,7 +1404,7 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
1349 __flush_deferred_io(md, def); 1404 __flush_deferred_io(md, def);
1350 up_write(&md->io_lock); 1405 up_write(&md->io_lock);
1351 unlock_fs(md); 1406 unlock_fs(md);
1352 goto out; 1407 goto out; /* pushback list is already flushed, so skip flush */
1353 } 1408 }
1354 up_write(&md->io_lock); 1409 up_write(&md->io_lock);
1355 1410
@@ -1359,6 +1414,25 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
1359 1414
1360 r = 0; 1415 r = 0;
1361 1416
1417flush_and_out:
1418 if (r && noflush) {
1419 /*
1420 * Because there may be already I/Os in the pushback list,
1421 * flush them before return.
1422 */
1423 down_write(&md->io_lock);
1424
1425 spin_lock_irqsave(&md->pushback_lock, flags);
1426 clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
1427 bio_list_merge_head(&md->deferred, &md->pushback);
1428 bio_list_init(&md->pushback);
1429 spin_unlock_irqrestore(&md->pushback_lock, flags);
1430
1431 def = bio_list_get(&md->deferred);
1432 __flush_deferred_io(md, def);
1433 up_write(&md->io_lock);
1434 }
1435
1362out: 1436out:
1363 if (r && md->suspended_bdev) { 1437 if (r && md->suspended_bdev) {
1364 bdput(md->suspended_bdev); 1438 bdput(md->suspended_bdev);
@@ -1445,6 +1519,17 @@ int dm_suspended(struct mapped_device *md)
1445 return test_bit(DMF_SUSPENDED, &md->flags); 1519 return test_bit(DMF_SUSPENDED, &md->flags);
1446} 1520}
1447 1521
1522int dm_noflush_suspending(struct dm_target *ti)
1523{
1524 struct mapped_device *md = dm_table_get_md(ti->table);
1525 int r = __noflush_suspending(md);
1526
1527 dm_put(md);
1528
1529 return r;
1530}
1531EXPORT_SYMBOL_GPL(dm_noflush_suspending);
1532
1448static struct block_device_operations dm_blk_dops = { 1533static struct block_device_operations dm_blk_dops = {
1449 .open = dm_blk_open, 1534 .open = dm_blk_open,
1450 .release = dm_blk_close, 1535 .release = dm_blk_close,
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index c307ca9a4c33..2f796b1436b2 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -36,12 +36,14 @@
36 * Definitions of return values from target end_io function. 36 * Definitions of return values from target end_io function.
37 */ 37 */
38#define DM_ENDIO_INCOMPLETE 1 38#define DM_ENDIO_INCOMPLETE 1
39#define DM_ENDIO_REQUEUE 2
39 40
40/* 41/*
41 * Definitions of return values from target map function. 42 * Definitions of return values from target map function.
42 */ 43 */
43#define DM_MAPIO_SUBMITTED 0 44#define DM_MAPIO_SUBMITTED 0
44#define DM_MAPIO_REMAPPED 1 45#define DM_MAPIO_REMAPPED 1
46#define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE
45 47
46/* 48/*
47 * Suspend feature flags 49 * Suspend feature flags
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 2e5c42346c38..499f5373e213 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -40,6 +40,7 @@ typedef void (*dm_dtr_fn) (struct dm_target *ti);
40 * < 0: error 40 * < 0: error
41 * = 0: The target will handle the io by resubmitting it later 41 * = 0: The target will handle the io by resubmitting it later
42 * = 1: simple remap complete 42 * = 1: simple remap complete
43 * = 2: The target wants to push back the io
43 */ 44 */
44typedef int (*dm_map_fn) (struct dm_target *ti, struct bio *bio, 45typedef int (*dm_map_fn) (struct dm_target *ti, struct bio *bio,
45 union map_info *map_context); 46 union map_info *map_context);
@@ -50,6 +51,7 @@ typedef int (*dm_map_fn) (struct dm_target *ti, struct bio *bio,
50 * 0 : ended successfully 51 * 0 : ended successfully
51 * 1 : for some reason the io has still not completed (eg, 52 * 1 : for some reason the io has still not completed (eg,
52 * multipath target might want to requeue a failed io). 53 * multipath target might want to requeue a failed io).
54 * 2 : The target wants to push back the io
53 */ 55 */
54typedef int (*dm_endio_fn) (struct dm_target *ti, 56typedef int (*dm_endio_fn) (struct dm_target *ti,
55 struct bio *bio, int error, 57 struct bio *bio, int error,
@@ -188,6 +190,7 @@ int dm_wait_event(struct mapped_device *md, int event_nr);
188const char *dm_device_name(struct mapped_device *md); 190const char *dm_device_name(struct mapped_device *md);
189struct gendisk *dm_disk(struct mapped_device *md); 191struct gendisk *dm_disk(struct mapped_device *md);
190int dm_suspended(struct mapped_device *md); 192int dm_suspended(struct mapped_device *md);
193int dm_noflush_suspending(struct dm_target *ti);
191 194
192/* 195/*
193 * Geometry functions. 196 * Geometry functions.