md: the md RAID10 resync thread could cause a md RAID10 array deadlock

This message describes another issue about md RAID10 found by testing the 2.6.24 md RAID10 using new scsi fault injection framework. Abstract: When a scsi error results in disabling a disk during RAID10 recovery, the resync threads of md RAID10 could stall. This case, the raid array has already been broken and it may not matter. But I think stall is not preferable. If it occurs, even shutdown or reboot will fail because of resource busy. The deadlock mechanism: The r10bio_s structure has a "remaining" member to keep track of BIOs yet to be handled when recovering. The "remaining" counter is incremented when building a BIO in sync_request() and is decremented when finish a BIO in end_sync_write(). If building a BIO fails for some reasons in sync_request(), the "remaining" should be decremented if it has already been incremented. I found a case where this decrement is forgotten. This causes a md_do_sync() deadlock because md_do_sync() waits for md_done_sync() called by end_sync_write(), but end_sync_write() never calls md_done_sync() because of the "remaining" counter mismatch. For example, this problem would be reproduced in the following case: Personalities : [raid10] md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F) 3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_] [>....................] recovery = 2.2% (45376/1959808) finish=0.7min speed=45376K/sec This case, sdf1 is recovering, sdb1 and sde1 are disabled. An additional error with detaching sdd will cause a deadlock. md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F) 3919616 blocks 64K chunks 2 near-copies [4/1] [_U__] [=>...................] recovery = 5.0% (99520/1959808) finish=5.9min speed=5237K/sec 2739 ? S< 0:17 [md0_raid10] 28608 ? D< 0:00 [md0_resync] 28629 pts/1 Ss 0:00 bash 28830 pts/1 R+ 0:00 ps ax 31819 ? D< 0:00 [kjournald] The resync thread keeps working, but actually it is deadlocked. Patch: By this patch, the remaining counter will be decremented if needed. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: K.Tanaka <k-tanaka@ce.jp.nec.com> 2008-03-04 17:29:37 -0500
committer: Linus Torvalds <torvalds@woody.linux-foundation.org> 2008-03-04 19:35:18 -0500
commit: a07e6ab41be179cf1ed728a4f41368435508b550 (patch)
tree: 10773de394ab861259468372099c90b3f8671292
parent: 1c830532f6b44d10a1743ccd00e990c6b83396f5 (diff)
1 files changed, 2 insertions, 0 deletions
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 8e5671d2f3d3..32389d2f18fc 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1818,6 +1818,8 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i
                                if (j == conf->copies) {
                                        /* Cannot recover, so abort the recovery */
                                        put_buf(r10_bio);
+                                        if (rb2)
+                                                atomic_dec(&rb2->remaining);
                                        r10_bio = rb2;
                                        if (!test_and_set_bit(MD_RECOVERY_ERR, &mddev->recovery))
                                                printk(KERN_INFO "raid10: %s: insufficient working devices for recovery.\n",
author	K.Tanaka <k-tanaka@ce.jp.nec.com>	2008-03-04 17:29:37 -0500
committer	Linus Torvalds <torvalds@woody.linux-foundation.org>	2008-03-04 19:35:18 -0500
commit	a07e6ab41be179cf1ed728a4f41368435508b550 (patch)
tree	10773de394ab861259468372099c90b3f8671292
parent	1c830532f6b44d10a1743ccd00e990c6b83396f5 (diff)