[PATCH] drivers/md/dm-raid1.c: Fix inconsistent mirroring after interrupted recovery

dm-mirror has potential data corruption problem: while on-disk log shows that all disk contents are in-sync, actual contents of the disks are not synchronized. This problem occurs if initial recovery (synching) is interrupted and resumed. Attached patch fixes this problem. Background: rh_dec() changes the region state from RH_NOSYNC (out-of-sync) to RH_CLEAN (in-sync), which results in the corresponding bit of clean_bits being set. This is harmful if on-disk log is used and the map is removed/suspended before the initial sync is completed. The clean_bits is written down to the on-disk log at the map removal, and, upon resume, it's read and copied to sync_bits. Since the recovery process refers to the sync_bits to find a region to be recovered, the region whose state was changed from RH_NOSYNC to RH_CLEAN is no longer recovered. If you haven't applied dm-raid1-read-balancing.patch proposed in dm-devel sometimes ago, the contents of the mirrored disk just corrupt silently. If you have, balanced read may get bogus data from out-of-sync disks. The patch keeps RH_NOSYNC state unchanged. It will be changed to RH_RECOVERING when recovery starts and get reclaimed when the recovery completes. So it doesn't leak the region hash entry. Description: Keep RH_NOSYNC state unchanged when I/O on the region completes. rh_dec() changes the region state from RH_NOSYNC (out-of-sync) to RH_CLEAN (in-sync), which results in the corresponding bit of clean_bits being set. This is harmful if on-disk log is used and the map is removed/suspended before the initial sync is completed. The clean_bits is written down to the on-disk log at the map removal, and, upon resume, it's read and copied to sync_bits. Since the recovery process refers to the sync_bits to find a region to be recovered, the region whose state was changed from RH_NOSYNC to RH_CLEAN is no longer recovered. If you haven't applied dm-raid1-read-balancing.patch proposed in dm-devel sometimes ago, the contents of the mirrored disk just corrupt silently. If you have, balanced read may get bogus data from out-of-sync disks. The RH_NOSYNC region will be changed to RH_RECOVERING when recovery starts on the region and get reclaimed when the recovery completes. So it doesn't leak the region hash entry. Alasdair said: I've analysed the relevant part of the state machine and I believe that the patch is correct. (Further work on this code is still needed - this patch has the side-effect of holding onto memory unnecessarily for long periods of time under certain workloads - but better that than corrupting data.) Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
author: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> 2006-03-27 04:17:47 -0500
committer: Linus Torvalds <torvalds@g5.osdl.org> 2006-03-27 11:44:58 -0500
commit: 930d332a23682202c07df0276dd665a57755b37d (patch)
tree: 0a0202fb581a65a0b63a4a98aa725eb2088a7814 /drivers/md/dm-raid1.c
parent: 76df1c651b66bdf07d60b3d60789feb5f58d73e3 (diff)
1 files changed, 13 insertions, 1 deletions
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index 4e90f231fbfb..3d90f1b37e08 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -402,9 +402,21 @@ static void rh_dec(struct region_hash *rh, region_t region)
        spin_lock_irqsave(&rh->region_lock, flags);
        if (atomic_dec_and_test(&reg->pending)) {
+                /*
+                 * There is no pending I/O for this region.
+                 * We can move the region to corresponding list for next action.
+                 * At this point, the region is not yet connected to any list.
+                 *
+                 * If the state is RH_NOSYNC, the region should be kept off
+                 * from clean list.
+                 * The hash entry for RH_NOSYNC will remain in memory
+                 * until the region is recovered or the map is reloaded.
+                 */
+                /* do nothing for RH_NOSYNC */
                if (reg->state == RH_RECOVERING) {
                        list_add_tail(&reg->list, &rh->quiesced_regions);
-                } else {
+                } else if (reg->state == RH_DIRTY) {
                        reg->state = RH_CLEAN;
                        list_add(&reg->list, &rh->clean_regions);
                }
author	Jun'ichi Nomura <j-nomura@ce.jp.nec.com>	2006-03-27 04:17:47 -0500
committer	Linus Torvalds <torvalds@g5.osdl.org>	2006-03-27 11:44:58 -0500
commit	930d332a23682202c07df0276dd665a57755b37d (patch)
tree	0a0202fb581a65a0b63a4a98aa725eb2088a7814 /drivers/md/dm-raid1.c
parent	76df1c651b66bdf07d60b3d60789feb5f58d73e3 (diff)

diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c index 4e90f231fbfb..3d90f1b37e08 100644 --- a/drivers/md/dm-raid1.c +++ b/drivers/md/dm-raid1.c
@@ -402,9 +402,21 @@ static void rh_dec(struct region_hash *rh, region_t region)
402		402
403	spin_lock_irqsave(&rh->region_lock, flags);	403	spin_lock_irqsave(&rh->region_lock, flags);
404	if (atomic_dec_and_test(&reg->pending)) {	404	if (atomic_dec_and_test(&reg->pending)) {
		405	/*
		406	* There is no pending I/O for this region.
		407	* We can move the region to corresponding list for next action.
		408	* At this point, the region is not yet connected to any list.
		409	*
		410	* If the state is RH_NOSYNC, the region should be kept off
		411	* from clean list.
		412	* The hash entry for RH_NOSYNC will remain in memory
		413	* until the region is recovered or the map is reloaded.
		414	*/
		415
		416	/* do nothing for RH_NOSYNC */
405	if (reg->state == RH_RECOVERING) {	417	if (reg->state == RH_RECOVERING) {
406	list_add_tail(&reg->list, &rh->quiesced_regions);	418	list_add_tail(&reg->list, &rh->quiesced_regions);
407	} else {	419	} else if (reg->state == RH_DIRTY) {
408	reg->state = RH_CLEAN;	420	reg->state = RH_CLEAN;
409	list_add(&reg->list, &rh->clean_regions);	421	list_add(&reg->list, &rh->clean_regions);
410	}	422	}