aboutsummaryrefslogtreecommitdiffstats
path: root/drivers
diff options
context:
space:
mode:
authorJonathan Brassow <jbrassow@redhat.com>2013-05-08 19:00:54 -0400
committerNeilBrown <neilb@suse.de>2013-06-13 18:10:25 -0400
commita4dc163a55964d683f92742705c90c78c0f56c0c (patch)
treecbe562b8c39069ba7f0691eb06316ec087c24255 /drivers
parentf381e71b042af910fbe5f8222792cc5092750993 (diff)
DM RAID: Fix raid_resume not reviving failed devices in all cases
DM RAID: Fix raid_resume not reviving failed devices in all cases When a device fails in a RAID array, it is marked as Faulty. Later, md_check_recovery is called which (through the call chain) calls 'hot_remove_disk' in order to have the personalities remove the device from use in the array. Sometimes, it is possible for the array to be suspended before the personalities get their chance to perform 'hot_remove_disk'. This is normally not an issue. If the array is deactivated, then the failed device will be noticed when the array is reinstantiated. If the array is resumed and the disk is still missing, md_check_recovery will be called upon resume and 'hot_remove_disk' will be called at that time. However, (for dm-raid) if the device has been restored, a resume on the array would cause it to attempt to revive the device by calling 'hot_add_disk'. If 'hot_remove_disk' had not been called, a situation is then created where the device is thought to concurrently be the replacement and the device to be replaced. Thus, the device is first sync'ed with the rest of the array (because it is the replacement device) and then marked Faulty and removed from the array (because it is also the device being replaced). The solution is to check and see if the device had properly been removed before the array was suspended. This is done by seeing whether the device's 'raid_disk' field is -1 - a condition that implies that 'md_check_recovery -> remove_and_add_spares (where raid_disk is set to -1) -> hot_remove_disk' has been called. If 'raid_disk' is not -1, then 'hot_remove_disk' must be called to complete the removal of the previously faulty device before it can be revived via 'hot_add_disk'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
Diffstat (limited to 'drivers')
-rw-r--r--drivers/md/dm-raid.c15
1 files changed, 15 insertions, 0 deletions
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 59d15ec0ba81..49f0bd510fb9 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -1587,6 +1587,21 @@ static void attempt_restore_of_faulty_devices(struct raid_set *rs)
1587 DMINFO("Faulty %s device #%d has readable super block." 1587 DMINFO("Faulty %s device #%d has readable super block."
1588 " Attempting to revive it.", 1588 " Attempting to revive it.",
1589 rs->raid_type->name, i); 1589 rs->raid_type->name, i);
1590
1591 /*
1592 * Faulty bit may be set, but sometimes the array can
1593 * be suspended before the personalities can respond
1594 * by removing the device from the array (i.e. calling
1595 * 'hot_remove_disk'). If they haven't yet removed
1596 * the failed device, its 'raid_disk' number will be
1597 * '>= 0' - meaning we must call this function
1598 * ourselves.
1599 */
1600 if ((r->raid_disk >= 0) &&
1601 (r->mddev->pers->hot_remove_disk(r->mddev, r) != 0))
1602 /* Failed to revive this device, try next */
1603 continue;
1604
1590 r->raid_disk = i; 1605 r->raid_disk = i;
1591 r->saved_raid_disk = i; 1606 r->saved_raid_disk = i;
1592 flags = r->flags; 1607 flags = r->flags;