md/raid1: read balance chooses idlest disk for SSD

SSD hasn't spindle, distance between requests means nothing. And the original distance based algorithm sometimes can cause severe performance issue for SSD raid. Considering two thread groups, one accesses file A, the other access file B. The first group will access one disk and the second will access the other disk, because requests are near from one group and far between groups. In this case, read balance might keep one disk very busy but the other relative idle. For SSD, we should try best to distribute requests to as many disks as possible. There isn't spindle move penality anyway. With below patch, I can see more than 50% throughput improvement sometimes depending on workloads. The only exception is small requests can be merged to a big request which typically can drive higher throughput for SSD too. Such small requests are sequential reads. Unlike hard disk, sequential read which can't be merged (for example direct IO, or read without readahead) can be ignored for SSD. Again there is no spindle move penality. readahead dispatches small requests and such requests can be merged. Last patch can help detect sequential read well, at least if concurrent read number isn't greater than raid disk number. In that case, distance based algorithm doesn't work well too. V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for random IO too. This makes the algorithm generic for raid with SSD. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
author: Shaohua Li <shli@kernel.org> 2012-07-30 20:03:53 -0400
committer: NeilBrown <neilb@suse.de> 2012-07-30 20:03:53 -0400
commit: 9dedf60313fa4dddfd5b9b226a0ef12a512bf9dc (patch)
tree: 36e8f400d7c858da776bf74f40e0ca71829ecb05 /drivers/md/raid1.c
parent: be4d3280b17bc51f23ec6ebb345728f302f80a0c (diff)
1 files changed, 31 insertions, 3 deletions
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index fb96c0c2db40..d9869f25aa75 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -497,9 +497,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
        const sector_t this_sector = r1_bio->sector;
        int sectors;
        int best_good_sectors;
-        int best_disk;
+        int best_disk, best_dist_disk, best_pending_disk;
+        int has_nonrot_disk;
        int disk;
        sector_t best_dist;
+        unsigned int min_pending;
        struct md_rdev *rdev;
        int choose_first;
@@ -512,8 +514,12 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 retry:
        sectors = r1_bio->sectors;
        best_disk = -1;
+        best_dist_disk = -1;
        best_dist = MaxSector;
+        best_pending_disk = -1;
+        min_pending = UINT_MAX;
        best_good_sectors = 0;
+        has_nonrot_disk = 0;
        if (conf->mddev->recovery_cp < MaxSector &&
            (this_sector + sectors >= conf->next_resync))
@@ -525,6 +531,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
                sector_t dist;
                sector_t first_bad;
                int bad_sectors;
+                unsigned int pending;
                rdev = rcu_dereference(conf->mirrors[disk].rdev);
                if (r1_bio->bios[disk] == IO_BLOCKED
@@ -583,22 +590,43 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
                } else
                        best_good_sectors = sectors;
+                has_nonrot_disk |= blk_queue_nonrot(bdev_get_queue(rdev->bdev));
+                pending = atomic_read(&rdev->nr_pending);
                dist = abs(this_sector - conf->mirrors[disk].head_position);
                if (choose_first
                    /* Don't change to another disk for sequential reads */
                    || conf->mirrors[disk].next_seq_sect == this_sector
                    || dist == 0
                    /* If device is idle, use it */
-                    || atomic_read(&rdev->nr_pending) == 0) {
+                    || pending == 0) {
                        best_disk = disk;
                        break;
                }
+                if (min_pending > pending) {
+                        min_pending = pending;
+                        best_pending_disk = disk;
+                }
                if (dist < best_dist) {
                        best_dist = dist;
-                        best_disk = disk;
+                        best_dist_disk = disk;
                }
        }
+        /*
+         * If all disks are rotational, choose the closest disk. If any disk is
+         * non-rotational, choose the disk with less pending request even the
+         * disk is rotational, which might/might not be optimal for raids with
+         * mixed ratation/non-rotational disks depending on workload.
+         */
+        if (best_disk == -1) {
+                if (has_nonrot_disk)
+                        best_disk = best_pending_disk;
+                else
+                        best_disk = best_dist_disk;
+        }
        if (best_disk >= 0) {
                rdev = rcu_dereference(conf->mirrors[best_disk].rdev);
                if (!rdev)
author	Shaohua Li <shli@kernel.org>	2012-07-30 20:03:53 -0400
committer	NeilBrown <neilb@suse.de>	2012-07-30 20:03:53 -0400
commit	9dedf60313fa4dddfd5b9b226a0ef12a512bf9dc (patch)
tree	36e8f400d7c858da776bf74f40e0ca71829ecb05 /drivers/md/raid1.c
parent	be4d3280b17bc51f23ec6ebb345728f302f80a0c (diff)