summaryrefslogtreecommitdiffstats
path: root/block/bfq-iosched.c
diff options
context:
space:
mode:
authorPaolo Valente <paolo.valente@linaro.org>2019-01-29 06:06:32 -0500
committerJens Axboe <axboe@kernel.dk>2019-01-31 14:50:23 -0500
commit530c4cbb3c62f9e42dbf39279fb346f2d2ab4dbb (patch)
treeaad632d87cc3978218cbad308e0b0bd250025e34 /block/bfq-iosched.c
parentac8b0cb415f3aa9162009d39624501d37031533b (diff)
block, bfq: unconditionally plug I/O in asymmetric scenarios
bfq detects the creation of multiple bfq_queues shortly after each other, namely a burst of queue creations in the terminology used in the code. If the burst is large, then no queue in the burst is granted - either I/O-dispatch plugging when the queue remains temporarily idle while in service; - or weight raising, because it causes even longer plugging. In fact, such a plugging tends to lower throughput, while these bursts are typically due to applications or services that spawn multiple processes, to reach a common goal as soon as possible. Examples are a "git grep" or the booting of a system. Unfortunately, disabling plugging may cause a loss of service guarantees in asymmetric scenarios, i.e., if queue weights are differentiated or if more than one group is active. This commit addresses this issue by no longer disabling I/O-dispatch plugging for queues in large bursts. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Diffstat (limited to 'block/bfq-iosched.c')
-rw-r--r--block/bfq-iosched.c346
1 files changed, 165 insertions, 181 deletions
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a6fe60114ade..c1bb5e5fcdc4 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3479,191 +3479,175 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
3479 bfqd->wr_busy_queues == 0; 3479 bfqd->wr_busy_queues == 0;
3480} 3480}
3481 3481
3482/*
3483 * There is a case where idling must be performed not for
3484 * throughput concerns, but to preserve service guarantees.
3485 *
3486 * To introduce this case, we can note that allowing the drive
3487 * to enqueue more than one request at a time, and hence
3488 * delegating de facto final scheduling decisions to the
3489 * drive's internal scheduler, entails loss of control on the
3490 * actual request service order. In particular, the critical
3491 * situation is when requests from different processes happen
3492 * to be present, at the same time, in the internal queue(s)
3493 * of the drive. In such a situation, the drive, by deciding
3494 * the service order of the internally-queued requests, does
3495 * determine also the actual throughput distribution among
3496 * these processes. But the drive typically has no notion or
3497 * concern about per-process throughput distribution, and
3498 * makes its decisions only on a per-request basis. Therefore,
3499 * the service distribution enforced by the drive's internal
3500 * scheduler is likely to coincide with the desired
3501 * device-throughput distribution only in a completely
3502 * symmetric scenario where:
3503 * (i) each of these processes must get the same throughput as
3504 * the others;
3505 * (ii) the I/O of each process has the same properties, in
3506 * terms of locality (sequential or random), direction
3507 * (reads or writes), request sizes, greediness
3508 * (from I/O-bound to sporadic), and so on.
3509 * In fact, in such a scenario, the drive tends to treat
3510 * the requests of each of these processes in about the same
3511 * way as the requests of the others, and thus to provide
3512 * each of these processes with about the same throughput
3513 * (which is exactly the desired throughput distribution). In
3514 * contrast, in any asymmetric scenario, device idling is
3515 * certainly needed to guarantee that bfqq receives its
3516 * assigned fraction of the device throughput (see [1] for
3517 * details).
3518 * The problem is that idling may significantly reduce
3519 * throughput with certain combinations of types of I/O and
3520 * devices. An important example is sync random I/O, on flash
3521 * storage with command queueing. So, unless bfqq falls in the
3522 * above cases where idling also boosts throughput, it would
3523 * be important to check conditions (i) and (ii) accurately,
3524 * so as to avoid idling when not strictly needed for service
3525 * guarantees.
3526 *
3527 * Unfortunately, it is extremely difficult to thoroughly
3528 * check condition (ii). And, in case there are active groups,
3529 * it becomes very difficult to check condition (i) too. In
3530 * fact, if there are active groups, then, for condition (i)
3531 * to become false, it is enough that an active group contains
3532 * more active processes or sub-groups than some other active
3533 * group. More precisely, for condition (i) to hold because of
3534 * such a group, it is not even necessary that the group is
3535 * (still) active: it is sufficient that, even if the group
3536 * has become inactive, some of its descendant processes still
3537 * have some request already dispatched but still waiting for
3538 * completion. In fact, requests have still to be guaranteed
3539 * their share of the throughput even after being
3540 * dispatched. In this respect, it is easy to show that, if a
3541 * group frequently becomes inactive while still having
3542 * in-flight requests, and if, when this happens, the group is
3543 * not considered in the calculation of whether the scenario
3544 * is asymmetric, then the group may fail to be guaranteed its
3545 * fair share of the throughput (basically because idling may
3546 * not be performed for the descendant processes of the group,
3547 * but it had to be). We address this issue with the
3548 * following bi-modal behavior, implemented in the function
3549 * bfq_symmetric_scenario().
3550 *
3551 * If there are groups with requests waiting for completion
3552 * (as commented above, some of these groups may even be
3553 * already inactive), then the scenario is tagged as
3554 * asymmetric, conservatively, without checking any of the
3555 * conditions (i) and (ii). So the device is idled for bfqq.
3556 * This behavior matches also the fact that groups are created
3557 * exactly if controlling I/O is a primary concern (to
3558 * preserve bandwidth and latency guarantees).
3559 *
3560 * On the opposite end, if there are no groups with requests
3561 * waiting for completion, then only condition (i) is actually
3562 * controlled, i.e., provided that condition (i) holds, idling
3563 * is not performed, regardless of whether condition (ii)
3564 * holds. In other words, only if condition (i) does not hold,
3565 * then idling is allowed, and the device tends to be
3566 * prevented from queueing many requests, possibly of several
3567 * processes. Since there are no groups with requests waiting
3568 * for completion, then, to control condition (i) it is enough
3569 * to check just whether all the queues with requests waiting
3570 * for completion also have the same weight.
3571 *
3572 * Not checking condition (ii) evidently exposes bfqq to the
3573 * risk of getting less throughput than its fair share.
3574 * However, for queues with the same weight, a further
3575 * mechanism, preemption, mitigates or even eliminates this
3576 * problem. And it does so without consequences on overall
3577 * throughput. This mechanism and its benefits are explained
3578 * in the next three paragraphs.
3579 *
3580 * Even if a queue, say Q, is expired when it remains idle, Q
3581 * can still preempt the new in-service queue if the next
3582 * request of Q arrives soon (see the comments on
3583 * bfq_bfqq_update_budg_for_activation). If all queues and
3584 * groups have the same weight, this form of preemption,
3585 * combined with the hole-recovery heuristic described in the
3586 * comments on function bfq_bfqq_update_budg_for_activation,
3587 * are enough to preserve a correct bandwidth distribution in
3588 * the mid term, even without idling. In fact, even if not
3589 * idling allows the internal queues of the device to contain
3590 * many requests, and thus to reorder requests, we can rather
3591 * safely assume that the internal scheduler still preserves a
3592 * minimum of mid-term fairness.
3593 *
3594 * More precisely, this preemption-based, idleless approach
3595 * provides fairness in terms of IOPS, and not sectors per
3596 * second. This can be seen with a simple example. Suppose
3597 * that there are two queues with the same weight, but that
3598 * the first queue receives requests of 8 sectors, while the
3599 * second queue receives requests of 1024 sectors. In
3600 * addition, suppose that each of the two queues contains at
3601 * most one request at a time, which implies that each queue
3602 * always remains idle after it is served. Finally, after
3603 * remaining idle, each queue receives very quickly a new
3604 * request. It follows that the two queues are served
3605 * alternatively, preempting each other if needed. This
3606 * implies that, although both queues have the same weight,
3607 * the queue with large requests receives a service that is
3608 * 1024/8 times as high as the service received by the other
3609 * queue.
3610 *
3611 * The motivation for using preemption instead of idling (for
3612 * queues with the same weight) is that, by not idling,
3613 * service guarantees are preserved (completely or at least in
3614 * part) without minimally sacrificing throughput. And, if
3615 * there is no active group, then the primary expectation for
3616 * this device is probably a high throughput.
3617 *
3618 * We are now left only with explaining the additional
3619 * compound condition that is checked below for deciding
3620 * whether the scenario is asymmetric. To explain this
3621 * compound condition, we need to add that the function
3622 * bfq_symmetric_scenario checks the weights of only
3623 * non-weight-raised queues, for efficiency reasons (see
3624 * comments on bfq_weights_tree_add()). Then the fact that
3625 * bfqq is weight-raised is checked explicitly here. More
3626 * precisely, the compound condition below takes into account
3627 * also the fact that, even if bfqq is being weight-raised,
3628 * the scenario is still symmetric if all queues with requests
3629 * waiting for completion happen to be
3630 * weight-raised. Actually, we should be even more precise
3631 * here, and differentiate between interactive weight raising
3632 * and soft real-time weight raising.
3633 *
3634 * As a side note, it is worth considering that the above
3635 * device-idling countermeasures may however fail in the
3636 * following unlucky scenario: if idling is (correctly)
3637 * disabled in a time period during which all symmetry
3638 * sub-conditions hold, and hence the device is allowed to
3639 * enqueue many requests, but at some later point in time some
3640 * sub-condition stops to hold, then it may become impossible
3641 * to let requests be served in the desired order until all
3642 * the requests already queued in the device have been served.
3643 */
3482static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd, 3644static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
3483 struct bfq_queue *bfqq) 3645 struct bfq_queue *bfqq)
3484{ 3646{
3485 /* 3647 return (bfqq->wr_coeff > 1 &&
3486 * There is a case where idling must be performed not for 3648 bfqd->wr_busy_queues <
3487 * throughput concerns, but to preserve service guarantees. 3649 bfq_tot_busy_queues(bfqd)) ||
3488 *
3489 * To introduce this case, we can note that allowing the drive
3490 * to enqueue more than one request at a time, and thereby
3491 * delegating de facto final scheduling decisions to the
3492 * drive's internal scheduler, entails loss of control on the
3493 * actual request service order. In particular, the critical
3494 * situation is when requests from different processes happen
3495 * to be present, at the same time, in the internal queue(s)
3496 * of the drive. In such a situation, the drive, by deciding
3497 * the service order of the internally-queued requests, does
3498 * determine also the actual throughput distribution among
3499 * these processes. But the drive typically has no notion or
3500 * concern about per-process throughput distribution, and
3501 * makes its decisions only on a per-request basis. Therefore,
3502 * the service distribution enforced by the drive's internal
3503 * scheduler is likely to coincide with the desired
3504 * device-throughput distribution only in a completely
3505 * symmetric scenario where:
3506 * (i) each of these processes must get the same throughput as
3507 * the others;
3508 * (ii) the I/O of each process has the same properties, in
3509 * terms of locality (sequential or random), direction
3510 * (reads or writes), request sizes, greediness
3511 * (from I/O-bound to sporadic), and so on.
3512 * In fact, in such a scenario, the drive tends to treat
3513 * the requests of each of these processes in about the same
3514 * way as the requests of the others, and thus to provide
3515 * each of these processes with about the same throughput
3516 * (which is exactly the desired throughput distribution). In
3517 * contrast, in any asymmetric scenario, device idling is
3518 * certainly needed to guarantee that bfqq receives its
3519 * assigned fraction of the device throughput (see [1] for
3520 * details).
3521 * The problem is that idling may significantly reduce
3522 * throughput with certain combinations of types of I/O and
3523 * devices. An important example is sync random I/O, on flash
3524 * storage with command queueing. So, unless bfqq falls in the
3525 * above cases where idling also boosts throughput, it would
3526 * be important to check conditions (i) and (ii) accurately,
3527 * so as to avoid idling when not strictly needed for service
3528 * guarantees.
3529 *
3530 * Unfortunately, it is extremely difficult to thoroughly
3531 * check condition (ii). And, in case there are active groups,
3532 * it becomes very difficult to check condition (i) too. In
3533 * fact, if there are active groups, then, for condition (i)
3534 * to become false, it is enough that an active group contains
3535 * more active processes or sub-groups than some other active
3536 * group. More precisely, for condition (i) to hold because of
3537 * such a group, it is not even necessary that the group is
3538 * (still) active: it is sufficient that, even if the group
3539 * has become inactive, some of its descendant processes still
3540 * have some request already dispatched but still waiting for
3541 * completion. In fact, requests have still to be guaranteed
3542 * their share of the throughput even after being
3543 * dispatched. In this respect, it is easy to show that, if a
3544 * group frequently becomes inactive while still having
3545 * in-flight requests, and if, when this happens, the group is
3546 * not considered in the calculation of whether the scenario
3547 * is asymmetric, then the group may fail to be guaranteed its
3548 * fair share of the throughput (basically because idling may
3549 * not be performed for the descendant processes of the group,
3550 * but it had to be). We address this issue with the
3551 * following bi-modal behavior, implemented in the function
3552 * bfq_symmetric_scenario().
3553 *
3554 * If there are groups with requests waiting for completion
3555 * (as commented above, some of these groups may even be
3556 * already inactive), then the scenario is tagged as
3557 * asymmetric, conservatively, without checking any of the
3558 * conditions (i) and (ii). So the device is idled for bfqq.
3559 * This behavior matches also the fact that groups are created
3560 * exactly if controlling I/O is a primary concern (to
3561 * preserve bandwidth and latency guarantees).
3562 *
3563 * On the opposite end, if there are no groups with requests
3564 * waiting for completion, then only condition (i) is actually
3565 * controlled, i.e., provided that condition (i) holds, idling
3566 * is not performed, regardless of whether condition (ii)
3567 * holds. In other words, only if condition (i) does not hold,
3568 * then idling is allowed, and the device tends to be
3569 * prevented from queueing many requests, possibly of several
3570 * processes. Since there are no groups with requests waiting
3571 * for completion, then, to control condition (i) it is enough
3572 * to check just whether all the queues with requests waiting
3573 * for completion also have the same weight.
3574 *
3575 * Not checking condition (ii) evidently exposes bfqq to the
3576 * risk of getting less throughput than its fair share.
3577 * However, for queues with the same weight, a further
3578 * mechanism, preemption, mitigates or even eliminates this
3579 * problem. And it does so without consequences on overall
3580 * throughput. This mechanism and its benefits are explained
3581 * in the next three paragraphs.
3582 *
3583 * Even if a queue, say Q, is expired when it remains idle, Q
3584 * can still preempt the new in-service queue if the next
3585 * request of Q arrives soon (see the comments on
3586 * bfq_bfqq_update_budg_for_activation). If all queues and
3587 * groups have the same weight, this form of preemption,
3588 * combined with the hole-recovery heuristic described in the
3589 * comments on function bfq_bfqq_update_budg_for_activation,
3590 * are enough to preserve a correct bandwidth distribution in
3591 * the mid term, even without idling. In fact, even if not
3592 * idling allows the internal queues of the device to contain
3593 * many requests, and thus to reorder requests, we can rather
3594 * safely assume that the internal scheduler still preserves a
3595 * minimum of mid-term fairness.
3596 *
3597 * More precisely, this preemption-based, idleless approach
3598 * provides fairness in terms of IOPS, and not sectors per
3599 * second. This can be seen with a simple example. Suppose
3600 * that there are two queues with the same weight, but that
3601 * the first queue receives requests of 8 sectors, while the
3602 * second queue receives requests of 1024 sectors. In
3603 * addition, suppose that each of the two queues contains at
3604 * most one request at a time, which implies that each queue
3605 * always remains idle after it is served. Finally, after
3606 * remaining idle, each queue receives very quickly a new
3607 * request. It follows that the two queues are served
3608 * alternatively, preempting each other if needed. This
3609 * implies that, although both queues have the same weight,
3610 * the queue with large requests receives a service that is
3611 * 1024/8 times as high as the service received by the other
3612 * queue.
3613 *
3614 * The motivation for using preemption instead of idling (for
3615 * queues with the same weight) is that, by not idling,
3616 * service guarantees are preserved (completely or at least in
3617 * part) without minimally sacrificing throughput. And, if
3618 * there is no active group, then the primary expectation for
3619 * this device is probably a high throughput.
3620 *
3621 * We are now left only with explaining the additional
3622 * compound condition that is checked below for deciding
3623 * whether the scenario is asymmetric. To explain this
3624 * compound condition, we need to add that the function
3625 * bfq_symmetric_scenario checks the weights of only
3626 * non-weight-raised queues, for efficiency reasons (see
3627 * comments on bfq_weights_tree_add()). Then the fact that
3628 * bfqq is weight-raised is checked explicitly here. More
3629 * precisely, the compound condition below takes into account
3630 * also the fact that, even if bfqq is being weight-raised,
3631 * the scenario is still symmetric if all queues with requests
3632 * waiting for completion happen to be
3633 * weight-raised. Actually, we should be even more precise
3634 * here, and differentiate between interactive weight raising
3635 * and soft real-time weight raising.
3636 *
3637 * As a side note, it is worth considering that the above
3638 * device-idling countermeasures may however fail in the
3639 * following unlucky scenario: if idling is (correctly)
3640 * disabled in a time period during which all symmetry
3641 * sub-conditions hold, and hence the device is allowed to
3642 * enqueue many requests, but at some later point in time some
3643 * sub-condition stops to hold, then it may become impossible
3644 * to let requests be served in the desired order until all
3645 * the requests already queued in the device have been served.
3646 */
3647 bool asymmetric_scenario = (bfqq->wr_coeff > 1 &&
3648 bfqd->wr_busy_queues <
3649 bfq_tot_busy_queues(bfqd)) ||
3650 !bfq_symmetric_scenario(bfqd); 3650 !bfq_symmetric_scenario(bfqd);
3651
3652 /*
3653 * Finally, there is a case where maximizing throughput is the
3654 * best choice even if it may cause unfairness toward
3655 * bfqq. Such a case is when bfqq became active in a burst of
3656 * queue activations. Queues that became active during a large
3657 * burst benefit only from throughput, as discussed in the
3658 * comments on bfq_handle_burst. Thus, if bfqq became active
3659 * in a burst and not idling the device maximizes throughput,
3660 * then the device must no be idled, because not idling the
3661 * device provides bfqq and all other queues in the burst with
3662 * maximum benefit. Combining this and the above case, we can
3663 * now establish when idling is actually needed to preserve
3664 * service guarantees.
3665 */
3666 return asymmetric_scenario && !bfq_bfqq_in_large_burst(bfqq);
3667} 3651}
3668 3652
3669/* 3653/*