diff options
author | Paolo Valente <paolo.valente@linaro.org> | 2019-01-29 06:06:32 -0500 |
---|---|---|
committer | Jens Axboe <axboe@kernel.dk> | 2019-01-31 14:50:23 -0500 |
commit | 530c4cbb3c62f9e42dbf39279fb346f2d2ab4dbb (patch) | |
tree | aad632d87cc3978218cbad308e0b0bd250025e34 /block/bfq-iosched.c | |
parent | ac8b0cb415f3aa9162009d39624501d37031533b (diff) |
block, bfq: unconditionally plug I/O in asymmetric scenarios
bfq detects the creation of multiple bfq_queues shortly after each
other, namely a burst of queue creations in the terminology used in the
code. If the burst is large, then no queue in the burst is granted
- either I/O-dispatch plugging when the queue remains temporarily idle
while in service;
- or weight raising, because it causes even longer plugging.
In fact, such a plugging tends to lower throughput, while these bursts
are typically due to applications or services that spawn multiple
processes, to reach a common goal as soon as possible. Examples are a
"git grep" or the booting of a system.
Unfortunately, disabling plugging may cause a loss of service guarantees
in asymmetric scenarios, i.e., if queue weights are differentiated or if
more than one group is active.
This commit addresses this issue by no longer disabling I/O-dispatch
plugging for queues in large bursts.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Diffstat (limited to 'block/bfq-iosched.c')
-rw-r--r-- | block/bfq-iosched.c | 346 |
1 files changed, 165 insertions, 181 deletions
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index a6fe60114ade..c1bb5e5fcdc4 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c | |||
@@ -3479,191 +3479,175 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd, | |||
3479 | bfqd->wr_busy_queues == 0; | 3479 | bfqd->wr_busy_queues == 0; |
3480 | } | 3480 | } |
3481 | 3481 | ||
3482 | /* | ||
3483 | * There is a case where idling must be performed not for | ||
3484 | * throughput concerns, but to preserve service guarantees. | ||
3485 | * | ||
3486 | * To introduce this case, we can note that allowing the drive | ||
3487 | * to enqueue more than one request at a time, and hence | ||
3488 | * delegating de facto final scheduling decisions to the | ||
3489 | * drive's internal scheduler, entails loss of control on the | ||
3490 | * actual request service order. In particular, the critical | ||
3491 | * situation is when requests from different processes happen | ||
3492 | * to be present, at the same time, in the internal queue(s) | ||
3493 | * of the drive. In such a situation, the drive, by deciding | ||
3494 | * the service order of the internally-queued requests, does | ||
3495 | * determine also the actual throughput distribution among | ||
3496 | * these processes. But the drive typically has no notion or | ||
3497 | * concern about per-process throughput distribution, and | ||
3498 | * makes its decisions only on a per-request basis. Therefore, | ||
3499 | * the service distribution enforced by the drive's internal | ||
3500 | * scheduler is likely to coincide with the desired | ||
3501 | * device-throughput distribution only in a completely | ||
3502 | * symmetric scenario where: | ||
3503 | * (i) each of these processes must get the same throughput as | ||
3504 | * the others; | ||
3505 | * (ii) the I/O of each process has the same properties, in | ||
3506 | * terms of locality (sequential or random), direction | ||
3507 | * (reads or writes), request sizes, greediness | ||
3508 | * (from I/O-bound to sporadic), and so on. | ||
3509 | * In fact, in such a scenario, the drive tends to treat | ||
3510 | * the requests of each of these processes in about the same | ||
3511 | * way as the requests of the others, and thus to provide | ||
3512 | * each of these processes with about the same throughput | ||
3513 | * (which is exactly the desired throughput distribution). In | ||
3514 | * contrast, in any asymmetric scenario, device idling is | ||
3515 | * certainly needed to guarantee that bfqq receives its | ||
3516 | * assigned fraction of the device throughput (see [1] for | ||
3517 | * details). | ||
3518 | * The problem is that idling may significantly reduce | ||
3519 | * throughput with certain combinations of types of I/O and | ||
3520 | * devices. An important example is sync random I/O, on flash | ||
3521 | * storage with command queueing. So, unless bfqq falls in the | ||
3522 | * above cases where idling also boosts throughput, it would | ||
3523 | * be important to check conditions (i) and (ii) accurately, | ||
3524 | * so as to avoid idling when not strictly needed for service | ||
3525 | * guarantees. | ||
3526 | * | ||
3527 | * Unfortunately, it is extremely difficult to thoroughly | ||
3528 | * check condition (ii). And, in case there are active groups, | ||
3529 | * it becomes very difficult to check condition (i) too. In | ||
3530 | * fact, if there are active groups, then, for condition (i) | ||
3531 | * to become false, it is enough that an active group contains | ||
3532 | * more active processes or sub-groups than some other active | ||
3533 | * group. More precisely, for condition (i) to hold because of | ||
3534 | * such a group, it is not even necessary that the group is | ||
3535 | * (still) active: it is sufficient that, even if the group | ||
3536 | * has become inactive, some of its descendant processes still | ||
3537 | * have some request already dispatched but still waiting for | ||
3538 | * completion. In fact, requests have still to be guaranteed | ||
3539 | * their share of the throughput even after being | ||
3540 | * dispatched. In this respect, it is easy to show that, if a | ||
3541 | * group frequently becomes inactive while still having | ||
3542 | * in-flight requests, and if, when this happens, the group is | ||
3543 | * not considered in the calculation of whether the scenario | ||
3544 | * is asymmetric, then the group may fail to be guaranteed its | ||
3545 | * fair share of the throughput (basically because idling may | ||
3546 | * not be performed for the descendant processes of the group, | ||
3547 | * but it had to be). We address this issue with the | ||
3548 | * following bi-modal behavior, implemented in the function | ||
3549 | * bfq_symmetric_scenario(). | ||
3550 | * | ||
3551 | * If there are groups with requests waiting for completion | ||
3552 | * (as commented above, some of these groups may even be | ||
3553 | * already inactive), then the scenario is tagged as | ||
3554 | * asymmetric, conservatively, without checking any of the | ||
3555 | * conditions (i) and (ii). So the device is idled for bfqq. | ||
3556 | * This behavior matches also the fact that groups are created | ||
3557 | * exactly if controlling I/O is a primary concern (to | ||
3558 | * preserve bandwidth and latency guarantees). | ||
3559 | * | ||
3560 | * On the opposite end, if there are no groups with requests | ||
3561 | * waiting for completion, then only condition (i) is actually | ||
3562 | * controlled, i.e., provided that condition (i) holds, idling | ||
3563 | * is not performed, regardless of whether condition (ii) | ||
3564 | * holds. In other words, only if condition (i) does not hold, | ||
3565 | * then idling is allowed, and the device tends to be | ||
3566 | * prevented from queueing many requests, possibly of several | ||
3567 | * processes. Since there are no groups with requests waiting | ||
3568 | * for completion, then, to control condition (i) it is enough | ||
3569 | * to check just whether all the queues with requests waiting | ||
3570 | * for completion also have the same weight. | ||
3571 | * | ||
3572 | * Not checking condition (ii) evidently exposes bfqq to the | ||
3573 | * risk of getting less throughput than its fair share. | ||
3574 | * However, for queues with the same weight, a further | ||
3575 | * mechanism, preemption, mitigates or even eliminates this | ||
3576 | * problem. And it does so without consequences on overall | ||
3577 | * throughput. This mechanism and its benefits are explained | ||
3578 | * in the next three paragraphs. | ||
3579 | * | ||
3580 | * Even if a queue, say Q, is expired when it remains idle, Q | ||
3581 | * can still preempt the new in-service queue if the next | ||
3582 | * request of Q arrives soon (see the comments on | ||
3583 | * bfq_bfqq_update_budg_for_activation). If all queues and | ||
3584 | * groups have the same weight, this form of preemption, | ||
3585 | * combined with the hole-recovery heuristic described in the | ||
3586 | * comments on function bfq_bfqq_update_budg_for_activation, | ||
3587 | * are enough to preserve a correct bandwidth distribution in | ||
3588 | * the mid term, even without idling. In fact, even if not | ||
3589 | * idling allows the internal queues of the device to contain | ||
3590 | * many requests, and thus to reorder requests, we can rather | ||
3591 | * safely assume that the internal scheduler still preserves a | ||
3592 | * minimum of mid-term fairness. | ||
3593 | * | ||
3594 | * More precisely, this preemption-based, idleless approach | ||
3595 | * provides fairness in terms of IOPS, and not sectors per | ||
3596 | * second. This can be seen with a simple example. Suppose | ||
3597 | * that there are two queues with the same weight, but that | ||
3598 | * the first queue receives requests of 8 sectors, while the | ||
3599 | * second queue receives requests of 1024 sectors. In | ||
3600 | * addition, suppose that each of the two queues contains at | ||
3601 | * most one request at a time, which implies that each queue | ||
3602 | * always remains idle after it is served. Finally, after | ||
3603 | * remaining idle, each queue receives very quickly a new | ||
3604 | * request. It follows that the two queues are served | ||
3605 | * alternatively, preempting each other if needed. This | ||
3606 | * implies that, although both queues have the same weight, | ||
3607 | * the queue with large requests receives a service that is | ||
3608 | * 1024/8 times as high as the service received by the other | ||
3609 | * queue. | ||
3610 | * | ||
3611 | * The motivation for using preemption instead of idling (for | ||
3612 | * queues with the same weight) is that, by not idling, | ||
3613 | * service guarantees are preserved (completely or at least in | ||
3614 | * part) without minimally sacrificing throughput. And, if | ||
3615 | * there is no active group, then the primary expectation for | ||
3616 | * this device is probably a high throughput. | ||
3617 | * | ||
3618 | * We are now left only with explaining the additional | ||
3619 | * compound condition that is checked below for deciding | ||
3620 | * whether the scenario is asymmetric. To explain this | ||
3621 | * compound condition, we need to add that the function | ||
3622 | * bfq_symmetric_scenario checks the weights of only | ||
3623 | * non-weight-raised queues, for efficiency reasons (see | ||
3624 | * comments on bfq_weights_tree_add()). Then the fact that | ||
3625 | * bfqq is weight-raised is checked explicitly here. More | ||
3626 | * precisely, the compound condition below takes into account | ||
3627 | * also the fact that, even if bfqq is being weight-raised, | ||
3628 | * the scenario is still symmetric if all queues with requests | ||
3629 | * waiting for completion happen to be | ||
3630 | * weight-raised. Actually, we should be even more precise | ||
3631 | * here, and differentiate between interactive weight raising | ||
3632 | * and soft real-time weight raising. | ||
3633 | * | ||
3634 | * As a side note, it is worth considering that the above | ||
3635 | * device-idling countermeasures may however fail in the | ||
3636 | * following unlucky scenario: if idling is (correctly) | ||
3637 | * disabled in a time period during which all symmetry | ||
3638 | * sub-conditions hold, and hence the device is allowed to | ||
3639 | * enqueue many requests, but at some later point in time some | ||
3640 | * sub-condition stops to hold, then it may become impossible | ||
3641 | * to let requests be served in the desired order until all | ||
3642 | * the requests already queued in the device have been served. | ||
3643 | */ | ||
3482 | static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd, | 3644 | static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd, |
3483 | struct bfq_queue *bfqq) | 3645 | struct bfq_queue *bfqq) |
3484 | { | 3646 | { |
3485 | /* | 3647 | return (bfqq->wr_coeff > 1 && |
3486 | * There is a case where idling must be performed not for | 3648 | bfqd->wr_busy_queues < |
3487 | * throughput concerns, but to preserve service guarantees. | 3649 | bfq_tot_busy_queues(bfqd)) || |
3488 | * | ||
3489 | * To introduce this case, we can note that allowing the drive | ||
3490 | * to enqueue more than one request at a time, and thereby | ||
3491 | * delegating de facto final scheduling decisions to the | ||
3492 | * drive's internal scheduler, entails loss of control on the | ||
3493 | * actual request service order. In particular, the critical | ||
3494 | * situation is when requests from different processes happen | ||
3495 | * to be present, at the same time, in the internal queue(s) | ||
3496 | * of the drive. In such a situation, the drive, by deciding | ||
3497 | * the service order of the internally-queued requests, does | ||
3498 | * determine also the actual throughput distribution among | ||
3499 | * these processes. But the drive typically has no notion or | ||
3500 | * concern about per-process throughput distribution, and | ||
3501 | * makes its decisions only on a per-request basis. Therefore, | ||
3502 | * the service distribution enforced by the drive's internal | ||
3503 | * scheduler is likely to coincide with the desired | ||
3504 | * device-throughput distribution only in a completely | ||
3505 | * symmetric scenario where: | ||
3506 | * (i) each of these processes must get the same throughput as | ||
3507 | * the others; | ||
3508 | * (ii) the I/O of each process has the same properties, in | ||
3509 | * terms of locality (sequential or random), direction | ||
3510 | * (reads or writes), request sizes, greediness | ||
3511 | * (from I/O-bound to sporadic), and so on. | ||
3512 | * In fact, in such a scenario, the drive tends to treat | ||
3513 | * the requests of each of these processes in about the same | ||
3514 | * way as the requests of the others, and thus to provide | ||
3515 | * each of these processes with about the same throughput | ||
3516 | * (which is exactly the desired throughput distribution). In | ||
3517 | * contrast, in any asymmetric scenario, device idling is | ||
3518 | * certainly needed to guarantee that bfqq receives its | ||
3519 | * assigned fraction of the device throughput (see [1] for | ||
3520 | * details). | ||
3521 | * The problem is that idling may significantly reduce | ||
3522 | * throughput with certain combinations of types of I/O and | ||
3523 | * devices. An important example is sync random I/O, on flash | ||
3524 | * storage with command queueing. So, unless bfqq falls in the | ||
3525 | * above cases where idling also boosts throughput, it would | ||
3526 | * be important to check conditions (i) and (ii) accurately, | ||
3527 | * so as to avoid idling when not strictly needed for service | ||
3528 | * guarantees. | ||
3529 | * | ||
3530 | * Unfortunately, it is extremely difficult to thoroughly | ||
3531 | * check condition (ii). And, in case there are active groups, | ||
3532 | * it becomes very difficult to check condition (i) too. In | ||
3533 | * fact, if there are active groups, then, for condition (i) | ||
3534 | * to become false, it is enough that an active group contains | ||
3535 | * more active processes or sub-groups than some other active | ||
3536 | * group. More precisely, for condition (i) to hold because of | ||
3537 | * such a group, it is not even necessary that the group is | ||
3538 | * (still) active: it is sufficient that, even if the group | ||
3539 | * has become inactive, some of its descendant processes still | ||
3540 | * have some request already dispatched but still waiting for | ||
3541 | * completion. In fact, requests have still to be guaranteed | ||
3542 | * their share of the throughput even after being | ||
3543 | * dispatched. In this respect, it is easy to show that, if a | ||
3544 | * group frequently becomes inactive while still having | ||
3545 | * in-flight requests, and if, when this happens, the group is | ||
3546 | * not considered in the calculation of whether the scenario | ||
3547 | * is asymmetric, then the group may fail to be guaranteed its | ||
3548 | * fair share of the throughput (basically because idling may | ||
3549 | * not be performed for the descendant processes of the group, | ||
3550 | * but it had to be). We address this issue with the | ||
3551 | * following bi-modal behavior, implemented in the function | ||
3552 | * bfq_symmetric_scenario(). | ||
3553 | * | ||
3554 | * If there are groups with requests waiting for completion | ||
3555 | * (as commented above, some of these groups may even be | ||
3556 | * already inactive), then the scenario is tagged as | ||
3557 | * asymmetric, conservatively, without checking any of the | ||
3558 | * conditions (i) and (ii). So the device is idled for bfqq. | ||
3559 | * This behavior matches also the fact that groups are created | ||
3560 | * exactly if controlling I/O is a primary concern (to | ||
3561 | * preserve bandwidth and latency guarantees). | ||
3562 | * | ||
3563 | * On the opposite end, if there are no groups with requests | ||
3564 | * waiting for completion, then only condition (i) is actually | ||
3565 | * controlled, i.e., provided that condition (i) holds, idling | ||
3566 | * is not performed, regardless of whether condition (ii) | ||
3567 | * holds. In other words, only if condition (i) does not hold, | ||
3568 | * then idling is allowed, and the device tends to be | ||
3569 | * prevented from queueing many requests, possibly of several | ||
3570 | * processes. Since there are no groups with requests waiting | ||
3571 | * for completion, then, to control condition (i) it is enough | ||
3572 | * to check just whether all the queues with requests waiting | ||
3573 | * for completion also have the same weight. | ||
3574 | * | ||
3575 | * Not checking condition (ii) evidently exposes bfqq to the | ||
3576 | * risk of getting less throughput than its fair share. | ||
3577 | * However, for queues with the same weight, a further | ||
3578 | * mechanism, preemption, mitigates or even eliminates this | ||
3579 | * problem. And it does so without consequences on overall | ||
3580 | * throughput. This mechanism and its benefits are explained | ||
3581 | * in the next three paragraphs. | ||
3582 | * | ||
3583 | * Even if a queue, say Q, is expired when it remains idle, Q | ||
3584 | * can still preempt the new in-service queue if the next | ||
3585 | * request of Q arrives soon (see the comments on | ||
3586 | * bfq_bfqq_update_budg_for_activation). If all queues and | ||
3587 | * groups have the same weight, this form of preemption, | ||
3588 | * combined with the hole-recovery heuristic described in the | ||
3589 | * comments on function bfq_bfqq_update_budg_for_activation, | ||
3590 | * are enough to preserve a correct bandwidth distribution in | ||
3591 | * the mid term, even without idling. In fact, even if not | ||
3592 | * idling allows the internal queues of the device to contain | ||
3593 | * many requests, and thus to reorder requests, we can rather | ||
3594 | * safely assume that the internal scheduler still preserves a | ||
3595 | * minimum of mid-term fairness. | ||
3596 | * | ||
3597 | * More precisely, this preemption-based, idleless approach | ||
3598 | * provides fairness in terms of IOPS, and not sectors per | ||
3599 | * second. This can be seen with a simple example. Suppose | ||
3600 | * that there are two queues with the same weight, but that | ||
3601 | * the first queue receives requests of 8 sectors, while the | ||
3602 | * second queue receives requests of 1024 sectors. In | ||
3603 | * addition, suppose that each of the two queues contains at | ||
3604 | * most one request at a time, which implies that each queue | ||
3605 | * always remains idle after it is served. Finally, after | ||
3606 | * remaining idle, each queue receives very quickly a new | ||
3607 | * request. It follows that the two queues are served | ||
3608 | * alternatively, preempting each other if needed. This | ||
3609 | * implies that, although both queues have the same weight, | ||
3610 | * the queue with large requests receives a service that is | ||
3611 | * 1024/8 times as high as the service received by the other | ||
3612 | * queue. | ||
3613 | * | ||
3614 | * The motivation for using preemption instead of idling (for | ||
3615 | * queues with the same weight) is that, by not idling, | ||
3616 | * service guarantees are preserved (completely or at least in | ||
3617 | * part) without minimally sacrificing throughput. And, if | ||
3618 | * there is no active group, then the primary expectation for | ||
3619 | * this device is probably a high throughput. | ||
3620 | * | ||
3621 | * We are now left only with explaining the additional | ||
3622 | * compound condition that is checked below for deciding | ||
3623 | * whether the scenario is asymmetric. To explain this | ||
3624 | * compound condition, we need to add that the function | ||
3625 | * bfq_symmetric_scenario checks the weights of only | ||
3626 | * non-weight-raised queues, for efficiency reasons (see | ||
3627 | * comments on bfq_weights_tree_add()). Then the fact that | ||
3628 | * bfqq is weight-raised is checked explicitly here. More | ||
3629 | * precisely, the compound condition below takes into account | ||
3630 | * also the fact that, even if bfqq is being weight-raised, | ||
3631 | * the scenario is still symmetric if all queues with requests | ||
3632 | * waiting for completion happen to be | ||
3633 | * weight-raised. Actually, we should be even more precise | ||
3634 | * here, and differentiate between interactive weight raising | ||
3635 | * and soft real-time weight raising. | ||
3636 | * | ||
3637 | * As a side note, it is worth considering that the above | ||
3638 | * device-idling countermeasures may however fail in the | ||
3639 | * following unlucky scenario: if idling is (correctly) | ||
3640 | * disabled in a time period during which all symmetry | ||
3641 | * sub-conditions hold, and hence the device is allowed to | ||
3642 | * enqueue many requests, but at some later point in time some | ||
3643 | * sub-condition stops to hold, then it may become impossible | ||
3644 | * to let requests be served in the desired order until all | ||
3645 | * the requests already queued in the device have been served. | ||
3646 | */ | ||
3647 | bool asymmetric_scenario = (bfqq->wr_coeff > 1 && | ||
3648 | bfqd->wr_busy_queues < | ||
3649 | bfq_tot_busy_queues(bfqd)) || | ||
3650 | !bfq_symmetric_scenario(bfqd); | 3650 | !bfq_symmetric_scenario(bfqd); |
3651 | |||
3652 | /* | ||
3653 | * Finally, there is a case where maximizing throughput is the | ||
3654 | * best choice even if it may cause unfairness toward | ||
3655 | * bfqq. Such a case is when bfqq became active in a burst of | ||
3656 | * queue activations. Queues that became active during a large | ||
3657 | * burst benefit only from throughput, as discussed in the | ||
3658 | * comments on bfq_handle_burst. Thus, if bfqq became active | ||
3659 | * in a burst and not idling the device maximizes throughput, | ||
3660 | * then the device must no be idled, because not idling the | ||
3661 | * device provides bfqq and all other queues in the burst with | ||
3662 | * maximum benefit. Combining this and the above case, we can | ||
3663 | * now establish when idling is actually needed to preserve | ||
3664 | * service guarantees. | ||
3665 | */ | ||
3666 | return asymmetric_scenario && !bfq_bfqq_in_large_burst(bfqq); | ||
3667 | } | 3651 | } |
3668 | 3652 | ||
3669 | /* | 3653 | /* |