block, bfq: unconditionally plug I/O in asymmetric scenarios

bfq detects the creation of multiple bfq_queues shortly after each other, namely a burst of queue creations in the terminology used in the code. If the burst is large, then no queue in the burst is granted - either I/O-dispatch plugging when the queue remains temporarily idle while in service; - or weight raising, because it causes even longer plugging. In fact, such a plugging tends to lower throughput, while these bursts are typically due to applications or services that spawn multiple processes, to reach a common goal as soon as possible. Examples are a "git grep" or the booting of a system. Unfortunately, disabling plugging may cause a loss of service guarantees in asymmetric scenarios, i.e., if queue weights are differentiated or if more than one group is active. This commit addresses this issue by no longer disabling I/O-dispatch plugging for queues in large bursts. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
author: Paolo Valente <paolo.valente@linaro.org> 2019-01-29 06:06:32 -0500
committer: Jens Axboe <axboe@kernel.dk> 2019-01-31 14:50:23 -0500
commit: 530c4cbb3c62f9e42dbf39279fb346f2d2ab4dbb (patch)
tree: aad632d87cc3978218cbad308e0b0bd250025e34 /block/bfq-iosched.c
parent: ac8b0cb415f3aa9162009d39624501d37031533b (diff)
1 files changed, 165 insertions, 181 deletions
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a6fe60114ade..c1bb5e5fcdc4 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3479,191 +3479,175 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
                bfqd->wr_busy_queues == 0;
 }
+/*
+ * There is a case where idling must be performed not for
+ * throughput concerns, but to preserve service guarantees.
+ *
+ * To introduce this case, we can note that allowing the drive
+ * to enqueue more than one request at a time, and hence
+ * delegating de facto final scheduling decisions to the
+ * drive's internal scheduler, entails loss of control on the
+ * actual request service order. In particular, the critical
+ * situation is when requests from different processes happen
+ * to be present, at the same time, in the internal queue(s)
+ * of the drive. In such a situation, the drive, by deciding
+ * the service order of the internally-queued requests, does
+ * determine also the actual throughput distribution among
+ * these processes. But the drive typically has no notion or
+ * concern about per-process throughput distribution, and
+ * makes its decisions only on a per-request basis. Therefore,
+ * the service distribution enforced by the drive's internal
+ * scheduler is likely to coincide with the desired
+ * device-throughput distribution only in a completely
+ * symmetric scenario where:
+ * (i)  each of these processes must get the same throughput as
+ *      the others;
+ * (ii) the I/O of each process has the same properties, in
+ *      terms of locality (sequential or random), direction
+ *      (reads or writes), request sizes, greediness
+ *      (from I/O-bound to sporadic), and so on.
+ * In fact, in such a scenario, the drive tends to treat
+ * the requests of each of these processes in about the same
+ * way as the requests of the others, and thus to provide
+ * each of these processes with about the same throughput
+ * (which is exactly the desired throughput distribution). In
+ * contrast, in any asymmetric scenario, device idling is
+ * certainly needed to guarantee that bfqq receives its
+ * assigned fraction of the device throughput (see [1] for
+ * details).
+ * The problem is that idling may significantly reduce
+ * throughput with certain combinations of types of I/O and
+ * devices. An important example is sync random I/O, on flash
+ * storage with command queueing. So, unless bfqq falls in the
+ * above cases where idling also boosts throughput, it would
+ * be important to check conditions (i) and (ii) accurately,
+ * so as to avoid idling when not strictly needed for service
+ * guarantees.
+ *
+ * Unfortunately, it is extremely difficult to thoroughly
+ * check condition (ii). And, in case there are active groups,
+ * it becomes very difficult to check condition (i) too. In
+ * fact, if there are active groups, then, for condition (i)
+ * to become false, it is enough that an active group contains
+ * more active processes or sub-groups than some other active
+ * group. More precisely, for condition (i) to hold because of
+ * such a group, it is not even necessary that the group is
+ * (still) active: it is sufficient that, even if the group
+ * has become inactive, some of its descendant processes still
+ * have some request already dispatched but still waiting for
+ * completion. In fact, requests have still to be guaranteed
+ * their share of the throughput even after being
+ * dispatched. In this respect, it is easy to show that, if a
+ * group frequently becomes inactive while still having
+ * in-flight requests, and if, when this happens, the group is
+ * not considered in the calculation of whether the scenario
+ * is asymmetric, then the group may fail to be guaranteed its
+ * fair share of the throughput (basically because idling may
+ * not be performed for the descendant processes of the group,
+ * but it had to be).  We address this issue with the
+ * following bi-modal behavior, implemented in the function
+ * bfq_symmetric_scenario().
+ *
+ * If there are groups with requests waiting for completion
+ * (as commented above, some of these groups may even be
+ * already inactive), then the scenario is tagged as
+ * asymmetric, conservatively, without checking any of the
+ * conditions (i) and (ii). So the device is idled for bfqq.
+ * This behavior matches also the fact that groups are created
+ * exactly if controlling I/O is a primary concern (to
+ * preserve bandwidth and latency guarantees).
+ *
+ * On the opposite end, if there are no groups with requests
+ * waiting for completion, then only condition (i) is actually
+ * controlled, i.e., provided that condition (i) holds, idling
+ * is not performed, regardless of whether condition (ii)
+ * holds. In other words, only if condition (i) does not hold,
+ * then idling is allowed, and the device tends to be
+ * prevented from queueing many requests, possibly of several
+ * processes. Since there are no groups with requests waiting
+ * for completion, then, to control condition (i) it is enough
+ * to check just whether all the queues with requests waiting
+ * for completion also have the same weight.
+ *
+ * Not checking condition (ii) evidently exposes bfqq to the
+ * risk of getting less throughput than its fair share.
+ * However, for queues with the same weight, a further
+ * mechanism, preemption, mitigates or even eliminates this
+ * problem. And it does so without consequences on overall
+ * throughput. This mechanism and its benefits are explained
+ * in the next three paragraphs.
+ *
+ * Even if a queue, say Q, is expired when it remains idle, Q
+ * can still preempt the new in-service queue if the next
+ * request of Q arrives soon (see the comments on
+ * bfq_bfqq_update_budg_for_activation). If all queues and
+ * groups have the same weight, this form of preemption,
+ * combined with the hole-recovery heuristic described in the
+ * comments on function bfq_bfqq_update_budg_for_activation,
+ * are enough to preserve a correct bandwidth distribution in
+ * the mid term, even without idling. In fact, even if not
+ * idling allows the internal queues of the device to contain
+ * many requests, and thus to reorder requests, we can rather
+ * safely assume that the internal scheduler still preserves a
+ * minimum of mid-term fairness.
+ *
+ * More precisely, this preemption-based, idleless approach
+ * provides fairness in terms of IOPS, and not sectors per
+ * second. This can be seen with a simple example. Suppose
+ * that there are two queues with the same weight, but that
+ * the first queue receives requests of 8 sectors, while the
+ * second queue receives requests of 1024 sectors. In
+ * addition, suppose that each of the two queues contains at
+ * most one request at a time, which implies that each queue
+ * always remains idle after it is served. Finally, after
+ * remaining idle, each queue receives very quickly a new
+ * request. It follows that the two queues are served
+ * alternatively, preempting each other if needed. This
+ * implies that, although both queues have the same weight,
+ * the queue with large requests receives a service that is
+ * 1024/8 times as high as the service received by the other
+ * queue.
+ *
+ * The motivation for using preemption instead of idling (for
+ * queues with the same weight) is that, by not idling,
+ * service guarantees are preserved (completely or at least in
+ * part) without minimally sacrificing throughput. And, if
+ * there is no active group, then the primary expectation for
+ * this device is probably a high throughput.
+ *
+ * We are now left only with explaining the additional
+ * compound condition that is checked below for deciding
+ * whether the scenario is asymmetric. To explain this
+ * compound condition, we need to add that the function
+ * bfq_symmetric_scenario checks the weights of only
+ * non-weight-raised queues, for efficiency reasons (see
+ * comments on bfq_weights_tree_add()). Then the fact that
+ * bfqq is weight-raised is checked explicitly here. More
+ * precisely, the compound condition below takes into account
+ * also the fact that, even if bfqq is being weight-raised,
+ * the scenario is still symmetric if all queues with requests
+ * waiting for completion happen to be
+ * weight-raised. Actually, we should be even more precise
+ * here, and differentiate between interactive weight raising
+ * and soft real-time weight raising.
+ *
+ * As a side note, it is worth considering that the above
+ * device-idling countermeasures may however fail in the
+ * following unlucky scenario: if idling is (correctly)
+ * disabled in a time period during which all symmetry
+ * sub-conditions hold, and hence the device is allowed to
+ * enqueue many requests, but at some later point in time some
+ * sub-condition stops to hold, then it may become impossible
+ * to let requests be served in the desired order until all
+ * the requests already queued in the device have been served.
+ */
 static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
                                                 struct bfq_queue *bfqq)
 {
-        /*
+        return (bfqq->wr_coeff > 1 &&
-         * There is a case where idling must be performed not for
+                bfqd->wr_busy_queues <
-         * throughput concerns, but to preserve service guarantees.
+                bfq_tot_busy_queues(bfqd)) ||
-         *
-         * To introduce this case, we can note that allowing the drive
-         * to enqueue more than one request at a time, and thereby
-         * delegating de facto final scheduling decisions to the
-         * drive's internal scheduler, entails loss of control on the
-         * actual request service order. In particular, the critical
-         * situation is when requests from different processes happen
-         * to be present, at the same time, in the internal queue(s)
-         * of the drive. In such a situation, the drive, by deciding
-         * the service order of the internally-queued requests, does
-         * determine also the actual throughput distribution among
-         * these processes. But the drive typically has no notion or
-         * concern about per-process throughput distribution, and
-         * makes its decisions only on a per-request basis. Therefore,
-         * the service distribution enforced by the drive's internal
-         * scheduler is likely to coincide with the desired
-         * device-throughput distribution only in a completely
-         * symmetric scenario where:
-         * (i)  each of these processes must get the same throughput as
-         *      the others;
-         * (ii) the I/O of each process has the same properties, in
-         *      terms of locality (sequential or random), direction
-         *      (reads or writes), request sizes, greediness
-         *      (from I/O-bound to sporadic), and so on.
-         * In fact, in such a scenario, the drive tends to treat
-         * the requests of each of these processes in about the same
-         * way as the requests of the others, and thus to provide
-         * each of these processes with about the same throughput
-         * (which is exactly the desired throughput distribution). In
-         * contrast, in any asymmetric scenario, device idling is
-         * certainly needed to guarantee that bfqq receives its
-         * assigned fraction of the device throughput (see [1] for
-         * details).
-         * The problem is that idling may significantly reduce
-         * throughput with certain combinations of types of I/O and
-         * devices. An important example is sync random I/O, on flash
-         * storage with command queueing. So, unless bfqq falls in the
-         * above cases where idling also boosts throughput, it would
-         * be important to check conditions (i) and (ii) accurately,
-         * so as to avoid idling when not strictly needed for service
-         * guarantees.
-         *
-         * Unfortunately, it is extremely difficult to thoroughly
-         * check condition (ii). And, in case there are active groups,
-         * it becomes very difficult to check condition (i) too. In
-         * fact, if there are active groups, then, for condition (i)
-         * to become false, it is enough that an active group contains
-         * more active processes or sub-groups than some other active
-         * group. More precisely, for condition (i) to hold because of
-         * such a group, it is not even necessary that the group is
-         * (still) active: it is sufficient that, even if the group
-         * has become inactive, some of its descendant processes still
-         * have some request already dispatched but still waiting for
-         * completion. In fact, requests have still to be guaranteed
-         * their share of the throughput even after being
-         * dispatched. In this respect, it is easy to show that, if a
-         * group frequently becomes inactive while still having
-         * in-flight requests, and if, when this happens, the group is
-         * not considered in the calculation of whether the scenario
-         * is asymmetric, then the group may fail to be guaranteed its
-         * fair share of the throughput (basically because idling may
-         * not be performed for the descendant processes of the group,
-         * but it had to be).  We address this issue with the
-         * following bi-modal behavior, implemented in the function
-         * bfq_symmetric_scenario().
-         *
-         * If there are groups with requests waiting for completion
-         * (as commented above, some of these groups may even be
-         * already inactive), then the scenario is tagged as
-         * asymmetric, conservatively, without checking any of the
-         * conditions (i) and (ii). So the device is idled for bfqq.
-         * This behavior matches also the fact that groups are created
-         * exactly if controlling I/O is a primary concern (to
-         * preserve bandwidth and latency guarantees).
-         *
-         * On the opposite end, if there are no groups with requests
-         * waiting for completion, then only condition (i) is actually
-         * controlled, i.e., provided that condition (i) holds, idling
-         * is not performed, regardless of whether condition (ii)
-         * holds. In other words, only if condition (i) does not hold,
-         * then idling is allowed, and the device tends to be
-         * prevented from queueing many requests, possibly of several
-         * processes. Since there are no groups with requests waiting
-         * for completion, then, to control condition (i) it is enough
-         * to check just whether all the queues with requests waiting
-         * for completion also have the same weight.
-         *
-         * Not checking condition (ii) evidently exposes bfqq to the
-         * risk of getting less throughput than its fair share.
-         * However, for queues with the same weight, a further
-         * mechanism, preemption, mitigates or even eliminates this
-         * problem. And it does so without consequences on overall
-         * throughput. This mechanism and its benefits are explained
-         * in the next three paragraphs.
-         *
-         * Even if a queue, say Q, is expired when it remains idle, Q
-         * can still preempt the new in-service queue if the next
-         * request of Q arrives soon (see the comments on
-         * bfq_bfqq_update_budg_for_activation). If all queues and
-         * groups have the same weight, this form of preemption,
-         * combined with the hole-recovery heuristic described in the
-         * comments on function bfq_bfqq_update_budg_for_activation,
-         * are enough to preserve a correct bandwidth distribution in
-         * the mid term, even without idling. In fact, even if not
-         * idling allows the internal queues of the device to contain
-         * many requests, and thus to reorder requests, we can rather
-         * safely assume that the internal scheduler still preserves a
-         * minimum of mid-term fairness.
-         *
-         * More precisely, this preemption-based, idleless approach
-         * provides fairness in terms of IOPS, and not sectors per
-         * second. This can be seen with a simple example. Suppose
-         * that there are two queues with the same weight, but that
-         * the first queue receives requests of 8 sectors, while the
-         * second queue receives requests of 1024 sectors. In
-         * addition, suppose that each of the two queues contains at
-         * most one request at a time, which implies that each queue
-         * always remains idle after it is served. Finally, after
-         * remaining idle, each queue receives very quickly a new
-         * request. It follows that the two queues are served
-         * alternatively, preempting each other if needed. This
-         * implies that, although both queues have the same weight,
-         * the queue with large requests receives a service that is
-         * 1024/8 times as high as the service received by the other
-         * queue.
-         *
-         * The motivation for using preemption instead of idling (for
-         * queues with the same weight) is that, by not idling,
-         * service guarantees are preserved (completely or at least in
-         * part) without minimally sacrificing throughput. And, if
-         * there is no active group, then the primary expectation for
-         * this device is probably a high throughput.
-         *
-         * We are now left only with explaining the additional
-         * compound condition that is checked below for deciding
-         * whether the scenario is asymmetric. To explain this
-         * compound condition, we need to add that the function
-         * bfq_symmetric_scenario checks the weights of only
-         * non-weight-raised queues, for efficiency reasons (see
-         * comments on bfq_weights_tree_add()). Then the fact that
-         * bfqq is weight-raised is checked explicitly here. More
-         * precisely, the compound condition below takes into account
-         * also the fact that, even if bfqq is being weight-raised,
-         * the scenario is still symmetric if all queues with requests
-         * waiting for completion happen to be
-         * weight-raised. Actually, we should be even more precise
-         * here, and differentiate between interactive weight raising
-         * and soft real-time weight raising.
-         *
-         * As a side note, it is worth considering that the above
-         * device-idling countermeasures may however fail in the
-         * following unlucky scenario: if idling is (correctly)
-         * disabled in a time period during which all symmetry
-         * sub-conditions hold, and hence the device is allowed to
-         * enqueue many requests, but at some later point in time some
-         * sub-condition stops to hold, then it may become impossible
-         * to let requests be served in the desired order until all
-         * the requests already queued in the device have been served.
-         */
-        bool asymmetric_scenario = (bfqq->wr_coeff > 1 &&
-                                    bfqd->wr_busy_queues <
-                                    bfq_tot_busy_queues(bfqd)) ||
                !bfq_symmetric_scenario(bfqd);
-        /*
-         * Finally, there is a case where maximizing throughput is the
-         * best choice even if it may cause unfairness toward
-         * bfqq. Such a case is when bfqq became active in a burst of
-         * queue activations. Queues that became active during a large
-         * burst benefit only from throughput, as discussed in the
-         * comments on bfq_handle_burst. Thus, if bfqq became active
-         * in a burst and not idling the device maximizes throughput,
-         * then the device must no be idled, because not idling the
-         * device provides bfqq and all other queues in the burst with
-         * maximum benefit. Combining this and the above case, we can
-         * now establish when idling is actually needed to preserve
-         * service guarantees.
-         */
-        return asymmetric_scenario && !bfq_bfqq_in_large_burst(bfqq);
 }
 /*
author	Paolo Valente <paolo.valente@linaro.org>	2019-01-29 06:06:32 -0500
committer	Jens Axboe <axboe@kernel.dk>	2019-01-31 14:50:23 -0500
commit	530c4cbb3c62f9e42dbf39279fb346f2d2ab4dbb (patch)
tree	aad632d87cc3978218cbad308e0b0bd250025e34 /block/bfq-iosched.c
parent	ac8b0cb415f3aa9162009d39624501d37031533b (diff)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index a6fe60114ade..c1bb5e5fcdc4 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c
@@ -3479,191 +3479,175 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
3479	bfqd->wr_busy_queues == 0;	3479	bfqd->wr_busy_queues == 0;
3480	}	3480	}
3481		3481
		3482	/*
		3483	* There is a case where idling must be performed not for
		3484	* throughput concerns, but to preserve service guarantees.
		3485	*
		3486	* To introduce this case, we can note that allowing the drive
		3487	* to enqueue more than one request at a time, and hence
		3488	* delegating de facto final scheduling decisions to the
		3489	* drive's internal scheduler, entails loss of control on the
		3490	* actual request service order. In particular, the critical
		3491	* situation is when requests from different processes happen
		3492	* to be present, at the same time, in the internal queue(s)
		3493	* of the drive. In such a situation, the drive, by deciding
		3494	* the service order of the internally-queued requests, does
		3495	* determine also the actual throughput distribution among
		3496	* these processes. But the drive typically has no notion or
		3497	* concern about per-process throughput distribution, and
		3498	* makes its decisions only on a per-request basis. Therefore,
		3499	* the service distribution enforced by the drive's internal
		3500	* scheduler is likely to coincide with the desired
		3501	* device-throughput distribution only in a completely
		3502	* symmetric scenario where:
		3503	* (i) each of these processes must get the same throughput as
		3504	* the others;
		3505	* (ii) the I/O of each process has the same properties, in
		3506	* terms of locality (sequential or random), direction
		3507	* (reads or writes), request sizes, greediness
		3508	* (from I/O-bound to sporadic), and so on.
		3509	* In fact, in such a scenario, the drive tends to treat
		3510	* the requests of each of these processes in about the same
		3511	* way as the requests of the others, and thus to provide
		3512	* each of these processes with about the same throughput
		3513	* (which is exactly the desired throughput distribution). In
		3514	* contrast, in any asymmetric scenario, device idling is
		3515	* certainly needed to guarantee that bfqq receives its
		3516	* assigned fraction of the device throughput (see [1] for
		3517	* details).
		3518	* The problem is that idling may significantly reduce
		3519	* throughput with certain combinations of types of I/O and
		3520	* devices. An important example is sync random I/O, on flash
		3521	* storage with command queueing. So, unless bfqq falls in the
		3522	* above cases where idling also boosts throughput, it would
		3523	* be important to check conditions (i) and (ii) accurately,
		3524	* so as to avoid idling when not strictly needed for service
		3525	* guarantees.
		3526	*
		3527	* Unfortunately, it is extremely difficult to thoroughly
		3528	* check condition (ii). And, in case there are active groups,
		3529	* it becomes very difficult to check condition (i) too. In
		3530	* fact, if there are active groups, then, for condition (i)
		3531	* to become false, it is enough that an active group contains
		3532	* more active processes or sub-groups than some other active
		3533	* group. More precisely, for condition (i) to hold because of
		3534	* such a group, it is not even necessary that the group is
		3535	* (still) active: it is sufficient that, even if the group
		3536	* has become inactive, some of its descendant processes still
		3537	* have some request already dispatched but still waiting for
		3538	* completion. In fact, requests have still to be guaranteed
		3539	* their share of the throughput even after being
		3540	* dispatched. In this respect, it is easy to show that, if a
		3541	* group frequently becomes inactive while still having
		3542	* in-flight requests, and if, when this happens, the group is
		3543	* not considered in the calculation of whether the scenario
		3544	* is asymmetric, then the group may fail to be guaranteed its
		3545	* fair share of the throughput (basically because idling may
		3546	* not be performed for the descendant processes of the group,
		3547	* but it had to be). We address this issue with the
		3548	* following bi-modal behavior, implemented in the function
		3549	* bfq_symmetric_scenario().
		3550	*
		3551	* If there are groups with requests waiting for completion
		3552	* (as commented above, some of these groups may even be
		3553	* already inactive), then the scenario is tagged as
		3554	* asymmetric, conservatively, without checking any of the
		3555	* conditions (i) and (ii). So the device is idled for bfqq.
		3556	* This behavior matches also the fact that groups are created
		3557	* exactly if controlling I/O is a primary concern (to
		3558	* preserve bandwidth and latency guarantees).
		3559	*
		3560	* On the opposite end, if there are no groups with requests
		3561	* waiting for completion, then only condition (i) is actually
		3562	* controlled, i.e., provided that condition (i) holds, idling
		3563	* is not performed, regardless of whether condition (ii)
		3564	* holds. In other words, only if condition (i) does not hold,
		3565	* then idling is allowed, and the device tends to be
		3566	* prevented from queueing many requests, possibly of several
		3567	* processes. Since there are no groups with requests waiting
		3568	* for completion, then, to control condition (i) it is enough
		3569	* to check just whether all the queues with requests waiting
		3570	* for completion also have the same weight.
		3571	*
		3572	* Not checking condition (ii) evidently exposes bfqq to the
		3573	* risk of getting less throughput than its fair share.
		3574	* However, for queues with the same weight, a further
		3575	* mechanism, preemption, mitigates or even eliminates this
		3576	* problem. And it does so without consequences on overall
		3577	* throughput. This mechanism and its benefits are explained
		3578	* in the next three paragraphs.
		3579	*
		3580	* Even if a queue, say Q, is expired when it remains idle, Q
		3581	* can still preempt the new in-service queue if the next
		3582	* request of Q arrives soon (see the comments on
		3583	* bfq_bfqq_update_budg_for_activation). If all queues and
		3584	* groups have the same weight, this form of preemption,
		3585	* combined with the hole-recovery heuristic described in the
		3586	* comments on function bfq_bfqq_update_budg_for_activation,
		3587	* are enough to preserve a correct bandwidth distribution in
		3588	* the mid term, even without idling. In fact, even if not
		3589	* idling allows the internal queues of the device to contain
		3590	* many requests, and thus to reorder requests, we can rather
		3591	* safely assume that the internal scheduler still preserves a
		3592	* minimum of mid-term fairness.
		3593	*
		3594	* More precisely, this preemption-based, idleless approach
		3595	* provides fairness in terms of IOPS, and not sectors per
		3596	* second. This can be seen with a simple example. Suppose
		3597	* that there are two queues with the same weight, but that
		3598	* the first queue receives requests of 8 sectors, while the
		3599	* second queue receives requests of 1024 sectors. In
		3600	* addition, suppose that each of the two queues contains at
		3601	* most one request at a time, which implies that each queue
		3602	* always remains idle after it is served. Finally, after
		3603	* remaining idle, each queue receives very quickly a new
		3604	* request. It follows that the two queues are served
		3605	* alternatively, preempting each other if needed. This
		3606	* implies that, although both queues have the same weight,
		3607	* the queue with large requests receives a service that is
		3608	* 1024/8 times as high as the service received by the other
		3609	* queue.
		3610	*
		3611	* The motivation for using preemption instead of idling (for
		3612	* queues with the same weight) is that, by not idling,
		3613	* service guarantees are preserved (completely or at least in
		3614	* part) without minimally sacrificing throughput. And, if
		3615	* there is no active group, then the primary expectation for
		3616	* this device is probably a high throughput.
		3617	*
		3618	* We are now left only with explaining the additional
		3619	* compound condition that is checked below for deciding
		3620	* whether the scenario is asymmetric. To explain this
		3621	* compound condition, we need to add that the function
		3622	* bfq_symmetric_scenario checks the weights of only
		3623	* non-weight-raised queues, for efficiency reasons (see
		3624	* comments on bfq_weights_tree_add()). Then the fact that
		3625	* bfqq is weight-raised is checked explicitly here. More
		3626	* precisely, the compound condition below takes into account
		3627	* also the fact that, even if bfqq is being weight-raised,
		3628	* the scenario is still symmetric if all queues with requests
		3629	* waiting for completion happen to be
		3630	* weight-raised. Actually, we should be even more precise
		3631	* here, and differentiate between interactive weight raising
		3632	* and soft real-time weight raising.
		3633	*
		3634	* As a side note, it is worth considering that the above
		3635	* device-idling countermeasures may however fail in the
		3636	* following unlucky scenario: if idling is (correctly)
		3637	* disabled in a time period during which all symmetry
		3638	* sub-conditions hold, and hence the device is allowed to
		3639	* enqueue many requests, but at some later point in time some
		3640	* sub-condition stops to hold, then it may become impossible
		3641	* to let requests be served in the desired order until all
		3642	* the requests already queued in the device have been served.
		3643	*/
3482	static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,	3644	static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
3483	struct bfq_queue *bfqq)	3645	struct bfq_queue *bfqq)
3484	{	3646	{
3485	/*	3647	return (bfqq->wr_coeff > 1 &&
3486	* There is a case where idling must be performed not for	3648	bfqd->wr_busy_queues <
3487	* throughput concerns, but to preserve service guarantees.	3649	bfq_tot_busy_queues(bfqd)) \|\|
3488	*
3489	* To introduce this case, we can note that allowing the drive
3490	* to enqueue more than one request at a time, and thereby
3491	* delegating de facto final scheduling decisions to the
3492	* drive's internal scheduler, entails loss of control on the
3493	* actual request service order. In particular, the critical
3494	* situation is when requests from different processes happen
3495	* to be present, at the same time, in the internal queue(s)
3496	* of the drive. In such a situation, the drive, by deciding
3497	* the service order of the internally-queued requests, does
3498	* determine also the actual throughput distribution among
3499	* these processes. But the drive typically has no notion or
3500	* concern about per-process throughput distribution, and
3501	* makes its decisions only on a per-request basis. Therefore,
3502	* the service distribution enforced by the drive's internal
3503	* scheduler is likely to coincide with the desired
3504	* device-throughput distribution only in a completely
3505	* symmetric scenario where:
3506	* (i) each of these processes must get the same throughput as
3507	* the others;
3508	* (ii) the I/O of each process has the same properties, in
3509	* terms of locality (sequential or random), direction
3510	* (reads or writes), request sizes, greediness
3511	* (from I/O-bound to sporadic), and so on.
3512	* In fact, in such a scenario, the drive tends to treat
3513	* the requests of each of these processes in about the same
3514	* way as the requests of the others, and thus to provide
3515	* each of these processes with about the same throughput
3516	* (which is exactly the desired throughput distribution). In
3517	* contrast, in any asymmetric scenario, device idling is
3518	* certainly needed to guarantee that bfqq receives its
3519	* assigned fraction of the device throughput (see [1] for
3520	* details).
3521	* The problem is that idling may significantly reduce
3522	* throughput with certain combinations of types of I/O and
3523	* devices. An important example is sync random I/O, on flash
3524	* storage with command queueing. So, unless bfqq falls in the
3525	* above cases where idling also boosts throughput, it would
3526	* be important to check conditions (i) and (ii) accurately,
3527	* so as to avoid idling when not strictly needed for service
3528	* guarantees.
3529	*
3530	* Unfortunately, it is extremely difficult to thoroughly
3531	* check condition (ii). And, in case there are active groups,
3532	* it becomes very difficult to check condition (i) too. In
3533	* fact, if there are active groups, then, for condition (i)
3534	* to become false, it is enough that an active group contains
3535	* more active processes or sub-groups than some other active
3536	* group. More precisely, for condition (i) to hold because of
3537	* such a group, it is not even necessary that the group is
3538	* (still) active: it is sufficient that, even if the group
3539	* has become inactive, some of its descendant processes still
3540	* have some request already dispatched but still waiting for
3541	* completion. In fact, requests have still to be guaranteed
3542	* their share of the throughput even after being
3543	* dispatched. In this respect, it is easy to show that, if a
3544	* group frequently becomes inactive while still having
3545	* in-flight requests, and if, when this happens, the group is
3546	* not considered in the calculation of whether the scenario
3547	* is asymmetric, then the group may fail to be guaranteed its
3548	* fair share of the throughput (basically because idling may
3549	* not be performed for the descendant processes of the group,
3550	* but it had to be). We address this issue with the
3551	* following bi-modal behavior, implemented in the function
3552	* bfq_symmetric_scenario().
3553	*
3554	* If there are groups with requests waiting for completion
3555	* (as commented above, some of these groups may even be
3556	* already inactive), then the scenario is tagged as
3557	* asymmetric, conservatively, without checking any of the
3558	* conditions (i) and (ii). So the device is idled for bfqq.
3559	* This behavior matches also the fact that groups are created
3560	* exactly if controlling I/O is a primary concern (to
3561	* preserve bandwidth and latency guarantees).
3562	*
3563	* On the opposite end, if there are no groups with requests
3564	* waiting for completion, then only condition (i) is actually
3565	* controlled, i.e., provided that condition (i) holds, idling
3566	* is not performed, regardless of whether condition (ii)
3567	* holds. In other words, only if condition (i) does not hold,
3568	* then idling is allowed, and the device tends to be
3569	* prevented from queueing many requests, possibly of several
3570	* processes. Since there are no groups with requests waiting
3571	* for completion, then, to control condition (i) it is enough
3572	* to check just whether all the queues with requests waiting
3573	* for completion also have the same weight.
3574	*
3575	* Not checking condition (ii) evidently exposes bfqq to the
3576	* risk of getting less throughput than its fair share.
3577	* However, for queues with the same weight, a further
3578	* mechanism, preemption, mitigates or even eliminates this
3579	* problem. And it does so without consequences on overall
3580	* throughput. This mechanism and its benefits are explained
3581	* in the next three paragraphs.
3582	*
3583	* Even if a queue, say Q, is expired when it remains idle, Q
3584	* can still preempt the new in-service queue if the next
3585	* request of Q arrives soon (see the comments on
3586	* bfq_bfqq_update_budg_for_activation). If all queues and
3587	* groups have the same weight, this form of preemption,
3588	* combined with the hole-recovery heuristic described in the
3589	* comments on function bfq_bfqq_update_budg_for_activation,
3590	* are enough to preserve a correct bandwidth distribution in
3591	* the mid term, even without idling. In fact, even if not
3592	* idling allows the internal queues of the device to contain
3593	* many requests, and thus to reorder requests, we can rather
3594	* safely assume that the internal scheduler still preserves a
3595	* minimum of mid-term fairness.
3596	*
3597	* More precisely, this preemption-based, idleless approach
3598	* provides fairness in terms of IOPS, and not sectors per
3599	* second. This can be seen with a simple example. Suppose
3600	* that there are two queues with the same weight, but that
3601	* the first queue receives requests of 8 sectors, while the
3602	* second queue receives requests of 1024 sectors. In
3603	* addition, suppose that each of the two queues contains at
3604	* most one request at a time, which implies that each queue
3605	* always remains idle after it is served. Finally, after
3606	* remaining idle, each queue receives very quickly a new
3607	* request. It follows that the two queues are served
3608	* alternatively, preempting each other if needed. This
3609	* implies that, although both queues have the same weight,
3610	* the queue with large requests receives a service that is
3611	* 1024/8 times as high as the service received by the other
3612	* queue.
3613	*
3614	* The motivation for using preemption instead of idling (for
3615	* queues with the same weight) is that, by not idling,
3616	* service guarantees are preserved (completely or at least in
3617	* part) without minimally sacrificing throughput. And, if
3618	* there is no active group, then the primary expectation for
3619	* this device is probably a high throughput.
3620	*
3621	* We are now left only with explaining the additional
3622	* compound condition that is checked below for deciding
3623	* whether the scenario is asymmetric. To explain this
3624	* compound condition, we need to add that the function
3625	* bfq_symmetric_scenario checks the weights of only
3626	* non-weight-raised queues, for efficiency reasons (see
3627	* comments on bfq_weights_tree_add()). Then the fact that
3628	* bfqq is weight-raised is checked explicitly here. More
3629	* precisely, the compound condition below takes into account
3630	* also the fact that, even if bfqq is being weight-raised,
3631	* the scenario is still symmetric if all queues with requests
3632	* waiting for completion happen to be
3633	* weight-raised. Actually, we should be even more precise
3634	* here, and differentiate between interactive weight raising
3635	* and soft real-time weight raising.
3636	*
3637	* As a side note, it is worth considering that the above
3638	* device-idling countermeasures may however fail in the
3639	* following unlucky scenario: if idling is (correctly)
3640	* disabled in a time period during which all symmetry
3641	* sub-conditions hold, and hence the device is allowed to
3642	* enqueue many requests, but at some later point in time some
3643	* sub-condition stops to hold, then it may become impossible
3644	* to let requests be served in the desired order until all
3645	* the requests already queued in the device have been served.
3646	*/
3647	bool asymmetric_scenario = (bfqq->wr_coeff > 1 &&
3648	bfqd->wr_busy_queues <
3649	bfq_tot_busy_queues(bfqd)) \|\|
3650	!bfq_symmetric_scenario(bfqd);	3650	!bfq_symmetric_scenario(bfqd);
3651
3652	/*
3653	* Finally, there is a case where maximizing throughput is the
3654	* best choice even if it may cause unfairness toward
3655	* bfqq. Such a case is when bfqq became active in a burst of
3656	* queue activations. Queues that became active during a large
3657	* burst benefit only from throughput, as discussed in the
3658	* comments on bfq_handle_burst. Thus, if bfqq became active
3659	* in a burst and not idling the device maximizes throughput,
3660	* then the device must no be idled, because not idling the
3661	* device provides bfqq and all other queues in the burst with
3662	* maximum benefit. Combining this and the above case, we can
3663	* now establish when idling is actually needed to preserve
3664	* service guarantees.
3665	*/
3666	return asymmetric_scenario && !bfq_bfqq_in_large_burst(bfqq);
3667	}	3651	}
3668		3652
3669	/*	3653	/*