Merge branch 'qdisc_bulk_dequeue'

Jesper Dangaard Brouer says: ==================== qdisc: bulk dequeue support This patchset uses DaveM's recent API changes to dev_hard_start_xmit(), from the qdisc layer, to implement dequeue bulking. Patch01: "qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE" - Implement basic qdisc dequeue bulking - This time, 100% relying on BQL limits, no magic safe-guard constants Patch02: "qdisc: dequeue bulking also pickup GSO/TSO packets" - Extend bulking to bulk several GSO/TSO packets - Seperate patch, as it introduce a small regression, see test section. We do have a patch03, which exports a userspace tunable as a BQL tunable, that can byte-cap or disable the bulking/bursting. But we could not agree on it internally, thus not sending it now. We basically strive to avoid adding any new userspace tunable. Testing patch01: ================ Demonstrating the performance improvement of qdisc dequeue bulking, is tricky because the effect only "kicks-in" once the qdisc system have a backlog. Thus, for a backlog to form, we need either 1) to exceed wirespeed of the link or 2) exceed the capability of the device driver. For practical use-cases, the measureable effect of this will be a reduction in CPU usage 01-TCP_STREAM: -------------- Testing effect for TCP involves disabling TSO and GSO, because TCP already benefit from bulking, via TSO and especially for GSO segmented packets. This patch view TSO/GSO as a seperate kind of bulking, and avoid further bulking of these packet types. The measured perf diff benefit (at 10Gbit/s) for a single netperf TCP_STREAM were 9.24% less CPU used on calls to _raw_spin_lock() (mostly from sch_direct_xmit). If my E5-2695v2(ES) CPU is tuned according to: http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html Then it is possible that a single netperf TCP_STREAM, with GSO and TSO disabled, can utilize all bandwidth on a 10Gbit/s link. This will then cause a standing backlog queue at the qdisc layer. Trying to pressure the system some more CPU util wise, I'm starting 24x TCP_STREAMs and monitoring the overall CPU utilization. This confirms bulking saves CPU cycles when it "kicks-in". Tool mpstat, while stressing the system with netperf 24x TCP_STREAM, shows: * Disabled bulking: sys:2.58% soft:8.50% idle:88.78% * Enabled bulking: sys:2.43% soft:7.66% idle:89.79% 02-UDP_STREAM ------------- The measured perf diff benefit for UDP_STREAM were 6.41% less CPU used on calls to _raw_spin_lock(). 24x UDP_STREAM with packet size -m 1472 (to avoid sending UDP/IP fragments). 03-trafgen driver test ---------------------- The performance of the 10Gbit/s ixgbe driver is limited due to updating the HW ring-queue tail-pointer on every packet. As previously demonstrated with pktgen. Using trafgen to send RAW frames from userspace (via AF_PACKET), and forcing it through qdisc path (with option --qdisc-path and -t0), sending with 12 CPUs. I can demonstrate this driver layer limitation: * 12.8 Mpps with no qdisc bulking * 14.8 Mpps with qdisc bulking (full 10G-wirespeed) Testing patch02: ================ Testing Bulking several GSO/TSO packets: Measuring HoL (Head-of-Line) blocking for TSO and GSO, with netperf-wrapper. Bulking several TSO show no performance regressions (requeues were in the area 32 requeues/sec for 10G while transmitting approx 813Kpps). Bulking several GSOs does show small regression or very small improvement (requeues were in the area 8000 requeues/sec, for 10G while transmitting approx 813Kpps). Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional latency. Base-case, which is "normal" GSO bulking, sees varying high-prio queue delay between 0.38ms to 0.47ms. Bulking several GSOs together, result in a stable high-prio queue delay of 0.50ms. Corrosponding to: (10000*10^6)*((0.50-0.47)/10^3)/8 = 37500 bytes (10000*10^6)*((0.50-0.38)/10^3)/8 = 150000 bytes 37500/1500 = 25 pkts 150000/1500 = 100 pkts Using igb at 100Mbit/s with GSO bulking, shows an improvement. Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms diff of 0.12ms corrosponding to 1500 bytes at 100Mbit/s. Bulking several GSOs together, result in a stable high-prio queue delay of 2.23ms. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
author: David S. Miller <davem@davemloft.net> 2014-10-03 15:37:23 -0400
committer: David S. Miller <davem@davemloft.net> 2014-10-03 15:37:23 -0400
commit: c2bf5ec20488fb91af32f1c7f7c63f338ebacc9f (patch)
tree: ee3dc48d33d56e11df19b52c33abf2ac85667079
parent: 38df6492eb511d2a6823303cb1a194c4fe423154 (diff)
parent: 808e7ac0bdef31204184904f6b3ea356a30a9ed5 (diff)
2 files changed, 54 insertions, 2 deletions
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f12669819d1a..d17ed6fb2f70 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -7,6 +7,7 @@
 #include <linux/pkt_sched.h>
 #include <linux/pkt_cls.h>
 #include <linux/percpu.h>
+#include <linux/dynamic_queue_limits.h>
 #include <net/gen_stats.h>
 #include <net/rtnetlink.h>
@@ -119,6 +120,21 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
        qdisc->__state &= ~__QDISC___STATE_RUNNING;
 }
+static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
+{
+        return qdisc->flags & TCQ_F_ONETXQUEUE;
+}
+static inline int qdisc_avail_bulklimit(const struct netdev_queue *txq)
+{
+#ifdef CONFIG_BQL
+        /* Non-BQL migrated drivers will return 0, too. */
+        return dql_avail(&txq->dql);
+#else
+        return 0;
+#endif
+}
 static inline bool qdisc_is_throttled(const struct Qdisc *qdisc)
 {
        return test_bit(__QDISC_STATE_THROTTLED, &qdisc->state) ? true : false;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 7c8e5d73d433..797ebef73642 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,35 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
        return 0;
 }
+static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
+                                            struct sk_buff *head_skb,
+                                            int bytelimit)
+{
+        struct sk_buff *skb, *tail_skb = head_skb;
+        while (bytelimit > 0) {
+                skb = q->dequeue(q);
+                if (!skb)
+                        break;
+                bytelimit -= skb->len; /* covers GSO len */
+                skb = validate_xmit_skb(skb, qdisc_dev(q));
+                if (!skb)
+                        break;
+                while (tail_skb->next) /* GSO list goto tail */
+                        tail_skb = tail_skb->next;
+                tail_skb->next = skb;
+                tail_skb = skb;
+        }
+        return head_skb;
+}
+/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
+ * A requeued skb (via q->gso_skb) can also be a SKB list.
+ */
 static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 {
        struct sk_buff *skb = q->gso_skb;
@@ -70,10 +99,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
                } else
                        skb = NULL;
        } else {
-                if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
+                if (!(q->flags & TCQ_F_ONETXQUEUE) ||
+                    !netif_xmit_frozen_or_stopped(txq)) {
+                        int bytelimit = qdisc_avail_bulklimit(txq);
                        skb = q->dequeue(q);
-                        if (skb)
+                        if (skb) {
+                                bytelimit -= skb->len;
                                skb = validate_xmit_skb(skb, qdisc_dev(q));
+                        }
+                        if (skb && qdisc_may_bulk(q))
+                                skb = try_bulk_dequeue_skb(q, skb, bytelimit);
                }
        }
author	David S. Miller <davem@davemloft.net>	2014-10-03 15:37:23 -0400
committer	David S. Miller <davem@davemloft.net>	2014-10-03 15:37:23 -0400
commit	c2bf5ec20488fb91af32f1c7f7c63f338ebacc9f (patch)
tree	ee3dc48d33d56e11df19b52c33abf2ac85667079
parent	38df6492eb511d2a6823303cb1a194c4fe423154 (diff)
parent	808e7ac0bdef31204184904f6b3ea356a30a9ed5 (diff)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index f12669819d1a..d17ed6fb2f70 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h
@@ -7,6 +7,7 @@
7	#include <linux/pkt_sched.h>	7	#include <linux/pkt_sched.h>
8	#include <linux/pkt_cls.h>	8	#include <linux/pkt_cls.h>
9	#include <linux/percpu.h>	9	#include <linux/percpu.h>
		10	#include <linux/dynamic_queue_limits.h>
10	#include <net/gen_stats.h>	11	#include <net/gen_stats.h>
11	#include <net/rtnetlink.h>	12	#include <net/rtnetlink.h>
12		13
@@ -119,6 +120,21 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
119	qdisc->__state &= ~__QDISC___STATE_RUNNING;	120	qdisc->__state &= ~__QDISC___STATE_RUNNING;
120	}	121	}
121		122
		123	static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
		124	{
		125	return qdisc->flags & TCQ_F_ONETXQUEUE;
		126	}
		127
		128	static inline int qdisc_avail_bulklimit(const struct netdev_queue *txq)
		129	{
		130	#ifdef CONFIG_BQL
		131	/* Non-BQL migrated drivers will return 0, too. */
		132	return dql_avail(&txq->dql);
		133	#else
		134	return 0;
		135	#endif
		136	}
		137
122	static inline bool qdisc_is_throttled(const struct Qdisc *qdisc)	138	static inline bool qdisc_is_throttled(const struct Qdisc *qdisc)
123	{	139	{
124	return test_bit(__QDISC_STATE_THROTTLED, &qdisc->state) ? true : false;	140	return test_bit(__QDISC_STATE_THROTTLED, &qdisc->state) ? true : false;


diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 7c8e5d73d433..797ebef73642 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c
@@ -56,6 +56,35 @@ static inline int dev_requeue_skb(struct sk_buff skb, struct Qdisc q)
56	return 0;	56	return 0;
57	}	57	}
58		58
		59	static struct sk_buff try_bulk_dequeue_skb(struct Qdisc q,
		60	struct sk_buff *head_skb,
		61	int bytelimit)
		62	{
		63	struct sk_buff skb, tail_skb = head_skb;
		64
		65	while (bytelimit > 0) {
		66	skb = q->dequeue(q);
		67	if (!skb)
		68	break;
		69
		70	bytelimit -= skb->len; /* covers GSO len */
		71	skb = validate_xmit_skb(skb, qdisc_dev(q));
		72	if (!skb)
		73	break;
		74
		75	while (tail_skb->next) /* GSO list goto tail */
		76	tail_skb = tail_skb->next;
		77
		78	tail_skb->next = skb;
		79	tail_skb = skb;
		80	}
		81
		82	return head_skb;
		83	}
		84
		85	/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
		86	* A requeued skb (via q->gso_skb) can also be a SKB list.
		87	*/
59	static inline struct sk_buff dequeue_skb(struct Qdisc q)	88	static inline struct sk_buff dequeue_skb(struct Qdisc q)
60	{	89	{
61	struct sk_buff *skb = q->gso_skb;	90	struct sk_buff *skb = q->gso_skb;
@@ -70,10 +99,17 @@ static inline struct sk_buff dequeue_skb(struct Qdisc q)
70	} else	99	} else
71	skb = NULL;	100	skb = NULL;
72	} else {	101	} else {
73	if (!(q->flags & TCQ_F_ONETXQUEUE) \|\| !netif_xmit_frozen_or_stopped(txq)) {	102	if (!(q->flags & TCQ_F_ONETXQUEUE) \|\|
		103	!netif_xmit_frozen_or_stopped(txq)) {
		104	int bytelimit = qdisc_avail_bulklimit(txq);
		105
74	skb = q->dequeue(q);	106	skb = q->dequeue(q);
75	if (skb)	107	if (skb) {
		108	bytelimit -= skb->len;
76	skb = validate_xmit_skb(skb, qdisc_dev(q));	109	skb = validate_xmit_skb(skb, qdisc_dev(q));
		110	}
		111	if (skb && qdisc_may_bulk(q))
		112	skb = try_bulk_dequeue_skb(q, skb, bytelimit);
77	}	113	}
78	}	114	}
79		115