aboutsummaryrefslogtreecommitdiffstats
path: root/include/linux/skbuff.h
diff options
context:
space:
mode:
authorTom Herbert <therbert@google.com>2010-03-16 04:03:29 -0400
committerDavid S. Miller <davem@davemloft.net>2010-03-17 00:23:18 -0400
commit0a9627f2649a02bea165cfd529d7bcb625c2fcad (patch)
treee5d4424b99208c78e2b2fe6ff5a158fc21bdf782 /include/linux/skbuff.h
parent768bbedf9ca4cc4784eae2003f37abe0818fe0b0 (diff)
rps: Receive Packet Steering
This patch implements software receive side packet steering (RPS). RPS distributes the load of received packet processing across multiple CPUs. Problem statement: Protocol processing done in the NAPI context for received packets is serialized per device queue and becomes a bottleneck under high packet load. This substantially limits pps that can be achieved on a single queue NIC and provides no scaling with multiple cores. This solution queues packets early on in the receive path on the backlog queues of other CPUs. This allows protocol processing (e.g. IP and TCP) to be performed on packets in parallel. For each device (or each receive queue in a multi-queue device) a mask of CPUs is set to indicate the CPUs that can process packets. A CPU is selected on a per packet basis by hashing contents of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index into the CPU mask. The IPI mechanism is used to raise networking receive softirqs between CPUs. This effectively emulates in software what a multi-queue NIC can provide, but is generic requiring no device support. Many devices now provide a hash over the 4-tuple on a per packet basis (e.g. the Toeplitz hash). This patch allow drivers to set the HW reported hash in an skb field, and that value in turn is used to index into the RPS maps. Using the HW generated hash can avoid cache misses on the packet when steering it to a remote CPU. The CPU mask is set on a per device and per queue basis in the sysfs variable /sys/class/net/<device>/queues/rx-<n>/rps_cpus. This is a set of canonical bit maps for receive queues in the device (numbered by <n>). If a device does not support multi-queue, a single variable is used for the device (rx-0). Generally, we have found this technique increases pps capabilities of a single queue device with good CPU utilization. Optimal settings for the CPU mask seem to depend on architectures and cache hierarcy. Below are some results running 500 instances of netperf TCP_RR test with 1 byte req. and resp. Results show cumulative transaction rate and system CPU utilization. e1000e on 8 core Intel Without RPS: 108K tps at 33% CPU With RPS: 311K tps at 64% CPU forcedeth on 16 core AMD Without RPS: 156K tps at 15% CPU With RPS: 404K tps at 49% CPU bnx2x on 16 core AMD Without RPS 567K tps at 61% CPU (4 HW RX queues) Without RPS 738K tps at 96% CPU (8 HW RX queues) With RPS: 854K tps at 76% CPU (4 HW RX queues) Caveats: - The benefits of this patch are dependent on architecture and cache hierarchy. Tuning the masks to get best performance is probably necessary. - This patch adds overhead in the path for processing a single packet. In a lightly loaded server this overhead may eliminate the advantages of increased parallelism, and possibly cause some relative performance degradation. We have found that masks that are cache aware (share same caches with the interrupting CPU) mitigate much of this. - The RPS masks can be changed dynamically, however whenever the mask is changed this introduces the possibility of generating out of order packets. It's probably best not change the masks too frequently. Signed-off-by: Tom Herbert <therbert@google.com> include/linux/netdevice.h | 32 ++++- include/linux/skbuff.h | 3 + net/core/dev.c | 335 +++++++++++++++++++++++++++++++++++++-------- net/core/net-sysfs.c | 225 ++++++++++++++++++++++++++++++- net/core/skbuff.c | 2 + 5 files changed, 538 insertions(+), 59 deletions(-) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'include/linux/skbuff.h')
-rw-r--r--include/linux/skbuff.h3
1 files changed, 3 insertions, 0 deletions
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 03f816a9b659..def10b064f29 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -300,6 +300,7 @@ typedef unsigned char *sk_buff_data_t;
300 * @nfct_reasm: netfilter conntrack re-assembly pointer 300 * @nfct_reasm: netfilter conntrack re-assembly pointer
301 * @nf_bridge: Saved data about a bridged frame - see br_netfilter.c 301 * @nf_bridge: Saved data about a bridged frame - see br_netfilter.c
302 * @skb_iif: ifindex of device we arrived on 302 * @skb_iif: ifindex of device we arrived on
303 * @rxhash: the packet hash computed on receive
303 * @queue_mapping: Queue mapping for multiqueue devices 304 * @queue_mapping: Queue mapping for multiqueue devices
304 * @tc_index: Traffic control index 305 * @tc_index: Traffic control index
305 * @tc_verd: traffic control verdict 306 * @tc_verd: traffic control verdict
@@ -375,6 +376,8 @@ struct sk_buff {
375#endif 376#endif
376#endif 377#endif
377 378
379 __u32 rxhash;
380
378 kmemcheck_bitfield_begin(flags2); 381 kmemcheck_bitfield_begin(flags2);
379 __u16 queue_mapping:16; 382 __u16 queue_mapping:16;
380#ifdef CONFIG_IPV6_NDISC_NODETYPE 383#ifdef CONFIG_IPV6_NDISC_NODETYPE