rfs: Receive Flow Steering

This patch implements receive flow steering (RFS). RFS steers received packets for layer 3 and 4 processing to the CPU where the application for the corresponding flow is running. RFS is an extension of Receive Packet Steering (RPS). The basic idea of RFS is that when an application calls recvmsg (or sendmsg) the application's running CPU is stored in a hash table that is indexed by the connection's rxhash which is stored in the socket structure. The rxhash is passed in skb's received on the connection from netif_receive_skb. For each received packet, the associated rxhash is used to look up the CPU in the hash table, if a valid CPU is set then the packet is steered to that CPU using the RPS mechanisms. The convolution of the simple approach is that it would potentially allow OOO packets. If threads are thrashing around CPUs or multiple threads are trying to read from the same sockets, a quickly changing CPU value in the hash table could cause rampant OOO packets-- we consider this a non-starter. To avoid OOO packets, this solution implements two types of hash tables: rps_sock_flow_table and rps_dev_flow_table. rps_sock_table is a global hash table. Each entry is just a CPU number and it is populated in recvmsg and sendmsg as described above. This table contains the "desired" CPUs for flows. rps_dev_flow_table is specific to each device queue. Each entry contains a CPU and a tail queue counter. The CPU is the "current" CPU for a matching flow. The tail queue counter holds the value of a tail queue counter for the associated CPU's backlog queue at the time of last enqueue for a flow matching the entry. Each backlog queue has a queue head counter which is incremented on dequeue, and so a queue tail counter is computed as queue head count + queue length. When a packet is enqueued on a backlog queue, the current value of the queue tail counter is saved in the hash entry of the rps_dev_flow_table. And now the trick: when selecting the CPU for RPS (get_rps_cpu) the rps_sock_flow table and the rps_dev_flow table for the RX queue are consulted. When the desired CPU for the flow (found in the rps_sock_flow table) does not match the current CPU (found in the rps_dev_flow table), the current CPU is changed to the desired CPU if one of the following is true: - The current CPU is unset (equal to RPS_NO_CPU) - Current CPU is offline - The current CPU's queue head counter >= queue tail counter in the rps_dev_flow table. This checks if the queue tail has advanced beyond the last packet that was enqueued using this table entry. This guarantees that all packets queued using this entry have been dequeued, thus preserving in order delivery. Making each queue have its own rps_dev_flow table has two advantages: 1) the tail queue counters will be written on each receive, so keeping the table local to interrupting CPU s good for locality. 2) this allows lockless access to the table-- the CPU number and queue tail counter need to be accessed together under mutual exclusion from netif_receive_skb, we assume that this is only called from device napi_poll which is non-reentrant. This patch implements RFS for TCP and connected UDP sockets. It should be usable for other flow oriented protocols. There are two configuration parameters for RFS. The "rps_flow_entries" kernel init parameter sets the number of entries in the rps_sock_flow_table, the per rxqueue sysfs entry "rps_flow_cnt" contains the number of entries in the rps_dev_flow table for the rxqueue. Both are rounded to power of two. The obvious benefit of RFS (over just RPS) is that it achieves CPU locality between the receive processing for a flow and the applications processing; this can result in increased performance (higher pps, lower latency). The benefits of RFS are dependent on cache hierarchy, application load, and other factors. On simple benchmarks, we don't necessarily see improvement and sometimes see degradation. However, for more complex benchmarks and for applications where cache pressure is much higher this technique seems to perform very well. Below are some benchmark results which show the potential benfit of this patch. The netperf test has 500 instances of netperf TCP_RR test with 1 byte req. and resp. The RPC test is an request/response test similar in structure to netperf RR test ith 100 threads on each host, but does more work in userspace that netperf. e1000e on 8 core Intel No RFS or RPS 104K tps at 30% CPU No RFS (best RPS config): 290K tps at 63% CPU RFS 303K tps at 61% CPU RPC test tps CPU% 50/90/99% usec latency Latency StdDev No RFS/RPS 103K 48% 757/900/3185 4472.35 RPS only: 174K 73% 415/993/2468 491.66 RFS 223K 73% 379/651/1382 315.61 Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
author: Tom Herbert <therbert@google.com> 2010-04-16 19:01:27 -0400
committer: David S. Miller <davem@davemloft.net> 2010-04-16 19:01:27 -0400
commit: fec5e652e58fa6017b2c9e06466cb2a6538de5b4 (patch)
tree: e034f2a1e7930a0a225bd30896f834ec5e09c084 /include/linux/netdevice.h
parent: b5d43998234331b9c01bd2165fdbb25115f4387f (diff)
1 files changed, 68 insertions, 1 deletions
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 55c2086e1f06..649a0252686e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,14 +530,73 @@ struct rps_map {
 };
 #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
+/*
+ * The rps_dev_flow structure contains the mapping of a flow to a CPU and the
+ * tail pointer for that CPU's input queue at the time of last enqueue.
+ */
+struct rps_dev_flow {
+        u16 cpu;
+        u16 fill;
+        unsigned int last_qtail;
+};
+/*
+ * The rps_dev_flow_table structure contains a table of flow mappings.
+ */
+struct rps_dev_flow_table {
+        unsigned int mask;
+        struct rcu_head rcu;
+        struct work_struct free_work;
+        struct rps_dev_flow flows[0];
+};
+#define RPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_dev_flow_table) + \
+    (_num * sizeof(struct rps_dev_flow)))
+/*
+ * The rps_sock_flow_table contains mappings of flows to the last CPU
+ * on which they were processed by the application (set in recvmsg).
+ */
+struct rps_sock_flow_table {
+        unsigned int mask;
+        u16 ents[0];
+};
+#define RPS_SOCK_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_sock_flow_table) + \
+    (_num * sizeof(u16)))
+#define RPS_NO_CPU 0xffff
+static inline void rps_record_sock_flow(struct rps_sock_flow_table *table,
+                                        u32 hash)
+{
+        if (table && hash) {
+                unsigned int cpu, index = hash & table->mask;
+                /* We only give a hint, preemption can change cpu under us */
+                cpu = raw_smp_processor_id();
+                if (table->ents[index] != cpu)
+                        table->ents[index] = cpu;
+        }
+}
+static inline void rps_reset_sock_flow(struct rps_sock_flow_table *table,
+                                       u32 hash)
+{
+        if (table && hash)
+                table->ents[hash & table->mask] = RPS_NO_CPU;
+}
+extern struct rps_sock_flow_table *rps_sock_flow_table;
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
        struct rps_map *rps_map;
+        struct rps_dev_flow_table *rps_flow_table;
        struct kobject kobj;
        struct netdev_rx_queue *first;
        atomic_t count;
 } ____cacheline_aligned_in_smp;
-#endif
+#endif /* CONFIG_RPS */
 /*
 * This structure defines the management hooks for network devices.
@@ -1333,11 +1392,19 @@ struct softnet_data {
        /* Elements below can be accessed between CPUs for RPS */
 #ifdef CONFIG_RPS
        struct call_single_data csd ____cacheline_aligned_in_smp;
+        unsigned int            input_queue_head;
 #endif
        struct sk_buff_head     input_pkt_queue;
        struct napi_struct      backlog;
 };
+static inline void incr_input_queue_head(struct softnet_data *queue)
+{
+#ifdef CONFIG_RPS
+        queue->input_queue_head++;
+#endif
+}
 DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 #define HAVE_NETIF_QUEUE
author	Tom Herbert <therbert@google.com>	2010-04-16 19:01:27 -0400
committer	David S. Miller <davem@davemloft.net>	2010-04-16 19:01:27 -0400
commit	fec5e652e58fa6017b2c9e06466cb2a6538de5b4 (patch)
tree	e034f2a1e7930a0a225bd30896f834ec5e09c084 /include/linux/netdevice.h
parent	b5d43998234331b9c01bd2165fdbb25115f4387f (diff)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 55c2086e1f06..649a0252686e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h
@@ -530,14 +530,73 @@ struct rps_map {
530	};	530	};
531	#define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))	531	#define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
532		532
		533	/*
		534	* The rps_dev_flow structure contains the mapping of a flow to a CPU and the
		535	* tail pointer for that CPU's input queue at the time of last enqueue.
		536	*/
		537	struct rps_dev_flow {
		538	u16 cpu;
		539	u16 fill;
		540	unsigned int last_qtail;
		541	};
		542
		543	/*
		544	* The rps_dev_flow_table structure contains a table of flow mappings.
		545	*/
		546	struct rps_dev_flow_table {
		547	unsigned int mask;
		548	struct rcu_head rcu;
		549	struct work_struct free_work;
		550	struct rps_dev_flow flows[0];
		551	};
		552	#define RPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_dev_flow_table) + \
		553	(_num * sizeof(struct rps_dev_flow)))
		554
		555	/*
		556	* The rps_sock_flow_table contains mappings of flows to the last CPU
		557	* on which they were processed by the application (set in recvmsg).
		558	*/
		559	struct rps_sock_flow_table {
		560	unsigned int mask;
		561	u16 ents[0];
		562	};
		563	#define RPS_SOCK_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_sock_flow_table) + \
		564	(_num * sizeof(u16)))
		565
		566	#define RPS_NO_CPU 0xffff
		567
		568	static inline void rps_record_sock_flow(struct rps_sock_flow_table *table,
		569	u32 hash)
		570	{
		571	if (table && hash) {
		572	unsigned int cpu, index = hash & table->mask;
		573
		574	/* We only give a hint, preemption can change cpu under us */
		575	cpu = raw_smp_processor_id();
		576
		577	if (table->ents[index] != cpu)
		578	table->ents[index] = cpu;
		579	}
		580	}
		581
		582	static inline void rps_reset_sock_flow(struct rps_sock_flow_table *table,
		583	u32 hash)
		584	{
		585	if (table && hash)
		586	table->ents[hash & table->mask] = RPS_NO_CPU;
		587	}
		588
		589	extern struct rps_sock_flow_table *rps_sock_flow_table;
		590
533	/* This structure contains an instance of an RX queue. */	591	/* This structure contains an instance of an RX queue. */
534	struct netdev_rx_queue {	592	struct netdev_rx_queue {
535	struct rps_map *rps_map;	593	struct rps_map *rps_map;
		594	struct rps_dev_flow_table *rps_flow_table;
536	struct kobject kobj;	595	struct kobject kobj;
537	struct netdev_rx_queue *first;	596	struct netdev_rx_queue *first;
538	atomic_t count;	597	atomic_t count;
539	} ____cacheline_aligned_in_smp;	598	} ____cacheline_aligned_in_smp;
540	#endif	599	#endif /* CONFIG_RPS */
541		600
542	/*	601	/*
543	* This structure defines the management hooks for network devices.	602	* This structure defines the management hooks for network devices.
@@ -1333,11 +1392,19 @@ struct softnet_data {
1333	/* Elements below can be accessed between CPUs for RPS */	1392	/* Elements below can be accessed between CPUs for RPS */
1334	#ifdef CONFIG_RPS	1393	#ifdef CONFIG_RPS
1335	struct call_single_data csd ____cacheline_aligned_in_smp;	1394	struct call_single_data csd ____cacheline_aligned_in_smp;
		1395	unsigned int input_queue_head;
1336	#endif	1396	#endif
1337	struct sk_buff_head input_pkt_queue;	1397	struct sk_buff_head input_pkt_queue;
1338	struct napi_struct backlog;	1398	struct napi_struct backlog;
1339	};	1399	};
1340		1400
		1401	static inline void incr_input_queue_head(struct softnet_data *queue)
		1402	{
		1403	#ifdef CONFIG_RPS
		1404	queue->input_queue_head++;
		1405	#endif
		1406	}
		1407
1341	DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);	1408	DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
1342		1409
1343	#define HAVE_NETIF_QUEUE	1410	#define HAVE_NETIF_QUEUE