aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorDavid S. Miller <davem@davemloft.net>2013-06-11 00:23:57 -0400
committerDavid S. Miller <davem@davemloft.net>2013-06-11 00:23:57 -0400
commit0a4db187a999c4a715bf56b8ab6c4705b524e4bb (patch)
treea33921153ae0d65ef19c1c896cfe67dbf4ae236a
parent6f00a0229627ca189529cad3f9154ac2f9e5c7db (diff)
parent7e15b90ff9b796d14aa0d1aabc0dbb54632c673c (diff)
Merge branch 'll_poll'
Eliezer Tamir says: ==================== This patch set adds the ability for the socket layer code to poll directly on an Ethernet device's RX queue. This eliminates the cost of the interrupt and context switch and with proper tuning allows us to get very close to the HW latency. This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from last year http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id. Patch 2 adds an ndo_ll_poll method and the code that supports it. Patch 3 adds support for busy-polling on UDP sockets. Patch 4 adds support for TCP. Patch 5 adds the ixgbe driver code implementing ndo_ll_poll. Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll. Performance numbers: setup TCP_RR UDP_RR kernel Config C3/6 rx-usecs tps cpu% S.dem tps cpu% S.dem patched optimized on 100 87k 3.13 11.4 94K 3.17 10.7 patched optimized on 0 71k 3.12 14.0 84k 3.19 12.0 patched optimized on adaptive 80k 3.13 12.5 90k 3.46 12.2 patched typical on 100 72 3.13 14.0 79k 3.17 12.8 patched typical on 0 60k 2.13 16.5 71k 3.18 14.0 patched typical on adaptive 67k 3.51 16.7 75k 3.36 14.5 3.9 optimized on adaptive 25k 1.0 12.7 28k 0.98 11.2 3.9 typical off 0 48k 1.09 7.3 52k 1.11 4.18 3.9 typical 0ff adaptive 35k 1.12 4.08 38k 0.65 5.49 3.9 optimized off adaptive 40k 0.82 4.83 43k 0.70 5.23 3.9 optimized off 0 57k 1.17 4.08 62k 1.04 3.95 Test setup details: Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical NICs Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second) Kernel: unmodified 3.9 and patched 3.9 Config: typical is derived from RH6.2, optimized is a stripped down config. Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive, 100 us When C3/6 states were turned on (via BIOS) the performance governor was used. These performance numbers were measured with v2 of the patch set. Performance of the optimized config with an rx-usecs setting of 100 (the first line in the table above) was tracked during the evolution of the patches and has never varied by more than 1%. Design: A global hash table that allows us to look up a struct napi by a unique id was added. A napi_id field was added both to struct sk_buff and struct sk. This is used to track which NAPI we need to poll for a specific socket. The device driver marks every incoming skb with this id. This is propagated to the sk when the socket is looked up in the protocol handler. When the socket code does not find any more data on the socket queue, it now may call ndo_ll_poll which will crank the device's rx queue and feed incoming packets to the stack directly from the context of the socket. A sysctl value (net.core4.low_latency_poll) controls how many microseconds we busy-wait before giving up. (setting to 0 globally disables busy-polling) Locking: 1. Locking between napi poll and ndo_ll_poll: Since what needs to be locked between a device's NAPI poll and ndo_ll_poll, is highly device / configuration dependent, we do this inside the Ethernet driver. For example, when packets for high priority connections are sent to separate rx queues, you might not need locking between napi poll and ndo_ll_poll at all. For ixgbe we only lock the RX queue. ndo_ll_poll does not touch the interrupt state or the TX queues. (earlier versions of this patchset did touch them, but this design is simpler and works better.) If a queue is actively polled by a socket (on another CPU) napi poll will not service it, but will wait until the queue can be locked and cleaned before doing a napi_complete(). If a socket can't lock the queue because another CPU has it, either from napi or from another socket polling on the queue, the socket code can busy wait on the socket's skb queue. Ndo_ll_poll does not have preferential treatment for the data from the calling socket vs. data from others, so if another CPU is polling, you will see your data on this socket's queue when it arrives. Ndo_ll_poll is called with local BHs disabled, so it won't race on the same CPU with net_rx_action, which calls the napi poll method. 2. Napi_hash The napi hash mechanism uses RCU. napi_by_id() must be called under rcu_read_lock(). After a call to napi_hash_del(), caller must take care to wait an rcu grace period before freeing the memory containing the napi struct. (Ixgbe already had this because the queue vector structure uses rcu to protect the statistics counters in it.) how to test: 1. The patchset should apply cleanly to net-next. (don't forget to configure INET_LL_RX_POLL). 2. The ethtool -c setting for rx-usecs should be on the order of 100. 3. Use ethtool -K to disable GRO and LRO (You are encouraged to try it both ways. If you find that your workload does better with GRO on do tell us.) 4. Sysctl value net.core.low_latency_poll controls how long (in us) to busy-wait for more data, You are encouraged to play with this and see what works for you. The default is now 0 so you need to set it to turn the feature on. I recommend a value around 50. 4. benchmark thread and IRQ should be bound to separate cores. Both cores should be on the same CPU NUMA node as the NIC. When the app and the IRQ run on the same CPU you get a small penalty. If interrupt coalescing is set to a low value this penalty can be very large. 5. If you suspect that your machine is not configured properly, use numademo to make sure that the CPU to memory BW is OK. numademo 128m memcpy local copy numbers should be more than 8GB/s on a properly configured machine. Change log: v10 - removed select/poll support. (we will work on this some more and try again) v9 - correct sysctl proc_handler, reported by Eric Dumazet and Amir Vadai. - more int -> bool changes, reported by Eric Dumazet. - better mask testing in sock_poll(), reported by Eric Dumazet. v8 - split out udp and select/poll into separate patches. what used to be patch 2/5 is now three patches. - type corrections from Amir Vadai and Cong Wang: one unsigned long that was left when changing to cycles_t int -> bool - more detailed patch descriptions. v7 - suggested by Ben Hutchings and Eric Dumazet: type fixes, static for globals in net/core.c, avoid napi_id collisions in napi_hash_add() v6 - many small fixes suggested by Eric Dumazet: data locality, typos, documentation protect napi_hash insert/delete with a spinlock (napi_gen_id is no longer atomic_t since it's only accessed with the spinlock held.) - added IPv6 TCP and UDP support (only minimally tested) v5 - corrections suggested by Ben Hutchings: fixed typos, moved the config option and sysctl value from IPv4 to net - moved sk_mark_ll() to the protocol handlers - removed global id mechanism, replaced with a hashed napi_id. based on code sample from Eric Dumazet Note that ixgbe_free_q_vector() already waits an rcu grace period before freeing the q_vector, so nothing additional needs to be done when adding a call to napi_hash_del(). - simple poll/select support v4 - removed separate config option for TCP as suggested Eric Dumazet. - added linux mib counter for packets received through the low latency path, as suggested by Andi Kleen. - re-allow module unloading, remove module param, use a global generation id instead to prevent the use of a stale napi pointer, as suggested by Eric Dumazet - updated Documentation/networking/ip-sysctl.txt text v3 - coding style changes suggested by Dave Miller v2 - the sysctl knob is now in microseconds. The default value is now 0 (off). - for now the code depends at configure time on CONFIG_I86_TSC - the napi reference in struct skb is now a union with the dma cookie since the former is only used on RX and the latter on TX, as suggested by Eric Dumazet. - we do a better job at honoring non-blocking operations. - removed busy-polling support for tcp_read_sock() - remove dynamic disabling of GRO - coding style fixes - disallow unloading the device module after the feature has been used Credit: Jesse Brandeburg, Arun Chekhov Ilango, Julie Cummings, Alexander Duyck, Eric Geisler, Jason Neighbors, Yadong Li, Mike Polehn, Anil Vasudevan, Don Wood Special thanks for finding bugs in earlier versions: Willem de Bruijn and Andi Kleen ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-rw-r--r--Documentation/sysctl/net.txt7
-rw-r--r--drivers/net/ethernet/intel/ixgbe/ixgbe.h134
-rw-r--r--drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c40
-rw-r--r--drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c2
-rw-r--r--drivers/net/ethernet/intel/ixgbe/ixgbe_main.c69
-rw-r--r--include/linux/netdevice.h32
-rw-r--r--include/linux/skbuff.h8
-rw-r--r--include/net/ll_poll.h148
-rw-r--r--include/net/sock.h4
-rw-r--r--include/uapi/linux/snmp.h1
-rw-r--r--net/Kconfig12
-rw-r--r--net/core/datagram.c4
-rw-r--r--net/core/dev.c59
-rw-r--r--net/core/skbuff.c4
-rw-r--r--net/core/sock.c6
-rw-r--r--net/core/sysctl_net_core.c10
-rw-r--r--net/ipv4/proc.c1
-rw-r--r--net/ipv4/tcp.c5
-rw-r--r--net/ipv4/tcp_ipv4.c2
-rw-r--r--net/ipv4/udp.c6
-rw-r--r--net/ipv6/tcp_ipv6.c2
-rw-r--r--net/ipv6/udp.c6
-rw-r--r--net/socket.c6
23 files changed, 556 insertions, 12 deletions
diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index c1f8640c2fc8..85ab72dcdc3c 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -50,6 +50,13 @@ The maximum number of packets that kernel can handle on a NAPI interrupt,
50it's a Per-CPU variable. 50it's a Per-CPU variable.
51Default: 64 51Default: 64
52 52
53low_latency_poll
54----------------
55Low latency busy poll timeout. (needs CONFIG_NET_LL_RX_POLL)
56Approximate time in us to spin waiting for packets on the device queue.
57Recommended value is 50. May increase power usage.
58Default: 0 (off)
59
53rmem_default 60rmem_default
54------------ 61------------
55 62
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index ca932387a80f..fb098b46c6a6 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -52,6 +52,11 @@
52#include <linux/dca.h> 52#include <linux/dca.h>
53#endif 53#endif
54 54
55#include <net/ll_poll.h>
56
57#ifdef CONFIG_NET_LL_RX_POLL
58#define LL_EXTENDED_STATS
59#endif
55/* common prefix used by pr_<> macros */ 60/* common prefix used by pr_<> macros */
56#undef pr_fmt 61#undef pr_fmt
57#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 62#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -182,6 +187,11 @@ struct ixgbe_rx_buffer {
182struct ixgbe_queue_stats { 187struct ixgbe_queue_stats {
183 u64 packets; 188 u64 packets;
184 u64 bytes; 189 u64 bytes;
190#ifdef LL_EXTENDED_STATS
191 u64 yields;
192 u64 misses;
193 u64 cleaned;
194#endif /* LL_EXTENDED_STATS */
185}; 195};
186 196
187struct ixgbe_tx_queue_stats { 197struct ixgbe_tx_queue_stats {
@@ -356,9 +366,133 @@ struct ixgbe_q_vector {
356 struct rcu_head rcu; /* to avoid race with update stats on free */ 366 struct rcu_head rcu; /* to avoid race with update stats on free */
357 char name[IFNAMSIZ + 9]; 367 char name[IFNAMSIZ + 9];
358 368
369#ifdef CONFIG_NET_LL_RX_POLL
370 unsigned int state;
371#define IXGBE_QV_STATE_IDLE 0
372#define IXGBE_QV_STATE_NAPI 1 /* NAPI owns this QV */
373#define IXGBE_QV_STATE_POLL 2 /* poll owns this QV */
374#define IXGBE_QV_LOCKED (IXGBE_QV_STATE_NAPI | IXGBE_QV_STATE_POLL)
375#define IXGBE_QV_STATE_NAPI_YIELD 4 /* NAPI yielded this QV */
376#define IXGBE_QV_STATE_POLL_YIELD 8 /* poll yielded this QV */
377#define IXGBE_QV_YIELD (IXGBE_QV_STATE_NAPI_YIELD | IXGBE_QV_STATE_POLL_YIELD)
378#define IXGBE_QV_USER_PEND (IXGBE_QV_STATE_POLL | IXGBE_QV_STATE_POLL_YIELD)
379 spinlock_t lock;
380#endif /* CONFIG_NET_LL_RX_POLL */
381
359 /* for dynamic allocation of rings associated with this q_vector */ 382 /* for dynamic allocation of rings associated with this q_vector */
360 struct ixgbe_ring ring[0] ____cacheline_internodealigned_in_smp; 383 struct ixgbe_ring ring[0] ____cacheline_internodealigned_in_smp;
361}; 384};
385#ifdef CONFIG_NET_LL_RX_POLL
386static inline void ixgbe_qv_init_lock(struct ixgbe_q_vector *q_vector)
387{
388
389 spin_lock_init(&q_vector->lock);
390 q_vector->state = IXGBE_QV_STATE_IDLE;
391}
392
393/* called from the device poll routine to get ownership of a q_vector */
394static inline bool ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
395{
396 int rc = true;
397 spin_lock(&q_vector->lock);
398 if (q_vector->state & IXGBE_QV_LOCKED) {
399 WARN_ON(q_vector->state & IXGBE_QV_STATE_NAPI);
400 q_vector->state |= IXGBE_QV_STATE_NAPI_YIELD;
401 rc = false;
402#ifdef LL_EXTENDED_STATS
403 q_vector->tx.ring->stats.yields++;
404#endif
405 } else
406 /* we don't care if someone yielded */
407 q_vector->state = IXGBE_QV_STATE_NAPI;
408 spin_unlock(&q_vector->lock);
409 return rc;
410}
411
412/* returns true is someone tried to get the qv while napi had it */
413static inline bool ixgbe_qv_unlock_napi(struct ixgbe_q_vector *q_vector)
414{
415 int rc = false;
416 spin_lock(&q_vector->lock);
417 WARN_ON(q_vector->state & (IXGBE_QV_STATE_POLL |
418 IXGBE_QV_STATE_NAPI_YIELD));
419
420 if (q_vector->state & IXGBE_QV_STATE_POLL_YIELD)
421 rc = true;
422 q_vector->state = IXGBE_QV_STATE_IDLE;
423 spin_unlock(&q_vector->lock);
424 return rc;
425}
426
427/* called from ixgbe_low_latency_poll() */
428static inline bool ixgbe_qv_lock_poll(struct ixgbe_q_vector *q_vector)
429{
430 int rc = true;
431 spin_lock_bh(&q_vector->lock);
432 if ((q_vector->state & IXGBE_QV_LOCKED)) {
433 q_vector->state |= IXGBE_QV_STATE_POLL_YIELD;
434 rc = false;
435#ifdef LL_EXTENDED_STATS
436 q_vector->rx.ring->stats.yields++;
437#endif
438 } else
439 /* preserve yield marks */
440 q_vector->state |= IXGBE_QV_STATE_POLL;
441 spin_unlock_bh(&q_vector->lock);
442 return rc;
443}
444
445/* returns true if someone tried to get the qv while it was locked */
446static inline bool ixgbe_qv_unlock_poll(struct ixgbe_q_vector *q_vector)
447{
448 int rc = false;
449 spin_lock_bh(&q_vector->lock);
450 WARN_ON(q_vector->state & (IXGBE_QV_STATE_NAPI));
451
452 if (q_vector->state & IXGBE_QV_STATE_POLL_YIELD)
453 rc = true;
454 q_vector->state = IXGBE_QV_STATE_IDLE;
455 spin_unlock_bh(&q_vector->lock);
456 return rc;
457}
458
459/* true if a socket is polling, even if it did not get the lock */
460static inline bool ixgbe_qv_ll_polling(struct ixgbe_q_vector *q_vector)
461{
462 WARN_ON(!(q_vector->state & IXGBE_QV_LOCKED));
463 return q_vector->state & IXGBE_QV_USER_PEND;
464}
465#else /* CONFIG_NET_LL_RX_POLL */
466static inline void ixgbe_qv_init_lock(struct ixgbe_q_vector *q_vector)
467{
468}
469
470static inline bool ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
471{
472 return true;
473}
474
475static inline bool ixgbe_qv_unlock_napi(struct ixgbe_q_vector *q_vector)
476{
477 return false;
478}
479
480static inline bool ixgbe_qv_lock_poll(struct ixgbe_q_vector *q_vector)
481{
482 return false;
483}
484
485static inline bool ixgbe_qv_unlock_poll(struct ixgbe_q_vector *q_vector)
486{
487 return false;
488}
489
490static inline bool ixgbe_qv_ll_polling(struct ixgbe_q_vector *q_vector)
491{
492 return false;
493}
494#endif /* CONFIG_NET_LL_RX_POLL */
495
362#ifdef CONFIG_IXGBE_HWMON 496#ifdef CONFIG_IXGBE_HWMON
363 497
364#define IXGBE_HWMON_TYPE_LOC 0 498#define IXGBE_HWMON_TYPE_LOC 0
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index d3754722adb4..24e2e7aafda2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -1054,6 +1054,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
1054 data[i] = 0; 1054 data[i] = 0;
1055 data[i+1] = 0; 1055 data[i+1] = 0;
1056 i += 2; 1056 i += 2;
1057#ifdef LL_EXTENDED_STATS
1058 data[i] = 0;
1059 data[i+1] = 0;
1060 data[i+2] = 0;
1061 i += 3;
1062#endif
1057 continue; 1063 continue;
1058 } 1064 }
1059 1065
@@ -1063,6 +1069,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
1063 data[i+1] = ring->stats.bytes; 1069 data[i+1] = ring->stats.bytes;
1064 } while (u64_stats_fetch_retry_bh(&ring->syncp, start)); 1070 } while (u64_stats_fetch_retry_bh(&ring->syncp, start));
1065 i += 2; 1071 i += 2;
1072#ifdef LL_EXTENDED_STATS
1073 data[i] = ring->stats.yields;
1074 data[i+1] = ring->stats.misses;
1075 data[i+2] = ring->stats.cleaned;
1076 i += 3;
1077#endif
1066 } 1078 }
1067 for (j = 0; j < IXGBE_NUM_RX_QUEUES; j++) { 1079 for (j = 0; j < IXGBE_NUM_RX_QUEUES; j++) {
1068 ring = adapter->rx_ring[j]; 1080 ring = adapter->rx_ring[j];
@@ -1070,6 +1082,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
1070 data[i] = 0; 1082 data[i] = 0;
1071 data[i+1] = 0; 1083 data[i+1] = 0;
1072 i += 2; 1084 i += 2;
1085#ifdef LL_EXTENDED_STATS
1086 data[i] = 0;
1087 data[i+1] = 0;
1088 data[i+2] = 0;
1089 i += 3;
1090#endif
1073 continue; 1091 continue;
1074 } 1092 }
1075 1093
@@ -1079,6 +1097,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
1079 data[i+1] = ring->stats.bytes; 1097 data[i+1] = ring->stats.bytes;
1080 } while (u64_stats_fetch_retry_bh(&ring->syncp, start)); 1098 } while (u64_stats_fetch_retry_bh(&ring->syncp, start));
1081 i += 2; 1099 i += 2;
1100#ifdef LL_EXTENDED_STATS
1101 data[i] = ring->stats.yields;
1102 data[i+1] = ring->stats.misses;
1103 data[i+2] = ring->stats.cleaned;
1104 i += 3;
1105#endif
1082 } 1106 }
1083 1107
1084 for (j = 0; j < IXGBE_MAX_PACKET_BUFFERS; j++) { 1108 for (j = 0; j < IXGBE_MAX_PACKET_BUFFERS; j++) {
@@ -1115,12 +1139,28 @@ static void ixgbe_get_strings(struct net_device *netdev, u32 stringset,
1115 p += ETH_GSTRING_LEN; 1139 p += ETH_GSTRING_LEN;
1116 sprintf(p, "tx_queue_%u_bytes", i); 1140 sprintf(p, "tx_queue_%u_bytes", i);
1117 p += ETH_GSTRING_LEN; 1141 p += ETH_GSTRING_LEN;
1142#ifdef LL_EXTENDED_STATS
1143 sprintf(p, "tx_q_%u_napi_yield", i);
1144 p += ETH_GSTRING_LEN;
1145 sprintf(p, "tx_q_%u_misses", i);
1146 p += ETH_GSTRING_LEN;
1147 sprintf(p, "tx_q_%u_cleaned", i);
1148 p += ETH_GSTRING_LEN;
1149#endif /* LL_EXTENDED_STATS */
1118 } 1150 }
1119 for (i = 0; i < IXGBE_NUM_RX_QUEUES; i++) { 1151 for (i = 0; i < IXGBE_NUM_RX_QUEUES; i++) {
1120 sprintf(p, "rx_queue_%u_packets", i); 1152 sprintf(p, "rx_queue_%u_packets", i);
1121 p += ETH_GSTRING_LEN; 1153 p += ETH_GSTRING_LEN;
1122 sprintf(p, "rx_queue_%u_bytes", i); 1154 sprintf(p, "rx_queue_%u_bytes", i);
1123 p += ETH_GSTRING_LEN; 1155 p += ETH_GSTRING_LEN;
1156#ifdef LL_EXTENDED_STATS
1157 sprintf(p, "rx_q_%u_ll_poll_yield", i);
1158 p += ETH_GSTRING_LEN;
1159 sprintf(p, "rx_q_%u_misses", i);
1160 p += ETH_GSTRING_LEN;
1161 sprintf(p, "rx_q_%u_cleaned", i);
1162 p += ETH_GSTRING_LEN;
1163#endif /* LL_EXTENDED_STATS */
1124 } 1164 }
1125 for (i = 0; i < IXGBE_MAX_PACKET_BUFFERS; i++) { 1165 for (i = 0; i < IXGBE_MAX_PACKET_BUFFERS; i++) {
1126 sprintf(p, "tx_pb_%u_pxon", i); 1166 sprintf(p, "tx_pb_%u_pxon", i);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index ef5f7a678ce1..90b4e1089ecc 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -811,6 +811,7 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
811 /* initialize NAPI */ 811 /* initialize NAPI */
812 netif_napi_add(adapter->netdev, &q_vector->napi, 812 netif_napi_add(adapter->netdev, &q_vector->napi,
813 ixgbe_poll, 64); 813 ixgbe_poll, 64);
814 napi_hash_add(&q_vector->napi);
814 815
815 /* tie q_vector and adapter together */ 816 /* tie q_vector and adapter together */
816 adapter->q_vector[v_idx] = q_vector; 817 adapter->q_vector[v_idx] = q_vector;
@@ -931,6 +932,7 @@ static void ixgbe_free_q_vector(struct ixgbe_adapter *adapter, int v_idx)
931 adapter->rx_ring[ring->queue_index] = NULL; 932 adapter->rx_ring[ring->queue_index] = NULL;
932 933
933 adapter->q_vector[v_idx] = NULL; 934 adapter->q_vector[v_idx] = NULL;
935 napi_hash_del(&q_vector->napi);
934 netif_napi_del(&q_vector->napi); 936 netif_napi_del(&q_vector->napi);
935 937
936 /* 938 /*
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index d30fbdd81fca..047ebaaf0141 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1504,7 +1504,9 @@ static void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
1504{ 1504{
1505 struct ixgbe_adapter *adapter = q_vector->adapter; 1505 struct ixgbe_adapter *adapter = q_vector->adapter;
1506 1506
1507 if (!(adapter->flags & IXGBE_FLAG_IN_NETPOLL)) 1507 if (ixgbe_qv_ll_polling(q_vector))
1508 netif_receive_skb(skb);
1509 else if (!(adapter->flags & IXGBE_FLAG_IN_NETPOLL))
1508 napi_gro_receive(&q_vector->napi, skb); 1510 napi_gro_receive(&q_vector->napi, skb);
1509 else 1511 else
1510 netif_rx(skb); 1512 netif_rx(skb);
@@ -1892,9 +1894,9 @@ dma_sync:
1892 * expensive overhead for IOMMU access this provides a means of avoiding 1894 * expensive overhead for IOMMU access this provides a means of avoiding
1893 * it by maintaining the mapping of the page to the syste. 1895 * it by maintaining the mapping of the page to the syste.
1894 * 1896 *
1895 * Returns true if all work is completed without reaching budget 1897 * Returns amount of work completed
1896 **/ 1898 **/
1897static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, 1899static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
1898 struct ixgbe_ring *rx_ring, 1900 struct ixgbe_ring *rx_ring,
1899 const int budget) 1901 const int budget)
1900{ 1902{
@@ -1976,6 +1978,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
1976 } 1978 }
1977 1979
1978#endif /* IXGBE_FCOE */ 1980#endif /* IXGBE_FCOE */
1981 skb_mark_ll(skb, &q_vector->napi);
1979 ixgbe_rx_skb(q_vector, skb); 1982 ixgbe_rx_skb(q_vector, skb);
1980 1983
1981 /* update budget accounting */ 1984 /* update budget accounting */
@@ -1992,9 +1995,43 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
1992 if (cleaned_count) 1995 if (cleaned_count)
1993 ixgbe_alloc_rx_buffers(rx_ring, cleaned_count); 1996 ixgbe_alloc_rx_buffers(rx_ring, cleaned_count);
1994 1997
1995 return (total_rx_packets < budget); 1998 return total_rx_packets;
1996} 1999}
1997 2000
2001#ifdef CONFIG_NET_LL_RX_POLL
2002/* must be called with local_bh_disable()d */
2003static int ixgbe_low_latency_recv(struct napi_struct *napi)
2004{
2005 struct ixgbe_q_vector *q_vector =
2006 container_of(napi, struct ixgbe_q_vector, napi);
2007 struct ixgbe_adapter *adapter = q_vector->adapter;
2008 struct ixgbe_ring *ring;
2009 int found = 0;
2010
2011 if (test_bit(__IXGBE_DOWN, &adapter->state))
2012 return LL_FLUSH_FAILED;
2013
2014 if (!ixgbe_qv_lock_poll(q_vector))
2015 return LL_FLUSH_BUSY;
2016
2017 ixgbe_for_each_ring(ring, q_vector->rx) {
2018 found = ixgbe_clean_rx_irq(q_vector, ring, 4);
2019#ifdef LL_EXTENDED_STATS
2020 if (found)
2021 ring->stats.cleaned += found;
2022 else
2023 ring->stats.misses++;
2024#endif
2025 if (found)
2026 break;
2027 }
2028
2029 ixgbe_qv_unlock_poll(q_vector);
2030
2031 return found;
2032}
2033#endif /* CONFIG_NET_LL_RX_POLL */
2034
1998/** 2035/**
1999 * ixgbe_configure_msix - Configure MSI-X hardware 2036 * ixgbe_configure_msix - Configure MSI-X hardware
2000 * @adapter: board private structure 2037 * @adapter: board private structure
@@ -2550,6 +2587,9 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
2550 ixgbe_for_each_ring(ring, q_vector->tx) 2587 ixgbe_for_each_ring(ring, q_vector->tx)
2551 clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring); 2588 clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring);
2552 2589
2590 if (!ixgbe_qv_lock_napi(q_vector))
2591 return budget;
2592
2553 /* attempt to distribute budget to each queue fairly, but don't allow 2593 /* attempt to distribute budget to each queue fairly, but don't allow
2554 * the budget to go below 1 because we'll exit polling */ 2594 * the budget to go below 1 because we'll exit polling */
2555 if (q_vector->rx.count > 1) 2595 if (q_vector->rx.count > 1)
@@ -2558,9 +2598,10 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
2558 per_ring_budget = budget; 2598 per_ring_budget = budget;
2559 2599
2560 ixgbe_for_each_ring(ring, q_vector->rx) 2600 ixgbe_for_each_ring(ring, q_vector->rx)
2561 clean_complete &= ixgbe_clean_rx_irq(q_vector, ring, 2601 clean_complete &= (ixgbe_clean_rx_irq(q_vector, ring,
2562 per_ring_budget); 2602 per_ring_budget) < per_ring_budget);
2563 2603
2604 ixgbe_qv_unlock_napi(q_vector);
2564 /* If all work not completed, return budget and keep polling */ 2605 /* If all work not completed, return budget and keep polling */
2565 if (!clean_complete) 2606 if (!clean_complete)
2566 return budget; 2607 return budget;
@@ -3747,16 +3788,25 @@ static void ixgbe_napi_enable_all(struct ixgbe_adapter *adapter)
3747{ 3788{
3748 int q_idx; 3789 int q_idx;
3749 3790
3750 for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++) 3791 for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++) {
3792 ixgbe_qv_init_lock(adapter->q_vector[q_idx]);
3751 napi_enable(&adapter->q_vector[q_idx]->napi); 3793 napi_enable(&adapter->q_vector[q_idx]->napi);
3794 }
3752} 3795}
3753 3796
3754static void ixgbe_napi_disable_all(struct ixgbe_adapter *adapter) 3797static void ixgbe_napi_disable_all(struct ixgbe_adapter *adapter)
3755{ 3798{
3756 int q_idx; 3799 int q_idx;
3757 3800
3758 for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++) 3801 local_bh_disable(); /* for ixgbe_qv_lock_napi() */
3802 for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++) {
3759 napi_disable(&adapter->q_vector[q_idx]->napi); 3803 napi_disable(&adapter->q_vector[q_idx]->napi);
3804 while (!ixgbe_qv_lock_napi(adapter->q_vector[q_idx])) {
3805 pr_info("QV %d locked\n", q_idx);
3806 mdelay(1);
3807 }
3808 }
3809 local_bh_enable();
3760} 3810}
3761 3811
3762#ifdef CONFIG_IXGBE_DCB 3812#ifdef CONFIG_IXGBE_DCB
@@ -7177,6 +7227,9 @@ static const struct net_device_ops ixgbe_netdev_ops = {
7177#ifdef CONFIG_NET_POLL_CONTROLLER 7227#ifdef CONFIG_NET_POLL_CONTROLLER
7178 .ndo_poll_controller = ixgbe_netpoll, 7228 .ndo_poll_controller = ixgbe_netpoll,
7179#endif 7229#endif
7230#ifdef CONFIG_NET_LL_RX_POLL
7231 .ndo_ll_poll = ixgbe_low_latency_recv,
7232#endif
7180#ifdef IXGBE_FCOE 7233#ifdef IXGBE_FCOE
7181 .ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get, 7234 .ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
7182 .ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target, 7235 .ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8f967e34142b..2ecb96d9a1e5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -324,12 +324,15 @@ struct napi_struct {
324 struct sk_buff *gro_list; 324 struct sk_buff *gro_list;
325 struct sk_buff *skb; 325 struct sk_buff *skb;
326 struct list_head dev_list; 326 struct list_head dev_list;
327 struct hlist_node napi_hash_node;
328 unsigned int napi_id;
327}; 329};
328 330
329enum { 331enum {
330 NAPI_STATE_SCHED, /* Poll is scheduled */ 332 NAPI_STATE_SCHED, /* Poll is scheduled */
331 NAPI_STATE_DISABLE, /* Disable pending */ 333 NAPI_STATE_DISABLE, /* Disable pending */
332 NAPI_STATE_NPSVC, /* Netpoll - don't dequeue from poll_list */ 334 NAPI_STATE_NPSVC, /* Netpoll - don't dequeue from poll_list */
335 NAPI_STATE_HASHED, /* In NAPI hash */
333}; 336};
334 337
335enum gro_result { 338enum gro_result {
@@ -446,6 +449,32 @@ extern void __napi_complete(struct napi_struct *n);
446extern void napi_complete(struct napi_struct *n); 449extern void napi_complete(struct napi_struct *n);
447 450
448/** 451/**
452 * napi_by_id - lookup a NAPI by napi_id
453 * @napi_id: hashed napi_id
454 *
455 * lookup @napi_id in napi_hash table
456 * must be called under rcu_read_lock()
457 */
458extern struct napi_struct *napi_by_id(unsigned int napi_id);
459
460/**
461 * napi_hash_add - add a NAPI to global hashtable
462 * @napi: napi context
463 *
464 * generate a new napi_id and store a @napi under it in napi_hash
465 */
466extern void napi_hash_add(struct napi_struct *napi);
467
468/**
469 * napi_hash_del - remove a NAPI from global table
470 * @napi: napi context
471 *
472 * Warning: caller must observe rcu grace period
473 * before freeing memory containing @napi
474 */
475extern void napi_hash_del(struct napi_struct *napi);
476
477/**
449 * napi_disable - prevent NAPI from scheduling 478 * napi_disable - prevent NAPI from scheduling
450 * @n: napi context 479 * @n: napi context
451 * 480 *
@@ -943,6 +972,9 @@ struct net_device_ops {
943 gfp_t gfp); 972 gfp_t gfp);
944 void (*ndo_netpoll_cleanup)(struct net_device *dev); 973 void (*ndo_netpoll_cleanup)(struct net_device *dev);
945#endif 974#endif
975#ifdef CONFIG_NET_LL_RX_POLL
976 int (*ndo_ll_poll)(struct napi_struct *dev);
977#endif
946 int (*ndo_set_vf_mac)(struct net_device *dev, 978 int (*ndo_set_vf_mac)(struct net_device *dev,
947 int queue, u8 *mac); 979 int queue, u8 *mac);
948 int (*ndo_set_vf_vlan)(struct net_device *dev, 980 int (*ndo_set_vf_vlan)(struct net_device *dev,
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9995834d2cb6..400d82ae2b03 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -386,6 +386,7 @@ typedef unsigned char *sk_buff_data_t;
386 * @no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS 386 * @no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS
387 * @dma_cookie: a cookie to one of several possible DMA operations 387 * @dma_cookie: a cookie to one of several possible DMA operations
388 * done by skb DMA functions 388 * done by skb DMA functions
389 * @napi_id: id of the NAPI struct this skb came from
389 * @secmark: security marking 390 * @secmark: security marking
390 * @mark: Generic packet mark 391 * @mark: Generic packet mark
391 * @dropcount: total number of sk_receive_queue overflows 392 * @dropcount: total number of sk_receive_queue overflows
@@ -500,8 +501,11 @@ struct sk_buff {
500 /* 7/9 bit hole (depending on ndisc_nodetype presence) */ 501 /* 7/9 bit hole (depending on ndisc_nodetype presence) */
501 kmemcheck_bitfield_end(flags2); 502 kmemcheck_bitfield_end(flags2);
502 503
503#ifdef CONFIG_NET_DMA 504#if defined CONFIG_NET_DMA || defined CONFIG_NET_LL_RX_POLL
504 dma_cookie_t dma_cookie; 505 union {
506 unsigned int napi_id;
507 dma_cookie_t dma_cookie;
508 };
505#endif 509#endif
506#ifdef CONFIG_NETWORK_SECMARK 510#ifdef CONFIG_NETWORK_SECMARK
507 __u32 secmark; 511 __u32 secmark;
diff --git a/include/net/ll_poll.h b/include/net/ll_poll.h
new file mode 100644
index 000000000000..bc262f88173f
--- /dev/null
+++ b/include/net/ll_poll.h
@@ -0,0 +1,148 @@
1/*
2 * Low Latency Sockets
3 * Copyright(c) 2013 Intel Corporation.
4 *
5 * This program is free software; you can redistribute it and/or modify it
6 * under the terms and conditions of the GNU General Public License,
7 * version 2, as published by the Free Software Foundation.
8 *
9 * This program is distributed in the hope it will be useful, but WITHOUT
10 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
11 * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
12 * more details.
13 *
14 * You should have received a copy of the GNU General Public License along with
15 * this program; if not, write to the Free Software Foundation, Inc.,
16 * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
17 *
18 * Author: Eliezer Tamir
19 *
20 * Contact Information:
21 * e1000-devel Mailing List <e1000-devel@lists.sourceforge.net>
22 */
23
24/*
25 * For now this depends on CONFIG_X86_TSC
26 */
27
28#ifndef _LINUX_NET_LL_POLL_H
29#define _LINUX_NET_LL_POLL_H
30
31#include <linux/netdevice.h>
32#include <net/ip.h>
33
34#ifdef CONFIG_NET_LL_RX_POLL
35
36struct napi_struct;
37extern unsigned long sysctl_net_ll_poll __read_mostly;
38
39/* return values from ndo_ll_poll */
40#define LL_FLUSH_FAILED -1
41#define LL_FLUSH_BUSY -2
42
43/* we don't mind a ~2.5% imprecision */
44#define TSC_MHZ (tsc_khz >> 10)
45
46static inline cycles_t ll_end_time(void)
47{
48 return TSC_MHZ * ACCESS_ONCE(sysctl_net_ll_poll) + get_cycles();
49}
50
51static inline bool sk_valid_ll(struct sock *sk)
52{
53 return sysctl_net_ll_poll && sk->sk_napi_id &&
54 !need_resched() && !signal_pending(current);
55}
56
57static inline bool can_poll_ll(cycles_t end_time)
58{
59 return !time_after((unsigned long)get_cycles(),
60 (unsigned long)end_time);
61}
62
63static inline bool sk_poll_ll(struct sock *sk, int nonblock)
64{
65 cycles_t end_time = ll_end_time();
66 const struct net_device_ops *ops;
67 struct napi_struct *napi;
68 int rc = false;
69
70 /*
71 * rcu read lock for napi hash
72 * bh so we don't race with net_rx_action
73 */
74 rcu_read_lock_bh();
75
76 napi = napi_by_id(sk->sk_napi_id);
77 if (!napi)
78 goto out;
79
80 ops = napi->dev->netdev_ops;
81 if (!ops->ndo_ll_poll)
82 goto out;
83
84 do {
85
86 rc = ops->ndo_ll_poll(napi);
87
88 if (rc == LL_FLUSH_FAILED)
89 break; /* permanent failure */
90
91 if (rc > 0)
92 /* local bh are disabled so it is ok to use _BH */
93 NET_ADD_STATS_BH(sock_net(sk),
94 LINUX_MIB_LOWLATENCYRXPACKETS, rc);
95
96 } while (skb_queue_empty(&sk->sk_receive_queue)
97 && can_poll_ll(end_time) && !nonblock);
98
99 rc = !skb_queue_empty(&sk->sk_receive_queue);
100out:
101 rcu_read_unlock_bh();
102 return rc;
103}
104
105/* used in the NIC receive handler to mark the skb */
106static inline void skb_mark_ll(struct sk_buff *skb, struct napi_struct *napi)
107{
108 skb->napi_id = napi->napi_id;
109}
110
111/* used in the protocol hanlder to propagate the napi_id to the socket */
112static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb)
113{
114 sk->sk_napi_id = skb->napi_id;
115}
116
117#else /* CONFIG_NET_LL_RX_POLL */
118
119static inline cycles_t ll_end_time(void)
120{
121 return 0;
122}
123
124static inline bool sk_valid_ll(struct sock *sk)
125{
126 return false;
127}
128
129static inline bool sk_poll_ll(struct sock *sk, int nonblock)
130{
131 return false;
132}
133
134static inline void skb_mark_ll(struct sk_buff *skb, struct napi_struct *napi)
135{
136}
137
138static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb)
139{
140}
141
142static inline bool can_poll_ll(cycles_t end_time)
143{
144 return false;
145}
146
147#endif /* CONFIG_NET_LL_RX_POLL */
148#endif /* _LINUX_NET_LL_POLL_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 66772cf8c3c5..ac8e1818380c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -229,6 +229,7 @@ struct cg_proto;
229 * @sk_omem_alloc: "o" is "option" or "other" 229 * @sk_omem_alloc: "o" is "option" or "other"
230 * @sk_wmem_queued: persistent queue size 230 * @sk_wmem_queued: persistent queue size
231 * @sk_forward_alloc: space allocated forward 231 * @sk_forward_alloc: space allocated forward
232 * @sk_napi_id: id of the last napi context to receive data for sk
232 * @sk_allocation: allocation mode 233 * @sk_allocation: allocation mode
233 * @sk_sndbuf: size of send buffer in bytes 234 * @sk_sndbuf: size of send buffer in bytes
234 * @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, 235 * @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
@@ -325,6 +326,9 @@ struct sock {
325#ifdef CONFIG_RPS 326#ifdef CONFIG_RPS
326 __u32 sk_rxhash; 327 __u32 sk_rxhash;
327#endif 328#endif
329#ifdef CONFIG_NET_LL_RX_POLL
330 unsigned int sk_napi_id;
331#endif
328 atomic_t sk_drops; 332 atomic_t sk_drops;
329 int sk_rcvbuf; 333 int sk_rcvbuf;
330 334
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index df2e8b4f9c03..26cbf76f8058 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -253,6 +253,7 @@ enum
253 LINUX_MIB_TCPFASTOPENLISTENOVERFLOW, /* TCPFastOpenListenOverflow */ 253 LINUX_MIB_TCPFASTOPENLISTENOVERFLOW, /* TCPFastOpenListenOverflow */
254 LINUX_MIB_TCPFASTOPENCOOKIEREQD, /* TCPFastOpenCookieReqd */ 254 LINUX_MIB_TCPFASTOPENCOOKIEREQD, /* TCPFastOpenCookieReqd */
255 LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES, /* TCPSpuriousRtxHostQueues */ 255 LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES, /* TCPSpuriousRtxHostQueues */
256 LINUX_MIB_LOWLATENCYRXPACKETS, /* LowLatencyRxPackets */
256 __LINUX_MIB_MAX 257 __LINUX_MIB_MAX
257}; 258};
258 259
diff --git a/net/Kconfig b/net/Kconfig
index 523e43e6da1b..d6a9ce6e1800 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -243,6 +243,18 @@ config NETPRIO_CGROUP
243 Cgroup subsystem for use in assigning processes to network priorities on 243 Cgroup subsystem for use in assigning processes to network priorities on
244 a per-interface basis 244 a per-interface basis
245 245
246config NET_LL_RX_POLL
247 bool "Low Latency Receive Poll"
248 depends on X86_TSC
249 default n
250 ---help---
251 Support Low Latency Receive Queue Poll.
252 (For network card drivers which support this option.)
253 When waiting for data in read or poll call directly into the the device driver
254 to flush packets which may be pending on the device queues into the stack.
255
256 If unsure, say N.
257
246config BQL 258config BQL
247 boolean 259 boolean
248 depends on SYSFS 260 depends on SYSFS
diff --git a/net/core/datagram.c b/net/core/datagram.c
index b71423db7785..9cbaba98ce4c 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -56,6 +56,7 @@
56#include <net/sock.h> 56#include <net/sock.h>
57#include <net/tcp_states.h> 57#include <net/tcp_states.h>
58#include <trace/events/skb.h> 58#include <trace/events/skb.h>
59#include <net/ll_poll.h>
59 60
60/* 61/*
61 * Is a socket 'connection oriented' ? 62 * Is a socket 'connection oriented' ?
@@ -207,6 +208,9 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,
207 } 208 }
208 spin_unlock_irqrestore(&queue->lock, cpu_flags); 209 spin_unlock_irqrestore(&queue->lock, cpu_flags);
209 210
211 if (sk_valid_ll(sk) && sk_poll_ll(sk, flags & MSG_DONTWAIT))
212 continue;
213
210 /* User doesn't want to wait */ 214 /* User doesn't want to wait */
211 error = -EAGAIN; 215 error = -EAGAIN;
212 if (!timeo) 216 if (!timeo)
diff --git a/net/core/dev.c b/net/core/dev.c
index 9c18557f93c6..fa007dba6beb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -129,6 +129,7 @@
129#include <linux/inetdevice.h> 129#include <linux/inetdevice.h>
130#include <linux/cpu_rmap.h> 130#include <linux/cpu_rmap.h>
131#include <linux/static_key.h> 131#include <linux/static_key.h>
132#include <linux/hashtable.h>
132 133
133#include "net-sysfs.h" 134#include "net-sysfs.h"
134 135
@@ -166,6 +167,12 @@ static struct list_head offload_base __read_mostly;
166DEFINE_RWLOCK(dev_base_lock); 167DEFINE_RWLOCK(dev_base_lock);
167EXPORT_SYMBOL(dev_base_lock); 168EXPORT_SYMBOL(dev_base_lock);
168 169
170/* protects napi_hash addition/deletion and napi_gen_id */
171static DEFINE_SPINLOCK(napi_hash_lock);
172
173static unsigned int napi_gen_id;
174static DEFINE_HASHTABLE(napi_hash, 8);
175
169seqcount_t devnet_rename_seq; 176seqcount_t devnet_rename_seq;
170 177
171static inline void dev_base_seq_inc(struct net *net) 178static inline void dev_base_seq_inc(struct net *net)
@@ -4136,6 +4143,58 @@ void napi_complete(struct napi_struct *n)
4136} 4143}
4137EXPORT_SYMBOL(napi_complete); 4144EXPORT_SYMBOL(napi_complete);
4138 4145
4146/* must be called under rcu_read_lock(), as we dont take a reference */
4147struct napi_struct *napi_by_id(unsigned int napi_id)
4148{
4149 unsigned int hash = napi_id % HASH_SIZE(napi_hash);
4150 struct napi_struct *napi;
4151
4152 hlist_for_each_entry_rcu(napi, &napi_hash[hash], napi_hash_node)
4153 if (napi->napi_id == napi_id)
4154 return napi;
4155
4156 return NULL;
4157}
4158EXPORT_SYMBOL_GPL(napi_by_id);
4159
4160void napi_hash_add(struct napi_struct *napi)
4161{
4162 if (!test_and_set_bit(NAPI_STATE_HASHED, &napi->state)) {
4163
4164 spin_lock(&napi_hash_lock);
4165
4166 /* 0 is not a valid id, we also skip an id that is taken
4167 * we expect both events to be extremely rare
4168 */
4169 napi->napi_id = 0;
4170 while (!napi->napi_id) {
4171 napi->napi_id = ++napi_gen_id;
4172 if (napi_by_id(napi->napi_id))
4173 napi->napi_id = 0;
4174 }
4175
4176 hlist_add_head_rcu(&napi->napi_hash_node,
4177 &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
4178
4179 spin_unlock(&napi_hash_lock);
4180 }
4181}
4182EXPORT_SYMBOL_GPL(napi_hash_add);
4183
4184/* Warning : caller is responsible to make sure rcu grace period
4185 * is respected before freeing memory containing @napi
4186 */
4187void napi_hash_del(struct napi_struct *napi)
4188{
4189 spin_lock(&napi_hash_lock);
4190
4191 if (test_and_clear_bit(NAPI_STATE_HASHED, &napi->state))
4192 hlist_del_rcu(&napi->napi_hash_node);
4193
4194 spin_unlock(&napi_hash_lock);
4195}
4196EXPORT_SYMBOL_GPL(napi_hash_del);
4197
4139void netif_napi_add(struct net_device *dev, struct napi_struct *napi, 4198void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
4140 int (*poll)(struct napi_struct *, int), int weight) 4199 int (*poll)(struct napi_struct *, int), int weight)
4141{ 4200{
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 73f57a0e1523..4a4181e16c1a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -733,6 +733,10 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
733 new->vlan_tci = old->vlan_tci; 733 new->vlan_tci = old->vlan_tci;
734 734
735 skb_copy_secmark(new, old); 735 skb_copy_secmark(new, old);
736
737#ifdef CONFIG_NET_LL_RX_POLL
738 new->napi_id = old->napi_id;
739#endif
736} 740}
737 741
738/* 742/*
diff --git a/net/core/sock.c b/net/core/sock.c
index 88868a9d21da..788c0da5eed1 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -139,6 +139,8 @@
139#include <net/tcp.h> 139#include <net/tcp.h>
140#endif 140#endif
141 141
142#include <net/ll_poll.h>
143
142static DEFINE_MUTEX(proto_list_mutex); 144static DEFINE_MUTEX(proto_list_mutex);
143static LIST_HEAD(proto_list); 145static LIST_HEAD(proto_list);
144 146
@@ -2284,6 +2286,10 @@ void sock_init_data(struct socket *sock, struct sock *sk)
2284 2286
2285 sk->sk_stamp = ktime_set(-1L, 0); 2287 sk->sk_stamp = ktime_set(-1L, 0);
2286 2288
2289#ifdef CONFIG_NET_LL_RX_POLL
2290 sk->sk_napi_id = 0;
2291#endif
2292
2287 /* 2293 /*
2288 * Before updating sk_refcnt, we must commit prior changes to memory 2294 * Before updating sk_refcnt, we must commit prior changes to memory
2289 * (Documentation/RCU/rculist_nulls.txt for details) 2295 * (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 741db5fc7806..4b48f39582b0 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -19,6 +19,7 @@
19#include <net/ip.h> 19#include <net/ip.h>
20#include <net/sock.h> 20#include <net/sock.h>
21#include <net/net_ratelimit.h> 21#include <net/net_ratelimit.h>
22#include <net/ll_poll.h>
22 23
23static int one = 1; 24static int one = 1;
24 25
@@ -284,6 +285,15 @@ static struct ctl_table net_core_table[] = {
284 .proc_handler = flow_limit_table_len_sysctl 285 .proc_handler = flow_limit_table_len_sysctl
285 }, 286 },
286#endif /* CONFIG_NET_FLOW_LIMIT */ 287#endif /* CONFIG_NET_FLOW_LIMIT */
288#ifdef CONFIG_NET_LL_RX_POLL
289 {
290 .procname = "low_latency_poll",
291 .data = &sysctl_net_ll_poll,
292 .maxlen = sizeof(unsigned long),
293 .mode = 0644,
294 .proc_handler = proc_doulongvec_minmax
295 },
296#endif
287#endif /* CONFIG_NET */ 297#endif /* CONFIG_NET */
288 { 298 {
289 .procname = "netdev_budget", 299 .procname = "netdev_budget",
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 2a5bf86d2415..6577a1149a47 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -273,6 +273,7 @@ static const struct snmp_mib snmp4_net_list[] = {
273 SNMP_MIB_ITEM("TCPFastOpenListenOverflow", LINUX_MIB_TCPFASTOPENLISTENOVERFLOW), 273 SNMP_MIB_ITEM("TCPFastOpenListenOverflow", LINUX_MIB_TCPFASTOPENLISTENOVERFLOW),
274 SNMP_MIB_ITEM("TCPFastOpenCookieReqd", LINUX_MIB_TCPFASTOPENCOOKIEREQD), 274 SNMP_MIB_ITEM("TCPFastOpenCookieReqd", LINUX_MIB_TCPFASTOPENCOOKIEREQD),
275 SNMP_MIB_ITEM("TCPSpuriousRtxHostQueues", LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES), 275 SNMP_MIB_ITEM("TCPSpuriousRtxHostQueues", LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES),
276 SNMP_MIB_ITEM("LowLatencyRxPackets", LINUX_MIB_LOWLATENCYRXPACKETS),
276 SNMP_MIB_SENTINEL 277 SNMP_MIB_SENTINEL
277}; 278};
278 279
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bc4246940f6c..46ed9afd1f5e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -279,6 +279,7 @@
279 279
280#include <asm/uaccess.h> 280#include <asm/uaccess.h>
281#include <asm/ioctls.h> 281#include <asm/ioctls.h>
282#include <net/ll_poll.h>
282 283
283int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT; 284int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
284 285
@@ -1553,6 +1554,10 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
1553 struct sk_buff *skb; 1554 struct sk_buff *skb;
1554 u32 urg_hole = 0; 1555 u32 urg_hole = 0;
1555 1556
1557 if (sk_valid_ll(sk) && skb_queue_empty(&sk->sk_receive_queue)
1558 && (sk->sk_state == TCP_ESTABLISHED))
1559 sk_poll_ll(sk, nonblock);
1560
1556 lock_sock(sk); 1561 lock_sock(sk);
1557 1562
1558 err = -ENOTCONN; 1563 err = -ENOTCONN;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 289039b4d8de..1063bb83e342 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -75,6 +75,7 @@
75#include <net/netdma.h> 75#include <net/netdma.h>
76#include <net/secure_seq.h> 76#include <net/secure_seq.h>
77#include <net/tcp_memcontrol.h> 77#include <net/tcp_memcontrol.h>
78#include <net/ll_poll.h>
78 79
79#include <linux/inet.h> 80#include <linux/inet.h>
80#include <linux/ipv6.h> 81#include <linux/ipv6.h>
@@ -1993,6 +1994,7 @@ process:
1993 if (sk_filter(sk, skb)) 1994 if (sk_filter(sk, skb))
1994 goto discard_and_relse; 1995 goto discard_and_relse;
1995 1996
1997 sk_mark_ll(sk, skb);
1996 skb->dev = NULL; 1998 skb->dev = NULL;
1997 1999
1998 bh_lock_sock_nested(sk); 2000 bh_lock_sock_nested(sk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c7338ec79cc0..2955b25aee6d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -109,6 +109,7 @@
109#include <trace/events/udp.h> 109#include <trace/events/udp.h>
110#include <linux/static_key.h> 110#include <linux/static_key.h>
111#include <trace/events/skb.h> 111#include <trace/events/skb.h>
112#include <net/ll_poll.h>
112#include "udp_impl.h" 113#include "udp_impl.h"
113 114
114struct udp_table udp_table __read_mostly; 115struct udp_table udp_table __read_mostly;
@@ -1709,7 +1710,10 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
1709 sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable); 1710 sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
1710 1711
1711 if (sk != NULL) { 1712 if (sk != NULL) {
1712 int ret = udp_queue_rcv_skb(sk, skb); 1713 int ret;
1714
1715 sk_mark_ll(sk, skb);
1716 ret = udp_queue_rcv_skb(sk, skb);
1713 sock_put(sk); 1717 sock_put(sk);
1714 1718
1715 /* a return value > 0 means to resubmit the input, but 1719 /* a return value > 0 means to resubmit the input, but
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0a17ed9eaf39..5cffa5c3e6b8 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -63,6 +63,7 @@
63#include <net/inet_common.h> 63#include <net/inet_common.h>
64#include <net/secure_seq.h> 64#include <net/secure_seq.h>
65#include <net/tcp_memcontrol.h> 65#include <net/tcp_memcontrol.h>
66#include <net/ll_poll.h>
66 67
67#include <asm/uaccess.h> 68#include <asm/uaccess.h>
68 69
@@ -1498,6 +1499,7 @@ process:
1498 if (sk_filter(sk, skb)) 1499 if (sk_filter(sk, skb))
1499 goto discard_and_relse; 1500 goto discard_and_relse;
1500 1501
1502 sk_mark_ll(sk, skb);
1501 skb->dev = NULL; 1503 skb->dev = NULL;
1502 1504
1503 bh_lock_sock_nested(sk); 1505 bh_lock_sock_nested(sk);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index b5808539cd5c..f77e34c5a0e2 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -46,6 +46,7 @@
46#include <net/ip6_checksum.h> 46#include <net/ip6_checksum.h>
47#include <net/xfrm.h> 47#include <net/xfrm.h>
48#include <net/inet6_hashtables.h> 48#include <net/inet6_hashtables.h>
49#include <net/ll_poll.h>
49 50
50#include <linux/proc_fs.h> 51#include <linux/proc_fs.h>
51#include <linux/seq_file.h> 52#include <linux/seq_file.h>
@@ -841,7 +842,10 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
841 */ 842 */
842 sk = __udp6_lib_lookup_skb(skb, uh->source, uh->dest, udptable); 843 sk = __udp6_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
843 if (sk != NULL) { 844 if (sk != NULL) {
844 int ret = udpv6_queue_rcv_skb(sk, skb); 845 int ret;
846
847 sk_mark_ll(sk, skb);
848 ret = udpv6_queue_rcv_skb(sk, skb);
845 sock_put(sk); 849 sock_put(sk);
846 850
847 /* a return value > 0 means to resubmit the input, but 851 /* a return value > 0 means to resubmit the input, but
diff --git a/net/socket.c b/net/socket.c
index 3ebdcb805c51..21fd29f63ed2 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -104,6 +104,12 @@
104#include <linux/route.h> 104#include <linux/route.h>
105#include <linux/sockios.h> 105#include <linux/sockios.h>
106#include <linux/atalk.h> 106#include <linux/atalk.h>
107#include <net/ll_poll.h>
108
109#ifdef CONFIG_NET_LL_RX_POLL
110unsigned long sysctl_net_ll_poll __read_mostly;
111EXPORT_SYMBOL_GPL(sysctl_net_ll_poll);
112#endif
107 113
108static int sock_no_open(struct inode *irrelevant, struct file *dontcare); 114static int sock_no_open(struct inode *irrelevant, struct file *dontcare);
109static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, 115static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov,