aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4/tcp_input.c
Commit message (Collapse)AuthorAge
...
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-07-23
|\ | | | | | | | | | | | | | | | | | | Conflicts: net/bridge/br_mdb.c br_mdb.c conflict was a function call being removed to fix a bug in 'net' but whose signature was changed in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: don't use F-RTO on non-recurring timeoutsYuchung Cheng2015-07-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently F-RTO may repeatedly send new data packets on non-recurring timeouts in CA_Loss mode. This is a bug because F-RTO (RFC5682) should only be used on either new recovery or recurring timeouts. This exacerbates the recovery progress during frequent timeout & repair, because we prioritize sending new data packets instead of repairing the holes when the bandwidth is already scarce. Fix it by correcting the test of a new recovery episode. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: track success and failure of TCP PMTU probingRick Jones2015-07-22
| | | | | | | | | | | | | | Track success and failure of TCP PMTU probing. Signed-off-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: update congestion state first before raising cwndYuchung Cheng2015-07-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The congestion state and cwnd can be updated in the wrong order. For example, upon receiving a dubious ACK, we incorrectly raise the cwnd first (tcp_may_raise_cwnd()/tcp_cong_avoid()) because the state is still Open, then enter recovery state to reduce cwnd. For another example, if the ACK indicates spurious timeout or retransmits, we first revert the cwnd reduction and congestion state back to Open state. But we don't raise the cwnd even though the ACK does not indicate any congestion. To fix this problem we should first call tcp_fastretrans_alert() to process the dubious ACK and update the congestion state, then call tcp_may_raise_cwnd() that raises cwnd based on the current state. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: v1 always send a quick ack when quickacks are enabledJon Maxwell2015-07-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | V1 of this patch contains Eric Dumazet's suggestion to move the per dst RTAX_QUICKACK check into tcp_in_quickack_mode(). Thanks Eric. I ran some tests and after setting the "ip route change quickack 1" knob there were still many delayed ACKs sent. This occured because when icsk_ack.quick=0 the !icsk_ack.pingpong value is subsequently ignored as tcp_in_quickack_mode() checks both these values. The condition for a quick ack to trigger requires that both icsk_ack.quick != 0 and icsk_ack.pingpong=0. Currently only icsk_ack.pingpong is controlled by the knob. But the icsk_ack.quick value changes dynamically depending on heuristics. The crux of the matter is that delayed acks still cannot be entirely disabled even with the RTAX_QUICKACK per dst knob enabled. This patch ensures that a quick ack is always sent when the RTAX_QUICKACK per dst knob is turned on. The "ip route change quickack 1" knob was recently added to enable quickacks. It was modeled around the TCP_QUICKACK setsockopt() option. This issue is that even with "ip route change quickack 1" enabled we still see delayed ACKs under some conditions. It would be nice to be able to completely disable delayed ACKs. Here is an example: # netstat -s|grep dela 3 delayed acks sent For all routes enable the knob # ip route change quickack 1 Generate some traffic across a slow link and we still see the delayed acks. # netstat -s|grep dela 106 delayed acks sent 1 delayed acks further delayed because of locked socket The issue is that both the "ip route change quickack 1" knob and the TCP_QUICKACK option set the icsk_ack.pingpong variable to 0. However at the business end in the __tcp_ack_snd_check() routine, tcp_in_quickack_mode() checks that both icsk_ack.quick != 0 and icsk_ack.pingpong=0 in order to trigger a quickack. As icsk_ack.quick is determined by heuristics it can be 0. When that occurs the icsk_ack.pingpong value is ignored and a delayed ACK is sent regardless. This patch moves the RTAX_QUICKACK per dst check into the tcp_in_quickack_mode() routine which ensures that a quickack is always sent when the quickack knob is enabled for that dst. Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: PRR uses CRB mode by default and SS mode conditionallyYuchung Cheng2015-07-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PRR slow start is often too aggressive especially when drops are caused by traffic policers. The policers mainly use token bucket to enforce the rate so sending (twice) faster than the delivery rate causes excessive drops. This patch changes PRR to the conservative reduction bound (CRB) mode in RFC 6937 by default. CRB follows the packet conservation rule to send at most the delivery rate by default. But if many packets are lost and the pipe is empty, CRB may take N round trips to repair N losses. We conditionally turn on slow start mode if all these conditions are made to speed up the recovery: 1) on the second round or later in recovery 2) retransmission sent in the previous round is delivered on this ACK 3) no retransmission is marked lost on this ACK By using packet conservation by default, this change reduces the loss retransmits signicantly on networks that deploy traffic policers, up to 20% reduction of overall loss rate. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: reduce cwnd if retransmit is lost in CA_LossYuchung Cheng2015-07-08
|/ | | | | | | | | | | | | | | | | | | | | | | | | | If the retransmission in CA_Loss is lost again, we should not continue to slow start or raise cwnd in congestion avoidance mode. Instead we should enter fast recovery and use PRR to reduce cwnd, following the principle in RFC5681: "... or the loss of a retransmission, should be taken as two indications of congestion and, therefore, cwnd (and ssthresh) MUST be lowered twice in this case." This is especially important to reduce loss when the CA_Loss state was caused by a traffic policer dropping the entire inflight. The CA_Loss state has a problem where a loss of L packets causes the sender to send a burst of L packets. So a policer that's dropping most packets in a given RTT can cause a huge retransmit storm. By contrast, PRR includes logic to bound the number of outbound packets that result from a given ACK. So switching to CA_Recovery on lost retransmits in CA_Loss avoids this retransmit storm problem when in CA_Loss. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fill shinfo->gso_size at last momentEric Dumazet2015-06-11
| | | | | | | | | | | | | | In commit cd7d8498c9a5 ("tcp: change tcp_skb_pcount() location") we stored gso_segs in a temporary cache hot location. This patch does the same for gso_size. This allows to save 2 cache line misses in tcp xmit path for the last packet that is considered but not sent because of various conditions (cwnd, tso defer, receiver window, TSQ...) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fill shinfo->gso_type at last momentEric Dumazet2015-06-11
| | | | | | | | | | | | | | | Our goal is to touch skb_shinfo(skb) only when absolutely needed, to avoid two cache line misses in TCP output path for last skb that is considered but not sent because of various conditions (cwnd, tso defer, receiver window, TSQ...) A packet is GSO only when skb_shinfo(skb)->gso_size is not zero. We can set skb_shinfo(skb)->gso_type to sk->sk_gso_type even for non GSO packets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: export tcp_enter_cwr()Kenneth Klette Jonassen2015-06-11
| | | | | | | | | | | | | | | | | Upcoming tcp_cdg uses tcp_enter_cwr() to initiate PRR. Export this function so that CDG can be compiled as a module. Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: David Hayes <davihay@ifi.uio.no> Cc: Andreas Petlund <apetlund@simula.no> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-05-23
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/cadence/macb.c drivers/net/phy/phy.c include/linux/skbuff.h net/ipv4/tcp.c net/switchdev/switchdev.c Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD} renaming overlapping with net-next changes of various sorts. phy.c was a case of two changes, one adding a local variable to a function whilst the second was removing one. tcp.c overlapped a deadlock fix with the addition of new tcp_info statistic values. macb.c involved the addition of two zyncq device entries. skbuff.h involved adding back ipv4_daddr to nf_bridge_info whilst net-next changes put two other existing members of that struct into a union. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: fix a potential deadlock in tcp_get_info()Eric Dumazet2015-05-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Taking socket spinlock in tcp_get_info() can deadlock, as inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i], while packet processing can use the reverse locking order. We could avoid this locking for TCP_LISTEN states, but lockdep would certainly get confused as all TCP sockets share same lockdep classes. [ 523.722504] ====================================================== [ 523.728706] [ INFO: possible circular locking dependency detected ] [ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted [ 523.739202] ------------------------------------------------------- [ 523.745474] ss/18032 is trying to acquire lock: [ 523.750002] (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360 [ 523.758129] [ 523.758129] but task is already holding lock: [ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0 [ 523.774661] [ 523.774661] which lock already depends on the new lock. [ 523.774661] [ 523.782850] [ 523.782850] the existing dependency chain (in reverse order) is: [ 523.790326] -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}: [ 523.796599] [<ffffffff811126bb>] lock_acquire+0xbb/0x270 [ 523.802565] [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50 [ 523.808628] [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110 [ 523.815273] [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350 [ 523.822067] [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500 [ 523.828199] [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0 [ 523.834331] [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10 [ 523.840202] [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0 [ 523.847214] [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0 [ 523.853440] [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0 [ 523.859624] [<ffffffff81659db7>] ip_rcv+0x307/0x420 Lets use u64_sync infrastructure instead. As a bonus, 64bit arches get optimized, as these are nop for them. Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: don't over-send F-RTO probesYuchung Cheng2015-05-19
| | | | | | | | | | | | | | | | | | | | | | | | | | After sending the new data packets to probe (step 2), F-RTO may incorrectly send more probes if the next ACK advances SND_UNA and does not sack new packet. However F-RTO RFC 5682 probes at most once. This bug may cause sender to always send new data instead of repairing holes, inducing longer HoL blocking on the receiver for the application. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: only undo on partial ACKs in CA_LossYuchung Cheng2015-05-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Undo based on TCP timestamps should only happen on ACKs that advance SND_UNA, according to the Eifel algorithm in RFC 3522: Section 3.2: (4) If the value of the Timestamp Echo Reply field of the acceptable ACK's Timestamps option is smaller than the value of RetransmitTS, then proceed to step (5), Section Terminology: We use the term 'acceptable ACK' as defined in [RFC793]. That is an ACK that acknowledges previously unacknowledged data. This is because upon receiving an out-of-order packet, the receiver returns the last timestamp that advances RCV_NXT, not the current timestamp of the packet in the DUPACK. Without checking the flag, the DUPACK will cause tcp_packet_delayed() to return true and tcp_try_undo_loss() will revert cwnd reduction. Note that we check the condition in CA_Recovery already by only calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or tcp_try_undo_recovery() if snd_una crosses high_seq. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: allow one skb to be received per socket under memory pressureEric Dumazet2015-05-17
| | | | | | | | | | | | | | | | | | | | | | While testing tight tcp_mem settings, I found tcp sessions could be stuck because we do not allow even one skb to be received on them. By allowing one skb to be received, we introduce fairness and eventuallu force memory hogs to release their allocation. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: introduce tcp_under_memory_pressure()Eric Dumazet2015-05-17
| | | | | | | | | | | | | | | | Introduce an optimized version of sk_under_memory_pressure() for TCP. Our intent is to use it in fast paths. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: adjust window probe timers to safer valuesEric Dumazet2015-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the advent of small rto timers in datacenter TCP, (ip route ... rto_min x), the following can happen : 1) Qdisc is full, transmit fails. TCP sets a timer based on icsk_rto to retry the transmit, without exponential backoff. With low icsk_rto, and lot of sockets, all cpus are servicing timer interrupts like crazy. Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN) and 500ms (TCP_RESOURCE_PROBE_INTERVAL) 2) Receivers can send zero windows if they don't drain their receive queue. TCP sends zero window probes, based on icsk_rto current value, with exponential backoff. With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in some cases), sender can abort in less than one or two minutes ! If receiver stops the sender, it obviously doesn't care of very tight rto. Probability of dropping the ACK reopening the window is not worth the risk. Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these events (but not normal RTO based retransmits) A followup patch adds a new SNMP counter, as it would have helped a lot diagnosing this issue. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: provide SYN headers for passive connectionsEric Dumazet2015-05-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows a server application to get the TCP SYN headers for its passive connections. This is useful if the server is doing fingerprinting of clients based on SYN packet contents. Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN. The first is used on a socket to enable saving the SYN headers for child connections. This can be set before or after the listen() call. The latter is used to retrieve the SYN headers for passive connections, if the parent listener has enabled TCP_SAVE_SYN. TCP_SAVED_SYN is read once, it frees the saved SYN headers. The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP headers. Original patch was written by Tom Herbert, I changed it to not hold a full skb (and associated dst and conntracking reference). We have used such patch for about 3 years at Google. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: invoke pkts_acked hook on every ACKKenneth Klette Jonassen2015-05-03
| | | | | | | | | | | | | | | | | | | | | | | | | | Invoking pkts_acked is currently conditioned on FLAG_ACKED: receiving a cumulative ACK of new data, or ACK with SYN flag set. Remove this condition so that CC may get RTT measurements from all SACKs. Cc: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: improve RTT from SACK for CCKenneth Klette Jonassen2015-05-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_sacktag_one() always picks the earliest sequence SACKed for RTT. This might not make sense for congestion control in cases where: 1. ACKs are lost, i.e. a SACK following a lost SACK covers both new and old segments at the receiver. 2. The receiver disregards the RFC 5681 recommendation to immediately ACK out-of-order segments. Give congestion control a RTT for the latest segment SACKed, which is the most accurate RTT estimate, but preserve the conservative RTT for RTO. Removes the call to skb_mstamp_get() in tcp_sacktag_one(). Cc: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: move struct tcp_sacktag_state to tcp_ack()Kenneth Klette Jonassen2015-05-03
|/ | | | | | | | | | Later patch passes two values set in tcp_sacktag_one() to tcp_clean_rtx_queue(). Prepare passing them via struct tcp_sacktag_state. Acked-by: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: update reordering first before detecting lossYuchung Cheng2015-04-29
| | | | | | | | | | | tcp_mark_lost_retrans is not used when FACK is disabled. Since tcp_update_reordering may disable FACK, it should be called first before tcp_mark_lost_retrans. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add tcpi_bytes_received to tcp_infoEric Dumazet2015-04-29
| | | | | | | | | | | | | | | | | | | | | | | This patch tracks total number of payload bytes received on a TCP socket. This is the sum of all changes done to tp->rcv_nxt RFC4898 named this : tcpEStatsAppHCThruOctetsReceived This is a 64bit field, and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->bytes_received was placed near tp->rcv_nxt for best data locality and minimal performance impact. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Eric Salo <salo@google.com> Cc: Martin Lau <kafai@fb.com> Cc: Chris Rapier <rapier@psc.edu> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add tcpi_bytes_acked to tcp_infoEric Dumazet2015-04-29
| | | | | | | | | | | | | | | | | | | | | | | This patch tracks total number of bytes acked for a TCP socket. This is the sum of all changes done to tp->snd_una, and allows for precise tracking of delivered data. RFC4898 named this : tcpEStatsAppHCThruOctetsAcked This is a 64bit field, and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->bytes_acked was placed near tp->snd_una for best data locality and minimal performance impact. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Eric Salo <salo@google.com> Cc: Martin Lau <kafai@fb.com> Cc: Chris Rapier <rapier@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add memory barriers to write space pathsjbaron@akamai.com2015-04-21
| | | | | | | | | | Ensure that we either see that the buffer has write space in tcp_poll() or that we perform a wakeup from the input side. Did not run into any actual problem here, but thought that we should make things explicit. Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fix bogus RTT for CC when retransmissions are ackedKenneth Klette Jonassen2015-04-13
| | | | | | | | | | | | | | | | | | Since retransmitted segments are not used for RTT estimation, previously SACKed segments present in the rtx queue are used. This estimation can be several times larger than the actual RTT. When a cumulative ack covers both previously SACKed and retransmitted segments, CC may thus get a bogus RTT. Such segments previously had an RTT estimation in tcp_sacktag_one(), so it seems reasonable to not reuse them in tcp_clean_rtx_queue() at all. Afaik, this has had no effect on SRTT/RTO because of Karn's check. Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: RFC7413 option support for Fast Open clientDaniel Lee2015-04-07
| | | | | | | | | | | | | | | | | | | | | | Fast Open has been using an experimental option with a magic number (RFC6994). This patch makes the client by default use the RFC7413 option (34) to get and send Fast Open cookies. This patch makes the client solicit cookies from a given server first with the RFC7413 option. If that fails to elicit a cookie, then it tries the RFC6994 experimental option. If that also fails, it uses the RFC7413 option on all subsequent connect attempts. If the server returns a Fast Open cookie then the client caches the form of the option that successfully elicited a cookie, and uses that form on later connects when it presents that cookie. The idea is to gradually obsolete the use of experimental options as the servers and clients upgrade, while keeping the interoperability meanwhile. Signed-off-by: Daniel Lee <Longinus00@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: RFC7413 option support for Fast Open serverDaniel Lee2015-04-07
| | | | | | | | | | | | | | | Fast Open has been using the experimental option with a magic number (RFC6994) to request and grant Fast Open cookies. This patch enables the server to support the official IANA option 34 in RFC7413 in addition. The change has passed all existing Fast Open tests with both old and new options at Google. Signed-off-by: Daniel Lee <Longinus00@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-04-06
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/mellanox/mlx4/cmd.c net/core/fib_rules.c net/ipv4/fib_frontend.c The fib_rules.c and fib_frontend.c conflicts were locking adjustments in 'net' overlapping addition and removal of code in 'net-next'. The mlx4 conflict was a bug fix in 'net' happening in the same place a constant was being replaced with a more suitable macro. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: fix FRTO undo on cumulative ACK of SACKed rangeNeal Cardwell2015-04-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On processing cumulative ACKs, the FRTO code was not checking the SACKed bit, meaning that there could be a spurious FRTO undo on a cumulative ACK of a previously SACKed skb. The FRTO code should only consider a cumulative ACK to indicate that an original/unretransmitted skb is newly ACKed if the skb was not yet SACKed. The effect of the spurious FRTO undo would typically be to make the connection think that all previously-sent packets were in flight when they really weren't, leading to a stall and an RTO. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Fixes: e33099f96d99c ("tcp: implement RFC5682 F-RTO") Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: coding style: comparison for inequality with NULLIan Morris2015-04-03
| | | | | | | | | | | | | | | | | | | | | | | | The ipv4 code uses a mixture of coding styles. In some instances check for non-NULL pointer is done as x != NULL and sometimes as x. x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: coding style: comparison for equality with NULLIan Morris2015-04-03
| | | | | | | | | | | | | | | | | | | | | | | | The ipv4 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x == NULL and sometimes as !x. !x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: tcp_syn_flood_action() can be staticEric Dumazet2015-03-29
| | | | | | | | | | | | | | | | | | | | | | After commit 1fb6f159fd21 ("tcp: add tcp_conn_request"), tcp_syn_flood_action() is no longer used from IPv6. We can make it static, by moving it above tcp_conn_request() Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: fix ipv4 mapped request socksEric Dumazet2015-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ss should display ipv4 mapped request sockets like this : tcp SYN-RECV 0 0 ::ffff:192.168.0.1:8080 ::ffff:192.0.2.1:35261 and not like this : tcp SYN-RECV 0 0 192.168.0.1:8080 192.0.2.1:35261 We should init ireq->ireq_family based on listener sk_family, not the actual protocol carried by SYN packet. This means we can set ireq_family in inet_reqsk_alloc() Fixes: 3f66b083a5b7 ("inet: introduce ireq_family") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: drop prev pointer handling in request sockEric Dumazet2015-03-20
| | | | | | | | | | | | | | | | | | | | | | When request sock are put in ehash table, the whole notion of having a previous request to update dl_next is pointless. Also, following patch will get rid of big purge timer, so we want to delete a request sock without holding listener lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: fix request sock refcountingEric Dumazet2015-03-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While testing last patch series, I found req sock refcounting was wrong. We must set skc_refcnt to 1 for all request socks added in hashes, but also on request sockets created by FastOpen or syncookies. It is tricky because we need to defer this initialization so that future RCU lookups do not try to take a refcount on a not yet fully initialized request socket. Also get rid of ireq_refcnt alias. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 13854e5a6046 ("inet: add proper refcounting to request sock") Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: rename struct tcp_request_sock listenerEric Dumazet2015-03-17
| | | | | | | | | | | | | | | | | | The listener field in struct tcp_request_sock is a pointer back to the listener. We now have req->rsk_listener, so TCP only needs one boolean and not a full pointer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: add rsk_listener field to struct request_sockEric Dumazet2015-03-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once we'll be able to lookup request sockets in ehash table, we'll need to get access to listener which created this request. This avoid doing a lookup to find the listener, which benefits for a more solid SO_REUSEPORT, and is needed once we no longer queue request sock into a listener private queue. Note that 'struct tcp_request_sock'->listener could be reduced to a single bit, as TFO listener should match req->rsk_listener. TFO will no longer need to hold a reference on the listener. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: uninline inet_reqsk_alloc()Eric Dumazet2015-03-17
| | | | | | | | | | | | | | inet_reqsk_alloc() is becoming fat and should not be inlined. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: add sk_listener argument to inet_reqsk_alloc()Eric Dumazet2015-03-17
| | | | | | | | | | | | | | | | | | | | | | listener socket can be used to set net pointer, and will be later used to hold a reference on listener. Add a const qualifier to first argument (struct request_sock_ops *), and factorize all write_pnet(&ireq->ireq_net, sock_net(sk)); Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: uninline tcp_oow_rate_limited()Eric Dumazet2015-03-17
| | | | | | | | | | | | | | | | | | tcp_oow_rate_limited() is hardly used in fast path, there is no point inlining it. Signed-of-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: move tcp_openreq_init() to tcp_input.cEric Dumazet2015-03-17
| | | | | | | | | | | | | | | | This big helper is called once from tcp_conn_request(), there is no point having it in an include. Compiler will inline it anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: fill request sock ir_iif for IPv4Eric Dumazet2015-03-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once request socks will be in ehash table, they will need to have a valid ir_iff field. This is currently true only for IPv6. This patch extends support for IPv4 as well. This means inet_diag_fill_req() can now properly use ir_iif, which is better for IPv6 link locals anyway, as request sockets and established sockets will propagate consistent netlink idiag_if. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: add missing ireq_net & ir_cookie initializationsEric Dumazet2015-03-12
| | | | | | | | | | | | | | | | | | | | | | I forgot to update dccp_v6_conn_request() & cookie_v6_check(). They both need to set ireq->ireq_net and ireq->ir_cookie Lets clear ireq->ir_cookie in inet_reqsk_alloc() Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 33cf7c90fe2f ("net: add real socket cookies") Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fix CONFIG_NET_NS=n compilationEric Dumazet2015-03-11
| | | | | | | | | | | | | | | | | | I forgot to use write_pnet() in three locations. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 33cf7c90fe2f9 ("net: add real socket cookies") Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: add real socket cookiesEric Dumazet2015-03-11
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | A long standing problem in netlink socket dumps is the use of kernel socket addresses as cookies. 1) It is a security concern. 2) Sockets can be reused quite quickly, so there is no guarantee a cookie is used once and identify a flow. 3) request sock, establish sock, and timewait socks for a given flow have different cookies. Part of our effort to bring better TCP statistics requires to switch to a different allocator. In this patch, I chose to use a per network namespace 64bit generator, and to use it only in the case a socket needs to be dumped to netlink. (This might be refined later if needed) Note that I tried to carry cookies from request sock, to establish sock, then timewait sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eric Salo <salo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fix tcp_should_expand_sndbuf() to use tcp_packets_in_flight()Neal Cardwell2015-02-22
| | | | | | | | | | | | | | | | | tcp_should_expand_sndbuf() does not expand the send buffer if we have filled the congestion window. However, it should use tcp_packets_in_flight() instead of tp->packets_out to make this check. Testing has established that the difference matters a lot if there are many SACKed packets, causing a needless performance shortfall. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: mitigate ACK loops for connections as tcp_sockNeal Cardwell2015-02-08
| | | | | | | | | | | | | | | | | | | | | | Ensure that in state ESTABLISHED, where the connection is represented by a tcp_sock, we rate limit dupacks in response to incoming packets (a) with TCP timestamps that fail PAWS checks, or (b) with sequence numbers or ACK numbers that are out of the acceptable window. We do not send a dupack in response to out-of-window packets if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a dupack in response to an out-of-window packet. There is already a similar (although global) rate-limiting mechanism for "challenge ACKs". When deciding whether to send a challence ACK, we first consult the new per-connection rate limit, and then the global rate limit. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacksNeal Cardwell2015-02-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Helpers for mitigating ACK loops by rate-limiting dupacks sent in response to incoming out-of-window packets. This patch includes: - rate-limiting logic - sysctl to control how often we allow dupacks to out-of-window packets - SNMP counter for cases where we rate-limited our dupack sending The rate-limiting logic in this patch decides to not send dupacks in response to out-of-window segments if (a) they are SYNs or pure ACKs and (b) the remote endpoint is sending them faster than the configured rate limit. We rate-limit our responses rather than blocking them entirely or resetting the connection, because legitimate connections can rely on dupacks in response to some out-of-window segments. For example, zero window probes are typically sent with a sequence number that is below the current window, and ZWPs thus expect to thus elicit a dupack in response. We allow dupacks in response to TCP segments with data, because these may be spurious retransmissions for which the remote endpoint wants to receive DSACKs. This is safe because segments with data can't realistically be part of ACK loops, which by their nature consist of each side sending pure/data-less ACKs to each other. The dupack interval is controlled by a new sysctl knob, tcp_invalid_ratelimit, given in milliseconds, in case an administrator needs to dial this upward in the face of a high-rate DoS attack. The name and units are chosen to be analogous to the existing analogous knob for ICMP, icmp_ratelimit. The default value for tcp_invalid_ratelimit is 500ms, which allows at most one such dupack per 500ms. This is chosen to be 2x faster than the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule 2.4). We allow the extra 2x factor because network delay variations can cause packets sent at 1 second intervals to be compressed and arrive much closer. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dctcp: loosen requirement to assert ECT(0) during 3WHSFlorian Westphal2015-02-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One deployment requirement of DCTCP is to be able to run in a DC setting along with TCP traffic. As Glenn Judd's NSDI'15 paper "Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter" [1] (tba) explains, one way to solve this on switch side is to split DCTCP and TCP traffic in two queues per switch port based on the DSCP: one queue soley intended for DCTCP traffic and one for non-DCTCP traffic. For the DCTCP queue, there's the marking threshold K as explained in commit e3118e8359bb ("net: tcp: add DCTCP congestion control algorithm") for RED marking ECT(0) packets with CE. For the non-DCTCP queue, there's f.e. a classic tail drop queue. As already explained in e3118e8359bb, running DCTCP at scale when not marking SYN/SYN-ACK packets with ECT(0) has severe consequences as for non-ECT(0) packets, traversing the RED marking DCTCP queue will result in a severe reduction of connection probability. This is due to the DCTCP queue being dominated by ECT(0) traffic and switches handle non-ECT traffic in the RED marking queue after passing K as drops, where K is usually a low watermark in order to leave enough tailroom for bursts. Splitting DCTCP traffic among several queues (ECN and non-ECN queue) is being considered a terrible idea in the network community as it splits single flows across multiple network paths. Therefore, commit e3118e8359bb implements this on Linux as ECT(0) marked traffic, as we argue that marking all packets of a DCTCP flow is the only viable solution and also doesn't speak against the draft. However, recently, a DCTCP implementation for FreeBSD hit also their mainline kernel [2]. In order to let them play well together with Linux' DCTCP, we would need to loosen the requirement that ECT(0) has to be asserted during the 3WHS as not implemented in FreeBSD. This simplifies the ECN test and lets DCTCP work together with FreeBSD. Joint work with Daniel Borkmann. [1] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd [2] https://github.com/freebsd/freebsd/commit/8ad879445281027858a7fa706d13e458095b595f Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Glenn Judd <glenn.judd@morganstanley.com> Signed-off-by: David S. Miller <davem@davemloft.net>