aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAge
...
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2018-10-09
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree: 1) Support for matching on ipsec policy already set in the route, from Florian Westphal. 2) Split set destruction into deactivate and destroy phase to make it fit better into the transaction infrastructure, also from Florian. This includes a patch to warn on imbalance when setting the new activate and deactivate interfaces. 3) Release transaction list from the workqueue to remove expensive synchronize_rcu() from configuration plane path. This speeds up configuration plane quite a bit. From Florian Westphal. 4) Add new xfrm/ipsec extension, this new extension allows you to match for ipsec tunnel keys such as source and destination address, spi and reqid. From Máté Eckl and Florian Westphal. 5) Add secmark support, this includes connsecmark too, patches from Christian Gottsche. 6) Allow to specify remaining bytes in xt_quota, from Chenbo Feng. One follow up patch to calm a clang warning for this one, from Nathan Chancellor. 7) Flush conntrack entries based on layer 3 family, from Kristian Evensen. 8) New revision for cgroups2 to shrink the path field. 9) Get rid of obsolete need_conntrack(), as a result from recent demodularization works. 10) Use WARN_ON instead of BUG_ON, from Florian Westphal. 11) Unused exported symbol in nf_nat_ipv4_fn(), from Florian. 12) Remove superfluous check for timeout netlink parser and dump functions in layer 4 conntrack helpers. 13) Unnecessary redundant rcu read side locks in NAT redirect, from Taehee Yoo. 14) Pass nf_hook_state structure to error handlers, patch from Florian Westphal. 15) Remove ->new() interface from layer 4 protocol trackers. Place them in the ->packet() interface. From Florian. 16) Place conntrack ->error() handling in the ->packet() interface. Patches from Florian Westphal. 17) Remove unused parameter in the pernet initialization path, also from Florian. 18) Remove additional parameter to specify layer 3 protocol when looking up for protocol tracker. From Florian. 19) Shrink array of layer 4 protocol trackers, from Florian. 20) Check for linear skb only once from the ALG NAT mangling codebase, from Taehee Yoo. 21) Use rhashtable_walk_enter() instead of deprecated rhashtable_walk_init(), also from Taehee. 22) No need to flush all conntracks when only one single address is gone, from Tan Hu. 23) Remove redundant check for NAT flags in flowtable code, from Taehee Yoo. 24) Use rhashtable_lookup() instead of rhashtable_lookup_fast() from netfilter codebase, since rcu read lock side is already assumed in this path. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: masquerade: don't flush all conntracks if only one address ↵Tan Hu2018-09-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | deleted on device We configured iptables as below, which only allowed incoming data on established connections: iptables -t mangle -A PREROUTING -m state --state ESTABLISHED -j ACCEPT iptables -t mangle -P PREROUTING DROP When deleting a secondary address, current masquerade implements would flush all conntracks on this device. All the established connections on primary address also be deleted, then subsequent incoming data on the connections would be dropped wrongly because it was identified as NEW connection. So when an address was delete, it should only flush connections related with the address. Signed-off-by: Tan Hu <tan.hu@zte.com.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_nat_ipv4: remove obsolete EXPORT_SYMBOLFlorian Westphal2018-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | There are no external callers anymore, previous change just forgot to also remove the EXPORT_SYMBOL(). Fixes: 9971a514ed269 ("netfilter: nf_nat: add nat type hooks to nat core") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | net: Update netconf dump handlers for strict data checkingDavid Ahern2018-10-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update inet_netconf_dump_devconf, inet6_netconf_dump_devconf, and mpls_netconf_dump_devconf for strict data checking. If the flag is set, the dump request is expected to have an netconfmsg struct as the header. The struct only has the family member and no attributes can be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | rtnetlink: Update fib dumps for strict data checkingDavid Ahern2018-10-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add helper to check netlink message for route dumps. If the strict flag is set the dump request is expected to have an rtmsg struct as the header. All elements of the struct are expected to be 0 with the exception of rtm_flags (which is used by both ipv4 and ipv6 dumps) and no attributes can be appended. rtm_flags can only have RTM_F_CLONED and RTM_F_PREFIX set. Update inet_dump_fib, inet6_dump_fib, mpls_dump_routes, ipmr_rtm_dumproute, and ip6mr_rtm_dumproute to call this helper if strict data checking is enabled. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | rtnetlink: Update ipmr_rtm_dumplink for strict data checkingDavid Ahern2018-10-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update ipmr_rtm_dumplink for strict data checking. If the flag is set, the dump request is expected to have an ifinfomsg struct as the header. All elements of the struct are expected to be 0 and no attributes can be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net/ipv4: Update inet_dump_ifaddr for strict data checkingDavid Ahern2018-10-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update inet_dump_ifaddr for strict data checking. If the flag is set, the dump request is expected to have an ifaddrmsg struct as the header potentially followed by one or more attributes. Any data passed in the header or as an attribute is taken as a request to influence the data returned. Only values supported by the dump handler are allowed to be non-0 or set in the request. At the moment only the IFA_TARGET_NETNSID attribute is supported. Follow on patches can support for other fields (e.g., honor ifa_index and only return data for the given device index). Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: Add extack to nlmsg_parseDavid Ahern2018-10-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure extack is passed to nlmsg_parse where easy to do so. Most of these are dump handlers and leveraging the extack in the netlink_callback. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | udp: gro behind static keyWillem de Bruijn2018-10-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid the socket lookup cost in udp_gro_receive if no socket has a udp tunnel callback configured. udp_sk(sk)->gro_receive requires a registration with setup_udp_tunnel_sock, which enables the static key. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: Move free of dst_metrics to helperDavid Ahern2018-10-05
| | | | | | | | | | | | | | | | | | | | | | | | Move the refcounting and potential free of dst metrics associated for ipv4 and ipv6 to a common helper. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: common metrics init helper for dst_entryDavid Ahern2018-10-05
| | | | | | | | | | | | | | | | | | | | | | | | ipv4 and ipv6 both use refcounted metrics if FIB entries have metrics set. Move the common initialization code to a helper and use for both protocols. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: Move free of fib_metrics to helperDavid Ahern2018-10-05
| | | | | | | | | | | | | | | | | | | | | | | | Move the refcounting and potential free of dst metrics associated with a fib entry to a helper and use it in both ipv4 and ipv6. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: common metrics init helper for FIB entriesDavid Ahern2018-10-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consolidate initialization of ipv4 and ipv6 metrics when fib entries are created into a single helper, ip_fib_metrics_init, that handles the call to ip_metrics_convert. If no metrics are defined for the fib entry, then the metrics is set to dst_default_metrics. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-10-04
|\ \ \ | | |/ | |/| | | | | | | | | | | | | Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net' overlapped the renaming of a netlink attribute in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: fix use-after-free in ip_cmsg_recv_dstaddr()Eric Dumazet2018-10-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Caching ip_hdr(skb) before a call to pskb_may_pull() is buggy, do not do it. Fixes: 2efd4fca703a ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | inet: make sure to grab rcu_read_lock before using ireq->ireq_optEric Dumazet2018-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Timer handlers do not imply rcu_read_lock(), so my recent fix triggered a LOCKDEP warning when SYNACK is retransmit. Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt usages instead of guessing what is done by callers, since it is not worth the pain. Get rid of ireq_opt_deref() helper since it hides the logic without real benefit, since it is now a standard rcu_dereference(). Fixes: 1ad98e9d1bdf ("tcp/dccp: fix lockdep issue when SYN is backlogged") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge branch 'master' of ↵David S. Miller2018-10-02
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2018-10-01 1) Validate address prefix lengths in the xfrm selector, otherwise we may hit undefined behaviour in the address matching functions if the prefix is too big for the given address family. 2) Fix skb leak on local message size errors. From Thadeu Lima de Souza Cascardo. 3) We currently reset the transport header back to the network header after a transport mode transformation is applied. This leads to an incorrect transport header when multiple transport mode transformations are applied. Reset the transport header only after all transformations are already applied to fix this. From Sowmini Varadhan. 4) We only support one offloaded xfrm, so reset crypto_done after the first transformation in xfrm_input(). Otherwise we may call the wrong input method for subsequent transformations. From Sowmini Varadhan. 5) Fix NULL pointer dereference when skb_dst_force clears the dst_entry. skb_dst_force does not really force a dst refcount anymore, it might clear it instead. xfrm code did not expect this, add a check to not dereference skb_dst() if it was cleared by skb_dst_force. 6) Validate xfrm template mode, otherwise we can get a stack-out-of-bounds read in xfrm_state_find. From Sean Tranchetti. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | xfrm: reset transport header back to network header after all input ↵Sowmini Varadhan2018-09-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | transforms ahave been applied A policy may have been set up with multiple transforms (e.g., ESP and ipcomp). In this situation, the ingress IPsec processing iterates in xfrm_input() and applies each transform in turn, processing the nexthdr to find any additional xfrm that may apply. This patch resets the transport header back to network header only after the last transformation so that subsequent xfrms can find the correct transport header. Fixes: 7785bba299a8 ("esp: Add a software GRO codepath") Suggested-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | tcp/dccp: fix lockdep issue when SYN is backloggedEric Dumazet2018-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In normal SYN processing, packets are handled without listener lock and in RCU protected ingress path. But syzkaller is known to be able to trick us and SYN packets might be processed in process context, after being queued into socket backlog. In commit 06f877d613be ("tcp/dccp: fix other lockdep splats accessing ireq_opt") I made a very stupid fix, that happened to work mostly because of the regular path being RCU protected. Really the thing protecting ireq->ireq_opt is RCU read lock, and the pseudo request refcnt is not relevant. This patch extends what I did in commit 449809a66c1d ("tcp/dccp: block BH for SYN processing") by adding an extra rcu_read_{lock|unlock} pair in the paths that might be taken when processing SYN from socket backlog (thus possibly in process context) Fixes: 06f877d613be ("tcp/dccp: fix other lockdep splats accessing ireq_opt") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net-tcp: /proc/sys/net/ipv4/tcp_probe_interval is a u32 not intMaciej Żenczykowski2018-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (fix documentation and sysctl access to treat it as such) Tested: # zcat /proc/config.gz | egrep ^CONFIG_HZ CONFIG_HZ_1000=y CONFIG_HZ=1000 # echo $[(1<<32)/1000 + 1] | tee /proc/sys/net/ipv4/tcp_probe_interval 4294968 tee: /proc/sys/net/ipv4/tcp_probe_interval: Invalid argument # echo $[(1<<32)/1000] | tee /proc/sys/net/ipv4/tcp_probe_interval 4294967 # echo 0 | tee /proc/sys/net/ipv4/tcp_probe_interval # echo -1 | tee /proc/sys/net/ipv4/tcp_probe_interval -1 tee: /proc/sys/net/ipv4/tcp_probe_interval: Invalid argument Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | ipv4: Allow sending multicast packets on specific i/f using VRF socketRobert Shearman2018-10-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is useful to be able to use the same socket for listening in a specific VRF, as for sending multicast packets out of a specific interface. However, the bound device on the socket currently takes precedence and results in the packets not being sent. Relax the condition on overriding the output interface to use for sending packets out of UDP, raw and ping sockets to allow multicast packets to be sent using the specified multicast interface. Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tcp: do not release socket ownership in tcp_close()Eric Dumazet2018-10-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk)); in tcp_close() While a socket is being closed, it is very possible other threads find it in rtnetlink dump. tcp_get_info() will acquire the socket lock for a short amount of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);), enough to trigger the warning. Fixes: 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: inet_rtm_getroute() - use new style struct initializer instead of memsetMaciej Żenczykowski2018-10-02
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: ip_rt_get_source() - use new style struct initializer instead of memsetMaciej Żenczykowski2018-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (allows for better compiler optimization) Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tcp/fq: move back to CLOCK_MONOTONICEric Dumazet2018-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the recent TCP/EDT patch series, I switched TCP and sch_fq clocks from MONOTONIC to TAI, in order to meet the choice done earlier for sch_etf packet scheduler. But sure enough, this broke some setups were the TAI clock jumps forward (by almost 50 year...), as reported by Leonard Crestez. If we want to converge later, we'll probably need to add an skb field to differentiate the clock bases, or a socket option. In the meantime, an UDP application will need to use CLOCK_MONOTONIC base for its SCM_TXTIME timestamps if using fq packet scheduler. Fixes: 72b0094f9182 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base") Fixes: 142537e41923 ("net_sched: sch_fq: switch to CLOCK_TAI") Fixes: fd2bca2aa789 ("tcp: switch internal pacing timer to CLOCK_TAI") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Leonard Crestez <leonard.crestez@nxp.com> Tested-by: Leonard Crestez <leonard.crestez@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tcp: adjust rcv zerocopy hints based on frag sizesSoheil Hassas Yeganeh2018-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When SKBs are coalesced, we can have SKBs with different frag sizes. Some with PAGE_SIZE and some not with PAGE_SIZE. Since recv_skip_hint is always set to the full SKB size, it can overestimate the amount that should be read using normal read for coalesced packets. Change the recv_skip_hint so that it only includes the first frags that are not of PAGE_SIZE. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tcp: set recv_skip_hint when tcp_inq is less than PAGE_SIZESoheil Hassas Yeganeh2018-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we have less than PAGE_SIZE of data on receive queue, we set recv_skip_hint to 0. Instead, set it to the actual number of bytes available. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge branch 'master' of ↵David S. Miller2018-10-02
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2018-10-01 1) Make xfrmi_get_link_net() static to silence a sparse warning. From Wei Yongjun. 2) Remove a unused esph pointer definition in esp_input(). From Haishuang Yan. 3) Allow the NIC driver to quietly refuse xfrm offload in case it does not support it, the SA is created without offload in this case. From Shannon Nelson. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | esp: remove redundant define esphHaishuang Yan2018-08-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pointer 'esph' is defined but is never used hence it is redundant and canbe removed. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | | | | tcp: start receiver buffer autotuning soonerYuchung Cheng2018-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously receiver buffer auto-tuning starts after receiving one advertised window amount of data. After the initial receiver buffer was raised by patch a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"), the reciver buffer may take too long to start raising. To address this issue, this patch lowers the initial bytes expected to receive roughly the expected sender's initial window. Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | tcp: up initial rmem to 128KB and SYN rwin to around 64KBYuchung Cheng2018-09-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously TCP initial receive buffer is ~87KB by default and the initial receive window is ~29KB (20 MSS). This patch changes the two numbers to 128KB and ~64KB (rounding down to the multiples of MSS) respectively. The patch also simplifies the calculations s.t. the two numbers are directly controlled by sysctl tcp_rmem[1]: 1) Initial receiver buffer budget (sk_rcvbuf): while this should be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf() always override and set a larger size when a new connection establishes. 2) Initial receive window in SYN: previously it is set to 20 packets if MSS <= 1460. The number 20 was based on the initial congestion window of 10: the receiver needs twice amount to avoid being limited by the receive window upon out-of-order delivery in the first window burst. But since this only applies if the receiving MSS <= 1460, connection using large MTU (e.g. to utilize receiver zero-copy) may be limited by the receive window. With this patch TCP memory configuration is more straight-forward and more properly sized to modern high-speed networks by default. Several popular stacks have been announcing 64KB rwin in SYNs as well. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | net-ipv4: remove 2 always zero parameters from ipv4_redirect()Maciej Żenczykowski2018-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (the parameters in question are mark and flow_flags) Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | net-ipv4: remove 2 always zero parameters from ipv4_update_pmtu()Maciej Żenczykowski2018-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (the parameters in question are mark and flow_flags) Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-09-25
|\ \ \ \ \ | | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Version bump conflict in batman-adv, take what's in net-next. iavf conflict, adjustment of netdev_ops in net-next conflicting with poll controller method removal in net. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | ip_tunnel: be careful when accessing the inner headerPaolo Abeni2018-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cong noted that we need the same checks introduced by commit 76c0ddd8c3a6 ("ip6_tunnel: be careful when accessing the inner header") even for ipv4 tunnels. Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_netPeter Oskolkov2018-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, ip[6]frag_high_thresh sysctl values in new namespaces are hard-limited to those of the root/init ns. There are at least two use cases when it would be desirable to set the high_thresh values higher in a child namespace vs the global hard limit: - a security/ddos protection policy may lower the thresholds in the root/init ns but allow for a special exception in a child namespace - testing: a test running in a namespace may want to set these thresholds higher in its namespace than what is in the root/init ns The new behavior: # ip netns add testns # ip netns exec testns bash # sysctl -w net.ipv4.ipfrag_high_thresh=9000000 net.ipv4.ipfrag_high_thresh = 9000000 # sysctl net.ipv4.ipfrag_high_thresh net.ipv4.ipfrag_high_thresh = 9000000 # sysctl -w net.ipv6.ip6frag_high_thresh=9000000 net.ipv6.ip6frag_high_thresh = 9000000 # sysctl net.ipv6.ip6frag_high_thresh net.ipv6.ip6frag_high_thresh = 9000000 The old behavior: # ip netns add testns # ip netns exec testns bash # sysctl -w net.ipv4.ipfrag_high_thresh=9000000 net.ipv4.ipfrag_high_thresh = 9000000 # sysctl net.ipv4.ipfrag_high_thresh net.ipv4.ipfrag_high_thresh = 4194304 # sysctl -w net.ipv6.ip6frag_high_thresh=9000000 net.ipv6.ip6frag_high_thresh = 9000000 # sysctl net.ipv6.ip6frag_high_thresh net.ipv6.ip6frag_high_thresh = 4194304 Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | net/ipv4: avoid compile error in fib_info_nh_uses_devEric Dumazet2018-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net/ipv4/fib_frontend.c: In function 'fib_info_nh_uses_dev': net/ipv4/fib_frontend.c:322:6: error: unused variable 'ret' [-Werror=unused-variable] cc1: all warnings being treated as errors Fixes: 78f2756c5fc0 ("net/ipv4: Move device validation to helper") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | tcp: switch tcp_internal_pacing() to tcp_wstamp_nsEric Dumazet2018-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now TCP keeps track of tcp_wstamp_ns, recording the earliest departure time of next packet, we can remove duplicate code from tcp_internal_pacing() This removes one ktime_get_tai_ns() call, and a divide. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | tcp: switch tcp and sch_fq to new earliest departure time modelEric Dumazet2018-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq no longer has to do it. Thanks to this model, TCP can get more accurate RTT samples, since pacing no longer inflates them. This has the nice effect of removing some delays caused by FQ quantum mechanism, causing inflated max/P99 latencies. Also we might relax TCP Small Queue tight limits in the future, since this new model allow TCP to build bigger batches, since sch_fq (or a device with earliest departure time offload) ensure these packets will be delivered on time. Note that other protocols are not converted (they will probably never be) so sch_fq has still support for SO_MAX_PACING_RATE Tested: Test showing FQ pacing quantum artifact for low-rate flows, adding unexpected throttles for RPC flows, inflating max and P99 latencies. The parameters chosen here are to show what happens typically when a TCP flow has a reduced pacing rate (this can be caused by a reduced cwin after few losses, or/and rtt above few ms) MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY" Before : $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds 19,82.78,5279,3825,482.02 After : $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds 20,49.94,128,63,3.18 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | tcp: switch internal pacing timer to CLOCK_TAIEric Dumazet2018-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Next patch will use tcp_wstamp_ns to feed internal TCP pacing timer, so switch to CLOCK_TAI to share same base. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | tcp: provide earliest departure time in skb->tstampEric Dumazet2018-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns, from usec units to nsec units. Do not clear skb->tstamp before entering IP stacks in TX, so that qdisc or devices can implement pacing based on the earliest departure time instead of socket sk->sk_pacing_rate Packets are fed with tcp_wstamp_ns, and following patch will update tcp_wstamp_ns when both TCP and sch_fq switch to the earliest departure time mechanism. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | tcp: add tcp_wstamp_ns socket fieldEric Dumazet2018-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP will soon provide earliest departure time on TX skbs. It needs to track this in a new variable. tcp_mstamp_refresh() needs to update this variable, and became too big to stay an inline. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | tcp: introduce tcp_skb_timestamp_us() helperEric Dumazet2018-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are few places where TCP reads skb->skb_mstamp expecting a value in usec unit. skb->tstamp (aka skb->skb_mstamp) will soon store CLOCK_TAI nsec value. Add tcp_skb_timestamp_us() to provide proper conversion when needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | ipv4: remove redundant null pointer check before kfree_skbzhong jiang2018-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kfree_skb has taken the null pointer into account. hence it is safe to remove the redundant null pointer check before kfree_skb. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | netfilter: nft_fib: Convert nft_fib4_eval to new dev helperDavid Ahern2018-09-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert nft_fib4_eval to the new device checking helper and remove the duplicate code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | netfilter: rpfilter: Convert rpfilter_lookup_reverse to new dev helperDavid Ahern2018-09-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert rpfilter_lookup_reverse to the new device checking helper and remove the duplicate code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | net/ipv4: Move device validation to helperDavid Ahern2018-09-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the device matching check in __fib_validate_source to a helper and export it for use by netfilter modules. Code move only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-09-18
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Two new tls tests added in parallel in both net and net-next. Used Stephen Rothwell's linux-next resolution. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | udp4: fix IP_CMSG_CHECKSUM for connected socketsPaolo Abeni2018-09-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2abb7cdc0dc8 ("udp: Add support for doing checksum unnecessary conversion") left out the early demux path for connected sockets. As a result IP_CMSG_CHECKSUM gives wrong values for such socket when GRO is not enabled/available. This change addresses the issue by moving the csum conversion to a common helper and using such helper in both the default and the early demux rx path. Fixes: 2abb7cdc0dc8 ("udp: Add support for doing checksum unnecessary conversion") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | gso_segment: Reset skb->mac_len after modifying network headerToke Høiland-Jørgensen2018-09-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When splitting a GSO segment that consists of encapsulated packets, the skb->mac_len of the segments can end up being set wrong, causing packet drops in particular when using act_mirred and ifb interfaces in combination with a qdisc that splits GSO packets. This happens because at the time skb_segment() is called, network_header will point to the inner header, throwing off the calculation in skb_reset_mac_len(). The network_header is subsequently adjust by the outer IP gso_segment handlers, but they don't set the mac_len. Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6 gso_segment handlers, after they modify the network_header. Many thanks to Eric Dumazet for his help in identifying the cause of the bug. Acked-by: Dave Taht <dave.taht@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>