aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAge
...
| * | net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}()David S. Miller2012-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This will be used so that we can compose a full flow key. Even though we have a route in this context, we need more. In the future the routes will be without destination address, source address, etc. keying. One ipv4 route will cover entire subnets, etc. In this environment we have to have a way to possess persistent storage for redirects and PMTU information. This persistent storage will exist in the FIB tables, and that's why we'll need to be able to rebuild a full lookup flow key here. Using that flow key will do a fib_lookup() and create/update the persistent entry. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: Add helper inet_csk_update_pmtu().David S. Miller2012-07-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This abstracts away the call to dst_ops->update_pmtu() so that we can transparently handle the fact that, in the future, the dst itself can be invalidated by the PMTU update (when we have non-host routes cached in sockets). So we try to rebuild the socket cached route after the method invocation if necessary. This isn't used by SCTP because it needs to cache dsts per-transport, and thus will need it's own local version of this helper. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: implement RFC 5961 4.2Eric Dumazet2012-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement the RFC 5691 mitigation against Blind Reset attack using SYN bit. Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop incoming packet, instead of resetting the session. Add a new SNMP counter to count number of challenge acks sent in response to SYN packets. (netstat -s | grep TCPSYNChallenge) Remove obsolete TCPAbortOnSyn, since we no longer abort a TCP session because of a SYN flag. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: implement RFC 5961 3.2Eric Dumazet2012-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement the RFC 5691 mitigation against Blind Reset attack using RST bit. Idea is to validate incoming RST sequence, to match RCV.NXT value, instead of previouly accepted window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND) If sequence is in window but not an exact match, send a "challenge ACK", so that the other part can resend an RST with the appropriate sequence. Add a new sysctl, tcp_challenge_ack_limit, to limit number of challenge ACK sent per second. Add a new SNMP counter to count number of challenge acks sent. (netstat -s | grep TCPChallengeACK) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: make sock diag per-namespaceAndrey Vagin2012-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before this patch sock_diag works for init_net only and dumps information about sockets from all namespaces. This patch expands sock_diag for all name-spaces. It creates a netlink kernel socket for each netns and filters data during dumping. v2: filter accoding with netns in all places remove an unused variable. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Pavel Emelyanov <xemul@parallels.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: add OFO snmp countersEric Dumazet2012-07-17
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add three SNMP TCP counters, to better track TCP behavior at global stage (netstat -s), when packets are received Out Of Order (OFO) TCPOFOQueue : Number of packets queued in OFO queue TCPOFODrop : Number of packets meant to be queued in OFO but dropped because socket rcvbuf limit hit. TCPOFOMerge : Number of packets in OFO that were merged with other packets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Don't store a rule pointer in fib_result.David S. Miller2012-07-13
| | | | | | | | | | | | | | | | | | | | We only use it to fetch the rule's tclassid, so just store the tclassid there instead. This also decreases the size of fib_result by a full 8 bytes on 64-bit. On 32-bits it's a wash. Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: add LAST_ACK as a valid state for TSQEric Dumazet2012-07-13
| | | | | | | | | | | | | | | | | | | | | | Socket state LAST_ACK should allow TSQ to send additional frames, or else we rely on incoming ACKS or timers to send them. Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Remove tb_peers from fib_table.David S. Miller2012-07-12
| | | | | | | | | | | | No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Put proper checks into icmp_socket_deliver().David S. Miller2012-07-12
| | | | | | | | | | | | | | | | | | All handler->err() routines expect that we've done a pskb_may_pull() test to make sure that IP header length + 8 bytes can be safely pulled. Reported-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Fix warnings in ip_do_redirect() for some configurations.David S. Miller2012-07-12
| | | | | | | | | | Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Remove checks for dst_ops->redirect being NULL.David S. Miller2012-07-12
| | | | | | | | | | | | No longer necessary. Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Add dummy dst_ops->redirect method where needed.David S. Miller2012-07-12
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Kill ip_rt_redirect().David S. Miller2012-07-12
| | | | | | | | | | | | | | No longer needed, as the protocol handlers now all properly propagate the redirect back into the routing code. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Add redirect support to all protocol icmp error handlers.David S. Miller2012-07-12
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Add ipv4_redirect() and ipv4_sk_redirect() helper functions.David S. Miller2012-07-12
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Generalize ip_do_redirect() and hook into new dst_ops->redirect.David S. Miller2012-07-11
| | | | | | | | | | | | All of the redirect acceptance policy is now contained within. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Rearrange arguments to ip_rt_redirect()David S. Miller2012-07-11
| | | | | | | | | | | | | | | | Pass in the SKB rather than just the IP addresses, so that policy and other aspects can reside in ip_rt_redirect() rather then icmp_redirect(). Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Pull redirect instantiation out into a helper function.David S. Miller2012-07-11
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Deliver ICMP redirects to sockets too.David S. Miller2012-07-11
| | | | | | | | | | | | And thus, we can remove the ping_err() hack. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Pull icmp socket delivery out into a helper function.David S. Miller2012-07-11
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: TCP Small QueuesEric Dumazet2012-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduce TSQ (TCP Small Queues) TSQ goal is to reduce number of TCP packets in xmit queues (qdisc & device queues), to reduce RTT and cwnd bias, part of the bufferbloat problem. sk->sk_wmem_alloc not allowed to grow above a given limit, allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a given time. TSO packets are sized/capped to half the limit, so that we have two TSO packets in flight, allowing better bandwidth use. As a side effect, setting the limit to 40000 automatically reduces the standard gso max limit (65536) to 40000/2 : It can help to reduce latencies of high prio packets, having smaller TSO packets. This means we divert sock_wfree() to a tcp_wfree() handler, to queue/send following frames when skb_orphan() [2] is called for the already queued skbs. Results on my dev machines (tg3/ixgbe nics) are really impressive, using standard pfifo_fast, and with or without TSO/GSO. Without reduction of nominal bandwidth, we have reduction of buffering per bulk sender : < 1ms on Gbit (instead of 50ms with TSO) < 8ms on 100Mbit (instead of 132 ms) I no longer have 4 MBytes backlogged in qdisc by a single netperf session, and both side socket autotuning no longer use 4 Mbytes. As skb destructor cannot restart xmit itself ( as qdisc lock might be taken at this point ), we delegate the work to a tasklet. We use one tasklest per cpu for performance reasons. If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag. This flag is tested in a new protocol method called from release_sock(), to eventually send new segments. [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable [2] skb_orphan() is usually called at TX completion time, but some drivers call it in their start_xmit() handler. These drivers should at least use BQL, or else a single TCP session can still fill the whole NIC TX ring, since TSQ will have no effect. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Tom Herbert <therbert@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Fix out of bounds access to tcpm_valsAlexander Duyck2012-07-11
| | | | | | | | | | | | | | | | | | | | The recent patch "tcp: Maintain dynamic metrics in local cache." introduced an out of bounds access due to what appears to be a typo. I believe this change should resolve the issue by replacing the access to RTAX_CWND with TCP_METRIC_CWND. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Fix non-kernel-doc comments with kernel-doc start markerBen Hutchings2012-07-11
| | | | | | | | | | Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Fix (nearly-)kernel-doc comments for various functionsBen Hutchings2012-07-11
| | | | | | | | | | | | | | | | Fix incorrect start markers, wrapped summary lines, missing section breaks, incorrect separators, and some name mismatches. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Remove inetpeer from routes.David S. Miller2012-07-11
| | | | | | | | | | | | No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Calling ->cow_metrics() now is a bug.David S. Miller2012-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nothing every writes to ipv4 metrics any longer. PMTU is stored in rt->rt_pmtu. Dynamic TCP metrics are stored in a special TCP metrics cache, completely outside of the routes. Therefore ->cow_metrics() can simply nothing more than a WARN_ON trigger so we can catch anyone who tries to add new writes to ipv4 route metrics. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Kill dst_copy_metrics() call from ipv4_blackhole_route().David S. Miller2012-07-11
| | | | | | | | | | | | | | | | Blackhole routes have a COW metrics operation that returns NULL always, therefore this dst_copy_metrics() call did absolutely nothing. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Enforce max MTU metric at route insertion time.David S. Miller2012-07-11
| | | | | | | | | | | | Rather than at every struct rtable creation. Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Maintain redirect and PMTU info in struct rtable again.David S. Miller2012-07-11
| | | | | | | | | | | | | | Maintaining this in the inetpeer entries was not the right way to do this at all. Signed-off-by: David S. Miller <davem@davemloft.net>
* | rtnetlink: Remove ts/tsage args to rtnl_put_cacheinfo().David S. Miller2012-07-11
| | | | | | | | | | | | Nobody provides non-zero values any longer. Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: Kill FLOWI_FLAG_PRECOW_METRICS.David S. Miller2012-07-11
| | | | | | | | | | | | | | | | No longer needed. TCP writes metrics, but now in it's own special cache that does not dirty the route metrics. Therefore there is no longer any reason to pre-cow metrics in this way. Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: Minimize use of cached route inetpeer.David S. Miller2012-07-11
| | | | | | | | | | | | | | | | | | | | | | | | Only use it in the absolutely required cases: 1) COW'ing metrics 2) ipv4 PMTU 3) ipv4 redirects Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: Remove ->get_peer() method.David S. Miller2012-07-11
| | | | | | | | | | | | No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Remove tw->tw_peerDavid S. Miller2012-07-11
| | | | | | | | | | | | No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Move timestamps from inetpeer to metrics cache.David S. Miller2012-07-11
| | | | | | | | | | | | With help from Lin Ming. Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Don't report route RTT metric value in cache dumps.David S. Miller2012-07-11
| | | | | | | | | | | | | | We don't maintain it dynamically any longer, so reporting it would be extremely misleading. Report zero instead. Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Maintain dynamic metrics in local cache.David S. Miller2012-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Maintain a local hash table of TCP dynamic metrics blobs. Computed TCP metrics are no longer maintained in the route metrics. The table uses RCU and an extremely simple hash so that it has low latency and low overhead. A simple hash is legitimate because we only make metrics blobs for fully established connections. Some tweaking of the default hash table sizes, metric timeouts, and the hash chain length limit certainly could use some tweaking. But the basic design seems sound. With help from Eric Dumazet and Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Abstract back handling peer aliveness test into helper function.David S. Miller2012-07-10
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Move dynamnic metrics handling into seperate file.David S. Miller2012-07-10
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of git://1984.lsi.us.es/nf-nextDavid S. Miller2012-07-07
|\ \
| * | netfilter: nf_conntrack: generalize nf_ct_l4proto_netPablo Neira Ayuso2012-07-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch generalizes nf_ct_l4proto_net by splitting it into chunks and moving the corresponding protocol part to where it really belongs to. To clarify, note that we follow two different approaches to support per-net depending if it's built-in or run-time loadable protocol tracker. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
| * | netfilter: nf_ct_icmp: add icmp_kmemdup[_compat]_sysctl_table functionGao feng2012-06-27
| | | | | | | | | | | | | | | | | | | | | | | | Split sysctl function into smaller chucks to cleanup code and prepare patches to reduce ifdef pollution. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_conntrack: prepare l4proto->init_net cleanupGao feng2012-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | l4proto->init contain quite redundant code. We can simplify this by adding a new parameter l3proto. This patch prepares that code simplification. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | ipv4: Avoid overhead when no custom FIB rules are installed.David S. Miller2012-07-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the user hasn't actually installed any custom rules, or fiddled with the default ones, don't go through the whole FIB rules layer. It's just pure overhead. Instead do what we do with CONFIG_IP_MULTIPLE_TABLES disabled, check the individual tables by hand, one by one. Also, move fib_num_tclassid_users into the ipv4 network namespace. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: defer fib_compute_spec_dst() callEric Dumazet2012-07-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ip_options_compile() can avoid calling fib_compute_spec_dst() by default, and perform the call only if needed. David suggested to add a helper to make the call only once. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: No need to set generic neighbour pointer.David S. Miller2012-07-05
| | | | | | | | | | | | | | | | | | Nobody reads it any longer. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: Add optional SKB arg to dst_ops->neigh_lookup().David S. Miller2012-07-05
| | | | | | | | | | | | | | | | | | | | | Causes the handler to use the daddr in the ipv4/ipv6 header when the route gateway is unspecified (local subnet). Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: Do delayed neigh confirmation.David S. Miller2012-07-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a dst_confirm() happens, mark the confirmation as pending in the dst. Then on the next packet out, when we have the neigh in-hand, do the update. This removes the dependency in dst_confirm() of dst's having an attached neigh. While we're here, remove the explicit 'dst' NULL check, all except 2 or 3 call sites ensure it's not NULL. So just fix those cases up. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: Don't report neigh uptodate state in rtcache procfs.David S. Miller2012-07-05
| | | | | | | | | | | | | | | | | | | | | | | | Soon routes will not have a cached neigh attached, nor will we be able to necessarily go directly to a neigh from an arbitrary route. Signed-off-by: David S. Miller <davem@davemloft.net>