aboutsummaryrefslogtreecommitdiffstats
path: root/include/net
Commit message (Collapse)AuthorAge
* net-tcp: Fast Open client - cookie cacheYuchung Cheng2012-07-19
| | | | | | | | | | With help from Eric Dumazet, add Fast Open metrics in tcp metrics cache. The basic ones are MSS and the cookies. Later patch will cache more to handle unfriendly middleboxes. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net-tcp: Fast Open baseYuchung Cheng2012-07-19
| | | | | | | | | | | | | | | | | | | | | | This patch impelements the common code for both the client and server. 1. TCP Fast Open option processing. Since Fast Open does not have an option number assigned by IANA yet, it shares the experiment option code 254 by implementing draft-ietf-tcpm-experimental-options with a 16 bits magic number 0xF989. This enables global experiments without clashing the scarce(2) experimental options available for TCP. When the draft status becomes standard (maybe), the client should switch to the new option number assigned while the server supports both numbers for transistion. 2. The new sysctl tcp_fastopen 3. A place holder init function Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Fix warnings in dst_ops.hDavid S. Miller2012-07-19
| | | | | | include/net/dst_ops.h:28:20: warning: ‘struct sock’ declared inside parameter list Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: tcp: remove per net tcp_sockEric Dumazet2012-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket per network namespace. This leads to bad behavior on multiqueue NICS, because many cpus contend for the socket lock and once socket lock is acquired, extra false sharing on various socket fields slow down the operations. To better resist to attacks, we use a percpu socket. Each cpu can run without contention, using appropriate memory (local node) Additional features : 1) We also mirror the queue_mapping of the incoming skb, so that answers use the same queue if possible. 2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree() 3) We now limit the number of in-flight RST/ACK [1] packets per cpu, instead of per namespace, and we honor the sysctl_wmem_default limit dynamically. (Prior to this patch, sysctl_wmem_default value was copied at boot time, so any further change would not affect tcp_sock limit) [1] These packets are only generated when no socket was matched for the incoming packet. Reported-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: use seqlock for nh_exceptionsJulian Anastasov2012-07-19
| | | | | | | | | | | | | | | | | | Use global seqlock for the nh_exceptions. Call fnhe_oldest with the right hash chain. Correct the diff value for dst_set_expires. v2: after suggestions from Eric Dumazet: * get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe * continue daddr search in rt_bind_exception v3: * remove the daddr check before seqlock in rt_bind_exception * restart lookup in rt_bind_exception on detected seqlock change, as suggested by David Miller Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: add ipv6_addr_hash() helperEric Dumazet2012-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | Introduce ipv6_addr_hash() helper doing a XOR on all bits of an IPv6 address, with an optimized x86_64 version. Use it in flow dissector, as suggested by Andrew McGregor, to reduce hash collision probabilities in fq_codel (and other users of flow dissector) Use it in ip6_tunnel.c and use more bit shuffling, as suggested by David Laight, as existing hash was ignoring most of them. Use it in sunrpc and use more bit shuffling, using hash_32(). Use it in net/ipv6/addrconf.c, using hash_32() as well. As a cleanup, use it in net/ipv4/tcp_metrics.c Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrew McGregor <andrewmcgr@gmail.com> Cc: Dave Taht <dave.taht@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel.Saurabh2012-07-18
| | | | | | | | | Incorporated David and Steffen's comments. Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module. Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com> Reviewed-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: fix inet6_csk_xmit()Eric Dumazet2012-07-18
| | | | | | | | | | | | We should provide to inet6_csk_route_socket a struct flowi6 pointer, so that net6_csk_xmit() works correctly instead of sending garbage. Also add some consts Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'nexthop_exceptions'David S. Miller2012-07-17
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These patches implement the final mechanism necessary to really allow us to go without the route cache in ipv4. We need a place to have long-term storage of PMTU/redirect information which is independent of the routes themselves, yet does not get us back into a situation where we have to write to metrics or anything like that. For this we use an "next-hop exception" table in the FIB nexthops. The one thing I desperately want to avoid is having to create clone routes in the FIB trie for this purpose, because that is very expensive. However, I'm willing to entertain such an idea later if this current scheme proves to have downsides that the FIB trie variant would not have. In order to accomodate this any such scheme, we need to be able to produce a full flow key at PMTU/redirect time. That required an adjustment of the interface call-sites used to propagate these events. For a PMTU/redirect with a fully specified socket, we pass that socket and use it to produce the flow key. Otherwise we use a passed in SKB to formulate the key. There are two cases that need to be distinguished, ICMP message processing (in which case the IP header is at skb->data) and output packet processing (mostly tunnels, and in all such cases the IP header is at ip_hdr(skb)). We also have to make the code able to handle the case where the dst itself passed into the dst_ops->{update_pmtu,redirect} method is invalidated. This matters for calls from sockets that have cached that route. We provide a inet{,6} helper function for this purpose, and edit SCTP specially since it caches routes at the transport rather than socket level. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: Add FIB nexthop exceptions.David S. Miller2012-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In a regime where we have subnetted route entries, we need a way to store persistent storage about destination specific learned values such as redirects and PMTU values. This is implemented here via nexthop exceptions. The initial implementation is a 2048 entry hash table with relaiming starting at chain length 5. A more sophisticated scheme can be devised if that proves necessary. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}()David S. Miller2012-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This will be used so that we can compose a full flow key. Even though we have a route in this context, we need more. In the future the routes will be without destination address, source address, etc. keying. One ipv4 route will cover entire subnets, etc. In this environment we have to have a way to possess persistent storage for redirects and PMTU information. This persistent storage will exist in the FIB tables, and that's why we'll need to be able to rebuild a full lookup flow key here. Using that flow key will do a fib_lookup() and create/update the persistent entry. Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: Adjust PMTU updates to accomodate route invalidation.David S. Miller2012-07-16
| | | | | | | | | | | | | | | | | | This adjusts the call to dst_ops->update_pmtu() so that we can transparently handle the fact that, in the future, the dst itself can be invalidated by the PMTU update (when we have non-host routes cached in sockets). Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: Add helper inet6_csk_update_pmtu().David S. Miller2012-07-16
| | | | | | | | | | | | This is the ipv6 version of inet_csk_update_pmtu(). Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: Add helper inet_csk_update_pmtu().David S. Miller2012-07-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This abstracts away the call to dst_ops->update_pmtu() so that we can transparently handle the fact that, in the future, the dst itself can be invalidated by the PMTU update (when we have non-host routes cached in sockets). So we try to rebuild the socket cached route after the method invocation if necessary. This isn't used by SCTP because it needs to cache dsts per-transport, and thus will need it's own local version of this helper. Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: implement RFC 5961 3.2Eric Dumazet2012-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement the RFC 5691 mitigation against Blind Reset attack using RST bit. Idea is to validate incoming RST sequence, to match RCV.NXT value, instead of previouly accepted window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND) If sequence is in window but not an exact match, send a "challenge ACK", so that the other part can resend an RST with the appropriate sequence. Add a new sysctl, tcp_challenge_ack_limit, to limit number of challenge ACK sent per second. Add a new SNMP counter to count number of challenge acks sent. (netstat -s | grep TCPChallengeACK) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: make sock diag per-namespaceAndrey Vagin2012-07-17
|/ | | | | | | | | | | | | | | | | | | | | | | | | Before this patch sock_diag works for init_net only and dumps information about sockets from all namespaces. This patch expands sock_diag for all name-spaces. It creates a netlink kernel socket for each netns and filters data during dumping. v2: filter accoding with netns in all places remove an unused variable. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Pavel Emelyanov <xemul@parallels.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'for-davem' of ↵David S. Miller2012-07-14
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John Linville says: ==================== Several drivers see updates: mwifiex, ath9k, iwlwifi, brcmsmac, wlcore/wl12xx/wl18xx, and a handful of others. The bcma bus got a lot of attention from Hauke Mehrtens. The cfg80211 component gets a flurry of patches for multi-channel support, and the mac80211 component gets the first few VHT (11ac) and 60GHz (11ad) patches. This also includes the removal of the iwmc3200 drivers, since the hardware never became available to normal people. Additionally, the NFC subsystem gets a series of updates. According to Samuel, "Here are the interesting bits: - A better error management for the HCI stack. - An LLCP "late" binding implementation for a better NFC SAP usage. SAPs are now reserved only when there's a client for it. - Support for Sony RC-S360 (a.k.a. PaSoRi) pn533 based dongle. We can read and write NFC tags and also establish a p2p link with this dongle now. - A few LLCP fixes." Finally, this includes another pull of the fixes from the wireless tree in order to resolve some merge issues. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'master' of ↵John W. Linville2012-07-12
| |\ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
| | * NFC: Allow HCI driver to pre-open pipes to some gatesEric Lapuyade2012-07-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some NFC chips will statically create and open pipes for both standard and proprietary gates. The driver can now pass this information to HCI such that HCI will not attempt to create and open them, but will instead directly use the passed pipe ids. Signed-off-by: Eric Lapuyade <eric.lapuyade@intel.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| | * NFC: Driver failure APIEric Lapuyade2012-07-09
| | | | | | | | | | | | | | | | | | | | | | | | This API should be used by drivers, HCI, SHDLC or NCI stacks to report an unrecoverable error. Signed-off-by: Eric Lapuyade <eric.lapuyade@intel.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| | * NFC: Prepare asynchronous error management for driver and shdlcEric Lapuyade2012-07-09
| | | | | | | | | | | | | | | Signed-off-by: Eric Lapuyade <eric.lapuyade@intel.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| | * cfg80211: bitrate calculation for 60gVladimir Kondratiev2012-07-05
| | | | | | | | | | | | | | | | | | | | | | | | 60g band uses different from .11n MCS scheme, so bitrate should be calculated differently Signed-off-by: Vladimir Kondratiev <qca_vkondrat@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * {nl,cfg}80211: support high bitratesVladimir Kondratiev2012-07-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until now, a u16 value was used to represent bitrate value. With VHT bitrates this becomes too small. Introduce a new 32-bit bitrate attribute. nl80211 will report both the new and the old attribute, unless the bitrate doesn't fit into the old u16 attribute in which case only the new one will be reported. User space tools encouraged to prefer the 32-bit attribute, if available (since it won't be available on older kernels.) Signed-off-by: Vladimir Kondratiev <qca_vkondrat@qca.qualcomm.com> [reword commit message and comments a bit] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * mac80211: add TX prepare APIJohannes Berg2012-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some drivers require setup before being able to send management frames in managed mode, in particular in multi-channel cases. Introduce API to allow the drivers to do such setup while being able to sleep waiting for the setup to finish in the device. This isn't possible inside the TX call since that can't sleep. A future patch may also restructure the TX retry to wait for the driver to report the frame status, as suggested by Arik in http://mid.gmane.org/CA+XVXffKSEL6ZQPQ98x-zO-NL2=TNF1uN==mprRyUmAaRn254g@mail.gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * mac80211: reduce IEEE80211_TX_MAX_RATESThomas Huehn2012-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IEEE80211_TX_MAX_RATES can be reduced from 5 to 4 as there is no current hardware supporting a rate chain with 5 multi rate stages (mrr), so 4 mrr stages are sufficient. The memory that is freed within the ieee80211_tx_info struct will be used in the upcoming Transmission Power Control (TPC) implementation. Suggested-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Thomas Huehn <thomas@net.t-labs.tu-berlin.de> [reword commit message] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * mac80211: remove tx_frags driver callbackJohannes Berg2012-07-02
| | | | | | | | | | | | | | | | | | | | | | | | The implementation of tx_frags is buggy due to not handling queue stop, and there's no driver implementing it so remove it. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * cfg80211: add 802.11ad (60gHz band) supportVladimir Kondratiev2012-07-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add enumerations for both cfg80211 and nl80211. This expands wiphy.bands etc. arrays. Extend channel <-> frequency translation to cover 60g band and modify the rate check logic since there are no legacy mandatory rates (only MCS is used.) Signed-off-by: Vladimir Kondratiev <qca_vkondrat@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * cfg80211/mac80211: remove .get_channelMichal Kazior2012-06-29
| | | | | | | | | | | | | | | | | | | | | | | | We do not need it anymore since cfg80211 tracks monitor channel and monitor channel type. Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * cfg80211: track monitor interfaces countMichal Kazior2012-06-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implements .set_monitor_enabled(wiphy, enabled). Notifies driver upon change of interface layout. If only monitor interfaces become present it is called with 2nd argument being true. If non-monitor interface appears then 2nd argument is false. Driver is notified only upon change. This makes it more obvious about the fact that cfg80211 supports single monitor channel. Once we implement multi-channel we don't want to allow setting monitor channel while other interface types are running. Otherwise it would be ambiguous once we start considering num_different_channels. Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * cfg80211: track ibss fixed channelMichal Kazior2012-06-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | IBSS may hop between channels. It is necessary to account this special case when considering interface combinations. Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * cfg80211: add channel tracking for AP and meshMichal Kazior2012-06-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to know which channel is used by a running AP and mesh for channel context accounting and finding matching/active interface combination. STA/IBSS have current_bss already which allows us to check which channel a vif is tuned to. Non-fixed channel IBSS can be handled with additional changes. Monitor mode is going to be handled differently. Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * Merge remote-tracking branch 'wireless-next/master' into mac80211-nextJohannes Berg2012-06-28
| | |\
| | * | cfg80211: allow advertising VHT capabilitiesMahesh Palivela2012-06-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow drivers to advertise their VHT capabilities and export them to userspace via nl80211. Signed-off-by: Mahesh Palivela <maheshp@posedge.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | mac80211: don't expose ieee80211_add_srates_ie()Johannes Berg2012-06-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This and ieee80211_add_ext_srates_ie() aren't exported, so can't be used by drivers anyway, but there's also no reason that they should be so make them private to mac80211 and use sdata instead of vif arguments. Acked-by: Arik Nemtsov <arik@wizery.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | cfg80211: don't allow WoWLAN support without CONFIG_PMJohannes Berg2012-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_PM is disabled, no device can possibly support WoWLAN since it can't go to sleep to start with. Due to this, mac80211 had even rejected the hardware registration. By making all the code and data for WoWLAN depend on CONFIG_PM we can promote this runtime error to a compile-time error. Add #ifdef around all WoWLAN code to remove it in systems that don't need it as they never suspend. Cc: Kalle Valo <kvalo@qca.qualcomm.com> Acked-by: Luciano Coelho <coelho@ti.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
* | | | ipv4: Don't store a rule pointer in fib_result.David S. Miller2012-07-13
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | We only use it to fetch the rule's tclassid, so just store the tclassid there instead. This also decreases the size of fib_result by a full 8 bytes on 64-bit. On 32-bits it's a wash. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: Remove tb_peers from fib_table.David S. Miller2012-07-12
| | | | | | | | | | | | | | | | | | No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv6: Use icmpv6_notify() to propagate redirect, instead of rt6_redirect().David S. Miller2012-07-12
| | | | | | | | | | | | | | | | | | And delete rt6_redirect(), since it is no longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv6: Add redirect support to all protocol icmp error handlers.David S. Miller2012-07-12
| | | | | | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv6: Add ip6_redirect() and ip6_sk_redirect() helper functions.David S. Miller2012-07-12
| | | | | | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv6: Move bulk of redirect handling into rt6_redirect().David S. Miller2012-07-12
| | | | | | | | | | | | | | | | | | | | | | | | This sets things up so that we can have the protocol error handlers call down into the ipv6 route code for redirects just as ipv4 already does. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv6: Export ndisc option parsing from ndisc.cDavid S. Miller2012-07-12
| | | | | | | | | | | | | | | | | | This is going to be used internally by the rt6 redirect code. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: Kill ip_rt_redirect().David S. Miller2012-07-12
| | | | | | | | | | | | | | | | | | | | | No longer needed, as the protocol handlers now all properly propagate the redirect back into the routing code. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: Add ipv4_redirect() and ipv4_sk_redirect() helper functions.David S. Miller2012-07-12
| | | | | | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: Generalize ip_do_redirect() and hook into new dst_ops->redirect.David S. Miller2012-07-11
| | | | | | | | | | | | | | | | | | All of the redirect acceptance policy is now contained within. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: Rearrange arguments to ip_rt_redirect()David S. Miller2012-07-11
| | | | | | | | | | | | | | | | | | | | | | | | Pass in the SKB rather than just the IP addresses, so that policy and other aspects can reside in ip_rt_redirect() rather then icmp_redirect(). Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: TCP Small QueuesEric Dumazet2012-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduce TSQ (TCP Small Queues) TSQ goal is to reduce number of TCP packets in xmit queues (qdisc & device queues), to reduce RTT and cwnd bias, part of the bufferbloat problem. sk->sk_wmem_alloc not allowed to grow above a given limit, allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a given time. TSO packets are sized/capped to half the limit, so that we have two TSO packets in flight, allowing better bandwidth use. As a side effect, setting the limit to 40000 automatically reduces the standard gso max limit (65536) to 40000/2 : It can help to reduce latencies of high prio packets, having smaller TSO packets. This means we divert sock_wfree() to a tcp_wfree() handler, to queue/send following frames when skb_orphan() [2] is called for the already queued skbs. Results on my dev machines (tg3/ixgbe nics) are really impressive, using standard pfifo_fast, and with or without TSO/GSO. Without reduction of nominal bandwidth, we have reduction of buffering per bulk sender : < 1ms on Gbit (instead of 50ms with TSO) < 8ms on 100Mbit (instead of 132 ms) I no longer have 4 MBytes backlogged in qdisc by a single netperf session, and both side socket autotuning no longer use 4 Mbytes. As skb destructor cannot restart xmit itself ( as qdisc lock might be taken at this point ), we delegate the work to a tasklet. We use one tasklest per cpu for performance reasons. If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag. This flag is tested in a new protocol method called from release_sock(), to eventually send new segments. [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable [2] skb_orphan() is usually called at TX completion time, but some drivers call it in their start_xmit() handler. These drivers should at least use BQL, or else a single TCP session can still fill the whole NIC TX ring, since TSQ will have no effect. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Tom Herbert <therbert@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2012-07-11
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: net/batman-adv/bridge_loop_avoidance.c net/batman-adv/bridge_loop_avoidance.h net/batman-adv/soft-interface.c net/mac80211/mlme.c With merge help from Antonio Quartulli (batman-adv) and Stephen Rothwell (drivers/net/usb/qmi_wwan.c). The net/mac80211/mlme.c conflict seemed easy enough, accounting for a conversion to some new tracing macros. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | netfilter: nf_ct_ecache: fix crash with multiple containers, one shutting downPablo Neira Ayuso2012-07-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hans reports that he's still hitting: BUG: unable to handle kernel NULL pointer dereference at 000000000000027c IP: [<ffffffff813615db>] netlink_has_listeners+0xb/0x60 PGD 0 Oops: 0000 [#3] PREEMPT SMP CPU 0 It happens when adding a number of containers with do: nfct_query(h, NFCT_Q_CREATE, ct); and most likely one namespace shuts down. this problem was supposed to be fixed by: 70e9942 netfilter: nf_conntrack: make event callback registration per-netns Still, it was missing one rcu_access_pointer to check if the callback is set or not. Reported-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | | ipv6: optimize ipv6 addresses comparesEric Dumazet2012-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On 64 bit arches having efficient unaligned accesses (eg x86_64) we can use long words to reduce number of instructions for free. Joe Perches suggested to change ipv6_masked_addr_cmp() to return a bool instead of 'int', to make sure ipv6_masked_addr_cmp() cannot be used in a sorting function. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>