aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAge
* tcp: no md5sig option size check bugDmitry Popov2010-08-07
| | | | | | | | | tcp_parse_md5sig_option doesn't check md5sig option (TCPOPT_MD5SIG) length, but tcp_v[46]_inbound_md5_hash assume that it's at least 16 bytes long. Signed-off-by: Dmitry Popov <dp@highloadlab.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2010-08-03
|\ | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/e1000e/hw.h net/bridge/br_device.c net/bridge/br_input.c
| * tcp: cookie transactions setsockopt memory leakDmitry Popov2010-07-31
| | | | | | | | | | | | | | | | | | | | | | There is a bug in do_tcp_setsockopt(net/ipv4/tcp.c), TCP_COOKIE_TRANSACTIONS case. In some cases (when tp->cookie_values == NULL) new tcp_cookie_values structure can be allocated (at cvp), but not bound to tp->cookie_values. So a memory leak occurs. Signed-off-by: Dmitry Popov <dp@highloadlab.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twiceChangli Gao2010-08-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | 6c79bf0f2440fd250c8fce8d9b82fcf03d4e8350 subtracts PPPOE_SES_HLEN from mtu at the front of ip_fragment(). So the later subtraction should be removed. The MTU of 802.1q is also 1500, so MTU should not be changed. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Bart De Schuymer <bdschuym@pandora.bo> ---- net/ipv4/ip_output.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) Signed-off-by: Bart De Schuymer <bdschuym@pandora.bo> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Add getsockopt support for TCP thin-streamsJosh Hunt2010-08-02
| | | | | | | | | | | | | | | | | | | | | | Initial TCP thin-stream commit did not add getsockopt support for the new socket options: TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK. This adds support for them. Signed-off-by: Josh Hunt <johunt@akamai.com> Tested-by: Andreas Petlund <apetlund@simula.no> Acked-by: Andreas Petlund <apetlund@simula.no> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of ↵David S. Miller2010-08-02
|\ \ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6
| * | netfilter: nf_nat: don't check if the tuple is unique when there isn't any ↵Changli Gao2010-08-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | other choice The tuple got from unique_tuple() doesn't need to be really unique, so the check for the unique tuple isn't necessary, when there isn't any other choice. Eliminating the unnecessary nf_nat_used_tuple() can save some CPU cycles too. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | netfilter: nf_nat: make unique_tuple return voidChangli Gao2010-08-02
| | | | | | | | | | | | | | | | | | | | | | | | The only user of unique_tuple() get_unique_tuple() doesn't care about the return value of unique_tuple(), so make unique_tuple() return void (nothing). Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | netfilter: nf_nat: use local variable hdrlenChangli Gao2010-08-02
| | | | | | | | | | | | | | | | | | | | | Use local variable hdrlen instead of ip_hdrlen(skb). Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | netfilter: {ip,ip6,arp}_tables: dont block bottom half more than necessaryEric Dumazet2010-08-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently disable BH for the whole duration of get_counters() On machines with a lot of cpus and large tables, this might be too long. We can disable preemption during the whole function, and disable BH only while fetching counters for the current cpu. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | netfilter: iptables: use skb->len for accountingChangli Gao2010-07-23
| | | | | | | | | | | | | | | | | | | | | Use skb->len for accounting as xt_quota does. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | netfilter: arptables: use arp_hdr_len()Changli Gao2010-07-23
| | | | | | | | | | | | | | | | | | | | | use arp_hdr_len(). Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | netfilter: nf_nat_core: merge the same linesChangli Gao2010-07-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | proto->unique_tuple() will be called finally, if the previous calls fail. This patch checks the false condition of (range->flags &IP_NAT_RANGE_PROTO_RANDOM) instead to avoid duplicate line of code: proto->unique_tuple(). Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | netfilter: ipt_REJECT: avoid touching dst refEric Dumazet2010-07-05
| | | | | | | | | | | | | | | | | | | | | We can avoid a pair of atomic ops in ipt_REJECT send_reset() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | netfilter: ipt_REJECT: postpone the checksum calculation.Changli Gao2010-07-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | postpone the checksum calculation, then if the output NIC supports checksum offloading, we can utlize it. And though the output NIC doesn't support checksum offloading, but we'll mangle this packet, this can free us from updating the checksum, as the checksum calculation occurs later. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
* | | net: RTA_MARK additionEric Dumazet2010-07-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new rt attribute, RTA_MARK, and use it in rt_fill_info()/inet_rtm_getroute() to support following commands : ip route get 192.168.20.110 mark NUMBER ip route get 192.168.20.108 from 192.168.20.110 iif eth1 mark NUMBER ip route list cache [192.168.20.110] mark NUMBER Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: remove last uses of __attribute__((packed))Gustavo F. Padovan2010-07-21
| | | | | | | | | | | | | | | | | | | | | Network code uses the __packed macro instead of __attribute__((packed)). Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge branch 'master' of ↵David S. Miller2010-07-20
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/vhost/net.c net/bridge/br_device.c Fix merge conflict in drivers/vhost/net.c with guidance from Stephen Rothwell. Revert the effects of net-2.6 commit 573201f36fd9c7c6d5218cdcd9948cee700b277d since net-next-2.6 has fixes that make bridge netpoll work properly thus we don't need it disabled. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: fix crash in tcp_xmit_retransmit_queueIlpo Järvinen2010-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It can happen that there are no packets in queue while calling tcp_xmit_retransmit_queue(). tcp_write_queue_head() then returns NULL and that gets deref'ed to get sacked into a local var. There is no work to do if no packets are outstanding so we just exit early. This oops was introduced by 08ebd1721ab8fd (tcp: remove tp->lost_out guard to make joining diff nicer). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Lennart Schulte <lennart.schulte@nets.rwth-aachen.de> Tested-by: Lennart Schulte <lennart.schulte@nets.rwth-aachen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipmr: Don't leak memory if fib lookup fails.Ben Greear2010-07-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was detected using two mcast router tables. The pimreg for the second interface did not have a specific mrule, so packets received by it were handled by the default table, which had nothing configured. This caused the ipmr_fib_lookup to fail, causing the memory leak. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | rfs: call sock_rps_record_flow() in tcp_splice_read()Changli Gao2010-07-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rfs: call sock_rps_record_flow() in tcp_splice_read() call sock_rps_record_flow() in tcp_splice_read(), so the applications using splice(2) or sendfile(2) can utilize RFS. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- net/ipv4/tcp.c | 1 + 1 file changed, 1 insertion(+) Signed-off-by: David S. Miller <davem@davemloft.net>
* | | inet, inet6: make tcp_sendmsg() and tcp_sendpage() through inet_sendmsg() ↵Changli Gao2010-07-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and inet_sendpage() a new boolean flag no_autobind is added to structure proto to avoid the autobind calls when the protocol is TCP. Then sock_rps_record_flow() is called int the TCP's sendmsg() and sendpage() pathes. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/net/inet_common.h | 4 ++++ include/net/sock.h | 1 + include/net/tcp.h | 8 ++++---- net/ipv4/af_inet.c | 15 +++++++++------ net/ipv4/tcp.c | 11 +++++------ net/ipv4/tcp_ipv4.c | 3 +++ net/ipv6/af_inet6.c | 8 ++++---- net/ipv6/tcp_ipv6.c | 3 +++ 8 files changed, 33 insertions(+), 20 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net/ipv4: EXPORT_SYMBOL cleanupsEric Dumazet2010-07-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | CodingStyle cleanups EXPORT_SYMBOL should immediately follow the symbol declaration. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | gre: propagate ipv6 transport classStephen Hemminger2010-07-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch makes IPV6 over IPv4 GRE tunnel propagate the transport class field from the underlying IPV6 header to the IPV4 Type Of Service field. Without the patch, all IPV6 packets in tunnel look the same to QoS. This assumes that IPV6 transport class is exactly the same as IPv4 TOS. Not sure if that is always the case? Maybe need to mask off some bits. The mask and shift to get tclass is copied from ipv6/datagram.c Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge branch 'master' of ↵David S. Miller2010-07-07
|\| | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
| * | xfrm: fix xfrm by MARK logicPeter Kosyh2010-07-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While using xfrm by MARK feature in 2.6.34 - 2.6.35 kernels, the mark is always cleared in flowi structure via memset in _decode_session4 (net/ipv4/xfrm4_policy.c), so the policy lookup fails. IPv6 code is affected by this bug too. Signed-off-by: Peter Kosyh <p.kosyh@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net/ipv4/ip_output.c: Removal of unused variable in ip_fragment()George Kadianakis2010-07-07
| | | | | | | | | | | | | | | | | | | | | Removal of unused integer variable in ip_fragment(). Signed-off-by: George Kadianakis <desnacked@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: use skb_dst_copy() in ip_copy_metadata()Eric Dumazet2010-07-05
| |/ |/| | | | | | | | | | | Avoid touching dst refcount in ip_fragment(). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of ↵David S. Miller2010-07-03
|\ \ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6
| * | netfilter: ipt_LOG/ip6t_LOG: add option to print decoded MAC headerPatrick McHardy2010-06-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The LOG targets print the entire MAC header as one long string, which is not readable very well: IN=eth0 OUT= MAC=00:15:f2:24:91:f8:00:1b:24:dc:61:e6:08:00 ... Add an option to decode known header formats (currently just ARPHRD_ETHER devices) in their individual fields: IN=eth0 OUT= MACSRC=00:1b:24:dc:61:e6 MACDST=00:15:f2:24:91:f8 MACPROTO=0800 ... IN=eth0 OUT= MACSRC=00:1b:24:dc:61:e6 MACDST=00:15:f2:24:91:f8 MACPROTO=86dd ... The option needs to be explicitly enabled by userspace to avoid breaking existing parsers. Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | netfilter: ipt_LOG/ip6t_LOG: remove comparison within loopPatrick McHardy2010-06-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the comparison within the loop to print the macheader by prepending the colon to all but the first printk. Based on suggestion by Jan Engelhardt <jengelh@medozas.de>. Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | netfilter: nf_nat: support user-specified SNAT rules in LOCAL_INPatrick McHardy2010-06-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 2.6.34 introduced 'conntrack zones' to deal with cases where packets from multiple identical networks are handled by conntrack/NAT. Packets are looped through veth devices, during which they are NATed to private addresses, after which they can continue normally through the stack and possibly have NAT rules applied a second time. This works well, but is needlessly complicated for cases where only a single SNAT/DNAT mapping needs to be applied to these packets. In that case, all that needs to be done is to assign each network to a seperate zone and perform NAT as usual. However this doesn't work for packets destined for the machine performing NAT itself since its corrently not possible to configure SNAT mappings for the LOCAL_IN chain. This patch adds a new INPUT chain to the NAT table and changes the targets performing SNAT to be usable in that chain. Example usage with two identical networks (192.168.0.0/24) on eth0/eth1: iptables -t raw -A PREROUTING -i eth0 -j CT --zone 1 iptables -t raw -A PREROUTING -i eth0 -j MARK --set-mark 1 iptables -t raw -A PREROUTING -i eth1 -j CT --zone 2 iptabels -t raw -A PREROUTING -i eth1 -j MARK --set-mark 2 iptables -t nat -A INPUT -m mark --mark 1 -j NETMAP --to 10.0.0.0/24 iptables -t nat -A POSTROUTING -m mark --mark 1 -j NETMAP --to 10.0.0.0/24 iptables -t nat -A INPUT -m mark --mark 2 -j NETMAP --to 10.0.1.0/24 iptables -t nat -A POSTROUTING -m mark --mark 2 -j NETMAP --to 10.0.1.0/24 iptables -t raw -A PREROUTING -d 10.0.0.0/24 -j CT --zone 1 iptables -t raw -A OUTPUT -d 10.0.0.0/24 -j CT --zone 1 iptables -t raw -A PREROUTING -d 10.0.1.0/24 -j CT --zone 2 iptables -t raw -A OUTPUT -d 10.0.1.0/24 -j CT --zone 2 iptables -t nat -A PREROUTING -d 10.0.0.0/24 -j NETMAP --to 192.168.0.0/24 iptables -t nat -A OUTPUT -d 10.0.0.0/24 -j NETMAP --to 192.168.0.0/24 iptables -t nat -A PREROUTING -d 10.0.1.0/24 -j NETMAP --to 192.168.0.0/24 iptables -t nat -A OUTPUT -d 10.0.1.0/24 -j NETMAP --to 192.168.0.0/24 Signed-off-by: Patrick McHardy <kaber@trash.net>
* | | fragment: add fast path for in-order fragmentsChangli Gao2010-06-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | add fast path for in-order fragments As the fragments are sent in order in most of OSes, such as Windows, Darwin and FreeBSD, it is likely the new fragments are at the end of the inet_frag_queue. In the fast path, we check if the skb at the end of the inet_frag_queue is the prev we expect. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/net/inet_frag.h | 1 + net/ipv4/ip_fragment.c | 12 ++++++++++++ net/ipv6/reassembly.c | 11 +++++++++++ 3 files changed, 24 insertions(+) Signed-off-by: David S. Miller <davem@davemloft.net>
* | | snmp: 64bit ipstats_mib for all archesEric Dumazet2010-06-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | /proc/net/snmp and /proc/net/netstat expose SNMP counters. Width of these counters is either 32 or 64 bits, depending on the size of "unsigned long" in kernel. This means user program parsing these files must already be prepared to deal with 64bit values, regardless of user program being 32 or 64 bit. This patch introduces 64bit snmp values for IPSTAT mib, where some counters can wrap pretty fast if they are 32bit wide. # netstat -s|egrep "InOctets|OutOctets" InOctets: 244068329096 OutOctets: 244069348848 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: tso_fragment() might avoid GFP_ATOMICEric Dumazet2010-06-29
| | | | | | | | | | | | | | | | | | | | | | | | We can pass a gfp argument to tso_fragment() and avoid GFP_ATOMIC allocations sometimes. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: use this_cpu_ptr()Eric Dumazet2010-06-29
| | | | | | | | | | | | | | | | | | | | | use this_cpu_ptr(p) instead of per_cpu_ptr(p, smp_processor_id()) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | syncookies: add support for ECNFlorian Westphal2010-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allows use of ECN when syncookies are in effect by encoding ecn_ok into the syn-ack tcp timestamp. While at it, remove a uneeded #ifdef CONFIG_SYN_COOKIES. With CONFIG_SYN_COOKIES=nm want_cookie is ifdef'd to 0 and gcc removes the "if (0)". Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | syncookies: do not store rcv_wscale in tcp timestampFlorian Westphal2010-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As pointed out by Fernando Gont there is no need to encode rcv_wscale into the cookie. We did not use the restored rcv_wscale anyway; it is recomputed via tcp_select_initial_window(). Thus we can save 4 bits in the ts option space by removing rcv_wscale. In case window scaling was not supported, we set the (invalid) wscale value 0xf. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | snmp: add align parameter to snmp_mib_init()Eric Dumazet2010-06-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for 64bit snmp counters for some mibs, add an 'align' parameter to snmp_mib_init(), instead of assuming mibs only contain 'unsigned long' fields. Callers can use __alignof__(type) to provide correct alignment. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | arp: RCU change in arp_solicit()Eric Dumazet2010-06-26
| | | | | | | | | | | | | | | | | | | | | Avoid two atomic ops in arp_solicit() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: do not send reset to already closed socketsKonstantin Khorenko2010-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | i've found that tcp_close() can be called for an already closed socket, but still sends reset in this case (tcp_send_active_reset()) which seems to be incorrect. Moreover, a packet with reset is sent with different source port as original port number has been already cleared on socket. Besides that incrementing stat counter for LINUX_MIB_TCPABORTONCLOSE also does not look correct in this case. Initially this issue was found on 2.6.18-x RHEL5 kernel, but the same seems to be true for the current mainstream kernel (checked on 2.6.35-rc3). Please, correct me if i missed something. How that happens: 1) the server receives a packet for socket in TCP_CLOSE_WAIT state that triggers a tcp_reset(): Call Trace: <IRQ> [<ffffffff8025b9b9>] tcp_reset+0x12f/0x1e8 [<ffffffff80046125>] tcp_rcv_state_process+0x1c0/0xa08 [<ffffffff8003eb22>] tcp_v4_do_rcv+0x310/0x37a [<ffffffff80028bea>] tcp_v4_rcv+0x74d/0xb43 [<ffffffff8024ef4c>] ip_local_deliver_finish+0x0/0x259 [<ffffffff80037131>] ip_local_deliver+0x200/0x2f4 [<ffffffff8003843c>] ip_rcv+0x64c/0x69f [<ffffffff80021d89>] netif_receive_skb+0x4c4/0x4fa [<ffffffff80032eca>] process_backlog+0x90/0xec [<ffffffff8000cc50>] net_rx_action+0xbb/0x1f1 [<ffffffff80012d3a>] __do_softirq+0xf5/0x1ce [<ffffffff8001147a>] handle_IRQ_event+0x56/0xb0 [<ffffffff8006334c>] call_softirq+0x1c/0x28 [<ffffffff80070476>] do_softirq+0x2c/0x85 [<ffffffff80070441>] do_IRQ+0x149/0x152 [<ffffffff80062665>] ret_from_intr+0x0/0xa <EOI> [<ffffffff80008a2e>] __handle_mm_fault+0x6cd/0x1303 [<ffffffff80008903>] __handle_mm_fault+0x5a2/0x1303 [<ffffffff80033a9d>] cache_free_debugcheck+0x21f/0x22e [<ffffffff8006a263>] do_page_fault+0x49a/0x7dc [<ffffffff80066487>] thread_return+0x89/0x174 [<ffffffff800c5aee>] audit_syscall_exit+0x341/0x35c [<ffffffff80062e39>] error_exit+0x0/0x84 tcp_rcv_state_process() ... // (sk_state == TCP_CLOSE_WAIT here) ... /* step 2: check RST bit */ if(th->rst) { tcp_reset(sk); goto discard; } ... --------------------------------- tcp_rcv_state_process tcp_reset tcp_done tcp_set_state(sk, TCP_CLOSE); inet_put_port __inet_put_port inet_sk(sk)->num = 0; sk->sk_shutdown = SHUTDOWN_MASK; 2) After that the process (socket owner) tries to write something to that socket and "inet_autobind" sets a _new_ (which differs from the original!) port number for the socket: Call Trace: [<ffffffff80255a12>] inet_bind_hash+0x33/0x5f [<ffffffff80257180>] inet_csk_get_port+0x216/0x268 [<ffffffff8026bcc9>] inet_autobind+0x22/0x8f [<ffffffff80049140>] inet_sendmsg+0x27/0x57 [<ffffffff8003a9d9>] do_sock_write+0xae/0xea [<ffffffff80226ac7>] sock_writev+0xdc/0xf6 [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe [<ffffffff8001fb49>] __pollwait+0x0/0xdd [<ffffffff8008d533>] default_wake_function+0x0/0xe [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e [<ffffffff800f0b49>] do_readv_writev+0x163/0x274 [<ffffffff80066538>] thread_return+0x13a/0x174 [<ffffffff800145d8>] tcp_poll+0x0/0x1c9 [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3 [<ffffffff800f0dd0>] sys_writev+0x49/0xe4 [<ffffffff800622dd>] tracesys+0xd5/0xe0 3) sendmsg fails at last with -EPIPE (=> 'write' returns -EPIPE in userspace): F: tcp_sendmsg1 -EPIPE: sk=ffff81000bda00d0, sport=49847, old_state=7, new_state=7, sk_err=0, sk_shutdown=3 Call Trace: [<ffffffff80027557>] tcp_sendmsg+0xcb/0xe87 [<ffffffff80033300>] release_sock+0x10/0xae [<ffffffff8016f20f>] vgacon_cursor+0x0/0x1a7 [<ffffffff8026bd32>] inet_autobind+0x8b/0x8f [<ffffffff8003a9d9>] do_sock_write+0xae/0xea [<ffffffff80226ac7>] sock_writev+0xdc/0xf6 [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe [<ffffffff8001fb49>] __pollwait+0x0/0xdd [<ffffffff8008d533>] default_wake_function+0x0/0xe [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e [<ffffffff800f0b49>] do_readv_writev+0x163/0x274 [<ffffffff80066538>] thread_return+0x13a/0x174 [<ffffffff800145d8>] tcp_poll+0x0/0x1c9 [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3 [<ffffffff800f0dd0>] sys_writev+0x49/0xe4 [<ffffffff800622dd>] tracesys+0xd5/0xe0 tcp_sendmsg() ... /* Wait for a connection to finish. */ if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) { int old_state = sk->sk_state; if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) { if (f_d && (err == -EPIPE)) { printk("F: tcp_sendmsg1 -EPIPE: sk=%p, sport=%u, old_state=%d, new_state=%d, " "sk_err=%d, sk_shutdown=%d\n", sk, ntohs(inet_sk(sk)->sport), old_state, sk->sk_state, sk->sk_err, sk->sk_shutdown); dump_stack(); } goto out_err; } } ... 4) Then the process (socket owner) understands that it's time to close that socket and does that (and thus triggers sending reset packet): Call Trace: ... [<ffffffff80032077>] dev_queue_xmit+0x343/0x3d6 [<ffffffff80034698>] ip_output+0x351/0x384 [<ffffffff80251ae9>] dst_output+0x0/0xe [<ffffffff80036ec6>] ip_queue_xmit+0x567/0x5d2 [<ffffffff80095700>] vprintk+0x21/0x33 [<ffffffff800070f0>] check_poison_obj+0x2e/0x206 [<ffffffff80013587>] poison_obj+0x36/0x45 [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d [<ffffffff80023481>] dbg_redzone1+0x1c/0x25 [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d [<ffffffff8000ca94>] cache_alloc_debugcheck_after+0x189/0x1c8 [<ffffffff80023405>] tcp_transmit_skb+0x764/0x786 [<ffffffff8025df8a>] tcp_send_active_reset+0xf9/0x14d [<ffffffff80258ff1>] tcp_close+0x39a/0x960 [<ffffffff8026be12>] inet_release+0x69/0x80 [<ffffffff80059b31>] sock_release+0x4f/0xcf [<ffffffff80059d4c>] sock_close+0x2c/0x30 [<ffffffff800133c9>] __fput+0xac/0x197 [<ffffffff800252bc>] filp_close+0x59/0x61 [<ffffffff8001eff6>] sys_close+0x85/0xc7 [<ffffffff800622dd>] tracesys+0xd5/0xe0 So, in brief: * a received packet for socket in TCP_CLOSE_WAIT state triggers tcp_reset() which clears inet_sk(sk)->num and put socket into TCP_CLOSE state * an attempt to write to that socket forces inet_autobind() to get a new port (but the write itself fails with -EPIPE) * tcp_close() called for socket in TCP_CLOSE state sends an active reset via socket with newly allocated port This adds an additional check in tcp_close() for already closed sockets. We do not want to send anything to closed sockets. Signed-off-by: Konstantin Khorenko <khorenko@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge branch 'master' of ↵David S. Miller2010-06-23
|\ \ \ | | |/ | |/| | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/ipv4/ip_output.c
| * | udp: Fix bogus UFO packet generationHerbert Xu2010-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It has been reported that the new UFO software fallback path fails under certain conditions with NFS. I tracked the problem down to the generation of UFO packets that are smaller than the MTU. The software fallback path simply discards these packets. This patch fixes the problem by not generating such packets on the UFO path. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net - IP_NODEFRAG option for IPv4 socketJiri Olsa2010-06-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | this patch is implementing IP_NODEFRAG option for IPv4 socket. The reason is, there's no other way to send out the packet with user customized header of the reassembly part. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | syncookies: check decoded options against sysctl settingsFlorian Westphal2010-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Discard the ACK if we find options that do not match current sysctl settings. Previously it was possible to create a connection with sack, wscale, etc. enabled even if the feature was disabled via sysctl. Also remove an unneeded call to tcp_sack_reset() in cookie_check_timestamp: Both call sites (cookie_v4_check, cookie_v6_check) zero "struct tcp_options_received", hand it to tcp_parse_options() (which does not change tcp_opt->num_sacks/dsack) and then call cookie_check_timestamp(). Even if num_sacks/dsacks were changed, the structure is allocated on the stack and after cookie_check_timestamp returns only a few selected members are copied to the inet_request_sock. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | inetpeer: restore small inet_peer structuresEric Dumazet2010-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches. Thats a bit unfortunate, since old size was exactly 64 bytes. This can be solved, using an union between this rcu_head an four fields, that are normally used only when a refcount is taken on inet_peer. rcu_head is used only when refcnt=-1, right before structure freeing. Add a inet_peer_refcheck() function to check this assertion for a while. We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | inetpeer: do not use zero refcnt for freed entriesEric Dumazet2010-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Followup of commit aa1039e73cc2 (inetpeer: RCU conversion) Unused inet_peer entries have a null refcnt. Using atomic_inc_not_zero() in rcu lookups is not going to work for them, and slow path is taken. Fix this using -1 marker instead of 0 for deleted entries. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipfrag : frag_kfree_skb() cleanupEric Dumazet2010-06-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | Third param (work) is unused, remove it. Remove __inline__ and inline qualifiers. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ip_frag: Remove some atomic opsEric Dumazet2010-06-15
| | | | | | | | | | | | | | | | | | | | | Instead of doing one atomic operation per frag, we can factorize them. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | inetpeer: RCU conversionEric Dumazet2010-06-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | inetpeer currently uses an AVL tree protected by an rwlock. It's possible to make most lookups use RCU 1) Add a struct rcu_head to struct inet_peer 2) add a lookup_rcu_bh() helper to perform lockless and opportunistic lookup. This is a normal function, not a macro like lookup(). 3) Add a limit to number of links followed by lookup_rcu_bh(). This is needed in case we fall in a loop. 4) add an smp_wmb() in link_to_pool() right before node insert. 5) make unlink_from_pool() use atomic_cmpxchg() to make sure it can take last reference to an inet_peer, since lockless readers could increase refcount, even while we hold peers.lock. 6) Delay struct inet_peer freeing after rcu grace period so that lookup_rcu_bh() cannot crash. 7) inet_getpeer() first attempts lockless lookup. Note this lookup can fail even if target is in AVL tree, but a concurrent writer can let tree in a non correct form. If this attemps fails, lock is taken a regular lookup is performed again. 8) convert peers.lock from rwlock to a spinlock 9) Remove SLAB_HWCACHE_ALIGN when peer_cachep is created, because rcu_head adds 16 bytes on 64bit arches, doubling effective size (64 -> 128 bytes) In a future patch, this is probably possible to revert this part, if rcu field is put in an union to share space with rid, ip_id_count, tcp_ts & tcp_ts_stamp. These fields being manipulated only with refcnt > 0. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>