aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAge
* IP: Send an ICMP "Fragment Reassembly Timeout" message when enabling ↵Shan Wei2010-01-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | connection track No matter whether connection track is enabled, an end host should send an ICMPv4 "Fragment Reassembly Timeout" message when defrag timeout. The reasons are following two points: 1. RFC 792 says: >>>> >> > > If a host reassembling a fragmented datagram cannot complete the >>>> >> > > reassembly due to missing fragments within its time limit it >>>> >> > > discards the datagram, and it may send a time exceeded message. >>>> >> > > >>>> >> > > If fragment zero is not available then no time exceeded need be >>>> >> > > sent at all. >>>> >> > > >>>> >> > > Read more: http://www.faqs.org/rfcs/rfc792.html#ixzz0aOXRD7Wp 2. Patrick McHardy also agrees with this opinion. :-) About the discussion of this opinion, refer to http://patchwork.ozlabs.org/patch/41649 The patch fixed the problem like this: When enabling connection track, fragments are received at PRE_ROUTING HOOK. If they are failed to reassemble, ip_expire() will be called. Before sending an ICMP "Fragment Reassembly Timeout" message, the patch searches router table to get the destination entry only for host type. The patch has been tested on both host type and route type. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* icmp: move icmp_err_convert[] to .rodataAlexey Dobriyan2010-01-23
| | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: constify MIB name tablesAlexey Dobriyan2010-01-23
| | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2010-01-23
|\ | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
| * Merge branch 'master' of /home/davem/src/GIT/linux-2.6/David S. Miller2010-01-23
| |\
| | * Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds2010-01-12
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (56 commits) sky2: Fix oops in sky2_xmit_frame() after TX timeout Documentation/3c509: document ethtool support af_packet: Don't use skb after dev_queue_xmit() vxge: use pci_dma_mapping_error to test return value netfilter: ebtables: enforce CAP_NET_ADMIN e1000e: fix and commonize code for setting the receive address registers e1000e: e1000e_enable_tx_pkt_filtering() returns wrong value e1000e: perform 10/100 adaptive IFS only on parts that support it e1000e: don't accumulate PHY statistics on PHY read failure e1000e: call pci_save_state() after pci_restore_state() netxen: update version to 4.0.72 netxen: fix set mac addr netxen: fix smatch warning netxen: fix tx ring memory leak tcp: update the netstamp_needed counter when cloning sockets TI DaVinci EMAC: Handle emac module clock correctly. dmfe/tulip: Let dmfe handle DM910x except for SPARC on-board chips ixgbe: Fix compiler warning about variable being used uninitialized netfilter: nf_ct_ftp: fix out of bounds read in update_nl_seq() mv643xx_eth: don't include cache padding in rx desc buffer size ... Fix trivial conflict in drivers/scsi/cxgb3i/cxgb3i_offload.c
| | * \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds2009-12-30
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (74 commits) Revert "b43: Enforce DMA descriptor memory constraints" iwmc3200wifi: fix array out-of-boundary access wl1251: timeout one too soon in wl1251_boot_run_firmware() mac80211: fix propagation of failed hardware reconfigurations mac80211: fix race with suspend and dynamic_ps_disable_work ath9k: fix missed error codes in the tx status check ath9k: wake hardware during AMPDU TX actions ath9k: wake hardware for interface IBSS/AP/Mesh removal ath9k: fix suspend by waking device prior to stop cfg80211: fix error path in cfg80211_wext_siwscan wl1271_cmd.c: cleanup char => u8 iwlwifi: Storage class should be before const qualifier ath9k: Storage class should be before const qualifier cfg80211: fix race between deauth and assoc response wireless: remove remaining qual code rt2x00: Add USB ID for Linksys WUSB 600N rev 2. ath5k: fix SWI calibration interrupt storm mac80211: fix ibss join with fixed-bssid libertas: Remove carrier signaling from the scan code orinoco: fix GFP_KERNEL in orinoco_set_key with interrupts disabled ...
| * | | | netlink: With opcode INET_DIAG_BC_S_LE dport was compared in inet_diag_bc_run()Roel Kluin2010-01-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The s-port should be compared. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | ipv4: don't remove /proc/net/rt_acctAlexey Dobriyan2010-01-17
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | /proc/net/rt_acct is not created if NET_CLS_ROUTE=n. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | ipv4: allow warming up the ARP cache with request type gratuitous ARPOctavian Purdila2010-01-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the per device ARP_ACCEPT option is enable, currently we only allow creating new ARP cache entries for response type gratuitous ARP. Allowing gratuitous ARP to create new ARP entries (not only to update existing ones) is useful when we want to avoid unnecessary delays for the first packet of a stream. This patch allows request type gratuitous ARP to create new ARP cache entries as well. This is useful when we want to populate the ARP cache entries for a large number of hosts on the same LAN. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: spread __net_init, __net_exitAlexey Dobriyan2010-01-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __net_init/__net_exit are apparently not going away, so use them to full extent. In some cases __net_init was removed, because it was called from __net_exit code. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tcp: account SYN-ACK timeouts & retransmissionsOctavian Purdila2010-01-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we don't increment SYN-ACK timeouts & retransmissions although we do increment the same stats for SYN. We seem to have lost the SYN-ACK accounting with the introduction of tcp_syn_recv_timer (commit 2248761e in the netdev-vger-cvs tree). This patch fixes this issue. In the process we also rename the v4/v6 syn/ack retransmit functions for clarity. We also add a new request_socket operations (syn_ack_timeout) so we can keep code in inet_connection_sock.c protocol agnostic. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | ipv4: Use less conflicting local var name in change_nexthops() loop macro.David S. Miller2010-01-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As noticed by H Hartley Sweeten, since change_nexthops() uses 'nh' as it's iterator variable, it can conflict with other existing local vars. Use "nexthop_nh" to avoid the conflict and make it easier to figure out where this magic variable comes from. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tcp: Generalized TTL Security MechanismStephen Hemminger2010-01-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the kernel portions needed to implement RFC 5082 Generalized TTL Security Mechanism (GTSM). It is a lightweight security measure against forged packets causing DoS attacks (for BGP). This is already implemented the same way in BSD kernels. For the necessary Quagga patch http://www.gossamer-threads.com/lists/quagga/dev/17389 Description from Cisco http://www.cisco.com/en/US/docs/ios/12_3t/12_3t7/feature/guide/gt_btsh.html It does add one byte to each socket structure, but I did a little rearrangement to reuse a hole (on 64 bit), but it does grow the structure on 32 bit This should be documented on ip(4) man page and the Glibc in.h file also needs update. IPV6_MINHOPLIMIT should also be added (although BSD doesn't support that). Only TCP is supported, but could also be added to UDP, DCCP, SCTP if desired. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge branch 'master' of ↵David S. Miller2010-01-11
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/benet/be_cmds.h include/linux/sysctl.h
| * | | ip: fix mc_loop checks for tunnels with multicast outer addressesOctavian Purdila2010-01-06
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we have L3 tunnels with different inner/outer families (i.e. IPV4/IPV6) which use a multicast address as the outer tunnel destination address, multicast packets will be loopbacked back to the sending socket even if IP*_MULTICAST_LOOP is set to disabled. The mc_loop flag is present in the family specific part of the socket (e.g. the IPv4 or IPv4 specific part). setsockopt sets the inner family mc_loop flag. When the packet is pushed through the L3 tunnel it will eventually be processed by the outer family which if different will check the flag in a different part of the socket then it was set. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: restore ip source validationJamal Hadi Salim2009-12-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when using policy routing and the skb mark: there are cases where a back path validation requires us to use a different routing table for src ip validation than the one used for mapping ingress dst ip. One such a case is transparent proxying where we pretend to be the destination system and therefore the local table is used for incoming packets but possibly a main table would be used on outbound. Make the default behavior to allow the above and if users need to turn on the symmetry via sysctl src_valid_mark Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: RFC3069, private VLAN proxy arp supportJesper Dangaard Brouer2010-01-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is to be used together with switch technologies, like RFC3069, that where the individual ports are not allowed to communicate with each other, but they are allowed to talk to the upstream router. As described in RFC 3069, it is possible to allow these hosts to communicate through the upstream router by proxy_arp'ing. This patch basically allow proxy arp replies back to the same interface (from which the ARP request/solicitation was received). Tunable per device via proc "proxy_arp_pvlan": /proc/sys/net/ipv4/conf/*/proxy_arp_pvlan This switch technology is known by different vendor names: - In RFC 3069 it is called VLAN Aggregation. - Cisco and Allied Telesyn call it Private VLAN. - Hewlett-Packard call it Source-Port filtering or port-isolation. - Ericsson call it MAC-Forced Forwarding (RFC Draft). Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: Add rtnetlink init_rcvwnd to set the TCP initial receive windowlaurent chavey2009-12-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add rtnetlink init_rcvwnd to set the TCP initial receive window size advertised by passive and active TCP connections. The current Linux TCP implementation limits the advertised TCP initial receive window to the one prescribed by slow start. For short lived TCP connections used for transaction type of traffic (i.e. http requests), bounding the advertised TCP initial receive window results in increased latency to complete the transaction. Support for setting initial congestion window is already supported using rtnetlink init_cwnd, but the feature is useless without the ability to set a larger TCP initial receive window. The rtnetlink init_rcvwnd allows increasing the TCP initial receive window, allowing TCP connection to advertise larger TCP receive window than the ones bounded by slow start. Signed-off-by: Laurent Chavey <chavey@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: Slightly optimize tcp_sendmsgKrishna Kumar2009-12-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Slightly optimize tcp_sendmsg since NETIF_F_SG is used many times iteratively in the loop. The only other modification is to change: } else if (i == MAX_SKB_FRAGS || (!i && !(sk->sk_route_caps & NETIF_F_SG))) { to: } else if (i == MAX_SKB_FRAGS || !sg) { The reason why this change is correct: this code (other than the MAX_SKB_FRAGS case) executes only due to the else part of: "if (skb_tailroom(skb) > 0) {" - i.e. there was no space in the skb to put the data inline. Hence SG is false is a sufficient condition, and there is no way a fragment can be added to the skb. Changelog: - Added the above explanation for the change Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: Remove unrequired operations in tcp_push()Krishna Kumar2009-12-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove unrequired operations in tcp_push() Changelog: Removed a temporary skb variable from tcp_push() Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: Remove check in __tcp_push_pending_framesKrishna Kumar2009-12-23
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_push checks tcp_send_head and calls __tcp_push_pending_frames, which again checks tcp_send_head, and this unnecessary check is done for every other caller of __tcp_push_pending_frames. Remove tcp_send_head check in __tcp_push_pending_frames and add the check to tcp_push_pending_frames. Other functions call __tcp_push_pending_frames only when tcp_send_head would evaluate to true. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'for-2.6.33' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2009-12-16
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.33' of git://linux-nfs.org/~bfields/linux: (42 commits) nfsd: remove pointless paths in file headers nfsd: move most of nfsfh.h to fs/nfsd nfsd: remove unused field rq_reffh nfsd: enable V4ROOT exports nfsd: make V4ROOT exports read-only nfsd: restrict filehandles accepted in V4ROOT case nfsd: allow exports of symlinks nfsd: filter readdir results in V4ROOT case nfsd: filter lookup results in V4ROOT case nfsd4: don't continue "under" mounts in V4ROOT case nfsd: introduce export flag for v4 pseudoroot nfsd: let "insecure" flag vary by pseudoflavor nfsd: new interface to advertise export features nfsd: Move private headers to source directory vfs: nfsctl.c un-used nfsd #includes lockd: Remove un-used nfsd headers #includes s390: remove un-used nfsd #includes sparc: remove un-used nfsd #includes parsic: remove un-used nfsd #includes compat.c: Remove dependence on nfsd private headers ...
| * Merge commit 'v2.6.32-rc8' into HEADJ. Bruce Fields2009-11-23
| |\
| * | nfs: new subdir Documentation/filesystems/nfsJ. Bruce Fields2009-10-27
| | | | | | | | | | | | | | | | | | | | | | | | We're adding enough nfs documentation that it may as well have its own subdirectory. Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* | | Merge branch 'master' of ↵David S. Miller2009-12-16
|\ \ \ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6
| * | | netfilter: fix crashes in bridge netfilter caused by fragment jumpsPatrick McHardy2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When fragments from bridge netfilter are passed to IPv4 or IPv6 conntrack and a reassembly queue with the same fragment key already exists from reassembling a similar packet received on a different device (f.i. with multicasted fragments), the reassembled packet might continue on a different codepath than where the head fragment originated. This can cause crashes in bridge netfilter when a fragment received on a non-bridge device (and thus with skb->nf_bridge == NULL) continues through the bridge netfilter code. Add a new reassembly identifier for packets originating from bridge netfilter and use it to put those packets in insolated queues. Fixes http://bugzilla.kernel.org/show_bug.cgi?id=14805 Reported-and-Tested-by: Chong Qiao <qiaochong@loongson.cn> Signed-off-by: Patrick McHardy <kaber@trash.net>
* | | | tcp: Revert per-route SACK/DSACK/TIMESTAMP changes.David S. Miller2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It creates a regression, triggering badness for SYN_RECV sockets, for example: [19148.022102] Badness at net/ipv4/inet_connection_sock.c:293 [19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000 [19148.023035] REGS: eeecbd30 TRAP: 0700 Not tainted (2.6.32) [19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR> CR: 24002442 XER: 00000000 [19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000 This is likely caused by the change in the 'estab' parameter passed to tcp_parse_options() when invoked by the functions in net/ipv4/tcp_minisocks.c But even if that is fixed, the ->conn_request() changes made in this patch series is fundamentally wrong. They try to use the listening socket's 'dst' to probe the route settings. The listening socket doesn't even have a route, and you can't get the right route (the child request one) until much later after we setup all of the state, and it must be done by hand. This stuff really isn't ready, so the best thing to do is a full revert. This reverts the following commits: f55017a93f1a74d50244b1254b9a2bd7ac9bbf7d 022c3f7d82f0f1c68018696f2f027b87b9bb45c2 1aba721eba1d84a2defce45b950272cee1e6c72a cda42ebd67ee5fdf09d7057b5a4584d36fe8a335 345cda2fd695534be5a4494f1b59da9daed33663 dc343475ed062e13fc260acccaab91d7d80fd5b2 05eaade2782fb0c90d3034fd7a7d5a16266182bb 6a2a2d6bf8581216e08be15fcb563cfd6c430e1e Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | udp: udp_lib_get_port() fixEric Dumazet2009-12-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we can have a large udp hash table, udp_lib_get_port() loop should be converted to a do {} while (cond) form, or we dont enter it at all if hash table size is exactly 65536. Reported-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge branch 'master' of /home/davem/src/GIT/linux-2.6/David S. Miller2009-12-11
|\| | | | | | | | | | | | | | | | | | | Conflicts: include/net/tcp.h
| * | | Merge branch 'for-linus' of ↵Linus Torvalds2009-12-09
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits) tree-wide: fix misspelling of "definition" in comments reiserfs: fix misspelling of "journaled" doc: Fix a typo in slub.txt. inotify: remove superfluous return code check hdlc: spelling fix in find_pvc() comment doc: fix regulator docs cut-and-pasteism mtd: Fix comment in Kconfig doc: Fix IRQ chip docs tree-wide: fix assorted typos all over the place drivers/ata/libata-sff.c: comment spelling fixes fix typos/grammos in Documentation/edac.txt sysctl: add missing comments fs/debugfs/inode.c: fix comment typos sgivwfb: Make use of ARRAY_SIZE. sky2: fix sky2_link_down copy/paste comment error tree-wide: fix typos "couter" -> "counter" tree-wide: fix typos "offest" -> "offset" fix kerneldoc for set_irq_msi() spidev: fix double "of of" in comment comment typo fix: sybsystem -> subsystem ...
| | * \ \ Merge branch 'for-next' into for-linusJiri Kosina2009-12-07
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: kernel/irq/chip.c
| | | * | | tree-wide: fix assorted typos all over the placeAndré Goddard Rosa2009-12-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | That is "success", "unknown", "through", "performance", "[re|un]mapping" , "access", "default", "reasonable", "[con]currently", "temperature" , "channel", "[un]used", "application", "example","hierarchy", "therefore" , "[over|under]flow", "contiguous", "threshold", "enough" and others. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | | | | tcp: fix retrans_stamp advancing in error casesIlpo Järvinen2009-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It can happen, that tcp_retransmit_skb fails due to some error. In such cases we might end up into a state where tp->retrans_out is zero but that's only because we removed the TCPCB_SACKED_RETRANS bit from a segment but couldn't retransmit it because of the error that happened. Therefore some assumptions that retrans_out checks are based do not necessarily hold, as there still can be an old retransmission but that is only visible in TCPCB_EVER_RETRANS bit. As retransmission happen in sequential order (except for some very rare corner cases), it's enough to check the head skb for that bit. Main reason for all this complexity is the fact that connection dying time now depends on the validity of the retrans_stamp, in particular, that successive retransmissions of a segment must not advance retrans_stamp under any conditions. It seems after quick thinking that this has relatively low impact as eventually TCP will go into CA_Loss and either use the existing check for !retrans_stamp case or send a retransmission successfully, setting a new base time for the dying timer (can happen only once). At worst, the dying time will be approximately the double of the intented time. In addition, tcp_packet_delayed() will return wrong result (has some cc aspects but due to rarity of these errors, it's hardly an issue). One of retrans_stamp clearing happens indirectly through first going into CA_Open state and then a later ACK lets the clearing to happen. Thus tcp_try_keep_open has to be modified too. Thanks to Damian Lukowski <damian@tvk.rwth-aachen.de> for hinting that this possibility exists (though the particular case discussed didn't after all have it happening but was just a debug patch artifact). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | tcp: Stalling connections: Move timeout calculation routineDamian Lukowski2009-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch moves retransmits_timed_out() from include/net/tcp.h to tcp_timer.c, where it is used. Reported-by: Frederic Leroy <fredo@starox.org> Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | [PATCH] tcp: documents timewait refcnt tricks Eric Dumazet2009-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds kerneldoc for inet_twsk_unhash() & inet_twsk_bind_unhash(). With help from Randy Dunlap. Suggested-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | tcp: Fix a connect() race with timewait socketsEric Dumazet2009-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we find a timewait connection in __inet_hash_connect() and reuse it for a new connection request, we have a race window, releasing bind list lock and reacquiring it in __inet_twsk_kill() to remove timewait socket from list. Another thread might find the timewait socket we already chose, leading to list corruption and crashes. Fix is to remove timewait socket from bind list before releasing the bind lock. Note: This problem happens if sysctl_tcp_tw_reuse is set. Reported-by: kapil dakhane <kdakhane@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | tcp: Fix a connect() race with timewait socketsEric Dumazet2009-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | First patch changes __inet_hash_nolisten() and __inet6_hash() to get a timewait parameter to be able to unhash it from ehash at same time the new socket is inserted in hash. This makes sure timewait socket wont be found by a concurrent writer in __inet_check_established() Reported-by: kapil dakhane <kdakhane@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | tcp: Remove runtime check that can never be true.David S. Miller2009-12-08
|/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GCC even warns about it, as reported by Andrew Morton: net/ipv4/tcp.c: In function 'do_tcp_getsockopt': net/ipv4/tcp.c:2544: warning: comparison is always false due to limited range of data type Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6Linus Torvalds2009-12-08
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits) mac80211: fix reorder buffer release iwmc3200wifi: Enable wimax core through module parameter iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter iwmc3200wifi: Coex table command does not expect a response iwmc3200wifi: Update wiwi priority table iwlwifi: driver version track kernel version iwlwifi: indicate uCode type when fail dump error/event log iwl3945: remove duplicated event logging code b43: fix two warnings ipw2100: fix rebooting hang with driver loaded cfg80211: indent regulatory messages with spaces iwmc3200wifi: fix NULL pointer dereference in pmkid update mac80211: Fix TX status reporting for injected data frames ath9k: enable 2GHz band only if the device supports it airo: Fix integer overflow warning rt2x00: Fix padding bug on L2PAD devices. WE: Fix set events not propagated b43legacy: avoid PPC fault during resume b43: avoid PPC fault during resume tcp: fix a timewait refcnt race ... Fix up conflicts due to sysctl cleanups (dead sysctl_check code and CTL_UNNUMBERED removed) in kernel/sysctl_check.c net/ipv4/sysctl_net_ipv4.c net/ipv6/addrconf.c net/sctp/sysctl.c
| * | | | | tcp: fix a timewait refcnt raceEric Dumazet2009-12-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After TCP RCU conversion, tw->tw_refcnt should not be set to 1 in inet_twsk_alloc(). It allows a RCU reader to get this timewait socket, while we not yet stabilized it. Only choice we have is to set tw_refcnt to 0 in inet_twsk_alloc(), then atomic_add() it later, once everything is done. Location of this atomic_add() is tricky, because we dont want another writer to find this timewait in ehash, while tw_refcnt is still zero ! Thanks to Kapil Dakhane tests and reports. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | tcp: connect() race with timewait reuseEric Dumazet2009-12-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Its currently possible that several threads issuing a connect() find the same timewait socket and try to reuse it, leading to list corruptions. Condition for bug is that these threads bound their socket on same address/port of to-be-find timewait socket, and connected to same target. (SO_REUSEADDR needed) To fix this problem, we could unhash timewait socket while holding ehash lock, to make sure lookups/changes will be serialized. Only first thread finds the timewait socket, other ones find the established socket and return an EADDRNOTAVAIL error. This second version takes into account Evgeniy's review and makes sure inet_twsk_put() is called outside of locked sections. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | tcp: diag: Dont report negative values for rx queueEric Dumazet2009-12-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both netlink and /proc/net/tcp interfaces can report transient negative values for rx queue. ss -> State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB -6 6 127.0.0.1:45956 127.0.0.1:3333 netstat -> tcp 4294967290 6 127.0.0.1:37784 127.0.0.1:3333 ESTABLISHED This is because we dont lock socket while computing tp->rcv_nxt - tp->copied_seq, and another CPU can update copied_seq before rcv_next in RX path. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | Merge branch 'master' of ↵David S. Miller2009-12-03
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6
| | * | | | | netfilter: net/ipv[46]/netfilter: Move && and || to end of previous lineJoe Perches2009-11-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compile tested only. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| | * | | | | netfilter: remove unneccessary checks from netlink notifiersPatrick McHardy2009-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The NETLINK_URELEASE notifier is only invoked for bound sockets, so there is no need to check ->pid again. Signed-off-by: Patrick McHardy <kaber@trash.net>
| | * | | | | netfilter: nf_nat_helper: tidy up adjust_tcp_sequenceHannes Eder2009-11-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The variable 'other_way' gets initialized but is not read afterwards, so remove it. Pass the right arguments to a pr_debug call. While being at tidy up a bit and it fix this checkpatch warning: WARNING: suspect code indent for conditional statements Signed-off-by: Hannes Eder <heder@google.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| | * | | | | netfilter: remove synchronize_net() calls in ip_queue/ip6_queueEric Dumazet2009-11-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nf_unregister_queue_handlers() already does a synchronize_rcu() call, we dont need to do it again in callers. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | | | | | net: Batch inet_twsk_purgeEric W. Biederman2009-12-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function walks the whole hashtable so there is no point in passing it a network namespace. Instead I purge all timewait sockets from dead network namespaces that I find. If the namespace is one of the once I am trying to purge I am guaranteed no new timewait sockets can be formed so this will get them all. If the namespace is one I am not acting for it might form a few more but I will call inet_twsk_purge again and shortly to get rid of them. In any even if the network namespace is dead timewait sockets are useless. Move the calls of inet_twsk_purge into batch_exit routines so that if I am killing a bunch of namespaces at once I will just call inet_twsk_purge once and save a lot of redundant unnecessary work. My simple 4k network namespace exit test the cleanup time dropped from roughly 8.2s to 1.6s. While the time spent running inet_twsk_purge fell to about 2ms. 1ms for ipv4 and 1ms for ipv6. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | net: Use rcu lookups in inet_twsk_purge.Eric W. Biederman2009-12-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While we are looking up entries to free there is no reason to take the lock in inet_twsk_purge. We have to drop locks and restart occassionally anyway so adding a few more in case we get on the wrong list because of a timewait move is no big deal. At the same time not taking the lock for long periods of time is much more polite to the rest of the users of the hash table. In my test configuration of killing 4k network namespaces this change causes 4k back to back runs of inet_twsk_purge on an empty hash table to go from roughly 20.7s to 3.3s, and the total time to destroy 4k network namespaces goes from roughly 44s to 3.3s. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>