litmus-rt.git - The LITMUS^RT kernel.

	Commit message (Collapse)	Author	Age
*	[IPV6] IP6TUNNEL: Split out generic routine in ip6ip6_xmit().	Yasuyuki Kozakai	2007-04-26
\| \| \| \| \| \| \| \|	This enables to add IPv4/IPv6 specific handling later, Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV6] IP6TUNNEL: Split out generic routine in ip6ip6_rcv().	Yasuyuki Kozakai	2007-04-26
\| \| \| \| \| \| \| \|	This enables to add IPv4/IPv6 specific handling later, Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV6] IP6TUNNEL: Split out generic routine in ip6ip6_err().	Yasuyuki Kozakai	2007-04-26
\| \| \| \| \| \| \| \|	This enables to add IPv4/IPv6 specific error handling later, Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV6]: Decentralize EXPORT_SYMBOLs.	YOSHIFUJI Hideaki	2007-04-26
\| \| \| \|	Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
*	[NETLINK]: Mirror UDP MSG_TRUNC semantics.	David S. Miller	2007-04-26
\| \| \| \| \| \| \| \| \|	If the user passes MSG_TRUNC in via msg_flags, return the full packet size not the truncated size. Idea from Herbert Xu and Thomas Graf. Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NET]: convert network timestamps to ktime_t	Eric Dumazet	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We currently use a special structure (struct skb_timeval) and plain 'struct timeval' to store packet timestamps in sk_buffs and struct sock. This has some drawbacks : - Fixed resolution of micro second. - Waste of space on 64bit platforms where sizeof(struct timeval)=16 I suggest using ktime_t that is a nice abstraction of high resolution time services, currently capable of nanosecond resolution. As sizeof(ktime_t) is 8 bytes, using ktime_t in 'struct sock' permits a 8 byte shrink of this structure on 64bit architectures. Some other structures also benefit from this size reduction (struct ipq in ipv4/ip_fragment.c, struct frag_queue in ipv6/reassembly.c, ...) Once this ktime infrastructure adopted, we can more easily provide nanosecond resolution on top of it. (ioctl SIOCGSTAMPNS and/or SO_TIMESTAMPNS/SCM_TIMESTAMPNS) Note : this patch includes a bug correction in compat_sock_get_timestamp() where a "err = 0;" was missing (so this syscall returned -ENOENT instead of 0) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> CC: Stephen Hemminger <shemminger@linux-foundation.org> CC: John find <linux.kernel@free.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NET]: div64_64 consolidate (rev3)	Stephen Hemminger	2007-04-26
\| \| \| \| \| \| \|	Here is the current version of the 64 bit divide common code. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NET]: Convert xtime.tv_sec to get_seconds()	James Morris	2007-04-26
\| \| \| \| \| \| \| \|	Where appropriate, convert references to xtime.tv_sec to the get_seconds() helper function. Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[PKTGEN]: fix device name handling	Stephen Hemminger	2007-04-26
\| \| \| \| \| \| \| \| \| \| \|	Since devices can change name and other wierdness, don't hold onto a copy of device name, instead use pointer to output device. Fix a couple of leaks in error handling path as well. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[PKTGEN]: don't use __constant_htonl()	Stephen Hemminger	2007-04-26
\| \| \| \| \| \| \| \| \|	The existing htonl() macro is smart enough to do the same code as using __constant_htonl() and it looks cleaner. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[PKTGEN]: use random32	Stephen Hemminger	2007-04-26
\| \| \| \| \| \| \| \|	Can use random32() now. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[PKTGEN]: use pr_debug	Stephen Hemminger	2007-04-26
\| \| \| \| \| \| \| \|	Remove private debug macro and replace with standard version Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NET]: Keep sk_backlog near sk_lock	Eric Dumazet	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sk_backlog is a critical field of struct sock. (known famous words) It is (ab)used in hot paths, in particular in release_sock(), tcp_recvmsg(), tcp_v4_rcv(), sk_receive_skb(). It really makes sense to place it next to sk_lock, because sk_backlog is only used after sk_lock locked (and thus memory cache line in L1 cache). This should reduce cache misses and sk_lock acquisition time. (In theory, we could only move the head pointer near sk_lock, and leaving tail far away, because 'tail' is normally not so hot, but keep it simple :) ) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: FRTO undo response falls back to ratehalving one if ECEd	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Undoing ssthresh is disabled in fastretrans_alert whenever FLAG_ECE is set by clearing prior_ssthresh. The clearing does not protect FRTO because FRTO operates before fastretrans_alert. Moving the clearing of prior_ssthresh earlier seems to be a suboptimal solution to the FRTO case because then FLAG_ECE will cause a second ssthresh reduction in try_to_open (the first occurred when FRTO was entered). So instead, FRTO falls back immediately to the rate halving response, which switches TCP to CA_CWR state preventing the latter reduction of ssthresh. If the first ECE arrived before the ACK after which FRTO is able to decide RTO as spurious, prior_ssthresh is already cleared. Thus no undoing for ssthresh occurs. Besides, FLAG_ECE should be set also in the following ACKs resulting in rate halving response that sees TCP is already in CA_CWR, which again prevents an extra ssthresh reduction on that round-trip. If the first ECE arrived before RTO, ssthresh has already been adapted and prior_ssthresh remains cleared on entry because TCP is in CA_CWR (the same applies also to a case where FRTO is entered more than once and ECE comes in the middle). High_seq must not be touched after tcp_enter_cwr because CWR round-trip calculation depends on it. I believe that after this patch, FRTO should be ECN-safe and even able to take advantage of synergy benefits. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Complete icsk-to-local-variable change (in tcp_enter_cwr)	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \|	A local variable for icsk was created but this change was missing. Spotted by Jarek Poplawski. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Add two new spurious RTO responses to FRTO	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	New sysctl tcp_frto_response is added to select amongst these responses: - Rate halving based; reuses CA_CWR state (default) - Very conservative; used to be the only one available (=1) - Undo cwr; undoes ssthresh and cwnd reductions (=2) The response with rate halving requires a new parameter to tcp_enter_cwr because FRTO has already reduced ssthresh and doing a second reduction there has to be prevented. In addition, to keep things nice on 80 cols screen, a local variable was added. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Correct reordering detection change (no FRTO case)	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \|	The reordering detection must work also when FRTO has not been used at all which was the original intention of mine, just the expression of the idea was flawed. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Keep copied_seq, rcv_wup and rcv_next together.	Eric Dumazet	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I noticed in oprofile study a cache miss in tcp_rcv_established() to read copied_seq. ffffffff80400a80 <tcp_rcv_established>: /* tcp_rcv_established total: 4034293 2.0400 */ 55493 0.0281 :ffffffff80400bc9: mov 0x4c8(%r12),%eax copied_seq 543103 0.2746 :ffffffff80400bd1: cmp 0x3e0(%r12),%eax rcv_nxt if (tp->copied_seq == tp->rcv_nxt && len - tcp_header_len <= tp->ucopy.len) { In this function, the cache line 0x4c0 -> 0x500 is used only for this reading 'copied_seq' field. rcv_wup and copied_seq should be next to rcv_nxt field, to lower number of active cache lines in hot paths. (tcp_rcv_established(), tcp_poll(), ...) As you suggested, I changed tcp_create_openreq_child() so that these fields are changed together, to avoid adding a new store buffer stall. Patch is 64bit friendly (no new hole because of alignment constraints) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: struct *sock argument renamed: sp -> sk	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \|	In general, TCP code uses "sk" for struct sock pointer. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Add RFC3742 Limited Slow-Start, controlled by variable ↵	John Heffner	2007-04-26
\| \| \| \| \| \| \|	sysctl_tcp_max_ssthresh. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] YeAH-TCP: algorithm implementation	Angelo P. Castellani	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	YeAH-TCP is a sender-side high-speed enabled TCP congestion control algorithm, which uses a mixed loss/delay approach to compute the congestion window. It's design goals target high efficiency, internal, RTT and Reno fairness, resilience to link loss while keeping network elements load as low as possible. For further details look here: http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf Signed-off-by: Angelo P. Castellani <angelo.castellani@gmail.con> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: SACK enhanced FRTO	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implements the SACK-enhanced FRTO given in RFC4138 using the variant given in Appendix B. RFC4138, Appendix B: "This means that in order to declare timeout spurious, the TCP sender must receive an acknowledgment for non-retransmitted segment between SND.UNA and RecoveryPoint in algorithm step 3. RecoveryPoint is defined in conservative SACK-recovery algorithm [RFC3517]" The basic version of the FRTO algorithm can still be used also when SACK is enabled. To enabled SACK-enhanced version, tcp_frto sysctl is set to 2. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Prevent reordering adjustments during FRTO	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \|	To be honest, I'm not too sure how the reord stuff works in the first place but this seems necessary. When FRTO has been active, the one and only retransmission could be unnecessary but the state and sending order might not be what the sacktag code expects it to be (to work correctly). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: Fake cwnd for ssthresh callback	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP without FRTO would be in Loss state with small cwnd. FRTO, however, leaves cwnd (typically) to a larger value which causes ssthresh to become too large in case RTO is triggered again compared to what conventional recovery would do. Because consecutive RTOs result in only a single ssthresh reduction, RTO+cumulative ACK+RTO pattern is required to trigger this event. A large comment is included for congestion control module writers trying to figure out what CA_EVENT_FRTO handler should do because there exists a remote possibility of incompatibility between FRTO and module defined ssthresh functions. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: Reverse RETRANS bit clearing logic	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously RETRANS bits were cleared on the entry to FRTO. We postpone that into tcp_enter_frto_loss, which is really the place were the clearing should be done anyway. This allows simplification of the logic from a clearing loop to the head skb clearing only. Besides, the other changes made in the previous patches to tcp_use_frto made it impossible for the non-SACKed FRTO to be entered if other than the head has been rexmitted. With SACK-enhanced FRTO (and Appendix B), however, there can be a number retransmissions in flight when RTO expires (same thing could happen before this patchset also with non-SACK FRTO). To not introduce any jumpiness into the packet counting during FRTO, instead of clearing RETRANS bits from skbs during entry, do it later on. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: Entry is allowed only during (New)Reno like recovery	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This interpretation comes from RFC4138: "If the sender implements some loss recovery algorithm other than Reno or NewReno [FHG04], the F-RTO algorithm SHOULD NOT be entered when earlier fast recovery is underway." I think the RFC means to say (especially in the light of Appendix B) that ...recovery is underway (not just fast recovery) or was underway when it was interrupted by an earlier (F-)RTO that hasn't yet been resolved (snd_una has not advanced enough). Thus, my interpretation is that whenever TCP has ever retransmitted other than head, basic version cannot be used because then the order assumptions which are used as FRTO basis do not hold. NewReno has only the head segment retransmitted at a time. Therefore, walk up to the segment that has not been SACKed, if that segment is not retransmitted nor anything before it, we know for sure, that nothing after the non-SACKed segment should be either. This assumption is valid because TCPCB_EVER_RETRANS does not leave holes but each non-SACKed segment is rexmitted in-order. Check for retrans_out > 1 avoids more expensive walk through the skb list, as we can know the result beforehand: F-RTO will not be allowed. SACKed skb can turn into non-SACked only in the extremely rare case of SACK reneging, in this case we might fail to detect retransmissions if there were them for any other than head. To get rid of that feature, whole rexmit queue would have to be walked (always) or FRTO should be prevented when SACK reneging happens. Of course RTO should still trigger after reneging which makes this issue even less likely to show up. And as long as the response is as conservative as it's now, nothing bad happens even then. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Prevent unrelated cwnd adjustment while using FRTO	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \|	FRTO controls cwnd when it still processes the ACK input or it has just reverted back to conventional RTO recovery; the normal rules apply when FRTO has reverted to standard congestion control. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: frto_counter modulo-op converted to two assignments	Ilpo Järvinen	2007-04-26
\| \| \| \| \|	Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Don't enter to fast recovery while using FRTO	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because TCP is not in Loss state during FRTO recovery, fast recovery could be triggered by accident. Non-SACK FRTO is more robust than not yet included SACK-enhanced version (that can receiver high number of duplicate ACKs with SACK blocks during FRTO), at least with unidirectional transfers, but under extraordinary patterns fast recovery can be incorrectly triggered, e.g., Data loss+ACK losses => cumulative ACK with enough SACK blocks to meet sacked_out >= dupthresh condition). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: Response should reset also snd_cwnd_cnt	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \|	Since purpose is to reduce CWND, we prevent immediate growth. This is not a major issue nor is "the correct way" specified anywhere. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: fixes fallback to conventional recovery	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The FRTO detection did not care how ACK pattern affects to cwnd calculation of the conventional recovery. This caused incorrect setting of cwnd when the fallback becames necessary. The knowledge tcp_process_frto() has about the incoming ACK is now passed on to tcp_enter_frto_loss() in allowed_segments parameter that gives the number of segments that must be added to packets-in-flight while calculating the new cwnd. Instead of snd_una we use FLAG_DATA_ACKED in duplicate ACK detection because RFC4138 states (in Section 2.2): If the first acknowledgment after the RTO retransmission does not acknowledge all of the data that was retransmitted in step 1, the TCP sender reverts to the conventional RTO recovery. Otherwise, a malicious receiver acknowledging partial segments could cause the sender to declare the timeout spurious in a case where data was lost. If the next ACK after RTO is duplicate, we do not retransmit anything, which is equal to what conservative conventional recovery does in such case. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: Ignore some uninteresting ACKs	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \|	Handles RFC4138 shortcoming (in step 2); it should also have case c) which ignores ACKs that are not duplicates nor advance window (opposite dir data, winupdate). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: Use Disorder state during operation instead of Open	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Retransmission counter assumptions are to be changed. Forcing reason to do this exist: Using sysctl in check would be racy as soon as FRTO starts to ignore some ACKs (doing that in the following patches). Userspace may disable it at any moment giving nice oops if timing is right. frto_counter would be inaccessible from userspace, but with SACK enhanced FRTO retrans_out can include other than head, and possibly leaving it non-zero after spurious RTO, boom again. Luckily, solution seems rather simple: never go directly to Open state but use Disorder instead. This does not really change much, since TCP could anyway change its state to Disorder during FRTO using path tcp_fastretrans_alert -> tcp_try_to_open (e.g., when a SACK block makes ACK dubious). Besides, Disorder seems to be the state where TCP should be if not recovering (in Recovery or Loss state) while having some retransmissions in-flight (see tcp_try_to_open), which is exactly what happens with FRTO. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: Consecutive RTOs keep prior_ssthresh and ssthresh	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In case a latency spike causes more than one RTO, the later should not cause the already reduced ssthresh to propagate into the prior_ssthresh since FRTO declares all such RTOs spurious at once or none of them. In treating of ssthresh, we mimic what tcp_enter_loss() does. The previous state (in frto_counter) must be available until we have checked it in tcp_enter_frto(), and also ACK information flag in process_frto(). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: Comment cleanup & improvement	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \| \| \| \| \|	Moved comments out from the body of process_frto() to the head (preferred way; see Documentation/CodingStyle). Bonus: it's much easier to read in this compacted form. FRTO algorithm and implementation is described in greater detail. For interested reader, more information is available in RFC4138. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: Moved tcp_use_frto from tcp.h to tcp_input.c	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \|	In addition, removed inline. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: Separated response from FRTO detection algorithm	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \|	FRTO spurious RTO detection algorithm (RFC4138) does not include response to a detected spurious RTO but can use different response algorithms. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: Incorrectly clears TCPCB_EVER_RETRANS bit	Ilpo Järvinen	2007-04-26
\| \| \| \| \| \| \| \|	FRTO was slightly too brave... Should only clear TCPCB_SACKED_RETRANS bit. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETLINK]: Infinite recursion in netlink.	Alexey Kuznetsov	2007-04-25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reply to NETLINK_FIB_LOOKUP messages were misrouted back to kernel, which resulted in infinite recursion and stack overflow. The bug is present in all kernel versions since the feature appeared. The patch also makes some minimal cleanup: 1. Return something consistent (-ENOENT) when fib table is missing 2. Do not crash when queue is empty (does not happen, but yet) 3. Put result of lookup Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
*	IPv6: fix Routing Header Type 0 handling thinko	YOSHIFUJI Hideaki	2007-04-24
\| \| \| \| \| \| \| \|	Oops, thinko. The test for accempting a RH0 was exatly the wrong way around. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	[IPV6]: Disallow RH0 by default.	YOSHIFUJI Hideaki	2007-04-24
\| \| \| \| \| \| \| \| \|	A security issue is emerging. Disallow Routing Header Type 0 by default as we have been doing for IPv4. Note: We allow RH2 by default because it is harmless. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[XFRM]: beet: fix pseudo header length value	Patrick McHardy	2007-04-24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	draft-nikander-esp-beet-mode-07.txt is not entirely clear on how the length value of the pseudo header should be calculated, it states "The Header Length field contains the length of the pseudo header, IPv4 options, and padding in 8 octets units.", but also states "Length in octets (Header Len + 1) * 8". draft-nikander-esp-beet-mode-08-pre1.txt [1] clarifies this, the header length should not include the first 8 byte. This change affects backwards compatibility, but option encapsulation didn't work until very recently anyway. [1] http://users.piuha.net/jmelen/BEET/draft-nikander-esp-beet-mode-08-pre1.txt Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Congestion control initialization.	Stephen Hemminger	2007-04-24
\| \| \| \| \| \| \| \| \| \| \| \| \|	Change to defer congestion control initialization. If setsockopt() was used to change TCP_CONGESTION before connection is established, then protocols that use sequence numbers to keep track of one RTT interval (vegas, illinois, ...) get confused. Change the init hook to be called after handshake. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	RPC: Fix the TCP resend semantics for NFSv4	Trond Myklebust	2007-04-21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix a regression due to the patch "NFS: disconnect before retrying NFSv4 requests over TCP" The assumption made in xprt_transmit() that the condition "req->rq_bytes_sent == 0 and request is on the receive list" should imply that we're dealing with a retransmission is false. Firstly, it may simply happen that the socket send queue was full at the time the request was initially sent through xprt_transmit(). Secondly, doing this for each request that was retransmitted implies that we disconnect and reconnect for _every_ request that happened to be retransmitted irrespective of whether or not a disconnection has already occurred. Fix is to move this logic into the call_status request timeout handler. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	[NETLINK]: Don't attach callback to a going-away netlink socket	Denis Lunev	2007-04-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a race between netlink_dump_start() and netlink_release() that can lead to the situation when a netlink socket with non-zero callback is freed. Here it is: CPU1: CPU2 netlink_release(): netlink_dump_start(): sk = netlink_lookup(); /* OK / netlink_remove(); spin_lock(&nlk->cb_lock); if (nlk->cb) { / false / ... } spin_unlock(&nlk->cb_lock); spin_lock(&nlk->cb_lock); if (nlk->cb) { / false / ... } nlk->cb = cb; spin_unlock(&nlk->cb_lock); ... sock_orphan(sk); / * proceed with releasing * the socket */ The proposal it to make sock_orphan before detaching the callback in netlink_release() and to check for the sock to be SOCK_DEAD in netlink_dump_start() before setting a new callback. Signed-off-by: Denis Lunev <den@openvz.org> Signed-off-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IrDA]: Correctly handling socket error	Olaf Kirch	2007-04-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes an oops first reported in mid 2006 - see http://lkml.org/lkml/2006/8/29/358 The cause of this bug report is that when an error is signalled on the socket, irda_recvmsg_stream returns without removing a local wait_queue variable from the socket's sk_sleep queue. This causes havoc further down the road. In response to this problem, a patch was made that invoked sock_orphan on the socket when receiving a disconnect indication. This is not a good fix, as this sets sk_sleep to NULL, causing applications sleeping in recvmsg (and other places) to oops. This is against the latest net-2.6 and should be considered for -stable inclusion. Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: Samuel Ortiz <samuel@sortiz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[SCTP]: Do not interleave non-fragments when in partial delivery	Vlad Yasevich	2007-04-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The way partial delivery is currently implemnted, it is possible to intereleave a message (either from another steram, or unordered) that is not part of partial delivery process. The only way to this is for a message to not be a fragment and be 'in order' or unorderd for a given stream. This will result in bypassing the reassembly/ordering queues where things live duing partial delivery, and the message will be delivered to the socket in the middle of partial delivery. This is a two-fold problem, in that: 1. the app now must check the stream-id and flags which it may not be doing. 2. this clearing partial delivery state from the association and results in ulp hanging. This patch is a band-aid over a much bigger problem in that we don't do stream interleave. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPSEC] af_key: Fix thinko in pfkey_xfrm_policy2msg()	David S. Miller	2007-04-18
\| \| \| \| \| \| \| \| \|	Make sure to actually assign the determined mode to rq->sadb_x_ipsecrequest_mode. Noticed by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	Linus Torvalds	2007-04-17
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: [BRIDGE]: Unaligned access when comparing ethernet addresses [SCTP]: Unmap v4mapped addresses during SCTP_BINDX_REM_ADDR operation. [SCTP]: Fix assertion (!atomic_read(&sk->sk_rmem_alloc)) failed message [NET]: Set a separate lockdep class for neighbour table's proxy_queue [NET]: Fix UDP checksum issue in net poll mode. [KEY]: Fix conversion between IPSEC_MODE_xxx and XFRM_MODE_xxx. [NET]: Get rid of alloc_skb_from_cache
\| *	[BRIDGE]: Unaligned access when comparing ethernet addresses	Evgeny Kravtsunov	2007-04-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	compare_ether_addr() implicitly requires that the addresses passed are 2-bytes aligned in memory. This is not true for br_stp_change_bridge_id() and br_stp_recalculate_bridge_id() in which one of the addresses is unsigned char *, and thus may not be 2-bytes aligned. Signed-off-by: Evgeny Kravtsunov <emkravts@openvz.org> Signed-off-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: Pavel Emelianov <xemul@openvz.org>