tcp: Try to restore large SKBs while SACK processing

During SACK processing, most of the benefits of TSO are eaten by the SACK blocks that one-by-one fragment SKBs to MSS sized chunks. Then we're in problems when cleanup work for them has to be done when a large cumulative ACK comes. Try to return back to pre-split state already while more and more SACK info gets discovered by combining newly discovered SACK areas with the previous skb if that's SACKed as well. This approach has a number of benefits: 1) The processing overhead is spread more equally over the RTT 2) Write queue has less skbs to process (affect everything which has to walk in the queue past the sacked areas) 3) Write queue is consistent whole the time, so no other parts of TCP has to be aware of this (this was not the case with some other approach that was, well, quite intrusive all around). 4) Clean_rtx_queue can release most of the pages using single put_page instead of previous PAGE_SIZE/mss+1 calls In case a hole is fully filled by the new SACK block, we attempt to combine the next skb too which allows construction of skbs that are even larger than what tso split them to and it handles hole per on every nth patterns that often occur during slow start overshoot pretty nicely. Though this to be really useful also a retransmission would have to get lost since cumulative ACKs advance one hole at a time in the most typical case. TODO: handle upwards only merging. That should be rather easy when segment is fully sacked but I'm leaving that as future work item (it won't make very large difference anyway since this current approach already covers quite a lot of normal cases). I was earlier thinking of some sophisticated way of tracking timestamps of the first and the last segment but later on realized that it won't be that necessary at all to store the timestamp of the last segment. The cases that can occur are basically either: 1) ambiguous => no sensible measurement can be taken anyway 2) non-ambiguous is due to reordering => having the timestamp of the last segment there is just skewing things more off than does some good since the ack got triggered by one of the holes (besides some substle issues that would make determining right hole/skb even harder problem). Anyway, it has nothing to do with this change then. I choose to route some abnormal looking cases with goto noop, some could be handled differently (eg., by stopping the walking at that skb but again). In general, they either shouldn't happen at all or are rare enough to make no difference in practice. In theory this change (as whole) could cause some macroscale regression (global) because of cache misses that are taken over the round-trip time but it gets very likely better because of much less (local) cache misses per other write queue walkers and the big recovery clearing cumulative ack. Worth to note that these benefits would be very easy to get also without TSO/GSO being on as long as the data is in pages so that we can merge them. Currently I won't let that happen because DSACK splitting at fragment that would mess up pcounts due to sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets avoided, we have some conditions that can be made less strict. TODO: I will probably have to convert the excessive pointer passing to struct sacktag_state... :-) My testing revealed that considerable amount of skbs couldn't be shifted because they were cloned (most likely still awaiting tx reclaim)... [The rest is considering future work instead since I got repeatably EFAULT to tcpdump's recvfrom when I added pskb_expand_head to deal with clones, so I separated that into another, later patch] ...To counter that, I gave up on the fifth advantage: 5) When growing previous SACK block, less allocs for new skbs are done, basically a new alloc is needed only when new hole is detected and when the previous skb runs out of frags space ...which now only happens of if reclaim is fast enough to dispose the clone before the SACK block comes in (the window is RTT long), otherwise we'll have to alloc some. With clones being handled I got these numbers (will be somewhat worse without that), taken with fine-grained mibs: TCPSackShifted 398 TCPSackMerged 877 TCPSackShiftFallback 320 TCPSACKCOLLAPSEFALLBACKGSO 0 TCPSACKCOLLAPSEFALLBACKSKBBITS 0 TCPSACKCOLLAPSEFALLBACKSKBDATA 0 TCPSACKCOLLAPSEFALLBACKBELOW 0 TCPSACKCOLLAPSEFALLBACKFIRST 1 TCPSACKCOLLAPSEFALLBACKPREVBITS 318 TCPSACKCOLLAPSEFALLBACKMSS 1 TCPSACKCOLLAPSEFALLBACKNOHEAD 0 TCPSACKCOLLAPSEFALLBACKSHIFT 0 TCPSACKCOLLAPSENOOPSEQ 0 TCPSACKCOLLAPSENOOPSMALLPCOUNT 0 TCPSACKCOLLAPSENOOPSMALLLEN 0 TCPSACKCOLLAPSEHOLE 12 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
author: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> 2008-11-25 00:20:15 -0500
committer: David S. Miller <davem@davemloft.net> 2008-11-25 00:20:15 -0500
commit: 832d11c5cd076abc0aa1eaf7be96c81d1a59ce41 (patch)
tree: 95b22ad16d1ff414cab39578ed8c927c2ce08723 /net/core/skbuff.c
parent: f58b22fd3c16444edc393a217a74208f1894b601 (diff)
1 files changed, 140 insertions, 0 deletions
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 267185a848f6..844b8abeb18c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2018,6 +2018,146 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len)
                skb_split_no_header(skb, skb1, len, pos);
 }
+/* Shifting from/to a cloned skb is a no-go.
+ *
+ * TODO: handle cloned skbs by using pskb_expand_head()
+ */
+static int skb_prepare_for_shift(struct sk_buff *skb)
+{
+        return skb_cloned(skb);
+}
+/**
+ * skb_shift - Shifts paged data partially from skb to another
+ * @tgt: buffer into which tail data gets added
+ * @skb: buffer from which the paged data comes from
+ * @shiftlen: shift up to this many bytes
+ *
+ * Attempts to shift up to shiftlen worth of bytes, which may be less than
+ * the length of the skb, from tgt to skb. Returns number bytes shifted.
+ * It's up to caller to free skb if everything was shifted.
+ *
+ * If @tgt runs out of frags, the whole operation is aborted.
+ *
+ * Skb cannot include anything else but paged data while tgt is allowed
+ * to have non-paged data as well.
+ *
+ * TODO: full sized shift could be optimized but that would need
+ * specialized skb free'er to handle frags without up-to-date nr_frags.
+ */
+int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
+{
+        int from, to, merge, todo;
+        struct skb_frag_struct *fragfrom, *fragto;
+        BUG_ON(shiftlen > skb->len);
+        BUG_ON(skb_headlen(skb));       /* Would corrupt stream */
+        todo = shiftlen;
+        from = 0;
+        to = skb_shinfo(tgt)->nr_frags;
+        fragfrom = &skb_shinfo(skb)->frags[from];
+        /* Actual merge is delayed until the point when we know we can
+         * commit all, so that we don't have to undo partial changes
+         */
+        if (!to ||
+            !skb_can_coalesce(tgt, to, fragfrom->page, fragfrom->page_offset)) {
+                merge = -1;
+        } else {
+                merge = to - 1;
+                todo -= fragfrom->size;
+                if (todo < 0) {
+                        if (skb_prepare_for_shift(skb) ||
+                            skb_prepare_for_shift(tgt))
+                                return 0;
+                        fragto = &skb_shinfo(tgt)->frags[merge];
+                        fragto->size += shiftlen;
+                        fragfrom->size -= shiftlen;
+                        fragfrom->page_offset += shiftlen;
+                        goto onlymerged;
+                }
+                from++;
+        }
+        /* Skip full, not-fitting skb to avoid expensive operations */
+        if ((shiftlen == skb->len) &&
+            (skb_shinfo(skb)->nr_frags - from) > (MAX_SKB_FRAGS - to))
+                return 0;
+        if (skb_prepare_for_shift(skb) || skb_prepare_for_shift(tgt))
+                return 0;
+        while ((todo > 0) && (from < skb_shinfo(skb)->nr_frags)) {
+                if (to == MAX_SKB_FRAGS)
+                        return 0;
+                fragfrom = &skb_shinfo(skb)->frags[from];
+                fragto = &skb_shinfo(tgt)->frags[to];
+                if (todo >= fragfrom->size) {
+                        *fragto = *fragfrom;
+                        todo -= fragfrom->size;
+                        from++;
+                        to++;
+                } else {
+                        get_page(fragfrom->page);
+                        fragto->page = fragfrom->page;
+                        fragto->page_offset = fragfrom->page_offset;
+                        fragto->size = todo;
+                        fragfrom->page_offset += todo;
+                        fragfrom->size -= todo;
+                        todo = 0;
+                        to++;
+                        break;
+                }
+        }
+        /* Ready to "commit" this state change to tgt */
+        skb_shinfo(tgt)->nr_frags = to;
+        if (merge >= 0) {
+                fragfrom = &skb_shinfo(skb)->frags[0];
+                fragto = &skb_shinfo(tgt)->frags[merge];
+                fragto->size += fragfrom->size;
+                put_page(fragfrom->page);
+        }
+        /* Reposition in the original skb */
+        to = 0;
+        while (from < skb_shinfo(skb)->nr_frags)
+                skb_shinfo(skb)->frags[to++] = skb_shinfo(skb)->frags[from++];
+        skb_shinfo(skb)->nr_frags = to;
+        BUG_ON(todo > 0 && !skb_shinfo(skb)->nr_frags);
+onlymerged:
+        /* Most likely the tgt won't ever need its checksum anymore, skb on
+         * the other hand might need it if it needs to be resent
+         */
+        tgt->ip_summed = CHECKSUM_PARTIAL;
+        skb->ip_summed = CHECKSUM_PARTIAL;
+        /* Yak, is it really working this way? Some helper please? */
+        skb->len -= shiftlen;
+        skb->data_len -= shiftlen;
+        skb->truesize -= shiftlen;
+        tgt->len += shiftlen;
+        tgt->data_len += shiftlen;
+        tgt->truesize += shiftlen;
+        return shiftlen;
+}
 /**
 * skb_prepare_seq_read - Prepare a sequential read of skb data
 * @skb: the buffer to read
author	Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>	2008-11-25 00:20:15 -0500
committer	David S. Miller <davem@davemloft.net>	2008-11-25 00:20:15 -0500
commit	832d11c5cd076abc0aa1eaf7be96c81d1a59ce41 (patch)
tree	95b22ad16d1ff414cab39578ed8c927c2ce08723 /net/core/skbuff.c
parent	f58b22fd3c16444edc393a217a74208f1894b601 (diff)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 267185a848f6..844b8abeb18c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c
@@ -2018,6 +2018,146 @@ void skb_split(struct sk_buff skb, struct sk_buff skb1, const u32 len)
2018	skb_split_no_header(skb, skb1, len, pos);	2018	skb_split_no_header(skb, skb1, len, pos);
2019	}	2019	}
2020		2020
		2021	/* Shifting from/to a cloned skb is a no-go.
		2022	*
		2023	* TODO: handle cloned skbs by using pskb_expand_head()
		2024	*/
		2025	static int skb_prepare_for_shift(struct sk_buff *skb)
		2026	{
		2027	return skb_cloned(skb);
		2028	}
		2029
		2030	/**
		2031	* skb_shift - Shifts paged data partially from skb to another
		2032	* @tgt: buffer into which tail data gets added
		2033	* @skb: buffer from which the paged data comes from
		2034	* @shiftlen: shift up to this many bytes
		2035	*
		2036	* Attempts to shift up to shiftlen worth of bytes, which may be less than
		2037	* the length of the skb, from tgt to skb. Returns number bytes shifted.
		2038	* It's up to caller to free skb if everything was shifted.
		2039	*
		2040	* If @tgt runs out of frags, the whole operation is aborted.
		2041	*
		2042	* Skb cannot include anything else but paged data while tgt is allowed
		2043	* to have non-paged data as well.
		2044	*
		2045	* TODO: full sized shift could be optimized but that would need
		2046	* specialized skb free'er to handle frags without up-to-date nr_frags.
		2047	*/
		2048	int skb_shift(struct sk_buff tgt, struct sk_buff skb, int shiftlen)
		2049	{
		2050	int from, to, merge, todo;
		2051	struct skb_frag_struct fragfrom, fragto;
		2052
		2053	BUG_ON(shiftlen > skb->len);
		2054	BUG_ON(skb_headlen(skb)); /* Would corrupt stream */
		2055
		2056	todo = shiftlen;
		2057	from = 0;
		2058	to = skb_shinfo(tgt)->nr_frags;
		2059	fragfrom = &skb_shinfo(skb)->frags[from];
		2060
		2061	/* Actual merge is delayed until the point when we know we can
		2062	* commit all, so that we don't have to undo partial changes
		2063	*/
		2064	if (!to \|\|
		2065	!skb_can_coalesce(tgt, to, fragfrom->page, fragfrom->page_offset)) {
		2066	merge = -1;
		2067	} else {
		2068	merge = to - 1;
		2069
		2070	todo -= fragfrom->size;
		2071	if (todo < 0) {
		2072	if (skb_prepare_for_shift(skb) \|\|
		2073	skb_prepare_for_shift(tgt))
		2074	return 0;
		2075
		2076	fragto = &skb_shinfo(tgt)->frags[merge];
		2077
		2078	fragto->size += shiftlen;
		2079	fragfrom->size -= shiftlen;
		2080	fragfrom->page_offset += shiftlen;
		2081
		2082	goto onlymerged;
		2083	}
		2084
		2085	from++;
		2086	}
		2087
		2088	/* Skip full, not-fitting skb to avoid expensive operations */
		2089	if ((shiftlen == skb->len) &&
		2090	(skb_shinfo(skb)->nr_frags - from) > (MAX_SKB_FRAGS - to))
		2091	return 0;
		2092
		2093	if (skb_prepare_for_shift(skb) \|\| skb_prepare_for_shift(tgt))
		2094	return 0;
		2095
		2096	while ((todo > 0) && (from < skb_shinfo(skb)->nr_frags)) {
		2097	if (to == MAX_SKB_FRAGS)
		2098	return 0;
		2099
		2100	fragfrom = &skb_shinfo(skb)->frags[from];
		2101	fragto = &skb_shinfo(tgt)->frags[to];
		2102
		2103	if (todo >= fragfrom->size) {
		2104	fragto = fragfrom;
		2105	todo -= fragfrom->size;
		2106	from++;
		2107	to++;
		2108
		2109	} else {
		2110	get_page(fragfrom->page);
		2111	fragto->page = fragfrom->page;
		2112	fragto->page_offset = fragfrom->page_offset;
		2113	fragto->size = todo;
		2114
		2115	fragfrom->page_offset += todo;
		2116	fragfrom->size -= todo;
		2117	todo = 0;
		2118
		2119	to++;
		2120	break;
		2121	}
		2122	}
		2123
		2124	/* Ready to "commit" this state change to tgt */
		2125	skb_shinfo(tgt)->nr_frags = to;
		2126
		2127	if (merge >= 0) {
		2128	fragfrom = &skb_shinfo(skb)->frags[0];
		2129	fragto = &skb_shinfo(tgt)->frags[merge];
		2130
		2131	fragto->size += fragfrom->size;
		2132	put_page(fragfrom->page);
		2133	}
		2134
		2135	/* Reposition in the original skb */
		2136	to = 0;
		2137	while (from < skb_shinfo(skb)->nr_frags)
		2138	skb_shinfo(skb)->frags[to++] = skb_shinfo(skb)->frags[from++];
		2139	skb_shinfo(skb)->nr_frags = to;
		2140
		2141	BUG_ON(todo > 0 && !skb_shinfo(skb)->nr_frags);
		2142
		2143	onlymerged:
		2144	/* Most likely the tgt won't ever need its checksum anymore, skb on
		2145	* the other hand might need it if it needs to be resent
		2146	*/
		2147	tgt->ip_summed = CHECKSUM_PARTIAL;
		2148	skb->ip_summed = CHECKSUM_PARTIAL;
		2149
		2150	/* Yak, is it really working this way? Some helper please? */
		2151	skb->len -= shiftlen;
		2152	skb->data_len -= shiftlen;
		2153	skb->truesize -= shiftlen;
		2154	tgt->len += shiftlen;
		2155	tgt->data_len += shiftlen;
		2156	tgt->truesize += shiftlen;
		2157
		2158	return shiftlen;
		2159	}
		2160
2021	/**	2161	/**
2022	* skb_prepare_seq_read - Prepare a sequential read of skb data	2162	* skb_prepare_seq_read - Prepare a sequential read of skb data
2023	* @skb: the buffer to read	2163	* @skb: the buffer to read