litmus-rt.git/net/dccp, branch v2015.1

inet: fix possible panic in reqsk_queue_unlink()

2015-04-24T15:39:15+00:00

[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
 0000000000000080
[ 3897.931025] IP: [] reqsk_timer_handler+0x1a6/0x243

There is a race when reqsk_timer_handler() and tcp_check_req() call
inet_csk_reqsk_queue_unlink() on the same req at the same time.

Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener
timer"), listener spinlock was held and race could not happen.

To solve this bug, we change reqsk_queue_unlink() to not assume req
must be found, and we return a status, to conditionally release a
refcount on the request sock.

This also means tcp_check_req() in non fastopen case might or not
consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
to properly handle this.

(Same remark for dccp_check_req() and its callers)

inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
called 4 times in tcp and 3 times in dccp.

Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet 
Reported-by: Yuchung Cheng 
Signed-off-by: David S. Miller

tcp/dccp: get rid of central timewait timer

2015-04-13T20:40:05+00:00

Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.

This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.

Tested:

On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)

Before patch :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171

While test is running, we can observe 25 or even 33 ms latencies.

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

After patch :

About 90% increase of throughput :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992

And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

inet: fix double request socket freeing

2015-03-24T01:40:48+00:00

Eric Hugne reported following error :

I'm hitting this warning on latest net-next when i try to SSH into a machine
with eth0 added to a bridge (but i think the problem is older than that)

Steps to reproduce:
node2 ~ # brctl addif br0 eth0
[  223.758785] device eth0 entered promiscuous mode
node2 ~ # ip link set br0 up
[  244.503614] br0: port 1(eth0) entered forwarding state
[  244.505108] br0: port 1(eth0) entered forwarding state
node2 ~ # [  251.160159] ------------[ cut here ]------------
[  251.160831] WARNING: CPU: 0 PID: 3 at include/net/request_sock.h:102 tcp_v4_err+0x6b1/0x720()
[  251.162077] Modules linked in:
[  251.162496] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc3+ #18
[  251.163334] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  251.164078]  ffffffff81a8365c ffff880038a6ba18 ffffffff8162ace4 0000000000009898
[  251.165084]  0000000000000000 ffff880038a6ba58 ffffffff8104da85 ffff88003fa437c0
[  251.166195]  ffff88003fa437c0 ffff88003fa74e00 ffff88003fa43bb8 ffff88003fad99a0
[  251.167203] Call Trace:
[  251.167533]  [] dump_stack+0x45/0x57
[  251.168206]  [] warn_slowpath_common+0x85/0xc0
[  251.169239]  [] warn_slowpath_null+0x15/0x20
[  251.170271]  [] tcp_v4_err+0x6b1/0x720
[  251.171408]  [] ? _raw_read_lock_irq+0x3/0x10
[  251.172589]  [] ? inet_del_offload+0x40/0x40
[  251.173366]  [] icmp_socket_deliver+0x65/0xb0
[  251.174134]  [] icmp_unreach+0xc2/0x280
[  251.174820]  [] icmp_rcv+0x2bd/0x3a0
[  251.175473]  [] ip_local_deliver_finish+0x82/0x1e0
[  251.176282]  [] ip_local_deliver+0x88/0x90
[  251.177004]  [] ip_rcv_finish+0xf0/0x310
[  251.177693]  [] ip_rcv+0x2dc/0x390
[  251.178336]  [] __netif_receive_skb_core+0x713/0xa20
[  251.179170]  [] __netif_receive_skb+0x1a/0x80
[  251.179922]  [] process_backlog+0x94/0x120
[  251.180639]  [] net_rx_action+0x1e2/0x310
[  251.181356]  [] __do_softirq+0xa7/0x290
[  251.182046]  [] run_ksoftirqd+0x19/0x30
[  251.182726]  [] smpboot_thread_fn+0x153/0x1d0
[  251.183485]  [] ? SyS_setgroups+0x130/0x130
[  251.184228]  [] kthread+0xee/0x110
[  251.184871]  [] ? kthread_create_on_node+0x1b0/0x1b0
[  251.185690]  [] ret_from_fork+0x58/0x90
[  251.186385]  [] ? kthread_create_on_node+0x1b0/0x1b0
[  251.187216] ---[ end trace c947fc7b24e42ea1 ]---
[  259.542268] br0: port 1(eth0) entered forwarding state

Remove the double calls to reqsk_put()

[edumazet] :

I got confused because reqsk_timer_handler() _has_ to call
reqsk_put(req) after calling inet_csk_reqsk_queue_drop(), as
the timer handler holds a reference on req.

Signed-off-by: Fan Du 
Signed-off-by: Eric Dumazet 
Reported-by: Erik Hugne 
Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: David S. Miller

ipv6: dccp: handle ICMP messages on DCCP_NEW_SYN_RECV request sockets

2015-03-23T20:52:26+00:00

dccp_v6_err() can restrict lookups to ehash table, and not to listeners.

Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

ipv4: dccp: handle ICMP messages on DCCP_NEW_SYN_RECV request sockets

2015-03-23T20:52:26+00:00

dccp_v4_err() can restrict lookups to ehash table, and not to listeners.

Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.

New dccp_req_err() helper is exported so that we can use it in IPv6
in following patch.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

inet: remove sk_listener parameter from syn_ack_timeout()

2015-03-23T20:52:25+00:00

It is not needed, and req->sk_listener points to the listener anyway.
request_sock argument can be const.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

inet: get rid of central tcp/dccp listener timer

2015-03-20T16:40:25+00:00

One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

inet: drop prev pointer handling in request sock

2015-03-20T16:40:25+00:00

When request sock are put in ehash table, the whole notion
of having a previous request to update dl_next is pointless.

Also, following patch will get rid of big purge timer,
so we want to delete a request sock without holding listener lock.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

inet: request sock should init IPv6/IPv4 addresses

2015-03-19T02:00:35+00:00

In order to be able to use sk_ehashfn() for request socks,
we need to initialize their IPv6/IPv4 addresses.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

ipv6: get rid of __inet6_hash()

2015-03-19T02:00:35+00:00

We can now use inet_hash() and __inet_hash() instead of private
functions.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller