litmus-rt.git/net/sched, branch test

net_sched: invoke ->attach() after setting dev->qdisc

2015-05-27T18:09:55+00:00

For mq qdisc, we add per tx queue qdisc to root qdisc
for display purpose, however, that happens too early,
before the new dev->qdisc is finally set, this causes
q->list points to an old root qdisc which is going to be
freed right before assigning with a new one.

Fix this by moving ->attach() after setting dev->qdisc.

For the record, this fixes the following crash:

 ------------[ cut here ]------------
 WARNING: CPU: 1 PID: 975 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
 list_del corruption. prev->next should be ffff8800d1998ae8, but was 6b6b6b6b6b6b6b6b
 CPU: 1 PID: 975 Comm: tc Not tainted 4.1.0-rc4+ #1019
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  0000000000000009 ffff8800d73fb928 ffffffff81a44e7f 0000000047574756
  ffff8800d73fb978 ffff8800d73fb968 ffffffff810790da ffff8800cfc4cd20
  ffffffff814e725b ffff8800d1998ae8 ffffffff82381250 0000000000000000
 Call Trace:
  [] dump_stack+0x4c/0x65
  [] warn_slowpath_common+0x9c/0xb6
  [] ? __list_del_entry+0x5a/0x98
  [] warn_slowpath_fmt+0x46/0x48
  [] ? dev_graft_qdisc+0x5e/0x6a
  [] __list_del_entry+0x5a/0x98
  [] list_del+0xe/0x2d
  [] qdisc_list_del+0x1e/0x20
  [] qdisc_destroy+0x30/0xd6
  [] qdisc_graft+0x11d/0x243
  [] tc_get_qdisc+0x1a6/0x1d4
  [] ? mark_lock+0x2e/0x226
  [] rtnetlink_rcv_msg+0x181/0x194
  [] ? rtnl_lock+0x17/0x19
  [] ? rtnl_lock+0x17/0x19
  [] ? __rtnl_unlock+0x17/0x17
  [] netlink_rcv_skb+0x4d/0x93
  [] rtnetlink_rcv+0x26/0x2d
  [] netlink_unicast+0xcb/0x150
  [] ? might_fault+0x59/0xa9
  [] netlink_sendmsg+0x4fa/0x51c
  [] sock_sendmsg_nosec+0x12/0x1d
  [] sock_sendmsg+0x29/0x2e
  [] ___sys_sendmsg+0x1b4/0x23a
  [] ? native_sched_clock+0x35/0x37
  [] ? sched_clock_local+0x12/0x72
  [] ? sched_clock_cpu+0x9e/0xb7
  [] ? current_kernel_time+0xe/0x32
  [] ? lock_release_holdtime.part.29+0x71/0x7f
  [] ? read_seqcount_begin.constprop.27+0x5f/0x76
  [] ? trace_hardirqs_on_caller+0x17d/0x199
  [] ? __fget_light+0x50/0x78
  [] __sys_sendmsg+0x42/0x60
  [] SyS_sendmsg+0x12/0x1c
  [] system_call_fastpath+0x12/0x6f
 ---[ end trace ef29d3fb28e97ae7 ]---

For long term, we probably need to clean up the qdisc_graft() code
in case it hides other bugs like this.

Fixes: 95dc19299f74 ("pkt_sched: give visibility to mq slave qdiscs")
Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller

net: sched: fix call_rcu() race on classifier module unloads

2015-05-21T22:48:18+00:00

Vijay reported that a loop as simple as ...

  while true; do
    tc qdisc add dev foo root handle 1: prio
    tc filter add dev foo parent 1: u32 match u32 0 0  flowid 1
    tc qdisc del dev foo root
    rmmod cls_u32
  done

... will panic the kernel. Moreover, he bisected the change
apparently introducing it to 78fd1d0ab072 ("netlink: Re-add
locking to netlink_lookup() and seq walker").

The removal of synchronize_net() from the netlink socket
triggering the qdisc to be removed, seems to have uncovered
an RCU resp. module reference count race from the tc API.
Given that RCU conversion was done after e341694e3eb5 ("netlink:
Convert netlink_lookup() to use RCU protected hash table")
which added the synchronize_net() originally, occasion of
hitting the bug was less likely (not impossible though):

When qdiscs that i) support attaching classifiers and,
ii) have at least one of them attached, get deleted, they
invoke tcf_destroy_chain(), and thus call into ->destroy()
handler from a classifier module.

After RCU conversion, all classifier that have an internal
prio list, unlink them and initiate freeing via call_rcu()
deferral.

Meanhile, tcf_destroy() releases already reference to the
tp->ops->owner module before the queued RCU callback handler
has been invoked.

Subsequent rmmod on the classifier module is then not prevented
since all module references are already dropped.

By the time, the kernel invokes the RCU callback handler from
the module, that function address is then invalid.

One way to fix it would be to add an rcu_barrier() to
unregister_tcf_proto_ops() to wait for all pending call_rcu()s
to complete.

synchronize_rcu() is not appropriate as under heavy RCU
callback load, registered call_rcu()s could be deferred
longer than a grace period. In case we don't have any pending
call_rcu()s, the barrier is allowed to return immediately.

Since we came here via unregister_tcf_proto_ops(), there
are no users of a given classifier anymore. Further nested
call_rcu()s pointing into the module space are not being
done anywhere.

Only cls_bpf_delete_prog() may schedule a work item, to
unlock pages eventually, but that is not in the range/context
of cls_bpf anymore.

Fixes: 25d8c0d55f24 ("net: rcu-ify tcf_proto")
Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
Reported-by: Vijay Subramanian 
Signed-off-by: Daniel Borkmann 
Cc: John Fastabend 
Cc: Eric Dumazet 
Cc: Thomas Graf 
Cc: Jamal Hadi Salim 
Cc: Alexei Starovoitov 
Tested-by: Vijay Subramanian 
Acked-by: Alexei Starovoitov 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller

net_sched: gred: use correct backlog value in WRED mode

2015-05-11T17:26:26+00:00

In WRED mode, the backlog for a single virtual queue (VQ) should not be
used to determine queue behavior; instead the backlog is summed across
all VQs. This sum is currently used when calculating the average queue
lengths. It also needs to be used when determining if the queue's hard
limit has been reached, or when reporting each VQ's backlog via netlink.
q->backlog will only be used if the queue switches out of WRED mode.

Signed-off-by: David Ward 
Signed-off-by: David S. Miller

net_sched: fix a use-after-free in tc_ctl_tfilter()

2015-05-09T20:14:04+00:00

When tcf_destroy() returns true, tp could be already destroyed,
we should not use tp->next after that.

For long term, we probably should move tp list to list_head.

Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
Acked-by: Jamal Hadi Salim 
Signed-off-by: David S. Miller

codel: fix maxpacket/mtu confusion

2015-05-04T02:17:40+00:00

Under presence of TSO/GSO/GRO packets, codel at low rates can be quite
useless. In following example, not a single packet was ever dropped,
while average delay in codel queue is ~100 ms !

qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
 Sent 134376498 bytes 88797 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 13626b 3p requeues 0
  count 0 lastcount 0 ldelay 96.9ms drop_next 0us
  maxpacket 9084 ecn_mark 0 drop_overlimit 0

This comes from a confusion of what should be the minimal backlog. It is
pretty clear it is not 64KB or whatever max GSO packet ever reached the
qdisc.

codel intent was to use MTU of the device.

After the fix, we finally drop some packets, and rtt/cwnd of my single
TCP flow are meeting our expectations.

qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
 Sent 102798497 bytes 67912 pkt (dropped 1365, overlimits 0 requeues 0)
 backlog 6056b 3p requeues 0
  count 1 lastcount 1 ldelay 36.3ms drop_next 0us
  maxpacket 10598 ecn_mark 0 drop_overlimit 0

Signed-off-by: Eric Dumazet 
Cc: Kathleen Nichols 
Cc: Dave Taht 
Cc: Van Jacobson 
Signed-off-by: David S. Miller

net: sched: act_connmark: don't zap skb->nfct

2015-04-29T18:56:40+00:00

This action is meant to be passive, i.e. we should not alter
skb->nfct: If nfct is present just leave it alone.

Compile tested only.

Cc: Jamal Hadi Salim 
Signed-off-by: Florian Westphal 
Acked-by: Pablo Neira Ayuso 
Signed-off-by: David S. Miller

act_mirred: Fix bogus header when redirecting from VLAN

2015-04-17T17:29:28+00:00

When you redirect a VLAN device to any device, you end up with
crap in af_packet on the xmit path because hard_header_len is
not equal to skb->mac_len.  So the redirected packet contains
four extra bytes at the start which then gets interpreted as
part of the MAC address.

This patch fixes this by only pushing skb->mac_len.  We also
need to fix ifb because it tries to undo the pushing done by
act_mirred.

Signed-off-by: Herbert Xu 
Acked-by: Alexei Starovoitov 
Signed-off-by: David S. Miller

bpf: fix bpf helpers to use skb->mac_header relative offsets

2015-04-16T18:08:49+00:00

For the short-term solution, lets fix bpf helper functions to use
skb->mac_header relative offsets instead of skb->data in order to
get the same eBPF programs with cls_bpf and act_bpf work on ingress
and egress qdisc path. We need to ensure that mac_header is set
before calling into programs. This is effectively the first option
from below referenced discussion.

More long term solution for LD_ABS|LD_IND instructions will be more
intrusive but also more beneficial than this, and implemented later
as it's too risky at this point in time.

I.e., we plan to look into the option of moving skb_pull() out of
eth_type_trans() and into netif_receive_skb() as has been suggested
as second option. Meanwhile, this solution ensures ingress can be
used with eBPF, too, and that we won't run into ABI troubles later.
For dealing with negative offsets inside eBPF helper functions,
we've implemented bpf_skb_clone_unwritable() to test for unwriteable
headers.

Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
Fixes: 91bc4822c3d6 ("tc: bpf: add checksum helpers")
Signed-off-by: Alexei Starovoitov 
Signed-off-by: Daniel Borkmann 
Signed-off-by: David S. Miller

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

2015-04-14T19:44:14+00:00

The dwmac-socfpga.c conflict was a case of a bug fix overlapping
changes in net-next to handle an error pointer differently.

Signed-off-by: David S. Miller

net: use jump label patching for ingress qdisc in __netif_receive_skb_core

2015-04-13T17:34:40+00:00

Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.

Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.

Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
Signed-off-by: David S. Miller