| Commit message (Collapse) | Author | Age |
... | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Introduce xmit_mode 'netif_receive' for pktgen which generates the
packets using familiar pktgen commands, but feeds them into
netif_receive_skb() instead of ndo_start_xmit().
Default mode is called 'start_xmit'.
It is designed to test netif_receive_skb and ingress qdisc
performace only. Make sure to understand how it works before
using it for other rx benchmarking.
Sample script 'pktgen.sh':
\#!/bin/bash
function pgset() {
local result
echo $1 > $PGDEV
result=`cat $PGDEV | fgrep "Result: OK:"`
if [ "$result" = "" ]; then
cat $PGDEV | fgrep Result:
fi
}
[ -z "$1" ] && echo "Usage: $0 DEV" && exit 1
ETH=$1
PGDEV=/proc/net/pktgen/kpktgend_0
pgset "rem_device_all"
pgset "add_device $ETH"
PGDEV=/proc/net/pktgen/$ETH
pgset "xmit_mode netif_receive"
pgset "pkt_size 60"
pgset "dst 198.18.0.1"
pgset "dst_mac 90:e2:ba:ff:ff:ff"
pgset "count 10000000"
pgset "burst 32"
PGDEV=/proc/net/pktgen/pgctrl
echo "Running... ctrl^C to stop"
pgset "start"
echo "Done"
cat /proc/net/pktgen/$ETH
Usage:
$ sudo ./pktgen.sh eth2
...
Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags)
43033682pps 20656Mb/sec (20656167360bps) errors: 10000000
Raw netif_receive_skb speed should be ~43 million packet
per second on 3.7Ghz x86 and 'perf report' should look like:
37.69% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core
25.81% kpktgend_0 [kernel.vmlinux] [k] kfree_skb
7.22% kpktgend_0 [kernel.vmlinux] [k] ip_rcv
5.68% kpktgend_0 [pktgen] [k] pktgen_thread_worker
If fib_table_lookup is seen on top, it means skb was processed
by the stack. To benchmark netif_receive_skb only make sure
that 'dst_mac' of your pktgen script is different from
receiving device mac and it will be dropped by ip_rcv
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Allow flag NO_TIMESTAMP to turn timestamping on again, like other flags,
with a negation of the flag like !NO_TIMESTAMP.
Also document the option flag NO_TIMESTAMP.
Fixes: afb84b626184 ("pktgen: add flag NO_TIMESTAMP to disable timestamping")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
More accurately, listen all netns that have a nsid assigned into the netns
where the netlink socket is opened.
For this purpose, a netlink socket option is added:
NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this
socket will receive netlink notifications from all netns that have a nsid
assigned into the netns where the socket has been opened. The nsid is sent
to userland via an anscillary data.
With this patch, a daemon needs only one socket to listen many netns. This
is useful when the number of netns is high.
Because 0 is a valid value for a nsid, the field nsid_is_set indicates if
the field nsid is valid or not. skb->cb is initialized to 0 on skb
allocation, thus we are sure that we will never send a nsid 0 by error to
the userland.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Before this patch, nsid were protected by the rtnl lock. The goal of this
patch is to be able to find a nsid without needing to hold the rtnl lock.
The next patch will introduce a netlink socket option to listen to all
netns that have a nsid assigned into the netns where the socket is opened.
Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to
avoid a recursive lock (nsid are notified via rtnl). This was the main
reason of the previous patch.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
There is no functional change with this patch. It will ease the refactoring
of the locking system that protects nsids and the support of the netlink
socket option NETLINK_LISTEN_ALL_NSID.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
In a following commit, a new function will be introduced to only lookup for
a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the
existing function is renamed.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
The goal of this commit is to prepare the rework of the locking of nsnid
protection.
After this patch, rtnl_net_notifyid() will not call anymore __peernet2id(),
ie no idr_* operation into this function.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
All callers of this function expect a nsid, not an error.
Thus, returns NETNSA_NSID_NOT_ASSIGNED in case of error so that callers
don't have to convert the error to NETNSA_NSID_NOT_ASSIGNED.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Under tcp memory pressure, calling epoll_wait() in edge triggered
mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
even when there is sufficient memory available to continue making
progress. The problem is that when __sk_mem_schedule() returns 0
under memory pressure, we do not set the SOCK_NOSPACE flag in the
tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since
SOCK_NOSPACE is used to trigger wakeups when incoming acks create
sufficient new space in the write queue, all outstanding packets
are acked, but we never wake up with the the EPOLLOUT that we are
expecting from epoll_wait().
This issue is currently limited to epoll() when used in edge trigger
mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
This is sufficient for poll()/select() and epoll() in level trigger
mode. However, in edge trigger mode, epoll() is relying on the write
path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we
can only call epoll_wait() after read/write return -EAGAIN. Thus, in
the case of the socket write, we are relying on the fact that
tcp_sendmsg()/network write paths are going to issue a wakeup for
us at some point in the future when we get -EAGAIN.
Normally, epoll() edge trigger works fine when we've exceeded the
sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when
we return -EAGAIN from the write path b/c we are over the tcp memory
limits and not b/c we are over the sndbuf, we are never going to get
another wakeup.
I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule()
will return 0, or failure more readily with SO_SNDBUF:
1) create socket and set SO_SNDBUF to N
2) add socket as edge trigger
3) write to socket and block in epoll on -EAGAIN
4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem
The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory()
when the socket is non-blocking. Note that SOCK_NOSPACE, in addition
to waking up outstanding waiters is also used to expand the size of
the sk->sndbuf. However, we will not expand it by setting it in this
case because tcp_should_expand_sndbuf(), ensures that no expansion
occurs when we are under tcp memory pressure.
Note that we could still hang if sk->sk_wmem_queue is 0, when we get
the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we
are waiting for and event that will never happen. I believe
that this case is harder to hit (and did not hit in my testing),
in that over the tcp 'soft' memory limits, we continue to guarantee a
minimum write buffer size. Perhaps, we could return -ENOSPC in this
case, or maybe we simply issue a wakeup in this case, such that we
keep retrying the write. Note that this case is not specific to
epoll() ET, but rather would affect blocking sockets as well. So I
view this patch as bringing epoll() edge-trigger into sync with the
current poll()/select()/epoll() level trigger and blocking sockets
behavior.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Seccomp has always been a special candidate when it comes to preparation
of its filters in seccomp_prepare_filter(). Due to the extra checks and
filter rewrite it partially duplicates code and has BPF internals exposed.
This patch adds a generic API inside the BPF code code that seccomp can use
and thus keep it's filter preparation code minimal and better maintainable.
The other side-effect is that now classic JITs can add seccomp support as
well by only providing a BPF_LDX | BPF_W | BPF_ABS translation.
Tested with seccomp and BPF test suites.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nicolas Schichan <nschichan@freebox.fr>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
When seccomp BPF was added, it was discussed to add __GFP_NOWARN
flag for their configuration path as f.e. up to 32K allocations are
more prone to fail under stress. As we're going to reuse BPF API,
add __GFP_NOWARN flags where larger kmalloc() and friends allocations
could fail.
It doesn't make much sense to pass around __GFP_NOWARN everywhere as
an extra argument only for seccomp while we just as well could run
into similar issues for socket filters, where it's not desired to
have a user application throw a WARN() due to allocation failure.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nicolas Schichan <nschichan@freebox.fr>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Remove the calls to bpf_check_classic(), bpf_convert_filter() and
bpf_migrate_runtime() and let bpf_prepare_filter() take care of that
instead.
seccomp_check_filter() is passed to bpf_prepare_filter() so that it
gets called from there, after bpf_check_classic().
We can now remove exposure of two internal classic BPF functions
previously used by seccomp. The export of bpf_check_classic() symbol,
previously known as sk_chk_filter(), was there since pre git times,
and no in-tree module was using it, therefore remove it.
Joint work with Daniel Borkmann.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
This is in preparation for use by the seccomp code, the rationale is
not to duplicate additional code within the seccomp layer, but instead,
have it abstracted and hidden within the classic BPF API.
As an interim step, this now also makes bpf_prepare_filter() visible
(not as exported symbol though), so that seccomp can reuse that code
path instead of reimplementing it.
Joint work with Daniel Borkmann.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
> net/core/skbuff.c:4108:13: sparse: incorrect type in assignment (different base types)
> net/ipv6/mcast_snoop.c:63 ipv6_mc_check_exthdrs() warn: unsigned 'offset' is never less than zero.
Introduced by 9afd85c9e4552b276e2f4cfefd622bdeeffbbf26
("net: Export IGMP/MLD message validation code")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
With this patch, the IGMP and MLD message validation functions are moved
from the bridge code to IPv4/IPv6 multicast files. Some small
refactoring was done to enhance readibility and to iron out some
differences in behaviour between the IGMP and MLD parsing code (e.g. the
skb-cloning of MLD messages is now only done if necessary, just like the
IGMP part always did).
Finally, these IGMP and MLD message validation functions are exported so
that not only the bridge can use it but batman-adv later, too.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
improves ingress+u32 performance from 22.4 Mpps to 22.9 Mpps
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Some users of flow keys (well just sch_choke now) need to pass
flow_keys in skbuff cb, and use them for exact comparisons of flows
so that skb->hash is not sufficient. In order to increase size of
the flow_keys structure, we introduce another structure for
the purpose of passing flow keys in skbuff cb. We limit this structure
to sixteen bytes, and we will technically treat this as a digest of
flow_keys struct hence its name flow_keys_digest. In the first
incaranation we just copy the flow_keys structure up to 16 bytes--
this is the same information previously passed in the cb. In the
future, we'll adapt this for larger flow_keys and could use something
like SHA-1 over the whole flow_keys to improve the quality of the
digest.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
This calls flow_disect and __skb_get_hash to procure a hash for a
packet. Input includes a key to initialize jhash. This function
does not set skb->hash.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
TC classifiers/actions were converted to RCU by John in the series:
http://thread.gmane.org/gmane.linux.network/329739/focus=329739
and many follow on patches.
This is the last patch from that series that finally drops
ingress spin_lock.
Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.
In two cpu case when both cores are receiving traffic on the same
device and go into the same ingress+u32 the performance jumps
from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \ \ \ \ \ \
| |_|_|_|_|_|/
|/| | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A rather largish update for everything time and timer related:
- Cache footprint optimizations for both hrtimers and timer wheel
- Lower the NOHZ impact on systems which have NOHZ or timer migration
disabled at runtime.
- Optimize run time overhead of hrtimer interrupt by making the clock
offset updates smarter
- hrtimer cleanups and removal of restrictions to tackle some
problems in sched/perf
- Some more leap second tweaks
- Another round of changes addressing the 2038 problem
- First step to change the internals of clock event devices by
introducing the necessary infrastructure
- Allow constant folding for usecs/msecs_to_jiffies()
- The usual pile of clockevent/clocksource driver updates
The hrtimer changes contain updates to sched, perf and x86 as they
depend on them plus changes all over the tree to cleanup API changes
and redundant code, which got copied all over the place. The y2038
changes touch s390 to remove the last non 2038 safe code related to
boot/persistant clock"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
clocksource: Increase dependencies of timer-stm32 to limit build wreckage
timer: Minimize nohz off overhead
timer: Reduce timer migration overhead if disabled
timer: Stats: Simplify the flags handling
timer: Replace timer base by a cpu index
timer: Use hlist for the timer wheel hash buckets
timer: Remove FIFO "guarantee"
timers: Sanitize catchup_timer_jiffies() usage
hrtimer: Allow hrtimer::function() to free the timer
seqcount: Introduce raw_write_seqcount_barrier()
seqcount: Rename write_seqcount_barrier()
hrtimer: Fix hrtimer_is_queued() hole
hrtimer: Remove HRTIMER_STATE_MIGRATE
selftest: Timers: Avoid signal deadlock in leap-a-day
timekeeping: Copy the shadow-timekeeper over the real timekeeper last
clockevents: Check state instead of mode in suspend/resume path
selftests: timers: Add leap-second timer edge testing to leap-a-day.c
ntp: Do leapsecond adjustment in adjtimex read path
time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
ntp: Introduce and use SECS_PER_DAY macro instead of 86400
...
|
| |\ \ \ \ \ \
| | | |/ / / /
| | |/| | | |
| | | | | | |
| | | | | | | |
Make sure the upstream fixes are applied before adding further
modifications.
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
The check for hrtimer_active() after starting the timer is
pointless. If the timer is inactive it has expired already and
therefor the task pointer is already NULL.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20150414203503.165258315@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
We saw excessive direct memory compaction triggered by skb_page_frag_refill.
This causes performance issues and add latency. Commit 5640f7685831e0
introduces the order-3 allocation. According to the changelog, the order-3
allocation isn't a must-have but to improve performance. But direct memory
compaction has high overhead. The benefit of order-3 allocation can't
compensate the overhead of direct memory compaction.
This patch makes the order-3 page allocation atomic. If there is no memory
pressure and memory isn't fragmented, the alloction will still success, so we
don't sacrifice the order-3 benefit here. If the atomic allocation fails,
direct memory compaction will not be triggered, skb_page_frag_refill will
fallback to order-0 immediately, hence the direct memory compaction overhead is
avoided. In the allocation failure case, kswapd is waken up and doing
compaction, so chances are allocation could success next time.
alloc_skb_with_frags is the same.
The mellanox driver does similar thing, if this is accepted, we must fix
the driver too.
V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
V2: make the changelog clearer
Cc: Eric Dumazet <edumazet@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Debabrata Banerjee <dbavatar@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |_|_|_|_|/
|/| | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
deactivating swap
Jeff Layton reported the following;
[ 74.232485] ------------[ cut here ]------------
[ 74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80()
[ 74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio
[ 74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5
[ 74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 74.245546] 0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d
[ 74.246786] 0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa
[ 74.248175] 0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8
[ 74.249483] Call Trace:
[ 74.249872] [<ffffffff8179263d>] dump_stack+0x45/0x57
[ 74.250703] [<ffffffff8109e6fa>] warn_slowpath_common+0x8a/0xc0
[ 74.251655] [<ffffffff8109e82a>] warn_slowpath_null+0x1a/0x20
[ 74.252585] [<ffffffff81661241>] sk_clear_memalloc+0x51/0x80
[ 74.253519] [<ffffffffa0116c72>] xs_disable_swap+0x42/0x80 [sunrpc]
[ 74.254537] [<ffffffffa01109de>] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc]
[ 74.255610] [<ffffffffa03e4fd7>] nfs_swap_deactivate+0x27/0x30 [nfs]
[ 74.256582] [<ffffffff811e99d4>] destroy_swap_extents+0x74/0x80
[ 74.257496] [<ffffffff811ecb52>] SyS_swapoff+0x222/0x5c0
[ 74.258318] [<ffffffff81023f27>] ? syscall_trace_leave+0xc7/0x140
[ 74.259253] [<ffffffff81798dae>] system_call_fastpath+0x12/0x71
[ 74.260158] ---[ end trace 2530722966429f10 ]---
The warning in question was unnecessary but with Jeff's series the rules
are also clearer. This patch removes the warning and updates the comment
to explain why sk_mem_reclaim() may still be called.
[jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon
points out that it's not needed.]
Cc: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |_|_|_|/
|/| | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Commit 70008aa50e92 ("skbuff: convert to skb_orphan_frags") replaced
open coded tests of SKBTX_DEV_ZEROCOPY and skb_copy_ubufs with calls
to helper function skb_orphan_frags. Apply that to the last remaining
open coded site.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |_|_|/
|/| | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This reverts commit f96dee13b8e10f00840124255bed1d8b4c6afd6f.
It isn't right, ethtool is meant to manage one PHY instance
per netdevice at a time, and this is selected by the SET
command. Therefore by definition the GET command must only
return the settings for the configured and selected PHY.
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When trying to configure the settings for PHY1, using commands
like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to
modify other settings apart from the speed of the PHY1, in the
above case.
The ethtool seems to query the settings for PHY0, and use this
as the base to apply the new settings to the PHY1. This is
causing the other settings of the PHY 1 to be wrongly
configured.
The issue is caused by the '_ethtool_get_settings()' API, which
gets called because of the 'ETHTOOL_GSET' command, is clearing
the 'cmd' pointer (of type 'struct ethtool_cmd') by calling
memset. This clears all the parameters (if any) passed for the
'ETHTOOL_GSET' cmd. So the driver's callback is always invoked
with 'cmd->phy_address' as '0'.
The '_ethtool_get_settings()' is called from other files in the
'net/core'. So the fix is applied to the 'ethtool_get_settings()'
which is only called in the context of the 'ethtool'.
Signed-off-by: Arun Parameswaran <aparames@broadcom.com>
Reviewed-by: Ray Jui <rjui@broadcom.com>
Reviewed-by: Scott Branden <sbranden@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |/ /
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Before the patch, the command 'ip link add bond2 type bond mode 802.3ad'
causes the kernel to send a rtnl message for the bond2 interface, with an
ifindex 0.
'ip monitor' shows:
0: bond2: <BROADCAST,MULTICAST,MASTER> mtu 1500 state DOWN group default
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
9: bond2@NONE: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN group default
link/ether ea:3e:1f:53:92:7b brd ff:ff:ff:ff:ff:ff
[snip]
The patch fixes the spotted bug by checking in bond driver if the interface
is registered before calling the notifier chain.
It also adds a check in rtmsg_ifinfo() to prevent this kind of bug in the
future.
Fixes: d4261e565000 ("bonding: create netlink event when bonding option is changed")
CC: Jiri Pirko <jiri@resnulli.us>
Reported-by: Julien Meunier <julien.meunier@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Usually, RTM_NEWxxx is returned on a get (same as a dump).
Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The code in __netdev_upper_dev_link() has an over-stringent
loop detection logic that actually prevents valid configurations
from working correctly.
In particular, the logic returns an error if an upper device
is already in the list of all upper devices for a given dev.
This particular check seems to be a overzealous as it disallows
perfectly valid configurations. For example:
# ip l a link eth0 name eth0.10 type vlan id 10
# ip l a dev br0 typ bridge
# ip l s eth0.10 master br0
# ip l s eth0 master br0 <--- Will fail
If you switch the last two commands (add eth0 first), then both
will succeed. If after that, you remove eth0 and try to re-add
it, it will fail!
It appears to be enough to simply check adj_list to keeps things
safe.
I've tried stacking multiple devices multiple times in all different
combinations, and either rx_handler registration prevented the stacking
of the device linking cought the error.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts commit c243d7e20996254f89c28d4838b5feca735c030d.
That patch is solving a non-existant problem while creating a
real problem. Just because a socket is allocated in the init
name space doesn't mean that it gets hashed in the init name space.
When we unhash it the name space must be the same as the one
we had when we hashed it. So this patch is completely bogus
and causes socket leaks.
Reported-by: Andrey Wagin <avagin@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
it is sent only at the end of a dump.
Libraries like libnl will wait forever for NLMSG_DONE.
Fixes: e5a55a898720 ("net: create generic bridge ops")
Fixes: 815cccbf10b2 ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
CC: John Fastabend <john.r.fastabend@intel.com>
CC: Sathya Perla <sathya.perla@emulex.com>
CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
CC: Ajit Khaparde <ajit.khaparde@emulex.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: intel-wired-lan@lists.osuosl.org
CC: Jiri Pirko <jiri@resnulli.us>
CC: Scott Feldman <sfeldma@gmail.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: bridge@lists.linux-foundation.org
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 567e4b79731c ("net: rfs: add hash collision detection") had one
mistake :
RPS_NO_CPU is no longer the marker for invalid cpu in set_rps_cpu()
and get_rps_cpu(), as @next_cpu was the result of an AND with
rps_cpu_mask
This bug showed up on a host with 72 cpus :
next_cpu was 0x7f, and the code was trying to access percpu data of an
non existent cpu.
In a follow up patch, we might get rid of compares against nr_cpu_ids,
if we init the tables with 0. This is silly to test for a very unlikely
condition that exists only shortly after table initialization, as
we got rid of rps_reset_sock_flow() and similar functions that were
writing this RPS_NO_CPU magic value at flow dismantle : When table is
old enough, it never contains this value anymore.
Fixes: 567e4b79731c ("net: rfs: add hash collision detection")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When I added pfmemalloc support in build_skb(), I forgot netlink
was using build_skb() with a vmalloc() area.
In this patch I introduce __build_skb() for netlink use,
and build_skb() is a wrapper handling both skb->head_frag and
skb->pfmemalloc
This means netlink no longer has to hack skb->head_frag
[ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
[ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[ 1567.700067] Dumping ftrace buffer:
[ 1567.700067] (ftrace buffer empty)
[ 1567.700067] Modules linked in:
[ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
[ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
[ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
[ 1567.700067] RSP: 0018:ffff8802467779d8 EFLAGS: 00010202
[ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
[ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
[ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
[ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
[ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
[ 1567.700067] FS: 00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
[ 1567.700067] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
[ 1567.700067] Stack:
[ 1567.700067] ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
[ 1567.700067] ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
[ 1567.700067] ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
[ 1567.700067] Call Trace:
[ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
[ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
[ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
[ 1567.774369] sock_write_iter (net/socket.c:823)
[ 1567.774369] ? sock_sendmsg (net/socket.c:806)
[ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
[ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
[ 1567.774369] ? default_llseek (fs/read_write.c:487)
[ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
[ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
[ 1567.774369] vfs_write (fs/read_write.c:539)
[ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
[ 1567.774369] ? SyS_read (fs/read_write.c:577)
[ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
[ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
[ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
Fixes: 79930f5892e ("net: do not deplete pfmemalloc reserve")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
build_skb() should look at the page pfmemalloc status.
If set, this means page allocator allocated this page in the
expectation it would help to free other pages. Networking
stack can do that only if skb->pfmemalloc is also set.
Also, we must refrain using high order pages from the pfmemalloc
reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
them. Under memory pressure, using order-0 pages is probably the best
strategy.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In commit 04ffcb255f22 ("net: Add ndo_gso_check") Tom originally
added the 'dev' argument to be able to call ndo_gso_check().
Then later, when generalizing this in commit 5f35227ea34b
("net: Generalize ndo_gso_check to ndo_features_check")
Jesse removed the call to ndo_gso_check() in netif_needs_gso()
by calling the new ndo_features_check() in a different place.
This made the 'dev' argument unused.
Remove the unused argument and go back to the code as before.
Cc: Tom Herbert <therbert@google.com>
Cc: Jesse Gross <jesse@nicira.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On Wed, Apr 15, 2015 at 05:41:26PM +0200, Nicolas Dichtel wrote:
> Le 15/04/2015 15:57, Herbert Xu a écrit :
> >On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote:
> [snip]
> >Subject: skbuff: Do not scrub skb mark within the same name space
> >
> >The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels:
> Maybe add a Fixes tag?
> Fixes: ea23192e8e57 ("tunnels: harmonize cleanup done on skb on rx path")
>
> >harmonize cleanup done on skb on rx path") broke anyone trying to
> >use netfilter marking across IPv4 tunnels. While most of the
> >fields that are cleared by skb_scrub_packet don't matter, the
> >netfilter mark must be preserved.
> >
> >This patch rearranges skb_scurb_packet to preserve the mark field.
> nit: s/scurb/scrub
>
> Else it's fine for me.
Sure.
PS I used the wrong email for James the first time around. So
let me repeat the question here. Should secmark be preserved
or cleared across tunnels within the same name space? In fact,
do our security models even support name spaces?
---8<---
The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels:
harmonize cleanup done on skb on rx path") broke anyone trying to
use netfilter marking across IPv4 tunnels. While most of the
fields that are cleared by skb_scrub_packet don't matter, the
netfilter mark must be preserved.
This patch rearranges skb_scrub_packet to preserve the mark field.
Fixes: ea23192e8e57 ("tunnels: harmonize cleanup done on skb on rx path")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
| |
This patch reverts commit b8fb4e0648a2ab3734140342002f68fb0c7d1602
because the secmark must be preserved even when a packet crosses
namespace boundaries. The reason is that security labels apply to
the system as a whole and is not per-namespace.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For the short-term solution, lets fix bpf helper functions to use
skb->mac_header relative offsets instead of skb->data in order to
get the same eBPF programs with cls_bpf and act_bpf work on ingress
and egress qdisc path. We need to ensure that mac_header is set
before calling into programs. This is effectively the first option
from below referenced discussion.
More long term solution for LD_ABS|LD_IND instructions will be more
intrusive but also more beneficial than this, and implemented later
as it's too risky at this point in time.
I.e., we plan to look into the option of moving skb_pull() out of
eth_type_trans() and into netif_receive_skb() as has been suggested
as second option. Meanwhile, this solution ensures ingress can be
used with eBPF, too, and that we won't run into ABI troubles later.
For dealing with negative offsets inside eBPF helper functions,
we've implemented bpf_skb_clone_unwritable() to test for unwriteable
headers.
Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
Fixes: 91bc4822c3d6 ("tc: bpf: add checksum helpers")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
| |
Remove duplicated include.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Al Viro says:
====================
netdev-related stuff in vfs.git
There are several commits sitting in vfs.git that probably ought to go in
via net-next.git. First of all, there's merge with vfs.git#iocb - that's
Christoph's aio rework, which has triggered conflicts with the ->sendmsg()
and ->recvmsg() patches a while ago. It's not so much Christoph's stuff
that ought to be in net-next, as (pretty simple) conflict resolution on merge.
The next chunk is switch to {compat_,}import_iovec/import_single_range - new
safer primitives for initializing iov_iter. The primitives themselves come
from vfs/git#iov_iter (and they are used quite a lot in vfs part of queue),
conversion of net/socket.c syscalls belongs in net-next, IMO. Next there's
afs and rxrpc stuff from dhowells. And then there's sanitizing kernel_sendmsg
et.al. + missing inlined helper for "how much data is left in msg->msg_iter" -
this stuff is used in e.g. cifs stuff, but it belongs in net-next.
That pile is pullable from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-davem
I'll post the individual patches in there in followups; could you take a look
and tell if everything in there is OK with you?
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| | |
convert open-coded instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.
Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.
Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2015-04-11
This series contains updates to iflink, ixgbe and ixgbevf.
The entire set of changes come from Vlad Zolotarov to ultimately add
the ethtool ops to VF driver to allow querying the RSS indirection table
and RSS random key.
Currently we support only 82599 and x540 devices. On those devices, VFs
share the RSS redirection table and hash key with a PF. Letting the VF
query this information may introduce some security risks, therefore this
feature will be disabled by default.
The new netdev op allows a system administrator to change the default
behaviour with "ip link set" command. The relevant iproute2 patch has
already been sent and awaits for this series upstream.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add configuration setting for drivers to allow/block an RSS Redirection
Table and a Hash Key querying for discrete VFs.
On some devices VF share the mentioned above information with PF and
querying it may adduce a theoretical security risk. We want to let a
system administrator to decide if he/she wants to take this risk or not.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
With recent adoption of skc_cookie in struct sock_common,
struct tcp_timewait_sock size increased from 192 to 200 bytes
on 64bit arches. SLAB rounds then to 256 bytes.
It is time to drop SLAB_HWCACHE_ALIGN constraint for twsk_slab.
This saves about 12 MB of memory on typical configuration reaching
262144 timewait sockets, and has no noticeable impact on performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| | |
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When an FDB entry is added or deleted the information about VLAN
is not passed to listening applications like 'bridge monitor fdb'.
With this patch VLAN ID is passed if it was set in the original
netlink message.
Also remove an unused bdev variable.
Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| | |
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|