| Commit message (Collapse) | Author | Age |
... | |
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious
reason. It looks like it's only a convenience, so remove kmemleak.h
from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that
don't already #include it. Also remove <linux/kmemleak.h> from source
files that do not use it.
This is tested on i386 allmodconfig and x86_64 allmodconfig. It would
be good to run it through the 0day bot for other $ARCHes. I have
neither the horsepower nor the storage space for the other $ARCHes.
Update: This patch has been extensively build-tested by both the 0day
bot & kisskb/ozlabs build farms. Both of them reported 2 build failures
for which patches are included here (in v2).
[ slab.h is the second most used header file after module.h; kernel.h is
right there with slab.h. There could be some minor error in the
counting due to some #includes having comments after them and I didn't
combine all of those. ]
[akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
Link: http://kisskb.ellerman.id.au/kisskb/head/13396/
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au> [2 build failures]
Reported-by: Fengguang Wu <fengguang.wu@intel.com> [2 build failures]
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
since linker might place next to it a non zero value preventing a change
to ip6frag_low_thresh.
ip6frag_low_thresh is not used anymore in the kernel, but we do not
want to prematuraly break user scripts wanting to change it.
Since specifying a minimal value of 0 for proc_doulongvec_minmax()
is moot, let's remove these zero values in all defrag units.
Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After commit 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates
done by __ip_append_data()") and commit 1f4c6eb24029 ("ipv6:
factorize sk_wmem_alloc updates done by __ip6_append_data()"),
when transmitting sub MTU datagram, an addtional, unneeded atomic
operation is performed in ip*_append_data() to update wmem_alloc:
in the above condition the delta is 0.
The above cause small but measurable performance regression in UDP
xmit tput test with packet size below MTU.
This change avoids such overhead updating wmem_alloc only if
wmem_alloc_delta is non zero.
The error path is left intentionally unmodified: it's a slow path
and simplicity is preferred to performances.
Fixes: 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()")
Fixes: 1f4c6eb24029 ("ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this patch, when one packet is hashed into path [1]
(hash <= nh_upper_bound) and it's neigh is dead, it will try
path [2]. However, if path [2]'s neigh is alive but it's
hash > nh_upper_bound, it will not return this alive path.
This packet will never be sent even if path [2] is alive.
3.3.3.1/24:
nexthop via 1.1.1.254 dev eth1 weight 1 <--[1] (dead neigh)
nexthop via 2.2.2.254 dev eth2 weight 1 <--[2]
With sysctl_fib_multipath_use_neigh set is supposed to find an
available path respecting to the l3/l4 hash. But if there is
no available route with this hash, it should at least return
an alive route even with other hash.
This patch is to fix it by processing fib_multipath_use_neigh
earlier than the hash check, so that it will at least return
an alive route if there is when fib_multipath_use_neigh is
enabled. It's also compatible with before when there are alive
routes with the l3/l4 hash.
Fixes: a6db4494d218 ("net: ipv4: Consider failed nexthops in multipath routes")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:
1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE
2) In 'net-next' params->log_rq_size is renamed to be
params->log_rq_mtu_frames.
3) In 'net-next' params->hard_mtu is added.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- fix missed rebuild of TRIM_UNUSED_KSYMS
- fix rpm-pkg for GNU tar >= 1.29
- include scripts/dtc/include-prefixes/* to kernel header deb-pkg
- add -no-integrated-as option ealier to fix building with Clang
- fix netfilter Makefile for parallel building
* tag 'kbuild-fixes-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
netfilter: nf_nat_snmp_basic: add correct dependency to Makefile
kbuild: rpm-pkg: Support GNU tar >= 1.29
builddeb: Fix header package regarding dtc source links
kbuild: set no-integrated-as before incl. arch Makefile
kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
nf_nat_snmp_basic_main.c includes a generated header, but the
necessary dependency is missing in Makefile. This could cause
build error in parallel building.
Remove a weird line, and add a correct one.
Fixes: cc2d58634e0f ("netfilter: nf_nat_snmp_basic: use asn1 decoder library")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We want to use dev_set_mtu() regardless of how we calculate
the mtu value.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |\ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2018-03-29
1) Fix a rcu_read_lock/rcu_read_unlock imbalance
in the error path of xfrm_local_error().
From Taehee Yoo.
2) Some VTI MTU fixes. From Stefano Brivio.
3) Fix a too early overwritten skb control buffer
on xfrm transport mode.
Please note that this pull request has a merge conflict
in net/ipv4/ip_tunnel.c.
The conflict is between
commit f6cc9c054e77 ("ip_tunnel: Emit events for post-register MTU changes")
from the net tree and
commit 24fc79798b8d ("ip_tunnel: Clamp MTU to bounds on new link")
from the ipsec tree.
It can be solved as it is currently done in linux-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Don't hardcode a MTU value on vti tunnel initialization,
ip_tunnel_newlink() is able to deal with this already. See also
commit ffc2b6ee4174 ("ip_gre: fix IFLA_MTU ignored on NEWLINK").
Fixes: 1181412c1a67 ("net/ipv4: VTI support new module for ip_vti.")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Otherwise, it's possible to specify invalid MTU values directly
on creation of a link (via 'ip link add'). This is already
prevented on subsequent MTU changes by commit b96f9afee4eb
("ipv4/6: use core net MTU range checking").
Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This re-introduces the effect of commit a32452366b72 ("vti4:
Don't count header length twice.") which was accidentally
reverted by merge commit f895f0cfbb77 ("Merge branch 'master' of
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec").
The commit message from Steffen Klassert said:
We currently count the size of LL_MAX_HEADER and struct iphdr
twice for vti4 devices, this leads to a wrong device mtu.
The size of LL_MAX_HEADER and struct iphdr is already counted in
ip_tunnel_bind_dev(), so don't do it again in vti_tunnel_init().
And this is still the case now: ip_tunnel_bind_dev() already
accounts for the header length of the link layer (not
necessarily LL_MAX_HEADER, if the output device is found), plus
one IP header.
For example, with a vti device on top of veth, with MTU of 1500,
the existing implementation would set the initial vti MTU to
1332, accounting once for LL_MAX_HEADER (128, included in
hard_header_len by vti) and twice for the same IP header (once
from hard_header_len, once from ip_tunnel_bind_dev()).
It should instead be 1480, because ip_tunnel_bind_dev() is able
to figure out that the output device is veth, so no additional
link layer header is attached, and will properly count one
single IP header.
The existing issue had the side effect of avoiding PMTUD for
most xfrm policies, by arbitrarily lowering the initial MTU.
However, the only way to get a consistent PMTU value is to let
the xfrm PMTU discovery do its course, and commit d6af1a31cc72
("vti: Add pmtu handling to vti_xmit.") now takes care of local
delivery cases where the application ignores local socket
notifications.
Fixes: b9959fd3b0fa ("vti: switch to new ip tunnel code")
Fixes: f895f0cfbb77 ("Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Currently, the SMC experimental TCP option in a SYN packet is lost on
the server side when SYN Cookies are active. However, the corresponding
SYNACK sent back to the client contains the SMC option. This causes an
inconsistent view of the SMC capabilities on the client and server.
This patch disables the SMC option in the SYNACK when SYN Cookies are
active to avoid this issue.
Fixes: 60e2a7780793b ("tcp: TCP experimental option for SMC")
Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |\ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Don't pick fixed hash implementation for NFT_SET_EVAL sets, otherwise
userspace hits EOPNOTSUPP with valid rules using the meter statement,
from Florian Westphal.
2) If you send a batch that flushes the existing ruleset (that contains
a NAT chain) and the new ruleset definition comes with a new NAT
chain, don't bogusly hit EBUSY. Also from Florian.
3) Missing netlink policy attribute validation, from Florian.
4) Detach conntrack template from skbuff if IP_NODEFRAG is set on,
from Paolo Abeni.
5) Cache device names in flowtable object, otherwise we may end up
walking over devices going aways given no rtnl_lock is held.
6) Fix incorrect net_device ingress with ingress hooks.
7) Fix crash when trying to read more data than available in UDP
packets from the nf_socket infrastructure, from Subash.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
skb_header_pointer will copy data into a buffer if data is non linear,
otherwise it will return a pointer in the linear section of the data.
nf_sk_lookup_slow_v{4,6} always copies data of size udphdr but later
accesses memory within the size of tcphdr (th->doff) in case of TCP
packets. This causes a crash when running with KASAN with the following
call stack -
BUG: KASAN: stack-out-of-bounds in xt_socket_lookup_slow_v4+0x524/0x718
net/netfilter/xt_socket.c:178
Read of size 2 at addr ffffffe3d417a87c by task syz-executor/28971
CPU: 2 PID: 28971 Comm: syz-executor Tainted: G B W O 4.9.65+ #1
Call trace:
[<ffffff9467e8d390>] dump_backtrace+0x0/0x428 arch/arm64/kernel/traps.c:76
[<ffffff9467e8d7e0>] show_stack+0x28/0x38 arch/arm64/kernel/traps.c:226
[<ffffff946842d9b8>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffff946842d9b8>] dump_stack+0xd4/0x124 lib/dump_stack.c:51
[<ffffff946811d4b0>] print_address_description+0x68/0x258 mm/kasan/report.c:248
[<ffffff946811d8c8>] kasan_report_error mm/kasan/report.c:347 [inline]
[<ffffff946811d8c8>] kasan_report.part.2+0x228/0x2f0 mm/kasan/report.c:371
[<ffffff946811df44>] kasan_report+0x5c/0x70 mm/kasan/report.c:372
[<ffffff946811bebc>] check_memory_region_inline mm/kasan/kasan.c:308 [inline]
[<ffffff946811bebc>] __asan_load2+0x84/0x98 mm/kasan/kasan.c:739
[<ffffff94694d6f04>] __tcp_hdrlen include/linux/tcp.h:35 [inline]
[<ffffff94694d6f04>] xt_socket_lookup_slow_v4+0x524/0x718 net/netfilter/xt_socket.c:178
Fix this by copying data into appropriate size headers based on protocol.
Fixes: a583636a83ea ("inet: refactor inet[6]_lookup functions to take skb")
Signed-off-by: Tejaswi Tanikella <tejaswit@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | |/ /
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The ipv4 nf_ct code currently skips the nf_conntrak_in() call
for fragmented packets. As a results later matches/target can end
up manipulating template ct entry instead of 'real' ones.
Exploiting the above, syzbot found a way to trigger the following
splat:
WARNING: CPU: 1 PID: 4242 at net/netfilter/xt_cluster.c:55
xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 4242 Comm: syzkaller027971 Not tainted 4.16.0-rc2+ #243
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x24d lib/dump_stack.c:53
panic+0x1e4/0x41c kernel/panic.c:183
__warn+0x1dc/0x200 kernel/panic.c:547
report_bug+0x211/0x2d0 lib/bug.c:184
fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
fixup_bug arch/x86/kernel/traps.c:247 [inline]
do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x58/0x80 arch/x86/entry/entry_64.S:957
RIP: 0010:xt_cluster_hash net/netfilter/xt_cluster.c:55 [inline]
RIP: 0010:xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127
RSP: 0018:ffff8801d2f6f2d0 EFLAGS: 00010293
RAX: ffff8801af700540 RBX: 0000000000000000 RCX: ffffffff84a2d1e1
RDX: 0000000000000000 RSI: ffff8801d2f6f478 RDI: ffff8801cafd336a
RBP: ffff8801d2f6f2e8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801b03b3d18
R13: ffff8801cafd3300 R14: dffffc0000000000 R15: ffff8801d2f6f478
ipt_do_table+0xa91/0x19b0 net/ipv4/netfilter/ip_tables.c:296
iptable_filter_hook+0x65/0x80 net/ipv4/netfilter/iptable_filter.c:41
nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
nf_hook_slow+0xba/0x1a0 net/netfilter/core.c:483
nf_hook include/linux/netfilter.h:243 [inline]
NF_HOOK include/linux/netfilter.h:286 [inline]
raw_send_hdrinc.isra.17+0xf39/0x1880 net/ipv4/raw.c:432
raw_sendmsg+0x14cd/0x26b0 net/ipv4/raw.c:669
inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
sock_sendmsg_nosec net/socket.c:629 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:639
SYSC_sendto+0x361/0x5c0 net/socket.c:1748
SyS_sendto+0x40/0x50 net/socket.c:1716
do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x441b49
RSP: 002b:00007ffff5ca8b18 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000441b49
RDX: 0000000000000030 RSI: 0000000020ff7000 RDI: 0000000000000003
RBP: 00000000006cc018 R08: 000000002066354c R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000403470
R13: 0000000000403500 R14: 0000000000000000 R15: 0000000000000000
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..
Instead of adding checks for template ct on every target/match
manipulating skb->_nfct, simply drop the template ct when skipping
nf_conntrack_in().
Fixes: 7b4fdf77a450ec ("netfilter: don't track fragmented packets")
Reported-and-tested-by: syzbot+0346441ae0545cfcea3a@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
For tunnels created with IFLA_MTU, MTU of the netdevice is set by
rtnl_create_link() (called from rtnl_newlink()) before the device is
registered. However without IFLA_MTU that's not done.
rtnl_newlink() proceeds by calling struct rtnl_link_ops.newlink, which
via ip_tunnel_newlink() calls register_netdevice(), and that emits
NETDEV_REGISTER. Thus any listeners that inspect the netdevice get the
MTU of 0.
After ip_tunnel_newlink() corrects the MTU after registering the
netdevice, but since there's no event, the listeners don't get to know
about the MTU until something else happens--such as a NETDEV_UP event.
That's not ideal.
So instead of setting the MTU directly, go through dev_set_mtu(), which
takes care of distributing the necessary NETDEV_PRECHANGEMTU and
NETDEV_CHANGEMTU events.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
While testing my inet defrag changes, I found that the senders
could spend ~20% of cpu cycles in skb_set_owner_w() updating
sk->sk_wmem_alloc for every fragment they cook.
The solution to this problem is to use alloc_skb() instead
of sock_wmalloc() and manually perform a single sk_wmem_alloc change.
Similar change for IPv6 is provided in following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Exchange messages with hardware to program the TLS session
CPL handlers for messages received from chip.
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: Michael Werner <werner@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-03-31
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add raw BPF tracepoint API in order to have a BPF program type that
can access kernel internal arguments of the tracepoints in their
raw form similar to kprobes based BPF programs. This infrastructure
also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which
returns an anon-inode backed fd for the tracepoint object that allows
for automatic detach of the BPF program resp. unregistering of the
tracepoint probe on fd release, from Alexei.
2) Add new BPF cgroup hooks at bind() and connect() entry in order to
allow BPF programs to reject, inspect or modify user space passed
struct sockaddr, and as well a hook at post bind time once the port
has been allocated. They are used in FB's container management engine
for implementing policy, replacing fragile LD_PRELOAD wrapper
intercepting bind() and connect() calls that only works in limited
scenarios like glibc based apps but not for other runtimes in
containerized applications, from Andrey.
3) BPF_F_INGRESS flag support has been added to sockmap programs for
their redirect helper call bringing it in line with cls_bpf based
programs. Support is added for both variants of sockmap programs,
meaning for tx ULP hooks as well as recv skb hooks, from John.
4) Various improvements on BPF side for the nfp driver, besides others
this work adds BPF map update and delete helper call support from
the datapath, JITing of 32 and 64 bit XADD instructions as well as
offload support of bpf_get_prandom_u32() call. Initial implementation
of nfp packet cache has been tackled that optimizes memory access
(see merge commit for further details), from Jakub and Jiong.
5) Removal of struct bpf_verifier_env argument from the print_bpf_insn()
API has been done in order to prepare to use print_bpf_insn() soon
out of perf tool directly. This makes the print_bpf_insn() API more
generic and pushes the env into private data. bpftool is adjusted
as well with the print_bpf_insn() argument removal, from Jiri.
6) Couple of cleanups and prep work for the upcoming BTF (BPF Type
Format). The latter will reuse the current BPF verifier log as
well, thus bpf_verifier_log() is further generalized, from Martin.
7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read
and write support has been added in similar fashion to existing
IPv6 IPV6_TCLASS socket option we already have, from Nikita.
8) Fixes in recent sockmap scatterlist API usage, which did not use
sg_init_table() for initialization thus triggering a BUG_ON() in
scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and
uses a small helper sg_init_marker() to properly handle the affected
cases, from Prashant.
9) Let the BPF core follow IDR code convention and therefore use the
idr_preload() and idr_preload_end() helpers, which would also help
idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory
pressure, from Shaohua.
10) Last but not least, a spelling fix in an error message for the
BPF cookie UID helper under BPF sample code, from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
"Post-hooks" are hooks that are called right before returning from
sys_bind. At this time IP and port are already allocated and no further
changes to `struct sock` can happen before returning from sys_bind but
BPF program has a chance to inspect the socket and change sys_bind
result.
Specifically it can e.g. inspect what port was allocated and if it
doesn't satisfy some policy, BPF program can force sys_bind to fail and
return EPERM to user.
Another example of usage is recording the IP:port pair to some map to
use it in later calls to sys_connect. E.g. if some TCP server inside
cgroup was bound to some IP:port_n, it can be recorded to a map. And
later when some TCP client inside same cgroup is trying to connect to
127.0.0.1:port_n, BPF hook for sys_connect can override the destination
and connect application to IP:port_n instead of 127.0.0.1:port_n. That
helps forcing all applications inside a cgroup to use desired IP and not
break those applications if they e.g. use localhost to communicate
between each other.
== Implementation details ==
Post-hooks are implemented as two new attach types
`BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
Separate attach types for IPv4 and IPv6 are introduced to avoid access
to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
`inet6_bind()` since those fields might not make sense in such cases.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
== The problem ==
See description of the problem in the initial patch of this patch set.
== The solution ==
The patch provides much more reliable in-kernel solution for the 2nd
part of the problem: making outgoing connecttion from desired IP.
It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
`BPF_CGROUP_INET6_CONNECT` for program type
`BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
source and destination of a connection at connect(2) time.
Local end of connection can be bound to desired IP using newly
introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
and doesn't support binding to port, i.e. leverages
`IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
* looking for a free port is expensive and can affect performance
significantly;
* there is no use-case for port.
As for remote end (`struct sockaddr *` passed by user), both parts of it
can be overridden, remote IP and remote port. It's useful if an
application inside cgroup wants to connect to another application inside
same cgroup or to itself, but knows nothing about IP assigned to the
cgroup.
Support is added for IPv4 and IPv6, for TCP and UDP.
IPv4 and IPv6 have separate attach types for same reason as sys_bind
hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
when user passes sockaddr_in since it'd be out-of-bound.
== Implementation notes ==
The patch introduces new field in `struct proto`: `pre_connect` that is
a pointer to a function with same signature as `connect` but is called
before it. The reason is in some cases BPF hooks should be called way
before control is passed to `sk->sk_prot->connect`. Specifically
`inet_dgram_connect` autobinds socket before calling
`sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
it'd cause double-bind. On the other hand `proto.pre_connect` provides a
flexible way to add BPF hooks for connect only for necessary `proto` and
call them at desired time before `connect`. Since `bpf_bind()` is
allowed to bind only to IP and autobind in `inet_dgram_connect` binds
only port there is no chance of double-bind.
bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
of value of `bind_address_no_port` socket field.
bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
and __inet6_bind() since all call-sites, where bpf_bind() is called,
already hold socket lock.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Refactor `bind()` code to make it ready to be called from BPF helper
function `bpf_bind()` (will be added soon). Implementation of
`inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and
`__inet6_bind()` correspondingly. These function can be used from both
`sk_prot->bind` and `bpf_bind()` contexts.
New functions have two additional arguments.
`force_bind_address_no_port` forces binding to IP only w/o checking
`inet_sock.bind_address_no_port` field. It'll allow to bind local end of
a connection to desired IP in `bpf_bind()` w/o changing
`bind_address_no_port` field of a socket. It's useful since `bpf_bind()`
can return an error and we'd need to restore original value of
`bind_address_no_port` in that case if we changed this before calling to
the helper.
`with_lock` specifies whether to lock socket when working with `struct
sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old
behavior is preserved. But it will be set to `false` for `bpf_bind()`
use-case. The reason is all call-sites, where `bpf_bind()` will be
called, already hold that socket lock.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
== The problem ==
There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured. Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.
Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.
== The solution ==
The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).
It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
(similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
The new program type is intended to be used with sockets (`struct sock`)
in a cgroup and provided by user `struct sockaddr`. Pointers to both of
them are parts of the context passed to programs of newly added types.
The new attach types provides hooks in `bind(2)` system call for both
IPv4 and IPv6 so that one can write a program to override IP addresses
and ports user program tries to bind to and apply such a program for
whole cgroup.
== Implementation notes ==
[1]
Separate attach types for `AF_INET` and `AF_INET6` are added
intentionally to prevent reading/writing to offsets that don't make
sense for corresponding socket family. E.g. if user passes `sockaddr_in`
it doesn't make sense to read from / write to `user_ip6[]` context
fields.
[2]
The write access to `struct bpf_sock_addr_kern` is implemented using
special field as an additional "register".
There are just two registers in `sock_addr_convert_ctx_access`: `src`
with value to write and `dst` with pointer to context that can't be
changed not to break later instructions. But the fields, allowed to
write to, are not available directly and to access them address of
corresponding pointer has to be loaded first. To get additional register
the 1st not used by `src` and `dst` one is taken, its content is saved
to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
address of pointer field, and finally the register's content is restored
from the temporary field after writing `src` value.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Add support for the BPF_F_INGRESS flag in sk_msg redirect helper.
To do this add a scatterlist ring for receiving socks to check
before calling into regular recvmsg call path. Additionally, because
the poll wakeup logic only checked the skb recv queue we need to
add a hook in TCP stack (similar to write side) so that we have
a way to wake up polling socks when a scatterlist is redirected
to that sock.
After this all that is needed is for the redirect helper to
push the scatterlist into the psock receive queue.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
this integer is currently in a different cache line than skb->next,
meaning that we use two cache lines per skb when finding the insertion point.
By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
in a single cache line and save precious memory bandwidth.
Note that after the fast path added by Changli Gao in commit
d6bebca92c66 ("fragment: add fast path for in-order fragments")
this change wont help the fast path, since we still need
to access prev->len (2nd cache line), but will show great
benefits when slow path is entered, since we perform
a linear scan of a potentially long list.
Also, note that this potential long list is an attack vector,
we might consider also using an rb-tree there eventually.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
An skb_clone() was added in commit ec4fbd64751d ("inet: frag: release
spinlock before calling icmp_send()")
While fixing the bug at that time, it also added a very high cost
for DDOS frags, as the ICMP rate limit is applied after this
expensive operation (skb_clone() + consume_skb(), implying memory
allocations, copy, and freeing)
We can use skb_get(head) here, all we want is to make sure skb wont
be freed by another cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Some users are willing to provision huge amounts of memory to be able
to perform reassembly reasonnably well under pressure.
Current memory tracking is using one atomic_t and integers.
Switch to atomic_long_t so that 64bit arches can use more than 2GB,
without any cost for 32bit arches.
Note that this patch avoids an overflow error, if high_thresh was set
to ~2GB, since this test in inet_frag_alloc() was never true :
if (... || frag_mem_limit(nf) > nf->high_thresh)
Tested:
$ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh
<frag DDOS>
$ grep FRAG /proc/net/sockstat
FRAG: inuse 14705885 memory 16000002880
$ nstat -n ; sleep 1 ; nstat | grep Reas
IpReasmReqds 3317150 0.0
IpReasmFails 3317112 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This function is obsolete, after rhashtable addition to inet defrag.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This refactors ip_expire() since one indentation level is removed.
Note: in the future, we should try hard to avoid the skb_clone()
since this is a serious performance cost.
Under DDOS, the ICMP message wont be sent because of rate limits.
Fact that ip6_expire_frag_queue() does not use skb_clone() is
disturbing too. Presumably IPv6 should have the same
issue than the one we fixed in commit ec4fbd64751d
("inet: frag: release spinlock before calling icmp_send()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()
Also since we use rhashtable we can bring back the number of fragments
in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
removed in commit 434d305405ab ("inet: frag: don't account number
of fragment queues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.
Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.
After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)
$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608
A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We need to call inet_frags_init() before register_pernet_subsys(),
as a prereq for following patch ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
In order to simplify the API, add a pointer to struct inet_frags.
This will allow us to make things less complex.
These functions no longer have a struct inet_frags parameter :
inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We will soon initialize one rhashtable per struct netns_frags
in inet_frags_init_net().
This patch changes the return value to eventually propagate an
error.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree. This batch comes with more input sanitization for xtables to
address bug reports from fuzzers, preparation works to the flowtable
infrastructure and assorted updates. In no particular order, they are:
1) Make sure userspace provides a valid standard target verdict, from
Florian Westphal.
2) Sanitize error target size, also from Florian.
3) Validate that last rule in basechain matches underflow/policy since
userspace assumes this when decoding the ruleset blob that comes
from the kernel, from Florian.
4) Consolidate hook entry checks through xt_check_table_hooks(),
patch from Florian.
5) Cap ruleset allocations at 512 mbytes, 134217728 rules and reject
very large compat offset arrays, so we have a reasonable upper limit
and fuzzers don't exercise the oom-killer. Patches from Florian.
6) Several WARN_ON checks on xtables mutex helper, from Florian.
7) xt_rateest now has a hashtable per net, from Cong Wang.
8) Consolidate counter allocation in xt_counters_alloc(), from Florian.
9) Earlier xt_table_unlock() call in {ip,ip6,arp,eb}tables, patch
from Xin Long.
10) Set FLOW_OFFLOAD_DIR_* to IP_CT_DIR_* definitions, patch from
Felix Fietkau.
11) Consolidate code through flow_offload_fill_dir(), also from Felix.
12) Inline ip6_dst_mtu_forward() just like ip_dst_mtu_maybe_forward()
to remove a dependency with flowtable and ipv6.ko, from Felix.
13) Cache mtu size in flow_offload_tuple object, this is safe for
forwarding as f87c10a8aa1e describes, from Felix.
14) Rename nf_flow_table.c to nf_flow_table_core.o, to simplify too
modular infrastructure, from Felix.
15) Add rt0, rt2 and rt4 IPv6 routing extension support, patch from
Ahmed Abdelsalam.
16) Remove unused parameter in nf_conncount_count(), from Yi-Hung Wei.
17) Support for counting only to nf_conncount infrastructure, patch
from Yi-Hung Wei.
18) Add strict NFT_CT_{SRC_IP,DST_IP,SRC_IP6,DST_IP6} key datatypes
to nft_ct.
19) Use boolean as return value from ipt_ah and from IPVS too, patch
from Gustavo A. R. Silva.
20) Remove useless parameters in nfnl_acct_overquota() and
nf_conntrack_broadcast_help(), from Taehee Yoo.
21) Use ipv6_addr_is_multicast() from xt_cluster, also from Taehee Yoo.
22) Statify nf_tables_obj_lookup_byhandle, patch from Fengguang Wu.
23) Fix typo in xt_limit, from Geert Uytterhoeven.
24) Do no use VLAs in Netfilter code, again from Gustavo.
25) Use ADD_COUNTER from ebtables, from Taehee Yoo.
26) Bitshift support for CONNMARK and MARK targets, from Jack Ma.
27) Use pr_*() and add pr_fmt(), from Arushi Singhal.
28) Add synproxy support to ctnetlink.
29) ICMP type and IGMP matching support for ebtables, patches from
Matthias Schiffer.
30) Support for the revision infrastructure to ebtables, from
Bernie Harris.
31) String match support for ebtables, also from Bernie.
32) Documentation for the new flowtable infrastructure.
33) Use generic comparison functions in ebt_stp, from Joe Perches.
34) Demodularize filter chains in nftables.
35) Register conntrack hooks in case nftables NAT chain is added.
36) Merge assignments with return in a couple of spots in the
Netfilter codebase, also from Arushi.
37) Document that xtables percpu counters are stored in the same
memory area, from Ben Hutchings.
38) Revert mark_source_chains() sanity checks that break existing
rulesets, from Florian Westphal.
39) Use is_zero_ether_addr() in the ipset codebase, from Joe Perches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
underflow/policy"
This reverts commit 0d7df906a0e78079a02108b06d32c3ef2238ad25.
Valdis Kletnieks reported that xtables is broken in linux-next since
0d7df906a0e78 ("netfilter: x_tables: ensure last rule in base chain
matches underflow/policy"), as kernel rejects the (well-formed) ruleset:
[ 64.402790] ip6_tables: last base chain position 1136 doesn't match underflow 1344 (hook 1)
mark_source_chains is not the correct place for such a check, as it
terminates evaluation of a chain once it sees an unconditional verdict
(following rules are known to be unreachable). It seems preferrable to
fix libiptc instead, so remove this check again.
Fixes: 0d7df906a0e78 ("netfilter: x_tables: ensure last rule in base chain matches underflow/policy")
Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Register conntrack hooks if the user adds NAT chains. Users get confused
with the existing behaviour since they will see no packets hitting this
chain until they add the first rule that refers to conntrack.
This patch adds new ->init() and ->free() indirections to chain types
that can be used by NAT chains to invoke the conntrack dependency.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
One module per supported filter chain family type takes too much memory
for very little code - too much modularization - place all chain filter
definitions in one single file.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Use WARN_ON() instead since it should not happen that neither family
goes over NFPROTO_NUMPROTO nor there is already a chain of this type
already registered.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Use nft_ prefix. By when I added chain types, I forgot to use the
nftables prefix. Rename enum nft_chain_type to enum nft_chain_types too,
otherwise there is an overlap.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This patch exposes synproxy information per-conntrack. Moreover, send
sequence adjustment events once server sends us the SYN,ACK packet, so
we can synchronize the sequence adjustment too for packets going as
reply from the server, as part of the synproxy logic.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Harmless from kernel point of view, but again iptables assumes that
this is true when decoding ruleset coming from kernel.
If a (syzkaller generated) ruleset doesn't have the underflow/policy
stored as the last rule in the base chain, then iptables will abort()
because it doesn't find the chain policy.
libiptc assumes that the policy is the last rule in the basechain, which
is only true for iptables-generated rulesets.
Unfortunately this needs code duplication -- the functions need the
struct layout of the rule head, but that is different for
ip/ip6/arptables.
NB: pr_warn could be pr_debug but in case this break rulesets somehow its
useful to know why blob was rejected.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
should have no impact, function still always returns 0.
This patch is only to ease review.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
allows to have size checks in a single spot.
This is supposed to reduce oom situations when fuzz-testing xtables.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Allow followup patch to change on location instead of three.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Userspace must provide a valid verdict to the standard target.
The verdict can be either a jump (signed int > 0), or a return code.
Allowed return codes are either RETURN (pop from stack), NF_ACCEPT, DROP
and QUEUE (latter is allowed for legacy reasons).
Jump offsets (verdict > 0) are checked in more detail later on when
loop-detection is performed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Now it's doing cleanup_entry for oldinfo under the xt_table lock,
but it's not really necessary. After the replacement job is done
in xt_replace_table, oldinfo is not used elsewhere any more, and
it can be freed without xt_table lock safely.
The important thing is that rtnl_lock is called in some xt_target
destroy, which means rtnl_lock, a big lock is used in xt_table
lock, a smaller one. It usually could be the reason why a dead
lock may happen.
Besides, all xt_target/match checkentry is called out of xt_table
lock. It's better also to move all cleanup_entry calling out of
xt_table lock, just as do_replace_finish does for ebtables.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Return statements in functions returning bool should use
true/false instead of 1/0.
This issue was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Add checking to call to call_fib_entry_notifiers for IPv4 route replace.
Allows a notifier handler to fail the replace.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|