aboutsummaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAge
* packet: free packet_rollover after synchronize_netWillem de Bruijn2015-06-21
| | | | | | | | | | | | | Destruction of the po->rollover must be delayed until there are no more packets in flight that can access it. The field is destroyed in packet_release, before synchronize_net. Delay using rcu. Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netfilter: Remove spurios included of netfilter.hEric W Biederman2015-06-18
| | | | | | | | | While testing my netfilter changes I noticed several files where recompiling unncessarily because they unncessarily included netfilter.h. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: don't pull include/linux/netfilter.h from netns headersPablo Neira Ayuso2015-06-18
| | | | | | | | | | | | | | | | | | | | | | | This pulls the full hook netfilter definitions from all those that include net_namespace.h. Instead let's just include the bare minimum required in the new linux/netfilter_defs.h file, and use it from the netfilter netns header files. I also needed to include in.h and in6.h from linux/netfilter.h otherwise we hit this compilation error: In file included from include/linux/netfilter_defs.h:4:0, from include/net/netns/netfilter.h:4, from include/net/net_namespace.h:22, from include/linux/netdevice.h:43, from net/netfilter/nfnetlink_queue_core.c:23: include/uapi/linux/netfilter.h:76:17: error: field ‘in’ has incomplete type struct in_addr in; And also explicit include linux/netfilter.h in several spots. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* netfilter: use forward declaration instead of including linux/proc_fs.hPablo Neira Ayuso2015-06-18
| | | | | | | | | | | | | | | | We don't need to pull the full definitions in that file, a simple forward declaration is enough. Moreover, include linux/procfs.h from nf_synproxy_core, otherwise this hits a compilation error due to missing declarations, ie. net/netfilter/nf_synproxy_core.c: In function ‘synproxy_proc_init’: net/netfilter/nf_synproxy_core.c:326:2: error: implicit declaration of function ‘proc_create’ [-Werror=implicit-function-declaration] if (!proc_create("synproxy", S_IRUGO, net->proc_net_stat, ^ Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* net: sched: Simplify em_ipset_matchEric W. Biederman2015-06-18
| | | | | | | | em->net is always set and always available, use it in preference to dev_net(skb->dev). Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: Kill unused copies of RCV_SKB_FAILEric W. Biederman2015-06-18
| | | | | | | | This appears to have been a dead macro in both nfnetlink_log.c and nfnetlink_queue_core.c since these pieces of code were added in 2005. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: bridge: split ipv6 code into separated filePablo Neira Ayuso2015-06-18
| | | | | | | | | Resolve compilation breakage when CONFIG_IPV6 is not set by moving the IPv6 code into a separated br_netfilter_ipv6.c file. Fixes: efb6de9b4ba0 ("netfilter: bridge: forward IPv6 fragmented packets") Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: bridge: rename br_netfilter.c to br_netfilter_hooks.cPablo Neira Ayuso2015-06-18
| | | | | | To prepare separation of the IPv6 code into different file. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: xt_socket: add XT_SOCKET_RESTORESKMARK flagHarout Hedeshian2015-06-18
| | | | | | | | | | | | | | | | | | | | | | | | | xt_socket is useful for matching sockets with IP_TRANSPARENT and taking some action on the matching packets. However, it lacks the ability to match only a small subset of transparent sockets. Suppose there are 2 applications, each with its own set of transparent sockets. The first application wants all matching packets dropped, while the second application wants them forwarded somewhere else. Add the ability to retore the skb->mark from the sk_mark. The mark is only restored if a matching socket is found and the transparent / nowildcard conditions are satisfied. Now the 2 hypothetical applications can differentiate their sockets based on a mark value set with SO_MARK. iptables -t mangle -I PREROUTING -m socket --transparent \ --restore-skmark -j action iptables -t mangle -A action -m mark --mark 10 -j action2 iptables -t mangle -A action -m mark --mark 11 -j action3 Signed-off-by: Harout Hedeshian <harouth@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: nfnetlink_queue: add security context informationRoman Kubiak2015-06-18
| | | | | | | | | | | | This patch adds an additional attribute when sending packet information via netlink in netfilter_queue module. It will send additional security context data, so that userspace applications can verify this context against their own security databases. Signed-off-by: Roman Kubiak <r.kubiak@samsung.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* bpf: disallow bpf tc programs access current->pid,uidAlexei Starovoitov2015-06-15
| | | | | | | | Accessing current->pid/uid from cls_bpf may lead to misleading results and should not be used when TC classifiers need accurate information about pid/uid. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock_diag: implement a get_info handler for inetCraig Gallek2015-06-15
| | | | | | | | | | | | | | | | | | | | | This get_info handler will simply dispatch to the appropriate existing inet protocol handler. This patch also includes a new netlink attribute (INET_DIAG_PROTOCOL). This attribute is currently only used for multicast messages. Without this attribute, there is no way of knowing the IP protocol used by the socket information being broadcast. This attribute is not necessary in the 'dump' variant of this protocol (though it could easily be added) because dump requests are issued for specific family/protocol pairs. Tested: ss -E (note, the -E option has not yet been merged into the upstream version of ss). Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock_diag: specify info_size per inet protocolCraig Gallek2015-06-15
| | | | | | | | | | | | | | | Previously, there was no clear distinction between the inet protocols that used struct tcp_info to report information and those that didn't. This change adds a specific size attribute to the inet_diag_handler struct which defines these interfaces. This will make dispatching sock_diag get_info requests identical for all inet protocols in a following patch. Tested: ss -au Tested: ss -at Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock_diag: define destruction multicast groupsCraig Gallek2015-06-15
| | | | | | | | | | | | | | | These groups will contain socket-destruction events for AF_INET/AF_INET6, IPPROTO_TCP/IPPROTO_UDP. Near the end of socket destruction, a check for listeners is performed. In the presence of a listener, rather than completely cleanup the socket, a unit of work will be added to a private work queue which will first broadcast information about the socket and then finish the cleanup operation. Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/core: Add reading VF statistics through the PF netdeviceEran Ben Elisha2015-06-15
| | | | | | | | | | | Add ndo_get_vf_stats where the PF retrieves and fills the VFs traffic statistics. We encode the VF stats in a nested manner to allow for future extensions. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bridge: del external_learned fdbs from device on flush or ageoutScott Feldman2015-06-15
| | | | | | | | | | | | | | | | | | | | | | | | We need to delete from offload the device externally learnded fdbs when any one of these events happen: 1) Bridge ages out fdb. (When bridge is doing ageing vs. device doing ageing. If device is doing ageing, it would send SWITCHDEV_FDB_DEL directly). 2) STP state change flushes fdbs on port. 3) User uses sysfs interface to flush fdbs from bridge or bridge port: echo 1 >/sys/class/net/BR_DEV/bridge/flush echo 1 >/sys/class/net/BR_PORT/brport/flush 4) Offload driver send event SWITCHDEV_FDB_DEL to delete fdb entry. For rocker, we can now get called to delete fdb entry in wait and nowait contexts, so set NOWAIT flag when deleting fdb entry. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge tag 'nfc-next-4.2-1' of ↵David S. Miller2015-06-15
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next Samuel Ortiz says: ==================== NFC 4.2 pull request This is the NFC pull request for 4.2. - NCI drivers can now define their own handlers for processing proprietary NCI responses and notifications. - NFC vendors can use a dedicated netlink API to send their own proprietary commands, like e.g. all commands needed to implement vendor specific manufacturing tools. - A new generic NCI over UART driver against which any NCI chipset running on top of a serial interface can register. - The st21nfcb driver is renamed to st-nci as it can and will support most of ST Microelectronics NCI chipsets. - The st21nfcb driver can put its CLF in hibernate mode and save significant amount of power. - A few st21nfcb minor fixes. - The NXP NCI driver now supports ACPI enumeration. - The Marvell NCI driver now supports both USB and serial physical interfaces. - The Marvell NCI drivers also supports NCI frames being muxed over HCI. This is a setting that can be defined by a DT property. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * NFC: nci: remove current SLEEP mode managementVincent Cuissard2015-06-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | NCI deactivate management was modified to support all NCI deactivation type. Problem is that all the API are not ready yet for it. Problem is that with current code, when neard asks to deactivate the tag it sends a deactivate SLEEP but nobody will then send a IDLE deactivate. This IDLE deactivate is mandatory since NFC controller can only be unlocked by DH. Signed-off-by: Vincent Cuissard <cuissard@marvell.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * NFC: nfcmrvl: add UART driverVincent Cuissard2015-06-11
| | | | | | | | | | | | | | Add support of Marvell NFC chip controlled over UART Signed-off-by: Vincent Cuissard <cuissard@marvell.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * NFC: nci: add generic uart supportVincent Cuissard2015-06-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some NFC controller supports UART as host interface. As with SPI, a lot of code can be shared between vendor drivers. This patch add the generic support of UART and provides some extension API for vendor specific needs. This code is strongly inspired by the Bluetooth HCI ldisc implementation. NCI UART vendor drivers will have to register themselves to this layer via nci_uart_register. Underlying tty will have to be configured from user land thanks to an ioctl. Signed-off-by: Vincent Cuissard <cuissard@marvell.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * NFC: nci: Export nci_req_completeSamuel Ortiz2015-06-10
| | | | | | | | | | | | Drivers implementing proprietary ops may need it now. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * NFC: netlink: Implement vendor command supportSamuel Ortiz2015-06-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Vendor commands are passed from userspace through the NFC_CMD_VENDOR netlink command, allowing driver and hardware specific operations implementations like for example RF tuning or production line calibration. Drivers will associate a set of vendor commands to a vendor id, which could typically be an OUI. The netlink kernel implementation will try to match the received vendor id and sub command attributes with the registered ones. When such match is found, the driver defined sub command routine is called. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * NFC: nci: Move close ops call in nci_close_deviceChristophe Ricard2015-06-08
| | | | | | | | | | | | | | | | | | When closing the device some data (proprietary commands) might be sent. The core state machine needs to be set for correct command execution. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * NFC: nci: Add nci_prop_cmd allowing to send proprietary nci cmdChristophe Ricard2015-06-08
| | | | | | | | | | | | | | | | Handle allowing to send proprietary nci commands anywhere in the nci state machine. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * NFC: nci: Add nci init ops for early device initializationChristophe Ricard2015-06-08
| | | | | | | | | | | | | | | | Some device may need to execute some proprietary commands in order to "wake-up"; Before the nci state initialization. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * NFC: nci: Add NCI_RESET return code check before setupChristophe Ricard2015-06-08
| | | | | | | | | | | | | | setup was executed in any case, even if NCI_RESET failed. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * NFC: nci: Handle proprietary response and notificationsSamuel Ortiz2015-06-08
| | | | | | | | | | | | | | | | Allow for drivers to explicitly define handlers for each proprietary notifications and responses they expect to support. Reviewed-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * NFC: nci: hci: Fix releasing uninitialized skbsJoe Perches2015-06-08
| | | | | | | | | | | | | | | | | | | | | | | | | | Several of these goto exit; uses should be direct returns as skb is not yet initialized by nci_hci_get_param(). Miscellanea: o Use !memcmp instead of memcmp() == 0 o Remove unnecessary goto from if () {... goto exit;} else {...} exit: Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
* | bridge: use either ndo VLAN ops or switchdev VLAN ops to install MASTER vlansScott Feldman2015-06-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | v2: Move struct switchdev_obj automatics to inner scope where there used. v1: To maintain backward compatibility with the existing iproute2 "bridge vlan" command, let bridge's setlink/dellink handler call into either the port driver's 8021q ndo ops or the port driver's bridge_setlink/dellink ops. This allows port driver to choose 8021q ops or the newer bridge_setlink/dellink ops when implementing VLAN add/del filtering on the device. The iproute "bridge vlan" command does not need to be modified. To summarize using the "bridge vlan" command examples, we have: 1) bridge vlan add|del vid VID dev DEV Here iproute2 sets MASTER flag. Bridge's bridge_setlink/dellink is called. Vlan is set on bridge for port. If port driver implements ndo 8021q ops, call those to port driver can install vlan filter on device. Otherwise, if port driver implements bridge_setlink/dellink ops, call those to install vlan filter to device. This option only works if port is bridged. 2) bridge vlan add|del vid VID dev DEV master Same as 1) 3) bridge vlan add|del vid VID dev DEV self Bridge's bridge_setlink/dellink isn't called. Port driver's bridge_setlink/dellink is called, if implemented. This option works if port is bridged or not. If port is not bridged, a VLAN can still be added/deleted to device filter using this variant. 4) bridge vlan add|del vid VID dev DEV master self This is a combination of 1) and 3), but will only work if port is bridged. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | bpf: allow networking programs to use bpf_trace_printk() for debuggingAlexei Starovoitov2015-06-15
| | | | | | | | | | | | | | | | | | | | | | bpf_trace_printk() is a helper function used to debug eBPF programs. Let socket and TC programs use it as well. Note, it's DEBUG ONLY helper. If it's used in the program, the kernel will print warning banner to make sure users don't use it in production. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | bpf: introduce current->pid, tgid, uid, gid, comm accessorsAlexei Starovoitov2015-06-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | eBPF programs attached to kprobes need to filter based on current->pid, uid and other fields, so introduce helper functions: u64 bpf_get_current_pid_tgid(void) Return: current->tgid << 32 | current->pid u64 bpf_get_current_uid_gid(void) Return: current_gid << 32 | current_uid bpf_get_current_comm(char *buf, int size_of_buf) stores current->comm into buf They can be used from the programs attached to TC as well to classify packets based on current task fields. Update tracex2 example to print histogram of write syscalls for each process instead of aggregated for all. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2015-06-15
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next This a bit large (and late) patchset that contains Netfilter updates for net-next. Most relevantly br_netfilter fixes, ipset RCU support, removal of x_tables percpu ruleset copy and rework of the nf_tables netdev support. More specifically, they are: 1) Warn the user when there is a better protocol conntracker available, from Marcelo Ricardo Leitner. 2) Fix forwarding of IPv6 fragmented traffic in br_netfilter, from Bernhard Thaler. This comes with several patches to prepare the change in first place. 3) Get rid of special mtu handling of PPPoE/VLAN frames for br_netfilter. This is not needed anymore since now we use the largest fragment size to refragment, from Florian Westphal. 4) Restore vlan tag when refragmenting in br_netfilter, also from Florian. 5) Get rid of the percpu ruleset copy in x_tables, from Florian. Plus another follow up patch to refine it from Eric Dumazet. 6) Several ipset cleanups, fixes and finally RCU support, from Jozsef Kadlecsik. 7) Get rid of parens in Netfilter Kconfig files. 8) Attach the net_device to the basechain as opposed to the initial per table approach in the nf_tables netdev family. 9) Subscribe to netdev events to detect the removal and registration of a device that is referenced by a basechain. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: nf_tables_netdev: unregister hooks on net_device removalPablo Neira Ayuso2015-06-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case the net_device is gone, we have to unregister the hooks and put back the reference on the net_device object. Once it comes back, register them again. This also covers the device rename case. This patch also adds a new flag to indicate that the basechain is disabled, so their hooks are not registered. This flag is used by the netdev family to handle the case where the net_device object is gone. Currently this flag is not exposed to userspace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables: add nft_register_basechain() and ↵Pablo Neira Ayuso2015-06-15
| | | | | | | | | | | | | | | | | | | | | | | | nft_unregister_basechain() This wrapper functions take care of hook registration for basechains. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables: attach net_device to basechainPablo Neira Ayuso2015-06-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The device is part of the hook configuration, so instead of a global configuration per table, set it to each of the basechain that we create. This patch reworks ebddf1a8d78a ("netfilter: nf_tables: allow to bind table to net_device"). Note that this adds a dev_name field in the nft_base_chain structure which is required the netdev notification subscription that follows up in a patch to handle gone net_devices. Suggested-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: x_tables: remove XT_TABLE_INFO_SZ and a dereference.Eric Dumazet2015-06-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After Florian patches, there is no need for XT_TABLE_INFO_SZ anymore : Only one copy of table is kept, instead of one copy per cpu. We also can avoid a dereference if we put table data right after xt_table_info. It reduces register pressure and helps compiler. Then, we attempt a kmalloc() if total size is under order-3 allocation, to reduce TLB pressure, as in many cases, rules fit in 32 KB. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | Merge branch 'master' of git://blackhole.kfki.hu/nf-nextPablo Neira Ayuso2015-06-15
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jozsef Kadlecsik says: ==================== ipset patches for nf-next Please consider to apply the next bunch of patches for ipset. First comes the small changes, then the bugfixes and at the end the RCU related patches. * Use MSEC_PER_SEC consistently instead of the number. * Use SET_WITH_*() helpers to test set extensions from Sergey Popovich. * Check extensions attributes before getting extensions from Sergey Popovich. * Permit CIDR equal to the host address CIDR in IPv6 from Sergey Popovich. * Make sure we always return line number on batch in the case of error from Sergey Popovich. * Check CIDR value only when attribute is given from Sergey Popovich. * Fix cidr handling for hash:*net* types, reported by Jonathan Johnson. * Fix parallel resizing and listing of the same set so that the original set is kept for the whole dumping. * Make sure listing doesn't grab a set which is just being destroyed. * Remove rbtree from ip_set_hash_netiface.c in order to introduce RCU. * Replace rwlock_t with spinlock_t in "struct ip_set", change the locking in the core and simplifications in the timeout routines. * Introduce RCU locking in bitmap:* types with a slight modification in the logic on how an element is added. * Introduce RCU locking in hash:* types. This is the most complex part of the changes. * Introduce RCU locking in list type where standard rculist is used. * Fix coding styles reported by checkpatch.pl. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | netfilter: ipset: Fix coding styles reported by checkpatch.plJozsef Kadlecsik2015-06-14
| | | | | | | | | | | | | | | | Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * | netfilter: ipset: Introduce RCU locking in list typeJozsef Kadlecsik2015-06-14
| | | | | | | | | | | | | | | | | | | | | | | | Standard rculist is used. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * | netfilter: ipset: Introduce RCU locking in hash:* typesJozsef Kadlecsik2015-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Three types of data need to be protected in the case of the hash types: a. The hash buckets: standard rcu pointer operations are used. b. The element blobs in the hash buckets are stored in an array and a bitmap is used for book-keeping to tell which elements in the array are used or free. c. Networks per cidr values and the cidr values themselves are stored in fix sized arrays and need no protection. The values are modified in such an order that in the worst case an element testing is repeated once with the same cidr value. The ipset hash approach uses arrays instead of lists and therefore is incompatible with rhashtable. Performance is tested by Jesper Dangaard Brouer: Simple drop in FORWARD ~~~~~~~~~~~~~~~~~~~~~~ Dropping via simple iptables net-mask match:: iptables -t raw -N simple || iptables -t raw -F simple iptables -t raw -I simple -s 198.18.0.0/15 -j DROP iptables -t raw -D PREROUTING -j simple iptables -t raw -I PREROUTING -j simple Drop performance in "raw": 11.3Mpps Generator: sending 12.2Mpps (tx:12264083 pps) Drop via original ipset in RAW table ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create a set with lots of elements:: sudo ./ipset destroy test echo "create test hash:ip hashsize 65536" > test.set for x in `seq 0 255`; do for y in `seq 0 255`; do echo "add test 198.18.$x.$y" >> test.set done done sudo ./ipset restore < test.set Dropping via ipset:: iptables -t raw -F iptables -t raw -N net198 || iptables -t raw -F net198 iptables -t raw -I net198 -m set --match-set test src -j DROP iptables -t raw -I PREROUTING -j net198 Drop performance in "raw" with ipset: 8Mpps Perf report numbers ipset drop in "raw":: + 24.65% ksoftirqd/1 [ip_set] [k] ip_set_test - 21.42% ksoftirqd/1 [kernel.kallsyms] [k] _raw_read_lock_bh - _raw_read_lock_bh + 99.88% ip_set_test - 19.42% ksoftirqd/1 [kernel.kallsyms] [k] _raw_read_unlock_bh - _raw_read_unlock_bh + 99.72% ip_set_test + 4.31% ksoftirqd/1 [ip_set_hash_ip] [k] hash_ip4_kadt + 2.27% ksoftirqd/1 [ixgbe] [k] ixgbe_fetch_rx_buffer + 2.18% ksoftirqd/1 [ip_tables] [k] ipt_do_table + 1.81% ksoftirqd/1 [ip_set_hash_ip] [k] hash_ip4_test + 1.61% ksoftirqd/1 [kernel.kallsyms] [k] __netif_receive_skb_core + 1.44% ksoftirqd/1 [kernel.kallsyms] [k] build_skb + 1.42% ksoftirqd/1 [kernel.kallsyms] [k] ip_rcv + 1.36% ksoftirqd/1 [kernel.kallsyms] [k] __local_bh_enable_ip + 1.16% ksoftirqd/1 [kernel.kallsyms] [k] dev_gro_receive + 1.09% ksoftirqd/1 [kernel.kallsyms] [k] __rcu_read_unlock + 0.96% ksoftirqd/1 [ixgbe] [k] ixgbe_clean_rx_irq + 0.95% ksoftirqd/1 [kernel.kallsyms] [k] __netdev_alloc_frag + 0.88% ksoftirqd/1 [kernel.kallsyms] [k] kmem_cache_alloc + 0.87% ksoftirqd/1 [xt_set] [k] set_match_v3 + 0.85% ksoftirqd/1 [kernel.kallsyms] [k] inet_gro_receive + 0.83% ksoftirqd/1 [kernel.kallsyms] [k] nf_iterate + 0.76% ksoftirqd/1 [kernel.kallsyms] [k] put_compound_page + 0.75% ksoftirqd/1 [kernel.kallsyms] [k] __rcu_read_lock Drop via ipset in RAW table with RCU-locking ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With RCU locking, the RW-lock is gone. Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps Performance-tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * | netfilter: ipset: Introduce RCU locking in bitmap:* typesJozsef Kadlecsik2015-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's nothing much required because the bitmap types use atomic bit operations. However the logic of adding elements slightly changed: first the MAC address updated (which is not atomic), then the element activated (added). The extensions may call kfree_rcu() therefore we call rcu_barrier() at module removal. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * | netfilter: ipset: Prepare the ipset core to use RCU at set levelJozsef Kadlecsik2015-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace rwlock_t with spinlock_t in "struct ip_set" and change the locking accordingly. Convert the comment extension into an rcu-avare object. Also, simplify the timeout routines. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * | netfilter:ipset Remove rbtree from hash:net,ifaceJozsef Kadlecsik2015-06-14
| | | | | | | | | | | | | | | | | | | | | | | | Remove rbtree in order to introduce RCU instead of rwlock in ipset Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * | netfilter: ipset: Make sure listing doesn't grab a set which is just being ↵Jozsef Kadlecsik2015-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | destroyed. There was a small window when all sets are destroyed and a concurrent listing of all sets could grab a set which is just being destroyed. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * | netfilter: ipset: Fix parallel resizing and listing of the same setJozsef Kadlecsik2015-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When elements added to a hash:* type of set and resizing triggered, parallel listing could start to list the original set (before resizing) and "continue" with listing the new set. Fix it by references and using the original hash table for listing. Therefore the destroying of the original hash table may happen from the resizing or listing functions. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * | netfilter: ipset: Fix cidr handling for hash:*net* typesJozsef Kadlecsik2015-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit "Simplify cidr handling for hash:*net* types" broke the cidr handling for the hash:*net* types when the sets were used by the SET target: entries with invalid cidr values were added to the sets. Reported by Jonathan Johnson. Testsuite entry is added to verify the fix. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * | netfilter: ipset: Check CIDR value only when attribute is givenSergey Popovich2015-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no reason to check CIDR value regardless attribute specifying CIDR is given. Initialize cidr array in element structure on element structure declaration to let more freedom to the compiler to optimize initialization right before element structure is used. Remove local variables cidr and cidr2 for netnet and netportnet hashes as we do not use packed cidr value for such set types and can store value directly in e.cidr[]. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * | netfilter: ipset: Make sure we always return line number on batchSergey Popovich2015-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even if we return with generic IPSET_ERR_PROTOCOL it is good idea to return line number if we called in batch mode. Moreover we are not always exiting with IPSET_ERR_PROTOCOL. For example hash:ip,port,net may return IPSET_ERR_HASH_RANGE_UNSUPPORTED or IPSET_ERR_INVALID_CIDR. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * | netfilter: ipset: Permit CIDR equal to the host address CIDR in IPv6Sergey Popovich2015-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Permit userspace to supply CIDR length equal to the host address CIDR length in netlink message. Prohibit any other CIDR length for IPv6 variant of the set. Also return -IPSET_ERR_HASH_RANGE_UNSUPPORTED instead of generic -IPSET_ERR_PROTOCOL in IPv6 variant of hash:ip,port,net when IPSET_ATTR_IP_TO attribute is given. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * | netfilter: ipset: Check extensions attributes before getting extensions.Sergey Popovich2015-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make all extensions attributes checks within ip_set_get_extensions() and reduce number of duplicated code. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>