aboutsummaryrefslogtreecommitdiffstats
path: root/net/core/dev.c
Commit message (Collapse)AuthorAge
* net: fix problem in reading sock TX queueTom Herbert2010-07-14
| | | | | | | | | | | | | | | | | | | | | Fix problem in reading the tx_queue recorded in a socket. In dev_pick_tx, the TX queue is read by doing a check with sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get. The problem is that there is not mutual exclusion across these calls in the socket so it it is possible that the queue in the sock can be invalidated after sk_tx_queue_recorded is called so that sk_tx_queue get returns -1, which sets 65535 in queue_index and thus dev_pick_tx returns 65536 which is a bogus queue and can cause crash in dev_queue_xmit. We fix this by only calling sk_tx_queue_get which does the proper checks. The interface is that sk_tx_queue_get returns the TX queue if the sock argument is non-NULL and TX queue is recorded, else it returns -1. sk_tx_queue_recorded is no longer used so it can be completely removed. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: skb_tx_hash() fix relative to skb_orphan_try()Eric Dumazet2010-07-14
| | | | | | | | | | | | | | | | | | | | commit fc6055a5ba31e2 (net: Introduce skb_orphan_try()) added early orphaning of skbs. This unfortunately added a performance regression in skb_tx_hash() in case of stacked devices (bonding, vlans, ...) Since skb->sk is now NULL, we cannot access sk->sk_hash anymore to spread tx packets to multiple NIC queues on multiqueue devices. skb_tx_hash() in this case only uses skb->protocol, same value for all flows. skb_orphan_try() can copy sk->sk_hash into skb->rxhash and skb_tx_hash() can use this saved sk_hash value to compute its internal hash value. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: decreasing real_num_tx_queues needs to flush qdiscJohn Fastabend2010-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reducing real_num_queues needs to flush the qdisc otherwise skbs with queue_mappings greater then real_num_tx_queues can be sent to the underlying driver. The flow for this is, dev_queue_xmit() dev_pick_tx() skb_tx_hash() => hash using real_num_tx_queues skb_set_queue_mapping() ... qdisc_enqueue_root() => enqueue skb on txq from hash ... dev->real_num_tx_queues -= n ... sch_direct_xmit() dev_hard_start_xmit() ndo_start_xmit(skb,dev) => skb queue set with old hash skbs are enqueued on the qdisc with skb->queue_mapping set 0 < queue_mappings < real_num_tx_queues. When the driver decreases real_num_tx_queues skb's may be dequeued from the qdisc with a queue_mapping greater then real_num_tx_queues. This fixes a case in ixgbe where this was occurring with DCB and FCoE. Because the driver is using queue_mapping to map skbs to tx descriptor rings we can potentially map skbs to rings that no longer exist. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Tested-by: Ross Brattain <ross.b.brattain@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: deliver skbs on inactive slaves to exact matchesJohn Fastabend2010-06-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the accelerated receive path for VLAN's will drop packets if the real device is an inactive slave and is not one of the special pkts tested for in skb_bond_should_drop(). This behavior is different then the non-accelerated path and for pkts over a bonded vlan. For example, vlanx -> bond0 -> ethx will be dropped in the vlan path and not delivered to any packet handlers at all. However, bond0 -> vlanx -> ethx and bond0 -> ethx will be delivered to handlers that match the exact dev, because the VLAN path checks the real_dev which is not a slave and netif_recv_skb() doesn't drop frames but only delivers them to exact matches. This patch adds a sk_buff flag which is used for tagging skbs that would previously been dropped and allows the skb to continue to skb_netif_recv(). Here we add logic to check for the deliver_no_wcard flag and if it is set only deliver to handlers that match exactly. This makes both paths above consistent and gives pkt handlers a way to identify skbs that come from inactive slaves. Without this patch in some configurations skbs will be delivered to handlers with exact matches and in others be dropped out right in the vlan path. I have tested the following 4 configurations in failover modes and load balancing modes. # bond0 -> ethx # vlanx -> bond0 -> ethx # bond0 -> vlanx -> ethx # bond0 -> ethx | vlanx -> -- Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Print num_rx_queues imbalance warning only when there are allocated queuesTim Gardner2010-06-09
| | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/591416 There are a number of network drivers (bridge, bonding, etc) that are not yet receive multi-queue enabled and use alloc_netdev(), so don't print a num_rx_queues imbalance warning in that case. Also, only print the warning once for those drivers that _are_ multi-queue enabled. Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
* net: fix conflict between null_or_orig and null_or_bondJohn Fastabend2010-06-02
| | | | | | | | | | | | | | | | | If a skb is received on an inactive bond that does not meet the special cases checked for by skb_bond_should_drop it should only be delivered to exact matches as the comment in netif_receive_skb() says. However because null_or_bond could also be null this is not always true. This patch renames null_or_bond to orig_or_bond and initializes it to orig_dev. This keeps the intent of null_or_bond to pass frames received on VLAN interfaces stacked on bonding interfaces without invalidating the statement for null_or_orig. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds2010-05-25
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (63 commits) drivers/net/usb/asix.c: Fix pointer cast. be2net: Bug fix to avoid disabling bottom half during firmware upgrade. proc_dointvec: write a single value hso: add support for new products Phonet: fix potential use-after-free in pep_sock_close() ath9k: remove VEOL support for ad-hoc ath9k: change beacon allocation to prefer the first beacon slot sock.h: fix kernel-doc warning cls_cgroup: Fix build error when built-in macvlan: do proper cleanup in macvlan_common_newlink() V2 be2net: Bug fix in init code in probe net/dccp: expansion of error code size ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep wireless: fix sta_info.h kernel-doc warnings wireless: fix mac80211.h kernel-doc warnings iwlwifi: testing the wrong variable in iwl_add_bssid_station() ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs() ath9k_htc: dereferencing before check in hif_usb_tx_cb() rt2x00: Fix rt2800usb TX descriptor writing. rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions. ...
| * net-2.6 : V2 - fix dev_get_valid_nameDaniel Lezcano2010-05-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the commit: commit d90310243fd750240755e217c5faa13e24f41536 Author: Octavian Purdila <opurdila@ixiacom.com> Date: Wed Nov 18 02:36:59 2009 +0000 net: device name allocation cleanups introduced a bug when there is a hash collision making impossible to rename a device with eth%d. This bug is very hard to reproduce and appears rarely. The problem is coming from we don't pass a temporary buffer to __dev_alloc_name but 'dev->name' which is modified by the function. A detailed explanation is here: http://marc.info/?l=linux-netdev&m=127417784011987&w=2 Changelog: V2 : replaced strings comparison by pointers comparison Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Reviewed-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: fix problem in dequeuing from input_pkt_queueTom Herbert2010-05-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix some issues introduced in batch skb dequeuing for input_pkt_queue. The primary issue it that the queue head must be incremented only after a packet has been processed, that is only after __netif_receive_skb has been called. This is needed for the mechanism to prevent OOO packet in RFS. Also when flushing the input_pkt_queue and process_queue, the process queue should be done first to prevent OOO packets. Because the input_pkt_queue has been effectively split into two queues, the calculation of the tail ptr is no longer correct. The correct value would be head+input_pkt_queue->len+process_queue->len. To avoid this calculation we added an explict input_queue_tail in softnet_data. The tail value is simply incremented when queuing to input_pkt_queue. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Expose all network devices in a namespaces in sysfsEric W. Biederman2010-05-21
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit aaf8cdc34ddba08122f02217d9d684e2f9f5d575. Drivers like the ipw2100 call device_create_group when they are initialized and device_remove_group when they are shutdown. Moving them between namespaces deletes their sysfs groups early. In particular the following call chain results. netdev_unregister_kobject -> device_del -> kobject_del -> sysfs_remove_dir With sysfs_remove_dir recursively deleting all of it's subdirectories, and nothing adding them back. Ouch! Therefore we need to call something that ultimate calls sysfs_mv_dir as that sysfs function can move sysfs directories between namespaces without deleting their subdirectories or their contents. Allowing us to avoid placing extra boiler plate into every driver that does something interesting with sysfs. Currently the function that provides that capability is device_rename. That is the code works without nasty side effects as originally written. So remove the misguided fix for moving devices between namespaces. The bug in the kobject layer that inspired it has now been recognized and fixed. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* net: add a noref bit on skb dstEric Dumazet2010-05-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use low order bit of skb->_skb_dst to tell dst is not refcounted. Change _skb_dst to _skb_refdst to make sure all uses are catched. skb_dst() returns the dst, regardless of noref bit set or not, but with a lockdep check to make sure a noref dst is not given if current user is not rcu protected. New skb_dst_set_noref() helper to set an notrefcounted dst on a skb. (with lockdep check) skb_dst_drop() drops a reference only if skb dst was refcounted. skb_dst_force() helper is used to force a refcount on dst, when skb is queued and not anymore RCU protected. Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in sock_queue_rcv_skb(), in __nf_queue(). Use skb_dst_force() in dev_requeue_skb(). Note: dst_use_noref() still dirties dst, we might transform it later to do one dirtying per jiffies. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rps: avoid one atomic in enqueue_to_backlogEric Dumazet2010-05-17
| | | | | | | | | | | If CONFIG_SMP=y, then we own a queue spinlock, we can avoid the atomic test_and_set_bit() from napi_schedule_prep(). We now have same number of atomic ops per netif_rx() calls than with pre-RPS kernel. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Consistent skb timestampingEric Dumazet2010-05-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With RPS inclusion, skb timestamping is not consistent in RX path. If netif_receive_skb() is used, its deferred after RPS dispatch. If netif_rx() is used, its done before RPS dispatch. This can give strange tcpdump timestamps results. I think timestamping should be done as soon as possible in the receive path, to get meaningful values (ie timestamps taken at the time packet was delivered by NIC driver to our stack), even if NAPI already can defer timestamping a bit (RPS can help to reduce the gap) Tom Herbert prefer to sample timestamps after RPS dispatch. In case sampling is expensive (HPET/acpi_pm on x86), this makes sense. Let admins switch from one mode to another, using a new sysctl, /proc/sys/net/core/netdev_tstamp_prequeue Its default value (1), means timestamps are taken as soon as possible, before backlog queueing, giving accurate timestamps. Setting a 0 value permits to sample timestamps when processing backlog, after RPS dispatch, to lower the load of the pre-RPS cpu. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: adjust handle_macvlan to pass port struct to hookJiri Pirko2010-05-16
| | | | | | | | | | Now there's null check here and also again in the hook. Looking at bridge bits which are simmilar, port structure is rcu_dereferenced right away in handle_bridge and passed to hook. Looks nicer. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2010-05-12
|\ | | | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: Documentation/feature-removal-schedule.txt drivers/net/wireless/ath/ar9170/usb.c drivers/scsi/iscsi_tcp.c net/ipv4/ipmr.c
| * veth: Dont kfree_skb() after dev_forward_skb()Eric Dumazet2010-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of congestion, netif_rx() frees the skb, so we must assume dev_forward_skb() also consume skb. Bug introduced by commit 445409602c092 (veth: move loopback logic to common location) We must change dev_forward_skb() to always consume skb, and veth to not double free it. Bug report : http://marc.info/?l=linux-netdev&m=127310770900442&w=3 Reported-by: Martín Ferrari <martin.ferrari@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: Various optimizationsEric Dumazet2010-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce ____napi_schedule() helper for callers in irq disabled contexts. rps_trigger_softirq() becomes a leaf function. Use container_of() in process_backlog() instead of accessing per_cpu address. Use a custom inlined version of __napi_complete() in process_backlog() to avoid one locked instruction : only current cpu owns and manipulates this napi, and NAPI_STATE_SCHED is the only possible flag set on backlog. we can use a plain write instead of clear_bit(), and we dont need an smp_mb() memory barrier, since RPS is on, backlog is protected by a spinlock. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fix softnet_statChangli Gao2010-05-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per cpu variable softnet_data.total was shared between IRQ and SoftIRQ context without any protection. And enqueue_to_backlog should update the netdev_rx_stat of the target CPU. This patch renames softnet_data.total to softnet_data.processed: the number of packets processed in uppper levels(IP stacks). softnet_stat data is moved into softnet_data. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/linux/netdevice.h | 17 +++++++---------- net/core/dev.c | 26 ++++++++++++-------------- net/sched/sch_generic.c | 2 +- 3 files changed, 20 insertions(+), 25 deletions(-) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: batch skb dequeueing from softnet input_pkt_queueChangli Gao2010-04-27
| | | | | | | | | | | | | | | | | | | | | | | | batch skb dequeueing from softnet input_pkt_queue to reduce potential lock contention when RPS is enabled. Note: in the worst case, the number of packets in a softnet_data may be double of netdev_max_backlog. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: reimplement softnet_data.output_queue as a FIFO queueChangli Gao2010-04-27
| | | | | | | | | | | | | | | | | | | | | | | | | | reimplement softnet_data.output_queue as a FIFO queue to keep the fairness among the qdiscs rescheduled. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> ---- include/linux/netdevice.h | 1 + net/core/dev.c | 22 ++++++++++++---------- 2 files changed, 13 insertions(+), 10 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: optimize rps_get_cpu()Changli Gao2010-04-25
| | | | | | | | | | | | | | | | | | | | optimize rps_get_cpu(). don't initialize ports when we can get the ports. one memory access for ports than two. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Orphan and de-dst skbs earlier in xmit path.David S. Miller2010-04-22
| | | | | | | | | | | | | | | | | | This way GSO packets don't get handled differently. With help from Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
* | rps: immediate send IPI in process_backlog()Eric Dumazet2010-04-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If some skb are queued to our backlog, we are delaying IPI sending at the end of net_rx_action(), increasing latencies. This defeats the queueing, since we want to quickly dispatch packets to the pool of worker cpus, then eventually deeply process our packets. It's better to send IPI before processing our packets in upper layers, from process_backlog(). Change the _and_disable_irq suffix to _and_enable_irq(), since we enable local irq in net_rps_action(), sorry for the confusion. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Introduce skb_orphan_try()Eric Dumazet2010-04-22
| | | | | | | | | | | | | | | | At this point, skb->destructor is not the original one (stored in DEV_GSO_CB(skb)->destructor) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of ↵David S. Miller2010-04-21
|\| | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-6000.c net/core/dev.c
| * net: Fix an RCU warning in dev_pick_tx()David Howells2010-04-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following RCU warning in dev_pick_tx(): =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- net/core/dev.c:1993 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by swapper/0: #0: (&idev->mc_ifc_timer){+.-...}, at: [<ffffffff81039e65>] run_timer_softirq+0x17b/0x278 #1: (rcu_read_lock_bh){.+....}, at: [<ffffffff812ea3eb>] dev_queue_xmit+0x14e/0x4dc stack backtrace: Pid: 0, comm: swapper Not tainted 2.6.34-rc5-cachefs #4 Call Trace: <IRQ> [<ffffffff810516c4>] lockdep_rcu_dereference+0xaa/0xb2 [<ffffffff812ea4f6>] dev_queue_xmit+0x259/0x4dc [<ffffffff812ea3eb>] ? dev_queue_xmit+0x14e/0x4dc [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf [<ffffffff81035362>] ? local_bh_enable_ip+0xbc/0xc1 [<ffffffff812f0954>] neigh_resolve_output+0x24b/0x27c [<ffffffff8134f673>] ip6_output_finish+0x7c/0xb4 [<ffffffff81350c34>] ip6_output2+0x256/0x261 [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf [<ffffffff813517fb>] ip6_output+0xbbc/0xbcb [<ffffffff8135bc5d>] ? fib6_force_start_gc+0x2b/0x2d [<ffffffff81368acb>] mld_sendpack+0x273/0x39d [<ffffffff81368858>] ? mld_sendpack+0x0/0x39d [<ffffffff81052099>] ? mark_held_locks+0x52/0x70 [<ffffffff813692fc>] mld_ifc_timer_expire+0x24f/0x288 [<ffffffff81039ed6>] run_timer_softirq+0x1ec/0x278 [<ffffffff81039e65>] ? run_timer_softirq+0x17b/0x278 [<ffffffff813690ad>] ? mld_ifc_timer_expire+0x0/0x288 [<ffffffff81035531>] ? __do_softirq+0x69/0x140 [<ffffffff8103556a>] __do_softirq+0xa2/0x140 [<ffffffff81002e0c>] call_softirq+0x1c/0x28 [<ffffffff81004b54>] do_softirq+0x38/0x80 [<ffffffff81034f06>] irq_exit+0x45/0x47 [<ffffffff810177c3>] smp_apic_timer_interrupt+0x88/0x96 [<ffffffff810028d3>] apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff810488dd>] ? __atomic_notifier_call_chain+0x0/0x86 [<ffffffff810096bf>] ? mwait_idle+0x6e/0x78 [<ffffffff810096b6>] ? mwait_idle+0x65/0x78 [<ffffffff810011cb>] cpu_idle+0x4d/0x83 [<ffffffff81380b05>] rest_init+0xb9/0xc0 [<ffffffff81380a4c>] ? rest_init+0x0/0xc0 [<ffffffff8168dcf0>] start_kernel+0x392/0x39d [<ffffffff8168d2a3>] x86_64_start_reservations+0xb3/0xb7 [<ffffffff8168d38b>] x86_64_start_kernel+0xe4/0xeb An rcu_dereference() should be an rcu_dereference_bh(). Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: dev_pick_tx() fixEric Dumazet2010-04-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When dev_pick_tx() caches tx queue_index on a socket, we must check socket dst_entry matches skb one, or risk a crash later, as reported by Denys Fedorysychenko, if old packets are in flight during a route change, involving devices with different number of queues. Bug introduced by commit a4ee3ce3 (net: Use sk_tx_queue_mapping for connected sockets) Reported-by: Denys Fedorysychenko <nuclearcat@nuclearcat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: emphasize rtnl lock required in call_netdevice_notifiersJiri Pirko2010-04-20
| | | | | | | | | | | | | | | | | | Since netdev_chain is guarded by rtnl_lock, ASSERT_RTNL should be present here to make sure that all callers of call_netdevice_notifiers does the locking properly. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: consistent rxhashEric Dumazet2010-04-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case we compute a software skb->rxhash, we can generate a consistent hash : Its value will be the same in both flow directions. This helps some workloads, like conntracking, since the same state needs to be accessed in both directions. tbench + RFS + this patch gives better results than tbench with default kernel configuration (no RPS, no RFS) Also fixed some sparse warnings. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: cleanupsEric Dumazet2010-04-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct softnet_data holds many queues, so consistent use "sd" name instead of "queue" is better. Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog() Adds a _and_irq_disable suffix to net_rps_action() name, as David suggested. incr_input_queue_head() becomes input_queue_head_incr() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: shortcut net_rps_action()Eric Dumazet2010-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if RPS is not active. Tom Herbert used two bitmasks to hold information needed to send IPI, but a single LIFO list seems more appropriate. Move all RPS logic into net_rps_action() to cleanup net_rx_action() code (remove two ifdefs) Move rps_remote_softirq_cpus into softnet_data to share its first cache line, filling an existing hole. In a future patch, we could call net_rps_action() from process_backlog() to make sure we send IPI before handling this cpu backlog. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Introduce skb_orphan_try()Eric Dumazet2010-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Transmitted skb might be attached to a socket and a destructor, for memory accounting purposes. Traditionally, this destructor is called at tx completion time, when skb is freed. When tx completion is performed by another cpu than the sender, this forces some cache lines to change ownership. XPS was an attempt to give tx completion to initial cpu. David idea is to call destructor right before giving skb to device (call to ndo_start_xmit()). Because device queues are usually small, orphaning skb before tx completion is not a big deal. Some drivers already do this, we could do it in upper level. There is one known exception to this early orphaning, called tx timestamping. It needs to keep a reference to socket until device can give a hardware or software timestamp. This patch adds a skb_orphan_try() helper, to centralize all exceptions to early orphaning in one spot, and use it in dev_hard_start_xmit(). "tbench 16" results on a Nehalem machine (2 X5570 @ 2.93GHz) before: Throughput 4428.9 MB/sec 16 procs after: Throughput 4448.14 MB/sec 16 procs UDP should get even better results, its destructor being more complex, since SOCK_USE_WRITE_QUEUE is not set (four atomic ops instead of one) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: remove time limit in process_backlog()Eric Dumazet2010-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - There is no point to enforce a time limit in process_backlog(), since other napi instances dont follow same rule. We can exit after only one packet processed... The normal quota of 64 packets per napi instance should be the norm, and net_rx_action() already has its own time limit. Note : /proc/net/core/dev_weight can be used to tune this 64 default value. - Use DEFINE_PER_CPU_ALIGNED for softnet_data definition. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: rps_sock_flow_table is mostly readEric Dumazet2010-04-17
| | | | | | | | | | Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rfs: Receive Flow SteeringTom Herbert2010-04-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements receive flow steering (RFS). RFS steers received packets for layer 3 and 4 processing to the CPU where the application for the corresponding flow is running. RFS is an extension of Receive Packet Steering (RPS). The basic idea of RFS is that when an application calls recvmsg (or sendmsg) the application's running CPU is stored in a hash table that is indexed by the connection's rxhash which is stored in the socket structure. The rxhash is passed in skb's received on the connection from netif_receive_skb. For each received packet, the associated rxhash is used to look up the CPU in the hash table, if a valid CPU is set then the packet is steered to that CPU using the RPS mechanisms. The convolution of the simple approach is that it would potentially allow OOO packets. If threads are thrashing around CPUs or multiple threads are trying to read from the same sockets, a quickly changing CPU value in the hash table could cause rampant OOO packets-- we consider this a non-starter. To avoid OOO packets, this solution implements two types of hash tables: rps_sock_flow_table and rps_dev_flow_table. rps_sock_table is a global hash table. Each entry is just a CPU number and it is populated in recvmsg and sendmsg as described above. This table contains the "desired" CPUs for flows. rps_dev_flow_table is specific to each device queue. Each entry contains a CPU and a tail queue counter. The CPU is the "current" CPU for a matching flow. The tail queue counter holds the value of a tail queue counter for the associated CPU's backlog queue at the time of last enqueue for a flow matching the entry. Each backlog queue has a queue head counter which is incremented on dequeue, and so a queue tail counter is computed as queue head count + queue length. When a packet is enqueued on a backlog queue, the current value of the queue tail counter is saved in the hash entry of the rps_dev_flow_table. And now the trick: when selecting the CPU for RPS (get_rps_cpu) the rps_sock_flow table and the rps_dev_flow table for the RX queue are consulted. When the desired CPU for the flow (found in the rps_sock_flow table) does not match the current CPU (found in the rps_dev_flow table), the current CPU is changed to the desired CPU if one of the following is true: - The current CPU is unset (equal to RPS_NO_CPU) - Current CPU is offline - The current CPU's queue head counter >= queue tail counter in the rps_dev_flow table. This checks if the queue tail has advanced beyond the last packet that was enqueued using this table entry. This guarantees that all packets queued using this entry have been dequeued, thus preserving in order delivery. Making each queue have its own rps_dev_flow table has two advantages: 1) the tail queue counters will be written on each receive, so keeping the table local to interrupting CPU s good for locality. 2) this allows lockless access to the table-- the CPU number and queue tail counter need to be accessed together under mutual exclusion from netif_receive_skb, we assume that this is only called from device napi_poll which is non-reentrant. This patch implements RFS for TCP and connected UDP sockets. It should be usable for other flow oriented protocols. There are two configuration parameters for RFS. The "rps_flow_entries" kernel init parameter sets the number of entries in the rps_sock_flow_table, the per rxqueue sysfs entry "rps_flow_cnt" contains the number of entries in the rps_dev_flow table for the rxqueue. Both are rounded to power of two. The obvious benefit of RFS (over just RPS) is that it achieves CPU locality between the receive processing for a flow and the applications processing; this can result in increased performance (higher pps, lower latency). The benefits of RFS are dependent on cache hierarchy, application load, and other factors. On simple benchmarks, we don't necessarily see improvement and sometimes see degradation. However, for more complex benchmarks and for applications where cache pressure is much higher this technique seems to perform very well. Below are some benchmark results which show the potential benfit of this patch. The netperf test has 500 instances of netperf TCP_RR test with 1 byte req. and resp. The RPC test is an request/response test similar in structure to netperf RR test ith 100 threads on each host, but does more work in userspace that netperf. e1000e on 8 core Intel No RFS or RPS 104K tps at 30% CPU No RFS (best RPS config): 290K tps at 63% CPU RFS 303K tps at 61% CPU RPC test tps CPU% 50/90/99% usec latency Latency StdDev No RFS/RPS 103K 48% 757/900/3185 4472.35 RPS only: 174K 73% 415/993/2468 491.66 RFS 223K 73% 379/651/1382 315.61 Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: netif_rx() must disable preemptionEric Dumazet2010-04-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Eric Paris reported netif_rx() is calling smp_processor_id() from preemptible context, in particular when caller is ip_dev_loopback_xmit(). RPS commit added this smp_processor_id() call, this patch makes sure preemption is disabled. rps_get_cpus() wants rcu_read_lock() anyway, we can dot it a bit earlier. Reported-by: Eric Paris <eparis@redhat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: uninline skb_bond_should_drop()Eric Dumazet2010-04-13
| | | | | | | | | | | | | | | | | | | | skb_bond_should_drop() is too big to be inlined. This patch reduces kernel text size, and its compilation time as well (shrinking include/linux/netdevice.h) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sk_dst_cache RCUificationEric Dumazet2010-04-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this work. sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock) This rwlock is readlocked for a very small amount of time, and dst entries are already freed after RCU grace period. This calls for RCU again :) This patch converts sk_dst_lock to a spinlock, and use RCU for readers. __sk_dst_get() is supposed to be called with rcu_read_lock() or if socket locked by user, so use appropriate rcu_dereference_check() condition (rcu_read_lock_held() || sock_owned_by_user(sk)) This patch avoids two atomic ops per tx packet on UDP connected sockets, for example, and permits sk_dst_lock to be much less dirtied. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Dont use netdev_warn()Eric Dumazet2010-04-13
| | | | | | | | | | | | | | | | | | | | | | Dont use netdev_warn() in dev_cap_txqueue() and get_rps_cpu() so that we can catch following warnings without crash. bond0.2240 received packet on queue 6, but number of RX queues is 1 bond0.2240 received packet on queue 11, but number of RX queues is 1 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of ↵David S. Miller2010-04-11
|\| | | | | | | | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/stmmac/stmmac_main.c drivers/net/wireless/wl12xx/wl1271_cmd.c drivers/net/wireless/wl12xx/wl1271_main.c drivers/net/wireless/wl12xx/wl1271_spi.c net/core/ethtool.c net/mac80211/scan.c
| * include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo2010-03-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* | net: Add a missing local_irq_enable()Eric Dumazet2010-04-05
| | | | | | | | | | | | | | | | As noticed by Changli Gao, we must call local_irq_enable() after rps_unlock() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: fixed missed rps_unlockTom Herbert2010-04-05
| | | | | | | | | | | | | | | | Fix spin_unlock_irq which needs to be rps_unlock. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: convert multicast list to list_headJiri Pirko2010-04-03
| | | | | | | | | | | | | | | | | | | | | | | | | | Converts the list and the core manipulating with it to be the same as uc_list. +uses two functions for adding/removing mc address (normal and "global" variant) instead of a function parameter. +removes dev_mcast.c completely. +exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for manipulation with lists on a sandbox (used in bonding and 80211 drivers) Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: move address list functions to a separate fileJiri Pirko2010-04-03
| | | | | | | | | | | | | | +little renaming of unicast functions to be smooth with multicast ones Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: illegal_highdma() fixEric Dumazet2010-04-02
| | | | | | | | | | | | | | | | | | | | | | | | Followup to commit 5acbbd428db47b12f137a8a2aa96b3c0a96b744e (net: change illegal_highdma to use dma_mask) If dev->dev.parent is NULL, we should not try to dereference it. Dont force inline illegal_highdma() as its pretty big now. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: change illegal_highdma to use dma_maskFUJITA Tomonori2010-04-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Robert Hancock pointed out two problems about NETIF_F_HIGHDMA: -Many drivers only set the flag when they detect they can use 64-bit DMA, since otherwise they could receive DMA addresses that they can't handle (which on platforms without IOMMU/SWIOTLB support is fatal). This means that if 64-bit support isn't available, even buffers located below 4GB will get copied unnecessarily. -Some drivers set the flag even though they can't actually handle 64-bit DMA, which would mean that on platforms without IOMMU/SWIOTLB they would get a DMA mapping error if the memory they received happened to be located above 4GB. http://lkml.org/lkml/2010/3/3/530 We can use the dma_mask if we need bouncing or not here. Then we can safely fix drivers that misuse NETIF_F_HIGHDMA. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: keep the old behavior on SMP without rpsChangli Gao2010-04-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | keep the old behavior on SMP without rps RPS introduces a lock operation to per cpu variable input_pkt_queue on SMP whenever rps is enabled or not. On SMP without RPS, this lock isn't needed at all. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- net/core/dev.c | 42 ++++++++++++++++++++++++++++-------------- 1 file changed, 28 insertions(+), 14 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: __netif_receive_skb should be staticEric Dumazet2010-03-29
| | | | | | | | | | Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: add CONFIG_RPSEric Dumazet2010-03-25
| | | | | | | | | | | | | | | | | | | | | | RPS currently depends on SMP and SYSFS Adding a CONFIG_RPS makes sense in case this requirement changes in the future. This patch saves about 1500 bytes of kernel text in case SMP is on but SYSFS is off. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>