aboutsummaryrefslogtreecommitdiffstats
path: root/net/sched
Commit message (Collapse)AuthorAge
* net_sched: fix dequeuer fairnessjamal2011-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Results on dummy device can be seen in my netconf 2011 slides. These results are for a 10Gige IXGBE intel nic - on another i5 machine, very similar specs to the one used in the netconf2011 results. It turns out - this is a hell lot worse than dummy and so this patch is even more beneficial for 10G. Test setup: ---------- System under test sending packets out. Additional box connected directly dropping packets. Installed prio qdisc on the eth device and default netdev default length of 1000 used as is. The 3 prio bands each were set to 100 (didnt factor in the results). 5 packet runs were made and the middle 3 picked. results ------- The "cpu" column indicates the which cpu the sample was taken on, The "Pkt runx" carries the number of packets a cpu dequeued when forced to be in the "dequeuer" role. The "avg" for each run is the number of times each cpu should be a "dequeuer" if the system was fair. 3.0-rc4 (plain) cpu Pkt run1 Pkt run2 Pkt run3 ================================================ cpu0 21853354 21598183 22199900 cpu1 431058 473476 393159 cpu2 481975 477529 458466 cpu3 23261406 23412299 22894315 avg 11506948 11490372 11486460 3.0-rc4 with patch and default weight 64 cpu Pkt run1 Pkt run2 Pkt run3 ================================================ cpu0 13205312 13109359 13132333 cpu1 10189914 10159127 10122270 cpu2 10213871 10124367 10168722 cpu3 13165760 13164767 13096705 avg 11693714 11639405 11630008 As you can see the system is still not perfect but is a lot better than what it was before... At the moment we use the old backlog weight, weight_p which is 64 packets. It seems to be reasonably fine with that value. The system could be made more fair if we reduce the weight_p (as per my presentation), but we are going to affect the shared backlog weight. Unless deemed necessary, I think the default value is fine. If not we could add yet another knob. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip: introduce ip_is_fragment helper inline functionPaul Gortmaker2011-06-21
| | | | | | | | | | | There are enough instances of this: iph->frag_off & htons(IP_MF | IP_OFFSET) that a helper function is probably warranted. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: remove mm.h inclusion from netdevice.hAlexey Dobriyan2011-06-21
| | | | | | | | | | | | | | | | | Remove linux/mm.h inclusion from netdevice.h -- it's unused (I've checked manually). To prevent mm.h inclusion via other channels also extract "enum dma_data_direction" definition into separate header. This tiny piece is what gluing netdevice.h with mm.h via "netdevice.h => dmaengine.h => dma-mapping.h => scatterlist.h => mm.h". Removal of mm.h from scatterlist.h was tried and was found not feasible on most archs, so the link was cutoff earlier. Hope people are OK with tiny include file. Note, that mm_types.h is still dragged in, but it is a separate story. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2011-06-21
|\ | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-agn-rxon.c drivers/net/wireless/rtlwifi/pci.c net/netfilter/ipvs/ip_vs_core.c
| * net: Rework netdev_drivername() to avoid warning.David S. Miller2011-06-06
| | | | | | | | | | | | | | | | | | | | | | | | | | This interface uses a temporary buffer, but for no real reason. And now can generate warnings like: net/sched/sch_generic.c: In function dev_watchdog net/sched/sch_generic.c:254:10: warning: unused variable drivername Just return driver->name directly or "". Reported-by: Connor Hansen <cmdkhh@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rtnetlink: Compute and store minimum ifinfo dump sizeGreg Rose2011-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The message size allocated for rtnl ifinfo dumps was limited to a single page. This is not enough for additional interface info available with devices that support SR-IOV and caused a bug in which VF info would not be displayed if more than approximately 40 VFs were created per interface. Implement a new function pointer for the rtnl_register service that will calculate the amount of data required for the ifinfo dump and allocate enough data to satisfy the request. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
* | net: remove interrupt.h inclusion from netdevice.hAlexey Dobriyan2011-06-07
|/ | | | | | | | * remove interrupt.g inclusion from netdevice.h -- not needed * fixup fallout, add interrupt.h and hardirq.h back where needed. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sch_sfq: fix peek() implementationEric Dumazet2011-05-25
| | | | | | | | | | | | | | | Since commit eeaeb068f139 (sch_sfq: allow big packets and be fair), sfq_peek() can return a different skb that would be normally dequeued by sfq_dequeue() [ if current slot->allot is negative ] Use generic qdisc_peek_dequeued() instead of custom implementation, to get consistent result. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Patrick McHardy <kaber@trash.net> CC: Jesper Dangaard Brouer <hawk@diku.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
* sch_sfq: avoid giving spurious NET_XMIT_CN signalsEric Dumazet2011-05-23
| | | | | | | | | | | | | | | | | | | | | | While chasing a possible net_sched bug, I found that IP fragments have litle chance to pass a congestioned SFQ qdisc : - Say SFQ qdisc is full because one flow is non responsive. - ip_fragment() wants to send two fragments belonging to an idle flow. - sfq_enqueue() queues first packet, but see queue limit reached : - sfq_enqueue() drops one packet from 'big consumer', and returns NET_XMIT_CN. - ip_fragment() cancel remaining fragments. This patch restores fairness, making sure we return NET_XMIT_CN only if we dropped a packet from the same flow. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: avoid synchronize_rcu() in dev_deactivate_manyEric Dumazet2011-05-22
| | | | | | | | | | | | | | | | | | | | | dev_deactivate_many() issues one synchronize_rcu() call after qdiscs set to noop_qdisc. This call is here to make sure they are no outstanding qdisc-less dev_queue_xmit calls before returning to caller. But in dismantle phase, we dont have to wait, because we wont activate again the device, and we are going to wait one rcu grace period later in rollback_registered_many(). After this patch, device dismantle uses one synchronize_net() and one rcu_barrier() call only, so we have a ~30% speedup and a smaller RTNL latency. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net>, CC: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6Linus Torvalds2011-05-20
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits) macvlan: fix panic if lowerdev in a bond tg3: Add braces around 5906 workaround. tg3: Fix NETIF_F_LOOPBACK error macvlan: remove one synchronize_rcu() call networking: NET_CLS_ROUTE4 depends on INET irda: Fix error propagation in ircomm_lmp_connect_response() irda: Kill set but unused variable 'bytes' in irlan_check_command_param() irda: Kill set but unused variable 'clen' in ircomm_connect_indication() rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport() be2net: Kill set but unused variable 'req' in lancer_fw_download() irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication() atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined. rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer(). rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler() rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection() rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window() pkt_sched: Kill set but unused variable 'protocol' in tc_classify() isdn: capi: Use pr_debug() instead of ifdefs. tg3: Update version to 3.119 tg3: Apply rx_discards fix to 5719/5720 ... Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c as per Davem.
| * networking: NET_CLS_ROUTE4 depends on INETRandy Dunlap2011-05-19
| | | | | | | | | | | | | | | | | | | | | | IP_ROUTE_CLASSID depends on INET and NET_CLS_ROUTE4 selects IP_ROUTE_CLASSID, but when INET is not enabled, this kconfig warning is produced, so fix it by making NET_CLS_ROUTE4 depend on INET. warning: (NET_CLS_ROUTE4) selects IP_ROUTE_CLASSID which has unmet direct dependencies (NET && INET) Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * pkt_sched: Kill set but unused variable 'protocol' in tc_classify()David S. Miller2011-05-19
| | | | | | | | | | | | | | I checked the history and this has been like this since the beginning of time. Signed-off-by: David S. Miller <davem@davemloft.net>
| * inet: constify ip headers and in6_addrEric Dumazet2011-04-22
| | | | | | | | | | | | | | | | Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers where possible, to make code intention more obvious. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'master' of ↵David S. Miller2011-04-11
| |\ | | | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/smsc911x.c
| * | pkt_sched: QFQ - quick fair queue schedulerstephen hemminger2011-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an implementation of the Quick Fair Queue scheduler developed by Fabio Checconi. The same algorithm is already implemented in ipfw in FreeBSD. Fabio had an earlier version developed on Linux, I just cleaned it up. Thanks to Eric Dumazet for testing this under load. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net,act_police,rcu: remove rcu_barrier()Lai Jiangshan2011-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no callback of this module maybe queued since we use kfree_rcu(), we can safely remove the rcu_barrier(). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* | | net,rcu: convert call_rcu(tcf_police_free_rcu) to kfree_rcu()Lai Jiangshan2011-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [PATCH 05/17] net,rcu: convert call_rcu(tcf_police_free_rcu) to kfree_rcu() The rcu callback tcf_police_free_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(tcf_police_free_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* | | net,rcu: convert call_rcu(tcf_common_free_rcu) to kfree_rcu()Lai Jiangshan2011-05-08
| |/ |/| | | | | | | | | | | | | | | | | The rcu callback tcf_common_free_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(tcf_common_free_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* | Fix common misspellingsLucas De Marchi2011-03-31
|/ | | | | | Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
* ipv4: Remove flowi from struct rtable.David S. Miller2011-03-05
| | | | | | | | | | The only necessary parts are the src/dst addresses, the interface indexes, the TOS, and the mark. The rest is unnecessary bloat, which amounts to nearly 50 bytes on 64-bit. Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2011-03-04
|\ | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x.h
| * net: Fix more stale on-stack list_head objects.Eric W. Biederman2011-02-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From: Eric W. Biederman <ebiederm@xmission.com> In the beginning with batching unreg_list was a list that was used only once in the lifetime of a network device (I think). Now we have calls using the unreg_list that can happen multiple times in the life of a network device like dev_deactivate and dev_close that are also using the unreg_list. In addition in unregister_netdevice_queue we also do a list_move because for devices like veth pairs it is possible that unregister_netdevice_queue will be called multiple times. So I think the change below to fix dev_deactivate which Eric D. missed will fix this problem. Now to go test that. Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: reduce fifo qdisc sizeEric Dumazet2011-03-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Because of various alignements [SLUB / qdisc], we use 512 bytes of memory for one {p|b}fifo qdisc, instead of 256 bytes on 64bit arches and 192 bytes on 32bit ones. Move the "u32 limit" inside "struct Qdisc" (no impact on other qdiscs) Change qdisc_alloc(), first trying a regular allocation before an oversized one. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sched: protocol only needed when CONFIG_NET_CLS_ACT is enabledHagen Paul Pfeifer2011-02-25
| | | | | | | | | | Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sch_netem: Need to include vmalloc.hDavid S. Miller2011-02-25
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | sch_choke: add choke_skb_cbEric Dumazet2011-02-25
| | | | | | | | | | | | | | | | | | | | | | | | Better document choke skb->cb[] use, like we did in netem and sfb This adds a compile time check to make sure we dont exhaust skb->cb[] space. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netem: update version and cleanupstephen hemminger2011-02-25
| | | | | | | | | | | | | | | | Get rid of debug message that are not useful, and enable the log messages in case of error. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netem: revised correlated loss generatorstephen hemminger2011-02-25
| | | | | | | | | | | | | | | | | | | | | | This is a patch originated with Stefano Salsano and Fabio Ludovici. It provides several alternative loss models for use with netem. This patch adds two state machine based loss models. See: http://netgroup.uniroma2.it/twiki/bin/view.cgi/Main/NetemCLG Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Revert "sch_netem: Remove classful functionality"stephen hemminger2011-02-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Many users have wanted the old functionality that was lost to be able to use pfifo as inner qdisc for netem. The reason that netem could not be classful with the older API was because of the limitations of the old dequeue/requeue interface; now that qdisc API has a peek function, there is no longer a problem with using any inner qdisc's. This reverts commit 02201464119334690fe209849843881b8e9cfa9f. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netem: define NETEM_DIST_MAXstephen hemminger2011-02-25
| | | | | | | | | | | | | | | | | | Rather than magic constant in code, expose the maximum size of packet distribution table in API. In iproute2, q_netem defines MAX_DIST as 16K already. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netem: use vmalloc for distribution tablestephen hemminger2011-02-25
| | | | | | | | | | | | | | | | The netem probability table can be large (up to 64K bytes) which may be too large to allocate in one contiguous chunk. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netem: cleanup dump codestephen hemminger2011-02-25
| | | | | | | | | | | | | | Use nla_put_nested to update netlink attribute value. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | em_meta: fix sparse warningstephen hemminger2011-02-23
| | | | | | | | | | | | | | gfp_t needs to be cast to integer. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mqprio: cleanupsstephen hemminger2011-02-23
| | | | | | | | | | | | | | | | | | * make qdisc_ops local * add sparse annotation about expected unlock/unlock in dump_class_stats * fix indentation Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: SFB flow schedulerEric Dumazet2011-02-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the Stochastic Fair Blue scheduler, based on work from : W. Feng, D. Kandlur, D. Saha, K. Shin. Blue: A New Class of Active Queue Management Algorithms. U. Michigan CSE-TR-387-99, April 1999. http://www.thefengs.com/wuchang/blue/CSE-TR-387-99.pdf This implementation is based on work done by Juliusz Chroboczek General SFB algorithm can be found in figure 14, page 15: B[l][n] : L x N array of bins (L levels, N bins per level) enqueue() Calculate hash function values h{0}, h{1}, .. h{L-1} Update bins at each level for i = 0 to L - 1 if (B[i][h{i}].qlen > bin_size) B[i][h{i}].p_mark += p_increment; else if (B[i][h{i}].qlen == 0) B[i][h{i}].p_mark -= p_decrement; p_min = min(B[0][h{0}].p_mark ... B[L-1][h{L-1}].p_mark); if (p_min == 1.0) ratelimit(); else mark/drop with probabilty p_min; I did the adaptation of Juliusz code to meet current kernel standards, and various changes to address previous comments : http://thread.gmane.org/gmane.linux.network/90225 http://thread.gmane.org/gmane.linux.network/90375 Default flow classifier is the rxhash introduced by RPS in 2.6.35, but we can use an external flow classifier if wanted. tc qdisc add dev $DEV parent 1:11 handle 11: \ est 0.5sec 2sec sfb limit 128 tc filter add dev $DEV protocol ip parent 11: handle 3 \ flow hash keys dst divisor 1024 Notes: 1) SFB default child qdisc is pfifo_fast. It can be changed by another qdisc but a child qdisc MUST not drop a packet previously queued. This is because SFB needs to handle a dequeued packet in order to maintain its virtual queue states. pfifo_head_drop or CHOKe should not be used. 2) ECN is enabled by default, unlike RED/CHOKe/GRED With help from Patrick McHardy & Andi Kleen Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Juliusz Chroboczek <Juliusz.Chroboczek@pps.jussieu.fr> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Patrick McHardy <kaber@trash.net> CC: Andi Kleen <andi@firstfloor.org> CC: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | cls_u32: fix sparse warningsstephen hemminger2011-02-22
| | | | | | | | | | | | | | | | The variable _data is used in asm-generic to define sections which causes sparse warnings, so just rename the variable. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sch_mqprio: Always set num_tc to 0 in mqprio_destroy()Ben Hutchings2011-02-14
| | | | | | | | | | | | | | | | | | All the cleanup code in mqprio_destroy() is currently conditional on priv->qdiscs being non-null, but that condition should only apply to the per-queue qdisc cleanup. We should always set the number of traffic classes back to 0 here. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
* | sch_choke: Need linux/vmalloc.hDavid S. Miller2011-02-03
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | sched: CHOKe flow schedulerstephen hemminger2011-02-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CHOKe ("CHOose and Kill" or "CHOose and Keep") is an alternative packet scheduler based on the Random Exponential Drop (RED) algorithm. The core idea is: For every packet arrival: Calculate Qave if (Qave < minth) Queue the new packet else Select randomly a packet from the queue if (both packets from same flow) then Drop both the packets else if (Qave > maxth) Drop packet else Admit packet with proability p (same as RED) See also: Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active queue management scheme for approximating fair bandwidth allocation", Proceeding of INFOCOM'2000, March 2000. Help from: Eric Dumazet <eric.dumazet@gmail.com> Patrick McHardy <kaber@trash.net> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sfq: deadlock in error pathstephen hemminger2011-02-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The change to allow divisor to be a parameter (in 2.6.38-rc1) commit 817fb15dfd988d8dda916ee04fa506f0c466b9d6 introduced a possible deadlock caught by sparse. The scheduler tree lock was left locked in the case of an incorrect divisor value. Simplest fix is to move test outside of lock which also solves problem of partial update. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: sch_mqprio: dont leak kernel memoryEric Dumazet2011-01-26
| | | | | | | | | | | | | | | | | | mqprio_dump() should make sure all fields of struct tc_mqprio_qopt are initialized. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of ↵David S. Miller2011-01-24
|\| | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/sched/sch_hfsc.c net/sched/sch_htb.c net/sched/sch_tbf.c
| * Merge branch 'master' of ↵David S. Miller2011-01-24
| |\ | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6
| * | net_sched: accurate bytes/packets stats/ratesEric Dumazet2011-01-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 44b8288308ac9d (net_sched: pfifo_head_drop problem), we fixed a problem with pfifo_head drops that incorrectly decreased sch->bstats.bytes and sch->bstats.packets Several qdiscs (CHOKe, SFQ, pfifo_head, ...) are able to drop a previously enqueued packet, and bstats cannot be changed, so bstats/rates are not accurate (over estimated) This patch changes the qdisc_bstats updates to be done at dequeue() time instead of enqueue() time. bstats counters no longer account for dropped frames, and rates are more correct, since enqueue() bursts dont have effect on dequeue() rate. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net_sched: TCQ_F_CAN_BYPASS generalizationEric Dumazet2011-01-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now qdisc stab is handled before TCQ_F_CAN_BYPASS test in __dev_xmit_skb(), we can generalize TCQ_F_CAN_BYPASS to other qdiscs than pfifo_fast : pfifo, bfifo, pfifo_head_drop and sfq SFQ is special because it can have external classifiers, and in these cases, we cannot bypass queue discipline (packet could be dropped by classifier) without admin asking it, or further changes. Its worth doing this, especially for SFQ, avoiding dirtying memory in case no packets are already waiting in queue. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net_sched: RCU conversion of stabEric Dumazet2011-01-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch converts stab qdisc management to RCU, so that we can perform the qdisc_calculate_pkt_len() call before getting qdisc lock. This shortens the lock's held time in __dev_xmit_skb(). This permits more qdiscs to get TCQ_F_CAN_BYPASS status, avoiding lot of cache misses and so reducing latencies. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> CC: Jesper Dangaard Brouer <hawk@diku.dk> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net_sched: move TCQ_F_THROTTLED flagEric Dumazet2011-01-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 371121057607e (net: QDISC_STATE_RUNNING dont need atomic bit ops) I moved QDISC_STATE_RUNNING flag to __state container, located in the cache line containing qdisc lock and often dirtied fields. I now move TCQ_F_THROTTLED bit too, so that we let first cache line read mostly, and shared by all cpus. This should speedup HTB/CBQ for example. Not using test_bit()/__clear_bit()/__test_and_set_bit allows to use an "unsigned int" for __state container, reducing by 8 bytes Qdisc size. Introduce helpers to hide implementation details. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> CC: Jesper Dangaard Brouer <hawk@diku.dk> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net_sched: sfq: allow divisor to be a parameterEric Dumazet2011-01-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SFQ currently uses a 1024 slots hash table, and its internal structure (sfq_sched_data) allocation needs order-1 page on x86_64 Allow tc command to specify a divisor value (hash table size), between 1 and 65536. If no value is provided, assume the 1024 default size. This allows admins to setup smaller (or bigger) SFQ for specific needs. This also brings back sfq_sched_data allocations to order-0 ones, saving 3KB per SFQ qdisc. Jesper uses ~55.000 SFQ in one machine, this patch should free 165 MB of memory. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> CC: Jesper Dangaard Brouer <hawk@diku.dk> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge branch 'master' of ↵David S. Miller2011-01-20
|\ \ \ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6