aboutsummaryrefslogtreecommitdiffstats
path: root/include/linux
Commit message (Collapse)AuthorAge
* page allocator: do not check NUMA node ID when the caller knows the node is ↵Mel Gorman2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | valid Callers of alloc_pages_node() can optionally specify -1 as a node to mean "allocate from the current node". However, a number of the callers in fast paths know for a fact their node is valid. To avoid a comparison and branch, this patch adds alloc_pages_exact_node() that only checks the nid with VM_BUG_ON(). Callers that know their node is valid are then converted. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Paul Mundt <lethal@linux-sh.org> [for the SLOB NUMA bits] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: do not sanity check order in the fast pathMel Gorman2009-06-16
| | | | | | | | | | | | | | | | | | No user of the allocator API should be passing in an order >= MAX_ORDER but we check for it on each and every allocation. Delete this check and make it a VM_BUG_ON check further down the call path. [akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: replace __alloc_pages_internal() with __alloc_pages_nodemask()Mel Gorman2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The start of a large patch series to clean up and optimise the page allocator. The performance improvements are in a wide range depending on the exact machine but the results I've seen so fair are approximately; kernbench: 0 to 0.12% (elapsed time) 0.49% to 3.20% (sys time) aim9: -4% to 30% (for page_test and brk_test) tbench: -1% to 4% hackbench: -2.5% to 3.45% (mostly within the noise though) netperf-udp -1.34% to 4.06% (varies between machines a bit) netperf-tcp -0.44% to 5.22% (varies between machines a bit) I haven't sysbench figures at hand, but previously they were within the -0.5% to 2% range. On netperf, the client and server were bound to opposite number CPUs to maximise the problems with cache line bouncing of the struct pages so I expect different people to report different results for netperf depending on their exact machine and how they ran the test (different machines, same cpus client/server, shared cache but two threads client/server, different socket client/server etc). I also measured the vmlinux sizes for a single x86-based config with CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the .config is based on the Debian Lenny kernel config so I expect it to be reasonably typical. This patch: __alloc_pages_internal is the core page allocator function but essentially it is an alias of __alloc_pages_nodemask. Naming a publicly available and exported function "internal" is also a big ugly. This patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old nodemask function. Warning - This patch renames an exported symbol. No kernel driver is affected by external drivers calling __alloc_pages_internal() should change the call to __alloc_pages_nodemask() without any alteration of parameters. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cpuset,mm: update tasks' mems_allowed in timeMiao Xie2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix allocating page cache/slab object on the unallowed node when memory spread is set by updating tasks' mems_allowed after its cpuset's mems is changed. In order to update tasks' mems_allowed in time, we must modify the code of memory policy. Because the memory policy is applied in the process's context originally. After applying this patch, one task directly manipulates anothers mems_allowed, and we use alloc_lock in the task_struct to protect mems_allowed and memory policy of the task. But in the fast path, we didn't use lock to protect them, because adding a lock may lead to performance regression. But if we don't add a lock,the task might see no nodes when changing cpuset's mems_allowed to some non-overlapping set. In order to avoid it, we set all new allowed nodes, then clear newly disallowed ones. [lee.schermerhorn@hp.com: The rework of mpol_new() to extract the adjusting of the node mask to apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind() with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local allocation. Fix this by adding the check for MPOL_PREFERRED and empty node mask to mpol_new_mpolicy(). Remove the now unneeded 'nodes = NULL' from mpol_new(). Note that mpol_new_mempolicy() is always called with a non-NULL 'nodes' parameter now that it has been removed from mpol_new(). Therefore, we don't need to test nodes for NULL before testing it for 'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to verify this assumption.] [lee.schermerhorn@hp.com: I don't think the function name 'mpol_new_mempolicy' is descriptive enough to differentiate it from mpol_new(). This function applies cpuset set context, usually constraining nodes to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag is set, it also translates the nodes. So I settled on 'mpol_set_nodemask()', because the comment block for mpol_new() mentions that we need to call this function to "set nodes". Some additional minor line length, whitespace and typo cleanup.] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Paul Menage <menage@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: clean up get_user_pages_fast() documentationNick Piggin2009-06-16
| | | | | | | | | | | | | | Move more documentation for get_user_pages_fast into the new kerneldoc comment. Add some comments for get_user_pages as well. Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't there initially because it was once a static inline function. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Andy Grover <andy.grover@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: add radix_tree_prev_hole()Wu Fengguang2009-06-16
| | | | | | | | | | | | | The counterpart of radix_tree_next_hole(). To be used by context readahead. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Vladislav Bolkhovitin <vst@vlnb.net> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* readahead: record mmap read-around states in file_ra_stateWu Fengguang2009-06-16
| | | | | | | | | | | | | | | | | | | Mmap read-around now shares the same code style and data structure with readahead code. This also removes do_page_cache_readahead(). Its last user, mmap read-around, has been changed to call ra_submit(). The no-readahead-if-congested logic is dumped by the way. Users will be pretty sensitive about the slow loading of executables. So it's unfavorable to disabled mmap read-around on a congested queue. [akpm@linux-foundation.org: coding-style fixes] Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* readahead: make mmap_miss an unsigned intWu Fengguang2009-06-16
| | | | | | | | | | | | | | | This makes the performance impact of possible mmap_miss wrap around to be temporary and tolerable: i.e. MMAP_LOTSAMISS=100 extra readarounds. Otherwise if ever mmap_miss wraps around to negative, it takes INT_MAX cache misses to bring it back to normal state. During the time mmap readaround will be _enabled_ for whatever wild random workload. That's almost permanent performance impact. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: consolidate init_mm definitionAlexey Dobriyan2009-06-16
| | | | | | | | | | | | | | | | | | | * create mm/init-mm.c, move init_mm there * remove INIT_MM, initialize init_mm with C99 initializer * unexport init_mm on all arches: init_mm is already unexported on x86. One strange place is some OMAP driver (drivers/video/omap/) which won't build modular, but it's already wants get_vm_area() export. Somebody should look there. [akpm@linux-foundation.org: add missing #includes] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mike Frysinger <vapier.adi@gmail.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* firmware_map: fix hang with x86/32bitYinghai Lu2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13484 Peer reported: | The bug is introduced from kernel 2.6.27, if E820 table reserve the memory | above 4G in 32bit OS(BIOS-e820: 00000000fff80000 - 0000000120000000 | (reserved)), system will report Int 6 error and hang up. The bug is caused by | the following code in drivers/firmware/memmap.c, the resource_size_t is 32bit | variable in 32bit OS, the BUG_ON() will be invoked to result in the Int 6 | error. I try the latest 32bit Ubuntu and Fedora distributions, all hit this | bug. |====== |static int firmware_map_add_entry(resource_size_t start, resource_size_t end, | const char *type, | struct firmware_map_entry *entry) and it only happen with CONFIG_PHYS_ADDR_T_64BIT is not set. it turns out we need to pass u64 instead of resource_size_t for that. [akpm@linux-foundation.org: add comment] Reported-and-tested-by: Peer Chen <pchen@nvidia.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* time: move PIT_TICK_RATE to linux/timex.hArnd Bergmann2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | PIT_TICK_RATE is currently defined in four architectures, but in three different places. While linux/timex.h is not the perfect place for it, it is still a reasonable replacement for those drivers that traditionally use asm/timex.h to get CLOCK_TICK_RATE and expect it to be the PIT frequency. Note that for Alpha, the actual value changed from 1193182UL to 1193180UL. This is unlikely to make a difference, and probably can only improve accuracy. There was a discussion on the correct value of CLOCK_TICK_RATE a few years ago, after which every existing instance was getting changed to 1193182. According to the specification, it should be 1193181.818181... Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'timers-for-linus-migration' of ↵Linus Torvalds2009-06-15
|\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-migration' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timers: Logic to move non pinned timers timers: /proc/sys sysctl hook to enable timer migration timers: Identifying the existing pinned timers timers: Framework for identifying pinned timers timers: allow deferrable timers for intervals tv2-tv5 to be deferred Fix up conflicts in kernel/sched.c and kernel/timer.c manually
| * timers: Logic to move non pinned timersArun R Bharadwaj2009-05-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]: This patch migrates all non pinned timers and hrtimers to the current idle load balancer, from all the idle CPUs. Timers firing on busy CPUs are not migrated. While migrating hrtimers, care should be taken to check if migrating a hrtimer would result in a latency or not. So we compare the expiry of the hrtimer with the next timer interrupt on the target cpu and migrate the hrtimer only if it expires *after* the next interrupt on the target cpu. So, added a clockevents_get_next_event() helper function to return the next_event on the target cpu's clock_event_device. [ tglx: cleanups and simplifications ] Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * timers: /proc/sys sysctl hook to enable timer migrationArun R Bharadwaj2009-05-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]: This patch creates the /proc/sys sysctl interface at /proc/sys/kernel/timer_migration Timer migration is enabled by default. To disable timer migration, when CONFIG_SCHED_DEBUG = y, echo 0 > /proc/sys/kernel/timer_migration Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * timers: Framework for identifying pinned timersArun R Bharadwaj2009-05-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]: This patch creates a new framework for identifying cpu-pinned timers and hrtimers. This framework is needed because pinned timers are expected to fire on the same CPU on which they are queued. So it is essential to identify these and not migrate them, in case there are any. For regular timers, the currently existing add_timer_on() can be used queue pinned timers and subsequently mod_timer_pinned() can be used to modify the 'expires' field. For hrtimers, new modes HRTIMER_ABS_PINNED and HRTIMER_REL_PINNED are added to queue cpu-pinned hrtimer. [ tglx: use .._PINNED mode argument instead of creating tons of new functions ] Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | Merge branch 'timers-for-linus-clocksource' of ↵Linus Torvalds2009-06-15
|\ \ | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-clocksource' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: prevent selection of low resolution clocksourse also for nohz=on clocksource: sanity check sysfs clocksource changes
| * | clocksource: prevent selection of low resolution clocksourse also for nohz=onThomas Gleixner2009-06-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3f68535adad (clocksource: sanity check sysfs clocksource changes) prevents selection of non high resolution capable clocksources when high resolution mode is active, but did not take into account that the same rules apply for highres=off nohz=on. Check the tick device mode instead of hrtimer_hres_active() to verify whether the system needs to be protected from a switch to jiffies or other non highres capable clock sources. Reported-by: Luming Yu <luming.yu@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | clocksource: sanity check sysfs clocksource changesjohn stultz2009-06-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Thomas, Andrew and Ingo pointed out that we don't have any safety checks in the clocksource sysfs entries to make sure sysadmins don't try to change the clocksource to a non high-res timer capable clocksource (such as jiffies) when high-res timers (HRT) is enabled. Doing so will likely hang a system. Correct this by filtering non HRT clocksources from available_clocksources and not accepting non HRT clocksources with HRT enabled. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | Merge branch 'timers-for-linus-ntp' of ↵Linus Torvalds2009-06-15
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-ntp' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ntp: fix comment typos ntp: adjust SHIFT_PLL to improve NTP convergence
| * | | ntp: fix comment typosjohn stultz2009-05-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bernhard Schiffner noticed I had a few comment typos in this patch, (note: to save embarrassment, when making typos, avoid copying and pasting them) so this patch corrects them. [ Impact: cleanup ] Reported-by: Bernhard Schiffner <bernhard@schiffner-limbach.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: riel@redhat.com Cc: akpm@linux-foundation.org LKML-Reference: <1242090794.7214.131.camel@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | ntp: adjust SHIFT_PLL to improve NTP convergencejohn stultz2009-05-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The conversion to the ntpv4 reference model f19923937321244e7dc334767eb4b67e0e3d5c74 ("ntp: convert to the NTP4 reference model") in 2.6.19 added nanosecond resolution the adjtimex interface, but also changed the "stiffness" of the frequency adjustments, causing NTP convergence time to greatly increase. SHIFT_PLL, which reduces the stiffness of the freq adjustments, was designed to be inversely linked to HZ, and the reference value of 4 was designed for Unix systems using HZ=100. However Linux's clock steering code mostly independent of HZ. So this patch reduces the SHIFT_PLL value from 4 to 2, which causes NTPd behavior to match kernels prior to 2.6.19, greatly reducing convergence times, and improving close synchronization through environmental thermal changes. The patch also changes some l's to L's in nearby code to avoid misreading 50l as 501. [ Impact: tweak NTP algorithm for faster convergence ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: zippel@linux-m68k.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <200905051956.n45JuVo9025575@imap1.linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6Linus Torvalds2009-06-15
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1244 commits) pkt_sched: Rename PSCHED_US2NS and PSCHED_NS2US ipv4: Fix fib_trie rebalancing Bluetooth: Fix issue with uninitialized nsh.type in DTL-1 driver Bluetooth: Fix Kconfig issue with RFKILL integration PIM-SM: namespace changes ipv4: update ARPD help text net: use a deferred timer in rt_check_expire ieee802154: fix kconfig bool/tristate muckup bonding: initialization rework bonding: use is_zero_ether_addr bonding: network device names are case sensative bonding: elminate bad refcount code bonding: fix style issues bonding: fix destructor bonding: remove bonding read/write semaphore bonding: initialize before registration bonding: bond_create always called with default parameters x_tables: Convert printk to pr_err netfilter: conntrack: optional reliable conntrack event delivery list_nulls: add hlist_nulls_add_head and hlist_nulls_del ...
| * \ \ \ Merge branch 'master' of ↵David S. Miller2009-06-15
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 Conflicts: Documentation/feature-removal-schedule.txt drivers/scsi/fcoe/fcoe.c net/core/drop_monitor.c net/core/net-traces.c
| * | | | | list_nulls: add hlist_nulls_add_head and hlist_nulls_delPablo Neira Ayuso2009-06-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the hlist_nulls_add_head() function which is based on hlist_nulls_add_head_rcu() but without the use of rcu_assign_pointer(). It also adds hlist_nulls_del which is exactly the same like hlist_nulls_del_rcu(). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | | | | Merge branch 'master' of ↵David S. Miller2009-06-11
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6
| | * \ \ \ \ Merge branch 'master' of ↵Patrick McHardy2009-06-11
| | |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
| | * | | | | | netfilter: xt_socket: added new revision of the 'socket' match supporting flagsLaszlo Attila Toth2009-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the XT_SOCKET_TRANSPARENT flag is set, enabled 'transparent' socket option is required for the socket to be matched. Signed-off-by: Laszlo Attila Toth <panther@balabit.hu> Signed-off-by: Patrick McHardy <kaber@trash.net>
| | * | | | | | netfilter: passive OS fingerprint xtables matchEvgeniy Polyakov2009-06-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Passive OS fingerprinting netfilter module allows to passively detect remote OS and perform various netfilter actions based on that knowledge. This module compares some data (WS, MSS, options and it's order, ttl, df and others) from packets with SYN bit set with dynamically loaded OS fingerprints. Fingerprint matching rules can be downloaded from OpenBSD source tree or found in archive and loaded via netfilter netlink subsystem into the kernel via special util found in archive. Archive contains library file (also attached), which was shipped with iptables extensions some time ago (at least when ipt_osf existed in patch-o-matic). Following changes were made in this release: * added NLM_F_CREATE/NLM_F_EXCL checks * dropped _rcu list traversing helpers in the protected add/remove calls * dropped unneded structures, debug prints, obscure comment and check Fingerprints can be downloaded from http://www.openbsd.org/cgi-bin/cvsweb/src/etc/pf.os or can be found in archive Example usage: -d switch removes fingerprints Please consider for inclusion. Thank you. Passive OS fingerprint homepage (archives, examples): http://www.ioremap.net/projects/osf Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Patrick McHardy <kaber@trash.net>
| | * | | | | | netfilter: xt_NFQUEUE: queue balancing supportFlorian Westphal2009-06-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds support for specifying a range of queues instead of a single queue id. Flows will be distributed across the given range. This is useful for multicore systems: Instead of having a single application read packets from a queue, start multiple instances on queues x, x+1, .. x+n. Each instance can process flows independently. Packets for the same connection are put into the same queue. Signed-off-by: Holger Eitzenberger <heitzenberger@astaro.com> Signed-off-by: Florian Westphal <fwestphal@astaro.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
| | * | | | | | netfilter: x_tables: added hook number into match extension parameter structure.Evgeniy Polyakov2009-06-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Patrick McHardy <kaber@trash.net>
| | * | | | | | netfilter: conntrack: replace notify chain by function pointerPablo Neira Ayuso2009-06-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes the notify chain infrastructure and replace it by a simple function pointer. This issue has been mentioned in the mailing list several times: the use of the notify chain adds too much overhead for something that is only used by ctnetlink. This patch also changes nfnetlink_send(). It seems that gfp_any() returns GFP_KERNEL for user-context request, like those via ctnetlink, inside the RCU read-side section which is not valid. Using GFP_KERNEL is also evil since netlink may schedule(), this leads to "scheduling while atomic" bug reports. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | | | | netfilter: conntrack: remove events flags from userspace exposed filePablo Neira Ayuso2009-06-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch moves the event flags from linux/netfilter/nf_conntrack_common.h to net/netfilter/nf_conntrack_ecache.h. This flags are not of any use from userspace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | | | | netfilter: nf_ct_tcp: TCP simultaneous open supportJozsef Kadlecsik2009-06-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch below adds supporting TCP simultaneous open to conntrack. The unused LISTEN state is replaced by a new state (SYN_SENT2) denoting the second SYN sent from the reply direction in the new case. The state table is updated and the function tcp_in_window is modified to handle simultaneous open. The functionality can fairly easily be tested by socat. A sample tcpdump recording 23:21:34.244733 IP (tos 0x0, ttl 64, id 49224, offset 0, flags [DF], proto TCP (6), length 60) 192.168.0.254.2020 > 192.168.0.1.2020: S, cksum 0xe75f (correct), 3383710133:3383710133(0) win 5840 <mss 1460,sackOK,timestamp 173445629 0,nop,wscale 7> 23:21:34.244783 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40) 192.168.0.1.2020 > 192.168.0.254.2020: R, cksum 0x0253 (correct), 0:0(0) ack 3383710134 win 0 23:21:36.038680 IP (tos 0x0, ttl 64, id 28092, offset 0, flags [DF], proto TCP (6), length 60) 192.168.0.1.2020 > 192.168.0.254.2020: S, cksum 0x704b (correct), 2634546729:2634546729(0) win 5840 <mss 1460,sackOK,timestamp 824213 0,nop,wscale 1> 23:21:36.038777 IP (tos 0x0, ttl 64, id 49225, offset 0, flags [DF], proto TCP (6), length 60) 192.168.0.254.2020 > 192.168.0.1.2020: S, cksum 0xb179 (correct), 3383710133:3383710133(0) ack 2634546730 win 5840 <mss 1460,sackOK,timestamp 173447423 824213,nop,wscale 7> 23:21:36.038847 IP (tos 0x0, ttl 64, id 28093, offset 0, flags [DF], proto TCP (6), length 52) 192.168.0.1.2020 > 192.168.0.254.2020: ., cksum 0xebad (correct), ack 3383710134 win 2920 <nop,nop,timestamp 824213 173447423> and the corresponding netlink events: [NEW] tcp 6 120 SYN_SENT src=192.168.0.254 dst=192.168.0.1 sport=2020 dport=2020 [UNREPLIED] src=192.168.0.1 dst=192.168.0.254 sport=2020 dport=2020 [UPDATE] tcp 6 120 LISTEN src=192.168.0.254 dst=192.168.0.1 sport=2020 dport=2020 src=192.168.0.1 dst=192.168.0.254 sport=2020 dport=2020 [UPDATE] tcp 6 60 SYN_RECV src=192.168.0.254 dst=192.168.0.1 sport=2020 dport=2020 src=192.168.0.1 dst=192.168.0.254 sport=2020 dport=2020 [UPDATE] tcp 6 432000 ESTABLISHED src=192.168.0.254 dst=192.168.0.1 sport=2020 dport=2020 src=192.168.0.1 dst=192.168.0.254 sport=2020 dport=2020 [ASSURED] The RST packet was dropped in the raw table, thus it did not reach conntrack. nfnetlink_conntrack is unpatched so it shows the new SYN_SENT2 state as the old unused LISTEN. With TCP simultaneous open support we satisfy REQ-2 in RFC 5382 ;-) . Additional minor correction in this patch is that in order to catch uninitialized reply directions, "td_maxwin == 0" is used instead of "td_end == 0" because the former can't be true except in uninitialized state while td_end may accidentally be equal to zero in the mid of a connection. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Patrick McHardy <kaber@trash.net>
| | * | | | | | netfilter: conntrack: add support for DCCP handshake sequence to ctnetlinkPablo Neira Ayuso2009-05-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds CTA_PROTOINFO_DCCP_HANDSHAKE_SEQ that exposes the u64 handshake sequence number to user-space. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * | | | | | | Merge branch 'linux-2.6.31.y' of ↵David S. Miller2009-06-11
| |\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/inaky/wimax
| | * | | | | | | wimax/i2400m: rename misleading I2400M_PL_PAD to I2400M_PL_ALIGNInaky Perez-Gonzalez2009-06-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The constant is being use as an alignment factor, not as a padding factor; made reading/reviewing the code quite confusing. Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
| * | | | | | | | mISDN: cleanup mISDNhw.hKarsten Keil2009-06-11
| | |_|/ / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove unused stuff. Signed-off-by: Karsten Keil <keil@b1-systems.de>
| * | | | | | | mdio: Expose 10GBASE-T MDI-X status via ethtoolBen Hutchings2009-06-11
| |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is available in a standard MDIO register in 10GBASE-T PHYs. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | rfkill: don't impose global states on resume (just restore the previous states)Alan Jenkins2009-06-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once rfkill-input is disabled, the "global" states will only be used as default initial states. Since the states will always be the same after resume, we shouldn't generate events on resume. Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Signed-off-by: John W. Linville <linville@tuxdriver.com>
| * | | | | | rfkill: remove set_global_sw_stateAlan Jenkins2009-06-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rfkill_set_global_sw_state() (previously rfkill_set_default()) will no longer be exported by the rewritten rfkill core. Instead, platform drivers which can provide persistent soft-rfkill state across power-down/reboot should indicate their initial state by calling rfkill_set_sw_state() before registration. Otherwise, they will be initialized to a default value during registration by a set_block call. We remove existing calls to rfkill_set_sw_state() which happen before registration, since these had no effect in the old model. If these drivers do have persistent state, the calls can be put back (subject to testing :-). This affects hp-wmi and acer-wmi. Drivers with persistent state will affect the global state only if rfkill-input is enabled. This is required, otherwise booting with wireless soft-blocked and pressing the wireless-toggle key once would have no apparent effect. This special case will be removed in future along with rfkill-input, in favour of a more flexible userspace daemon (see Documentation/feature-removal-schedule.txt). Now rfkill_global_states[n].def is only used to preserve global states over EPO, it is renamed to ".sav". Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: John W. Linville <linville@tuxdriver.com>
| * | | | | | mac80211: do not pass PS frames out of mac80211 againJohannes Berg2009-06-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to handle powersave frames properly we had needed to pass these out to the device queues again, and introduce the skb->requeue bit. This, however, also has unnecessary overhead by needing to 'clean up' already tried frames, and this clean-up code is also buggy when software encryption is used. Instead of sending the frames via the master netdev queue again, simply put them into the pending queue. This also fixes a problem where frames for that particular station could be reordered when some were still on the software queues and older ones are re-injected into the software queue after them. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
| * | | | | | net/libertas: remove GPIO-CS handling in SPI interface codeSebastian Andrzej Siewior2009-06-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes the dependency on GPIO framework and lets the SPI host driver handle the chip select. The SPI host driver is required to keep the CS active for the entire message unless cs_change says otherwise. This patch collects the two/three single SPI transfers into a message. Also the delay in read path in case use_dummy_writes are not used is moved into the SPI host driver. Tested-by: Mike Rapoport <mike@compulab.co.il> Tested-by: Andrey Yurovsky <andrey@cozybit.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Dan Williams <dcbw@redhat.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
| * | | | | | rfkill: include err.hJohannes Berg2009-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we use ERR_PTR and similar macros, we need to include linux/err.h. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | e1000e: Expose MDI-X status via ethtool changeChaitanya Lala2009-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ethtool is a standard way of getting information about ethernet interfaces. We enhance ethtool kernel interface & e1000e to make the MDI-X status readable via ethtool in userspace. Signed-off-by: Chaitanya Lala <clala@riverbed.com> Signed-off-by: Arthur Jones <ajones@riverbed.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | net: add NL802154 interface for configuration of 802.15.4 devicesSergey Lapin2009-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a netlink interface for configuration of IEEE 802.15.4 device. Also this interface specifies events notification sent by devices towards higher layers. Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Signed-off-by: Sergey Lapin <slapin@ossfans.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | Add constants for the ieee 802.15.4 stackSergey Lapin2009-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IEEE 802.15.4 stack requires several constants to be defined/adjusted. Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Signed-off-by: Sergey Lapin <slapin@ossfans.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | netdevice.h: Use frag list abstraction interfaces.David S. Miller2009-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | skbuff: Add frag list abstraction interfaces.David S. Miller2009-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the hope that these can be used to eliminate direct references to the frag list implementation. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | isdn: rename capi_ctr_reseted() to capi_ctr_down()Tilman Schmidt2009-06-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the name of the Kernel CAPI exported function capi_ctr_reseted() to something representing its purpose better. Impact: renaming, no functional change Signed-off-by: Tilman Schmidt <tilman@imap.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | net: skb_shared_info optimizationEric Dumazet2009-06-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | skb_dma_unmap() is quite expensive for small packets, because we use two different cache lines from skb_shared_info. One to access nr_frags, one to access dma_maps[0] Instead of dma_maps being an array of MAX_SKB_FRAGS + 1 elements, let dma_head alone in a new dma_head field, close to nr_frags, to reduce cache lines misses. Tested on my dev machine (bnx2 & tg3 adapters), nice speedup ! Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>