aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* tracing: Separate raw syscall from syscall tracerLai Jiangshan2009-11-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current syscall tracer mixes raw syscalls and real syscalls. echo 1 > events/syscalls/enable And we get these from the output: (XXXX insteads " grep-20914 [001] 588211.446347" .. etc) XXXX: sys_read(fd: 3, buf: 80609a8, count: 7000) XXXX: sys_enter: NR 3 (3, 80609a8, 7000, a, 1000, bfce8ef8) XXXX: sys_read -> 0x138 XXXX: sys_exit: NR 3 = 312 XXXX: sys_read(fd: 3, buf: 8060ae0, count: 7000) XXXX: sys_enter: NR 3 (3, 8060ae0, 7000, a, 1000, bfce8ef8) XXXX: sys_read -> 0x138 XXXX: sys_exit: NR 3 = 312 There are 2 drawbacks here. A) two almost identical records are saved in ringbuffer when a syscall enters or exits. (4 records for every syscall) This wastes precious space in the ring buffer. B) the lines including "sys_enter/sys_exit" produces hardly any useful information for the output (no labels). The user can use this method to prevent these drawbacks: echo 1 > events/syscalls/enable echo 0 > events/syscalls/sys_enter/enable echo 0 > events/syscalls/sys_exit/enable But this is not user friendly. So we separate raw syscall from syscall tracer. After this fix applied: syscall tracer's output (echo 1 > events/syscalls/enable): XXXX: sys_read(fd: 3, buf: bfe87d88, count: 200) XXXX: sys_read -> 0x200 XXXX: sys_fstat64(fd: 3, statbuf: bfe87c98) XXXX: sys_fstat64 -> 0x0 XXXX: sys_close(fd: 3) raw syscall tracer's output (echo 1 > events/raw_syscalls/enable): XXXX: sys_enter: NR 175 (0, bf92bf18, bf92bf98, 8, b748cff4, bf92bef8) XXXX: sys_exit: NR 175 = 0 XXXX: sys_enter: NR 175 (2, bf92bf98, 0, 8, b748cff4, bf92bef8) XXXX: sys_exit: NR 175 = 0 XXXX: sys_enter: NR 3 (9, bf927f9c, 4000, b77e2518, b77dce60, bf92bff8) Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4AEFC37C.5080609@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* ring-buffer-benchmark: Add parameters to set produce/consumer prioritiesSteven Rostedt2009-11-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Running the ring-buffer-benchmark's threads at the lowest priority may work well for keeping it in the background, but it is not appropriate for the benchmarks. This patch adds 4 parameters to the module: consumer_fifo consumer_nice producer_fifo producer_nice By default the consumer and producer still run at nice +19. If the *_fifo options are set, they will override the *_nice values. modprobe ring_buffer_benchmark consumer_nice=0 producer_fifo=10 The above will set the consumer thread to a nice value of 0, and the producer thread to a RT SCHED_FIFO priority of 10. Note, this patch also fixes a bug where calling set_user_nice on the consumer thread would oops the kernel when the parameter "disable_reader" is set. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing, function tracer: Clean up strstrip() usageIngo Molnar2009-11-23
| | | | | | | | | | | | | | | | Clean up strstrip() usage - which also addresses this build warning: kernel/trace/ftrace.c: In function 'ftrace_pid_write': kernel/trace/ftrace.c:3004: warning: ignoring return value of 'strstrip', declared with attribute warn_unused_result Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* ring-buffer benchmark: Run producer/consumer threads at nice +19Ingo Molnar2009-11-23
| | | | | | | | | | | | | | | | | | | | The ring-buffer benchmark threads run on nice 0 by default, using up a lot of CPU time and slowing down the system: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1024 root 20 0 0 0 0 D 95.3 0.0 4:01.67 rb_producer 1023 root 20 0 0 0 0 R 93.5 0.0 2:54.33 rb_consumer 21569 mingo 40 0 14852 1048 772 R 3.6 0.1 0:00.05 top 1 root 40 0 4080 928 668 S 0.0 0.0 0:23.98 init Renice them to +19 to make them less intrusive. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* tracing: Remove the stale include/trace/power.hJosh Stone2009-11-18
| | | | | | | | | | | | | | | Commit 6161352 moved the power tracing to include/trace/events/, but left the old header behind. No one is using the old header, and its declarations are now incorrect, so it should be removed. Signed-off-by: Josh Stone <jistone@redhat.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1258578415-14752-1-git-send-email-jistone@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* tracing: Only print objcopy version warning once from recordmcountSteven Rostedt2009-11-17
| | | | | | | | | | | | | | | | | | | | | | If the user has an older version of objcopy, that can not handle converting local symbols to global and vice versa, then some functions will not be part of the dynamic function tracer. The current code in recordmcount.pl will print a warning in this case. Unfortunately, there exists lots of files that may have this issue with older objcopys and this will cause a warning for every file compiled with this issue. This patch solves this overwhelming output by creating a .tmp_quiet_recordmcount file on the first instance the warning is encountered. The warning will not print if this file exists. The temp file is deleted at the beginning of the compile to ensure that the warning will happen once again on new compiles (because the issue is still present). Reported-by: Andrew Morton <akpm@linux-foundation.org> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Prevent build warning: 'ftrace_graph_buf' defined but not usedLai Jiangshan2009-11-17
| | | | | | | | Prevent build warning when CONFIG_FUNCTION_GRAPH_TRACER is not set. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4AF24381.5060307@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* ring-buffer: Move access to commit_page up into function usedSteven Rostedt2009-11-17
| | | | | | | | | | | With the change of the way we process commits. Where a commit only happens at the outer most level, and that we don't need to worry about a commit ending after the rb_start_commit() has been called, the code use to grab the commit page before the tail page to prevent a possible race. But this race no longer exists with the rb_start_commit() rb_end_commit() interface. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: do not disable interrupts for trace_clock_localSteven Rostedt2009-11-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disabling interrupts in trace_clock_local takes quite a performance hit to the recording of traces. Using perf top we see: ------------------------------------------------------------------------------ PerfTop: 244 irqs/sec kernel:100.0% [1000Hz cpu-clock-msecs], (all, 4 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 2842.00 - 40.4% : trace_clock_local 1043.00 - 14.8% : rb_reserve_next_event 784.00 - 11.1% : ring_buffer_lock_reserve 600.00 - 8.5% : __rb_reserve_next 579.00 - 8.2% : rb_end_commit 440.00 - 6.3% : ring_buffer_unlock_commit 290.00 - 4.1% : ring_buffer_producer_thread [ring_buffer_benchmark] 155.00 - 2.2% : debug_smp_processor_id 117.00 - 1.7% : trace_recursive_unlock 103.00 - 1.5% : ring_buffer_event_data 28.00 - 0.4% : do_gettimeofday 22.00 - 0.3% : _spin_unlock_irq 14.00 - 0.2% : native_read_tsc 11.00 - 0.2% : getnstimeofday Where trace_clock_local is 40% of the tracing, and the time for recording a trace according to ring_buffer_benchmark is 210ns. After converting the interrupts to preemption disabling we have from perf top: ------------------------------------------------------------------------------ PerfTop: 1084 irqs/sec kernel:99.9% [1000Hz cpu-clock-msecs], (all, 4 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 1277.00 - 16.8% : native_read_tsc 1148.00 - 15.1% : rb_reserve_next_event 896.00 - 11.8% : ring_buffer_lock_reserve 688.00 - 9.1% : __rb_reserve_next 664.00 - 8.8% : rb_end_commit 563.00 - 7.4% : ring_buffer_unlock_commit 508.00 - 6.7% : _spin_unlock_irq 365.00 - 4.8% : debug_smp_processor_id 321.00 - 4.2% : trace_clock_local 303.00 - 4.0% : ring_buffer_producer_thread [ring_buffer_benchmark] 273.00 - 3.6% : native_sched_clock 122.00 - 1.6% : trace_recursive_unlock 113.00 - 1.5% : sched_clock 101.00 - 1.3% : ring_buffer_event_data 53.00 - 0.7% : tick_nohz_stop_sched_tick Where trace_clock_local drops from 40% to only taking 4% of the total time. The trace time also goes from 210ns down to 179ns (31ns). I talked with Peter Zijlstra about the impact that sched_clock may have without having interrupts disabled, and he told me that if a timer interrupt comes in, sched_clock may report a wrong time. Balancing a seldom incorrect timestamp with a 15% performance boost, I'll take the performance boost. Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* ring-buffer: Add multiple iterations between benchmark timestampsSteven Rostedt2009-11-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ring_buffer_benchmark does a gettimeofday after every write to the ring buffer in its measurements. This adds the overhead of the call to gettimeofday to the measurements and does not give an accurate picture of the length of time it takes to record a trace. This was first noticed with perf top: ------------------------------------------------------------------------------ PerfTop: 679 irqs/sec kernel:99.9% [1000Hz cpu-clock-msecs], (all, 4 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 1673.00 - 27.8% : trace_clock_local 806.00 - 13.4% : do_gettimeofday 590.00 - 9.8% : rb_reserve_next_event 554.00 - 9.2% : native_read_tsc 431.00 - 7.2% : ring_buffer_lock_reserve 365.00 - 6.1% : __rb_reserve_next 355.00 - 5.9% : rb_end_commit 322.00 - 5.4% : getnstimeofday 268.00 - 4.5% : ring_buffer_unlock_commit 262.00 - 4.4% : ring_buffer_producer_thread [ring_buffer_benchmark] 113.00 - 1.9% : read_tsc 91.00 - 1.5% : debug_smp_processor_id 69.00 - 1.1% : trace_recursive_unlock 66.00 - 1.1% : ring_buffer_event_data 25.00 - 0.4% : _spin_unlock_irq And the length of each write to the ring buffer measured at 310ns. This patch adds a new module parameter called "write_interval" which is defaulted to 50. This is the number of writes performed between timestamps. After this patch perf top shows: ------------------------------------------------------------------------------ PerfTop: 244 irqs/sec kernel:100.0% [1000Hz cpu-clock-msecs], (all, 4 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 2842.00 - 40.4% : trace_clock_local 1043.00 - 14.8% : rb_reserve_next_event 784.00 - 11.1% : ring_buffer_lock_reserve 600.00 - 8.5% : __rb_reserve_next 579.00 - 8.2% : rb_end_commit 440.00 - 6.3% : ring_buffer_unlock_commit 290.00 - 4.1% : ring_buffer_producer_thread [ring_buffer_benchmark] 155.00 - 2.2% : debug_smp_processor_id 117.00 - 1.7% : trace_recursive_unlock 103.00 - 1.5% : ring_buffer_event_data 28.00 - 0.4% : do_gettimeofday 22.00 - 0.3% : _spin_unlock_irq 14.00 - 0.2% : native_read_tsc 11.00 - 0.2% : getnstimeofday do_gettimeofday dropped from 13% usage to a mere 0.4%! (using the default 50 interval) The measurement for each timestamp went from 310ns to 210ns. That's 100ns (1/3rd) overhead that the gettimeofday call was introducing. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* kprobes: Sanitize struct kretprobe_instance allocationsAnanth N Mavinakayanahalli2009-11-02
| | | | | | | | | | | | | | | | | | | | For as long as kretprobes have existed, we've allocated NR_CPUS instances of kretprobe_instance structures. With the default value of CONFIG_NR_CPUS increasing on certain architectures, we are potentially wasting kernel memory. See http://sourceware.org/bugzilla/show_bug.cgi?id=10839#c3 for more details. Use a saner num_possible_cpus() instead of NR_CPUS for allocation. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: fweisbec@gmail.com LKML-Reference: <20091030135310.GA22230@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* tracing: Fix to use __always_unused attributeLi Zefan2009-11-02
| | | | | | | | | | | | | ____ftrace_check_##name() is used for compile-time check on F_printk() only, so it should be marked as __unused instead of __used. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <4AEE2D01.4010305@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* compiler: Introduce __always_unusedLi Zefan2009-11-02
| | | | | | | | | | | | | | | I wrote some code which is used as compile-time checker, and the code should be elided after compile. So I need to annotate the code as "always unused", compared to "maybe unused". Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <4AEE2CEC.8040206@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* tracing: Exit with error if a weak function is used in recordmcount.plLi Hong2009-10-29
| | | | | | | | | | | | If a weak function is used as a relocation reference for mcount callers and that function is overridden, it will cause ftrace to fail at run time. The current code should prevent a weak function from being used, but if one is, the code should exit with an error to fail at compile time. Signed-off-by: Li Hong <lihong.hi@gmail.com> LKML-Reference: <20091028050743.GH30758@uhli> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Move conditional into update_funcs() in recordmcount.plLi Hong2009-10-29
| | | | | | | | | | Move all the condition validations into the function update_funcs(). Also update_funcs should not die if $ref_func is undefined for there may be more than one valid section in an object file. Signed-off-by: Li Hong <lihong.hi@gmail.com> LKML-Reference: <20091028050703.GG30758@uhli> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Add regex for weak functions in recordmcount.plLi Hong2009-10-29
| | | | | | | | | | | | Add a variable to contain the regex needed to find weak functions in the 'nm' output. This will allow other archs to easily override it. Also rename the regex variable $nm_regex to $local_regex to be more descriptive. Signed-off-by: Li Hong <lihong.hi@gmail.com> LKML-Reference: <20091028050619.GF30758@uhli> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Move mcount section search to front of loop in recordmcount.plLi Hong2009-10-29
| | | | | | | | | | Move the mcount section check to the beginning of the objdump read loop. This makes the code easier to follow since the search for the mcount section is performed first before the mcount callers are processed. Signed-off-by: Li Hong <lihong.hi@gmail.com> LKML-Reference: <20091028050523.GE30758@uhli> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Fix objcopy revision check in recordmcount.plLi Hong2009-10-29
| | | | | | | | | | | | The current logic to check objcopy's version is incorrect. This patch fixes the algorithm and disables the use of local functions as a reference if the objcopy version does not support static to global conversions. Also remove some usused variables. Signed-off-by: Li Hong <lihong.hi@gmail.com> LKML-Reference: <20091028050421.GD30758@uhli> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Check absolute path of input file in recordmcount.plLi Hong2009-10-29
| | | | | | | | | | | | The ftrace.c file may reference the mcount function and this may interfere with the recordmcount.pl processing. To avoid this, the code does not process the kernel/trace/ftrace.o. But currently the check is against a relative path. This patch modifies the check to succeed if the path is an absolute path. Signed-off-by: Li Hong <lihong.hi@gmail.com> LKML-Reference: <20091028050332.GC30758@uhli> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Correct the check for number of arguments in recordmcount.plLi Hong2009-10-29
| | | | | | | | | The number of arguments passed into recordmcount.pl is 10, but the code checks if only 7 are passed in. Signed-off-by: Li Hong <lihong.hi@gmail.com> LKML-Reference: <20091027065733.GB22032@uhli> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Amend documentation in recordmcount.pl to reflect implementationLi Hong2009-10-29
| | | | | | | | | | | | | The documentation currently says we will use the first function in a section as a reference. The actual algorithm is: choose the first global function we meet as a reference. If there is none, choose the first local one. Change the documentation to be consistent with the code. Also add several other clarifications. Signed-off-by: Li Hong <lihong.hi@gmail.com> LKML-Reference: <20091028050138.GA30758@uhli> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* Merge branch 'tracing/urgent' into tracing/coreIngo Molnar2009-10-29
|\ | | | | | | | | | | Merge reason: Pick up fixes and move base from -rc1 to -rc5. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * tracing: Remove cpu arg from the rb_time_stamp() functionJiri Olsa2009-10-24
| | | | | | | | | | | | | | | | | | | | | | The cpu argument is not used inside the rb_time_stamp() function. Plus fix a typo. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091023233647.118547500@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * tracing: Fix comment typo and documentation exampleJiri Olsa2009-10-24
| | | | | | | | | | | | | | | | | | | | | | Trivial patch to fix a documentation example and to fix a comment. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091023233646.871719877@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * tracing: Fix trace_seq_printf() return valueJiri Olsa2009-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | trace_seq_printf() return value is a little ambiguous. It currently returns the length of the space available in the buffer. printf usually returns the amount written. This is not adequate here, because: trace_seq_printf(s, ""); is perfectly legal, and returning 0 would indicate that it failed. We can always see the amount written by looking at the before and after values of s->len. This is not quite the same use as printf. We only care if the string was successfully written to the buffer or not. Make trace_seq_printf() return 0 if the trace oversizes the buffer's free space, 1 otherwise. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091023233646.631787612@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * tracing: Update *ppos instead of filp->f_posJiri Olsa2009-10-24
| | | | | | | | | | | | | | | | | | | | | | | | Instead of directly updating filp->f_pos we should update the *ppos argument. The filp->f_pos gets updated within the file_pos_write() function called from sys_write(). Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091023233646.399670810@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linusLinus Torvalds2009-10-22
| |\ | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: move virtrng_remove to .devexit.text move virtballoon_remove to .devexit.text virtio_blk: Revert serial number support virtio: let header files include virtio_ids.h virtio_blk: revert QUEUE_FLAG_VIRT addition
| | * move virtrng_remove to .devexit.textUwe Kleine-König2009-10-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function virtrng_remove is used only wrapped by __devexit_p so define it using __devexit. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Acked-by: Sam Ravnborg <sam@ravnborg.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michael S. Tsirkin <mst@redhat.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| | * move virtballoon_remove to .devexit.textUwe Kleine-König2009-10-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function virtballoon_remove is used only wrapped by __devexit_p so define it using __devexit. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Acked-by: Sam Ravnborg <sam@ravnborg.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| | * virtio_blk: Revert serial number supportRusty Russell2009-10-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts "Add serial number support for virtio_blk, V4a". Turns out that virtio_pci, lguest and s/390 all have an 8 bit limit on virtio config space, so noone could ever use this. This is coming back later in a cleaner form. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: john cooper <john.cooper@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com>
| | * virtio: let header files include virtio_ids.hChristian Borntraeger2009-10-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rusty, commit 3ca4f5ca73057a617f9444a91022d7127041970a virtio: add virtio IDs file moved all device IDs into a single file. While the change itself is a very good one, it can break userspace applications. For example if a userspace tool wanted to get the ID of virtio_net it used to include virtio_net.h. This does no longer work, since virtio_net.h does not include virtio_ids.h. This patch moves all "#include <linux/virtio_ids.h>" from the C files into the header files, making the header files compatible with the old ones. In addition, this patch exports virtio_ids.h to userspace. CC: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| | * virtio_blk: revert QUEUE_FLAG_VIRT additionChristoph Hellwig2009-10-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems like the addition of QUEUE_FLAG_VIRT caueses major performance regressions for Fedora users: https://bugzilla.redhat.com/show_bug.cgi?id=509383 https://bugzilla.redhat.com/show_bug.cgi?id=505695 while I can't reproduce those extreme regressions myself I think the flag is wrong. Rationale: QUEUE_FLAG_VIRT expands to QUEUE_FLAG_NONROT which casus the queue unplugged immediately. This is not a good behaviour for at least qemu and kvm where we do have significant overhead for every I/O operations. Even with all the latested speeups (native AIO, MSI support, zero copy) we can only get native speed for up to 128kb I/O requests we already are down to 66% of native performance for 4kb requests even on my laptop running the Intel X25-M SSD for which the QUEUE_FLAG_NONROT was designed. If we ever get virtio-blk overhead low enough that this flag makes sense it should only be set based on a feature flag set by the host. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds2009-10-22
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (21 commits) niu: VLAN_ETH_HLEN should be used to make sure that the whole MAC header was copied to the head buffer in the Vlan packets case KS8851: Fix ks8851_set_rx_mode() for IFF_MULTICAST KS8851: Fix MAC address write order KS8851: Add soft reset at probe time net: fix section mismatch in fec.c net: Fix struct inet_timewait_sock bitfield annotation tcp: Try to catch MSG_PEEK bug net: Fix IP_MULTICAST_IF bluetooth: static lock key fix bluetooth: scheduling while atomic bug fix tcp: fix TCP_DEFER_ACCEPT retrans calculation tcp: reduce SYN-ACK retrans for TCP_DEFER_ACCEPT tcp: accept socket after TCP_DEFER_ACCEPT period Revert "tcp: fix tcp_defer_accept to consider the timeout" AF_UNIX: Fix deadlock on connecting to shutdown socket ethoc: clear only pending irqs ethoc: inline regs access vmxnet3: use dev_dbg, fix build for CONFIG_BLOCK=n virtio_net: use dev_kfree_skb_any() in free_old_xmit_skbs() be2net: fix support for PCI hot plug ...
| | * | niu: VLAN_ETH_HLEN should be used to make sure that the whole MAC header was ↵Joyce Yu2009-10-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | copied to the head buffer in the Vlan packets case Signed-off-by: Joyce Yu <joyce.yu@sun.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | KS8851: Fix ks8851_set_rx_mode() for IFF_MULTICASTBen Dooks2009-10-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In ks8851_set_rx_mode() the case handling IFF_MULTICAST was also setting the RXCR1_AE bit by accident. This meant that all unicast frames where being accepted by the device. Remove RXCR1_AE from this case. Note, RXCR1_AE was also masking a problem with setting the MAC address properly, so needs to be applied after fixing the MAC write order. Fixes a bug reported by Doong, Ping of Micrel. This version of the patch avoids setting RXCR1_ME for all cases. Signed-off-by: Ben Dooks <ben@simtec.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | KS8851: Fix MAC address write orderBen Dooks2009-10-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The MAC address register was being written in the wrong order, so add a new address macro to convert mac-address byte to register address and a ks8851_wrreg8() function to write each byte without having to worry about any difficult byte swapping. Fixes a bug reported by Doong, Ping of Micrel. Signed-off-by: Ben Dooks <ben@simtec.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | KS8851: Add soft reset at probe timeBen Dooks2009-10-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Issue a full soft reset at probe time. This was reported by Doong Ping of Micrel, but no explanation of why this is necessary or what bug it is fixing. Add it as it does not seem to hurt the current driver and ensures that the device is in a known state when we start setting it up. Signed-off-by: Ben Dooks <ben@simtec.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net: fix section mismatch in fec.cSteven King2009-10-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fec_enet_init is called by both fec_probe and fec_resume, so it shouldn't be marked as __init. Signed-off-by: Steven King <sfking@fdwdc.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net: Fix struct inet_timewait_sock bitfield annotationEric Dumazet2009-10-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9e337b0f (net: annotate inet_timewait_sock bitfields) added 4/8 bytes in struct inet_timewait_sock. Fix this by declaring tw_ipv6_offset in the 'flags' bitfield The 14 bits hole is named tw_pad to make it cleary apparent. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | tcp: Try to catch MSG_PEEK bugHerbert Xu2009-10-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch tries to print out more information when we hit the MSG_PEEK bug in tcp_recvmsg. It's been around since at least 2005 and it's about time that we finally fix it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net: Fix IP_MULTICAST_IFEric Dumazet2009-10-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls. This function should be called only with RTNL or dev_base_lock held, or reader could see a corrupt hash chain and eventually enter an endless loop. Fix is to call dev_get_by_index()/dev_put(). If this happens to be performance critical, we could define a new dev_exist_by_index() function to avoid touching dev refcount. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | bluetooth: static lock key fixDave Young2009-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When shutdown ppp connection, lockdep waring about non-static key will happen, it is caused by the lock is not initialized properly at that time. Fix with tuning the lock/skb_queue_head init order [ 94.339261] INFO: trying to register non-static key. [ 94.342509] the code is fine but needs lockdep annotation. [ 94.342509] turning off the locking correctness validator. [ 94.342509] Pid: 0, comm: swapper Not tainted 2.6.31-mm1 #2 [ 94.342509] Call Trace: [ 94.342509] [<c0248fbe>] register_lock_class+0x58/0x241 [ 94.342509] [<c024b5df>] ? __lock_acquire+0xb57/0xb73 [ 94.342509] [<c024ab34>] __lock_acquire+0xac/0xb73 [ 94.342509] [<c024b7fa>] ? lock_release_non_nested+0x17b/0x1de [ 94.342509] [<c024b662>] lock_acquire+0x67/0x84 [ 94.342509] [<c04cd1eb>] ? skb_dequeue+0x15/0x41 [ 94.342509] [<c054a857>] _spin_lock_irqsave+0x2f/0x3f [ 94.342509] [<c04cd1eb>] ? skb_dequeue+0x15/0x41 [ 94.342509] [<c04cd1eb>] skb_dequeue+0x15/0x41 [ 94.342509] [<c054a648>] ? _read_unlock+0x1d/0x20 [ 94.342509] [<c04cd641>] skb_queue_purge+0x14/0x1b [ 94.342509] [<fab94fdc>] l2cap_recv_frame+0xea1/0x115a [l2cap] [ 94.342509] [<c024b5df>] ? __lock_acquire+0xb57/0xb73 [ 94.342509] [<c0249c04>] ? mark_lock+0x1e/0x1c7 [ 94.342509] [<f8364963>] ? hci_rx_task+0xd2/0x1bc [bluetooth] [ 94.342509] [<fab95346>] l2cap_recv_acldata+0xb1/0x1c6 [l2cap] [ 94.342509] [<f8364997>] hci_rx_task+0x106/0x1bc [bluetooth] [ 94.342509] [<fab95295>] ? l2cap_recv_acldata+0x0/0x1c6 [l2cap] [ 94.342509] [<c02302c4>] tasklet_action+0x69/0xc1 [ 94.342509] [<c022fbef>] __do_softirq+0x94/0x11e [ 94.342509] [<c022fcaf>] do_softirq+0x36/0x5a [ 94.342509] [<c022fe14>] irq_exit+0x35/0x68 [ 94.342509] [<c0204ced>] do_IRQ+0x72/0x89 [ 94.342509] [<c02038ee>] common_interrupt+0x2e/0x34 [ 94.342509] [<c024007b>] ? pm_qos_add_requirement+0x63/0x9d [ 94.342509] [<c038e8a5>] ? acpi_idle_enter_bm+0x209/0x238 [ 94.342509] [<c049d238>] cpuidle_idle_call+0x5c/0x94 [ 94.342509] [<c02023f8>] cpu_idle+0x4e/0x6f [ 94.342509] [<c0534153>] rest_init+0x53/0x55 [ 94.342509] [<c0781894>] start_kernel+0x2f0/0x2f5 [ 94.342509] [<c0781091>] i386_start_kernel+0x91/0x96 Reported-by: Oliver Hartkopp <oliver@hartkopp.net> Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Tested-by: Oliver Hartkopp <oliver@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | bluetooth: scheduling while atomic bug fixDave Young2009-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to driver core changes dev_set_drvdata will call kzalloc which should be in might_sleep context, but hci_conn_add will be called in atomic context Like dev_set_name move dev_set_drvdata to work queue function. oops as following: Oct 2 17:41:59 darkstar kernel: [ 438.001341] BUG: sleeping function called from invalid context at mm/slqb.c:1546 Oct 2 17:41:59 darkstar kernel: [ 438.001345] in_atomic(): 1, irqs_disabled(): 0, pid: 2133, name: sdptool Oct 2 17:41:59 darkstar kernel: [ 438.001348] 2 locks held by sdptool/2133: Oct 2 17:41:59 darkstar kernel: [ 438.001350] #0: (sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP){+.+.+.}, at: [<faa1d2f5>] lock_sock+0xa/0xc [l2cap] Oct 2 17:41:59 darkstar kernel: [ 438.001360] #1: (&hdev->lock){+.-.+.}, at: [<faa20e16>] l2cap_sock_connect+0x103/0x26b [l2cap] Oct 2 17:41:59 darkstar kernel: [ 438.001371] Pid: 2133, comm: sdptool Not tainted 2.6.31-mm1 #2 Oct 2 17:41:59 darkstar kernel: [ 438.001373] Call Trace: Oct 2 17:41:59 darkstar kernel: [ 438.001381] [<c022433f>] __might_sleep+0xde/0xe5 Oct 2 17:41:59 darkstar kernel: [ 438.001386] [<c0298843>] __kmalloc+0x4a/0x15a Oct 2 17:41:59 darkstar kernel: [ 438.001392] [<c03f0065>] ? kzalloc+0xb/0xd Oct 2 17:41:59 darkstar kernel: [ 438.001396] [<c03f0065>] kzalloc+0xb/0xd Oct 2 17:41:59 darkstar kernel: [ 438.001400] [<c03f04ff>] device_private_init+0x15/0x3d Oct 2 17:41:59 darkstar kernel: [ 438.001405] [<c03f24c5>] dev_set_drvdata+0x18/0x26 Oct 2 17:41:59 darkstar kernel: [ 438.001414] [<fa51fff7>] hci_conn_init_sysfs+0x40/0xd9 [bluetooth] Oct 2 17:41:59 darkstar kernel: [ 438.001422] [<fa51cdc0>] ? hci_conn_add+0x128/0x186 [bluetooth] Oct 2 17:41:59 darkstar kernel: [ 438.001429] [<fa51ce0f>] hci_conn_add+0x177/0x186 [bluetooth] Oct 2 17:41:59 darkstar kernel: [ 438.001437] [<fa51cf8a>] hci_connect+0x3c/0xfb [bluetooth] Oct 2 17:41:59 darkstar kernel: [ 438.001442] [<faa20e87>] l2cap_sock_connect+0x174/0x26b [l2cap] Oct 2 17:41:59 darkstar kernel: [ 438.001448] [<c04c8df5>] sys_connect+0x60/0x7a Oct 2 17:41:59 darkstar kernel: [ 438.001453] [<c024b703>] ? lock_release_non_nested+0x84/0x1de Oct 2 17:41:59 darkstar kernel: [ 438.001458] [<c028804b>] ? might_fault+0x47/0x81 Oct 2 17:41:59 darkstar kernel: [ 438.001462] [<c028804b>] ? might_fault+0x47/0x81 Oct 2 17:41:59 darkstar kernel: [ 438.001468] [<c033361f>] ? __copy_from_user_ll+0x11/0xce Oct 2 17:41:59 darkstar kernel: [ 438.001472] [<c04c9419>] sys_socketcall+0x82/0x17b Oct 2 17:41:59 darkstar kernel: [ 438.001477] [<c020329d>] syscall_call+0x7/0xb Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | tcp: fix TCP_DEFER_ACCEPT retrans calculationJulian Anastasov2009-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix TCP_DEFER_ACCEPT conversion between seconds and retransmission to match the TCP SYN-ACK retransmission periods because the time is converted to such retransmissions. The old algorithm selects one more retransmission in some cases. Allow up to 255 retransmissions. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | tcp: reduce SYN-ACK retrans for TCP_DEFER_ACCEPTJulian Anastasov2009-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change SYN-ACK retransmitting code for the TCP_DEFER_ACCEPT users to not retransmit SYN-ACKs during the deferring period if ACK from client was received. The goal is to reduce traffic during the deferring period. When the period is finished we continue with sending SYN-ACKs (at least one) but this time any traffic from client will change the request to established socket allowing application to terminate it properly. Also, do not drop acked request if sending of SYN-ACK fails. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | tcp: accept socket after TCP_DEFER_ACCEPT periodJulian Anastasov2009-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Willy Tarreau and many other folks in recent years were concerned what happens when the TCP_DEFER_ACCEPT period expires for clients which sent ACK packet. They prefer clients that actively resend ACK on our SYN-ACK retransmissions to be converted from open requests to sockets and queued to the listener for accepting after the deferring period is finished. Then application server can decide to wait longer for data or to properly terminate the connection with FIN if read() returns EAGAIN which is an indication for accepting after the deferring period. This change still can have side effects for applications that expect always to see data on the accepted socket. Others can be prepared to work in both modes (with or without TCP_DEFER_ACCEPT period) and their data processing can ignore the read=EAGAIN notification and to allocate resources for clients which proved to have no data to send during the deferring period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not as a timeout) to wait for data will notice clients that didn't send data for 3 seconds but that still resend ACKs. Thanks to Willy Tarreau for the initial idea and to Eric Dumazet for the review and testing the change. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | Revert "tcp: fix tcp_defer_accept to consider the timeout"David S. Miller2009-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 6d01a026b7d3009a418326bdcf313503a314f1ea. Julian Anastasov, Willy Tarreau and Eric Dumazet have come up with a more correct way to deal with this. Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | AF_UNIX: Fix deadlock on connecting to shutdown socketTomoki Sekiyama2009-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I found a deadlock bug in UNIX domain socket, which makes able to DoS attack against the local machine by non-root users. How to reproduce: 1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct namespace(*), and shutdown(2) it. 2. Repeat connect(2)ing to the listening socket from the other sockets until the connection backlog is full-filled. 3. connect(2) takes the CPU forever. If every core is taken, the system hangs. PoC code: (Run as many times as cores on SMP machines.) int main(void) { int ret; int csd; int lsd; struct sockaddr_un sun; /* make an abstruct name address (*) */ memset(&sun, 0, sizeof(sun)); sun.sun_family = PF_UNIX; sprintf(&sun.sun_path[1], "%d", getpid()); /* create the listening socket and shutdown */ lsd = socket(AF_UNIX, SOCK_STREAM, 0); bind(lsd, (struct sockaddr *)&sun, sizeof(sun)); listen(lsd, 1); shutdown(lsd, SHUT_RDWR); /* connect loop */ alarm(15); /* forcely exit the loop after 15 sec */ for (;;) { csd = socket(AF_UNIX, SOCK_STREAM, 0); ret = connect(csd, (struct sockaddr *)&sun, sizeof(sun)); if (-1 == ret) { perror("connect()"); break; } puts("Connection OK"); } return 0; } (*) Make sun_path[0] = 0 to use the abstruct namespace. If a file-based socket is used, the system doesn't deadlock because of context switches in the file system layer. Why this happens: Error checks between unix_socket_connect() and unix_wait_for_peer() are inconsistent. The former calls the latter to wait until the backlog is processed. Despite the latter returns without doing anything when the socket is shutdown, the former doesn't check the shutdown state and just retries calling the latter forever. Patch: The patch below adds shutdown check into unix_socket_connect(), so connect(2) to the shutdown socket will return -ECONREFUSED. Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com> Signed-off-by: Masanori Yoshida <masanori.yoshida.tv@hitachi.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | ethoc: clear only pending irqsThomas Chou2009-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixed the problem of dropped packets due to lost of interrupt requests. We should only clear what was pending at the moment we read the irq source reg. Signed-off-by: Thomas Chou <thomas@wytron.com.tw> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | ethoc: inline regs accessThomas Chou2009-10-19
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Thomas Chou <thomas@wytron.com.tw> Signed-off-by: David S. Miller <davem@davemloft.net>