|  | Commit message (Collapse) | Author | Age | 
|---|
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Commit 65a64464349883891e21e74af16c05d6e1eeb4e9 ("HWPOISON: Allow
schedule_on_each_cpu() from keventd") which allows schedule_on_each_cpu()
to be called from keventd added a race condition.  schedule_on_each_cpu()
may race with cpu hotplug and end up executing the function twice on a
cpu.
Fix it by moving direct execution into the section protected with
get/put_online_cpus().  While at it, update code such that direct
execution is done after works have been scheduled for all other cpus and
drop unnecessary cpu != orig test from flush loop.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | 
| |\  
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  highmem: Fix debug_kmap_atomic() to also handle KM_IRQ_PTE, KM_NMI, and KM_NMI_PTE
  highmem: Fix race in debug_kmap_atomic() which could cause warn_count to underflow
  rcu: Fix long-grace-period race between forcing and initialization
  uids: Prevent tear down race | 
| | | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | | Very long RCU read-side critical sections (50 milliseconds or
so) can cause a race between force_quiescent_state() and
rcu_start_gp() as follows on kernel builds with multi-level
rcu_node hierarchies:
1.	CPU 0 calls force_quiescent_state(), sees that there is a
	grace period in progress, and acquires ->fsqlock.
2.	CPU 1 detects the end of the grace period, and so
	cpu_quiet_msk_finish() sets rsp->completed to rsp->gpnum.
	This operation is carried out under the root rnp->lock,
	but CPU 0 has not yet acquired that lock.  Note that
	rsp->signaled is still RCU_SAVE_DYNTICK from the last
	grace period.
3.	CPU 1 calls rcu_start_gp(), but no one wants a new grace
	period, so it drops the root rnp->lock and returns.
4.	CPU 0 acquires the root rnp->lock and picks up rsp->completed
	and rsp->signaled, then drops rnp->lock.  It then enters the
	RCU_SAVE_DYNTICK leg of the switch statement.
5.	CPU 2 invokes call_rcu(), and now needs a new grace period.
	It calls rcu_start_gp(), which acquires the root rnp->lock, sets
	rsp->signaled to RCU_GP_INIT (too bad that CPU 0 is already in
	the RCU_SAVE_DYNTICK leg of the switch statement!)  and starts
	initializing the rcu_node hierarchy.  If there are multiple
	levels to the hierarchy, it will drop the root rnp->lock and
	initialize the lower levels of the hierarchy.
6.	CPU 0 notes that rsp->completed has not changed, which permits
        both CPU 2 and CPU 0 to try updating it concurrently.  If CPU 0's
	update prevails, later calls to force_quiescent_state() can
	count old quiescent states against the new grace period, which
	can in turn result in premature ending of grace periods.
	Not good.
This patch adds an RCU_GP_IDLE state for rsp->signaled that is
set initially at boot time and any time a grace period ends.
This prevents CPU 0 from getting into the workings of
force_quiescent_state() in step 4.  Additional locking and
checks prevent the concurrent update of rsp->signaled in step 6.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1256742889199-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| | | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | 
| | | Ingo triggered the following warning:
WARNING: at lib/debugobjects.c:255 debug_print_object+0x42/0x50()
Hardware name: System Product Name
ODEBUG: init active object type: timer_list
Modules linked in:
Pid: 2619, comm: dmesg Tainted: G        W  2.6.32-rc5-tip+ #5298
Call Trace:
 [<81035443>] warn_slowpath_common+0x6a/0x81
 [<8120e483>] ? debug_print_object+0x42/0x50
 [<81035498>] warn_slowpath_fmt+0x29/0x2c
 [<8120e483>] debug_print_object+0x42/0x50
 [<8120ec2a>] __debug_object_init+0x279/0x2d7
 [<8120ecb3>] debug_object_init+0x13/0x18
 [<810409d2>] init_timer_key+0x17/0x6f
 [<81041526>] free_uid+0x50/0x6c
 [<8104ed2d>] put_cred_rcu+0x61/0x72
 [<81067fac>] rcu_do_batch+0x70/0x121
debugobjects warns about an enqueued timer being initialized. If
CONFIG_USER_SCHED=y the user management code uses delayed work to
remove the user from the hash table and tear down the sysfs objects.
free_uid is called from RCU and initializes/schedules delayed work if
the usage count of the user_struct is 0. The init/schedule happens
outside of the uidhash_lock protected region which allows a concurrent
caller of find_user() to reference the about to be destroyed
user_struct w/o preventing the work from being scheduled. If the next
free_uid call happens before the work timer expired then the active
timer is initialized and the work scheduled again.
The race was introduced in commit 5cb350ba (sched: group scheduling,
sysfs tunables) and made more prominent by commit 3959214f (sched:
delayed cleanup of user_struct)
Move the init/schedule_delayed_work inside of the uidhash_lock
protected region to prevent the race.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: stable@kernel.org | 
| |\ \  
| | | 
| | | 
| | | 
| | | 
| | | 
| | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: try_one_irq() must be called with irq disabled | 
| | | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | | Prarit reported:
=================================
[ INFO: inconsistent lock state ]
2.6.32-rc5 #1
---------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
swapper/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
 (&irq_desc_lock_class){?.-...}, at: [<ffffffff810c264e>] try_one_irq+0x32/0x138
{IN-HARDIRQ-W} state was registered at:
 [<ffffffff81095160>] __lock_acquire+0x2fc/0xd5d
 [<ffffffff81095cb4>] lock_acquire+0xf3/0x12d
 [<ffffffff814cdadd>] _spin_lock+0x40/0x89
 [<ffffffff810c3389>] handle_level_irq+0x30/0x105
 [<ffffffff81014e0e>] handle_irq+0x95/0xb7
 [<ffffffff810141bd>] do_IRQ+0x6a/0xe0
 [<ffffffff81012813>] ret_from_intr+0x0/0x16
irq event stamp: 195096
hardirqs last  enabled at (195096): [<ffffffff814cd7f7>] _spin_unlock_irq+0x3a/0x5c
hardirqs last disabled at (195095): [<ffffffff814cdbdd>] _spin_lock_irq+0x29/0x95
softirqs last  enabled at (195088): [<ffffffff81068c92>] __do_softirq+0x1c1/0x1ef
softirqs last disabled at (195093): [<ffffffff8101304c>] call_softirq+0x1c/0x30
other info that might help us debug this:
1 lock held by swapper/0:
 #0:  (kernel/irq/spurious.c:21){+.-...}, at: [<ffffffff81070cf2>]
run_timer_softirq+0x1a9/0x315
stack backtrace:
Pid: 0, comm: swapper Not tainted 2.6.32-rc5 #1
Call Trace:
 <IRQ>  [<ffffffff81093e94>] valid_state+0x187/0x1ae
 [<ffffffff81093fe4>] mark_lock+0x129/0x253
 [<ffffffff810951d4>] __lock_acquire+0x370/0xd5d
 [<ffffffff81095cb4>] lock_acquire+0xf3/0x12d
 [<ffffffff814cdadd>] _spin_lock+0x40/0x89
 [<ffffffff810c264e>] try_one_irq+0x32/0x138
 [<ffffffff810c2795>] poll_all_shared_irqs+0x41/0x6d
 [<ffffffff810c27dd>] poll_spurious_irqs+0x1c/0x49
 [<ffffffff81070d82>] run_timer_softirq+0x239/0x315
 [<ffffffff81068bd3>] __do_softirq+0x102/0x1ef
 [<ffffffff8101304c>] call_softirq+0x1c/0x30
 [<ffffffff81014b65>] do_softirq+0x59/0xca
 [<ffffffff810686ad>] irq_exit+0x58/0xae
 [<ffffffff81029b84>] smp_apic_timer_interrupt+0x94/0xba
 [<ffffffff81012a33>] apic_timer_interrupt+0x13/0x20
The reason is that try_one_irq() is called from hardirq context with
interrupts disabled and from softirq context (poll_all_shared_irqs())
with interrupts enabled.
Disable interrupts before calling it from poll_all_shared_irqs().
Reported-and-tested-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
LKML-Reference: <1257563773-4620-1-git-send-email-yong.zhang0@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> | 
| | | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | | root_task_group_empty is used only with FAIR_GROUP_SCHED
so if we use other scheduler options we get:
  kernel/sched.c:314: warning: 'root_task_group_empty' defined but not used
So move CONFIG_FAIR_GROUP_SCHED up that it covers
root_task_group_empty().
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091026192414.GB5321@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| |/ /  
| |   
| |   
| |   
| |   
| |   
| |   
| |   
| |   
| |   
| |   
| |   
| |   
| | | Fix variable name in sched.c kernel-doc notation.
Fixes this DocBook warning:
 Warning(kernel/sched.c:2008): No description found for parameter
 'p' Warning(kernel/sched.c:2008): Excess function parameter 'k'
 description in 'kthread_bind'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
LKML-Reference: <4AF4B1BC.8020604@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| |\ \  
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c
  sched: Disable SD_PREFER_LOCAL at node level
  sched: Fix boot crash by zalloc()ing most of the cpu masks
  sched: Strengthen buddies and mitigate buddy induced latencies | 
| | | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | | Eric Paris reported that commit
f685ceacab07d3f6c236f04803e2f2f0dbcc5afb causes boot time
PREEMPT_DEBUG complaints.
 [    4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
 [    4.593043] caller is task_hot+0x86/0xd0
Since kthread_bind() messes with scheduler internals, move the
body to sched.c, and lock the runqueue.
Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Tested-by: Eric Paris <eparis@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1256813310.7574.3.camel@marge.simson.net>
[ v2: fix !SMP build and clean up ]
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| | | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | | I got a boot crash when forcing cpumasks offstack on 32 bit,
because find_new_ilb() returned 3 on my UP system (nohz.cpu_mask
wasn't zeroed).
AFAICT the others need to be zeroed too: only
nohz.ilb_grp_nohz_mask is initialized before use.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <200911022037.21282.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| | | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | 
| | | | This patch restores the effectiveness of LAST_BUDDY in preventing
pgsql+oltp from collapsing due to wakeup preemption. It also
switches LAST_BUDDY to exclusively do what it does best, namely
mitigate the effects of aggressive wakeup preemption, which
improves vmark throughput markedly, and restores mysql+oltp
scalability.
Since buddies are about scalability, enable them beginning at the
point where we begin expanding sched_latency, namely
sched_nr_latency. Previously, buddies were cleared aggressively,
which seriously reduced their effectiveness. Not clearing
aggressively however, produces a small drop in mysql+oltp
throughput immediately after peak, indicating that LAST_BUDDY is
actually doing some harm. This is right at the point where X on the
desktop in competition with another load wants low latency service.
Ergo, do not enable until we need to scale.
To mitigate latency induced by buddies, or by a task just missing
wakeup preemption, check latency at tick time.
Last hunk prevents buddies from stymieing BALANCE_NEWIDLE via
CACHE_HOT_BUDDY.
Supporting performance tests:
 tip   = v2.6.32-rc5-1497-ga525b32
 tipx  = NO_GENTLE_FAIR_SLEEPERS NEXT_BUDDY granularity knobs = 31 knobs + 31 buddies
 tip+x = NO_GENTLE_FAIR_SLEEPERS granularity knobs = 31 knobs
(Three run averages except where noted.)
 vmark:
 ------
 tip           108466 messages per second
 tip+          125307 messages per second
 tip+x         125335 messages per second
 tipx          117781 messages per second
 2.6.31.3      122729 messages per second
 mysql+oltp:
 -----------
 clients          1        2        4        8       16       32       64        128    256
 ..........................................................................................
 tip        9949.89 18690.20 34801.24 34460.04 32682.88 30765.97 28305.27 25059.64 19548.08
 tip+      10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47
 tipx       9698.71 18002.70 34477.56 33420.01 32634.30 31657.27 29932.67 26827.52 21487.18
 2.6.31.3   8243.11 18784.20 34404.83 33148.38 31900.32 31161.90 29663.81 25995.94 18058.86
 pgsql+oltp:
 -----------
 clients          1        2        4        8       16       32       64      128      256
 ..........................................................................................
 tip       13686.37 26609.25 51934.28 51347.81 49479.51 45312.65 36691.91 26851.57 24145.35
 tip+ (1x) 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94
 tip+x     13906.78 27065.81 52951.19 52542.59 52176.11 51815.94 50838.90 49439.46 46891.00
 tipx      13742.46 26769.81 52351.99 51891.73 51320.79 50938.98 50248.65 48908.70 46553.84
 2.6.31.3  13815.35 26906.46 52683.34 52061.31 51937.10 51376.80 50474.28 49394.47 47003.25
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| |\ \ \  
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  ftrace: Fix unmatched locking in ftrace_regex_write()
  ring-buffer: Synchronize resizing buffer with reader lock | 
| | | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | | When a command is passed to the set_ftrace_filter, then
the ftrace_regex_lock is still held going back to user space.
 # echo 'do_open : foo' > set_ftrace_filter
 (still holding ftrace_regex_lock when returning to user space!)
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4AEF7F8A.3080300@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> | 
| | | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | | We got a sudden panic when we reduced the size of the
ringbuffer.
We can reproduce the panic by the following steps:
echo 1 > events/sched/enable
cat trace_pipe > /dev/null &
while ((1))
do
echo 12000 > buffer_size_kb
echo 512 > buffer_size_kb
done
(not more than 5 seconds, panic ...)
Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4AF01735.9060409@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> | 
| |\ \ \ \  
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM: Remove some debug messages producing too much noise
  PM: Fix warning on suspend errors
  PM / Hibernate: Add newline to load_image() fail path
  PM / Hibernate: Fix error handling in save_image()
  PM / Hibernate: Fix blkdev refleaks
  PM / yenta: Split resume into early and late parts (rev. 4) | 
| | | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | | Finish a line by \n when load_image fails in the middle of loading.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> | 
| | | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | | There are too many retval variables in save_image(). Thus error return
value from snapshot_read_next() may be ignored and only part of the
snapshot (successfully) written.
Remove 'error' variable, invert the condition in the do-while loop
and convert the loop to use only 'ret' variable.
Switch the rest of the function to consider only 'ret'.
Also make sure we end printed line by \n if an error occurs.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> | 
| | | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | | While cruising through the swsusp code I found few blkdev reference
leaks of resume_bdev.
swsusp_read: remove blkdev_put altogether. Some fail paths do
             not do that.
swsusp_check: make sure we always put a reference on fail paths
software_resume: all fail paths between swsusp_check and swsusp_read
                 omit swsusp_close. Add it in those cases. And since
                 swsusp_read doesn't drop the reference anymore, do
                 it here unconditionally.
[rjw: Fixed a small coding style issue.]
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> | 
| |/ / / /  
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | |   
| | | | | nr_processes() returns the sum of the per cpu counter process_counts for
all online CPUs. This counter is incremented for the current CPU on
fork() and decremented for the current CPU on exit(). Since a process
does not necessarily fork and exit on the same CPU the process_count for
an individual CPU can be either positive or negative and effectively has
no meaning in isolation.
Therefore calculating the sum of process_counts over only the online
CPUs omits the processes which were started or stopped on any CPU which
has since been unplugged. Only the sum of process_counts across all
possible CPUs has meaning.
The only caller of nr_processes() is proc_root_getattr() which
calculates the number of links to /proc as
        stat->nlink = proc_root.nlink + nr_processes();
You don't have to be all that unlucky for the nr_processes() to return a
negative value leading to a negative number of links (or rather, an
apparently enormous number of links). If this happens then you can get
failures where things like "ls /proc" start to fail because they got an
-EOVERFLOW from some stat() call.
Example with some debugging inserted to show what goes on:
        # ps haux|wc -l
        nr_processes: CPU0:     90
        nr_processes: CPU1:     1030
        nr_processes: CPU2:     -900
        nr_processes: CPU3:     -136
        nr_processes: TOTAL:    84
        proc_root_getattr. nlink 12 + nr_processes() 84 = 96
        84
        # echo 0 >/sys/devices/system/cpu/cpu1/online
        # ps haux|wc -l
        nr_processes: CPU0:     85
        nr_processes: CPU2:     -901
        nr_processes: CPU3:     -137
        nr_processes: TOTAL:    -953
        proc_root_getattr. nlink 12 + nr_processes() -953 = -941
        75
        # stat /proc/
        nr_processes: CPU0:     84
        nr_processes: CPU2:     -901
        nr_processes: CPU3:     -137
        nr_processes: TOTAL:    -954
        proc_root_getattr. nlink 12 + nr_processes() -954 = -942
          File: `/proc/'
          Size: 0               Blocks: 0          IO Block: 1024   directory
        Device: 3h/3d   Inode: 1           Links: 4294966354
        Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
        Access: 2009-11-03 09:06:55.000000000 +0000
        Modify: 2009-11-03 09:06:55.000000000 +0000
        Change: 2009-11-03 09:06:55.000000000 +0000
I'm not 100% convinced that the per_cpu regions remain valid for offline
CPUs, although my testing suggests that they do. If not then I think the
correct solution would be to aggregate the process_count for a given CPU
into a global base value in cpu_down().
This bug appears to pre-date the transition to git and it looks like it
may even have been present in linux-2.6.0-test7-bk3 since it looks like
the code Rusty patched in http://lwn.net/Articles/64773/ was already
wrong.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | 
| |\ \ \ \  
| | |_|/  
| |/| |   
| | | |   
| | | |   
| | | |   
| | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  futex: Fix spurious wakeup for requeue_pi really | 
| | | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | 
| | | | | The requeue_pi path doesn't use unqueue_me() (and the racy lock_ptr ==
NULL test) nor does it use the wake_list of futex_wake() which where
the reason for commit 41890f2 (futex: Handle spurious wake up)
See debugging discussing on LKML Message-ID: <4AD4080C.20703@us.ibm.com>
The changes in this fix to the wait_requeue_pi path were considered to
be a likely unecessary, but harmless safety net. But it turns out that
due to the fact that for unknown $@#!*( reasons EWOULDBLOCK is defined
as EAGAIN we built an endless loop in the code path which returns
correctly EWOULDBLOCK.
Spurious wakeups in wait_requeue_pi code path are unlikely so we do
the easy solution and return EWOULDBLOCK^WEAGAIN to user space and let
it deal with the spurious wakeup.
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Stultz <johnstul@linux.vnet.ibm.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
LKML-Reference: <4AE23C74.1090502@us.ibm.com>
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> | 
| |\ \ \ \  
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf tools: Remove -Wcast-align
  perf tools: Fix compatibility with libelf 0.8 and autodetect
  perf events: Don't generate events for the idle task when exclude_idle is set
  perf events: Fix swevent hrtimer sampling by keeping track of remaining time when enabling/disabling swevent hrtimers | 
| | | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | | Getting samples for the idle task is often not interesting, so
don't generate them when exclude_idle is set for the event in
question.
Signed-off-by: Søren Sandmann Pedersen <sandmann@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <ye8pr8fmlq7.fsf@camel16.daimi.au.dk>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| | | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | | when enabling/disabling swevent hrtimers
Make the hrtimer based events work for sysprof.
Whenever a swevent is scheduled out, the hrtimer is canceled.
When it is scheduled back in, the timer is restarted. This
happens every scheduler tick, which means the timer never
expired because it was getting repeatedly restarted over and
over with the same period.
To fix that, save the remaining time when disabling; when
reenabling, use that saved time as the period instead of the
user-specified sampling period.
Also, move the starting and stopping of the hrtimers to helper
functions instead of duplicating the code.
Signed-off-by: Søren Sandmann Pedersen <sandmann@redhat.com>
LKML-Reference: <ye8vdi7mluz.fsf@camel16.daimi.au.dk>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| |\ \ \ \ \  
| | |_|/ /  
| |/| | |   
| | | | |   
| | | | |   
| | | | |   
| | | | |   
| | | | |   
| | | | |   
| | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing: Remove cpu arg from the rb_time_stamp() function
  tracing: Fix comment typo and documentation example
  tracing: Fix trace_seq_printf() return value
  tracing: Update *ppos instead of filp->f_pos | 
| | | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | | The cpu argument is not used inside the rb_time_stamp() function.
Plus fix a typo.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091023233647.118547500@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| | | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | | Trivial patch to fix a documentation example and to fix a
comment.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091023233646.871719877@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| | | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | | trace_seq_printf() return value is a little ambiguous. It
currently returns the length of the space available in the
buffer. printf usually returns the amount written. This is not
adequate here, because:
  trace_seq_printf(s, "");
is perfectly legal, and returning 0 would indicate that it
failed.
We can always see the amount written by looking at the before
and after values of s->len. This is not quite the same use as
printf. We only care if the string was successfully written to
the buffer or not.
Make trace_seq_printf() return 0 if the trace oversizes the
buffer's free space, 1 otherwise.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091023233646.631787612@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| | | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | 
| | | | | | Instead of directly updating filp->f_pos we should update the *ppos
argument. The filp->f_pos gets updated within the file_pos_write()
function called from sys_write().
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091023233646.399670810@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| |\ \ \ \ \  
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  sched: move rq_weight data array out of .percpu
  percpu: allow pcpu_alloc() to be called with IRQs off | 
| | | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | 
| | | | | | | Commit 34d76c41 introduced percpu array update_shares_data, size of which
being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
Fix this by allocating this array dynamically and keep only pointer to it
percpu.
The per-cpu handling doesn't impose significant performance penalty on
potentially contented path in tg_shares_up().
...
ffffffff8104337c:       65 48 8b 14 25 20 cd    mov    %gs:0xcd20,%rdx
ffffffff81043383:       00 00
ffffffff81043385:       48 c7 c0 00 e1 00 00    mov    $0xe100,%rax
ffffffff8104338c:       48 c7 45 a0 00 00 00    movq   $0x0,-0x60(%rbp)
ffffffff81043393:       00
ffffffff81043394:       48 c7 45 a8 00 00 00    movq   $0x0,-0x58(%rbp)
ffffffff8104339b:       00
ffffffff8104339c:       48 01 d0                add    %rdx,%rax
ffffffff8104339f:       49 8d 94 24 08 01 00    lea    0x108(%r12),%rdx
ffffffff810433a6:       00
ffffffff810433a7:       b9 ff ff ff ff          mov    $0xffffffff,%ecx
ffffffff810433ac:       48 89 45 b0             mov    %rax,-0x50(%rbp)
ffffffff810433b0:       bb 00 04 00 00          mov    $0x400,%ebx
ffffffff810433b5:       48 89 55 c0             mov    %rdx,-0x40(%rbp)
...
After:
...
ffffffff8104337c:       65 8b 04 25 28 cd 00    mov    %gs:0xcd28,%eax
ffffffff81043383:       00
ffffffff81043384:       48 98                   cltq
ffffffff81043386:       49 8d bc 24 08 01 00    lea    0x108(%r12),%rdi
ffffffff8104338d:       00
ffffffff8104338e:       48 8b 15 d3 7f 76 00    mov    0x767fd3(%rip),%rdx        # ffffffff817ab368 <update_shares_data>
ffffffff81043395:       48 8b 34 c5 00 ee 6d    mov    -0x7e921200(,%rax,8),%rsi
ffffffff8104339c:       81
ffffffff8104339d:       48 c7 45 a0 00 00 00    movq   $0x0,-0x60(%rbp)
ffffffff810433a4:       00
ffffffff810433a5:       b9 ff ff ff ff          mov    $0xffffffff,%ecx
ffffffff810433aa:       48 89 7d c0             mov    %rdi,-0x40(%rbp)
ffffffff810433ae:       48 c7 45 a8 00 00 00    movq   $0x0,-0x58(%rbp)
ffffffff810433b5:       00
ffffffff810433b6:       bb 00 04 00 00          mov    $0x400,%ebx
ffffffff810433bb:       48 01 f2                add    %rsi,%rdx
ffffffff810433be:       48 89 55 b0             mov    %rdx,-0x50(%rbp)
...
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Tejun Heo <tj@kernel.org> | 
| |\ \ \ \ \ \  
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-param-fixes:
  param: fix setting arrays of bool
  param: fix NULL comparison on oom
  param: fix lots of bugs with writing charp params from sysfs, by leaking mem. | 
| | | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | We create a dummy struct kernel_param on the stack for parsing each
array element, but we didn't initialize the flags word.  This matters
for arrays of type "bool", where the flag indicates if it really is
an array of bools or unsigned int (old-style).
Reported-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org | 
| | | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | kp->arg is always true: it's the contents of that pointer we care about.
Reported-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org | 
| | | |/ / / /  
| |/| | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | |   
| | | | | | | e180a6b7759a "param: fix charp parameters set via sysfs" fixed the case
where charp parameters written via sysfs were freed, leaving drivers
accessing random memory.
Unfortunately, storing a flag in the kparam struct was a bad idea: it's
rodata so setting it causes an oops on some archs.  But that's not all:
1) module_param_array() on charp doesn't work reliably, since we use an
   uninitialized temporary struct kernel_param.
2) there's a fundamental race if a module uses this parameter and then
   it's changed: they will still access the old, freed, memory.
The simplest fix (ie. for 2.6.32) is to never free the memory.  This
prevents all these problems, at cost of a memory leak.  In practice, there
are only 18 places where a charp is writable via sysfs, and all are
root-only writable.
Reported-by: Takashi Iwai <tiwai@suse.de>
Cc: Sitsofe Wheeler <sitsofe@yahoo.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org | 
| |\ \ \ \ \ \  
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6
* 'hwpoison-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
  HWPOISON: fix invalid page count in printk output
  HWPOISON: Allow schedule_on_each_cpu() from keventd
  HWPOISON: fix/proc/meminfo alignment
  HWPOISON: fix oops on ksm pages
  HWPOISON: Fix page count leak in hwpoison late kill in do_swap_page
  HWPOISON: return early on non-LRU pages
  HWPOISON: Add brief hwpoison description to Documentation
  HWPOISON: Clean up PR_MCE_KILL interface | 
| | | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | Right now when calling schedule_on_each_cpu() from keventd there
is a deadlock because it tries to schedule a work item on the current CPU
too. This happens via lru_add_drain_all() in hwpoison.
Just call the function for the current CPU in this case. This is actually
faster too.
Debugging with Fengguang Wu & Max Asbock
Signed-off-by: Andi Kleen <ak@linux.intel.com> | 
| | | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | While writing the manpage I noticed some shortcomings in the
current interface.
- Define symbolic names for all the different values
- Boundary check the kill mode values
- For symmetry add a get interface too. This allows library
code to get/set the current state.
- For consistency define a PR_MCE_KILL_DEFAULT value
Signed-off-by: Andi Kleen <ak@linux.intel.com> | 
| |\ \ \ \ \ \ \  
| | |_|_|_|/ /  
| |/| | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  futex: Move drop_futex_key_refs out of spinlock'ed region
  rcu: Fix TREE_PREEMPT_RCU CPU_HOTPLUG bad-luck hang
  rcu: Stopgap fix for synchronize_rcu_expedited() for TREE_PREEMPT_RCU
  rcu: Prevent RCU IPI storms in presence of high call_rcu() load
  futex: Check for NULL keys in match_futex
  futex: Handle spurious wake up | 
| | | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | When requeuing tasks from one futex to another, the reference held
by the requeued task to the original futex location needs to be
dropped eventually.
Dropping the reference may ultimately lead to a call to
"iput_final" and subsequently call into filesystem- specific code -
which may be non-atomic.
It is therefore safer to defer this drop operation until after the
futex_hash_bucket spinlock has been dropped.
Originally-From: Helge Bahmann <hcb@chaoticmind.net>
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: <stable@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@linux.vnet.ibm.com>
Cc: Sven-Thorsten Dietrich <sdietrich@novell.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <4AD7A298.5040802@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| | | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | If the following sequence of events occurs, then
TREE_PREEMPT_RCU will hang waiting for a grace period to
complete, eventually OOMing the system:
o	A TREE_PREEMPT_RCU build of the kernel is booted on a system
	with more than 64 physical CPUs present (32 on a 32-bit system).
	Alternatively, a TREE_PREEMPT_RCU build of the kernel is booted
	with RCU_FANOUT set to a sufficiently small value that the
	physical CPUs populate two or more leaf rcu_node structures.
o	A task is preempted in an RCU read-side critical section
	while running on a CPU corresponding to a given leaf rcu_node
	structure.
o	All CPUs corresponding to this same leaf rcu_node structure
	record quiescent states for the current grace period.
o	All of these same CPUs go offline (hence the need for enough
	physical CPUs to populate more than one leaf rcu_node structure).
	This causes the preempted task to be moved to the root rcu_node
	structure.
At this point, there is nothing left to cause the quiescent
state to be propagated up the rcu_node tree, so the current
grace period never completes.
The simplest fix, especially after considering the deadlock
possibilities, is to detect this situation when the last CPU is
offlined, and to set that CPU's ->qsmask bit in its leaf
rcu_node structure.  This will cause the next invocation of
force_quiescent_state() to end the grace period.
Without this fix, this hang can be triggered in an hour or so on
some machines with rcutorture and random CPU onlining/offlining.
With this fix, these same machines pass a full 10 hours of this
sort of abuse.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <20091015162614.GA19131@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| | | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | For the short term, map synchronize_rcu_expedited() to
synchronize_rcu() for TREE_PREEMPT_RCU and to
synchronize_sched_expedited() for TREE_RCU.
Longer term, there needs to be a real expedited grace period for
TREE_PREEMPT_RCU, but candidate patches to date are considerably
more complex and intrusive.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: npiggin@suse.de
Cc: jens.axboe@oracle.com
LKML-Reference: <12555405592331-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| | | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | As the number of callbacks on a given CPU rises, invoke
force_quiescent_state() only every blimit number of callbacks
(defaults to 10,000), and even then only if no other CPU has
invoked force_quiescent_state() in the meantime.
This should fix the performance regression reported by Nick.
Reported-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: jens.axboe@oracle.com
LKML-Reference: <12555405592133-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| | | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | If userspace tries to perform a requeue_pi on a non-requeue_pi waiter,
it will find the futex_q->requeue_pi_key to be NULL and OOPS.
Check for NULL in match_futex() instead of doing explicit NULL pointer
checks on all call sites.  While match_futex(NULL, NULL) returning
false is a little odd, it's still correct as we expect valid key
references.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Dinakar Guniguntala <dino@in.ibm.com>
CC: John Stultz <johnstul@us.ibm.com>
Cc: stable@kernel.org
LKML-Reference: <4AD60687.10306@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> | 
| | | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | The futex code does not handle spurious wake up in futex_wait and
futex_wait_requeue_pi.
The code assumes that any wake up which was not caused by futex_wake /
requeue or by a timeout was caused by a signal wake up and returns one
of the syscall restart error codes.
In case of a spurious wake up the signal delivery code which deals
with the restart error codes is not invoked and we return that error
code to user space. That causes applications which actually check the
return codes to fail. Blaise reported that on preempt-rt a python test
program run into a exception trap. -rt exposed that due to a built in
spurious wake up accelerator :)
Solve this by checking signal_pending(current) in the wake up path and
handle the spurious wake up case w/o returning to user space.
Reported-by: Blaise Gassend <blaise@willowgarage.com>
Debugged-by: Darren Hart <dvhltc@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@kernel.org
LKML-Reference: <new-submission> | 
| |\ \ \ \ \ \ \  
| | |_|_|_|/ /  
| |/| | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf timechart: Improve the visual appearance of scheduler delays
  perf timechart: Fix the wakeup-arrows that point to non-visible processes
  perf top: Fix --delay_secs 0 division by zero
  perf tools: Bump version to 0.0.2
  perf_event: Adjust frequency and unthrottle for non-group-leader events | 
| | | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | The loop in perf_ctx_adjust_freq checks the frequency of sampling
event counters, and adjusts the event interval and unthrottles the
event if required, and resets the interrupt count for the event.
However, at present it only looks at group leaders.
This means that a sampling event that is not a group leader will
eventually get throttled, once its interrupt count reaches
sysctl_perf_event_sample_rate/HZ --- and that is guaranteed to
happen, if the event is active for long enough, since the interrupt
count never gets reset.  Once it is throttled it never gets
unthrottled, so it basically just stops working at that point.
This fixes it by making perf_ctx_adjust_freq use ctx->event_list
rather than ctx->group_list.  The existing spin_lock/spin_unlock
around the loop makes it unnecessary to put rcu_read_lock/
rcu_read_unlock around the list_for_each_entry_rcu().
Reported-by: Mark W. Krentel <krentel@cs.rice.edu>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <19157.26731.855609.165622@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu> | 
| |\ \ \ \ \ \ \  
| | |_|_|_|_|/  
| |/| | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | |   
| | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Do less agressive buddy clearing
  sched: Disable SD_PREFER_LOCAL for MC/CPU domains | 
| | | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | 
| | | | | | | | Yanmin reported a hackbench regression due to:
 > commit de69a80be32445b0a71e8e3b757e584d7beb90f7
 > Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
 > Date:   Thu Sep 17 09:01:20 2009 +0200
 >
 >     sched: Stop buddies from hogging the system
I really liked de69a80b, and it affecting hackbench shows I wasn't
crazy ;-)
So hackbench is a multi-cast, with one sender spraying multiple
receivers, who in their turn don't spray back.
This would be exactly the scenario that patch 'cures'. Previously
we would not clear the last buddy after running the next task,
allowing the sender to get back to work sooner than it otherwise
ought to have been, increasing latencies for other tasks.
Now, since those receivers don't poke back, they don't enforce the
buddy relation, which means there's nothing to re-elect the sender.
Cure this by less agressively clearing the buddy stats. Only clear
buddies when they were not chosen. It should still avoid a buddy
sticking around long after its served its time.
Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Galbraith <efault@gmx.de>
LKML-Reference: <1255084986.8802.46.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu> |