aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAge
...
| * | | | | | | | | | | | | | | sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()Kirill Tkhai2014-11-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently used hrtimer_try_to_cancel() is racy: raw_spin_lock(&rq->lock) ... dl_task_timer raw_spin_lock(&rq->lock) ... raw_spin_lock(&rq->lock) ... switched_from_dl() ... ... hrtimer_try_to_cancel() ... ... switched_to_fair() ... ... ... ... ... ... ... ... raw_spin_unlock(&rq->lock) ... (asquired) ... ... ... ... ... ... do_exit() ... ... schedule() ... ... raw_spin_lock(&rq->lock) ... raw_spin_unlock(&rq->lock) ... ... ... raw_spin_unlock(&rq->lock) ... raw_spin_lock(&rq->lock) ... ... (asquired) put_task_struct() ... ... free_task_struct() ... ... ... ... raw_spin_unlock(&rq->lock) ... (asquired) ... ... ... ... ... (use after free) ... So, let's implement 100% guaranteed way to cancel the timer and let's be sure we are safe even in very unlikely situations. rq unlocking does not limit the area of switched_from_dl() use, because this has already been possible in pull_dl_task() below. Let's consider the safety of of this unlocking. New code in the patch is working when hrtimer_try_to_cancel() fails. This means the callback is running. In this case hrtimer_cancel() is just waiting till the callback is finished. Two 1) Since we are in switched_from_dl(), new class is not dl_sched_class and new prio is not less MAX_DL_PRIO. So, the callback returns early; it's right after !dl_task() check. After that hrtimer_cancel() returns back too. The above is: raw_spin_lock(rq->lock); ... ... dl_task_timer() ... raw_spin_lock(rq->lock); switched_from_dl() ... hrtimer_try_to_cancel() ... raw_spin_unlock(rq->lock); ... hrtimer_cancel() ... ... raw_spin_unlock(rq->lock); ... return HRTIMER_NORESTART; ... ... raw_spin_lock(rq->lock); ... 2) But the below is also possible: dl_task_timer() raw_spin_lock(rq->lock); ... raw_spin_unlock(rq->lock); raw_spin_lock(rq->lock); ... switched_from_dl() ... hrtimer_try_to_cancel() ... ... return HRTIMER_NORESTART; raw_spin_unlock(rq->lock); ... hrtimer_cancel(); ... raw_spin_lock(rq->lock); ... In this case hrtimer_cancel() returns immediately. Very unlikely case, just to mention. Nobody can manipulate the task, because check_class_changed() is always called with pi_lock locked. Nobody can force the task to participate in (concurrent) priority inheritance schemes (the same reason). All concurrent task operations require pi_lock, which is held by us. No deadlocks with dl_task_timer() are possible, because it returns right after !dl_task() check (it does nothing). If we receive a new dl_task during the time of unlocked rq, we just don't have to do pull_dl_task() in switched_from_dl() further. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> [ Added comments] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched: Use WARN_ONCE for the might_sleep() TASK_RUNNING testPeter Zijlstra2014-11-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In some cases this can trigger a true flood of output. Requested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | audit, sched/wait: Fixup kauditd_thread() wait loopPeter Zijlstra2014-11-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kauditd_thread wait loop is a bit iffy; it has a number of problems: - calls try_to_freeze() before schedule(); you typically want the thread to re-evaluate the sleep condition when unfreezing, also freeze_task() issues a wakeup. - it unconditionally does the {add,remove}_wait_queue(), even when the sleep condition is false. Use wait_event_freezable() that does the right thing. Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Eric Paris <eparis@redhat.com> Cc: oleg@redhat.com Cc: Eric Paris <eparis@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141002102251.GA6324@worktop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/wait: Fix a kthread race with wait_woken()Peter Zijlstra2014-11-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a race between kthread_stop() and the new wait_woken() that can result in a lack of progress. CPU 0 | CPU 1 | rfcomm_run() | kthread_stop() ... | if (!test_bit(KTHREAD_SHOULD_STOP)) | | set_bit(KTHREAD_SHOULD_STOP) | wake_up_process() wait_woken() | wait_for_completion() set_current_state(INTERRUPTIBLE) | if (!WQ_FLAG_WOKEN) | schedule_timeout() | | After which both tasks will wait.. forever. Fix this by having wait_woken() check for kthread_should_stop() but only for kthreads (obviously). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched: Exclude cond_resched() from nested sleep testPeter Zijlstra2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cond_resched() is a preemption point, not strictly a blocking primitive, so exclude it from the ->state test. In particular, preemption preserves task_struct::state. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: tglx@linutronix.de Cc: ilya.dryomov@inktank.com Cc: umgwanakikbuti@gmail.com Cc: oleg@redhat.com Cc: Alex Elder <alex.elder@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Axel Lin <axel.lin@ingics.com> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20140924082242.656559952@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched: Debug nested sleepsPeter Zijlstra2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Validate we call might_sleep() with TASK_RUNNING, which catches places where we nest blocking primitives, eg. mutex usage in a wait loop. Since all blocking is arranged through task_struct::state, nesting this will cause the inner primitive to set TASK_RUNNING and the outer will thus not block. Another observed problem is calling a blocking function from schedule()->sched_submit_work()->blk_schedule_flush_plug() which will then destroy the task state for the actual __schedule() call that comes after it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: tglx@linutronix.de Cc: ilya.dryomov@inktank.com Cc: umgwanakikbuti@gmail.com Cc: oleg@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140924082242.591637616@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched, modules: Fix nested sleep in add_unformed_module()Peter Zijlstra2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a genuine bug in add_unformed_module(), we cannot use blocking primitives inside a wait loop. So rewrite the wait_event_interruptible() usage to use the fresh wait_woken() stuff. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: tglx@linutronix.de Cc: ilya.dryomov@inktank.com Cc: umgwanakikbuti@gmail.com Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: oleg@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: http://lkml.kernel.org/r/20140924082242.458562904@infradead.org [ So this is probably complex to backport and the race wasn't reported AFAIK, so not marked for -stable. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched, smp: Correctly deal with nested sleepsPeter Zijlstra2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | smp_hotplug_thread::{setup,unpark} functions can sleep too, so be consistent and do the same for all callbacks. __might_sleep+0x74/0x80 kmem_cache_alloc_trace+0x4e/0x1c0 perf_event_alloc+0x55/0x450 perf_event_create_kernel_counter+0x2f/0x100 watchdog_nmi_enable+0x8d/0x160 watchdog_enable+0x45/0x90 smpboot_thread_fn+0xec/0x2b0 kthread+0xe4/0x100 ret_from_fork+0x7c/0xb0 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: tglx@linutronix.de Cc: ilya.dryomov@inktank.com Cc: umgwanakikbuti@gmail.com Cc: oleg@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140924082242.392279328@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched, exit: Deal with nested sleepsPeter Zijlstra2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_wait() is a big wait loop, but we set TASK_RUNNING too late; we end up calling potential sleeps before we reset it. Not strictly a bug since we're guaranteed to exit the loop and not call schedule(); put in annotations to quiet might_sleep(). WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:7123 __might_sleep+0x7e/0x90() do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109a788>] do_wait+0x88/0x270 Call Trace: [<ffffffff81694991>] dump_stack+0x4e/0x7a [<ffffffff8109877c>] warn_slowpath_common+0x8c/0xc0 [<ffffffff8109886c>] warn_slowpath_fmt+0x4c/0x50 [<ffffffff810bca6e>] __might_sleep+0x7e/0x90 [<ffffffff811a1c15>] might_fault+0x55/0xb0 [<ffffffff8109a3fb>] wait_consider_task+0x90b/0xc10 [<ffffffff8109a804>] do_wait+0x104/0x270 [<ffffffff8109b837>] SyS_wait4+0x77/0x100 [<ffffffff8169d692>] system_call_fastpath+0x16/0x1b Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: tglx@linutronix.de Cc: umgwanakikbuti@gmail.com Cc: ilya.dryomov@inktank.com Cc: Alex Elder <alex.elder@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Axel Lin <axel.lin@ingics.com> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Guillaume Morin <guillaume@morinfr.org> Cc: Ionut Alexa <ionut.m.alexa@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Michal Schmidt <mschmidt@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20140924082242.186408915@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/wait: Provide infrastructure to deal with nested blockingPeter Zijlstra2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are a few places that call blocking primitives from wait loops, provide infrastructure to support this without the typical task_struct::state collision. We record the wakeup in wait_queue_t::flags which leaves task_struct::state free to be used by others. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: tglx@linutronix.de Cc: ilya.dryomov@inktank.com Cc: umgwanakikbuti@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140924082242.051202318@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | locking/mutex: Don't assume TASK_RUNNINGPeter Zijlstra2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're going to make might_sleep() test for TASK_RUNNING, because blocking without TASK_RUNNING will destroy the task state by setting it to TASK_RUNNING. There are a few occasions where its 'valid' to call blocking primitives (and mutex_lock in particular) and not have TASK_RUNNING, typically such cases are right before we set TASK_RUNNING anyhow. Robustify the code by not assuming this; this has the beneficial side effect of allowing optional code emission for fixing the above might_sleep() false positives. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: tglx@linutronix.de Cc: ilya.dryomov@inktank.com Cc: umgwanakikbuti@gmail.com Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140924082241.988560063@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/deadline: Don't balance during wakeup if wakee is pinnedWanpeng Li2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use nr_cpus_allowed to bail from select_task_rq() when only one cpu can be used, and saves some cycles for pinned tasks. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1413253360-5318-2-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/deadline: Don't check SD_BALANCE_FORKWanpeng Li2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no need to do balance during fork since SCHED_DEADLINE tasks can't fork. This patch avoid the SD_BALANCE_FORK check. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1413253360-5318-1-git-send-email-wanpeng.li@linux.intel.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/deadline: Ensure that updates to exclusive cpusets don't break ACJuri Lelli2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | How we deal with updates to exclusive cpusets is currently broken. As an example, suppose we have an exclusive cpuset composed of two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it up to the allowed bandwidth. If we want now to modify cpusetA's cpumask, we have to check that removing a cpu's amount of bandwidth doesn't break AC guarantees. This thing isn't checked in the current code. This patch fixes the problem above, denying an update if the new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks that are currently active. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Li Zefan <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/deadline: Fix bandwidth check/update when migrating tasks between ↵Juri Lelli2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | exclusive cpusets Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks affinity (performing what is commonly called clustered scheduling). Unfortunately, such thing is currently broken for two reasons: - No check is performed when the user tries to attach a task to an exlusive cpuset (recall that exclusive cpusets have an associated maximum allowed bandwidth). - Bandwidths of source and destination cpusets are not correctly updated after a task is migrated between them. This patch fixes both things at once, as they are opposite faces of the same coin. The check is performed in cpuset_can_attach(), as there aren't any points of failure after that function. The updated is split in two halves. We first reserve bandwidth in the destination cpuset, after we pass the check in cpuset_can_attach(). And we then release bandwidth from the source cpuset when the task's affinity is actually changed. Even if there can be time windows when sched_setattr() may erroneously fail in the source cpuset, we are fine with it, as we can't perfom an atomic update of both cpusets at once. Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Reported-by: Vincent Legout <vincent@legout.info> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dario Faggioli <raistlin@linux.it> Cc: Michael Trimarchi <michael@amarulasolutions.com> Cc: Fabio Checconi <fchecconi@gmail.com> Cc: michael@amarulasolutions.com Cc: luca.abeni@unitn.it Cc: Li Zefan <lizefan@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: cgroups@vger.kernel.org Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/deadline: Do not try to push tasks if pinned task switches to dlWanpeng Li2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As Kirill mentioned (https://lkml.org/lkml/2013/1/29/118): | If rq has already had 2 or more pushable tasks and we try to add a | pinned task then call of push_rt_task will just waste a time. Just switched pinned task is not able to be pushed. If the rq has had several dl tasks before they have already been considered as candidates to be pushed (or pulled). This patch implements the same behavior as rt class which introduced by commit 10447917551e ("sched/rt: Do not try to push tasks if pinned task switches to RT"). Suggested-by: Kirill V Tkhai <tkhai@yandex.ru> Acked-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1413938203-224610-1-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched: Kill task_preempt_count()Oleg Nesterov2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | task_preempt_count() is pointless if preemption counter is per-cpu, currently this is x86 only. It is only valid if the task is not running, and even in this case the only info it can provide is the state of PREEMPT_ACTIVE bit. Change its single caller to check p->on_rq instead, this should be the same if p->state != TASK_RUNNING, and kill this helper. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Alexander Graf <agraf@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Link: http://lkml.kernel.org/r/20141008183348.GC17495@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched: Make finish_task_switch() return 'struct rq *'Oleg Nesterov2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both callers of finish_task_switch() need to recalculate this_rq() and pass it as an argument, plus __schedule() does this again after context_switch(). It would be simpler to call this_rq() once in finish_task_switch() and return the this rq to the callers. Note: probably "int cpu" in __schedule() should die; it is not used and both rcu_note_context_switch() and wq_worker_sleeping() do not really need this argument. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141009193232.GB5408@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched: Fix schedule_tail() to disable preemptionOleg Nesterov2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | finish_task_switch() enables preemption, so post_schedule(rq) can be called on the wrong (and even dead) CPU. Afaics, nothing really bad can happen, but in this case we can wrongly clear rq->post_schedule on that CPU. And this simply looks wrong in any case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141008193644.GA32055@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/numa: Check all nodes when placing a pseudo-interleaved groupRik van Riel2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In pseudo-interleaved numa_groups, all tasks try to relocate to the group's preferred_nid. When a group is spread across multiple NUMA nodes, this can lead to tasks swapping their location with other tasks inside the same group, instead of swapping location with tasks from other NUMA groups. This can keep NUMA groups from converging. Examining all nodes, when dealing with a task in a pseudo-interleaved NUMA group, avoids this problem. Note that only CPUs in nodes that improve the task or group score are examined, so the loop isn't too bad. Tested-by: Vinod Chegu <chegu_vinod@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Vinod Chegu" <chegu_vinod@hp.com> Cc: mgorman@suse.de Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141009172747.0d97c38c@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/numa: Find the preferred nid with complex NUMA topologyRik van Riel2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On systems with complex NUMA topologies, the node scoring is adjusted to allow workloads to converge on nodes that are near each other. The way a task group's preferred nid is determined needs to be adjusted, in order for the preferred_nid to be consistent with group_weight scoring. This ensures that we actually try to converge workloads on adjacent nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Chegu Vinod <chegu_vinod@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1413530994-9732-6-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/numa: Calculate node scores in complex NUMA topologiesRik van Riel2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to do task placement on systems with complex NUMA topologies, it is necessary to count the faults on nodes nearby the node that is being examined for a potential move. In case of a system with a backplane interconnect, we are dealing with groups of NUMA nodes; each of the nodes within a group is the same number of hops away from nodes in other groups in the system. Optimal placement on this topology is achieved by counting all nearby nodes equally. When comparing nodes A and B at distance N, nearby nodes are those at distances smaller than N from nodes A or B. Placement strategy on a system with a glueless mesh NUMA topology needs to be different, because there are no natural groups of nodes determined by the hardware. Instead, when dealing with two nodes A and B at distance N, N >= 2, there will be intermediate nodes at distance < N from both nodes A and B. Good placement can be achieved by right shifting the faults on nearby nodes by the number of hops from the node being scored. In this context, a nearby node is any node less than the maximum distance in the system away from the node. Those nodes are skipped for efficiency reasons, there is no real policy reason to do so. Placement policy on directly connected NUMA systems is not affected. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Chegu Vinod <chegu_vinod@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Link: http://lkml.kernel.org/r/1413530994-9732-5-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/numa: Prepare for complex topology placementRik van Riel2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Preparatory patch for adding NUMA placement on systems with complex NUMA topology. Also fix a potential divide by zero in group_weight() Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Chegu Vinod <chegu_vinod@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1413530994-9732-4-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/numa: Classify the NUMA topology of a systemRik van Riel2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Smaller NUMA systems tend to have all NUMA nodes directly connected to each other. This includes the degenerate case of a system with just one node, ie. a non-NUMA system. Larger systems can have two kinds of NUMA topology, which affects how tasks and memory should be placed on the system. On glueless mesh systems, nodes that are not directly connected to each other will bounce traffic through intermediary nodes. Task groups can be run closer to each other by moving tasks from a node to an intermediary node between it and the task's preferred node. On NUMA systems with backplane controllers, the intermediary hops are incapable of running programs. This creates "islands" of nodes that are at an equal distance to anywhere else in the system. Each kind of topology requires a slightly different placement algorithm; this patch provides the mechanism to detect the kind of NUMA topology of a system. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Chegu Vinod <chegu_vinod@hp.com> [ Changed to use kernel/sched/sched.h ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Link: http://lkml.kernel.org/r/1413530994-9732-3-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | sched/numa: Export info needed for NUMA balancing on complex topologiesRik van Riel2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Export some information that is necessary to do placement of tasks on systems with multi-level NUMA topologies. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1413530994-9732-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | | | | | | | | | | | | | Merge branch 'perf-core-for-linus' of ↵Linus Torvalds2014-12-09
|\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events update from Ingo Molnar: "On the kernel side there's few changes, the one that stands out is PEBS machine state sampling support on x86, by Stephane Eranian. On the tooling side: User visible tooling changes: - Don't open the DWARF info multiple times, keeping instead a dwfl handle in struct dso, greatly speeding up 'perf report' on powerpc. (Sukadev Bhattiprolu) - Introduce PARSE_OPT_DISABLED option flag and use it to avoid showing undersired options in tools that provides frontends to 'perf record', like sched, kvm, etc (Namhyung Kim) - Fallback to kallsyms when using the minimal 'ELF' loader (Arnaldo Carvalho de Melo) - Fix annotation with kcore (Adrian Hunter) - Support source line numbers in annotate using a hotkey (Andi Kleen) - Callchain improvements including: * Enable printing the srcline in the history * Make get_srcline fall back to sym+offset (Andi Kleen) - TUI hist_entry browser fixes, including showing missing overhead value for first level callchain. Detected comparing the output of --stdio/--gui (that matched) with --tui, that had this problem. (Namhyung Kim) - Support handling complete branch stacks as histograms (Andi Kleen) Tooling infrastructure changes: - Prep work for supporting per-pkg and snapshot counters in 'perf stat' (Jiri Olsa) - 'perf stat' refactorings, moving stuff from it to evsel.c to use in per-pkg/snapshot format changes (Jiri Olsa) - Add per-pkg format file parsing (Matt Fleming) - Clean up libelf feature support code (Namhyung Kim) - Add gzip decompression support for kernel modules (Namhyung Kim) - More prep patches for Intel PT, including a a thread stack and more stuff made available via the database export mechanism (Adrian Hunter) - More Intel PT work, including a facility to export sample data (comms, threads, symbol names, etc) in a database friendly way, with an script to use this to create a postgresql database. (Adrian Hunter) - Make sure that thread->mg->machine points to the machine where the thread exists (it was being set only for the kmaps kernel modules case, do it as well for the mmaps) and use it to shorten function signatures (Arnaldo Carvalho de Melo) ... and lots of other fixes and smaller improvements" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (91 commits) perf report: In branch stack mode use address history sorting perf report: Add --branch-history option perf callchain: Support handling complete branch stacks as histograms perf stat: Add support for snapshot counters perf stat: Add support for per-pkg counters perf tools: Remove perf_evsel__read interface perf stat: Use read_counter in read_counter_aggr perf stat: Make read_counter work over the thread dimension perf stat: Use perf_evsel__read_cb in read_counter perf tools: Add snapshot format file parsing perf tools: Add per-pkg format file parsing perf evsel: Introduce perf_evsel__read_cb function perf evsel: Introduce perf_counts_values__scale function perf evsel: Introduce perf_evsel__compute_deltas function perf tools: Allow to force redirect pr_debug to stderr. perf tools: Fix segfault due to invalid kernel dso access perf callchain: Make get_srcline fall back to sym+offset perf symbols: Move bfd_demangle stubbing to its only user perf callchain: Enable printing the srcline in the history perf tools: Collapse first level callchain entry if it has sibling ...
| * | | | | | | | | | | | | | | | perf: Improve the perf_sample_data struct layoutPeter Zijlstra2014-11-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch reorders fields in the perf_sample_data struct in order to minimize the number of cachelines touched in perf_sample_data_init(). It also removes some intializations which are redundant with the code in kernel/events/core.c Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1411559322-16548-7-git-send-email-eranian@google.com Cc: cebbert.lkml@gmail.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: jolsa@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | | | | | | | perf: Add ability to sample machine state on interruptStephane Eranian2014-11-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable capture of interrupted machine state for each sample. Registers to sample are passed per event in the sample_regs_intr bitmask. To sample interrupt machine state, the PERF_SAMPLE_INTR_REGS must be passed in sample_type. The list of available registers is arch dependent and provided by asm/perf_regs.h Registers are laid out as u64 in the order of the bit order of sample_intr_regs. This patch also adds a new ABI version PERF_ATTR_SIZE_VER4 because we extend the perf_event_attr struct with a new u64 field. Reviewed-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: cebbert.lkml@gmail.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-api@vger.kernel.org Link: http://lkml.kernel.org/r/1411559322-16548-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | | | | | | | | | | | | | | Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds2014-12-09
|\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ | |_|_|_|_|_|_|_|_|/ / / / / / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "These are the main changes in this cycle: - Streamline RCU's use of per-CPU variables, shifting from "cpu" arguments to functions to "this_"-style per-CPU variable accessors. - signal-handling RCU updates. - real-time updates. - torture-test updates. - miscellaneous fixes. - documentation updates" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits) rcu: Fix FIXME in rcu_tasks_kthread() rcu: More info about potential deadlocks with rcu_read_unlock() rcu: Optimize cond_resched_rcu_qs() rcu: Add sparse check for RCU_INIT_POINTER() documentation: memory-barriers.txt: Correct example for reorderings documentation: Add atomic_long_t to atomic_ops.txt documentation: Additional restriction for control dependencies documentation: Document RCU self test boot params rcutorture: Fix rcu_torture_cbflood() memory leak rcutorture: Remove obsolete kversion param in kvm.sh rcutorture: Remove stale test configurations rcutorture: Enable RCU self test in configs rcutorture: Add early boot self tests torture: Run Linux-kernel binary out of results directory cpu: Avoid puts_pending overflow rcu: Remove "cpu" argument to rcu_cleanup_after_idle() rcu: Remove "cpu" argument to rcu_prepare_for_idle() rcu: Remove "cpu" argument to rcu_needs_cpu() rcu: Remove "cpu" argument to rcu_note_context_switch() rcu: Remove "cpu" argument to rcu_preempt_check_callbacks() ...
| * | | | | | | | | | | | | | | | Merge branch 'rcu/next' of ↵Ingo Molnar2014-11-20
| |\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ | | |_|_|_|_|_|_|_|_|/ / / / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Streamline RCU's use of per-CPU variables, shifting from "cpu" arguments to functions to "this_"-style per-CPU variable accessors. - Signal-handling RCU updates. - Real-time updates. - Torture-test updates. - Miscellaneous fixes. - Documentation updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | | | | | | | | | | | | | | | |
| | | \ \ \ \ \ \ \ \ \ \ \ \ \ \
| | | \ \ \ \ \ \ \ \ \ \ \ \ \ \
| | | \ \ \ \ \ \ \ \ \ \ \ \ \ \
| | | \ \ \ \ \ \ \ \ \ \ \ \ \ \
| | | \ \ \ \ \ \ \ \ \ \ \ \ \ \
| | | \ \ \ \ \ \ \ \ \ \ \ \ \ \
| | | \ \ \ \ \ \ \ \ \ \ \ \ \ \
| | *-------. \ \ \ \ \ \ \ \ \ \ \ \ \ \ Merge branches 'torture.2014.11.03a', 'cpu.2014.11.03a', 'doc.2014.11.13a', ↵Paul E. McKenney2014-11-13
| | |\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'fixes.2014.11.13a', 'signal.2014.10.29a' and 'rt.2014.10.29a' into HEAD cpu.2014.11.03a: Changes for per-CPU variables. doc.2014.11.13a: Documentation updates. fixes.2014.11.13a: Miscellaneous fixes. signal.2014.10.29a: Signal changes. rt.2014.10.29a: Real-time changes. torture.2014.11.03a: torture-test changes.
| | | | | | | * | | | | | | | | | | | | | | rcu: Fix for rcuo online-time-creation reorganization bugPaul E. McKenney2014-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs) contains checks for the case where CPUs are brought online out of order, re-wiring the rcuo leader-follower relationships as needed. Unfortunately, this rewiring was broken. This apparently went undetected due to the tendency of systems to bring CPUs online in order. This commit nevertheless fixes the rewiring. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| | | | | | | * | | | | | | | | | | | | | | rcu: Kick rcuo kthreads after their CPU goes offlinePaul E. McKenney2014-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a no-CBs CPU were to post an RCU callback with interrupts disabled after it entered the idle loop for the last time, there might be no deferred wakeup for the corresponding rcuo kthreads. This commit therefore adds a set of calls to do_nocb_deferred_wakeup() after the CPU has gone completely offline. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| | | | | | | * | | | | | | | | | | | | | | rcu: Remove redundant TREE_PREEMPT_RCU config optionPranith Kumar2014-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PREEMPT_RCU and TREE_PREEMPT_RCU serve the same function after TINY_PREEMPT_RCU has been removed. This patch removes TREE_PREEMPT_RCU and uses PREEMPT_RCU config option in its place. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| | | | | | | * | | | | | | | | | | | | | | rcu: Unify boost and kthread prioritiesClark Williams2014-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename CONFIG_RCU_BOOST_PRIO to CONFIG_RCU_KTHREAD_PRIO and use this value for both the per-CPU kthreads (rcuc/N) and the rcu boosting threads (rcub/n). Also, create the module_parameter rcutree.kthread_prio to be used on the kernel command line at boot to set a new value (rcutree.kthread_prio=N). Signed-off-by: Clark Williams <clark.williams@gmail.com> [ paulmck: Ported to rcu/dev, applied Paul Bolle and Peter Zijlstra feedback. ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| | | | | | | * | | | | | | | | | | | | | | rcu: Avoid IPIing idle CPUs from synchronize_sched_expedited()Paul E. McKenney2014-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, synchronize_sched_expedited() sends IPIs to all online CPUs, even those that are idle or executing in nohz_full= userspace. Because idle CPUs and nohz_full= userspace CPUs are in extended quiescent states, there is no need to IPI them in the first place. This commit therefore avoids IPIing CPUs that are already in extended quiescent states. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| | | | | | | * | | | | | | | | | | | | | | rcu: Move RCU_BOOST variable declarations, eliminating #ifdefPaul E. McKenney2014-10-28
| | | | | |_|/ / / / / / / / / / / / / / / | | | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are some RCU_BOOST-specific per-CPU variable declarations that are needlessly defined under #ifdef in kernel/rcu/tree.c. This commit therefore moves these declarations into a pre-existing #ifdef in kernel/rcu/tree_plugin.h. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| | | | | | * | | | | | | | | | | | | | | signal: Document the RCU protection of ->sighandOleg Nesterov2014-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __cleanup_sighand() frees sighand without RCU grace period. This is correct but this looks "obviously buggy" and constantly confuses the readers, add the comments to explain how this works. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
| | | | | | * | | | | | | | | | | | | | | signal: Exit RCU read-side critical section on each pass through loopPaul E. McKenney2014-10-29
| | | | | |/ / / / / / / / / / / / / / / | | | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kill_pid_info() can potentially loop indefinitely if tasks are created and deleted sufficiently quickly, and if this happens, this function will remain in a single RCU read-side critical section indefinitely. This commit therefore exits the RCU read-side critical section on each pass through the loop. Because a race must happen to retry the loop, this should have no performance impact in the common case. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
| | | | | * | | | | | | | | | | | | | | rcu: Fix FIXME in rcu_tasks_kthread()Paul E. McKenney2014-11-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit affines rcu_tasks_kthread() to the housekeeping CPUs in CONFIG_NO_HZ_FULL builds. This is just a default, so systems administrators are free to put this kthread somewhere else if they wish. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| | | | | * | | | | | | | | | | | | | | rcu: Remove CONFIG_RCU_CPU_STALL_VERBOSEPaul E. McKenney2014-10-28
| | | | |/ / / / / / / / / / / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The CONFIG_RCU_CPU_STALL_VERBOSE Kconfig parameter causes preemptible RCU's CPU stall warnings to dump out any preempted tasks that are blocking the current RCU grace period. This information is useful, and the default has been CONFIG_RCU_CPU_STALL_VERBOSE=y for some years. It is therefore time for this commit to remove this Kconfig parameter, so that future kernel builds will always act as if CONFIG_RCU_CPU_STALL_VERBOSE=y. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| | | * | | | | | | | | | | | | | | | cpu: Avoid puts_pending overflowPaul E. McKenney2014-11-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A long string of get_online_cpus() with each followed by a put_online_cpu() that fails to acquire cpu_hotplug.lock can result in overflow of the cpu_hotplug.puts_pending counter. Although this is perhaps improbably, a system with absolutely no CPU-hotplug operations will have an arbitrarily long time in which this overflow could occur. This commit therefore adds overflow checks to get_online_cpus() and try_get_online_cpus(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
| | | * | | | | | | | | | | | | | | | rcu: Remove "cpu" argument to rcu_cleanup_after_idle()Paul E. McKenney2014-11-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "cpu" argument to rcu_cleanup_after_idle() is always the current CPU, so drop it. This moves the smp_processor_id() from the caller to rcu_cleanup_after_idle(), saving argument-passing overhead. Again, the anticipated cross-CPU uses of these functions has been replaced by NO_HZ_FULL. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
| | | * | | | | | | | | | | | | | | | rcu: Remove "cpu" argument to rcu_prepare_for_idle()Paul E. McKenney2014-11-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "cpu" argument to rcu_prepare_for_idle() is always the current CPU, so drop it. This in turn allows two of the uses of "cpu" in this function to be replaced with a this_cpu_ptr() and the third by smp_processor_id(), replacing that of the call to rcu_prepare_for_idle(). Again, the anticipated cross-CPU uses of these functions has been replaced by NO_HZ_FULL. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
| | | * | | | | | | | | | | | | | | | rcu: Remove "cpu" argument to rcu_needs_cpu()Paul E. McKenney2014-11-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "cpu" argument to rcu_needs_cpu() is always the current CPU, so drop it. This in turn allows the "cpu" argument to rcu_cpu_has_callbacks() to be removed, which allows the uses of "cpu" in both functions to be replaced with a this_cpu_ptr(). Again, the anticipated cross-CPU uses of these functions has been replaced by NO_HZ_FULL. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
| | | * | | | | | | | | | | | | | | | rcu: Remove "cpu" argument to rcu_note_context_switch()Paul E. McKenney2014-11-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "cpu" argument to rcu_note_context_switch() is always the current CPU, so drop it. This in turn allows the "cpu" argument to rcu_preempt_note_context_switch() to be removed, which allows the sole use of "cpu" in both functions to be replaced with a this_cpu_ptr(). Again, the anticipated cross-CPU uses of these functions has been replaced by NO_HZ_FULL. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
| | | * | | | | | | | | | | | | | | | rcu: Remove "cpu" argument to rcu_preempt_check_callbacks()Paul E. McKenney2014-11-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because rcu_preempt_check_callbacks()'s argument is guaranteed to always be the current CPU, drop the argument and replace per_cpu() with __this_cpu_read(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
| | | * | | | | | | | | | | | | | | | rcu: Remove "cpu" argument to rcu_pending()Paul E. McKenney2014-11-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because rcu_pending()'s argument is guaranteed to always be the current CPU, drop the argument and replace per_cpu_ptr() with this_cpu_ptr(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
| | | * | | | | | | | | | | | | | | | rcu: Remove "cpu" argument to rcu_check_callbacks()Paul E. McKenney2014-11-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "cpu" argument was kept around on the off-chance that RCU might offload scheduler-clock interrupts. However, this offload approach has been replaced by NO_HZ_FULL, which offloads -all- RCU processing from qualifying CPUs. It is therefore time to remove the "cpu" argument to rcu_check_callbacks(), which this commit does. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
| | | * | | | | | | | | | | | | | | | rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_dataPaul E. McKenney2014-11-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rcu_data per-CPU variable has a number of fields that are atomically manipulated, potentially by any CPU. This situation can result in false sharing with per-CPU variables that have the misfortune of being allocated adjacent to rcu_data in memory. This commit therefore changes the DEFINE_PER_CPU() to DEFINE_PER_CPU_SHARED_ALIGNED() in order to avoid this false sharing. Reported-by: Christoph Lameter <cl@linux.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>