| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will place at least one task per core.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-11-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The 'struct sched_group_capacity::capacity_orig' field is no longer used
in the scheduler so we can remove it.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425378903-5349-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This new field 'cpu_capacity_orig' reflects the original capacity of a CPU
before being altered by rt tasks and/or IRQ
The cpu_capacity_orig will be used:
- to detect when the capacity of a CPU has been noticeably reduced so we can
trig load balance to look for a CPU with better capacity. As an example, we
can detect when a CPU handles a significant amount of irq
(with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
scheduler whereas CPUs, which are really idle, are available.
- evaluate the available capacity for CFS tasks
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|\
| |
| |
| |
| |
| | |
applying new patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.
I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.
Signed-off-by: Brian Silverman <brian@peloton-tech.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: austin@peloton-tech.com
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Thiscontains misc fixes: preempt_schedule_common() and io_schedule()
recursion fixes, sched/dl fixes, a completion_done() revert, two
sched/rt fixes and a comment update patch"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Avoid obvious configuration fail
sched/autogroup: Fix failure to set cpu.rt_runtime_us
sched/dl: Do update_rq_clock() in yield_task_dl()
sched: Prevent recursion in io_schedule()
sched/completion: Serialize completion_done() with complete()
sched: Fix preempt_schedule_common() triggering tracing recursion
sched/dl: Prevent enqueue of a sleeping task in dl_task_timer()
sched: Make dl_task_time() use task_rq_lock()
sched: Clarify ordering between task_rq_lock() and move_queued_task()
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
printk and friends can now format bitmaps using '%*pb[l]'. cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If the CPU is running a realtime task that does not round-robin
with another realtime task of equal priority, there is no point
in keeping the scheduler tick going. After all, whenever the
scheduler tick runs, the kernel will just decide not to
reschedule.
Extend sched_can_stop_tick() to recognize these situations, and
inform the rest of the kernel that the scheduler tick can be
stopped.
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@redhat.com
Cc: mtosatti@redhat.com
Link: http://lkml.kernel.org/r/20150216152349.6a8ed824@annuminas.surriel.com
[ Small cleanliness tweak. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Setting the root group's cpu.rt_runtime_us to 0 is a bad thing; it
would disallow the kernel creating RT tasks.
One can of course still set it to 1, which will (likely) still wreck
your kernel, but at least make it clear that setting it to 0 is not
good.
Collect both sanity checks into the one place while we're there.
Suggested-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150209112715.GO24151@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Because task_group() uses a cache of autogroup_task_group(), whose
output depends on sched_class, switching classes can generate
problems.
In particular, when started as fair, the cache points to the
autogroup, so when switching to RT the tg_rt_schedulable() test fails
for every cpu.rt_{runtime,period}_us change because now the autogroup
has tasks and no runtime.
Furthermore, going back to the previous semantics of varying
task_group() with sched_class has the down-side that the sched_debug
output varies as well, even though the task really is in the
autogroup.
Therefore add an autogroup exception to tg_has_rt_tasks() -- such that
both (all) task_group() usages in sched/core now have one. And remove
all the remnants of the variable task_group() output.
Reported-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Stefan Bader <stefan.bader@canonical.com>
Fixes: 8323f26ce342 ("sched: Fix race in task_group()")
Link: http://lkml.kernel.org/r/20150209112237.GR5029@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.
Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.
This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().
Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.
So:
- use ->in_iowait to detect recursion. Set it earlier, and restore
it to the old value.
- move the call to "raw_rq" after the call to blk_flush_plug().
As this is some sort of per-cpu thing, we want some chance that
we are on the right CPU
- When io_schedule() is called recurively, use blk_schedule_flush_plug()
which cannot further recurse.
- as this makes io_schedule() a lot more complex and as io_schedule()
must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
and make io_schedule a simple wrapper for that.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Since the function graph tracer needs to disable preemption, it might
call preempt_schedule() after reenabling it if something triggered the
need for rescheduling in between.
Therefore we can't trace preempt_schedule() itself because we would
face a function tracing recursion otherwise as the tracer is always
called before PREEMPT_ACTIVE gets set to prevent that recursion. This is
why preempt_schedule() is tagged as "notrace".
But the same issue applies to every function called by preempt_schedule()
before PREEMPT_ACTIVE is actually set. And preempt_schedule_common() is
one such example. Unfortunately we forgot to tag it as notrace as well
and as a result we are encountering tracing recursion since it got
introduced by:
a18b5d0181923 ("sched: Fix missing preemption opportunity")
Let's fix that by applying the appropriate function tag to
preempt_schedule_common().
Reported-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1424110807-15057-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Kirill reported that a dl task can be throttled and dequeued at the
same time. This happens, when it becomes throttled in schedule(),
which is called to go to sleep:
current->state = TASK_INTERRUPTIBLE;
schedule()
deactivate_task()
dequeue_task_dl()
update_curr_dl()
start_dl_timer()
__dequeue_task_dl()
prev->on_rq = 0;
This invalidates the assumption from commit 0f397f2c90ce ("sched/dl:
Fix race in dl_task_timer()"):
"The only reason we don't strictly need ->pi_lock now is because
we're guaranteed to have p->state == TASK_RUNNING here and are
thus free of ttwu races".
And therefore we have to use the full task_rq_lock() here.
This further amends the fact that we forgot to update the rq lock loop
for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach
scheduler to understand TASK_ON_RQ_MIGRATING state").
Reported-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|/
|
|
|
|
|
|
|
|
|
|
|
| |
There was a wee bit of confusion around the exact ordering here;
clarify things.
Reported-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20150217121258.GM5029@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main scheduler changes in this cycle were:
- various sched/deadline fixes and enhancements
- rescheduling latency fixes/cleanups
- rework the rq->clock code to be more consistent and more robust.
- minor micro-optimizations
- ->avg.decay_count fixes
- add a stack overflow check to might_sleep()
- idle-poll handler fix, possibly resulting in power savings
- misc smaller updates and fixes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/Documentation: Remove unneeded word
sched/wait: Introduce wait_on_bit_timeout()
sched: Pull resched loop to __schedule() callers
sched/deadline: Remove cpu_active_mask from cpudl_find()
sched: Fix hrtick_start() on UP
sched/deadline: Avoid pointless __setscheduler()
sched/deadline: Fix stale yield state
sched/deadline: Fix hrtick for a non-leftmost task
sched/deadline: Modify cpudl::free_cpus to reflect rd->online
sched/idle: Add missing checks to the exit condition of cpu_idle_poll()
sched: Fix missing preemption opportunity
sched/rt: Reduce rq lock contention by eliminating locking of non-feasible target
sched/debug: Print rq->clock_task
sched/core: Rework rq->clock update skips
sched/core: Validate rq_clock*() serialization
sched/core: Remove check of p->sched_class
sched/fair: Fix sched_entity::avg::decay_count initialization
sched/debug: Fix potential call to __ffs(0) in sched_show_task()
sched/debug: Check for stack overflow in ___might_sleep()
sched/fair: Fix the dealing with decay_count in __synchronize_entity_decay()
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
__schedule() disables preemption during its job and re-enables it
afterward without doing a preemption check to avoid recursion.
But if an event happens after the context switch which requires
rescheduling, we need to check again if a task of a higher priority
needs the CPU. A preempt irq can raise such a situation. To handle that,
__schedule() loops on need_resched().
But preempt_schedule_*() functions, which call __schedule(), also loop
on need_resched() to handle missed preempt irqs. Hence we end up with
the same loop happening twice.
Lets simplify that by attributing the need_resched() loop responsibility
to all __schedule() callers.
There is a risk that the outer loop now handles reschedules that used
to be handled by the inner loop with the added overhead of caller details
(inc/dec of PREEMPT_ACTIVE, irq save/restore) but assuming those inner
rescheduling loop weren't too frequent, this shouldn't matter. Especially
since the whole preemption path is now losing one loop in any case.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1422404652-29067-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The commit 177ef2a6315e ("sched/deadline: Fix a precision problem in
the microseconds range") forgot to change the UP version of
hrtick_start(), do so now.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Fixes: 177ef2a6315e ("sched/deadline: Fix a precision problem in the microseconds range")
[ Fixed the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1416962647-76792-7-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There is no need to dequeue/enqueue and push/pull if there are
no scheduling parameters changed for the DL class.
Both fair and RT classes already check if parameters changed for
them to avoid unnecessary overhead. This patch add the parameters
changed test for the DL class in order to reduce overhead.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[ Fixed up the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1416962647-76792-5-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |\
| | |
| | |
| | |
| | |
| | | |
new patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If an interrupt fires in cond_resched(), between the call to __schedule()
and the PREEMPT_ACTIVE count decrementation, and that interrupt sets
TIF_NEED_RESCHED, the call to preempt_schedule_irq() will be ignored
due to the PREEMPT_ACTIVE count. This kind of scenario, with irq preemption
being delayed because it's interrupting a preempt-disabled area, is
usually fixed up after preemption is re-enabled back with an explicit
call to preempt_schedule().
This is what preempt_enable() does but a raw preempt count decrement as
performed by __preempt_count_sub(PREEMPT_ACTIVE) doesn't handle delayed
preemption check. Therefore when such a race happens, the rescheduling
is going to be delayed until the next scheduler or preemption entrypoint.
This can be a problem for scheduler latency sensitive workloads.
Lets fix that by consolidating cond_resched() with preempt_schedule()
internals.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Ingo Molnar <mingo@kernel.org>
Original-patch-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1421946484-9298-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |\ \
| | | |
| | | |
| | | |
| | | |
| | | | |
Merge all pending fixes and refresh the tree, before applying new changes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The original purpose of rq::skip_clock_update was to avoid 'costly' clock
updates for back to back wakeup-preempt pairs. The big problem with it
has always been that the rq variable is unaware of the context and
causes indiscrimiate clock skips.
Rework the entire thing and create a sense of context by only allowing
schedule() to skip clock updates. (XXX can we measure the cost of the
added store?)
By ensuring only schedule can ever skip an update, we guarantee we're
never more than 1 tick behind on the update.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150105103554.432381549@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Search all usage of p->sched_class in sched/core.c, no one check it
before use, so it seems that every task must belong to one sched_class.
Signed-off-by: Yao Dongdong <yaodongdong@huawei.com>
[ Moved the early class assignment to make it boot. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1419835303-28958-1-git-send-email-yaodongdong@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Child has the same decay_count as parent. If it's not zero,
we add it to parent's cfs_rq->removed_load:
wake_up_new_task()->set_task_cpu()->migrate_task_rq_fair().
Child's load is a just garbade after copying of parent,
it hasn't been on cfs_rq yet, and it must not be added to
cfs_rq::removed_load in migrate_task_rq_fair().
The patch moves sched_entity::avg::decay_count intialization
in sched_fork(). So, migrate_task_rq_fair() does not change
removed_load.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418644618.6074.13.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
"struct task_struct"->state is "volatile long" and __ffs() warns that
"Undefined if no bit exists, so code should check against 0 first."
Therefore, at expression
state = p->state ? __ffs(p->state) + 1 : 0;
in sched_show_task(), CPU might see "p->state" before "?" as "non-zero"
but "p->state" after "?" as "zero", which could result in
"state >= sizeof(stat_nam)" being true and bogus '?' is printed.
This patch changes "state" from "unsigned int" to "unsigned long" and
save "p->state" before calling __ffs(), in order to avoid potential call
to __ffs(0).
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/201412052131.GCE35924.FVHFOtLOJOMQFS@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Sometimes a "BUG: sleeping function called from invalid context"
message is not indicative of locking problems, but is the result
of a stack overflow corrupting the thread info.
Witness http://oss.sgi.com/archives/xfs/2014-02/msg00325.html
for example, which took a few go-rounds to sort out.
If we're printing the warning, things are wonky already, and
it'd be informative to check for the stack end corruption at this
point, too.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/5490B158.4060005@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- AMD range breakpoints support:
Extend breakpoint tools and core to support address range through
perf event with initial backend support for AMD extended
breakpoints.
The syntax is:
perf record -e mem:addr/len:type
For example set write breakpoint from 0x1000 to 0x1200 (0x1000 + 512)
perf record -e mem:0x1000/512:w
- event throttling/rotating fixes
- various event group handling fixes, cleanups and general paranoia
code to be more robust against bugs in the future.
- kernel stack overhead fixes
User-visible tooling side changes:
- Show precise number of samples in at the end of a 'record' session,
if processing build ids, since we will then traverse the whole
perf.data file and see all the PERF_RECORD_SAMPLE records,
otherwise stop showing the previous off-base heuristicly counted
number of "samples" (Namhyung Kim).
- Support to read compressed module from build-id cache (Namhyung
Kim)
- Enable sampling loads and stores simultaneously in 'perf mem'
(Stephane Eranian)
- 'perf diff' output improvements (Namhyung Kim)
- Fix error reporting for evsel pgfault constructor (Arnaldo Carvalho
de Melo)
Tooling side infrastructure changes:
- Cache eh/debug frame offset for dwarf unwind (Namhyung Kim)
- Support parsing parameterized events (Cody P Schafer)
- Add support for IP address formats in libtraceevent (David Ahern)
Plus other misc fixes"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
perf: Decouple unthrottling and rotating
perf: Drop module reference on event init failure
perf: Use POLLIN instead of POLL_IN for perf poll data in flag
perf: Fix put_event() ctx lock
perf: Fix move_group() order
perf: Fix event->ctx locking
perf: Add a bit of paranoia
perf symbols: Convert lseek + read to pread
perf tools: Use perf_data_file__fd() consistently
perf symbols: Support to read compressed module from build-id cache
perf evsel: Set attr.task bit for a tracking event
perf header: Set header version correctly
perf record: Show precise number of samples
perf tools: Do not use __perf_session__process_events() directly
perf callchain: Cache eh/debug frame offset for dwarf unwind
perf tools: Provide stub for missing pthread_attr_setaffinity_np
perf evsel: Don't rely on malloc working for sz 0
tools lib traceevent: Add support for IP address formats
perf ui/tui: Show fatal error message only if exists
perf tests: Fix typo in sample-parsing.c
...
|
| |\ \ \ \
| | | | | |
| | | | | |
| | | | | | |
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | |/ / /
| |/| | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Both Linus (most recent) and Steve (a while ago) reported that perf
related callbacks have massive stack bloat.
The problem is that software events need a pt_regs in order to
properly report the event location and unwind stack. And because we
could not assume one was present we allocated one on stack and filled
it with minimal bits required for operation.
Now, pt_regs is quite large, so this is undesirable. Furthermore it
turns out that most sites actually have a pt_regs pointer available,
making this even more onerous, as the stack space is pointless waste.
This patch addresses the problem by observing that software events
have well defined nesting semantics, therefore we can use static
per-cpu storage instead of on-stack.
Linus made the further observation that all but the scheduler callers
of perf_sw_event() have a pt_regs available, so we change the regular
perf_sw_event() to require a valid pt_regs (where it used to be
optional) and add perf_sw_event_sched() for the scheduler.
We have a scheduler specific call instead of a more generic _noregs()
like construct because we can assume non-recursion from the scheduler
and thereby simplify the code further (_noregs would have to put the
recursion context call inline in order to assertain which __perf_regs
element to use).
One last note on the implementation of perf_trace_buf_prepare(); we
allow .regs = NULL for those cases where we already have a pt_regs
pointer available and do not need another.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Javi Merino <javi.merino@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Link: http://lkml.kernel.org/r/20141216115041.GW3337@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|\ \ \ \ \
| |_|/ / /
|/| | | /
| | |_|/
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Misc fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix deadline parameter modification handling
sched/wait: Remove might_sleep() from wait_event_cmd()
sched: Fix crash if cpuset_cpumask_can_shrink() is passed an empty cpumask
sched/fair: Avoid using uninitialized variable in preferred_group_nid()
|
| | |/
| |/|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit 67dfa1b756f2 ("sched/deadline: Implement cancel_dl_timer() to
use in switched_from_dl()") removed the hrtimer_try_cancel() function
call out from init_dl_task_timer(), which gets called from
__setparam_dl().
The result is that we can now re-init the timer while its active --
this is bad and corrupts timer state.
Furthermore; changing the parameters of an active deadline task is
tricky in that you want to maintain guarantees, while immediately
effective change would allow one to circumvent the CBS guarantees --
this too is bad, as one (bad) task should not be able to affect the
others.
Rework things to avoid both problems. We only need to initialize the
timer once, so move that to __sched_fork() for new tasks.
Then make sure __setparam_dl() doesn't affect the current running
state but only updates the parameters used to calculate the next
scheduling period -- this guarantees the CBS functions as expected
(albeit slightly pessimistic).
This however means we need to make sure __dl_clear_params() needs to
reset the active state otherwise new (and tasks flipping between
classes) will not properly (re)compute their first instance.
Todo: close class flipping CBS hole.
Todo: implement delayed BW release.
Reported-by: Luca Abeni <luca.abeni@unitn.it>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Luca Abeni <luca.abeni@unitn.it>
Fixes: 67dfa1b756f2 ("sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150128140803.GF23038@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
While creating an exclusive cpuset, we passed cpuset_cpumask_can_shrink()
an empty cpumask (cur), and dl_bw_of(cpumask_any(cur)) made boom with it:
CPU: 0 PID: 6942 Comm: shield.sh Not tainted 3.19.0-master #19
Hardware name: MEDIONPC MS-7502/MS-7502, BIOS 6.00 PG 12/26/2007
task: ffff880224552450 ti: ffff8800caab8000 task.ti: ffff8800caab8000
RIP: 0010:[<ffffffff81073846>] [<ffffffff81073846>] cpuset_cpumask_can_shrink+0x56/0xb0
[...]
Call Trace:
[<ffffffff810cb82a>] validate_change+0x18a/0x200
[<ffffffff810cc877>] cpuset_write_resmask+0x3b7/0x720
[<ffffffff810c4d58>] cgroup_file_write+0x38/0x100
[<ffffffff811d953a>] kernfs_fop_write+0x12a/0x180
[<ffffffff8116e1a3>] vfs_write+0xb3/0x1d0
[<ffffffff8116ed06>] SyS_write+0x46/0xb0
[<ffffffff8159ced6>] system_call_fastpath+0x16/0x1b
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Fixes: f82f80426f7a ("sched/deadline: Ensure that updates to exclusive cpusets don't break AC")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1422417235.5716.5.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 8eb23b9f35aa ("sched: Debug nested sleeps") added code to report
on nested sleep conditions, which we generally want to avoid because the
inner sleeping operation can re-set the thread state to TASK_RUNNING,
but that will then cause the outer sleep loop not actually sleep when it
calls schedule.
However, that's actually valid traditional behavior, with the inner
sleep being some fairly rare case (like taking a sleeping lock that
normally doesn't actually need to sleep).
And the debug code would actually change the state of the task to
TASK_RUNNING internally, which makes that kind of traditional and
working code not work at all, because now the nested sleep doesn't just
sometimes cause the outer one to not block, but will cause it to happen
every time.
In particular, it will cause the cardbus kernel daemon (pccardd) to
basically busy-loop doing scheduling, converting a laptop into a heater,
as reported by Bruno Prémont. But there may be other legacy uses of
that nested sleep model in other drivers that are also likely to never
get converted to the new model.
This fixes both cases:
- don't set TASK_RUNNING when the nested condition happens (note: even
if WARN_ONCE() only _warns_ once, the return value isn't whether the
warning happened, but whether the condition for the warning was true.
So despite the warning only happening once, the "if (WARN_ON(..))"
would trigger for every nested sleep.
- in the cases where we knowingly disable the warning by using
"sched_annotate_sleep()", don't change the task state (that is used
for all core scheduling decisions), instead use '->task_state_change'
that is used for the debugging decision itself.
(Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
suggested change to use 'task_state_change' as part of the test)
Reported-and-bisected-by: Bruno Prémont <bonbons@linux-vserver.org>
Tested-by: Rafael J Wysocki <rjw@rjwysocki.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Cc: Ilya Dryomov <ilya.dryomov@inktank.com>,
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>,
Cc: Davidlohr Bueso <dave@stgolabs.net>,
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When allocating space for load_balance_mask, in sched_init, when
CPUMASK_OFFSTACK is set, we've managed to spill over
KMALLOC_MAX_SIZE on our 6144 core machine. The patch below
breaks up the allocations so that they don't overflow the max
alloc size. It also allocates the masks on the the node from
which they'll most commonly be accessed, to minimize remote
accesses on NUMA machines.
Suggested-by: George Beshers <gbeshers@sgi.com>
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: George Beshers <gbeshers@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418928270-148543-1-git-send-email-athorlton@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
rcu_read_lock() can not protect p->real_parent if release_task(p) was
already called, change sched_show_task() to check pis_alive() like other
users do.
Note: we need some helpers to cleanup the code like this. And it seems
that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle are:
- 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y.
This instruments might_sleep() checks to catch places that nest
blocking primitives - such as mutex usage in a wait loop. Such
bugs can result in hard to debug races/hangs.
Another category of invalid nesting that this facility will detect
is the calling of blocking functions from within schedule() ->
sched_submit_work() -> blk_schedule_flush_plug().
There's some potential for false positives (if secondary blocking
primitives themselves are not ready yet for this facility), but the
kernel will warn once about such bugs per bootup, so the warning
isn't much of a nuisance.
This feature comes with a number of fixes, for problems uncovered
with it, so no messages are expected normally.
- Another round of sched/numa optimizations and refinements, for
CONFIG_NUMA_BALANCING=y.
- Another round of sched/dl fixes and refinements.
Plus various smaller fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched: Add missing rcu protection to wake_up_all_idle_cpus
sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
sched/numa: Init numa balancing fields of init_task
sched/deadline: Remove unnecessary definitions in cpudeadline.h
sched/cpupri: Remove unnecessary definitions in cpupri.h
sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
sched/fair: Fix stale overloaded status in the busiest group finding logic
sched: Move p->nr_cpus_allowed check to select_task_rq()
sched/completion: Document when to use wait_for_completion_io_*()
sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC
sched/fair: Kill task_struct::numa_entry and numa_group::task_list
sched: Refactor task_struct to use numa_faults instead of numa_* pointers
sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
sched/deadline: Reschedule from switched_from_dl() after a successful pull
sched/deadline: Push task away if the deadline is equal to curr during wakeup
sched/deadline: Add deadline rq status print
sched/deadline: Fix artificial overrun introduced by yield_task_dl()
sched/rt: Clean up check_preempt_equal_prio()
sched/core: Use dl_bw_of() under rcu_read_lock_sched()
sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Locklessly doing is_idle_task(rq->curr) is only okay because of
RCU protection. The older variant of the broken code checked
rq->curr == rq->idle instead and therefore didn't need RCU.
Fixes: f6be8af1c95d ("sched: Add new API wake_up_if_idle() to wake up the idle cpu")
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Chuansheng Liu <chuansheng.liu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/729365dddca178506dfd0a9451006344cd6808bc.1417277372.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.
Suggested-and-Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "pang.xunlei" <pang.xunlei@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Nobody iterates over numa_group::task_list, this just confuses the readers.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415358456.28592.17.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |\
| | |
| | |
| | |
| | |
| | | |
applying more changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch simplifies task_struct by removing the four numa_* pointers
in the same array and replacing them with the array pointer. By doing this,
on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
x86_64).
A new parameter is added to the task_faults_idx function so that it can return
an index to the correct offset, corresponding with the old precalculated
pointers.
All of the code in sched/ that depended on task_faults_idx and numa_* was
changed in order to match the new logic.
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: dave@stgolabs.net
Cc: riel@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
As per commit f10e00f4bf36 ("sched/dl: Use dl_bw_of() under
rcu_read_lock_sched()"), dl_bw_of() has to be protected by
rcu_read_lock_sched().
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414497286-28824-1-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Currently used hrtimer_try_to_cancel() is racy:
raw_spin_lock(&rq->lock)
... dl_task_timer raw_spin_lock(&rq->lock)
... raw_spin_lock(&rq->lock) ...
switched_from_dl() ... ...
hrtimer_try_to_cancel() ... ...
switched_to_fair() ... ...
... ... ...
... ... ...
raw_spin_unlock(&rq->lock) ... (asquired)
... ... ...
... ... ...
do_exit() ... ...
schedule() ... ...
raw_spin_lock(&rq->lock) ... raw_spin_unlock(&rq->lock)
... ... ...
raw_spin_unlock(&rq->lock) ... raw_spin_lock(&rq->lock)
... ... (asquired)
put_task_struct() ... ...
free_task_struct() ... ...
... ... raw_spin_unlock(&rq->lock)
... (asquired) ...
... ... ...
... (use after free) ...
So, let's implement 100% guaranteed way to cancel the timer and let's
be sure we are safe even in very unlikely situations.
rq unlocking does not limit the area of switched_from_dl() use, because
this has already been possible in pull_dl_task() below.
Let's consider the safety of of this unlocking. New code in the patch
is working when hrtimer_try_to_cancel() fails. This means the callback
is running. In this case hrtimer_cancel() is just waiting till the
callback is finished. Two
1) Since we are in switched_from_dl(), new class is not dl_sched_class and
new prio is not less MAX_DL_PRIO. So, the callback returns early; it's
right after !dl_task() check. After that hrtimer_cancel() returns back too.
The above is:
raw_spin_lock(rq->lock); ...
... dl_task_timer()
... raw_spin_lock(rq->lock);
switched_from_dl() ...
hrtimer_try_to_cancel() ...
raw_spin_unlock(rq->lock); ...
hrtimer_cancel() ...
... raw_spin_unlock(rq->lock);
... return HRTIMER_NORESTART;
... ...
raw_spin_lock(rq->lock); ...
2) But the below is also possible:
dl_task_timer()
raw_spin_lock(rq->lock);
...
raw_spin_unlock(rq->lock);
raw_spin_lock(rq->lock); ...
switched_from_dl() ...
hrtimer_try_to_cancel() ...
... return HRTIMER_NORESTART;
raw_spin_unlock(rq->lock); ...
hrtimer_cancel(); ...
raw_spin_lock(rq->lock); ...
In this case hrtimer_cancel() returns immediately. Very unlikely case,
just to mention.
Nobody can manipulate the task, because check_class_changed() is
always called with pi_lock locked. Nobody can force the task to
participate in (concurrent) priority inheritance schemes (the same reason).
All concurrent task operations require pi_lock, which is held by us.
No deadlocks with dl_task_timer() are possible, because it returns
right after !dl_task() check (it does nothing).
If we receive a new dl_task during the time of unlocked rq, we just
don't have to do pull_dl_task() in switched_from_dl() further.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
[ Added comments]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In some cases this can trigger a true flood of output.
Requested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
cond_resched() is a preemption point, not strictly a blocking
primitive, so exclude it from the ->state test.
In particular, preemption preserves task_struct::state.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Alex Elder <alex.elder@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@ingics.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140924082242.656559952@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Validate we call might_sleep() with TASK_RUNNING, which catches places
where we nest blocking primitives, eg. mutex usage in a wait loop.
Since all blocking is arranged through task_struct::state, nesting
this will cause the inner primitive to set TASK_RUNNING and the outer
will thus not block.
Another observed problem is calling a blocking function from
schedule()->sched_submit_work()->blk_schedule_flush_plug() which will
then destroy the task state for the actual __schedule() call that
comes after it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.591637616@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
How we deal with updates to exclusive cpusets is currently broken.
As an example, suppose we have an exclusive cpuset composed of
two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it
up to the allowed bandwidth. If we want now to modify cpusetA's
cpumask, we have to check that removing a cpu's amount of
bandwidth doesn't break AC guarantees. This thing isn't checked
in the current code.
This patch fixes the problem above, denying an update if the
new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks
that are currently active.
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
exclusive cpusets
Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks
affinity (performing what is commonly called clustered scheduling).
Unfortunately, such thing is currently broken for two reasons:
- No check is performed when the user tries to attach a task to
an exlusive cpuset (recall that exclusive cpusets have an
associated maximum allowed bandwidth).
- Bandwidths of source and destination cpusets are not correctly
updated after a task is migrated between them.
This patch fixes both things at once, as they are opposite faces
of the same coin.
The check is performed in cpuset_can_attach(), as there aren't any
points of failure after that function. The updated is split in two
halves. We first reserve bandwidth in the destination cpuset, after
we pass the check in cpuset_can_attach(). And we then release
bandwidth from the source cpuset when the task's affinity is
actually changed. Even if there can be time windows when sched_setattr()
may erroneously fail in the source cpuset, we are fine with it, as
we can't perfom an atomic update of both cpusets at once.
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: Vincent Legout <vincent@legout.info>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: michael@amarulasolutions.com
Cc: luca.abeni@unitn.it
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
task_preempt_count() is pointless if preemption counter is per-cpu,
currently this is x86 only. It is only valid if the task is not
running, and even in this case the only info it can provide is the
state of PREEMPT_ACTIVE bit.
Change its single caller to check p->on_rq instead, this should be
the same if p->state != TASK_RUNNING, and kill this helper.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Alexander Graf <agraf@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Link: http://lkml.kernel.org/r/20141008183348.GC17495@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Both callers of finish_task_switch() need to recalculate this_rq()
and pass it as an argument, plus __schedule() does this again after
context_switch().
It would be simpler to call this_rq() once in finish_task_switch()
and return the this rq to the callers.
Note: probably "int cpu" in __schedule() should die; it is not used
and both rcu_note_context_switch() and wq_worker_sleeping() do not
really need this argument.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141009193232.GB5408@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|