| Commit message (Collapse) | Author | Age |
|
|
|
|
| |
move the switch_from/switch_to trace macros so that the
scheduling overhead shows up more clearly in the traces
|
|
|
|
| |
Helps when debugging wake-up related crashes.
|
|
|
|
|
| |
Other quantum-based plugins also require the present flag. Hence,
move it to the core data structure
|
|
|
|
|
| |
This provides and hooks up a new made-from-scratch sched_trace()
implementation based on Feather-Trace and ftdev.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
- TICK
- SCHED1
- SCHED2
- RELEASE
- CXS
|
|
|
|
|
|
| |
- remove outdated comment
- reorder stack_in_use marker to be in front of finish_lock_switch()
- add TRACE()s to try_to_wake_up()
|
|
|
|
|
| |
This caused LITMUS^RT real-time tasks to be missed, which
in turn caused all kinds of inconsistent state.
|
|
|
|
|
| |
This change fixes a race where a job could be executed on more than one
CPU, which to random crashes.
|
|
|
|
|
| |
Many things can't be done from within the scheduler.
This framework should make it easer to defer them.
|
|\ |
|
| |
| |
| |
| |
| | |
The old version had a significant window for races with
interrupt handlers executing on other CPUs.
|
|/
|
|
|
|
|
|
| |
A proper real-time migration works as follows:
1) transition to best-effort task
2) migrate to target CPU
3) update real-time parameters
4) transition to real-time task
|
|
|
|
|
|
|
|
| |
The SRP implementation did not correctly address various
suspension-related scenarios correctly. Now the need for
SRP blocking is tested on each scheduling event. This ensures
mutual exclusion under the SRP even in the face of unexpected
suspensions, for example due to IO.
|
|
|
|
|
| |
This is usefule for releasing exactly n tasks without having to
call compete() repeatedly.
|
|
|
|
|
| |
This introduces the core changes ported from LITMUS 2007.
The kernel seems to work under QEMU, but many bugs probably remain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
setting cpu share to 1 causes hangs, as reported in:
http://bugzilla.kernel.org/show_bug.cgi?id=9779
as the default share is 1024, the values of 0 and 1 can indeed
cause problems. Limit it to 2 or higher values.
These values can only be set by the root user - but still it
makes sense to protect against nonsensical values.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
| |
The show_task function invoked by sysrq-t et al displays the
pid and parent's pid of each task. It seems more useful to
show the actual process hierarchy here than who is using
ptrace on each process.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
| |
touch softlockup watchdog after idling.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some services (e.g. sched_setscheduler(), rt_mutex_setprio() and
sched_move_task()) must handle a given task differently in case it's the
'rq->curr' task on its run-queue. The task_running() interface is not
suitable for determining such tasks for platforms with one of the
following options:
#define __ARCH_WANT_UNLOCKED_CTXSW
#define __ARCH_WANT_INTERRUPTS_ON_CTXSW
Due to the fact that it makes use of 'p->oncpu == 1' as a criterion but
such a task is not necessarily 'rq->curr'.
The detailed explanation is available here:
https://lists.linux-foundation.org/pipermail/containers/2007-December/009262.html
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
|
|
|
|
|
|
|
|
|
|
| |
some platforms have sched_clock() implementations that cannot be called
very early during wakeup. If it's called it might hang or crash in hard
to debug ways. So only call update_rq_clock() [which calls sched_clock()]
if sched_init() has already been called. (rq->idle is NULL before the
scheduler is initialized.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
| |
style cleanup of various changes that were done recently.
no code changed:
text data bss dec hex filename
23680 2542 28 26250 668a sched.o.before
23680 2542 28 26250 668a sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Luiz Fernando N. Capitulino reported that sched_rr_get_interval()
crashes for SCHED_OTHER tasks that are on an idle runqueue.
The fix is to return a 0 timeslice for tasks that are on an idle
runqueue. (and which are not running, obviously)
this also shrinks the code a bit:
text data bss dec hex filename
47903 3934 336 52173 cbcd sched.o.before
47885 3934 336 52155 cbbb sched.o.after
Reported-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a useful feature for
us, which provided a cpu accounting resource controller. This feature would be
useful if someone wants to group tasks only for accounting purpose and doesnt
really want to exercise any control over their cpu consumption.
The patch below reintroduces the feature. It is based on Paul Menage's
original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with
these differences:
- Removed load average information. I felt it needs more thought (esp
to deal with SMP and virtualized platforms) and can be added for
2.6.25 after more discussions.
- Convert group cpu usage to be nanosecond accurate (as rest of the cfs
stats are) and invoke cpuacct_charge() from the respective scheduler
classes
- Make accounting scalable on SMP systems by splitting the usage
counter to be per-cpu
- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
code is not big enough to warrant a new file and also this rightly
needs to live inside the scheduler. Also things like accessing
rq->lock while reading cpu usage becomes easier if the code lived in
kernel/sched.c)
The patch also modifies the cpu controller not to provide the same accounting
information.
Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
some simple tests like cpuspin (spin on the cpu), ran several tasks in
the same group and timed them. Compared their time stamps with
cpuacct.usage.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
| |
move __sched_text_start/end to sched.h. No code changed:
text data bss dec hex filename
26582 2310 28 28920 70f8 sched.o.before
26582 2310 28 28920 70f8 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
| |
clean up sd_alloc_ctl_cpu_table() definition.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
| |
reorder SCHED_FEAT_ bits so that the used ones come first. Makes
tuning instructions easier.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
cpu_down() code is ok wrt sched_idle_next() placing the 'idle' task not
at the beginning of the queue.
So get rid of activate_idle_task() and make use of activate_task() instead.
It is the same as activate_task(), except for the update_rq_clock(rq) call
that is redundant.
Code size goes down:
text data bss dec hex filename
47853 3934 336 52123 cb9b sched.o.before
47828 3934 336 52098 cb82 sched.o.after
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Grant Wilson has reported rare SCHED_FAIR_USER crashes on his quad-core
system, which crashes can only be explained via runqueue corruption.
there is a narrow SMP race in __set_task_cpu(): after ->cpu is set up to
a new value, task_rq_lock(p, ...) can be successfuly executed on another
CPU. We must ensure that updates of per-task data have been completed by
this moment.
this bug has been hiding in the Linux scheduler for an eternity (we never
had any explicit barrier for task->cpu in set_task_cpu() - so the bug was
introduced in 2.5.1), but only became visible via set_task_cfs_rq() being
accidentally put after the task->cpu update. It also probably needs a
sufficiently out-of-order CPU to trigger.
Reported-by: Grant Wilson <grant.wilson@zen.co.uk>
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Suppose that the SCHED_FIFO task does
switch_uid(new_user);
Now, p->se.cfs_rq and p->se.parent both point into the old
user_struct->tg because sched_move_task() doesn't call set_task_cfs_rq()
for !fair_sched_class case.
Suppose that old user_struct/task_group is freed/reused, and the task
does
sched_setscheduler(SCHED_NORMAL);
__setscheduler() sets fair_sched_class, but doesn't update
->se.cfs_rq/parent which point to the freed memory.
This means that check_preempt_wakeup() doing
while (!is_same_group(se, pse)) {
se = parent_entity(se);
pse = parent_entity(pse);
}
may OOPS in a similar way if rq->curr or p did something like above.
Perhaps we need something like the patch below, note that
__setscheduler() can't do set_task_cfs_rq().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently the scheduler checks for PF_VCPU to decide if this timeslice
has to be accounted as guest time. On s390 host interrupts are not
disabled during guest execution. This causes theses interrupts to be
accounted as guest time if CONFIG_VIRT_CPU_ACCOUNTING is set. Solution
is to check if an interrupt triggered account_system_time. As the tick
is timer interrupt based, we have to subtract hardirq_offset.
I tested the patch on s390 with CONFIG_VIRT_CPU_ACCOUNTING and on
x86_64. Seems to work.
CC: Avi Kivity <avi@qumranet.com>
CC: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Revert 62d0df64065e7c135d0002f069444fbdfc64768f.
This was originally intended as a simple initial example of how to create a
control groups subsystem; it wasn't intended for mainline, but I didn't make
this clear enough to Andrew.
The CFS cgroup subsystem now has better functionality for the per-cgroup usage
accounting (based directly on CFS stats) than the "usage" status file in this
patch, and the "load" status file is rather simplistic - although having a
per-cgroup load average report would be a useful feature, I don't believe this
patch actually provides it. If it gets into the final 2.6.24 we'd probably
have to support this interface for ever.
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds a proper prototype for migration_init() in
include/linux/sched.h
Since there's no point in always returning 0 to a caller that doesn't check
the return value it also changes the function to return void.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SMP balancing is done with IRQs disabled and can iterate the full rq.
When rqs are large this can cause large irq-latencies. Limit the nr of
iterations on each run.
This fixes a scheduling latency regression reported by the -rt folks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
| |
remove PREEMPT_RESTRICT. (this is a separate commit so that any
regression related to the removal itself is bisectable)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PREEMPT_RESTRICT was a method aimed at reducing the amount of wakeup
related preemption. It has a disadvantage though, it can prevent
legitimate wakeups if a task is 'unlucky' to be hit too early by a tick
that clears peer_preempt.
Now that the wakeup preemption has been cleaned up we dont seem to have
excessive preemptions anymore, so this feature can be turned off. (and
removed in the next patch)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1) hardcoded 1000000000 value is used five times in places where
NSEC_PER_SEC might be more readable.
2) A conversion from nsec to msec uses the hardcoded 1000000 value,
which is a candidate for NSEC_PER_MSEC.
no code changed:
text data bss dec hex filename
44359 3326 36 47721 ba69 sched.o.before
44359 3326 36 47721 ba69 sched.o.after
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Yanmin Zhang reported an aim7 regression and bisected it down to:
| commit 38ad464d410dadceda1563f36bdb0be7fe4c8938
| Author: Ingo Molnar <mingo@elte.hu>
| Date: Mon Oct 15 17:00:02 2007 +0200
|
| sched: uniform tunings
|
| use the same defaults on both UP and SMP.
fix this by reintroducing similar SMP tunings again. This resolves
the regression.
(also update the comments to match the ilog2(nr_cpus) tuning effect)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
| |
fallout of recent commits: small coding style fixes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
| |
Adds a cpu.usage file to the CFS cgroup that reports CPU usage in
milliseconds for that cgroup's tasks
[ mingo@elte.hu: style cleanups. ]
Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
| |
Peter Zijlstra noticed that the rcu_head object need not be present
in every cfs_rq of a group. Move it to the task_group structure
instead.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch:
commit 9b5b77512dce239fa168183fa71896712232e95a
Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Date: Mon Oct 15 17:00:09 2007 +0200
sched: clean up code under CONFIG_FAIR_GROUP_SCHED
Introduced an assumption of the existence of CPU0 via this line
cfs_rq = tg->cfs_rq[0];
If you have no CPU0, that will be NULL. The fix seems to be just to
take whatever cfs_rq queue comes out of the for_each_possible_cpu()
loop, since they're all equally good for the destruction operation.
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
| |
account_guest_time() can become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
At the moment, a lot of load balancing code that is irrelevant to non
SMP systems gets included during non SMP builds.
This patch addresses this issue and reduces the binary size on non
SMP systems:
text data bss dec hex filename
10983 28 1192 12203 2fab sched.o.before
10739 28 1192 11959 2eb7 sched.o.after
Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
At the moment, balance_tasks() provides low level functionality for both
move_tasks() and move_one_task() (indirectly) via the load_balance()
function (in the sched_class interface) which also provides dual
functionality. This dual functionality complicates the interfaces and
internal mechanisms and makes the run time overhead of operations that
are called with two run queue locks held.
This patch addresses this issue and reduces the overhead of these
operations.
Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
| |
- replace "cont" with "cgrp" in a few places in the CFS cgroup code,
- use write_uint rather than write for cpu.shares write function
Signed-off-by: Paul Menage <menage@google.com>
Acked-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A full register dump along with stack backtrace would make the
"scheduling while atomic" message more helpful. Use show_regs() instead
of dump_stack() for this. We already know we're atomic in here (that is
why this function was called) so show_regs()'s atomicity expectations
are guaranteed.
Also, modify the output of the "BUG: scheduling while atomic:" header a
bit to keep task->comm and task->pid together and preempt_count() after
them.
Signed-off-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
|
|
| |
clean up sched_domain_debug().
this also shrinks the code a bit:
text data bss dec hex filename
50474 4306 480 55260 d7dc sched.o.before
50404 4306 480 55190 d796 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
|
|
|
|
|
|
| |
Jeff Dike noticed that wait_for_completion_interruptible()'s prototype
had a mismatched fastcall.
Fix this by removing the fastcall attributes from all the completion APIs.
Found-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|