aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched.c
Commit message (Collapse)AuthorAge
* [PATCH] struct seq_operations and struct file_operations constificationHelge Deller2006-12-07
| | | | | | | | | | | | | | - move some file_operations structs into the .rodata section - move static strings from policy_types[] array into the .rodata section - fix generic seq_operations usages, so that those structs may be defined as "const" as well [akpm@osdl.org: couple of fixes] Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: correct output of show_state()Chris Caputo2006-12-07
| | | | | | | | | | | | | | | | | | | | | | | | At present show_state prints a header the does not match the output of show_task, as follows: - sibling task PC pid father child younger older init S 00000000 0 1 0 2 (NOTLB) - This patch corrects the output of show_state so that the header is aligned with the data, ala: - free sibling task PC stack pid father child younger older init S 00000000 0 1 0 2 (NOTLB) - Signed-off-by: Chris Caputo <ccaputo@alt.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] hotplug CPU: clean up hotcpu_notifier() useIngo Molnar2006-12-07
| | | | | | | | | | | | | | | | | | There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn, prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus generating compiler warnings of unused symbols, hence forcing people to add #ifdefs. the compiler can skip truly unused functions just fine: text data bss dec hex filename 1624412 728710 3674856 6027978 5bfaca vmlinux.before 1624412 728710 3674856 6027978 5bfaca vmlinux.after [akpm@osdl.org: topology.c fix] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sleep profilingIngo Molnar2006-12-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement prof=sleep profiling. TASK_UNINTERRUPTIBLE sleeps will be taken as a profile hit, and every millisecond spent sleeping causes a profile-hit for the call site that initiated the sleep. Sample readprofile output on i386: 306 ps2_sendbyte 1.3973 432 call_usermodehelper_keys 1.9548 484 ps2_command 0.6453 790 __driver_attach 4.7879 1593 msleep 44.2500 3976 sync_buffer 64.1290 4076 do_lookup 12.4648 8587 sync_page 122.6714 20820 total 0.0067 (NOTE: architectures need to check whether get_wchan() can be called from deep within the wakeup path.) akpm: we need to mark more functions __sched. lock_sock(), msleep(), others.. akpm: the contention in do_lookup() is a surprise. Presumably doing disk reads for directory contents while holding i_mutex. [akpm@osdl.org: various fixes] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] lockdep: print current locks on in_atomic warningsPeter Zijlstra2006-12-07
| | | | | | | | | | Add debug_show_held_locks(current) to __might_sleep() and schedule(); this makes finding the offending lock leak easier. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] SysRq-X: show blocked tasksIngo Molnar2006-12-07
| | | | | | | | | | Add SysRq-X support: show blocked (TASK_UNINTERRUPTIBLE) tasks only. Useful for debugging IO stalls. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Add include/linux/freezer.h and move definitions from sched.hNigel Cunningham2006-12-07
| | | | | | | | | | | | | Move process freezing functions from include/linux/sched.h to freezer.h, so that modifications to the freezer or the kernel configuration don't require recompiling just about everything. [akpm@osdl.org: fix ueagle driver] Signed-off-by: Nigel Cunningham <nigel@suspend2.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] readjust comments of task_timeslice for kernel docBorislav Petkov2006-10-20
| | | | | | | Signed-off-by: Borislav Petkov <petkov@math.uni-muenster.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: likely profilingNick Piggin2006-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This likely profiling is pretty fun. I found a few possible problems in sched.c. This patch may be not measurable, but when I did measure long ago, nooping (un)likely cost a couple of % on scheduler heavy benchmarks, so it all adds up. Tweak some branch hints: - the 2nd 64 bits in the bitmask is likely to be populated, because it contains the first 28 bits (nearly 3/4) of the normal priorities. (ratio of 669669:691 ~= 1000:1). - it isn't unlikely that context switching switches to another process. it might be very rapidly switching to and from the idle process (ratio of 475815:419004 and 471330:423544). Let the branch predictor decide. - preempt_enable seems to be very often called in a nested preempt_disable or with interrupts disabled (ratio of 3567760:87965 ~= 40:1) Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Daniel Walker <dwalker@mvista.com> Cc: Hua Zhong <hzhong@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] scheduler: NUMA aware placement of sched_group_allnodesChristoph Lameter2006-10-03
| | | | | | | | | | | | When the per cpu sched domains are build then they also need to be placed on the node where the cpu resides otherwise we will have frequent off node accesses which will slow down the system. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: fixing wrong comment for find_idlest_cpu()Satoru Takeuchi2006-10-03
| | | | | | | | | Fixing wrong comment for find_idlest_cpu(). Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: cleanup sched_group cpu_power setupSiddha, Suresh B2006-10-03
| | | | | | | | | | | | | | | | | | | | | Up to now sched group's cpu_power for each sched domain is initialized independently. This made the setup code ugly as the new sched domains are getting added. Make the sched group cpu_power setup code generic, by using domain child field and new domain flag in sched_domain. For most of the sched domains(except NUMA), sched group's cpu_power is now computed generically using the domain properties of itself and of the child domain. sched groups in NUMA domains are setup little differently and hence they don't use this generic mechanism. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: introduce child field in sched_domainSiddha, Suresh B2006-10-03
| | | | | | | | | | | | | | | Introduce the child field in sched_domain struct and use it in sched_balance_self(). We will also use this field in cleaning up the sched group cpu_power setup(done in a different patch) code. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: don't print migration cost when only 1 CPUDave Jones2006-10-03
| | | | | | | | | | If only a single CPU is present, printing this doesn't make much sense. Signed-off-by: Dave Jones <davej@redhat.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: remove unnecessary sched group allocationsSiddha, Suresh B2006-10-03
| | | | | | | | | | | | | Remove dynamic sched group allocations for MC and SMP domains. These allocations can easily fail on big systems(1024 or so CPUs) and we can live with out these dynamic allocations. [akpm@osdl.org: build fix] Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: force /sbin/init off isolated cpusNick Piggin2006-10-03
| | | | | | | | | | | | | | | | | | | | | | | | Force /sbin/init off isolated cpus (unless every CPU is specified as an isolcpu). Users seem to think that the isolated CPUs shouldn't have much running on them to begin with. That's fair enough: intuitive, I guess. It also means that the cpu affinity masks of tasks will not include isolcpus by default, which is also more intuitive, perhaps. /sbin/init is spawned from the boot CPU's idle thread, and /sbin/init starts the rest of userspace. So if the boot CPU is specified to be an isolcpu, then prior to this patch, all of userspace will be run there. (throw in a couple of plausible devinit -> cpuinit conversions I spotted while we're here). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Dimitri Sivanich <sivanich@sgi.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpumask: export cpu_online_map and cpu_possible_map consistentlyGreg Banks2006-10-02
| | | | | | | | | | cpumask: ensure that the cpu_online_map and cpu_possible_map bitmasks, and hence all the macros in <linux/cpumask.h> that require them, are available to modules for all supported combinations of architecture and CONFIG_SMP. Signed-off-by: Greg Banks <gnb@melbourne.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] csa: convert CONFIG tag for extended accounting routinesJay Lan2006-10-01
| | | | | | | | | | | | | | | | | | There were a few accounting data/macros that are used in CSA but are #ifdef'ed inside CONFIG_BSD_PROCESS_ACCT. This patch is to change those ifdef's from CONFIG_BSD_PROCESS_ACCT to CONFIG_TASK_XACCT. A few defines are moved from kernel/acct.c and include/linux/acct.h to kernel/tsacct.c and include/linux/tsacct_kern.h. Signed-off-by: Jay Lan <jlan@sgi.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Jes Sorensen <jes@sgi.com> Cc: Chris Sturtivant <csturtiv@sgi.com> Cc: Tony Ernst <tee@sgi.com> Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] introduce TASK_DEAD stateOleg Nesterov2006-09-29
| | | | | | | | | | | | | | | | I am not sure about this patch, I am asking Ingo to take a decision. task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should live only in ->exit_state as documented in sched.h. Note that this state is not visible to user-space, get_task_state() masks off unsuitable states. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kill PF_DEAD flagOleg Nesterov2006-09-29
| | | | | | | | | | After the previous change (->flags & PF_DEAD) <=> (->state == EXIT_DEAD), we don't need PF_DEAD any longer. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] set EXIT_DEAD state in do_exit(), not in schedule()Oleg Nesterov2006-09-29
| | | | | | | | | | | | | | | | | | | schedule() checks PF_DEAD on every context switch and sets ->state = EXIT_DEAD to ensure that the exiting task will be deactivated. Note that this EXIT_DEAD is in fact a "random" value, we can use any bit except normal TASK_XXX values. It is better to set this state in do_exit() along with PF_DEAD flag and remove that check in schedule(). We are safe wrt concurrent try_to_wake_up() (for example ptrace, tkill), it can not change task's ->state: the 'state' argument of try_to_wake_up() can't have EXIT_DEAD bit. And in case when try_to_wake_up() sees a stale value of ->state == TASK_RUNNING it will do nothing. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched_setscheduler: fix? policy checksOleg Nesterov2006-09-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I am not sure this patch is correct: I can't understand what the current code does, and I don't know what it was supposed to do. The comment says: * can't change policy, except between SCHED_NORMAL * and SCHED_BATCH: The code: if (((policy != SCHED_NORMAL && p->policy != SCHED_BATCH) && (policy != SCHED_BATCH && p->policy != SCHED_NORMAL)) && But this is equivalent to: if ( (is_rt_policy(policy) && has_rt_policy(p)) && which means something different. We can't _decrease_ the current ->rt_priority with such a check (if rlim[RLIMIT_RTPRIO] == 0). Probably, it was supposed to be: if ( !(policy == SCHED_NORMAL && p->policy == SCHED_BATCH) && !(policy == SCHED_BATCH && p->policy == SCHED_NORMAL) this matches the comment, but strange: it doesn't allow to _drop_ the realtime priority when rlim[RLIMIT_RTPRIO] == 0. I think the right check would be: /* can't set/change rt policy */ if (is_rt_policy(policy) && policy != p->policy && !rlim_rtprio) return -EPERM; Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] introduce is_rt_policy() helperOleg Nesterov2006-09-29
| | | | | | | | | | | | Imho, makes the code a bit easier to read. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] do_sched_setscheduler(): don't take tasklist_lockOleg Nesterov2006-09-29
| | | | | | | | | | | | | Use rcu locks instead. sched_setscheduler() now takes ->siglock before reading ->signal->rlim[]. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] check return value of cpu_callbackAkinobu Mita2006-09-29
| | | | | | | | | | | | | Spawing ksoftirqd, migration, or watchdog, and calling init_timers_cpu() may fail with small memory. If it happens in initcalls, kernel NULL pointer dereference happens later. This patch makes crash happen immediately in such cases. It seems a bit better than getting kernel NULL pointer dereference later. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Akinobu Mita <mita@miraclelinux.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Fix longstanding load balancing bug in the schedulerChristoph Lameter2006-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The scheduler will stop load balancing if the most busy processor contains processes pinned via processor affinity. The scheduler currently only does one search for busiest cpu. If it cannot pull any tasks away from the busiest cpu because they were pinned then the scheduler goes into a corner and sulks leaving the idle processors idle. F.e. If you have processor 0 busy running four tasks pinned via taskset, there are none on processor 1 and one just started two processes on processor 2 then the scheduler will not move one of the two processes away from processor 2. This patch fixes that issue by forcing the scheduler to come out of its corner and retrying the load balancing by considering other processors for load balancing. This patch was originally developed by John Hawkes and discussed at http://marc.theaimsgroup.com/?l=linux-kernel&m=113901368523205&w=2. I have removed extraneous material and gone back to equipping struct rq with the cpu the queue is associated with since this makes the patch much easier and it is likely that others in the future will have the same difficulty of figuring out which processor owns which runqueue. The overhead added through these patches is a single word on the stack if the kernel is configured to support 32 cpus or less (32 bit). For 32 bit environments the maximum number of cpus that can be configued is 255 which would result in the use of 32 bytes additional on the stack. On IA64 up to 1k cpus can be configured which will result in the use of 128 additional bytes on the stack. The maximum additional cache footprint is one cacheline. Typically memory use will be much less than a cacheline and the additional cpumask will be placed on the stack in a cacheline that already contains other local variable. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: John Hawkes <hawkes@sgi.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Williams <pwil3058@bigpond.net.au> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] revert "Drop tasklist lock in do_sched_setscheduler"Oleg Nesterov2006-08-27
| | | | | | | | | | | | | | sched_setscheduler() looks at ->signal->rlim[]. It is unsafe do dereference ->signal unless tasklist_lock or ->siglock is held (or p == current). We pin the task structure, but this can't prevent from release_task()->__exit_signal() which sets ->signal = NULL. Restore tasklist_lock across the setscheduler call. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pi-futex: missing pi_waiters plist initializationHeiko Carstens2006-07-31
| | | | | | | | | | | | | | | | | | | | | | Initialize init task's pi_waiters plist. Otherwise cpu hotplug of cpu 0 might crash, since rt_mutex_getprio() accesses an uninitialized list head. call chain which led to crash: take_cpu_down sched_idle_next __setscheduler rt_mutex_getprio Using PLIST_HEAD_INIT in the INIT_TASK macro doesn't work unfortunately, since the pi_waiters member is only conditionally present. Cc: Arjan van de Ven <arjan@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] fix cond_resched() fixJim Houston2006-07-31
| | | | | | | | | | | | | | In cond_resched_lock() it calls __resched_legal() before dropping the spin lock. __resched_legal() will always finds the preempt_count non-zero and will prevent the call to __cond_resched(). The attached patch adds a parameter to __resched_legal() with the expected preempt_count value. Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: build_sched_domains() fixSiddha, Suresh B2006-07-31
| | | | | | | | | | | Use the correct groups while initializing sched groups power for allnodes_domain. This fixes the crash observed while creating exclusive cpusets. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-and-tested-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] per-task-delay-accounting: cpu delay collection via schedstatsChandra Seetharaman2006-07-15
| | | | | | | | | | | | | | | | | Make the task-related schedstats functions callable by delay accounting even if schedstats collection isn't turned on. This removes the dependency of delay accounting on schedstats. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Cc: Jes Sorensen <jes@sgi.com> Cc: Peter Chubb <peterc@gelato.unsw.edu.au> Cc: Erich Focht <efocht@ess.nec.de> Cc: Levent Serinol <lserinol@gmail.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] per-task-delay-accounting: sync block I/O and swapin delay collectionShailabh Nagar2006-07-15
| | | | | | | | | | | | | | | | | | | | | Unlike earlier iterations of the delay accounting patches, now delays are only collected for the actual I/O waits rather than try and cover the delays seen in I/O submission paths. Account separately for block I/O delays incurred as a result of swapin page faults whose frequency can be affected by the task/process' rss limit. Hence swapin delays can act as feedback for rss limit changes independent of I/O priority changes. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Cc: Jes Sorensen <jes@sgi.com> Cc: Peter Chubb <peterc@gelato.unsw.edu.au> Cc: Erich Focht <efocht@ess.nec.de> Cc: Levent Serinol <lserinol@gmail.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] lockdep: core, fix rq-lock handling on __ARCH_WANT_UNLOCKED_CTXSWIngo Molnar2006-07-15
| | | | | | | | | | | | | | | | | On platforms that have __ARCH_WANT_UNLOCKED_CTXSW set and want to implement lock validator support there's a bug in rq->lock handling: in this case we dont 'carry over' the runqueue lock into another task - but still we did a spinlock_release() of it. Fix this by making the spinlock_release() in context_switch() dependent on !__ARCH_WANT_UNLOCKED_CTXSW. (Reported by Ralf Baechle on MIPS, which has __ARCH_WANT_UNLOCKED_CTXSW. This fixes a lockdep-internal BUG message on such platforms.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] small kernel/sched.c cleanupAndreas Mohr2006-07-10
| | | | | | | | | | - constify and optimize stat_nam (thanks to Michael Tokarev!) - spelling and comment fixes Signed-off-by: Andreas Mohr <andi@lisas.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: fix bug in __migrate_task()Peter Williams2006-07-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In the function __migrate_task(), deactivate_task() followed by activate_task() is used to move the task from one run queue to another. This has two undesirable effects: 1. The task's priority is recalculated. (Nowhere else in the scheduler code is the priority recalculated for a change of CPU.) 2. The task's time stamp is set to the current time. At the very least, this makes the adjustment of the time stamp before the call to deactivate_task() redundant but I believe the problem is more serious as the time stamp now holds the time of the queue change instead of the time at which the task was woken. In addition, unless dest_rq is the same queue as "current" is on the time stamp could be inaccurate due to inter CPU drift. Solution: Replace the call to activate_task() with one to __activate_task(). Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: cleanup, convert sched.c-internal typedefs to structIngo Molnar2006-07-03
| | | | | | | | | | | | | | | | | | | convert: - runqueue_t to 'struct rq' - prio_array_t to 'struct prio_array' - migration_req_t to 'struct migration_req' I was the one who added these but they are both against the kernel coding style and also were used inconsistently at places. So just get rid of them at once, now that we are flushing the scheduler patch-queue anyway. Conversion was mostly scripted, the result was reviewed and all secondary whitespace and style impact (if any) was fixed up by hand. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: cleanup, remove task_t, convert to struct task_structIngo Molnar2006-07-03
| | | | | | | | | | | | cleanup: remove task_t and convert all the uses to struct task_struct. I introduced it for the scheduler anno and it was a mistake. Conversion was mostly scripted, the result was reviewed and all secondary whitespace and style impact (if any) was fixed up by hand. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: clean up fallout of recent changesIngo Molnar2006-07-03
| | | | | | | | | | | | | | | | | | | | | | Clean up some of the impact of recent (and not so recent) scheduler changes: - turning macros into nice inline functions - sanitizing and unifying variable definitions - whitespace, style consistency, 80-lines, comment correctness, spelling and curly braces police Due to the macro hell and variable placement simplifications there's even 26 bytes of .text saved: text data bss dec hex filename 25510 4153 192 29855 749f sched.o.before 25484 4153 192 29829 7485 sched.o.after [akpm@osdl.org: build fix] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] lockdep: annotate scheduler runqueue locksIngo Molnar2006-07-03
| | | | | | | | | | Teach per-CPU runqueue locks and recursive locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] lockdep: prove spinlock rwlock locking correctnessIngo Molnar2006-07-03
| | | | | | | | | | Use the lock validator framework to prove spinlock and rwlock locking correctness. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] lockdep: irqtrace subsystem, coreIngo Molnar2006-07-03
| | | | | | | | | | | | Accurate hard-IRQ-flags and softirq-flags state tracing. This allows us to attach extra functionality to IRQ flags on/off events (such as trace-on/off). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] lockdep: better lock debuggingIngo Molnar2006-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Generic lock debugging: - generalized lock debugging framework. For example, a bug in one lock subsystem turns off debugging in all lock subsystems. - got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from the mutex/rtmutex debugging code: it caused way too much prototype hackery, and lockdep will give the same information anyway. - ability to do silent tests - check lock freeing in vfree too. - more finegrained debugging options, to allow distributions to turn off more expensive debugging features. There's no separate 'held mutexes' list anymore - but there's a 'held locks' stack within lockdep, which unifies deadlock detection across all lock classes. (this is independent of the lockdep validation stuff - lockdep first checks whether we are holding a lock already) Here are the current debugging options: CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_LOCK_ALLOC=y which do: config DEBUG_MUTEXES bool "Mutex debugging, basic checks" config DEBUG_LOCK_ALLOC bool "Detect incorrect freeing of live mutexes" Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cond_resched() fixAndrew Morton2006-06-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a bug identified by Zou Nan hai <nanhai.zou@intel.com>: If the system is in state SYSTEM_BOOTING, and need_resched() is true, cond_resched() returns true even though it didn't reschedule. Consequently need_resched() remains true and JBD locks up. Fix that by teaching cond_resched() to only return true if it really did call schedule(). cond_resched_lock() and cond_resched_softirq() have a problem too. If we're in SYSTEM_BOOTING state and need_resched() is true, these functions will drop the lock and will then try to call schedule(), but the SYSTEM_BOOTING state will prevent schedule() from being called. So on return, need_resched() will still be true, but cond_resched_lock() has to return 1 to tell the caller that the lock was dropped. The caller will probably lock up. Bottom line: if these functions dropped the lock, they _must_ call schedule() to clear need_resched(). Make it so. Also, uninline __cond_resched(). It's largeish, and slowpath. Acked-by: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] rtmutex: Propagate priority settings into PI lock chainsThomas Gleixner2006-06-27
| | | | | | | | | | | | | | | | | When the priority of a task, which is blocked on a lock, changes we must propagate this change into the PI lock chain. Therefor the chain walk code is changed to get rid of the references to current to avoid false positives in the deadlock detector, as setscheduler might be called by a task which holds the lock on which the task whose priority is changed is blocked. Also add some comments about the get/put_task_struct usage to avoid confusion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Drop tasklist lock in do_sched_setschedulerThomas Gleixner2006-06-27
| | | | | | | | | | | | | | There is no need to hold tasklist_lock across the setscheduler call, when we pin the task structure with get_task_struct(). Interrupts are disabled in setscheduler anyway and the permission checks do not need interrupts disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pi-futex: scheduler support for piIngo Molnar2006-06-27
| | | | | | | | | | | | | | | | | | | | | | Add framework to boost/unboost the priority of RT tasks. This consists of: - caching the 'normal' priority in ->normal_prio - providing a functions to set/get the priority of the task - make sched_setscheduler() aware of boosting The effective_prio() cleanups also fix a priority-calculation bug pointed out by Andrey Gelman, in set_user_nice(). has_rt_policy() fix: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Andrey Gelman <agelman@012.net.il> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] BUG() if setscheduler is called from interrupt contextSteven Rostedt2006-06-27
| | | | | | | | | | | | | | | | | Thomas Gleixner is adding the call to a rtmutex function in setscheduler. This call grabs a spin_lock that is not always protected by interrupts disabled. So this means that setscheduler cant be called from interrupt context. To prevent this from happening in the future, this patch adds a BUG_ON(in_interrupt()) in that function. (Thanks to akpm <aka. Andrew Morton> for this suggestion). Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: uninline task_rq_lock()Oleg Nesterov2006-06-27
| | | | | | | | | | | | Saves 543 bytes from sched.o (gcc 3.3.3). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Con Kolivas <kernel@kolivas.org> Cc: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: mc/smt power savings sched policySiddha, Suresh B2006-06-27
| | | | | | | | | | | | | | | | | | | | | | sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in /sys/devices/system/cpu/ control the MC/SMT power savings policy for the scheduler. Based on the values (1-enable, 0-disable) for these controls, sched groups cpu power will be determined for different domains. When power savings policy is enabled and under light load conditions, scheduler will minimize the physical packages/cpu cores carrying the load and thus conserving power(with a perf impact based on the workload characteristics... see OLS 2005 CMP kernel scheduler paper for more details..) Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Con Kolivas <kernel@kolivas.org> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched_domai: Allocate sched_group structures dynamicallySrivatsa Vaddagiri2006-06-27
| | | | | | | | | | | | | | | | | | As explained here: http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2 there is a problem with sharing sched_group structures between two separate sched_group structures for different sched_domains. The patch has been tested and found to avoid the kernel lockup problem described in above URL. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>