aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched.c
Commit message (Collapse)AuthorAge
* sched: reorder SCHED_FEAT_ bitsIngo Molnar2007-11-15
| | | | | | | reorder SCHED_FEAT_ bits so that the used ones come first. Makes tuning instructions easier. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: remove activate_idle_task()Dmitry Adamushko2007-11-15
| | | | | | | | | | | | | | | | | | cpu_down() code is ok wrt sched_idle_next() placing the 'idle' task not at the beginning of the queue. So get rid of activate_idle_task() and make use of activate_task() instead. It is the same as activate_task(), except for the update_rq_clock(rq) call that is redundant. Code size goes down: text data bss dec hex filename 47853 3934 336 52123 cb9b sched.o.before 47828 3934 336 52098 cb82 sched.o.after Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix __set_task_cpu() SMP raceDmitry Adamushko2007-11-15
| | | | | | | | | | | | | | | | | | | | Grant Wilson has reported rare SCHED_FAIR_USER crashes on his quad-core system, which crashes can only be explained via runqueue corruption. there is a narrow SMP race in __set_task_cpu(): after ->cpu is set up to a new value, task_rq_lock(p, ...) can be successfuly executed on another CPU. We must ensure that updates of per-task data have been completed by this moment. this bug has been hiding in the Linux scheduler for an eternity (we never had any explicit barrier for task->cpu in set_task_cpu() - so the bug was introduced in 2.5.1), but only became visible via set_task_cfs_rq() being accidentally put after the task->cpu update. It also probably needs a sufficiently out-of-order CPU to trigger. Reported-by: Grant Wilson <grant.wilson@zen.co.uk> Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix SCHED_FIFO tasks & FAIR_GROUP_SCHEDOleg Nesterov2007-11-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Suppose that the SCHED_FIFO task does switch_uid(new_user); Now, p->se.cfs_rq and p->se.parent both point into the old user_struct->tg because sched_move_task() doesn't call set_task_cfs_rq() for !fair_sched_class case. Suppose that old user_struct/task_group is freed/reused, and the task does sched_setscheduler(SCHED_NORMAL); __setscheduler() sets fair_sched_class, but doesn't update ->se.cfs_rq/parent which point to the freed memory. This means that check_preempt_wakeup() doing while (!is_same_group(se, pse)) { se = parent_entity(se); pse = parent_entity(pse); } may OOPS in a similar way if rq->curr or p did something like above. Perhaps we need something like the patch below, note that __setscheduler() can't do set_task_cfs_rq(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix accounting of interrupts during guest execution on s390Christian Borntraeger2007-11-15
| | | | | | | | | | | | | | | | | Currently the scheduler checks for PF_VCPU to decide if this timeslice has to be accounted as guest time. On s390 host interrupts are not disabled during guest execution. This causes theses interrupts to be accounted as guest time if CONFIG_VIRT_CPU_ACCOUNTING is set. Solution is to check if an interrupt triggered account_system_time. As the tick is timer interrupt based, we have to subtract hardirq_offset. I tested the patch on s390 with CONFIG_VIRT_CPU_ACCOUNTING and on x86_64. Seems to work. CC: Avi Kivity <avi@qumranet.com> CC: Laurent Vivier <Laurent.Vivier@bull.net> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* revert "Task Control Groups: example CPU accounting subsystem"Andrew Morton2007-11-14
| | | | | | | | | | | | | | | | | | | Revert 62d0df64065e7c135d0002f069444fbdfc64768f. This was originally intended as a simple initial example of how to create a control groups subsystem; it wasn't intended for mainline, but I didn't make this clear enough to Andrew. The CFS cgroup subsystem now has better functionality for the per-cgroup usage accounting (based directly on CFS stats) than the "usage" status file in this patch, and the "load" status file is rather simplistic - although having a per-cgroup load average report would be a useful feature, I don't believe this patch actually provides it. If it gets into the final 2.6.24 we'd probably have to support this interface for ever. Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sched: proper prototype for kernel/sched.c:migration_init()Adrian Bunk2007-11-09
| | | | | | | | | | | | This patch adds a proper prototype for migration_init() in include/linux/sched.h Since there's no point in always returning 0 to a caller that doesn't check the return value it also changes the function to return void. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: avoid large irq-latencies in smp-balancingPeter Zijlstra2007-11-09
| | | | | | | | | | | | | SMP balancing is done with IRQs disabled and can iterate the full rq. When rqs are large this can cause large irq-latencies. Limit the nr of iterations on each run. This fixes a scheduling latency regression reported by the -rt folks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: remove PREEMPT_RESTRICTIngo Molnar2007-11-09
| | | | | | | remove PREEMPT_RESTRICT. (this is a separate commit so that any regression related to the removal itself is bisectable) Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: turn off PREEMPT_RESTRICTIngo Molnar2007-11-09
| | | | | | | | | | | | | PREEMPT_RESTRICT was a method aimed at reducing the amount of wakeup related preemption. It has a disadvantage though, it can prevent legitimate wakeups if a task is 'unlucky' to be hit too early by a tick that clears peer_preempt. Now that the wakeup preemption has been cleaned up we dont seem to have excessive preemptions anymore, so this feature can be turned off. (and removed in the next patch) Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: cleanup, use NSEC_PER_MSEC and NSEC_PER_SECEric Dumazet2007-11-09
| | | | | | | | | | | | | | | | | 1) hardcoded 1000000000 value is used five times in places where NSEC_PER_SEC might be more readable. 2) A conversion from nsec to msec uses the hardcoded 1000000 value, which is a candidate for NSEC_PER_MSEC. no code changed: text data bss dec hex filename 44359 3326 36 47721 ba69 sched.o.before 44359 3326 36 47721 ba69 sched.o.after Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: reintroduce SMP tunings againIngo Molnar2007-11-09
| | | | | | | | | | | | | | | | | | | Yanmin Zhang reported an aim7 regression and bisected it down to: | commit 38ad464d410dadceda1563f36bdb0be7fe4c8938 | Author: Ingo Molnar <mingo@elte.hu> | Date: Mon Oct 15 17:00:02 2007 +0200 | | sched: uniform tunings | | use the same defaults on both UP and SMP. fix this by reintroducing similar SMP tunings again. This resolves the regression. (also update the comments to match the ilog2(nr_cpus) tuning effect) Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix style in kernel/sched.cIngo Molnar2007-10-29
| | | | | | fallout of recent commits: small coding style fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: report CPU usage in CFS cgroup directoriesPaul Menage2007-10-29
| | | | | | | | | | Adds a cpu.usage file to the CFS cgroup that reports CPU usage in milliseconds for that cgroup's tasks [ mingo@elte.hu: style cleanups. ] Signed-off-by: Paul Menage <menage@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: move rcu_head to task_group structSrivatsa Vaddagiri2007-10-29
| | | | | | | | | Peter Zijlstra noticed that the rcu_head object need not be present in every cfs_rq of a group. Move it to the task_group structure instead. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix incorrect assumption that cpu 0 existsJames Bottomley2007-10-29
| | | | | | | | | | | | | | | | | | | | | This patch: commit 9b5b77512dce239fa168183fa71896712232e95a Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Date: Mon Oct 15 17:00:09 2007 +0200 sched: clean up code under CONFIG_FAIR_GROUP_SCHED Introduced an assumption of the existence of CPU0 via this line cfs_rq = tg->cfs_rq[0]; If you have no CPU0, that will be NULL. The fix seems to be just to take whatever cfs_rq queue comes out of the for_each_possible_cpu() loop, since they're all equally good for the destruction operation. Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: make kernel/sched.c:account_guest_time() staticAdrian Bunk2007-10-29
| | | | | | | account_guest_time() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: isolate SMP balancing code a bit morePeter Williams2007-10-24
| | | | | | | | | | | | | | | At the moment, a lot of load balancing code that is irrelevant to non SMP systems gets included during non SMP builds. This patch addresses this issue and reduces the binary size on non SMP systems: text data bss dec hex filename 10983 28 1192 12203 2fab sched.o.before 10739 28 1192 11959 2eb7 sched.o.after Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: reduce balance-tasks overheadPeter Williams2007-10-24
| | | | | | | | | | | | | | | At the moment, balance_tasks() provides low level functionality for both move_tasks() and move_one_task() (indirectly) via the load_balance() function (in the sched_class interface) which also provides dual functionality. This dual functionality complicates the interfaces and internal mechanisms and makes the run time overhead of operations that are called with two run queue locks held. This patch addresses this issue and reduces the overhead of these operations. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: clean up some control group codePaul Menage2007-10-24
| | | | | | | | | - replace "cont" with "cgrp" in a few places in the CFS cgroup code, - use write_uint rather than write for cpu.shares write function Signed-off-by: Paul Menage <menage@google.com> Acked-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: use show_regs() to improve __schedule_bug() outputSatyam Sharma2007-10-24
| | | | | | | | | | | | | | | A full register dump along with stack backtrace would make the "scheduling while atomic" message more helpful. Use show_regs() instead of dump_stack() for this. We already know we're atomic in here (that is why this function was called) so show_regs()'s atomicity expectations are guaranteed. Also, modify the output of the "BUG: scheduling while atomic:" header a bit to keep task->comm and task->pid together and preempt_count() after them. Signed-off-by: Satyam Sharma <satyam@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: clean up sched_domain_debug()Ingo Molnar2007-10-24
| | | | | | | | | | | | clean up sched_domain_debug(). this also shrinks the code a bit: text data bss dec hex filename 50474 4306 480 55260 d7dc sched.o.before 50404 4306 480 55190 d796 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix fastcall mismatch in completion APIsIngo Molnar2007-10-24
| | | | | | | | | | Jeff Dike noticed that wait_for_completion_interruptible()'s prototype had a mismatched fastcall. Fix this by removing the fastcall attributes from all the completion APIs. Found-by: Jeff Dike <jdike@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix sched_domain sysctl registration againMilton Miller2007-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 029190c515f15f512ac85de8fc686d4dbd0ae731 (cpuset sched_load_balance flag) was not tested SCHED_DEBUG enabled as committed as it dereferences NULL when used and it reordered the sysctl registration to cause it to never show any domains or their tunables. Fixes: 1) restore arch_init_sched_domains ordering we can't walk the domains before we build them presently we register cpus with empty directories (no domain directories or files). 2) make unregister_sched_domain_sysctl do nothing when already unregistered detach_destroy_domains is now called one set of cpus at a time unregister_syctl dereferences NULL if called with a null. While the the function would always dereference null if called twice, in the previous code it was always called once and then was followed a register. So only the hidden bug of the sysctl_root_table not being allocated followed by an attempt to free it would have shown the error. 3) always call unregister and register in partition_sched_domains The code is "smart" about unregistering only needed domains. Since we aren't guaranteed any calls to unregister, always unregister. Without calling register on the way out we will not have a table or any sysctl tree. 4) warn if register is called without unregistering The previous table memory is lost, leaving pointers to the later freed memory in sysctl and leaking the memory of the tables. Before this patch on a 2-core 4-thread box compiled for SMT and NUMA, the domains appear empty (there are actually 3 levels per cpu). And as soon as two domains a null pointer is dereferenced (unreliable in this case is stack garbage): bu19a:~# ls -R /proc/sys/kernel/sched_domain/ /proc/sys/kernel/sched_domain/: cpu0 cpu1 cpu2 cpu3 /proc/sys/kernel/sched_domain/cpu0: /proc/sys/kernel/sched_domain/cpu1: /proc/sys/kernel/sched_domain/cpu2: /proc/sys/kernel/sched_domain/cpu3: bu19a:~# mkdir /dev/cpuset bu19a:~# mount -tcpuset cpuset /dev/cpuset/ bu19a:~# cd /dev/cpuset/ bu19a:/dev/cpuset# echo 0 > sched_load_balance bu19a:/dev/cpuset# mkdir one bu19a:/dev/cpuset# echo 1 > one/cpus bu19a:/dev/cpuset# echo 0 > one/sched_load_balance Unable to handle kernel paging request for data at address 0x00000018 Faulting instruction address: 0xc00000000006b608 NIP: c00000000006b608 LR: c00000000006b604 CTR: 0000000000000000 REGS: c000000018d973f0 TRAP: 0300 Not tainted (2.6.23-bml) MSR: 9000000000009032 <EE,ME,IR,DR> CR: 28242442 XER: 00000000 DAR: 0000000000000018, DSISR: 0000000040000000 TASK = c00000001912e340[1987] 'bash' THREAD: c000000018d94000 CPU: 2 .. NIP [c00000000006b608] .unregister_sysctl_table+0x38/0x110 LR [c00000000006b604] .unregister_sysctl_table+0x34/0x110 Call Trace: [c000000018d97670] [c000000007017270] 0xc000000007017270 (unreliable) [c000000018d97720] [c000000000058710] .detach_destroy_domains+0x30/0xb0 [c000000018d977b0] [c00000000005cf1c] .partition_sched_domains+0x1bc/0x230 [c000000018d97870] [c00000000009fdc4] .rebuild_sched_domains+0xb4/0x4c0 [c000000018d97970] [c0000000000a02e8] .update_flag+0x118/0x170 [c000000018d97a80] [c0000000000a1768] .cpuset_common_file_write+0x568/0x820 [c000000018d97c00] [c00000000009d95c] .cgroup_file_write+0x7c/0x180 [c000000018d97cf0] [c0000000000e76b8] .vfs_write+0xe8/0x1b0 [c000000018d97d90] [c0000000000e810c] .sys_write+0x4c/0x90 [c000000018d97e30] [c00000000000852c] syscall_exit+0x0/0x40 Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: don't clear PF_VCPU in schedulerLaurent Vivier2007-10-22
| | | | | | | | KVM clears it by itself now, and for s390 this is plain wrong. Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Avi Kivity <avi@qumranet.com>
* kernel/sched.c: remove bogus comment from account_user_timeMichael Neuling2007-10-19
| | | | | | | hardirq_offset is no longer needed. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Adrian Bunk <bunk@kernel.org>
* Fix misspellings of "system", "controller", "interrupt" and "necessary".Robert P. J. Day2007-10-19
| | | | | | | | Fix the various misspellings of "system", controller", "interrupt" and "[un]necessary". Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
* Hook up group scheduler with control groupsSrivatsa Vaddagiri2007-10-19
| | | | | | | | | | | | | | | | Enable "cgroup" (formerly containers) based fair group scheduling. This will let administrator create arbitrary groups of tasks (using "cgroup" pseudo filesystem) and control their cpu bandwidth usage. [akpm@linux-foundation.org: fix cpp condition] Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hotplug cpu: migrate a task within its cpusetCliff Wickman2007-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have been running on that cpu. Currently, such a task is migrated: 1) to any cpu on the same node as the disabled cpu, which is both online and among that task's cpus_allowed 2) to any cpu which is both online and among that task's cpus_allowed It is typical of a multithreaded application running on a large NUMA system to have its tasks confined to a cpuset so as to cluster them near the memory that they share. Furthermore, it is typical to explicitly place such a task on a specific cpu in that cpuset. And in that case the task's cpus_allowed includes only a single cpu. This patch would insert a preference to migrate such a task to some cpu within its cpuset (and set its cpus_allowed to its entire cpuset). With this patch, migrate the task to: 1) to any cpu on the same node as the disabled cpu, which is both online and among that task's cpus_allowed 2) to any online cpu within the task's cpuset 3) to any cpu which is both online and among that task's cpus_allowed In order to do this, move_task_off_dead_cpu() must make a call to cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will not block. (name change - per Oleg's suggestion) Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set the cpuset mutex during the whole migrate_live_tasks() and migrate_dead_tasks() procedure. [akpm@linux-foundation.org: build fix] [pj@sgi.com: Fix indentation and spacing] Signed-off-by: Cliff Wickman <cpw@sgi.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Use helpers to obtain task pid in printksPavel Emelyanov2007-10-19
| | | | | | | | | | | | | | | | The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Fix tsk->exit_state usageEugene Teo2007-10-19
| | | | | | | | | | | | tsk->exit_state can only be 0, EXIT_ZOMBIE, or EXIT_DEAD. A non-zero test is the same as tsk->exit_state & (EXIT_ZOMBIE | EXIT_DEAD), so just testing tsk->exit_state is sufficient. Signed-off-by: Eugene Teo <eugeneteo@kernel.sg> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Fix cpusets update_cpumaskPaul Menage2007-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks: - collect batches of tasks under tasklist_lock and then call set_cpus_allowed() on them outside the lock (since this can sleep). - add a simple generic priority heap type to allow efficient collection of batches of tasks to be processed without duplicating or missing any tasks in subsequent batches. - make "cpus" file update a no-op if the mask hasn't changed - fix race between update_cpumask() and sched_setaffinity() by making sched_setaffinity() post-check that it's not running on any cpus outside cpuset_cpus_allowed(). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cpuset sched_load_balance flagPaul Jackson2007-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new per-cpuset flag called 'sched_load_balance'. When enabled in a cpuset (the default value) it tells the kernel scheduler that the scheduler should provide the normal load balancing on the CPUs in that cpuset, sometimes moving tasks from one CPU to a second CPU if the second CPU is less loaded and if that task is allowed to run there. When disabled (write "0" to the file) then it tells the kernel scheduler that load balancing is not required for the CPUs in that cpuset. Now even if this flag is disabled for some cpuset, the kernel may still have to load balance some or all the CPUs in that cpuset, if some overlapping cpuset has its sched_load_balance flag enabled. If there are some CPUs that are not in any cpuset whose sched_load_balance flag is enabled, the kernel scheduler will not load balance tasks to those CPUs. Moreover the kernel will partition the 'sched domains' (non-overlapping sets of CPUs over which load balancing is attempted) into the finest granularity partition that it can find, while still keeping any two CPUs that are in the same shed_load_balance enabled cpuset in the same element of the partition. This serves two purposes: 1) It provides a mechanism for real time isolation of some CPUs, and 2) it can be used to improve performance on systems with many CPUs by supporting configurations in which load balancing is not done across all CPUs at once, but rather only done in several smaller disjoint sets of CPUs. This mechanism replaces the earlier overloading of the per-cpuset flag 'cpu_exclusive', which overloading was removed in an earlier patch: cpuset-remove-sched-domain-hooks-from-cpusets See further the Documentation and comments in the code itself. [akpm@linux-foundation.org: don't be weird] Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Uninline find_task_by_xxx set of functionsPavel Emelyanov2007-10-19
| | | | | | | | | | | | | | | | | | | | | | | | The find_task_by_something is a set of macros are used to find task by pid depending on what kind of pid is proposed - global or virtual one. All of them are wrappers above the most generic one - find_task_by_pid_type_ns() - and just substitute some args for it. It turned out, that dereferencing the current->nsproxy->pid_ns construction and pushing one more argument on the stack inline cause kernel text size to grow. This patch moves all this stuff out-of-line into kernel/pid.c. Together with the next patch it saves a bit less than 400 bytes from the .text section. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid namespaces: changes to show virtual ids to userPavel Emelyanov2007-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the largest patch in the set. Make all (I hope) the places where the pid is shown to or get from user operate on the virtual pids. The idea is: - all in-kernel data structures must store either struct pid itself or the pid's global nr, obtained with pid_nr() call; - when seeking the task from kernel code with the stored id one should use find_task_by_pid() call that works with global pids; - when showing pid's numerical value to the user the virtual one should be used, but however when one shows task's pid outside this task's namespace the global one is to be used; - when getting the pid from userspace one need to consider this as the virtual one and use appropriate task/pid-searching functions. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: nuther build fix] [akpm@linux-foundation.org: yet nuther build fix] [akpm@linux-foundation.org: remove unneeded casts] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Task Control Groups: example CPU accounting subsystemPaul Menage2007-10-19
| | | | | | | | | | | | | | | | | | | | | This example demonstrates how to use the generic cgroup subsystem for a simple resource tracker that counts, for the processes in a cgroup, the total CPU time used and the %CPU used in the last complete 10 second interval. Portions contributed by Balbir Singh <balbir@in.ibm.com> Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-schedLinus Torvalds2007-10-18
|\ | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: reduce schedstat variable overhead a bit sched: add KERN_CONT annotation sched: cleanup, make struct rq comments more consistent sched: cleanup, fix spacing sched: fix return value of wait_for_completion_interruptible()
| * sched: reduce schedstat variable overhead a bitKen Chen2007-10-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | schedstat is useful in investigating CPU scheduler behavior. Ideally, I think it is beneficial to have it on all the time. However, the cost of turning it on in production system is quite high, largely due to number of events it collects and also due to its large memory footprint. Most of the fields probably don't need to be full 64-bit on 64-bit arch. Rolling over 4 billion events will most like take a long time and user space tool can be made to accommodate that. I'm proposing kernel to cut back most of variable width on 64-bit system. (note, the following patch doesn't affect 32-bit system). Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: add KERN_CONT annotationIngo Molnar2007-10-18
| | | | | | | | | | | | | | printk: add the KERN_CONT annotation (which is empty string but via which checkpatch.pl can notice that the lacking KERN_ level is fine). Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: cleanup, make struct rq comments more consistentIngo Molnar2007-10-18
| | | | | | | | | | | | | | | | cleanup, make struct rq comments more consistent. found via scripts/checkpatch.pl. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: cleanup, fix spacingIngo Molnar2007-10-18
| | | | | | | | | | | | | | | | | | cleanup: fix sysctl_sched_features initialization spacing, and fix sd_alloc_ctl_cpu_table() prototype spacing. found via scripts/checkpatch.pl. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: fix return value of wait_for_completion_interruptible()Andi Kleen2007-10-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent wait_for_completion() cleanups: commit 8cbbe86dfcfd68ad69916164bdc838d9e09adca8 Author: Andi Kleen <ak@suse.de> Date: Mon Oct 15 17:00:14 2007 +0200 sched: cleanup: refactor common code of sleep_on / wait_for_completion Refactor common code of sleep_on / wait_for_completion broke the return value of wait_for_completion_interruptible(). Previously it returned 0 on success, now -1. Fix that. Problem found by Geert Uytterhoeven. [ mingo: fixed whitespace damage ] Reported-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | Add scaled time to taskstats based process accountingMichael Neuling2007-10-18
|/ | | | | | | | | | | | | | | | This adds items to the taststats struct to account for user and system time based on scaling the CPU frequency and instruction issue rates. Adds account_(user|system)_time_scaled callbacks which architectures can use to account for time using this mechanism. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-schedLinus Torvalds2007-10-17
|\ | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: fix new task startup crash sched: fix !SYSFS build breakage sched: fix improper load balance across sched domain sched: more robust sd-sysctl entry freeing
| * sched: fix new task startup crashSrivatsa Vaddagiri2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Child task may be added on a different cpu that the one on which parent is running. In which case, task_new_fair() should check whether the new born task's parent entity should be added as well on the cfs_rq. Patch below fixes the problem in task_new_fair. This could fix the put_prev_task_fair() crashes reported. Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Reported-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: fix improper load balance across sched domainKen Chen2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We recently discovered a nasty performance bug in the kernel CPU load balancer where we were hit by 50% performance regression. When tasks are assigned to a subset of CPUs that span across sched_domains (either ccNUMA node or the new multi-core domain) via cpu affinity, kernel fails to perform proper load balance at these domains, due to several logic in find_busiest_group() miss identified busiest sched group within a given domain. This leads to inadequate load balance and causes 50% performance hit. To give you a concrete example, on a dual-core, 2 socket numa system, there are 4 logical cpu, organized as: CPU0 attaching sched-domain: domain 0: span 0003 groups: 0001 0002 domain 1: span 000f groups: 0003 000c CPU1 attaching sched-domain: domain 0: span 0003 groups: 0002 0001 domain 1: span 000f groups: 0003 000c CPU2 attaching sched-domain: domain 0: span 000c groups: 0004 0008 domain 1: span 000f groups: 000c 0003 CPU3 attaching sched-domain: domain 0: span 000c groups: 0008 0004 domain 1: span 000f groups: 000c 0003 If I run 2 tasks with CPU affinity set to 0x5. There are situation where cpu0 has run queue length of 2, and cpu2 will be idle. The kernel load balancer is unable to balance out these two tasks over cpu0 and cpu2 due to at least three logics in find_busiest_group() that heavily bias load balance towards power saving mode. e.g. while determining "busiest" variable, kernel only set it when "sum_nr_running > group_capacity". This test is flawed that "sum_nr_running" is not necessary same as sum-tasks-allowed-to-run-within-the sched-group. The end result is that kernel "think" everything is balanced, but in reality we have an imbalance and thus causing one CPU to be over-subscribed and leaving other idle. There are two other logic in the same function will also causing similar effect. The nastiness of this bug is that kernel not be able to get unstuck in this unfortunate broken state. From what we've seen in our environment, kernel will stuck in imbalanced state for extended period of time and it is also very easy for the kernel to stuck into that state (it's pretty much 100% reproducible for us). So proposing the following fix: add addition logic in find_busiest_group to detect intrinsic imbalance within the busiest group. When such condition is detected, load balance goes into spread mode instead of default grouping mode. Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: more robust sd-sysctl entry freeingMilton Miller2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It occurred to me this morning that the procname field was dynamically allocated and needed to be freed. I started to put in break statements when allocation failed but it was approaching 50% error handling code. I came up with this alternative of looping while entry->mode is set and checking proc_handler instead of ->table. Alternatively, the string version of the domain name and cpu number could be stored the structs. I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation counts after taking a cpuset exclusive and back. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | migration_call(CPU_DEAD): use spin_lock_irq() instead of task_rq_lock()Oleg Nesterov2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change migration_call(CPU_DEAD) to use direct spin_lock_irq() instead of task_rq_lock(rq->idle), rq->idle can't change its task_rq(). This makes the code a bit more symmetrical with migrate_dead_tasks()'s path which uses spin_lock_irq/spin_unlock_irq. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Cliff Wickman <cpw@sgi.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | do CPU_DEAD migrating under read_lock(tasklist) instead of ↵Oleg Nesterov2007-10-17
|/ | | | | | | | | | | | | | | | | | | | | | | write_lock_irq(tasklist) Currently move_task_off_dead_cpu() is called under write_lock_irq(tasklist). This means it can't use task_lock() which is needed to improve migrating to take task's ->cpuset into account. Change the code to call move_task_off_dead_cpu() with irqs enabled, and change migrate_live_tasks() to use read_lock(tasklist). This all is a preparation for the futher changes proposed by Cliff Wickman, see http://marc.info/?t=117327786100003 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Cliff Wickman <cpw@sgi.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cpuset: remove sched domain hooks from cpusetsPaul Jackson2007-10-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the cpuset hooks that defined sched domains depending on the setting of the 'cpu_exclusive' flag. The cpu_exclusive flag can only be set on a child if it is set on the parent. This made that flag painfully unsuitable for use as a flag defining a partitioning of a system. It was entirely unobvious to a cpuset user what partitioning of sched domains they would be causing when they set that one cpu_exclusive bit on one cpuset, because it depended on what CPUs were in the remainder of that cpusets siblings and child cpusets, after subtracting out other cpu_exclusive cpusets. Furthermore, there was no way on production systems to query the result. Using the cpu_exclusive flag for this was simply wrong from the get go. Fortunately, it was sufficiently borked that so far as I know, almost no successful use has been made of this. One real time group did use it to affectively isolate CPUs from any load balancing efforts. They are willing to adapt to alternative mechanisms for this, such as someway to manipulate the list of isolated CPUs on a running system. They can do without this present cpu_exclusive based mechanism while we develop an alternative. There is a real risk, to the best of my understanding, of users accidentally setting up a partitioned scheduler domains, inhibiting desired load balancing across all their CPUs, due to the nonobvious (from the cpuset perspective) side affects of the cpu_exclusive flag. Furthermore, since there was no way on a running system to see what one was doing with sched domains, this change will be invisible to any using code. Unless they have real insight to the scheduler load balancing choices, they will be unable to detect that this change has been made in the kernel's behaviour. Initial discussion on lkml of this patch has generated much comment. My (probably controversial) take on that discussion is that it has reached a rough concensus that the current cpuset cpu_exclusive mechanism for defining sched domains is borked. There is no concensus on the replacement. But since we can remove this mechanism, and since its continued presence risks causing unwanted partitioning of the schedulers load balancing, we should remove it while we can, as we proceed to work the replacement scheduler domain mechanisms. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>