aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAge
* [PATCH] cond_resched(): fix bogus might_sleep() warningIngo Molnar2005-07-07
| | | | | | | | | | The BKS might be reacquired before we have dropped PREEMPT_ACTIVE, which could trigger a second could trigger a second cond_resched() call. Bug found by Hirofumi Ogawa. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mostly_read data sectionChristoph Lameter2005-07-07
| | | | | | | | | | | | | | | | | | | | | Add a new section called ".data.read_mostly" for data items that are read frequently and rarely written to like cpumaps etc. If these maps are placed in the .data section then these frequenly read items may end up in cachelines with data is is frequently updated. In that case all processors in an SMP system must needlessly reload the cachelines again and again containing elements of those frequently used variables. The ability to share these cachelines will allow each cpu in an SMP system to keep local copies of those shared cachelines thereby optimizing performance. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Christoph Lameter <christoph@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pm: clean up process.cPavel Machek2005-07-07
| | | | | | | | | freezeable() already tests for TRACED/STOPPED processes, no need to do it twice. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] swsusp: fix error handlingPavel Machek2005-07-07
| | | | | | | | | Fix error handling and whitespace in swsusp.c. swsusp_free() was called when there was nothing allocating, leading to oops. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pm: Fix resume from initrdPavel Machek2005-07-07
| | | | | | | | | Move device name resolution code around so that it is not called from resume-from-initrd. name_to_dev_t may be unavailable at that point. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kprobes: fix namespace problem and sparc64 buildRusty Lynch2005-07-05
| | | | | | | | | | | | The following renames arch_init, a kprobes function for performing any architecture specific initialization, to arch_init_kprobes in order to cleanup the namespace. Also, this patch adds arch_init_kprobes to sparc64 to fix the sparc64 kprobes build from the last return probe patch. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] irqpollAlan Cox2005-06-29
| | | | | | | | | | | | | | | | Anyone reporting a stuck IRQ should try these options. Its effectiveness varies we've found in the Fedora case. Quite a few systems with misdescribed IRQ routing just work when you use irqpoll. It also fixes up the VIA systems although thats now fixed with the VIA quirk (which we could just make default as its what Redmond OS does but Linus didn't like it historically). A small number of systems have jammed IRQ sources or misdescribes that cause an IRQ that we have no handler registered anywhere for. In those cases it doesn't help. Signed-off-by: Alan Cox <number6@the-village.bc.nu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ITIMER_REAL: fix possible deadlock and raceOleg Nesterov2005-06-29
| | | | | | | | | | | | | | | | | | | | As Steven Rostedt pointed out, there are 2 problems with ITIMER_REAL timers. 1. do_setitimer() does not call del_timer_sync() in case when the timer is not pending (it_real_value() returns 0). This is wrong, the timer may still be running, and it can rearm itself. 2. It calls del_timer_sync() with tsk->sighand->siglock held. This is deadlockable, because timer's handler needs this lock too. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Using msleep() instead of HZLuca Falavigna2005-06-29
| | | | | | | | | | Use msleep() in a few places. Signed-off-by: Luca Falavigna <dktrkranz@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Tweak idle thread setup semanticsIngo Molnar2005-06-28
| | | | | | | | | | | | | | This patch tweaks idle thread setup semantics a bit: instead of setting NEED_RESCHED in init_idle(), we do an explicit schedule() before calling into cpu_idle(). This patch, while having no negative side-effects, enables wider use of cond_resched()s. (which might happen in the stock kernel too, but it's particulary important for voluntary-preempt) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: fix sparse warningsAlexey Dobriyan2005-06-28
| | | | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Return probe redesign: architecture independent changesRusty Lynch2005-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following is the second version of the function return probe patches I sent out earlier this week. Changes since my last submission include: * Fix in ppc64 code removing an unneeded call to re-enable preemption * Fix a build problem in ia64 when kprobes was turned off * Added another BUG_ON check to each of the architecture trampoline handlers My initial patch description ==> From my experiences with adding return probes to x86_64 and ia64, and the feedback on LKML to those patches, I think we can simplify the design for return probes. The following patch tweaks the original design such that: * Instead of storing the stack address in the return probe instance, the task pointer is stored. This gives us all we need in order to: - find the correct return probe instance when we enter the trampoline (even if we are recursing) - find all left-over return probe instances when the task is going away This has the side effect of simplifying the implementation since more work can be done in kernel/kprobes.c since architecture specific knowledge of the stack layout is no longer required. Specifically, we no longer have: - arch_get_kprobe_task() - arch_kprobe_flush_task() - get_rp_inst_tsk() - get_rp_inst() - trampoline_post_handler() <see next bullet> * Instead of splitting the return probe handling and cleanup logic across the pre and post trampoline handlers, all the work is pushed into the pre function (trampoline_probe_handler), and then we skip single stepping the original function. In this case the original instruction to be single stepped was just a NOP, and we can do without the extra interruption. The new flow of events to having a return probe handler execute when a target function exits is: * At system initialization time, a kprobe is inserted at the beginning of kretprobe_trampoline. kernel/kprobes.c use to handle this on it's own, but ia64 needed to do this a little differently (i.e. a function pointer is really a pointer to a structure containing the instruction pointer and a global pointer), so I added the notion of arch_init(), so that kernel/kprobes.c:init_kprobes() now allows architecture specific initialization by calling arch_init() before exiting. Each architecture now registers a kprobe on it's own trampoline function. * register_kretprobe() will insert a kprobe at the beginning of the targeted function with the kprobe pre_handler set to arch_prepare_kretprobe (still no change) * When the target function is entered, the kprobe is fired, calling arch_prepare_kretprobe (still no change) * In arch_prepare_kretprobe() we try to get a free instance and if one is available then we fill out the instance with a pointer to the return probe, the original return address, and a pointer to the task structure (instead of the stack address.) Just like before we change the return address to the trampoline function and mark the instance as used. If multiple return probes are registered for a given target function, then arch_prepare_kretprobe() will get called multiple times for the same task (since our kprobe implementation is able to handle multiple kprobes at the same address.) Past the first call to arch_prepare_kretprobe, we end up with the original address stored in the return probe instance pointing to our trampoline function. (This is a significant difference from the original arch_prepare_kretprobe design.) * Target function executes like normal and then returns to kretprobe_trampoline. * kprobe inserted on the first instruction of kretprobe_trampoline is fired and calls trampoline_probe_handler() (no change here) * trampoline_probe_handler() consumes each of the instances associated with the current task by calling the registered handler function and marking the instance as unused until an instance is found that has a return address different then the trampoline function. (change similar to my previous ia64 RFC) * If the task is killed with some left-over return probe instances (meaning that a target function was entered, but never returned), then we just free any instances associated with the task. (Not much different other then we can handle this without calling architecture specific functions.) There is a known problem that this patch does not yet solve where registering a return probe flush_old_exec or flush_thread will put us in a bad state. Most likely the best way to handle this is to not allow registering return probes on these two functions. (Significant change) This patch series applies to the 2.6.12-rc6-mm1 kernel, and provides: * kernel/kprobes.c changes * i386 patch of existing return probes implementation * x86_64 patch of existing return probe implementation * ia64 implementation * ppc64 implementation (provided by Ananth) This patch implements the architecture independant changes for a reworking of the kprobes based function return probes design. Changes include: * Removing functions for querying a return probe instance off a stack address * Removing the stack_addr field from the kretprobe_instance definition, and adding a task pointer * Adding architecture specific initialization via arch_init() * Removing extern definitions for the architecture trampoline functions (this isn't needed anymore since the architecture handles the initialization of the kprobe in the return probe trampoline function.) Signed-off-by: Rusty Lynch <rusty.lynch@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kprobes: fix single-step out of line - take2Ananth N Mavinakayanahalli2005-06-27
| | | | | | | | | | | Now that PPC64 has no-execute support, here is a second try to fix the single step out of line during kprobe execution. Kprobes on x86_64 already solved this problem by allocating an executable page and using it as the scratch area for stepping out of line. Reuse that. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Update cfq io scheduler to time sliced designJens Axboe2005-06-27
| | | | | | | | | | | | | | This updates the CFQ io scheduler to the new time sliced design (cfq v3). It provides full process fairness, while giving excellent aggregate system throughput even for many competing processes. It supports io priorities, either inherited from the cpu nice value or set directly with the ioprio_get/set syscalls. The latter closely mimic set/getpriority. This import is based on my latest from -mm. Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Merge Christoph's freeze cleanup patchLinus Torvalds2005-06-25
|\
| * [PATCH] Cleanup patch for process freezingChristoph Lameter2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Establish a simple API for process freezing defined in linux/include/sched.h: frozen(process) Check for frozen process freezing(process) Check if a process is being frozen freeze(process) Tell a process to freeze (go to refrigerator) thaw_process(process) Restart process frozen_process(process) Process is frozen now 2. Remove all references to PF_FREEZE and PF_FROZEN from all kernel sources except sched.h 3. Fix numerous locations where try_to_freeze is manually done by a driver 4. Remove the argument that is no longer necessary from two function calls. 5. Some whitespace cleanup 6. Clear potential race in refrigerator (provides an open window of PF_FREEZE cleared before setting PF_FROZEN, recalc_sigpending does not check PF_FROZEN). This patch does not address the problem of freeze_processes() violating the rule that a task may only modify its own flags by setting PF_FREEZE. This is not clean in an SMP environment. freeze(process) is therefore not SMP safe! Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] Use ALIGN to remove duplicate codeNick Wilson2005-06-25
| | | | | | | | | | | | | | | | This patch makes use of ALIGN() to remove duplicate round-up code. Signed-off-by: Nick Wilson <njw@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] remove redundant NULL check before before kfree() in kernel/sysctl.cJesper Juhl2005-06-25
| | | | | | | | | | | | Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] kernel/timer: fix msleep_interruptible() commentDomen Puncer2005-06-25
| | | | | | | | | | | | | | | | | | | | The comment for msleep_interruptible() is wrong, as it will ignore wait-queue events, but will wake up early for signals. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] kexec code cleanupManeesh Soni2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | o Following patch provides purely cosmetic changes and corrects CodingStyle guide lines related certain issues like below in kexec related files o braces for one line "if" statements, "for" loops, o more than 80 column wide lines, o No space after "while", "for" and "switch" key words o Changes: o take-2: Removed the extra tab before "case" key words. o take-3: Put operator at the end of line and space before "*/" Signed-off-by: Maneesh Soni <maneesh@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] kdump: Use real pt_regs from exceptionAlexander Nyberg2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Makes kexec_crashdump() take a pt_regs * as an argument. This allows to get exact register state at the point of the crash. If we come from direct panic assertion NULL will be passed and the current registers saved before crashdump. This hooks into two places: die(): check the conditions under which we will panic when calling do_exit and go there directly with the pt_regs that caused the fatal fault. die_nmi(): If we receive an NMI lockup while in the kernel use the pt_regs and go directly to crash_kexec(). We're probably nested up badly at this point so this might be the only chance to escape with proper information. Signed-off-by: Alexander Nyberg <alexn@telia.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] kdump: Access dump file in elf format (/proc/vmcore)Vivek Goyal2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From: "Vivek Goyal" <vgoyal@in.ibm.com> o Support for /proc/vmcore interface. This interface exports elf core image either in ELF32 or ELF64 format, depending on the format in which elf headers have been stored by crashed kernel. o Added support for CONFIG_VMCORE config option. o Removed the dependency on /proc/kcore. From: "Eric W. Biederman" <ebiederm@xmission.com> This patch has been refactored to more closely match the prevailing style in the affected files. And to clearly indicate the dependency between /proc/kcore and proc/vmcore.c From: Hariprasad Nellitheertha <hari@in.ibm.com> This patch contains the code that provides an ELF format interface to the previous kernel's memory post kexec reboot. Signed off by Hariprasad Nellitheertha <hari@in.ibm.com> Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] Retrieve elfcorehdr address from command lineVivek Goyal2005-06-25
| | | | | | | | | | | | | | | | | | This patch adds support for retrieving the address of elf core header if one is passed in command line. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] kdump: Routines for copying dump pagesVivek Goyal2005-06-25
| | | | | | | | | | | | | | | | | | | | | | This patch provides the interfaces necessary to read the dump contents, treating it as a high memory device. Signed off by Hariprasad Nellitheertha <hari@in.ibm.com> Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] Kdump: Export crash notes section address through sysfsVivek Goyal2005-06-25
| | | | | | | | | | | | | | | | | | o Following patch exports kexec global variable "crash_notes" to user space through sysfs as kernel attribute in /sys/kernel. Signed-off-by: Maneesh Soni <maneesh@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] Kexec on panic vmlinux initrd fixVivek Goyal2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a minor bug fix in kexec to resolve the problem of loading panic kernel with initrd. o Problem: Loading a capture kenrel fails if initrd is also being loaded. This has been observed for vmlinux image for kexec on panic case. o This patch fixes the problem. In segment location and size verification logic, minor correction has been done. Segment memory end (mend) should be mstart + memsz - 1. This one byte offset was source of failure for initrd loading which was being loaded at hole boundary. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] kexec: add kexec syscallsEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces the architecture independent implementation the sys_kexec_load, the compat_sys_kexec_load system calls. Kexec on panic support has been integrated into the core patch and is relatively clean. In addition the hopefully architecture independent option crashkernel=size@location has been docuemented. It's purpose is to reserve space for the panic kernel to live, and where no DMA transfer will ever be setup to access. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Alexander Nyberg <alexn@telia.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: voluntary kernel preemptionIngo Molnar2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new preemption model: 'Voluntary Kernel Preemption'. The 3 models can be selected from a new menu: (X) No Forced Preemption (Server) ( ) Voluntary Kernel Preemption (Desktop) ( ) Preemptible Kernel (Low-Latency Desktop) we still default to the stock (Server) preemption model. Voluntary preemption works by adding a cond_resched() (reschedule-if-needed) call to every might_sleep() check. It is lighter than CONFIG_PREEMPT - at the cost of not having as tight latencies. It represents a different latency/complexity/overhead tradeoff. It has no runtime impact at all if disabled. Here are size stats that show how the various preemption models impact the kernel's size: text data bss dec hex filename 3618774 547184 179896 4345854 424ffe vmlinux.stock 3626406 547184 179896 4353486 426dce vmlinux.voluntary +0.2% 3748414 548640 179896 4476950 445016 vmlinux.preempt +3.5% voluntary-preempt is +0.2% of .text, preempt is +3.5%. This feature has been tested for many months by lots of people (and it's also included in the RHEL4 distribution and earlier variants were in Fedora as well), and it's intended for users and distributions who dont want to use full-blown CONFIG_PREEMPT for one reason or another. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] enable PREEMPT_BKL on !PREEMPT+SMP tooIngo Molnar2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only sane way to clean up the current 3 lock_kernel() variants seems to be to remove the spinlock-based BKL implementations altogether, and to keep the semaphore-based one only. If we dont want to do that for whatever reason then i'm afraid we have to live with the current complexity. (but i'm open for other cleanup suggestions as well.) To explore this possibility we'll (at a minimum) have to know whether the semaphore-based BKL works fine on plain SMP too. The patch below enables this. The patch may make sense in isolation as well, as it might bring performance benefits: code that would formerly spin on the BKL spinlock will now schedule away and give up the CPU. It might introduce performance regressions as well, if any performance-critical code uses the BKL heavily and gets overscheduled due to the semaphore. I very much hope there is no such performance-critical codepath left though. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] consolidate PREEMPT options into kernel/Kconfig.preemptIngo Molnar2005-06-25
| | | | | | | | | | | | | | | | | | | | | | This patch consolidates the CONFIG_PREEMPT and CONFIG_PREEMPT_BKL preemption options into kernel/Kconfig.preempt. This, besides reducing source-code, also enables more centralized tweaking of preemption related options. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] Dynamic sched domains: cpuset changesDinakar Guniguntala2005-06-25
| | | | | | | | | | | | | | | | | | | | Adds the core update_cpu_domains code and updated cpusets documentation Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] Dynamic sched domains: sched changesDinakar Guniguntala2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following patches add dynamic sched domains functionality that was extensively discussed on lkml and lse-tech. I would like to see this added to -mm o The main advantage with this feature is that it ensures that the scheduler load balacing code only balances against the cpus that are in the sched domain as defined by an exclusive cpuset and not all of the cpus in the system. This removes any overhead due to load balancing code trying to pull tasks outside of the cpu exclusive cpuset only to be prevented by the tasks' cpus_allowed mask. o cpu exclusive cpusets are useful for servers running orthogonal workloads such as RT applications requiring low latency and HPC applications that are throughput sensitive o It provides a new API partition_sched_domains in sched.c that makes dynamic sched domains possible. o cpu_exclusive cpusets sets are now associated with a sched domain. Which means that the users can dynamically modify the sched domains through the cpuset file system interface o ia64 sched domain code has been updated to support this feature as well o Currently, this does not support hotplug. (However some of my tests indicate hotplug+preempt is currently broken) o I have tested it extensively on x86. o This should have very minimal impact on performance as none of the fast paths are affected Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] Changing RT priority without CAP_SYS_NICEOlivier Croquette2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Presently, a process without the capability CAP_SYS_NICE can not change its own policy, which is OK. But it can also not decrease its RT priority (if scheduled with policy SCHED_RR or SCHED_FIFO), which is what this patch changes. The rationale is the same as for the nice value: a process should be able to require less priority for itself. Increasing the priority is still not allowed. This is for example useful if you give a multithreaded user process a RT priority, and the process would like to organize its internal threads using priorities also. Then you can give the process the highest priority needed N, and the process starts its threads with lower priorities: N-1, N-2... The POSIX norm says that the permissions are implementation specific, so I think we can do that. In a sense, it makes the permissions consistent whatever the policy is: with this patch, process scheduled by SCHED_FIFO, SCHED_RR and SCHED_OTHER can all decrease their priority. From: Ingo Molnar <mingo@elte.hu> cleaned up and merged to -mm. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: micro-optimize task requeueing in schedule()Chen Shang2005-06-25
| | | | | | | | | | | | | | | | micro-optimize task requeueing in schedule() & clean up recalc_task_prio(). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: relax pinned balancingNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The maximum rebalance interval allowed by the multiprocessor balancing backoff is often not large enough to handle corner cases where there are lots of tasks pinned on a CPU. Suresh reported: I see system livelock's if for example I have 7000 processes pinned onto one cpu (this is on the fastest 8-way system I have access to). After this patch, the machine is reported to go well above this number. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: consolidate sbe sbfNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consolidate balance-on-exec with balance-on-fork. This is made easy by the sched-domains RCU patches. As well as the general goodness of code reduction, this allows the runqueues to be unlocked during balance-on-fork. schedstats is a problem. Maybe just have balance-on-event instead of distinguishing fork and exec? Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: RCU domainsNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the problems with the multilevel balance-on-fork/exec is that it needs to jump through hoops to satisfy sched-domain's locking semantics (that is, you may traverse your own domain when not preemptable, and you may traverse others' domains when holding their runqueue lock). balance-on-exec had to potentially migrate between more than one CPU before finding a final CPU to migrate to, and balance-on-fork needed to potentially take multiple runqueue locks. So bite the bullet and make sched-domains go completely RCU. This actually simplifies the code quite a bit. From: Ingo Molnar <mingo@elte.hu> schedstats RCU fix, and a nice comment on for_each_domain, from Ingo. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: multilevel sbe sbfNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fundamental problem that Suresh has with balance on exec and fork is that it only tries to balance the top level domain with the flag set. This was worked around by removing degenerate domains, but is still a problem if people want to start using more complex sched-domains, especially multilevel NUMA that ia64 is already using. This patch makes balance on fork and exec try balancing over not just the top most domain with the flag set, but all the way down the domain tree. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: remove degenerate domainsSuresh Siddha2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove degenerate scheduler domains during the sched-domain init. For example on x86_64, we always have NUMA configured in. On Intel EM64T systems, top most sched domain will be of NUMA and with only one sched_group in it. With fork/exec balances(recent Nick's fixes in -mm tree), we always endup taking wrong decisions because of this topmost domain (as it contains only one group and find_idlest_group always returns NULL). We will endup loading HT package completely first, letting active load balance kickin and correct it. In general, this patch also makes sense with out recent Nick's fixes in -mm. From: Nick Piggin <nickpiggin@yahoo.com.au> Modified to account for more than just sched_groups when scanning for degenerate domains by Nick Piggin. And allow a runqueue's sd to go NULL rather than keep a single degenerate domain around (this happens when you run with maxcpus=1). Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: null domainsNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | Fix the last 2 places that directly access a runqueue's sched-domain and assume it cannot be NULL. That allows the use of NULL for domain, instead of a dummy domain, to signify no balancing is to happen. No functional changes. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: cleanup context switch lockingNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of requiring architecture code to interact with the scheduler's locking implementation, provide a couple of defines that can be used by the architecture to request runqueue unlocked context switches, and ask for interrupts to be enabled over the context switch. Also replaces the "switch_lock" used by these architectures with an oncpu flag (note, not a potentially slow bitflag). This eliminates one bus locked memory operation when context switching, and simplifies the task_running function. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: uninline task_timesliceIngo Molnar2005-06-25
| | | | | | | | | | | | | | | | | | | | | | "Chen, Kenneth W" <kenneth.w.chen@intel.com> uninline task_timeslice() - reduces code footprint noticeably, and it's slowpath code. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: schedstats update for balance on forkNick Piggin2005-06-25
| | | | | | | | | | | | | | | | Add SCHEDSTAT statistics for sched-balance-fork. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: balance on forkNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reimplement the balance on exec balancing to be sched-domains aware. Use this to also do balance on fork balancing. Make x86_64 do balance on fork over the NUMA domain. The problem that the non sched domains aware blancing became apparent on dual core, multi socket opterons. What we want is for the new tasks to be sent to a different socket, but more often than not, we would first load up our sibling core, or fill two cores of a single remote socket before selecting a new one. This gives large improvements to STREAM on such systems. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: no aggressive idle balancingNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | Remove the very aggressive idle stuff that has recently gone into 2.6 - it is going against the direction we are trying to go. Hopefully we can regain performance through other methods. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: tweak affine wakeupsNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do less affine wakeups. We're trying to reduce dbt2-pgsql idle time regressions here... make sure we don't don't move tasks the wrong way in an imbalance condition. Also, remove the cache coldness requirement from the calculation - this seems to induce sharp cutoff points where behaviour will suddenly change on some workloads if the load creeps slightly over or under some point. It is good for periodic balancing because in that case have otherwise have no other context to determine what task to move. But also make a minor tweak to "wake balancing" - the imbalance tolerance is now set at half the domain's imbalance, so we get the opportunity to do wake balancing before the more random periodic rebalancing gets preformed. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: balance timersNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | Do CPU load averaging over a number of different intervals. Allow each interval to be chosen by sending a parameter to source_load and target_load. 0 is instantaneous, idx > 0 returns a decaying average with the most recent sample weighted at 2^(idx-1). To a maximum of 3 (could be easily increased). So generally a higher number will result in more conservative balancing. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: less aggressive idle balancingNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | Remove the special casing for idle CPU balancing. Things like this are hurting for example on SMT, where are single sibling being idle doesn't really warrant a really aggressive pull over the NUMA domain, for example. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: add debuggingNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | These conditions should now be impossible, and we need to fix them if they happen. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | [PATCH] sched: fix SMT scheduling problemsNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SMT balancing has a couple of problems. Firstly, active_load_balance is too complex - basically it should be a dumb helper for when the periodic balancer has determined there is an imbalance, but gets stuck because the task is running. So rip out all its "smarts", and just make it move one task to the target CPU. Second, the busy CPU's sched-domain tree was being used for active balancing. This means that it may not see that nr_balance_failed has reached a critical level. So use the target CPU's sched-domain tree for this. We can do this because we hold its runqueue lock. Lastly, reset nr_balance_failed to a point where we allow cache hot migration. This will help ensure active load balancing is successful. Thanks to Suresh Siddha for pointing out these issues. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>