aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* [PATCH] ppc64: kexec support for ppc64R Sharada2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements the kexec support for ppc64 platforms. A couple of notes: 1) We copy the pages in virtual mode, using the full base kernel and a statically allocated stack. At kexec_prepare time we scan the pages and if any overlap our (0, _end[]) range we return -ETXTBSY. On PowerPC 64 systems running in LPAR (logical partitioning) mode, only a small region of memory, referred to as the RMO, can be accessed in real mode. Since Linux runs with only one zone of memory in the memory allocator, and it can be orders of magnitude more memory than the RMO, looping until we allocate pages in the source region is not feasible. Copying in virtual means we don't have to write a hash table generation and call hypervisor to insert translations, instead we rely on the pinned kernel linear mapping. The kernel already has move to linked location built in, so there is no requirement to load it at 0. If we want to load something other than a kernel, then a stub can be written to copy a linear chunk in real mode. 2) The start entry point gets passed parameters from the kernel. Slaves are started at a fixed address after copying code from the entry point. All CPUs get passed their firmware assigned physical id in r3 (most calling conventions use this register for the first argument). This is used to distinguish each CPU from all other CPUs. Since firmware is not around, there is no other way to obtain this information other than to pass it somewhere. A single CPU, referred to here as the master and the one executing the kexec call, branches to start with the address of start in r4. While this can be calculated, we have to load it through a gpr to branch to this point so defining the register this is contained in is free. A stack of unspecified size is available at r1 (also common calling convention). All remaining running CPUs are sent to start at absolute address 0x60 after copying the first 0x100 bytes from start to address 0. This convention was chosen because it matches what the kernel has been doing itself. (only gpr3 is defined). Note: This is not quite the convention of the kexec bootblock v2 in the kernel. A stub has been written to convert between them, and we may adjust the kernel in the future to allow this directly without any stub. 3) Destination pages can be placed anywhere, even where they would not be accessible in real mode. This will allow us to place ram disks above the RMO if we choose. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: R Sharada <sharada@in.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ppc64 kexec: native hash clearR Sharada2005-06-25
| | | | | | | | | | Add code to clear the hash table and invalidate the tlb for native (SMP, non-LPAR) mode. Supports 16M and 4k pages. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: R Sharada <sharada@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: kexec ppc supportEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I have tweaked this patch slightly to handle an empty list of pages to relocate passed to relocate_new_kernel. And I have added ppc_md.machine_crash_shutdown. To keep up with the changes in the generic kexec infrastructure. From: Albert Herranz <albert_herranz@yahoo.es> The following patch adds support for kexec on the ppc32 platform. Non-OpenFirmware based platforms are likely to work directly without additional changes on the kernel side. The kexec-tools userland package may need to be slightly updated, though. For OpenFirmware based machines, additional work is still needed on the kernel side before kexec support is ready. Benjamin Herrenschmidt is kindly working on that part. In order for a ppc platform to use the kexec kernel services it must implement some ppc_md hooks. Otherwise, kexec will be explicitly disabled, as suggested by benh. There are 3+1 new ppc_md hooks that a platform supporting kexec may implement. Two of them are mandatory for kexec to work. See include/asm-ppc/machdep.h for details. - machine_kexec_prepare(image) This function is called to make any arrangements to the image before it is loaded. This hook _MUST_ be provided by a platform in order to activate kexec support for that platform. Otherwise, the platform is considered to not support kexec and the kexec_load system call will fail (that makes all existing platforms by default non-kexec'able). - machine_kexec_cleanup(image) This function is called to make any cleanups on image after the loaded image data it is freed. This hook is optional. A platform may or may not provide this hook. - machine_kexec(image) This function is called to perform the _actual_ kexec. This hook _MUST_ be provided by a platform in order to activate kexec support for that platform. If a platform provides machine_kexec_prepare but forgets to provide machine_kexec, a kexec will fall back to a reboot. A ready-to-use machine_kexec_simple() generic function is provided to, hopefully, simplify kexec adoption for embedded platforms. A platform may call this function from its specific machine_kexec hook, like this: void myplatform_kexec(struct kimage *image) { machine_kexec_simple(image); } - machine_shutdown() This function is called to perform any machine specific shutdowns, not already done by drivers. This hook is optional. A platform may or may not provide this hook. An example (trimmed) platform specific module for a platform supporting kexec through the existing machine_kexec_simple follows: /* ... */ #ifdef CONFIG_KEXEC int myplatform_kexec_prepare(struct kimage *image) { /* here, we can place additional preparations */ return 0; /* yes, we support kexec */ } void myplatform_kexec(struct kimage *image) { machine_kexec_simple(image); } #endif /* CONFIG_KEXEC */ /* ... */ void __init platform_init(unsigned long r3, unsigned long r4, unsigned long r5, unsigned long r6, unsigned long r7) { /* ... */ #ifdef CONFIG_KEXEC ppc_md.machine_kexec_prepare = myplatform_kexec_prepare; ppc_md.machine_kexec = myplatform_kexec; #endif /* CONFIG_KEXEC */ /* ... */ } The kexec ppc kernel support has been heavily tested on the GameCube Linux port, and, as reported in the fastboot mailing list, it has been tested too on a Moto 82xx ppc by Rick Richardson. Signed-off-by: Albert Herranz <albert_herranz@yahoo.es> Signed-off-by: Eric Biederman <ebiederm@xmission.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] crashdump: x86_64: crashkernel optionEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | | | This is the x86_64 implementation of the crashkernel option. It reserves a window of memory very early in the bootup process, so we never use it for anything but the kernel to switch to when the running kernel panics. In addition to reserving this memory a resource structure is registered so looking at /proc/iomem it is clear what happened to that memory. ISSUES: Is it possible to implement this in a architecture generic way? What should be done with architectures that always use an iommu and thus don't report their RAM memory resources in /proc/iomem? Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86_64 kexec implementationEric W. Biederman2005-06-25
| | | | | | | | | | | | | | This is the x86_64 implementation of machine kexec. 32bit compatibility support has been implemented, and machine_kexec has been enhanced to not care about the changing internal kernel paget table structures. From: Alexander Nyberg <alexn@dsv.su.se> build fix Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86_64: factor out apic shutdown codeEric W. Biederman2005-06-25
| | | | | | | | | Factor out the apic and smp shutdown code from machine_restart so it can be called by in the kexec reboot path as well. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] crashdump: x86 crashkernel optionEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | | | This is the x86 implementation of the crashkernel option. It reserves a window of memory very early in the bootup process, so we never use it for anything but the kernel to switch to when the running kernel panics. In addition to reserving this memory a resource structure is registered so looking at /proc/iomem it is clear what happened to that memory. ISSUES: Is it possible to implement this in a architecture generic way? What should be done with architectures that always use an iommu and thus don't report their RAM memory resources in /proc/iomem? Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86 shutdown APICs during crash_shutdownEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | | | | | | In the case of a crash/panic an architecture specific function machine_crash_shutdown is called. This patch adds to the x86 machine_crash function the standard kernel code for shutting down apics. Every line of code added to that function increases the risk that we will call code after a kernel panic that is not safe. This patch should not make it to the stable kernel without a being reviewed a lot more. It is unclear how much a hardned kernel can take when it comes to misconfigured apics. So since a normal kernel has problems this patch does a clean shutdown. It is my expectation this patch will be dropped from future generations of the kexec work. But for the moment it is a crutch to keep from breaking everything. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86: snapshot registers during crash shutdownEric W. Biederman2005-06-25
| | | | | | | | | | | | | | After the kernel panics if we wish to generate an entire machine core file it is very nice to know the register state at the time the machine crashed. After long discussion it was realized that if you are going to be saving the information anyway it is reasonable to store the information in a format that it will be used and recognized in so the register state is stored in the standard ELF note format. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] crashdump: x86: add NMI handler to capture other CPUsEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | | One of the dangers when switching from one kernel to another is what happens to all of the other cpus that were running in the crashed kernel. In an attempt to avoid that problem this patch adds a nmi handler and attempts to shoot down the other cpus by sending them non maskable interrupts. The code then waits for 1 second or until all known cpus have stopped running and then jumps from the running kernel that has crashed to the kernel in reserved memory. The kernel spin loop is used for the delay as that should behave continue to be safe even in after a crash. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86 kexec coreEric W. Biederman2005-06-25
| | | | | | | | This is the i386 implementation of kexec. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86: factor out apic shutdown codeEric W. Biederman2005-06-25
| | | | | | | | | | | | Factor out the apic and smp shutdown code from machine_restart so it can be called by in the kexec reboot path as well. By switching to the bootstrap cpu by default on reboot I can delete/simplify some motherboard fixups well. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Kexec on panic vmlinux initrd fixVivek Goyal2005-06-25
| | | | | | | | | | | | | | | | | This is a minor bug fix in kexec to resolve the problem of loading panic kernel with initrd. o Problem: Loading a capture kenrel fails if initrd is also being loaded. This has been observed for vmlinux image for kexec on panic case. o This patch fixes the problem. In segment location and size verification logic, minor correction has been done. Segment memory end (mend) should be mstart + memsz - 1. This one byte offset was source of failure for initrd loading which was being loaded at hole boundary. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: add kexec syscallsEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | | | | This patch introduces the architecture independent implementation the sys_kexec_load, the compat_sys_kexec_load system calls. Kexec on panic support has been integrated into the core patch and is relatively clean. In addition the hopefully architecture independent option crashkernel=size@location has been docuemented. It's purpose is to reserve space for the panic kernel to live, and where no DMA transfer will ever be setup to access. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Alexander Nyberg <alexn@telia.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86_64: add CONFIG_PHYSICAL_STARTEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | | | For one kernel to report a crash another kernel has created we need to have 2 kernels loaded simultaneously in memory. To accomplish this the two kernels need to built to run at different physical addresses. This patch adds the CONFIG_PHYSICAL_START option to the x86_64 kernel so we can do just that. You need to know what you are doing and the ramifications are before changing this value, and most users won't care so I have made it depend on CONFIG_EMBEDDED bzImage kernels will work and run at a different address when compiled with this option but they will still load at 1MB. If you need a kernel loaded at a different address as well you need to boot a vmlinux. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: reserve Bootmem fix for booting nondefault location kernelVivek Goyal2005-06-25
| | | | | | | | | | | | | This patch fixes a problem with reserving memory during boot up of a kernel built for non-default location. Currently boot memory allocator reserves the memory required by kernel image, boot allocaotor bitmap etc. It assumes that kernel is loaded at 1MB (HIGH_MEMORY hard coded to 1024*1024). But kernel can be built for non-default locatoin, hence existing hardcoding will lead to reserving unnecessary memory. This patch fixes it. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86: add CONFIG_PYSICAL_STARTEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | | | For one kernel to report a crash another kernel has created we need to have 2 kernels loaded simultaneously in memory. To accomplish this the two kernels need to built to run at different physical addresses. This patch adds the CONFIG_PHYSICAL_START option to the x86 kernel so we can do just that. You need to know what you are doing and the ramifications are before changing this value, and most users won't care so I have made it depend on CONFIG_EMBEDDED bzImage kernels will work and run at a different address when compiled with this option but they will still load at 1MB. If you need a kernel loaded at a different address as well you need to boot a vmlinux. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86_64: vmlinux: fix physical addressesEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | | | The vmlinux on x86_64 does not report the correct physical address of the kernel. Instead in the physical address field it currently reports the virtual address of the kernel. This is patch is a bug fix that corrects vmlinux to report the proper physical addresses. This is potentially a help for crash dump analysis tools. This definitiely allows bootloaders that load vmlinux as a standard ELF executable. Bootloaders directly loading vmlinux become of practical importance when we consider the kexec on panic case. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86: vmlinux: fix physical addressesEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | | | The vmlinux on i386 does not report the correct physical address of the kernel. Instead in the physical address field it currently reports the virtual address of the kernel. This is patch is a bug fix that corrects vmlinux to report the proper physical addresses. This is potentially a help for crash dump analysis tools. This definitiely allows bootloaders that load vmlinux as a standard ELF executable. Bootloaders directly loading vmlinux become of practical importance when we consider the kexec on panic case. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: vmlinux: fix physical addressesEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | In vmlinux.lds.h the code is carefull to define every section so vmlinux properly reports the correct physical load address of code, as well as it's virtual address. The new SECURITY_INIT definition fails to follow that convention and and causes incorrect physical address to appear in the vmlinux if there are any security initcalls. This patch updates the SECURITY_INIT to follow the convention in the rest of the file. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86_64: restore apic virtual wire mode on shutdownEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | When coming out of apic mode attempt to set the appropriate apic back into virtual wire mode. This improves on previous versions of this patch by by never setting bot the local apic and the ioapic into veritual wire mode. This code looks at data from the mptable to see if an ioapic has an ExtInt input to make this decision. A future improvement is to figure out which apic or ioapic was in virtual wire mode at boot time and to remember it. That is potentially a more accurate method, of selecting which apic to place in virutal wire mode. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86: resture apic virtual wire mode on shutdownEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | When coming out of apic mode attempt to set the appropriate apic back into virtual wire mode. This improves on previous versions of this patch by by never setting bot the local apic and the ioapic into veritual wire mode. This code looks at data from the mptable to see if an ioapic has an ExtInt input to make this decision. A future improvement is to figure out which apic or ioapic was in virtual wire mode at boot time and to remember it. That is potentially a more accurate method, of selecting which apic to place in virutal wire mode. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86_64: add i8259 shutdown methodEric W. Biederman2005-06-25
| | | | | | | | | | From: Eric W. Biederman <ebiederm@xmission.com The following patch simply adds a shutdown method to the x86_64 i8259 code. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86: i8259 shutdown: disable interruptsEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | From: Eric W. Biederman <ebiederm@xmission.com> This patch disables interrupt generation from the legacy pic on reboot. Now that there is a sys_device class it should not be called while drivers are still using interrupts. There is a report about this breaking ACPI power off on some systems. http://bugme.osdl.org/show_bug.cgi?id=4041 However the final comment seems to exonerate this code. So until I get more information I believe that was a false positive. Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86_64: e820 64bit fixEric W. Biederman2005-06-25
| | | | | | | | | | From: Eric W. Biederman <ebiederm@xmission.com> It is ok to reserve resources > 4G on x86_64 struct resource is 64bit now :) Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86: local apic fixEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | From: "Maciej W. Rozycki" <macro@linux-mips.org> Fix a kexec problem whcih causes local APIC detection failure. The problem is detect_init_APIC() is called early, before the command line have been processed. Therefore "lapic" (and "nolapic") have not been seen, yet. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kexec: x86: rename APIC_MODE_EXINTEric W. Biederman2005-06-25
| | | | | | | | | | | | | | | | | | From: "Maciej W. Rozycki" <macro@linux-mips.org> Rename APIC_MODE_EXINT to APIC_MODE_EXTINT - I think it should be named after what the mode is called in documentation. From: "Eric W. Biederman" <ebiederm@lnxi.com> I have reduced this patch to just the name change in the header. And integrated the changes into the patches that add those lines. Otherwise I ran into some ugly dependencies. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: voluntary kernel preemptionIngo Molnar2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new preemption model: 'Voluntary Kernel Preemption'. The 3 models can be selected from a new menu: (X) No Forced Preemption (Server) ( ) Voluntary Kernel Preemption (Desktop) ( ) Preemptible Kernel (Low-Latency Desktop) we still default to the stock (Server) preemption model. Voluntary preemption works by adding a cond_resched() (reschedule-if-needed) call to every might_sleep() check. It is lighter than CONFIG_PREEMPT - at the cost of not having as tight latencies. It represents a different latency/complexity/overhead tradeoff. It has no runtime impact at all if disabled. Here are size stats that show how the various preemption models impact the kernel's size: text data bss dec hex filename 3618774 547184 179896 4345854 424ffe vmlinux.stock 3626406 547184 179896 4353486 426dce vmlinux.voluntary +0.2% 3748414 548640 179896 4476950 445016 vmlinux.preempt +3.5% voluntary-preempt is +0.2% of .text, preempt is +3.5%. This feature has been tested for many months by lots of people (and it's also included in the RHEL4 distribution and earlier variants were in Fedora as well), and it's intended for users and distributions who dont want to use full-blown CONFIG_PREEMPT for one reason or another. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] enable PREEMPT_BKL on !PREEMPT+SMP tooIngo Molnar2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | The only sane way to clean up the current 3 lock_kernel() variants seems to be to remove the spinlock-based BKL implementations altogether, and to keep the semaphore-based one only. If we dont want to do that for whatever reason then i'm afraid we have to live with the current complexity. (but i'm open for other cleanup suggestions as well.) To explore this possibility we'll (at a minimum) have to know whether the semaphore-based BKL works fine on plain SMP too. The patch below enables this. The patch may make sense in isolation as well, as it might bring performance benefits: code that would formerly spin on the BKL spinlock will now schedule away and give up the CPU. It might introduce performance regressions as well, if any performance-critical code uses the BKL heavily and gets overscheduled due to the semaphore. I very much hope there is no such performance-critical codepath left though. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] consolidate PREEMPT options into kernel/Kconfig.preemptIngo Molnar2005-06-25
| | | | | | | | | | | This patch consolidates the CONFIG_PREEMPT and CONFIG_PREEMPT_BKL preemption options into kernel/Kconfig.preempt. This, besides reducing source-code, also enables more centralized tweaking of preemption related options. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Dynamic sched domains: ia64 changesDinakar Guniguntala2005-06-25
| | | | | | | | | | ia64 changes similar to kernel/sched.c. Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Dynamic sched domains: cpuset changesDinakar Guniguntala2005-06-25
| | | | | | | | | | Adds the core update_cpu_domains code and updated cpusets documentation Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Dynamic sched domains: sched changesDinakar Guniguntala2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following patches add dynamic sched domains functionality that was extensively discussed on lkml and lse-tech. I would like to see this added to -mm o The main advantage with this feature is that it ensures that the scheduler load balacing code only balances against the cpus that are in the sched domain as defined by an exclusive cpuset and not all of the cpus in the system. This removes any overhead due to load balancing code trying to pull tasks outside of the cpu exclusive cpuset only to be prevented by the tasks' cpus_allowed mask. o cpu exclusive cpusets are useful for servers running orthogonal workloads such as RT applications requiring low latency and HPC applications that are throughput sensitive o It provides a new API partition_sched_domains in sched.c that makes dynamic sched domains possible. o cpu_exclusive cpusets sets are now associated with a sched domain. Which means that the users can dynamically modify the sched domains through the cpuset file system interface o ia64 sched domain code has been updated to support this feature as well o Currently, this does not support hotplug. (However some of my tests indicate hotplug+preempt is currently broken) o I have tested it extensively on x86. o This should have very minimal impact on performance as none of the fast paths are affected Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Changing RT priority without CAP_SYS_NICEOlivier Croquette2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Presently, a process without the capability CAP_SYS_NICE can not change its own policy, which is OK. But it can also not decrease its RT priority (if scheduled with policy SCHED_RR or SCHED_FIFO), which is what this patch changes. The rationale is the same as for the nice value: a process should be able to require less priority for itself. Increasing the priority is still not allowed. This is for example useful if you give a multithreaded user process a RT priority, and the process would like to organize its internal threads using priorities also. Then you can give the process the highest priority needed N, and the process starts its threads with lower priorities: N-1, N-2... The POSIX norm says that the permissions are implementation specific, so I think we can do that. In a sense, it makes the permissions consistent whatever the policy is: with this patch, process scheduled by SCHED_FIFO, SCHED_RR and SCHED_OTHER can all decrease their priority. From: Ingo Molnar <mingo@elte.hu> cleaned up and merged to -mm. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: micro-optimize task requeueing in schedule()Chen Shang2005-06-25
| | | | | | | | micro-optimize task requeueing in schedule() & clean up recalc_task_prio(). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: relax pinned balancingNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | The maximum rebalance interval allowed by the multiprocessor balancing backoff is often not large enough to handle corner cases where there are lots of tasks pinned on a CPU. Suresh reported: I see system livelock's if for example I have 7000 processes pinned onto one cpu (this is on the fastest 8-way system I have access to). After this patch, the machine is reported to go well above this number. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: consolidate sbe sbfNick Piggin2005-06-25
| | | | | | | | | | | | | | | | Consolidate balance-on-exec with balance-on-fork. This is made easy by the sched-domains RCU patches. As well as the general goodness of code reduction, this allows the runqueues to be unlocked during balance-on-fork. schedstats is a problem. Maybe just have balance-on-event instead of distinguishing fork and exec? Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: RCU domainsNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | One of the problems with the multilevel balance-on-fork/exec is that it needs to jump through hoops to satisfy sched-domain's locking semantics (that is, you may traverse your own domain when not preemptable, and you may traverse others' domains when holding their runqueue lock). balance-on-exec had to potentially migrate between more than one CPU before finding a final CPU to migrate to, and balance-on-fork needed to potentially take multiple runqueue locks. So bite the bullet and make sched-domains go completely RCU. This actually simplifies the code quite a bit. From: Ingo Molnar <mingo@elte.hu> schedstats RCU fix, and a nice comment on for_each_domain, from Ingo. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: multilevel sbe sbfNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | The fundamental problem that Suresh has with balance on exec and fork is that it only tries to balance the top level domain with the flag set. This was worked around by removing degenerate domains, but is still a problem if people want to start using more complex sched-domains, especially multilevel NUMA that ia64 is already using. This patch makes balance on fork and exec try balancing over not just the top most domain with the flag set, but all the way down the domain tree. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: remove degenerate domainsSuresh Siddha2005-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove degenerate scheduler domains during the sched-domain init. For example on x86_64, we always have NUMA configured in. On Intel EM64T systems, top most sched domain will be of NUMA and with only one sched_group in it. With fork/exec balances(recent Nick's fixes in -mm tree), we always endup taking wrong decisions because of this topmost domain (as it contains only one group and find_idlest_group always returns NULL). We will endup loading HT package completely first, letting active load balance kickin and correct it. In general, this patch also makes sense with out recent Nick's fixes in -mm. From: Nick Piggin <nickpiggin@yahoo.com.au> Modified to account for more than just sched_groups when scanning for degenerate domains by Nick Piggin. And allow a runqueue's sd to go NULL rather than keep a single degenerate domain around (this happens when you run with maxcpus=1). Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: null domainsNick Piggin2005-06-25
| | | | | | | | | | | | | Fix the last 2 places that directly access a runqueue's sched-domain and assume it cannot be NULL. That allows the use of NULL for domain, instead of a dummy domain, to signify no balancing is to happen. No functional changes. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: cleanup context switch lockingNick Piggin2005-06-25
| | | | | | | | | | | | | | | | Instead of requiring architecture code to interact with the scheduler's locking implementation, provide a couple of defines that can be used by the architecture to request runqueue unlocked context switches, and ask for interrupts to be enabled over the context switch. Also replaces the "switch_lock" used by these architectures with an oncpu flag (note, not a potentially slow bitflag). This eliminates one bus locked memory operation when context switching, and simplifies the task_running function. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: uninline task_timesliceIngo Molnar2005-06-25
| | | | | | | | | | | "Chen, Kenneth W" <kenneth.w.chen@intel.com> uninline task_timeslice() - reduces code footprint noticeably, and it's slowpath code. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: sched tuningNick Piggin2005-06-25
| | | | | | | | Do some basic initial tuning. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: schedstats update for balance on forkNick Piggin2005-06-25
| | | | | | | | Add SCHEDSTAT statistics for sched-balance-fork. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: balance on forkNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | Reimplement the balance on exec balancing to be sched-domains aware. Use this to also do balance on fork balancing. Make x86_64 do balance on fork over the NUMA domain. The problem that the non sched domains aware blancing became apparent on dual core, multi socket opterons. What we want is for the new tasks to be sent to a different socket, but more often than not, we would first load up our sibling core, or fill two cores of a single remote socket before selecting a new one. This gives large improvements to STREAM on such systems. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: no aggressive idle balancingNick Piggin2005-06-25
| | | | | | | | | | Remove the very aggressive idle stuff that has recently gone into 2.6 - it is going against the direction we are trying to go. Hopefully we can regain performance through other methods. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: tweak affine wakeupsNick Piggin2005-06-25
| | | | | | | | | | | | | | | | | | Do less affine wakeups. We're trying to reduce dbt2-pgsql idle time regressions here... make sure we don't don't move tasks the wrong way in an imbalance condition. Also, remove the cache coldness requirement from the calculation - this seems to induce sharp cutoff points where behaviour will suddenly change on some workloads if the load creeps slightly over or under some point. It is good for periodic balancing because in that case have otherwise have no other context to determine what task to move. But also make a minor tweak to "wake balancing" - the imbalance tolerance is now set at half the domain's imbalance, so we get the opportunity to do wake balancing before the more random periodic rebalancing gets preformed. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: balance timersNick Piggin2005-06-25
| | | | | | | | | | | | | Do CPU load averaging over a number of different intervals. Allow each interval to be chosen by sending a parameter to source_load and target_load. 0 is instantaneous, idx > 0 returns a decaying average with the most recent sample weighted at 2^(idx-1). To a maximum of 3 (could be easily increased). So generally a higher number will result in more conservative balancing. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: less aggressive idle balancingNick Piggin2005-06-25
| | | | | | | | | | Remove the special casing for idle CPU balancing. Things like this are hurting for example on SMT, where are single sibling being idle doesn't really warrant a really aggressive pull over the NUMA domain, for example. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>