aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAge
* sysctl: remove binary sysctl support where it clearly doesn't workEric W. Biederman2007-10-18
| | | | | | | | | | | These functions are all wrapper functions for the proc interface that are needed for them to work correctly. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Acked-by: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sysctl: Factor out sysctl_data.Eric W. Biederman2007-10-18
| | | | | | | | | | | | | | | | | | | | There as been no easy way to wrap the default sysctl strategy routine except for returning 0. Which is not always what we want. The few instances I have seen that want different behaviour have written their own version of sysctl_data. While not too hard it is unnecessary code and has the potential for extra bugs. So to make these situations easier and make that part of sysctl more symetric I have factord sysctl_data out of do_sysctl_strategy and exported as a function everyone can use. Further having sysctl_data be an explicit function makes checking for badly formed sysctl tables much easier. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sysctl core: Stop using the unnecessary ctl_table typedefEric W. Biederman2007-10-18
| | | | | | | | | | | | | | | In sysctl.h the typedef struct ctl_table ctl_table violates coding style isn't needed and is a bit of a nuisance because it makes it harder to recognize ctl_table is a type name. So this patch removes it from the generic sysctl code. Hopefully I will have enough energy to send the rest of my patches will follow and to remove it from the rest of the kernel. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cpu hotplug: cpu: deliver CPU_UP_CANCELED only to NOTIFY_OKed callbacks with ↵Akinobu Mita2007-10-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CPU_UP_PREPARE The functions in a CPU notifier chain is called with CPU_UP_PREPARE event before making the CPU online. If one of the callback returns NOTIFY_BAD, it stops to deliver CPU_UP_PREPARE event, and CPU online operation is canceled. Then CPU_UP_CANCELED event is delivered to the functions in a CPU notifier chain again. This CPU_UP_CANCELED event is delivered to the functions which have been called with CPU_UP_PREPARE, not delivered to the functions which haven't been called with CPU_UP_PREPARE. The problem that makes existing cpu hotplug error handlings complex is that the CPU_UP_CANCELED event is delivered to the function that has returned NOTIFY_BAD, too. Usually we don't expect to call destructor function against the object that has failed to initialize. It is like: err = register_something(); if (err) { unregister_something(); return err; } So it is natural to deliver CPU_UP_CANCELED event only to the functions that have returned NOTIFY_OK with CPU_UP_PREPARE event and not to call the function that have returned NOTIFY_BAD. This is what this patch is doing. Otherwise, every cpu hotplug notifiler has to track whether notifiler event is failed or not for each cpu. (drivers/base/topology.c is doing this with topology_dev_map) Similary this patch makes same thing with CPU_DOWN_PREPARE and CPU_DOWN_FAILED evnets. Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* param_sysfs_builtin memchr argument fixDave Young2007-10-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If memchr argument is longer than strlen(kp->name), there will be some weird result. It will casuse duplicate filenames in sysfs for the "nousb". kernel warning messages are as bellow: sysfs: duplicate filename 'usbcore' can not be created WARNING: at fs/sysfs/dir.c:416 sysfs_add_one() [<c01c4750>] sysfs_add_one+0xa0/0xe0 [<c01c4ab8>] create_dir+0x48/0xb0 [<c01c4b69>] sysfs_create_dir+0x29/0x50 [<c024e0fb>] create_dir+0x1b/0x50 [<c024e3b6>] kobject_add+0x46/0x150 [<c024e2da>] kobject_init+0x3a/0x80 [<c053b880>] kernel_param_sysfs_setup+0x50/0xb0 [<c053b9ce>] param_sysfs_builtin+0xee/0x130 [<c053ba33>] param_sysfs_init+0x23/0x60 [<c024d062>] __next_cpu+0x12/0x20 [<c052aa30>] kernel_init+0x0/0xb0 [<c052aa30>] kernel_init+0x0/0xb0 [<c052a856>] do_initcalls+0x46/0x1e0 [<c01bdb12>] create_proc_entry+0x52/0x90 [<c0158d4c>] register_irq_proc+0x9c/0xc0 [<c01bda94>] proc_mkdir_mode+0x34/0x50 [<c052aa30>] kernel_init+0x0/0xb0 [<c052aa92>] kernel_init+0x62/0xb0 [<c0104f83>] kernel_thread_helper+0x7/0x14 ======================= kobject_add failed for usbcore with -EEXIST, don't try to register things with the same name in the same directory. [<c024e466>] kobject_add+0xf6/0x150 [<c053b880>] kernel_param_sysfs_setup+0x50/0xb0 [<c053b9ce>] param_sysfs_builtin+0xee/0x130 [<c053ba33>] param_sysfs_init+0x23/0x60 [<c024d062>] __next_cpu+0x12/0x20 [<c052aa30>] kernel_init+0x0/0xb0 [<c052aa30>] kernel_init+0x0/0xb0 [<c052a856>] do_initcalls+0x46/0x1e0 [<c01bdb12>] create_proc_entry+0x52/0x90 [<c0158d4c>] register_irq_proc+0x9c/0xc0 [<c01bda94>] proc_mkdir_mode+0x34/0x50 [<c052aa30>] kernel_init+0x0/0xb0 [<c052aa92>] kernel_init+0x62/0xb0 [<c0104f83>] kernel_thread_helper+0x7/0x14 ======================= Module 'usbcore' failed to be added to sysfs, error number -17 The system will be unstable now. Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Fix discrepancy between VDSO based gettimeofday() and sys_gettimeofday().Tony Breeds2007-10-18
| | | | | | | | | | | | | | | | | | | | On platforms that copy sys_tz into the vdso (currently only x86_64, soon to include powerpc), it is possible for the vdso to get out of sync if a user calls (admittedly unusual) settimeofday(NULL, ptr). This patch adds a hook for architectures that set CONFIG_GENERIC_TIME_VSYSCALL to ensure when sys_tz is updated they can also updatee their copy in the vdso. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> Cc: Andi Kleen <ak@suse.de> Cc: Tony Luck <tony.luck@intel.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Remove struct task_struct::io_waitAlexey Dobriyan2007-10-18
| | | | | | | | | | | | Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b during 2.6.9 development. Commit introduced io_wait field which remained write-only than and still remains write-only. Also garbage collect macros which "use" io_wait. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Hibernation: Enter platform hibernation state in a consistent wayRafael J. Wysocki2007-10-18
| | | | | | | | | | | | | | | | | | | | | | | Make hibernation_platform_enter() execute the enter-a-sleep-state sequence instead of the mixed shutdown-with-entering-S4 thing. Replace the shutting down of devices done by kernel_shutdown_prepare(), before entering the ACPI S4 sleep state, with suspending them and the shutting down of sysdevs with calling device_power_down(PMSG_SUSPEND) (just like before entering S1 or S3, but the target state is now S4).  Also, disable the nonboot CPUs before entering the sleep state (S4), which generally always is a good idea. This is known to fix the "double disk spin down during hibernation" on some machines, eg. HPC nx6325 (ref. http://lkml.org/lkml/2007/8/7/316 and the following thread).  Moreover, it has been reported to make /sys/class/rtc/rtc0/wakealarm work correctly with hibernation for some users. It also generally causes the hibernation state (ACPI S4) to be entered faster. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Hibernation: Check if ACPI is enabled during restore in the right placeRafael J. Wysocki2007-10-18
| | | | | | | | | | | | | | | | | | | | | | | The following scenario leads to total confusion of the platform firmware on some boxes (eg. HPC nx6325): * Hibernate with ACPI enabled * Resume passing "acpi=off" to the boot kernel To prevent this from happening it's necessary to check if ACPI is enabled (and enable it if that's not the case) _right_ _after_ control has been transfered from the boot kernel to the image kernel, before device_power_up() is called (ie. with interrupts disabled).  Enabling ACPI after calling device_power_up() turns out to be insufficient. For this reason, introduce new hibernation callback ->leave() that will be executed before device_power_up() by the restored image kernel.  To make it work, it also is necessary to move swsusp_suspend() from swsusp.c to disk.c (it's name is changed to "create_image", which is more up to the point). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Hibernation: Arbitrary boot kernel support - generic codeRafael J. Wysocki2007-10-18
| | | | | | | | | | | | | | | | Add the bits needed for supporting arbitrary boot kernels to the common hibernation code. To support arbitrary boot kernels, make it possible to replace the 'struct new_utsname' and the kernel version in the hibernation image header by some architecture specific data that will be used to verify if the image is valid and to restore the image. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* serial: turn serial console suspend a boot rather than compile time optionAndres Salomon2007-10-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, there's a CONFIG_DISABLE_CONSOLE_SUSPEND that allows one to stop the serial console from being suspended when the rest of the machine goes to sleep. This is incredibly useful for debugging power management-related things; however, having it as a compile-time option has proved to be incredibly inconvenient for us (OLPC). There are plenty of times that we want serial console to not suspend, but for the most part we'd like serial console to be suspended. This drops CONFIG_DISABLE_CONSOLE_SUSPEND, and replaces it with a kernel boot parameter (no_console_suspend). By default, the serial console will be suspended along with the rest of the system; by passing 'no_console_suspend' to the kernel during boot, serial console will remain alive during suspend. For now, this is pretty serial console specific; further fixes could be applied to make this work for things like netconsole. Signed-off-by: Andres Salomon <dilinger@debian.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@suspend2.net> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* freezer: measure freezing timeRafael J. Wysocki2007-10-18
| | | | | | | | | Measure the time of the freezing of tasks, even if it doesn't fail. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* freezer: be more verboseRafael J. Wysocki2007-10-18
| | | | | | | | | | | Increase the freezer's verbosity a bit, so that it's easier to read problem reports related to it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* unexport pm_power_off_prepareAdrian Bunk2007-10-18
| | | | | | | | | This patch removes the unused EXPORT_SYMBOL(pm_power_off_prepare). Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* freezer: do not send signals to kernel threadsRafael J. Wysocki2007-10-18
| | | | | | | | | | | | | | | | | | The freezer should not send signals to kernel threads, since that may lead to subtle problems. In particular, commit b74d0deb968e1f85942f17080eace015ce3c332c has changed recalc_sigpending_tsk() so that it doesn't clear TIF_SIGPENDING. For this reason, if the freezer continues to send fake signals to kernel threads and the freezing of kernel threads fails, some of them may be running with TIF_SIGPENDING set forever. Accordingly, recalc_sigpending_tsk() shouldn't set the task's TIF_SIGPENDING flag if TIF_FREEZE is set. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* freezer: prevent new tasks from inheriting TIF_FREEZE setRafael J. Wysocki2007-10-18
| | | | | | | | | | | | Tasks should go to the refrigerator only if explicitly requested to do that by the freezer and not as a result of inheriting the TIF_FREEZE flag set from the parent. Make it happen. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* freezer: do not sync filesystems from freeze_processesRafael J. Wysocki2007-10-18
| | | | | | | | | | | | | | | | | The syncing of filesystems from within the freezer is generally not needed. Also, if there's an ext3 filesystem loopback-mounted from a FUSE one, the syncing results in writes to it and deadlocks. Similarly, it will deadlock if FUSE implements sync. Change freeze_processes() so that it doesn't execute sys_sync() and make the suspend and hibernation code path sync filesystems independently of the freezer. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* PM: Rename hibernation_ops to platform_hibernation_opsRafael J. Wysocki2007-10-18
| | | | | | | | | | | | Rename 'struct hibernation_ops' to 'struct platform_hibernation_ops' in analogy with 'struct platform_suspend_ops'. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* PM: Rework struct hibernation_opsRafael J. Wysocki2007-10-18
| | | | | | | | | | | | | | | | | During hibernation we also need to tell the ACPI core that we're going to put the system into the S4 sleep state. For this reason, an additional method in 'struct hibernation_ops' is needed, playing the role of set_target() in 'struct platform_suspend_operations'. Moreover, the role of the .prepare() method is now different, so it's better to introduce another method, that in general may be different from .prepare(), that will be used to prepare the platform for creating the hibernation image (.prepare() is used anyway to notify the platform that we're going to enter the low power state after the image has been saved). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* PM: Make suspend_ops staticRafael J. Wysocki2007-10-18
| | | | | | | | | | | The variable suspend_ops representing the set of global platform-specific suspend-related operations, used by the PM core, need not be exported outside of kernel/power/main.c .  Make it static. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* PM: Rework struct platform_suspend_opsRafael J. Wysocki2007-10-18
| | | | | | | | | | | | | | | | | There is no reason why the .prepare() and .finish() methods in 'struct platform_suspend_ops' should take any arguments, since architectures don't use these methods' argument in any practically meaningful way (ie. either the target system sleep state is conveyed to the platform by .set_target(), or there is only one suspend state supported and it is indicated to the PM core by .valid(), or .prepare() and .finish() aren't defined at all).  There also is no reason why .finish() should return any result. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* PM: Rename struct pm_ops and related thingsRafael J. Wysocki2007-10-18
| | | | | | | | | | | | | | | | The name of 'struct pm_ops' suggests that it is related to the power management in general, but in fact it is only related to suspend.  Moreover, its name should indicate what this structure is used for, so it seems reasonable to change it to 'struct platform_suspend_ops'.  In that case, the name of the global variable of this type used by the PM core and the names of related functions should be changed accordingly. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* make kernel/power/main.c:suspend_enter() staticAdrian Bunk2007-10-18
| | | | | | | | | | suspend_enter() can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* x86: C1E late detection fix. Really switch off lapic timerThomas Gleixner2007-10-17
| | | | | | | | | | | Doh, I completely missed that devices marked DUMMY are not running the set_mode function. So we force broadcasting, but we keep the local APIC timer running. Let the clock event layer mark the device _after_ switching it off. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-schedLinus Torvalds2007-10-17
|\ | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: fix new task startup crash sched: fix !SYSFS build breakage sched: fix improper load balance across sched domain sched: more robust sd-sysctl entry freeing
| * sched: fix new task startup crashSrivatsa Vaddagiri2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Child task may be added on a different cpu that the one on which parent is running. In which case, task_new_fair() should check whether the new born task's parent entity should be added as well on the cfs_rq. Patch below fixes the problem in task_new_fair. This could fix the put_prev_task_fair() crashes reported. Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Reported-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: fix !SYSFS build breakageDhaval Giani2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_SYSFS is not set, CONFIG_FAIR_USER_SCHED fails to build with kernel/built-in.o: In function `uids_kobject_init': (.init.text+0x1488): undefined reference to `kernel_subsys' kernel/built-in.o: In function `uids_kobject_init': (.init.text+0x1490): undefined reference to `kernel_subsys' kernel/built-in.o: In function `uids_kobject_init': (.init.text+0x1480): undefined reference to `kernel_subsys' kernel/built-in.o: In function `uids_kobject_init': (.init.text+0x1494): undefined reference to `kernel_subsys' This patch fixes this build error. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: fix improper load balance across sched domainKen Chen2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We recently discovered a nasty performance bug in the kernel CPU load balancer where we were hit by 50% performance regression. When tasks are assigned to a subset of CPUs that span across sched_domains (either ccNUMA node or the new multi-core domain) via cpu affinity, kernel fails to perform proper load balance at these domains, due to several logic in find_busiest_group() miss identified busiest sched group within a given domain. This leads to inadequate load balance and causes 50% performance hit. To give you a concrete example, on a dual-core, 2 socket numa system, there are 4 logical cpu, organized as: CPU0 attaching sched-domain: domain 0: span 0003 groups: 0001 0002 domain 1: span 000f groups: 0003 000c CPU1 attaching sched-domain: domain 0: span 0003 groups: 0002 0001 domain 1: span 000f groups: 0003 000c CPU2 attaching sched-domain: domain 0: span 000c groups: 0004 0008 domain 1: span 000f groups: 000c 0003 CPU3 attaching sched-domain: domain 0: span 000c groups: 0008 0004 domain 1: span 000f groups: 000c 0003 If I run 2 tasks with CPU affinity set to 0x5. There are situation where cpu0 has run queue length of 2, and cpu2 will be idle. The kernel load balancer is unable to balance out these two tasks over cpu0 and cpu2 due to at least three logics in find_busiest_group() that heavily bias load balance towards power saving mode. e.g. while determining "busiest" variable, kernel only set it when "sum_nr_running > group_capacity". This test is flawed that "sum_nr_running" is not necessary same as sum-tasks-allowed-to-run-within-the sched-group. The end result is that kernel "think" everything is balanced, but in reality we have an imbalance and thus causing one CPU to be over-subscribed and leaving other idle. There are two other logic in the same function will also causing similar effect. The nastiness of this bug is that kernel not be able to get unstuck in this unfortunate broken state. From what we've seen in our environment, kernel will stuck in imbalanced state for extended period of time and it is also very easy for the kernel to stuck into that state (it's pretty much 100% reproducible for us). So proposing the following fix: add addition logic in find_busiest_group to detect intrinsic imbalance within the busiest group. When such condition is detected, load balance goes into spread mode instead of default grouping mode. Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: more robust sd-sysctl entry freeingMilton Miller2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It occurred to me this morning that the procname field was dynamically allocated and needed to be freed. I started to put in break statements when allocation failed but it was approaching 50% error handling code. I came up with this alternative of looping while entry->mode is set and checking proc_handler instead of ->table. Alternatively, the string version of the domain name and cpu number could be stored the structs. I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation counts after taking a cpuset exclusive and back. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | security/ cleanupsAdrian Bunk2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch contains the following cleanups that are now possible: - remove the unused security_operations->inode_xattr_getsuffix - remove the no longer used security_operations->unregister_security - remove some no longer required exit code - remove a bunch of no longer used exports Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ifdef struct task_struct::securityAlexey Dobriyan2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | For those who don't care about CONFIG_SECURITY. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | migration_call(CPU_DEAD): use spin_lock_irq() instead of task_rq_lock()Oleg Nesterov2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change migration_call(CPU_DEAD) to use direct spin_lock_irq() instead of task_rq_lock(rq->idle), rq->idle can't change its task_rq(). This makes the code a bit more symmetrical with migrate_dead_tasks()'s path which uses spin_lock_irq/spin_unlock_irq. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Cliff Wickman <cpw@sgi.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | do CPU_DEAD migrating under read_lock(tasklist) instead of ↵Oleg Nesterov2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | write_lock_irq(tasklist) Currently move_task_off_dead_cpu() is called under write_lock_irq(tasklist). This means it can't use task_lock() which is needed to improve migrating to take task's ->cpuset into account. Change the code to call move_task_off_dead_cpu() with irqs enabled, and change migrate_live_tasks() to use read_lock(tasklist). This all is a preparation for the futher changes proposed by Cliff Wickman, see http://marc.info/?t=117327786100003 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Cliff Wickman <cpw@sgi.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | module: return error when mod_sysfs_init() failedAkinobu Mita2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | load_module() returns zero when mod_sysfs_init() fails, then the module loading will succeed accidentally. This patch makes load_module() return error correctly in that case. Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Compile handle_percpu_irq even for uniprocessor kernelsRalf Baechle2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compiling handle_percpu_irq only on uniprocessor generates an artificial special case so a typical use like: set_irq_chip_and_handler(irq, &some_irq_type, handle_percpu_irq); needs to be conditionally compiled only on SMP systems as well and an alternative UP construct is usually needed - for no good reason. This fixes uniprocessor configurations for some MIPS SMP systems. Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | change inotifyfs magic as the same magic is used for futexfsAndrey Mirkin2007-10-17
| | | | | | | | | | | | | | | | | | | | | | Right now futexfs and inotifyfs have one magic 0xBAD1DEA, that looks a little bit confusing. Use 0xBAD1DEA as magic for futexfs and 0x2BAD1DEA as magic for inotifyfs. Signed-off-by: Andrey Mirkin <major@openvz.org> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Use KMEM_CACHE macro to create the nsproxy cachePavel Emelyanov2007-10-17
| | | | | | | | | | | | | | | | | | | | | | The blessed way for standard caches is to use it. Besides, this may give this cache a better alignment. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | user.c: #ifdef ->mq_bytesAlexey Dobriyan2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | For those who deselect POSIX message queues. Reduces SLAB size of user_struct from 64 to 32 bytes here, SLUB size -- from 40 bytes to 32 bytes. [akpm@linux-foundation.org: fix build] Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | user.c: deinlineAlexey Dobriyan2007-10-17
| | | | | | | | | | | | | | | | Save some space because uid_hash_find() has 3 callsites. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | constify string/array kparam tracking structuresJan Beulich2007-10-17
| | | | | | | | | | | | | | | | | | .. in an effort to make read-only whatever can be made, so that CONFIG_DEBUG_RODATA can catch as many issues as possible. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | make kernel/profile.c:time_hook staticAdrian Bunk2007-10-17
| | | | | | | | | | | | | | | | {,un}register_timer_hook() is the API that should be used. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | kernel/sys_ni.c: add dummy sys_ni_syscall() prototypeAdrian Bunk2007-10-17
| | | | | | | | | | | | | | | | | | kernel/sys_ni.c can't #include <linux/syscalls.h> due to cond_syscall(), but let's tell gcc to not warn with -Wmissing-prototypes. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Move PREEMPT_NOTIFIERS into an always-included KconfigAvi Kivity2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kconfig.preempt is not included on some archs (for example, m68k). On those archs, the Kconfig machinery complains that KVM selects an undefined symbol PREEMPT_NOTIFIERS (which lives in Kconfig.preempt). So move the offending symbol into a Kconfig file which is included by everyone. Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Avi Kivity <avi@qumranet.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Shrink task_struct if CONFIG_FUTEX=nAlexey Dobriyan2007-10-17
| | | | | | | | | | | | | | | | | | | | robust_list, compat_robust_list, pi_state_list, pi_state_cache are really used if futexes are on. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | add-vmcore: add a prefix "VMCOREINFO_" to the vmcoreinfo macrosKen'ichi Ohmichi2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | Add a prefix "VMCOREINFO_" to the vmcoreinfo macros. Old vmcoreinfo macros were defined as generic names SYMBOL/SIZE/OFFSET /LENGTH/CONFIG, and it is impossible to grep for them. So these names should be changed. This discussion is the following: http://www.ussg.iu.edu/hypermail/linux/kernel/0709.1/0415.html Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | add-vmcore: add nodemask_t's size and NR_FREE_PAGES's value to vmcoreinfo_dataKen'ichi Ohmichi2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [2/3] Add nodemask_t's size and NR_FREE_PAGES's value to vmcoreinfo_data. The dump filetering command 'makedumpfile'(v1.1.6 or before) had assumed the above values, and it was not good from the reliability viewpoint. So makedumpfile v1.2.0 came to need these values and I created the patch to let the kernel output them. makedumpfile site: https://sourceforge.net/projects/makedumpfile/ Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | add-vmcore: cleanup the coding style according to Andrew's commentsKen'ichi Ohmichi2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [1/3] Cleanup the coding style according to Andrew's comments: http://lists.infradead.org/pipermail/kexec/2007-August/000522.html - vmcoreinfo_append_str() should have suitable __attribute__s so that the compiler can check its use. - vmcoreinfo_max_size should have size_t. - Use get_seconds() instead of xtime.tv_sec. - Use init_uts_ns.name.release instead of UTS_RELEASE. Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Add vmcoreinfoKen'ichi Ohmichi2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch set frees the restriction that makedumpfile users should install a vmlinux file (including the debugging information) into each system. makedumpfile command is the dump filtering feature for kdump. It creates a small dumpfile by filtering unnecessary pages for the analysis. To distinguish unnecessary pages, it needs a vmlinux file including the debugging information. These days, the debugging package becomes a huge file, and it is hard to install it into each system. To solve the problem, kdump developers discussed it at lkml and kexec-ml. As the result, we reached the conclusion that necessary information for dump filtering (called "vmcoreinfo") should be embedded into the first kernel file and it should be accessed through /proc/vmcore during the second kernel. (http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.0/1806.html) Dan Aloni created the patch set for the above implementation. (http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.1/1053.html) And I updated it for multi architectures and memory models. (http://lists.infradead.org/pipermail/kexec/2007-August/000479.html) Signed-off-by: Dan Aloni <da-x@monatomic.org> Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Signed-off-by: Bernhard Walle <bwalle@suse.de> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | do_sigaction: don't worry about signal_pending()Oleg Nesterov2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_sigaction() returns -ERESTARTNOINTR if signal_pending(). The comment says: * If there might be a fatal signal pending on multiple * threads, make sure we take it before changing the action. I think this is not needed. We should only worry about SIGNAL_GROUP_EXIT case, bit it implies a pending SIGKILL which can't be cleared by do_sigaction. Kill this special case. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | exec: RT sub-thread can livelock and monopolize CPU on execOleg Nesterov2007-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | de_thread() yields waiting for ->group_leader to be a zombie. This deadlocks if an rt-prio execer shares the same cpu with ->group_leader. Change the code to use ->group_exit_task/notify_count mechanics. This patch certainly uglifies the code, perhaps someone can suggest something better. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>