aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAge
* schedstat: consolidate per-task cpu runtime statsKen Chen2008-12-18
| | | | | | | | | | | | | | | Impact: simplify code When we turn on CONFIG_SCHEDSTATS, per-task cpu runtime is accumulated twice. Once in task->se.sum_exec_runtime and once in sched_info.cpu_time. These two stats are exactly the same. Given that task->se.sum_exec_runtime is always accumulated by the core scheduler, sched_info can reuse that data instead of duplicate the accounting. Signed-off-by: Ken Chen <kenchen@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched, cpuacct: export percpu cpuacct cgroup statsKen Chen2008-12-16
| | | | | | | | | | | | This patch export per-cpu CPU cycle usage for a given cpuacct cgroup. There is a need for a user space monitor daemon to track group CPU usage on per-cpu base. It is also useful for monitoring CFS load balancer behavior by tracking per CPU group usage. Signed-off-by: Ken Chen <kenchen@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched, cpuacct: refactoring cpuusage_read / cpuusage_writeKen Chen2008-12-16
| | | | | | | | | | | | | | | | | | | | | | Impact: micro-optimize the code on 64-bit architectures In the thread regarding to 'export percpu cpuacct cgroup stats' http://lkml.org/lkml/2008/12/7/13 akpm pointed out that current cpuacct code is inefficient. This patch refactoring the following: * make cpu_rq locking only on 32-bit * change iterator to each_present_cpu instead of each_possible_cpu to make it hotplug friendly. It's a bit of code churn, but I was rewarded with 160 byte code size saving on x86-64 arch and zero code size change on i386. Signed-off-by: Ken Chen <kenchen@google.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: optimize update_curr()Peter Zijlstra2008-12-16
| | | | | | | | | | Impact: micro-optimization Skip the hard work when there is none. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix wakeup preemption clockMike Galbraith2008-12-16
| | | | | | | | | | Impact: sharpen the wakeup-granularity to always be against current scheduler time It was possible to do the preemption check against an old time stamp. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: add missing arch_update_cpu_topology() callHeiko Carstens2008-12-12
| | | | | | | | | | | | | | | | | | arch_reinit_sched_domains() used to call arch_update_cpu_topology() via arch_init_sched_domains(). This call got lost with e761b7725234276a802322549cee5255305a0930 ("cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2)". So we might end up with outdated and missing cpus in the cpu core maps (architecture used to call arch_reinit_sched_domains if cpu topology changed). This adds a call to arch_update_cpu_topology in partition_sched_domains which gets called whenever scheduling domains get updated. Which is what is supposed to happen when cpu topology changes. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: let arch_update_cpu_topology indicate if topology changedHeiko Carstens2008-12-12
| | | | | | | | | Change arch_update_cpu_topology so it returns 1 if the cpu topology changed and 0 if it didn't change. This will be useful for the next patch which adds a call to this function in partition_sched_domains. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Merge commit 'v2.6.28-rc8' into sched/coreIngo Molnar2008-12-12
|\
| * Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds2008-12-10
| |\ | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: CPU remove deadlock fix
| | * sched: CPU remove deadlock fixBrian King2008-12-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix possible deadlock in CPU hot-remove path This patch fixes a possible deadlock scenario in the CPU remove path. migration_call grabs rq->lock, then wakes up everything on rq->migration_queue with the lock held. Then one of the tasks on the migration queue ends up calling tg_shares_up which then also tries to acquire the same rq->lock. [c000000058eab2e0] c000000000502078 ._spin_lock_irqsave+0x98/0xf0 [c000000058eab370] c00000000008011c .tg_shares_up+0x10c/0x20c [c000000058eab430] c00000000007867c .walk_tg_tree+0xc4/0xfc [c000000058eab4d0] c0000000000840c8 .try_to_wake_up+0xb0/0x3c4 [c000000058eab590] c0000000000799a0 .__wake_up_common+0x6c/0xe0 [c000000058eab640] c00000000007ada4 .complete+0x54/0x80 [c000000058eab6e0] c000000000509fa8 .migration_call+0x5fc/0x6f8 [c000000058eab7c0] c000000000504074 .notifier_call_chain+0x68/0xe0 [c000000058eab860] c000000000506568 ._cpu_down+0x2b0/0x3f4 [c000000058eaba60] c000000000506750 .cpu_down+0xa4/0x108 [c000000058eabb10] c000000000507e54 .store_online+0x44/0xa8 [c000000058eabba0] c000000000396260 .sysdev_store+0x3c/0x50 [c000000058eabc10] c0000000001a39b8 .sysfs_write_file+0x124/0x18c [c000000058eabcd0] c00000000013061c .vfs_write+0xd0/0x1bc [c000000058eabd70] c0000000001308a4 .sys_write+0x68/0x114 [c000000058eabe30] c0000000000086b4 syscall_exit+0x0/0x40 Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | fix mapping_writably_mapped()Hugh Dickins2008-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lee Schermerhorn noticed yesterday that I broke the mapping_writably_mapped test in 2.6.7! Bad bad bug, good good find. The i_mmap_writable count must be incremented for VM_SHARED (just as i_writecount is for VM_DENYWRITE, but while holding the i_mmap_lock) when dup_mmap() copies the vma for fork: it has its own more optimal version of __vma_link_file(), and I missed this out. So the count was later going down to 0 (dangerous) when one end unmapped, then wrapping negative (inefficient) when the other end unmapped. The only impact on x86 would have been that setting a mandatory lock on a file which has at some time been opened O_RDWR and mapped MAP_SHARED (but not necessarily PROT_WRITE) across a fork, might fail with -EAGAIN when it should succeed, or succeed when it should fail. But those architectures which rely on flush_dcache_page() to flush userspace modifications back into the page before the kernel reads it, may in some cases have skipped the flush after such a fork - though any repetitive test will soon wrap the count negative, in which case it will flush_dcache_page() unnecessarily. Fix would be a two-liner, but mapping variable added, and comment moved. Reported-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | KSYM_SYMBOL_LEN fixesHugh Dickins2008-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked to my 966c8c12dc9e77f931e2281ba25d2f0244b06949 sprint_symbol(): use less stack exposing a bug in slub's list_locations() - kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was beyond the end of page provided. The 100 slop which list_locations() allows at end of page looks roughly enough for all the other stuff it might print after the symbol before it checks again: break out KSYM_SYMBOL_LEN earlier than before. Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies them. [akpm@linux-foundation.org: ftrace.h needs module.h] Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc Miles Lane <miles.lane@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | relayfs: fix infinite loop with splice()Tom Zanussi2008-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running kmemtraced, which uses splice() on relayfs, causes a hard lock on x86-64 SMP. As described by Tom Zanussi: It looks like you hit the same problem as described here: commit 8191ecd1d14c6914c660dfa007154860a7908857 splice: fix infinite loop in generic_file_splice_read() relay uses the same loop but it never got noticed or fixed. Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Tested-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | [PATCH] fix broken timestamps in AVC generated by kernel threadsAl Viro2008-12-09
| | | | | | | | | | | | | | | | | | Timestamp in audit_context is valid only if ->in_syscall is set. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | [patch 1/1] audit: remove excess kernel-docRandy Dunlap2008-12-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Delete excess kernel-doc notation in kernel/auditsc.c: Warning(linux-2.6.27-git10//kernel/auditsc.c:1481): Excess function parameter or struct member 'tsk' description in 'audit_syscall_entry' Warning(linux-2.6.27-git10//kernel/auditsc.c:1564): Excess function parameter or struct member 'tsk' description in 'audit_syscall_exit' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | [PATCH] return records for fork() both to child and parentAl Viro2008-12-09
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | [PATCH] Audit: make audit=0 actually turn off auditEric Paris2008-12-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently audit=0 on the kernel command line does absolutely nothing. Audit always loads and always uses its resources such as creating the kernel netlink socket. This patch causes audit=0 to actually disable audit. Audit will use no resources and starting the userspace auditd daemon will not cause the kernel audit system to activate. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | Merge branch 'for-linus' of ↵Linus Torvalds2008-12-05
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: [PATCH] fix bogus argument of blkdev_put() in pktcdvd [PATCH 2/2] documnt FMODE_ constants [PATCH 1/2] kill FMODE_NDELAY_NOW [PATCH] clean up blkdev_get a little bit [PATCH] Fix block dev compat ioctl handling [PATCH] kill obsolete temporary comment in swsusp_close()
| | * | [PATCH] kill obsolete temporary comment in swsusp_close()Al Viro2008-12-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | it had been put there to mark the call of blkdev_put() that needed proper argument propagated to it; later patch in the same series had done just that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds2008-12-05
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: time: catch xtime_nsec underflows and fix them posix-cpu-timers: fix clock_gettime with CLOCK_PROCESS_CPUTIME_ID
| | * | | time: catch xtime_nsec underflows and fix themjohn stultz2008-12-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix time warp bug Alex Shi, along with Yanmin Zhang have been noticing occasional time inconsistencies recently. Through their great diagnosis, they found that the xtime_nsec value used in update_wall_time was occasionally going negative. After looking through the code for awhile, I realized we have the possibility for an underflow when three conditions are met in update_wall_time(): 1) We have accumulated a second's worth of nanoseconds, so we incremented xtime.tv_sec and appropriately decrement xtime_nsec. (This doesn't cause xtime_nsec to go negative, but it can cause it to be small). 2) The remaining offset value is large, but just slightly less then cycle_interval. 3) clocksource_adjust() is speeding up the clock, causing a corrective amount (compensating for the increase in the multiplier being multiplied against the unaccumulated offset value) to be subtracted from xtime_nsec. This can cause xtime_nsec to underflow. Unfortunately, since we notify the NTP subsystem via second_overflow() whenever we accumulate a full second, and this effects the error accumulation that has already occured, we cannot simply revert the accumulated second from xtime nor move the second accumulation to after the clocksource_adjust call without a change in behavior. This leaves us with (at least) two options: 1) Simply return from clocksource_adjust() without making a change if we notice the adjustment would cause xtime_nsec to go negative. This would work, but I'm concerned that if a large adjustment was needed (due to the error being large), it may be possible to get stuck with an ever increasing error that becomes too large to correct (since it may always force xtime_nsec negative). This may just be paranoia on my part. 2) Catch xtime_nsec if it is negative, then add back the amount its negative to both xtime_nsec and the error. This second method is consistent with how we've handled earlier rounding issues, and also has the benefit that the error being added is always in the oposite direction also always equal or smaller then the correction being applied. So the risk of a corner case where things get out of control is lessened. This patch fixes bug 11970, as tested by Yanmin Zhang http://bugzilla.kernel.org/show_bug.cgi?id=11970 Reported-by: alex.shi@intel.com Signed-off-by: John Stultz <johnstul@us.ibm.com> Acked-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * | | posix-cpu-timers: fix clock_gettime with CLOCK_PROCESS_CPUTIME_IDPetr Tesarik2008-11-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since CLOCK_PROCESS_CPUTIME_ID is in fact translated to -6, the switch statement in cpu_clock_sample_group() must first mask off the irrelevant bits, similar to cpu_clock_sample(). Signed-off-by: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> -- posix-cpu-timers.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
| * | | | Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds2008-12-05
| |\ \ \ \ | | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: check_hung_task(): unsigned sysctl_hung_task_warnings cannot be less than 0 documentation: local_ops fix on_each_cpu
| | * | | check_hung_task(): unsigned sysctl_hung_task_warnings cannot be less than 0Roel Kluin2008-12-03
| | | |/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix warnings-limit cutoff check for debug feature unsigned sysctl_hung_task_warnings cannot be less than 0 Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | taint: add missing commentArjan van de Ven2008-12-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The description for 'D' was missing in the comment... (causing me a minute of WTF followed by looking at more of the code) Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | epoll: introduce resource usage limitsDavide Libenzi2008-12-01
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It has been thought that the per-user file descriptors limit would also limit the resources that a normal user can request via the epoll interface. Vegard Nossum reported a very simple program (a modified version attached) that can make a normal user to request a pretty large amount of kernel memory, well within the its maximum number of fds. To solve such problem, default limits are now imposed, and /proc based configuration has been introduced. A new directory has been created, named /proc/sys/fs/epoll/ and inside there, there are two configuration points: max_user_instances = Maximum number of devices - per user max_user_watches = Maximum number of "watched" fds - per user The current default for "max_user_watches" limits the memory used by epoll to store "watches", to 1/32 of the amount of the low RAM. As example, a 256MB 32bit machine, will have "max_user_watches" set to roughly 90000. That should be enough to not break existing heavy epoll users. The default value for "max_user_instances" is set to 128, that should be enough too. This also changes the userspace, because a new error code can now come out from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already listed, so that should be ok. [akpm@linux-foundation.org: use get_current_user()] Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: <stable@kernel.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Reported-by: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | sched: idle_balance() does not call load_balance_newidle()Vaidyanathan Srinivasan2008-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix SD_BALANCE_NEWIDLEand broaden its use load_balance_newidle() does not get called if SD_BALANCE_NEWIDLE is set at higher level domain (3-CPU) and not in low level domain (2-MC). pulled_task is initialised to -1 and checked for non-zero which is always true if the lowest level sched_domain does not have SD_BALANCE_NEWIDLE flag set. Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | sched: fix sd_parent_degenerate on non-numa smp machineKen Chen2008-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: optimize the sched domains tree some more The addition of SD_SERIALIZE flag added to SD_NODE_INIT prevented top level dummy numa sched_domain to be properly degenerated on non-numa smp machine. The reason is that in sd_parent_degenerate(), it found that the child and parent does not have comon sched_domain flags due to SD_SERIALIZE. However, for non-numa smp box, the top level is a dummy with a single sched_group. Filter out SD_SERIALIZE if it is on non-numa machine to properly degenerate top level node sched_domain. this will cut back some of the sd domain walk in the load balancer code. Signed-off-by: Ken Chen <kenchen@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | Merge branch 'sched/urgent' into sched/coreIngo Molnar2008-12-08
|\| |
| * | Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds2008-11-30
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: prevent divide by zero error in cpu_avg_load_per_task, update sched, cpusets: fix warning in kernel/cpuset.c sched: prevent divide by zero error in cpu_avg_load_per_task
| | * | sched: prevent divide by zero error in cpu_avg_load_per_task, updateIngo Molnar2008-11-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Regarding the bug addressed in: 4cd4262: sched: prevent divide by zero error in cpu_avg_load_per_task Linus points out that the fix is not complete: > There's nothing that keeps gcc from deciding not to reload > rq->nr_running. > > Of course, in _practice_, I don't think gcc ever will (if it decides > that it will spill, gcc is likely going to decide that it will > literally spill the local variable to the stack rather than decide to > reload off the pointer), but it's a valid compiler optimization, and > it even has a name (rematerialization). > > So I suspect that your patch does fix the bug, but it still leaves the > fairly unlikely _potential_ for it to re-appear at some point. > > We have ACCESS_ONCE() as a macro to guarantee that the compiler > doesn't rematerialize a pointer access. That also would clarify > the fact that we access something unsafe outside a lock. So make sure our nr_running value is immutable and cannot change after we check it for nonzero. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * | sched, cpusets: fix warning in kernel/cpuset.cIngo Molnar2008-11-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | this warning: kernel/cpuset.c: In function ‘generate_sched_domains’: kernel/cpuset.c:588: warning: ‘ndoms’ may be used uninitialized in this function triggers because GCC does not recognize that ndoms stays uninitialized only if doms is NULL - but that flow is covered at the end of generate_sched_domains(). Help out GCC by initializing this variable to 0. (that's prudent anyway) Also, this function needs a splitup and code flow simplification: with 160 lines length it's clearly too long. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | Merge branch 'irq-fixes-for-linus' of ↵Linus Torvalds2008-11-30
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: irq.h: fix missing/extra kernel-doc genirq: __irq_set_trigger: change pr_warning to pr_debug irq: fix typo x86: apic honour irq affinity which was set in early boot genirq: fix the affinity setting in setup_irq genirq: keep affinities set from userspace across free/request_irq()
| | * \ \ Merge commit 'v2.6.28-rc6' into irq/urgentIngo Molnar2008-11-23
| | |\ \ \ | | | | |/ | | | |/|
| | * | | genirq: __irq_set_trigger: change pr_warning to pr_debugMark Nelson2008-11-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0c5d1eb77a8be917b638344a22afe1398236482b (genirq: record trigger type) caused powerpc platforms that had no set_type() function in their struct irq_chip to spew out warnings about "No set_type function for IRQ...". This warning isn't necessarily justified though because the generic powerpc platform code calls set_irq_type() (which in turn calls __irq_set_trigger) with information from the device tree to establish the interrupt mappings, regardless of whether the PIC can actually set a type. A platform's irq_chip might not have a set_type function for a variety of reasons, for example: the platform may have the type essentially hard-coded, or as in the case for Cell interrupts are just messages past around that have no real concept of type, or the platform could even have a virtual PIC as on the PS3. Signed-off-by: Mark Nelson <markn@au1.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * | | irq: fix typoIngo Molnar2008-11-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: build fix fix build failure on UP. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * | | genirq: fix the affinity setting in setup_irqThomas Gleixner2008-11-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The affinity setting in setup irq is called before the NO_BALANCING flag is checked and might therefore override affinity settings from the calling code with the default setting. Move the NO_BALANCING flag check before the call to the affinity setting. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * | | genirq: keep affinities set from userspace across free/request_irq()Thomas Gleixner2008-11-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: preserve user-modified affinities on interrupts Kumar Galak noticed that commit 18404756765c713a0be4eb1082920c04822ce588 (genirq: Expose default irq affinity mask (take 3)) overrides an already set affinity setting across a free / request_irq(). Happens e.g. with ifdown/ifup of a network device. Change the logic to mark the affinities as set and keep them intact. This also fixes the unlocked access to irq_desc in irq_select_affinity() when called from irq_affinity_proc_write() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds2008-11-30
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: consistent alignement for lockdep info
| | * | | | lockdep: consistent alignement for lockdep infoLi Zefan2008-11-21
| | | |/ / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: prettify /proc/lockdep_info Just feel odd that not all lines of lockdep info are aligned. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds2008-11-30
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ftrace: prevent recursion tracing, doc: update mmiotrace documentation x86, mmiotrace: fix buffer overrun detection function tracing: fix wrong position computing of stack_trace
| | * | | | ftrace: prevent recursionLai Jiangshan2008-11-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: prevent unnecessary stack recursion if the resched flag was set before we entered, then don't reschedule. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * | | | x86, mmiotrace: fix buffer overrun detectionPekka Paalanen2008-11-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix mmiotrace overrun tracing When ftrace framework moved to use the ring buffer facility, the buffer overrun detection was broken after 2.6.27 by commit | commit 3928a8a2d98081d1bc3c0a84a2d70e29b90ecf1c | Author: Steven Rostedt <rostedt@goodmis.org> | Date: Mon Sep 29 23:02:41 2008 -0400 | | ftrace: make work with new ring buffer | | This patch ports ftrace over to the new ring buffer. The detection is now fixed by using the ring buffer API. When mmiotrace detects a buffer overrun, it will report the number of lost events. People reading an mmiotrace log must know if something was missed, otherwise the data may not make sense. Signed-off-by: Pekka Paalanen <pq@iki.fi> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * | | | function tracing: fix wrong position computing of stack_traceLiming Wang2008-11-21
| | |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: make output of stack_trace complete if buffer overruns When read buffer overruns, the output of stack_trace isn't complete. When printing records with seq_printf in t_show, if the read buffer has overruned by the current record, then this record won't be printed to user space through read buffer, it will just be dropped in this printing. When next printing, t_start should return the "*pos"th record, which is the one dropped by previous printing, but it just returns (m->private + *pos)th record. Here we use a more sane method to implement seq_operations which can be found in kernel code. Thus we needn't initialize m->private. About testing, it's not easy to overrun read buffer, but we can use seq_printf to print more padding bytes in t_show, then it's easy to check whether or not records are lost. This commit has been tested on both condition of overrun and non overrun. Signed-off-by: Liming Wang <liming.wang@windriver.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | remove __ARCH_WANT_COMPAT_SYS_PTRACEChristoph Hellwig2008-11-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All architectures now use the generic compat_sys_ptrace, as should every new architecture that needs 32bit compat (if we'll ever get another). Remove the now superflous __ARCH_WANT_COMPAT_SYS_PTRACE define, and also kill a comment about __ARCH_SYS_PTRACE that was added after __ARCH_SYS_PTRACE was already gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | cpuinit fixes in kernel/*Al Viro2008-11-30
| |/ / / | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | sched: add uid information to sched_debug for CONFIG_USER_SCHEDArun R Bharadwaj2008-12-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: extend information in /proc/sched_debug This patch adds uid information in sched_debug for CONFIG_USER_SCHED Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | sched: move double_unlock_balance() higherAlexey Dobriyan2008-11-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move double_lock_balance()/double_unlock_balance() higher to fix the following with gcc-3.4.6: CC kernel/sched.o In file included from kernel/sched.c:1605: kernel/sched_rt.c: In function `find_lock_lowest_rq': kernel/sched_rt.c:914: sorry, unimplemented: inlining failed in call to 'double_unlock_balance': function body not available kernel/sched_rt.c:1077: sorry, unimplemented: called from here make[2]: *** [kernel/sched.o] Error 1 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | Merge branch 'sched/urgent' into sched/coreIngo Molnar2008-11-28
|\ \ \ \ | | |_|/ | |/| |
| * | | sched: prevent divide by zero error in cpu_avg_load_per_taskSteven Rostedt2008-11-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix divide by zero crash in scheduler rebalance irq While testing the branch profiler, I hit this crash: divide error: 0000 [#1] PREEMPT SMP [...] RIP: 0010:[<ffffffff8024a008>] [<ffffffff8024a008>] cpu_avg_load_per_task+0x50/0x7f [...] Call Trace: <IRQ> <0> [<ffffffff8024fd43>] find_busiest_group+0x3e5/0xcaa [<ffffffff8025da75>] rebalance_domains+0x2da/0xa21 [<ffffffff80478769>] ? find_next_bit+0x1b2/0x1e6 [<ffffffff8025e2ce>] run_rebalance_domains+0x112/0x19f [<ffffffff8026d7c2>] __do_softirq+0xa8/0x232 [<ffffffff8020ea7c>] call_softirq+0x1c/0x3e [<ffffffff8021047a>] do_softirq+0x94/0x1cd [<ffffffff8026d5eb>] irq_exit+0x6b/0x10e [<ffffffff8022e6ec>] smp_apic_timer_interrupt+0xd3/0xff [<ffffffff8020e4b3>] apic_timer_interrupt+0x13/0x20 The code for cpu_avg_load_per_task has: if (rq->nr_running) rq->avg_load_per_task = rq->load.weight / rq->nr_running; The runqueue lock is not held here, and there is nothing that prevents the rq->nr_running from going to zero after it passes the if condition. The branch profiler simply made the race window bigger. This patch saves off the rq->nr_running to a local variable and uses that for both the condition and the division. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>