aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAge
* x86_64: Report the pending irq if available in smp_affinityAndi Kleen2007-07-21
| | | | | | | | | | Otherwise smp_affinity would only update after the next interrupt on x86 systems. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* NTP: move the cmos update code into ntp.cThomas Gleixner2007-07-21
| | | | | | | | | | | | | | | | i386 and sparc64 have the identical code to update the cmos clock. Move it into kernel/time/ntp.c as there are other architectures coming along with the same requirements. [akpm@linux-foundation.org: build fixes] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Cc: David Miller <davem@davemloft.net> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hrtimer: speedup hrtimer_enqueueIngo Molnar2007-07-21
| | | | | | | | | | Speedup hrtimer_enqueue by evaluating the rbtree insertion result. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* highres: improve debug outputIngo Molnar2007-07-21
| | | | | | | | | | Add some more debug information to the hrtimer and clock events code. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tick management: spread timer interruptjohn stultz2007-07-21
| | | | | | | | | | | | | | After discussing w/ Thomas over IRC, it seems the issue is the sched tick fires on every cpu at the same time, causing extra lock contention. This smaller change, adds an extra offset per cpu so the ticks don't line up. This patch also drops the idle latency from 40us down to under 20us. Signed-off-by: john stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* clockevents: fix device replacementThomas Gleixner2007-07-21
| | | | | | | | | | | When a device is replaced by a better rated device, then the broadcast mode needs to be evaluated again. When the new device has no requirement for broadcasting, then the broadcast bits for the CPU must be cleared. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* clockevents: fix resume logicThomas Gleixner2007-07-21
| | | | | | | | | | | | | | | | | | | | We need to make sure, that the clockevent devices are resumed, before the tick is resumed. The current resume logic does not guarantee this. Add CLOCK_EVT_MODE_RESUME and call the set mode functions of the clock event devices before resuming the tick / oneshot functionality. Fixup the existing users. Thanks to Nigel Cunningham for tracking down a long standing thinko, which affected the jinxed VAIO. [akpm@linux-foundation.org: xen build fix] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Revert "sys_time() speedup"Linus Torvalds2007-07-20
| | | | | | | | | | | | | | | | | | This basically reverts commit 4e44f3497d41db4c3b9051c61410dee8ae4fb49c, while waiting for it to be re-done more completely. There are cases of people mixing "time()" with higher-resolution time sources, and we need to take the nanosecond offsets into account. Ingo has a patch that does that, but it's still under some discussion. In the meantime, just revert back to the old simple situation of just doing the whole exact timesource calculations. But rather than using do_gettimeofday(), use the internal nanosecond resolution getnstimeofday(), which at least avoids one unnecessary conversion (since we really don't care about whether the fractional seconds are nanoseconds or microseconds - we'll just throw them away). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'release' of ↵Linus Torvalds2007-07-20
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: [IA64] Prevent people from directly including <asm/rwsem.h>. [IA64] remove time interpolator [IA64] Convert to generic timekeeping/clocksource [IA64] refresh some config files for 64K pagesize [IA64] Delete iosapic_free_rte() [IA64] fallocate system call [IA64] Enable percpu vector domain for IA64_DIG [IA64] Enable percpu vector domain for IA64_GENERIC [IA64] Support irq migration across domain [IA64] Add support for vector domain [IA64] Add mapping table between irq and vector [IA64] Check if irq is sharable [IA64] Fix invalid irq vector assumption for iosapic [IA64] Use dynamic irq for iosapic interrupts [IA64] Use per iosapic lock for indirect iosapic register access [IA64] Cleanup lock order in iosapic_register_intr [IA64] Remove duplicated members in iosapic_rte_info [IA64] Remove block structure for locking in iosapic.c
| * Pull ia64-clocksource into release branchTony Luck2007-07-20
| |\
| | * [IA64] remove time interpolatorBob Picco2007-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove time_interpolator code (This is generic code, but only user was ia64. It has been superseded by the CONFIG_GENERIC_TIME code). Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Peter Keilty <peter.keilty@hp.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* | | FRV: Fix linkage problemsDavid Howells2007-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make it possible to use __start_notes and __stop_notes without getting a GPREL overflow error from the FRV linker. Small variables that would otherwise be in .data or .bss may, depending on the arch, be placed in special sections (.sdata or .sbss) that permit single instruction references on fixed instruction width machines. __start_notes and __stop_notes aren't really char variables, and certainly don't refer to data in .data or .bss. Making them type "void" fools the compiler into not assuming anything about them. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm: Remove slab destructors from kmem_cache_create().Paul Mundt2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Slab destructors were no longer supported after Christoph's c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>
* | | [PATCH] sched: implement cpu_clock(cpu) high-speed time sourceIngo Molnar2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement the cpu_clock(cpu) interface for kernel-internal use: high-speed (but slightly incorrect) per-cpu clock constructed from sched_clock(). This API, unused at the moment, will be used in the future by blktrace, by the softlockup-watchdog, by printk and by lockstat. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | [PATCH] sched: fix the all pinned logic in load_balance_newidle()Suresh Siddha2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | nr_moved is not the correct check for triggering all pinned logic. Fix the all pinned logic in the case of load_balance_newidle(). Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | [PATCH] sched: fix newly idle load balance in case of SMTSuresh Siddha2007-07-19
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the presence of SMT, newly idle balance was never happening for multi-core and SMP domains (even when both the logical siblings are idle). If thread 0 is already idle and when thread 1 is about to go to idle, newly idle load balance always think that one of the threads is not idle and skips doing the newly idle load balance for multi-core and SMP domains. This is because of the idle_cpu() macro, which checks if the current process on a cpu is an idle process. But this is not the case for the thread doing the load_balance_newidle(). Fix this by using runqueue's nr_running field instead of idle_cpu(). And also skip the logic of 'only one idle cpu in the group will be doing load balancing' during newly idle case. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | kernel/sysctl.c: finish off the warning commentsAndrew Morton2007-07-19
| | | | | | | | | | | | | | | | I've been chasing these comments around this file all week. Hopefully we're straight now. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | lguest: the host codeRusty Russell2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the code for the "lg.ko" module, which allows lguest guests to be launched. [akpm@linux-foundation.org: update for futex-new-private-futexes] [akpm@linux-foundation.org: build fix] [jmorris@namei.org: lguest: use hrtimers] [akpm@linux-foundation.org: x86_64 build fix] Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@suse.de> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | lguest: export symbols for lguest as a moduleRusty Russell2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lguest does some fairly lowlevel things to support a host, which normal modules don't need: math_state_restore: When the guest triggers a Device Not Available fault, we need to be able to restore the FPU __put_task_struct: We need to hold a reference to another task for inter-guest I/O, and put_task_struct() is an inline function which calls __put_task_struct. access_process_vm: We need to access another task for inter-guest I/O. map_vm_area & __get_vm_area: We need to map the switcher shim (ie. monitor) at 0xFFC01000. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | timer.c: cleanup recently introduced whitespace damageThomas Gleixner2007-07-19
| | | | | | | | | | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | timekeeping: fixup shadow variable argumentThomas Gleixner2007-07-19
| | | | | | | | | | | | | | | | | | | | clocksource_adjust() has a clock argument, which shadows the file global clock variable. Fix this up. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | lockdep debugging: give stacktrace for init_errorJohannes Berg2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When I started adding support for lockdep to 64-bit powerpc, I got a lockdep_init_error and with this patch was able to pinpoint why and where to put lockdep_init(). Let's support this generally for others adding lockdep support to their architecture. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | lockstat: better class name representationPeter Zijlstra2007-07-19
| | | | | | | | | | | | | | | | | | optionally add class->name_version and class->subclass to the class name Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | lockstat: measure lock bouncingPeter Zijlstra2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __acquire | lock _____ | \ | __contended | | | wait | _______/ |/ | __acquired | __release | unlock We measure acquisition and contention bouncing. This is done by recording a cpu stamp in each lock instance. Contention bouncing requires the cpu stamp to be set on acquisition. Hence we move __acquired into the generic path. __acquired is then used to measure acquisition bouncing by comparing the current cpu with the old stamp before replacing it. __contended is used to measure contention bouncing (only useful for preemptable locks) [akpm@linux-foundation.org: cleanups] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | lockdep: various fixesPeter Zijlstra2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - update the copyright notices - use the default hash function - fix a thinko in a BUILD_BUG_ON - add a WARN_ON to spot inconsitent naming - fix a termination issue in /proc/lock_stat [akpm@linux-foundation.org: cleanups] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | lockstat: hook into spinlock_t, rwlock_t, rwsem and mutexPeter Zijlstra2007-07-19
| | | | | | | | | | | | | | | | | | | | Call the new lockstat tracking functions from the various lock primitives. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | lockstat: human readability tweaksPeter Zijlstra2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Present all this fancy new lock statistics information: *warning, _wide_ output ahead* (output edited for purpose of brevity) # cat /proc/lock_stat lock_stat version 0.1 ----------------------------------------------------------------------------------------------------------------------------------------------------------------- class name contentions waittime-min waittime-max waittime-total acquisitions holdtime-min holdtime-max holdtime-total ----------------------------------------------------------------------------------------------------------------------------------------------------------------- &inode->i_mutex: 14458 6.57 398832.75 2469412.23 6768876 0.34 11398383.65 339410830.89 --------------- &inode->i_mutex 4486 [<ffffffff802a08f9>] pipe_wait+0x86/0x8d &inode->i_mutex 0 [<ffffffff802a01e8>] pipe_write_fasync+0x29/0x5d &inode->i_mutex 0 [<ffffffff802a0e18>] pipe_read+0x74/0x3a5 &inode->i_mutex 0 [<ffffffff802a1a6a>] do_lookup+0x81/0x1ae ................................................................................................................................................................. &inode->i_data.tree_lock-W: 491 0.27 62.47 493.89 2477833 0.39 468.89 1146584.25 &inode->i_data.tree_lock-R: 65 0.44 4.27 48.78 26288792 0.36 184.62 10197458.24 -------------------------- &inode->i_data.tree_lock 46 [<ffffffff80277095>] __do_page_cache_readahead+0x69/0x24f &inode->i_data.tree_lock 31 [<ffffffff8026f9fb>] add_to_page_cache+0x31/0xba &inode->i_data.tree_lock 0 [<ffffffff802770ee>] __do_page_cache_readahead+0xc2/0x24f &inode->i_data.tree_lock 0 [<ffffffff8026f6e4>] find_get_page+0x1a/0x58 ................................................................................................................................................................. proc_inum_idr.lock: 0 0.00 0.00 0.00 36 0.00 65.60 148.26 proc_subdir_lock: 0 0.00 0.00 0.00 3049859 0.00 106.81 1563212.42 shrinker_rwsem-W: 0 0.00 0.00 0.00 5 0.00 1.73 3.68 shrinker_rwsem-R: 0 0.00 0.00 0.00 633 2.57 246.57 10909.76 'contentions' and 'acquisitions' are the number of such events measured (since the last reset). The waittime- and holdtime- (min, max, total) numbers are presented in microseconds. If there are any contention points, the lock class is presented in the block format (as i_mutex and tree_lock above), otherwise a single line of output is presented. The output is sorted on absolute number of contentions (read + write), this should get the worst offenders presented first, so that: # grep : /proc/lock_stat | head will quickly show who's bad. The stats can be reset using: # echo 0 > /proc/lock_stat [bunk@stusta.de: make 2 functions static] [akpm@linux-foundation.org: fix printk warning] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | lockstat: core infrastructurePeter Zijlstra2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce the core lock statistics code. Lock statistics provides lock wait-time and hold-time (as well as the count of corresponding contention and acquisitions events). Also, the first few call-sites that encounter contention are tracked. Lock wait-time is the time spent waiting on the lock. This provides insight into the locking scheme, that is, a heavily contended lock is indicative of a too coarse locking scheme. Lock hold-time is the duration the lock was held, this provides a reference for the wait-time numbers, so they can be put into perspective. 1) lock 2) ... do stuff .. unlock 3) The time between 1 and 2 is the wait-time. The time between 2 and 3 is the hold-time. The lockdep held-lock tracking code is reused, because it already collects locks into meaningful groups (classes), and because it is an existing infrastructure for lock instrumentation. Currently lockdep tracks lock acquisition with two hooks: lock() lock_acquire() _lock() ... code protected by lock ... unlock() lock_release() _unlock() We need to extend this with two more hooks, in order to measure contention. lock_contended() - used to measure contention events lock_acquired() - completion of the contention These are then placed the following way: lock() lock_acquire() if (!_try_lock()) lock_contended() _lock() lock_acquired() ... do locked stuff ... unlock() lock_release() _unlock() (Note: the try_lock() 'trick' is used to avoid instrumenting all platform dependent lock primitive implementations.) It is also possible to toggle the two lockdep features at runtime using: /proc/sys/kernel/prove_locking /proc/sys/kernel/lock_stat (esp. turning off the O(n^2) prove_locking functionaliy can help) [akpm@linux-foundation.org: build fixes] [akpm@linux-foundation.org: nuke unneeded ifdefs] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | lockdep: reduce the ifdefferyPeter Zijlstra2007-07-19
| | | | | | | | | | | | | | | | | | | | Move code around to get fewer but larger #ifdef sections. Break some in-function #ifdefs out into their own functions. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | lockdep: sanitise CONFIG_PROVE_LOCKINGPeter Zijlstra2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | Ensure that all of the lock dependency tracking code is under CONFIG_PROVE_LOCKING. This allows us to use the held lock tracking code for other purposes. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Add /sys/kernel/notesRoland McGrath2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the /sys/kernel/notes magic file. Reading this delivers the contents of the kernel's .notes section. This lets userland easily glean any detailed information about the running kernel's build that was stored there at compile time. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Andi Kleen <ak@suse.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | kernel/relay.c: make functions staticAdrian Bunk2007-07-19
| | | | | | | | | | | | | | Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | coredump masking: add an interface for core dump filterKawai, Hidehiro2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds an interface to set/reset flags which determines each memory segment should be dumped or not when a core file is generated. /proc/<pid>/coredump_filter file is provided to access the flags. You can change the flag status for a particular process by writing to or reading from the file. The flag status is inherited to the child process when it is created. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | coredump masking: reimplementation of dumpable using two flagsKawai, Hidehiro2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes mm_struct.dumpable to a pair of bit flags. set_dumpable() converts three-value dumpable to two flags and stores it into lower two bits of mm_struct.flags instead of mm_struct.dumpable. get_dumpable() behaves in the opposite way. [akpm@linux-foundation.org: export set_dumpable] Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | coredump masking: bound suid_dumpable sysctlKawai, Hidehiro2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch series is version 5 of the core dump masking feature, which controls which VMAs should be dumped based on their memory types and per-process flags. I adopted most of Andrew's suggestion at the previous version. He also suggested using system call instead of /proc/<pid>/ interface, I decided to use the latter continuously because adding new system call with pid argument will give a big impact on the kernel. You can access the per-process flags via /proc/<pid>/coredump_filter interface. coredump_filter represents a bitmask of memory types, and if a bit is set, VMAs of corresponding memory type are written into a core file when the process is dumped. The bitmask is inherited from the parent process when a process is created. The original purpose is to avoid longtime system slowdown when a number of processes which share a huge shared memory are dumped at the same time. To achieve this purpose, this patch series adds an ability to suppress dumping anonymous shared memory for specified processes. In this version, three other memory types are also supported. Here are the coredump_filter bits: bit 0: anonymous private memory bit 1: anonymous shared memory bit 2: file-backed private memory bit 3: file-backed shared memory The default value of coredump_filter is 0x3. This means the new core dump routine has the same behavior as conventional behavior by default. In this version, coredump_filter bits and mm.dumpable are merged into mm.flags, and it is accessed by atomic bitops. The supported core file formats are ELF and ELF-FDPIC. ELF has been tested, but ELF-FDPIC has not been built and tested because I don't have the test environment. This patch limits a value of suid_dumpable sysctl to the range of 0 to 2. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: variable length argument supportOllie Wild2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from the old mm into the new mm. We create the new mm before the binfmt code runs, and place the new stack at the very top of the address space. Once the binfmt code runs and figures out where the stack should be, we move it downwards. It is a bit peculiar in that we have one task with two mm's, one of which is inactive. [a.p.zijlstra@chello.nl: limit stack size] Signed-off-by: Ollie Wild <aaw@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <linux-arch@vger.kernel.org> Cc: Hugh Dickins <hugh@veritas.com> [bunk@stusta.de: unexport bprm_mm_init] Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | audit: rework execve auditPeter Zijlstra2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The purpose of audit_bprm() is to log the argv array to a userspace daemon at the end of the execve system call. Since user-space hasn't had time to run, this array is still in pristine state on the process' stack; so no need to copy it, we can just grab it from there. In order to minimize the damage to audit_log_*() copy each string into a temporary kernel buffer first. Currently the audit code requires that the full argument vector fits in a single packet. So currently it does clip the argv size to a (sysctl) limit, but only when execve auditing is enabled. If the audit protocol gets extended to allow for multiple packets this check can be removed. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ollie Wild <aaw@google.com> Cc: <linux-audit@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | use the new percpu interface for shared dataFenghua Yu2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently most of the per cpu data, which is accessed by different cpus, has a ____cacheline_aligned_in_smp attribute. Move all this data to the new per cpu shared data section: .data.percpu.shared_aligned. This will seperate the percpu data which is referenced frequently by other cpus from the local only percpu data. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | jprobes: make jprobes a little safer for usersMichael Ellerman2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I realise jprobes are a razor-blades-included type of interface, but that doesn't mean we can't try and make them safer to use. This guy I know once wrote code like this: struct jprobe jp = { .kp.symbol_name = "foo", .entry = "jprobe_foo" }; And then his kernel exploded. Oops. This patch adds an arch hook, arch_deref_entry_point() (I don't like it either) which takes the void * in a struct jprobe, and gives back the text address that it represents. We can then use that in register_jprobe() to check that the entry point we're passed is actually in the kernel text, rather than just some random value. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | PM: Integrate beeping flag with existing acpi_sleep flagsPavel Machek2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | Move "debug during resume from s2ram" into the variable we already use for real-mode flags to simplify code. It also closes nasty trap for the user in acpi_sleep_setup; order of parameters actually mattered there, acpi_sleep=s3_bios,s3_mode doing something different from acpi_sleep=s3_mode,s3_bios. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | PM: Optional beeping during resume from suspend to RAMNigel Cunningham2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | Add a feature allowing the user to make the system beep during a resume from suspend to RAM, on x86_64 and i386. This is useful for the users with broken resume from RAM, so that they can verify if the control reaches the kernel after a wake-up event. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | PM: Introduce pm_power_off_prepareRafael J. Wysocki2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce the pm_power_off_prepare() callback that can be registered by the interested platforms in analogy with pm_idle() and pm_power_off(), used for preparing the system to power off (needed by ACPI). This allows us to drop acpi_sysclass and device_acpi that are only defined in order to register the ACPI power off preparation callback, which is needed by pm_power_off() registered in a much different way. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | PM: Reduce code duplication between main.c and user.cRafael J. Wysocki2007-07-19
| | | | | | | | | | | | | | | | | | | | | | The SNAPSHOT_S2RAM ioctl code is outdated and it should not duplicate the suspend code in kernel/power/main.c. Fix that. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | PM: prevent frozen user mode helpers from failing the freezing of tasksRafael J. Wysocki2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At present, if a user mode helper is running while usermodehelper_pm_callback() is executed, the helper may be frozen and the completion in call_usermodehelper_exec() won't be completed until user space processes are thawed. As a result, the freezing of kernel threads may fail, which is not desirable. Prevent this from happening by introducing a counter of running user mode helpers and allowing usermodehelper_pm_callback() to succeed for action = PM_HIBERNATION_PREPARE or action = PM_SUSPEND_PREPARE only if there are no helpers running. [Namely, usermodehelper_pm_callback() waits for at most RUNNING_HELPERS_TIMEOUT for the number of running helpers to become zero and fails if that doesn't happen.] Special thanks to Uli Luckas <u.luckas@road.de>, Pavel Machek <pavel@ucw.cz> and Oleg Nesterov <oleg@tv-sign.ru> for reviewing the previous versions of this patch and for very useful comments. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Uli Luckas <u.luckas@road.de> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | PM: disable usermode helper before hibernation and suspendRafael J. Wysocki2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | Use a hibernation and suspend notifier to disable the user mode helper before a hibernation/suspend and enable it after the operation. [akpm@linux-foundation.org: build fix] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | PM: introduce hibernation and suspend notifiersRafael J. Wysocki2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make it possible to register hibernation and suspend notifiers, so that subsystems can perform hibernation-related or suspend-related operations that should not be carried out by device drivers' .suspend() and .resume() routines. [akpm@linux-foundation.org: build fixes] [akpm@linux-foundation.org: cleanups] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Freezer: remove redundant check in try_to_freeze_tasksRafael J. Wysocki2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | We don't need to check if todo is positive before calling time_after() in try_to_freeze_tasks(), because if todo is zero at this point, the loop will be broken anyway due to the while () condition being false. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Freezer: return int from freeze_processesRafael J. Wysocki2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | Make try_to_freeze_tasks() and freeze_processes() return -EBUSY on failure instead of the number of unfrozen tasks (none of the callers actually uses this number). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Freezer: use __set_current_state in refrigeratorRafael J. Wysocki2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | Use __set_current_state() as appropriate in refrigerator() instead of accessing current->state directly. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Freezer: avoid freezing kernel threads prematurelyRafael J. Wysocki2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel threads should not have TIF_FREEZE set when user space processes are being frozen, since otherwise some of them might be frozen prematurely. To prevent this from happening we can (1) make exit_mm() unset TIF_FREEZE unconditionally just after clearing tsk->mm and (2) make try_to_freeze_tasks() check if p->mm is different from zero and PF_BORROWED_MM is unset in p->flags when user space processes are to be frozen. Namely, when user space processes are being frozen, we only should set TIF_FREEZE for tasks that have p->mm different from NULL and don't have PF_BORROWED_MM set in p->flags. For this reason task_lock() must be used to prevent try_to_freeze_tasks() from racing with use_mm()/unuse_mm(), in which p->mm and p->flags.PF_BORROWED_MM are changed under task_lock(p). Also, we need to prevent the following scenario from happening: * daemonize() is called by a task spawned from a user space code path * freezer checks if the task has p->mm set and the result is positive * task enters exit_mm() and clears its TIF_FREEZE * freezer sets TIF_FREEZE for the task * task calls try_to_freeze() and goes to the refrigerator, which is wrong at that point This requires us to acquire task_lock(p) before p->flags.PF_BORROWED_MM and p->mm are examined and release it after TIF_FREEZE is set for p (or it turns out that TIF_FREEZE should not be set). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>