litmus-rt.git - The LITMUS^RT kernel.

	Commit message (Collapse)	Author	Age
*	char drivers: RAM oops/panic logger	Marco Stornelli	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ramoops, like mtdoops, can log oops/panic information but in RAM. It can be used with persistent RAM for systems without flash support. In addition, for this systems, with this driver, it's no more needed add to the kernel the mtd subsystem with advantage in footprint. It can be used in a very easy way with persistent RAM for systems without flash support. For these systems, with this driver, it is no longer required to cinlude mtd subsystem with an advantage in footprint. In addition, you can save flash space and store this information only in RAM. Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com> Cc: Simon Kagstrom <simon.kagstrom@netinsight.net> Cc: David Woodhouse <David.Woodhouse@intel.com> Cc; Anders Grafstrom <anders.grafstrom@netinsight.net> Cc: Yuasa Yoichi <yuasa@linux-mips.org> Cc: Jamie Lokier <jamie@shareable.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipmi: handle run_to_completion properly in deliver_recv_msg()	Jiri Kosina	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \|	If run_to_completion flag is set, it means that we are running in a single-threaded mode, and thus no locks are held. This fixes a deadlock when IPMI notifier is being called during panic. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Acked-by: Corey Minyard <minyard@acm.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipmi: update driver to use dev_printk and its constructs	Myron Stowe	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Update core IPMI driver printk()'s with dev_printk(), and its constructs, to provide additional device topology information. An example of the additional device topology for a PNP device - ipmi_si 00:02: probing via ACPI ipmi_si 00:02: [io 0x0ca2-0x0ca3] regsize 1 spacing 1 irq 0 ipmi_si 00:02: Found new BMC (man_id: 0x00000b, prod_id: 0x0000, ... ipmi_si 00:02: IPMI kcs interface initialized and for a PCI device - ipmi_si 0000:01:04.6: probing via PCI ipmi_si 0000:01:04.6: PCI INT A -> GSI 21 (level, low) -> IRQ 21 ipmi_si 0000:01:04.6: [mem 0xf1ef0000-0xf1ef00ff] regsize 1 spaci... ipmi_si 0000:01:04.6: IPMI kcs interface initialized [minyard@acm.org: rework to fix rejects, extended it a bit] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Myron Stowe <myron.stowe@hp.com> Signed-off-by: Corey Minyard <minyard@acm.org> Cc: Zhao Yakui <yakui.zhao@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipmi: convert tracking of the ACPI device pointer to a PNP device	Myron Stowe	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Convert PNP patch (git 9e368fa011d4e0aa050db348d69514900520e40b) to maintain a pointer to a PNP device, 'pnp_dev', instead of the ACPI device, 'acpi_dev', that is currently being tracked with PNP based IPMI device discovery. Signed-off-by: Myron Stowe <myron.stowe@hp.com> Acked-by: Zhao Yakui <yakui.zhao@intel.com> Acked-by: Corey Minyard <minyard@acm.org> Cc: Len Brown <lenb@kernel.org> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipmi: change timeout and event poll to one second	Corey Minyard	2010-05-27
\| \| \| \| \| \| \| \| \| \|	The timeouts in IPMI are in the 1-5 second range in message handling, so a 1 second timeout is a reasonable thing to do. This should help with reducing power consumption on idle systems. Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipmi: attempt to register multiple SIs of the same type	Matthew Garrett	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \|	Some odd systems may have multiple BMCs, and we want to be able to support them. Let's make the assumption that if a system legitimately has multiple BMCs then each BMC's SI will be of the same type, and also that we won't see multiple SIs of the same type unless we have multiple BMCs. If these hold true then we should register all SIs of the same type. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipmi: reduce polling	Matthew Garrett	2010-05-27
\| \| \| \| \| \| \| \| \| \|	We can reasonably alter the poll rate depending on whether we're performing a transaction or merely waiting for an event. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipmi: reduce polling when interrupts are available	Matthew Garrett	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we're not currently in the middle of a transaction, and if we have interrupts, there's no real reason to poll the controller more frequently than the core IPMI code does. Set the interrupt_disabled flag appropriately as the interrupt state changes, and make the timeout code reset itself only if the transaction is incomplete or we have no interrupts. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipmi: change device discovery order	Matthew Garrett	2010-05-27
\| \| \| \| \| \| \| \| \| \| \|	The ipmi spec provides an ordering for si discovery. Change the driver to match, with the exception of preferring smbios to SPMI as HPs (at least) contain accurate information in the former but not the latter. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipmi: only register one si per bmc	Matthew Garrett	2010-05-27
\| \| \| \| \| \| \| \| \| \| \|	Only register one si per bmc. Use any user-provided devices first, followed by the first device with an irq, followed by the first device discovered. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipmi: split device discovery and registration	Matthew Garrett	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \|	The ipmi spec indicates that we should only make use of one si per bmc, so separate device discovery and registration to make that possible. [thenzl@redhat.com: fix mutex use] Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipmi: change addr_source to an enum rather than strings	Matthew Garrett	2010-05-27
\| \| \| \| \| \| \| \| \| \|	Switch from a char* to an enum to identify the address source of SIs, making it easier to handle them appropriately during registration. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipc/sem.c: use ERR_CAST	Julia Lawall	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more clear what is the purpose of the operation, which otherwise looks like a no-op. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ type T; T x; identifier f; @@ T f (...) { <+... - ERR_PTR(PTR_ERR(x)) + x ...+> } @@ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipc/sem.c: update description of the implementation	Manfred Spraul	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ipc/sem.c begins with a 15 year old description about bugs in the initial implementation in Linux-1.0. The patch replaces that with a top level description of the current code. A TODO could be derived from this text: The opengroup man page for semop() does not mandate FIFO. Thus there is no need for a semaphore array list of pending operations. If - this list is removed - the per-semaphore array spinlock is removed (possible if there is no list to protect) - sem_otime is moved into the semaphores and calculated on demand during semctl() then the array would be read-mostly - which would significantly improve scaling for applications that use semaphore arrays with lots of entries. The price would be expensive semctl() calls: for(i=0;i<sma->sem_nsems;i++) spin_lock(sma->sem_lock); <do stuff> for(i=0;i<sma->sem_nsems;i++) spin_unlock(sma->sem_lock); I'm not sure if the complexity is worth the effort, thus here is the documentation of the current behavior first. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipc/sem.c: cacheline align the ipc spinlock for semaphores	Manfred Spraul	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Cacheline align the spinlock for sysv semaphores. Without the patch, the spinlock and sem_otime [written by every semop that modified the array] and sem_base [read in the hot path of try_atomic_semop()] can be in the same cacheline. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipc/sem.c: move wake_up_process out of the spinlock section	Manfred Spraul	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The wake-up part of semtimedop() consists out of two steps: - the right tasks must be identified. - they must be woken up. Right now, both steps run while the array spinlock is held. This patch reorders the code and moves the actual wake_up_process() behind the point where the spinlock is dropped. The code also moves setting sem->sem_otime to one place: It does not make sense to set the last modify time multiple times. [akpm@linux-foundation.org: repair kerneldoc] [akpm@linux-foundation.org: fix uninitialised retval] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ipc/sem.c: optimize update_queue() for bulk wakeup calls	Manfred Spraul	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following series of patches tries to fix the spinlock contention reported by Chris Mason - his benchmark exposes problems of the current code: - In the worst case, the algorithm used by update_queue() is O(N^2). Bulk wake-up calls can enter this worst case. The patch series fix that. Note that the benchmark app doesn't expose the problem, it just should be fixed: Real world apps might do the wake-ups in another order than perfect FIFO. - The part of the code that runs within the semaphore array spinlock is significantly larger than necessary. The patch series fixes that. This change is responsible for the main improvement. - The cacheline with the spinlock is also used for a variable that is read in the hot path (sem_base) and for a variable that is unnecessarily written to multiple times (sem_otime). The last step of the series cacheline-aligns the spinlock. This patch: The SysV semaphore code allows to perform multiple operations on all semaphores in the array as atomic operations. After a modification, update_queue() checks which of the waiting tasks can complete. The algorithm that is used to identify the tasks is O(N^2) in the worst case. For some cases, it is simple to avoid the O(N^2). The patch adds a detection logic for some cases, especially for the case of an array where all sleeping tasks are single sembuf operations and a multi-sembuf operation is used to wake up multiple tasks. A big database application uses that approach. The patch fixes wakeup due to semctl(,,SETALL,) - the initial version of the patch breaks that. [akpm@linux-foundation.org: make do_smart_update() static] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	idr: fix backtrack logic in idr_remove_all	Imre Deak	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently idr_remove_all will fail with a use after free error if idr::layers is bigger than 2, which on 32 bit systems corresponds to items more than 1024. This is due to stepping back too many levels during backtracking. For simplicity let's assume that IDR_BITS=1 -> we have 2 nodes at each level below the root node and each leaf node stores two IDs. (In reality for 32 bit systems IDR_BITS=5, with 32 nodes at each sub-root level and 32 IDs in each leaf node). The sequence of freeing the nodes at the moment is as follows: layer 1 -> a(7) 2 -> b(3) c(5) 3 -> d(1) e(2) f(4) g(6) Until step 4 things go fine, but then node c is freed, whereas node g should be freed first. Since node c contains the pointer to node g we'll have a use after free error at step 6. How many levels we step back after visiting the leaf nodes is currently determined by the msb of the id we are currently visiting: Step 1. node d with IDs 0,1 is freed, current ID is advanced to 2. msb of the current ID bit 1. This means we need to step back 1 level to node b and take the next sibling, node e. 2-3. node e with IDs 2,3 is freed, current ID is 4, msb is bit 2. This means we need to step back 2 levels to node a, freeing node b on the way. 4-5. node f with IDs 4,5 is freed, current ID is 6, msb is still bit 2. This means we again need to step back 2 levels to node a and free c on the way. 6. We should visit node g, but its pointer is not available as node c was freed. The fix changes how we determine the number of levels to step back. Instead of deducting this merely from the msb of the current ID, we should really check if advancing the ID causes an overflow to a bit position corresponding to a given layer. In the above example overflow from bit 0 to bit 1 should mean stepping back 1 level. Overflow from bit 1 to bit 2 should mean stepping back 2 levels and so on. The fix was tested with IDs up to 1 << 20, which corresponds to 4 layers on 32 bit systems. Signed-off-by: Imre Deak <imre.deak@nokia.com> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Eric Paris <eparis@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: <stable@kernel.org> [2.6.34.1] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	cpuhotplug: do not need cpu_hotplug_begin() when CONFIG_HOTPLUG_CPU=n	Lai Jiangshan	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since when CONFIG_HOTPLUG_CPU=n, get_online_cpus() do nothing, so we don't need cpu_hotplug_begin() either. This patch moves cpu_hotplug_begin()/cpu_hotplug_done() into the code block of CONFIG_HOTPLUG_CPU=y. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fault-injection: add CPU notifier error injection module	Akinobu Mita	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I used this module to test the series of modification to the cpu notifiers code. Example1: inject CPU offline error (-1 == -EPERM) # modprobe cpu-notifier-error-inject cpu_down_prepare_error=-1 # echo 0 > /sys/devices/system/cpu/cpu1/online bash: echo: write error: Operation not permitted Example2: inject CPU online error (-2 == -ENOENT) # modprobe cpu-notifier-error-inject cpu_up_prepare_error=-2 # echo 1 > /sys/devices/system/cpu/cpu1/online bash: echo: write error: No such file or directory [akpm@linux-foundation.org: fix Kconfig help text] Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	md: convert cpu notifier to return encapsulate errno value	Akinobu Mita	2010-05-27
\| \| \| \| \| \| \| \| \| \|	By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for raid5. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	s390: convert cpu notifier to return encapsulate errno value	Akinobu Mita	2010-05-27
\| \| \| \| \| \| \| \| \| \| \|	By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for s390. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ehca: convert cpu notifier to return encapsulate errno value	Akinobu Mita	2010-05-27
\| \| \| \| \| \| \| \| \| \| \|	By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for ehca. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Hoang-Nam Nguyen <hnguyen@de.ibm.com> Cc: Christoph Raisch <raisch@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	iucv: convert cpu notifier to return encapsulate errno value	Akinobu Mita	2010-05-27
\| \| \| \| \| \| \| \| \| \|	By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for iucv. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Ursula Braun <ursula.braun@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	slab: convert cpu notifier to return encapsulate errno value	Akinobu Mita	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \|	By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for slab. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	kernel/: convert cpu notifier to return encapsulate errno value	Akinobu Mita	2010-05-27
\| \| \| \| \| \| \| \| \| \| \|	By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for kernel/*.c Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	topology: convert cpu notifier to return encapsulate errno value	Akinobu Mita	2010-05-27
\| \| \| \| \| \| \| \| \| \|	By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for topology. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	x86: convert cpu notifier to return encapsulate errno value	Akinobu Mita	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \|	By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for msr, cpuid, and therm_throt. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	notifier: change notifier_from_errno(0) to return NOTIFY_OK	Akinobu Mita	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This changes notifier_from_errno(0) to be NOTIFY_OK instead of NOTIFY_STOP_MASK \| NOTIFY_OK. Currently, the notifiers which return encapsulated errno value have to do something like this: err = do_something(); // returns -errno if (err) return notifier_from_errno(err); else return NOTIFY_OK; This change makes the above code simple: err = do_something(); // returns -errno return return notifier_from_errno(err); Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	cpu-hotplug: return better errno on cpu hotplug failure	Akinobu Mita	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, onlining or offlining a CPU failure by one of the cpu notifiers error always cause -EINVAL error. (i.e. writing 0 or 1 to /sys/devices/system/cpu/cpuX/online gets EINVAL) To get better error reporting rather than always getting -EINVAL, This changes cpu_notify() to return -errno value with notifier_to_errno() and fix the callers. Now that cpu notifiers can return encapsulate errno value. Currently, all cpu hotplug notifiers return NOTIFY_OK, NOTIFY_BAD, or NOTIFY_DONE. So cpu_notify() can returns 0 or -EPERM with this change for now. (notifier_to_errno(NOTIFY_OK) == 0, notifier_to_errno(NOTIFY_DONE) == 0, notifier_to_errno(NOTIFY_BAD) == -EPERM) Forthcoming patches convert several cpu notifiers to return encapsulate errno value with notifier_from_errno(). Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	cpu-hotplug: introduce cpu_notify(), __cpu_notify(), cpu_notify_nofail()	Akinobu Mita	2010-05-27
\| \| \| \| \| \| \| \| \| \|	No functional change. These are just wrappers of raw_cpu_notifier_call_chain. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	kcore: add _text to KCORE_TEXT	Wu Fengguang	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Extend KCORE_TEXT to cover the pages between _text and _stext, to allow examining some important page table pages. `readelf -a` output on x86_64 before and after patch: Type Offset VirtAddr PhysAddr before LOAD 0x00007fff8100c000 0xffffffff81009000 0x0000000000000000 after LOAD 0x00007fff81003000 0xffffffff81000000 0x0000000000000000 The newly covered pages are: 0xffffffff81000000 <startup_64> etc. 0xffffffff81001000 <init_level4_pgt> 0xffffffff81002000 <level3_ident_pgt> 0xffffffff81003000 <level3_kernel_pgt> 0xffffffff81004000 <level2_fixmap_pgt> 0xffffffff81005000 <level1_fixmap_pgt> 0xffffffff81006000 <level2_ident_pgt> 0xffffffff81007000 <level2_kernel_pgt> 0xffffffff81008000 <level2_spare_pgt> Before patch, /proc/kcore shows outdated contents for the above page table pages, for example: (gdb) p level3_ident_pgt $1 = {<text variable, no debug info>} 0xffffffff81002000 <level3_ident_pgt> (gdb) p/x ((pud_t )&level3_ident_pgt)@512 $2 = {{pud = 0x1006063}, {pud = 0x0} <repeats 511 times>} while the real content is: root@hp /home/wfg# hexdump -s 0x1002000 -n 4096 /dev/mem 1002000 6063 0100 0000 0000 8067 0000 0000 0000 1002010 0000 0000 0000 0000 0000 0000 0000 0000 * 1003000 That is, on a x86_64 box with 2GB memory, we can see first-1GB / full-2GB identity mapping before/after patch: (gdb) p/x ((pud_t )&level3_ident_pgt)@512 before $1 = {{pud = 0x1006063}, {pud = 0x0} <repeats 511 times>} after $1 = {{pud = 0x1006063}, {pud = 0x8067}, {pud = 0x0} <repeats 510 times>} Obviously the content before patch is wrong. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	proc: remove obsolete comments	Amerigo Wang	2010-05-27
\| \| \| \| \| \| \| \| \|	A quick test shows these comments are obsolete, so just remove them. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	proc: cleanup: remove unused assignments	Dan Carpenter	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \|	I removed 3 unused assignments. The first two get reset on the first statement of their functions. For "err" in root.c we don't return an error and we don't use the variable again. Signed-off-by: Dan Carpenter <error27@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	proc: turn signal_struct->count into "int nr_threads"	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	No functional changes, just s/atomic_t count/int nr_threads/. With the recent changes this counter has a single user, get_nr_threads() And, none of its callers need the really accurate number of threads, not to mention each caller obviously races with fork/exit. It is only used to report this value to the user-space, except first_tid() uses it to avoid the unnecessary while_each_thread() loop in the unlikely case. It is a bit sad we need a word in struct signal_struct for this, perhaps we can change get_nr_threads() to approximate the number of threads using signal->live and kill ->nr_threads later. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	keyctl_session_to_parent(): use thread_group_empty() to check singlethreadness	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	No functional changes. keyctl_session_to_parent() is the only user of signal->count which needs the correct value. Change it to use thread_group_empty() instead, this must be strictly equivalent under tasklist, and imho looks better. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	proc_sched_show_task(): use get_nr_threads()	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Trivial, use get_nr_threads() helper to read signal->count which we are going to change. Like other callers, proc_sched_show_task() doesn't need the exactly precise nr_threads. David said: : Note that get_nr_threads() isn't completely equivalent (it can return 0 : where proc_sched_show_task() will display a 1). But I don't think this : should be a problem. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	proc: get_nr_threads() doesn't need ->siglock any longer	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that task->signal can't go away get_nr_threads() doesn't need ->siglock to read signal->count. Also, make it inline, move into sched.h, and convert 2 other proc users of signal->count to use this (now trivial) helper. Henceforth get_nr_threads() is the only valid user of signal->count, we are ready to turn it into "int nr_threads" or, perhaps, kill it. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	check_unshare_flags: kill the bogus CLONE_SIGHAND/sig->count check	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	check_unshare_flags(CLONE_SIGHAND) adds CLONE_THREAD to *flags_ptr if the task is multithreaded to ensure unshare_thread() will fail. Not only this is a bit strange way to return the error, this is absolutely meaningless. If signal->count > 1 then sighand->count must be also > 1, and unshare_sighand() will fail anyway. In fact, all CLONE_THREAD/SIGHAND/VM checks inside sys_unshare() do not look right. Fortunately this code doesn't really work anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	exit: move taskstats_tgid_free() from __exit_signal() to free_signal_struct()	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move taskstats_tgid_free() from __exit_signal() to free_signal_struct(). This way signal->stats never points to nowhere and we can read ->stats lockless. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	kill the obsolete thread_group_cputime_free() helper	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \|	Kill the empty thread_group_cputime_free() helper. It was needed to free the per-cpu data which we no longer have. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	exit: __exit_signal: use thread_group_leader() consistently	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Cleanup: - Add the boolean, group_dead = thread_group_leader(), for clarity. - Do not test/set sig == NULL to detect the all-dead case, use this boolean. - Pass this boolen to __unhash_process() and use it instead of another thread_group_leader() call which needs ->group_leader. This can be considered as microoptimization, but hopefully this also allows us do do other cleanups later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	signals: kill the awful task_rq_unlock_wait() hack	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that task->signal can't go away we can revert the horrible hack added by ad474caca3e2a0550b7ce0706527ad5ab389a4d4 ("fix for account_group_exec_runtime(), make sure ->signal can't be freed under rq->lock"). And we can do more cleanups sched_stats.h/posix-cpu-timers.c later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	signals: clear signal->tty when the last thread exits	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the last thread exits signal->tty is freed, but the pointer is not cleared and points to nowhere. This is OK. Nobody should use signal->tty lockless, and it is no longer possible to take ->siglock. However this looks wrong even if correct, and the nice OOPS is better than subtle and hard to find bugs. Change __exit_signal() to clear signal->tty under ->siglock. Note: __exit_signal() needs more cleanups. It should not check "sig != NULL" to detect the all-dead case and we have the same issues with signal->stats. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	signals: make task_struct->signal immutable/refcountable	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have a lot of problems with accessing task_struct->signal, it can "disappear" at any moment. Even current can't use its ->signal safely after exit_notify(). ->siglock helps, but it is not convenient, not always possible, and sometimes it makes sense to use task->signal even after this task has already dead. This patch adds the reference counter, sigcnt, into signal_struct. This reference is owned by task_struct and it is dropped in __put_task_struct(). Perhaps it makes sense to export get/put_signal_struct() later, but currently I don't see the immediate reason. Rename __cleanup_signal() to free_signal_struct() and unexport it. With the previous changes it does nothing except kmem_cache_free(). Change __exit_signal() to not clear/free ->signal, it will be freed when the last reference to any thread in the thread group goes away. Note: - when the last thead exits signal->tty can point to nowhere, see the next patch. - with or without this patch signal_struct->count should go away, or at least it should be "int nr_threads" for fs/proc. This will be addressed later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fork/exit: move tty_kref_put() outside of __cleanup_signal()	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tty_kref_put() has two callsites in copy_process() paths, 1. if copy_process() suceeds it is called before we copy signal->tty from parent 2. otherwise it is called from __cleanup_signal() under bad_fork_cleanup_signal: label In both cases tty_kref_put() is not right and unneeded because we don't have the balancing tty_kref_get(). Fortunately, this is harmless because this can only happen without CLONE_THREAD, and in this case signal->tty must be NULL. Remove tty_kref_put() from copy_process() and __cleanup_signal(), and change another caller of __cleanup_signal(), __exit_signal(), to call tty_kref_put() by hand. I hope this change makes sense by itself, but it is also needed to make ->signal refcountable. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Alan Cox <alan@linux.intel.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ia64: ptrace_attach_sync_user_rbs: avoid "task->signal != NULL" checks	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Preparation to make task->signal immutable, no functional changes. It doesn't matter which pointer we check under tasklist to ensure the task was not released, ->signal or ->sighand. But we are going to make ->signal refcountable, change the code to use ->sighand. Note: this code doesn't need this check and tasklist_lock at all, it should be converted to use lock_task_sighand(). And, the code under SIGNAL_STOP_STOPPED check looks wrong. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Roland McGrath <roland@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	posix-cpu-timers: avoid "task->signal != NULL" checks	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Preparation to make task->signal immutable, no functional changes. posix-cpu-timers.c checks task->signal != NULL to ensure this task is alive and didn't pass __exit_signal(). This is correct but we are going to change the lifetime rules for ->signal and never reset this pointer. Change the code to check ->sighand instead, it doesn't matter which pointer we check under tasklist, they both are cleared simultaneously. As Roland pointed out, some of these changes are not strictly needed and probably it makes sense to revert them later, when ->signal will be pinned to task_struct. But this patch tries to ensure the subsequent changes in fork/exit can't make any visible impact on posix cpu timers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	exit: avoid sig->count in __exit_signal() to detect the group-dead case	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Change __exit_signal() to check thread_group_leader() instead of atomic_dec_and_test(&sig->count). This must be equivalent, the group leader must be released only after all other threads have exited and passed __exit_signal(). Henceforth sig->count is not actually used, except in fs/proc for get_nr_threads/etc. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	exit: avoid sig->count in de_thread/__exit_signal synchronization	Oleg Nesterov	2010-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	de_thread() and __exit_signal() use signal_struct->count/notify_count for synchronization. We can simplify the code and use ->notify_count only. Instead of comparing these two counters, we can change de_thread() to set ->notify_count = nr_of_sub_threads, then change __exit_signal() to dec-and-test this counter and notify group_exit_task. Note that __exit_signal() checks "notify_count > 0" just for symmetry with exit_notify(), we could just check it is != 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>